# Introduction to data handling and visualisation with python

This is a Jupyter Notebook, designed to introduce some popular and useful approaches and libraries for data handling and visualisation in Python and the Python ecosystem.

We will be using [pandas](https://pandas.pydata.org/) library for the data handling parts, and the [seaborn](https://seaborn.pydata.org/) library for visualisation. There are altneratives, but these two are great options if you want to "just learn one thing".

## Sections
- Python and duck typing
- The Python data ecosystem:
  - Reading data
  - Manipulating (anonymising) and parsing data
  - Saving data
  - Joining dataframes
- Visualising data
  - Seaborn for quick but nice figures
  - Time-series analysis


In [None]:
import pandas as pd
import numpy as np

## Python and duck typing

Python is "duck typed" ("if it looks like a duck and quacks like a duck... its a duck!").
That means you don't need to _declare_ the type of a variable, it is inferred.
(Note that in modern Python you CAN add a "type hint", and they are wonderful, but they are mostly used to help readability and catch some bugs.)

Python is unusual in that you can make an iterable `List` of varied types:

In [None]:
my_strange_list = ["hello", 42, pd.DataFrame, [1,2,3], {"this": "that"}]

my_strange_list

Note that each element in this List has a different type:

In [None]:
[type(element) for element in my_strange_list]

(By the way, that `[thing for thing in things]` form of code is called a "list comprehension". They are very handy in Python for doing an inline loop or conditional.)

In data analysis, most people are used to `Array`s rather than Pythonic `List`s.
`Array`s tend to be fixed type, and a bit more like a mathematical vector.
For instance consider this, and before you run it... what do you THINK the result should be?:

In [None]:
[1,2,3] * 3

In the Python ecosystem, the most popular way to get a more "normal" `Array`, that works like in `R` etc, is to use `numpy`. `numpy` gives you an `Array` type that is very fast to do things with, since every element is known to be of the same type so a bunch of type-checking can be skipped:

In [None]:
np.array([1,2,3]) * 3

## Python's data ecosystem

This is just one example of the ecosystem of libraries that can build upon Python's generic design and add data-handling focussed tooling.

### Reading data

For example, you CAN read data using Python's "standard library":

In [None]:
import csv

my_data = []  # an empty list
with open("people.tsv") as my_csv_file:  # a file handler
    my_csv_reader = csv.DictReader(my_csv_file, delimiter='\t')  # from the python standard library, a way to read CSV files into dictionaries
    for row in my_csv_reader:  # iterate over all lines in the TSV
        my_data.append(row)

my_data[:10] # the first 10 items in the list
    

... but it is much less code to do this using the `pandas` (commonly shortened to `pd`) library, than is not part of the standard library either.

In [None]:
my_data = pd.read_csv("people.tsv", sep="\t")
my_data.head(10)

This object:

In [None]:
type(my_data)

is a `pandas` `DataFrame`. It is similar to data frames in other analytical / data science languages and tools.
Specifically it is a great object for having tabular data (or higher-dimension structured data), especially when you've got named columns, differrent special data types like strings, numbers, and dates, and when you want to do tabular things like lookups, orderings, group-bys, and table-to-table JOINs.

---

### Manipulating a dataframe: anonymising data

<div style="
  background-color:#e6f3ff;
  border-left:6px solid #2196F3;
  padding:10px 16px;
  border-radius:6px;
  font-size:14px;
">
 Task 1: anonymise that data frame.
  We have research data about patients, and want to anonymise it. To do so at the very least we should (1) remove their names and replace them with patient identifiers, and (2) compress their dats of birth into less identifiable integer ages.
</div>

**For the names:** consider that we may need to be able to join FURTHER data, by patient name, to this. So rather than anonymise, we can pseudononymise. One simple way is to make a `hash` of their name, such that it is very easy to turn another copy of that name in future data into the hash, but (relatively) hard to do the opposite.

You can make a new column in a `pandas` `DataFrame`, that takes as input a single existing column, like this with the `.apply` method:

```python
def my_method(name):
    # reverse the letters in name
    # in reality use something cryptographic, like a method from "hashlib".
    # bonus points for implementing that!
    return name[::-1]  # python slices are defined like [0:10:-1] for elements 0 to 10 in reverse order

my_data['identifier'] = my_data['name'].apply(my_method)
```

Lastly: 
- remember to actually remove the names! Look into the method `my_data.drop(columns=...)` to remove a column. Hint: check out the `inplace=` parameter.
- "index" the dataframe by this new identifier, since the table is naturally keyed by it (each row is a person). To set the index of a `DataFrame`, look into the method `my_data.set_index(...)`

In [None]:
people = pd.read_csv("people.tsv", sep="\t")

def pseudonymise_name(name: str):
    # ... your code

people['identifier'] = # ... your code

You should end up with a dataframe called `people` that looks like:

| identifier | date_of_birth |
| ---------- | ------------- |
| abcd123456 | 10/28/1969    |
| wxyz987654 | 12/07/1965    |

(The solution is in the collapsed cell below)

In [None]:
# solution
import hashlib

people = pd.read_csv("people.tsv", sep="\t")

def pseudonymise_name(name: str):
    return hashlib.md5(name.encode()).hexdigest()

people["identifier"] = people["name"].apply(pseudonymise_name)
people.drop(columns=["name"], inplace=True)
people.set_index("identifier", inplace=True)
people

**For the dates of birth:** 
First, we need to parse the dates. At the moment they are just arbitrary objects:

In [None]:
people.dtypes

...but Python, and pandas, have an actual `datetime` object for this, that helps disambiguate all the weirdness of dates and time intervals etc.

The Python one is part of the standard library:

In [None]:
from datetime import datetime, timedelta
now = datetime.now()

print(now)
print(type(now))

So! We want to parse and coerce the dates of birth into datetimes, so that we can calculate their integer ages based on the `timedelta` between that date and `now`.

This is so common that `Pandas` [provides a method](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html#pandas-to-datetime) for it: `pd.to_datetime(my_pandas_series)`.

**Write code to overwrite the `people.date_of_birth` column to be parsed `datetime`s.
Note that this CSV uses the bizarre "middle-endian" date format (i.e. month/day/year) common to the USA... not a nice ISO standard! So, you'll need to [provide a `format=` argument to tell `pd.to_datetime`](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html#pandas-to-datetime) how to understand this insanity.**

(Hint: `%d-%m-%Y` represents the format of a dd-mm-yyyy year like `24-10-2025`.)

You may notice that the data includes some problems... we will fix those in a moment but check out the `errors=` parameter to find a way to coerce the dates into shape first.

In [None]:
people.date_of_birth = # ... your code

In [None]:
# solution
people.date_of_birth = pd.to_datetime(people.date_of_birth, format="%m/%d/%Y", errors="coerce")

Now, if you managed to coerce those bad dates... you'll notice that they appear in the DataFrame with a strange symbol `NaT` (not a time). Let's just drop those data from our set. Again, this is such a common thing that `pandas` provides a method: `my_data.dropna(...)` (`dropna` means drop NAs, and NA means "Not A" - like "Not A Number (NaN) etc). Check out the help for that method to see how to ONLY drop rows which have an NA in a subset of the columns (in our case, `date_of_birth`).

Like earlier... you want to modify the existing dataframe, so check out the `inplace=` argument.

In [None]:
# your code...

In [None]:
# solution 
people.dropna(subset=["date_of_birth"], inplace=True)
people

Finally, we can calculate their ages!

Like earlier when we pseudonymised the names, pseudonymise their ages by first creating a new `people["age"]` column, and then dropping the `date_of_birth` column.

To calculate the age, you can just to normal arithmetic with the datetime objects. We calculated `now` earlier as the current date, so `now - <some date>` gives a `Timedelta`, which has a property `.days`. You can turn `.days` into years by just dividing and pretending leap years don't exist for our simple purposes. Finally, you can cast the floating point age to an integer by just wrapping it in `int(...)` – that's just standard Python.

In [None]:
people['age'] = # your code...

In [None]:
# solution
people['age'] = people.date_of_birth.apply(lambda dob: int((now - dob).days / 365))
 # by the way, a lambda is like an inline unnamed function, 
  # it just takes an argument (dob) and returns whatever the value after the colon is. handy!
people.drop(columns=["date_of_birth"], inplace=True)
people

In [None]:
people.to_csv(

### Saving dataframes

You might want to save this dataframe back to a TSV file... like `anonymous_people.tsv`.

To do, use to `.to_csv(...)` method on the `people` dataframe. Note that a TSV (Tab Separated Value) is often a better choice than a CSV (Comma Separated Value), but is so similar that you use the same method to write it. Just set the argument `sep='\t'` (`sep` for seperator, and `\t` is the control character for a tab, like `\n` is a newline).

In [None]:
people.
# your code...

---

### Joining dataframes

Often you will have multiple datasets that need to be joined together, e.g. from multiple data sources or multi-omics results.
Ocassionally this is concatenating dataframes "row-wise" (e.g. adding more people to the table), but more often is it "column-wise" (e.g. adding more data about each person).

<div style="
  background-color:#e6f3ff;
  border-left:6px solid #2196F3;
  padding:10px 16px;
  border-radius:6px;
  font-size:14px;
">
 Task 2: Join the `people` dataframe to a new dataframe read in from the measurements.tsv table.
</div>

Hints:
- `measurements.tsv` can be read into a dataframe just like before.
- `measurements` contains names... but remember we want to be pseudonymous so you'll need to convert names to identifiers exactly as before, so that the identifiers can be used to join the dataframes
- there are multiple `measurement`s for each person! This means you **shouldn't** use the `identifier` as the `index` for the `measurements` dataframe, as it wouldn't be unique.
- the method to join two dataframes is `people_measurements = measurements.join(people, on='identifier')`. The `on='identifier'` argument means that the index of `people` will be `join`ed `on`to the `identifier` column of `measurements`. By the way, this wouldn't be needed if the two dataframes were both indexed by the same identifiers.

In [None]:
measurements = # your code...
people_measurements = # your code...

In [None]:
# solution
measurements = pd.read_csv("measurements.tsv", sep="\t")
measurements["identifier"] = measurements.name.apply(pseudonymise_name)
measurements.drop(columns=["name"], inplace=True)
people_measurements = measurements.join(people, on='identifier')

You might notice from looking at the `people_measurements` dataframe that there are also some missing (`NaN`) weight measurements in there.

Drop those dataframe rows like we did with the missing dates of birth.

In [None]:
people_measurements. # your code...

In [None]:
# solution
people_measurements.dropna(subset=["weight_kg"], inplace=True)
people_measurements

Lastly, we need to parse the dates as actual dates again. Interestingly, these dates are formatted in the little-endian sense of dd/mm/yyyy!

In [None]:
people_measurements.date = pd.to_datetime(people_measurements.date, format="%d/%m/%Y", errors="coerce")
people_measurements

**All being well you now have a dataframe of 941 weight measurements, for about 100 people whose ages we know.**

---

## Visualising data

An obvious place to start here would be plotting weight vs age. For simple "explorative" visualisation like this, `pandas` actually provides built in plotting methods:


In [None]:
people_measurements.plot(x="age", y="weight_kg", kind="scatter")

<div style="
  background-color:#e6f3ff;
  border-left:6px solid #2196F3;
  padding:10px 16px;
  border-radius:6px;
  font-size:14px;
">
 Task 3: Explore the different "kind"s of plot pandas lets you make.
</div>
Use the popup help (press shift-tab whilst your cursor is inside the `.plot(   )` argument list), or call `help(people_measurements.plot)`, or just [read the docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html#pandas-dataframe-plot).

In [None]:
people_measurements.plot(  # ... your code

<div style="
  background-color:#e6f3ff;
  border-left:6px solid #2196F3;
  padding:10px 16px;
  border-radius:6px;
  font-size:14px;
">
 Are any of these useful? Are any of them... pretty?
</div>



#### Seaborn: a quick route to publication ready visualisation

`Seaborn` (commonly shortened to `sns`) and `matplotlib`'s `pyplot` (commonly `plt`) are two widely used options for data visualisation. In general `seaborn` does a lot without much configuration, whereas `matplotlib` is lower level, more general and configurable.

<div style="
  background-color:#e6f3ff;
  border-left:6px solid #2196F3;
  padding:10px 16px;
  border-radius:6px;
  font-size:14px;
">
 Task 4: Use seaborn to make a "jointplot" showing the age–weight datapoints, a linear regression through the data, and the distribution of data independently across both the age and weight dimensions.
</div>

Hint: this is much easier than it sounds! Check out [the `seaborn` `jointplot` documentation](https://seaborn.pydata.org/generated/seaborn.jointplot.html#seaborn-jointplot)

In [None]:
import seaborn as sns

In [None]:
g = sns. # ... your code

In [None]:
# solution
g = sns.jointplot(
    data=people_measurements,
    x="age", y="weight_kg",
    kind="reg",
)

**Fancy a fun break?** Play around with the [aesthetics options for seaborn](https://seaborn.pydata.org/tutorial/aesthetics.html#controlling-figure-aesthetics)

The reason we set the result of the `sns....` call to a variable (`g = `) is so that we can do more with it after creation.
Remake your plot, but set the axis labels to nicer ones like "Age in years".
Use the `g.set_axis_labels(...)` method for that

In [None]:
g.set_axis_labels(  # ... your code

In [None]:
# solution
g = sns.jointplot(
    data=people_measurements,
    x="age", y="weight_kg",
    kind="reg",
)
g.set_axis_labels(xlabel="Age in years", ylabel="Weight in KG")

### Time-series analysis

So far, we can see that our cohort increase slightly in weight with age (once reaching adulthood).
But what about individuals? We have multiple weight measurements on different dates... so we can perform a time series analysis of the individuals.

One way to do this is to group the dataframe by `identifier` (person), so that we get a group of that individual's weight rows, and also order by date, so that we can the time series in chronological order:

In [None]:
people_measurements.sort_values(["identifier", "date"], inplace=True)

people_timeseries = people_measurements.groupby("identifier")
people_timeseries

We see that this `people_timeseries` is a "DataFrameGroupBy` object... this is hard to comprehend, until we iterate over the groups (well, just the first for now):

In [None]:
for group_label, group_content in people_timeseries:
    print(group_label)
    print(group_content)
    break

So the `group_label` there is a person identifier, and the `group_content` is a small dataframe of just their weight measurements - i.e. a timeseries!

We can of course plot this timeseries:

In [None]:
first_person_identifier = people_measurements.identifier.iloc[0]  # by the way, .iloc[0] is a method to LOCate the item in a pandas object at Index 0
first_person_timeseries = people_timeseries.get_group(first_person_identifier)
first_person_timeseries.plot(x="date", y="weight_kg")

<div style="
  background-color:#e6f3ff;
  border-left:6px solid #2196F3;
  padding:10px 16px;
  border-radius:6px;
  font-size:14px;
">
 Task 5 (much harder!): Make a multi-panel plot with seaborn. Each panel should represent a 10-year age-band, and show a mean time-series of weight for people in that age band.
</div>

Steps and hints: 
- we will need to `groupby` BOTH a decadal "age band", AND e.g. what month the weight measurement happened in (so that individuals' timeseries can be aligned to some common timeline)
- the decadal age bands can be calculated by defining a list of age `bins` `0, 10, 20....` and then using `pd.cut` to make a temporary column (i.e. there will be a bin value, for every person, like `(20, 30]` for person aged 22).
- the monthly timeseries grouping can be done using a normal pandas [Grouper](https://pandas.pydata.org/docs/reference/api/pandas.Grouper.html#pandas-grouper), alongwith a [frequency alias for "the start of each month"](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects).
- dataframes can be grouped by multiple groupers, by giving a **list** to `.groupby`, instead of a single column name.
- there is [a good example in the seaborn documentation of the actual visualisation](https://seaborn.pydata.org/examples/timeseries_facets.html#small-multiple-time-series)!


Step 1: grouping the data

In [None]:
bins =  # your code...
monthly_by_age_band = people_measurements.groupby([  # your code...

In [None]:
# solution
bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
monthly_by_age_band = people_measurements.groupby([
    pd.Grouper(key="date", freq="MS"),
    pd.cut(people_measurements["age"], bins=bins)
])

Step 2: calculate the mean weight each month, for the people in each age band. 
We can just select the `weight_kg` column from the grouped data and call `.mean()` on it... Pandas knows to apply that to each group.
A somewhat unintuitive step is here is to call `reset_index`. This is quite common after you've done a bunch of grouping and aggregating in pandas... the goal is to just get back to a regular, single-indexed "table-like" dataframe.

In [None]:
monthly_averages_by_age_band = monthly_by_age_band["weight_kg"].mean().reset_index(name="avg_weight")
monthly_averages_by_age_band

You should see that we've ended up with a table where we have a date column, for the first of each month, an age column which is a decadal age band, and the average weight of people in that cohort on the month.

<div style="
  background-color:#e6f3ff;
  border-left:6px solid #2196F3;
  padding:10px 16px;
  border-radius:6px;
  font-size:14px;
">
 Can you spot a slight interpretation problem with our logic here?
</div>

Step 3: plot each age band in a panel. We can use the `relplot` from `seaborn`, similar to the example in the seaborn documentation...

In [None]:
g = sns.relplot(
# your code...

In [None]:
# solution
g = sns.relplot(
    data = monthly_averages_by_age_band,
    x="date",
    y="avg_weight",
    col="age",
    hue="age",
    kind="line", palette="crest", linewidth=4, zorder=5,
    col_wrap=4, height=2, aspect=1.5, legend=False,
)

# Summary
In this notebook, you've learned why Python really needs to be extended with its ecosystem of libraries to do data analysis.
Specifically, `pandas` gives Python support for `dataframe`s, which are the basis of a lot of scientific data handling.
It can handle data parsing, manipulating (like making new columns algorithmically from existing ones), and joining multiple datasets together.
`Pandas` also gives us some basic plotting, but for more advanced visualisations – and especially for jumping straight to fairly aesthetically satisfying figures – we are better adopting a library like `seaborn` (which, as you've seen, is compatible with `pandas` dataframes).

We finished with some time-series analysis, using `pandas` sorting and grouping methods to multiply-group our dataset, and used a `relplot` to create a grid of figure subplots.