# Lecture 13 - Introduction to Data Science

In this lecture you had a short introduction to the growing field of *Data Science*.

> 🎯 The main learning objective for this tutorial is to become experienced with the [**Pandas**](https://pandas.pydata.org/) library.

I strongly suggest you start by taking a few minutes to scroll through these:

- [5 min] [PyData cheat sheets](./docs/pydata_cheatsheets.pdf)
- [10 min] [Enthought cheat sheets](./docs/enthought_cheatsheets.pdf)

----------



## Tutorial - A gene expression dataset

For this tutorial we will focus on using gene expression data (but everything we will learn can be applied to any kind of data). 

> PS: Some exercises in this tutorial are more difficult than usual. 🤯 It is fine if sometimes you need to take a sneak peek at the solutions. 🫣


We will use data from this publication [(Lee et al, 2016)](https://www.sciencedirect.com/science/article/pii/S1550413116302480) that analysed gene expression levels of liver and adipose tissue of 12 obese patients undergoing bariatric surgery.

- The RNA-seq data was initially submitted to the Gene Expression Omnibus [(GSE83322)](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE83322). Here you can see more details about the sequencing protocol and, if you want to, download the raw sequencing data.
- This data is also available at the Expression Atlas [(E-GEOD-83322)](https://www.ebi.ac.uk/gxa/experiments/E-GEOD-83322/Results). This database contains a selection of post-processed and manually curated gene and protein expression datasets. 

Let's begin by loading the **data** into a Pandas dataframe:

In [None]:
import pandas as pd
data = pd.read_csv('files/E-GEOD-83322-query-results.tpms.tsv', sep='\t', comment='#')

data.sample(10) # show 10 random rows

Now let's load the **metadata** in a similar way:

> Note that for convenience we are dropping (and renaming) some columns in the metadata file:

In [None]:
mdata = pd.read_csv('files/E-GEOD-83322-experiment-design.tsv', sep='\t', usecols=[1,3,13,15],
                    header=0, names=['age', 'bmi', 'patient', 'tissue'])

mdata.sample(10)

## Exercise 1 - Data cleaning

### Exercise 1.1:

The metadata table contains two columns that don't have the appropriate type: 
- the **age** is a string (`'34 year'`), but we would like to have it as a number (`int`) instead
- the **patient** identifier was loaded as a number, but we would like to have it as a string

Try to fix those two issues.

In [None]:
# insert your code here...

Click to see (a possible) solution below:

In [None]:

mdata['age'] = mdata['age'].apply(lambda x: int(x.split()[0]))
mdata['patient'] = mdata['patient'].apply(str)

### Exercise 1.2: 

The data table contains several `NaN` values, which most likely correspond to conditions where a transcript was not detected for the respective gene. 

It also contains entries where a transcript was mapped to different variants of the same gene (example: **ABCF2** is associated with [ENSG00000033050](https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000033050) and [ENSG00000285292](https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000285292). 

- Use `.fillna()` to replace `NaN` with zeros.
- Use `.groupby()` and `.sum()` to sum up the expression levels for different variants of the same gene.
- Use `.rename()` to rename the column from `'Gene Name'` to `'gene'`.
- Use `.drop()` to remove the column `'Gene ID'`.


> Tip: use *as_index=False* with *groupby*

In [None]:
# insert your code here...

Click to see solution below:

In [None]:

data.fillna(0, inplace=True) # alternatively: data = data.fillna(0)

data = data.groupby('Gene Name', as_index=False).sum()
data.rename(columns={'Gene Name': 'gene'}, inplace=True)
data.drop(columns={'Gene ID'}, inplace=True)

### Exercise 1.3:

The column identifiers in the data table contain the patient number and the sampled tissue. This makes it harder to group samples by either patient or tissue. 

- Use `.melt()` to unpivot the table from *wide* to *long* format.
- Create two new columns (patient and tissue) by splitting the original identifiers.
- Finally, delete the old column.

In [None]:
# insert your code here...

Click to see solution below:

In [None]:

data = data.melt(id_vars='gene', var_name='sample')
data['patient'] = data['sample'].apply(lambda x: x.split(', ')[0])
data['tissue'] = data['sample'].apply(lambda x: x.split(', ')[1])
data.drop(columns={'sample'}, inplace=True)

## Exercise 2 - Data Exploration

Now that we have the data in *long format* we can try to analyse the overall distribution of gene expression values by plotting a histogram. 

You can do this directly from the pandas Dataframe:

In [None]:
data['value'].hist()

Well, that wasn't so helpful after all... Let's try a box plot? 

In [None]:
data['value'].plot.box()

It looks like the distribution is *very skewed* (the median is quite low compared to the values of the outliers). 

Here is another way to inspect that:

In [None]:
data['value'].describe()

### Exercise 2.1

Try to re-scale the data by converting it to a log-scale and plot the histogram again.

> Tip: you cannot convert the zero values to log-scale, can you find a *"quick and dirty fix"* for this?
> (What is the smallest non-zero value?)

In [None]:
import numpy as np 
# now you can use np.log10()

# insert your code here

Click to see solution below:

In [None]:

# the lowest non-zero value in the data is 0.1 because the values were stored with only one decimal place
# replacing zeros with a value lower than 0.1 makes them easy to distinguish

data['log_value'] = np.log10(data['value'] + 1e-3)
data['log_value'].hist(bins=30, log=True)

For the rest of this tutorial, let's keep only the genes that have been detected in all conditions:

In [None]:
data = data.groupby('gene').filter(lambda x: x['value'].min() > 0)

### Exercise 2.2

One important aspect to consider is that *metadata* means *data about the data*. This means that you *can* (and often *should*) explore the metadata. 

Use a scatterplot (`df.plot.scatter()`) to see if there is a correlation between **age** and **BMI** (body mass index).

> Advanced: use instead **regplot** from the [seaborn library](https://seaborn.pydata.org/generated/seaborn.regplot.html) to also plot a regression line.

In [None]:
# Type your code here...

Click to see solution below...

In [None]:

import seaborn as sns
sns.regplot(data=mdata, x='age', y='bmi')

### Exercise 2.3

Let's try to understand if there is more variation in gene expression across patients or tissues.

- Create a new dataframe in wide format using `.pivot()`.
- Use the genes as rows, patients and tissues as columns, and take the log-normalized values.
- Use seaborn [clustermap](https://seaborn.pydata.org/generated/seaborn.clustermap.html#seaborn.clustermap) to plot the data. You can use [ColorBrewer](https://colorbrewer2.org/) to find a better colormap (*cmap* argument).

In [None]:
# Type your code here...

Click to see solution below:

In [None]:

import seaborn as sns

df_wide = data.pivot(index='gene', columns=['patient', 'tissue'], values='log_value')
sns.clustermap(df_wide, cmap='YlGnBu')

> 🧠 Do you think gene expression is more similar across **patients** or **tissues**?

### Exercise 2.4

Let's try to address a more specific biological question: 

- Are there any genes whose expression in one of the tissues is correlated with **age** or **BMI**? 🤔

Let's start by re-arranging the data so that we can try to answer that question:

- Use `pd.merge()` to merge the data and metadata using patient and tissue as keys.
- Create two dataframes (called `df_liver` and `df_adipose`) with separate *adipose tissue* and *liver* data using `.query()` 

In [None]:
# type your code here...

Click to see solution below: 

In [None]:

data = pd.merge(data, mdata, on=['patient', 'tissue'])
df_adipose = data.query("tissue == 'adipose tissue'")
df_liver = data.query("tissue == 'liver'")

We will now use a little bit of advanced pandas wizardry to create a dataframe with correlations between the two tissues and the two variables (age and BMI) :

In [None]:
from scipy.stats import spearmanr
import seaborn as sns

corr_liver_age = df_liver.groupby('gene').apply(lambda x: spearmanr(x['age'], x['log_value'])[0])
corr_liver_bmi = df_liver.groupby('gene').apply(lambda x: spearmanr(x['bmi'], x['log_value'])[0])
corr_adipose_age = df_adipose.groupby('gene').apply(lambda x: spearmanr(x['age'], x['log_value'])[0])
corr_adipose_bmi = df_adipose.groupby('gene').apply(lambda x: spearmanr(x['bmi'], x['log_value'])[0])

df_corr = pd.DataFrame([corr_liver_age, corr_liver_bmi, corr_adipose_age, corr_adipose_bmi], 
             index=['liver_age', 'liver_bmi', 'adipose_age', 'adipose_bmi']).T

sns.violinplot(data=df_corr)

### 2.5

Now find the genes with:
  - highest correlation with *age* in *liver* tissues
  - highest correlation with *BMI* in *adipose* tissues

In [None]:
# type your code here...

Click below to see solution:

In [None]:

print('highest age correlation in liver tissue:', df_corr['liver_age'].idxmax())
print('highest BMI correlation in adipose tissue:', df_corr['adipose_bmi'].idxmax())

### 2.6

Finally, plot the expression of those two genes as a function of age and BMI in the respective tissues.

> Tip: use *query()* and *plot.scatter()*

In [None]:
# type your code here...

Click below to see solution:

In [None]:

# using subplots just for convenience

import matplotlib.pyplot as plt
f, axs = plt.subplots(1,2, figsize=(6,3))

df_liver.query('gene == "AC003102.1"').plot.scatter('age', 'value', ax=axs[0])
df_adipose.query('gene == "HIRA"').plot.scatter('bmi', 'value', ax=axs[1])

f.tight_layout()

## Wrap-up

This was a challenging tutorial with some hard exercises with heavy use of so-called [*"one-liners"*](https://en.wikipedia.org/wiki/One-liner_program). 

It is not a problem if you didn't understand all the code, but make sure you read through the tutorial and understand the purpose of the exercises and the biological meaning of the results... 

We will use these data again in the next tutorial 😉