# Practice on MLP
In this practical lesson, we will learn how to load a real dataset using [`pandas`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), we will normalize it with a [`MinMaxScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) and explore various approaches using [`Keras Dense layers`](https://keras.io/layers/core/#dense).

## What does the dataset contain?

In synopsis, the data are cellular variables labeled as pathogenic or not: we want to predict to which class a given datapoint is taken from.

## Loading a dataset with pandas
We will use the python package [`pandas`](http://pandas.pydata.org/pandas-docs/stable/) to load the dataset, if you need to install it just run:
```bash
pip install pandas
```
It is standard to import pandas as `pd`.

In [1]:
import pandas as pd

In [16]:
x = pd.read_csv("input.csv", index_col=0)
y = pd.read_csv("output.csv", index_col=0)

  mask |= (ar1 == a)


## Exploring the dataset
In a jupyter notebook is extremely simple to visualize a pandas DataFrame:

In [3]:
x

Unnamed: 0,CpGobsExp,CpGperCpG,CpGperGC,DGVCount,DnaseClusteredHyp,DnaseClusteredScore,EncH3K27Ac,EncH3K4Me1,EncH3K4Me3,GCContent,...,fantom5Robust,fracRareCommon,mamPhastCons46way,mamPhyloP46way,numTFBSConserved,priPhastCons46way,priPhyloP46way,rareVar,verPhastCons46way,verPhyloP46way
chr1.55505180,0.96,17.8,61.2,3,73,1000,39.20,15.04,33.68,0.623,...,0,0.759,0.000,-1.844,0,0.106,-1.268,22,0.001,-1.855
chr11.5246715,0.00,0.0,0.0,0,0,0,5.16,26.76,10.00,0.338,...,0,0.826,0.899,2.030,0,0.260,0.459,38,0.652,0.782
chr11.5246717,0.00,0.0,0.0,0,0,0,5.16,26.76,10.00,0.338,...,0,0.826,0.995,0.839,0,0.360,0.459,38,0.984,0.923
chr11.5246718,0.00,0.0,0.0,0,0,0,5.16,26.76,10.00,0.331,...,0,0.826,0.998,0.859,0,0.391,0.459,38,0.987,0.886
chr11.5246718,0.00,0.0,0.0,0,0,0,5.16,26.76,10.00,0.331,...,0,0.826,0.998,0.859,0,0.391,0.459,38,0.987,0.886
chr11.5246720,0.00,0.0,0.0,0,0,0,5.16,26.76,10.00,0.338,...,0,0.826,1.000,2.030,0,0.429,0.454,38,0.985,2.230
chr11.5246796,0.00,0.0,0.0,0,0,0,3.00,26.40,14.20,0.470,...,0,0.844,0.010,0.662,0,0.017,0.454,38,0.003,0.402
chr11.5248257,0.00,0.0,0.0,0,0,0,17.08,22.32,5.00,0.490,...,0,0.841,0.948,0.903,0,0.929,0.650,37,0.176,0.522
chr11.5248269,0.00,0.0,0.0,0,0,0,17.08,22.32,5.00,0.517,...,0,0.841,0.976,2.810,0,0.936,0.650,37,1.000,2.110
chr11.5248280,0.00,0.0,0.0,0,0,0,13.04,22.88,5.00,0.523,...,0,0.822,0.479,1.470,0,0.958,0.650,37,0.055,1.540


### What should I (usually) check in a dataset?

#### Nan density
In most datasets there are some unknown values, for example a patient has yet to do a blood test, so the data relative to the test in the dataset are set to `NaN`. Now, `NaN` is a special float number, which isn't equal to no other values, so if you check with an equal you will get a `False`.

In [4]:
import numpy as np

In [5]:
np.nan == np.nan

False

To check if a given value is `NaN` we have to use a function from either [`pandas`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isna.html) or [`numpy`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.isnan.html) as follows:

In [18]:
np.isnan(np.nan), pd.isna(np.nan)

(True, True)

Measuring the `NaN` density:

In [8]:
np.mean(pd.isna(x).values)

0.0

We are lucky! Our dataset isn't sparse!

#### Data variance
Some columns could have extremely low variance, once normalized by the mean, (even 0, with constant values) and would be nearly useless for either training or testing a model:

In [9]:
variances = (x/x.mean()).var()

In [10]:
variances

CpGobsExp               187.790483
CpGperCpG               188.264205
CpGperGC                184.221247
DGVCount                  2.948082
DnaseClusteredHyp        20.708281
DnaseClusteredScore       8.568552
EncH3K27Ac               11.102579
EncH3K4Me1                2.213287
EncH3K4Me3                9.800345
GCContent                 0.056654
GerpRS                   49.061068
GerpRSpv                510.933610
ISCApath                  2.341130
commonVar                 0.414103
dbVARCount                2.948082
fantom5Perm            2486.381294
fantom5Robust          2145.723258
fracRareCommon            0.019075
mamPhastCons46way        14.844264
mamPhyloP46way            1.093232
numTFBSConserved        135.660186
priPhastCons46way         7.110275
priPhyloP46way            0.634043
rareVar                   0.086971
verPhastCons46way        12.481429
verPhyloP46way            1.190950
dtype: float64

Let's proceed to filter out the columns with less than $0.05$ variance:

In [11]:
variances[variances<0.05]

fracRareCommon    0.019075
dtype: float64

We have now identified some columns with variance so low that we could consider discarding them.

In [13]:
x = x.drop(columns=variances[variances<0.05].index)

#### Covariance and correlation
Another important tool is the correlation coefficient of the various columns: highly correlated columns add little to the training dataset:

In [14]:
correlations = x.corr()
correlations

Unnamed: 0,CpGobsExp,CpGperCpG,CpGperGC,DGVCount,DnaseClusteredHyp,DnaseClusteredScore,EncH3K27Ac,EncH3K4Me1,EncH3K4Me3,GCContent,...,fantom5Perm,fantom5Robust,mamPhastCons46way,mamPhyloP46way,numTFBSConserved,priPhastCons46way,priPhyloP46way,rareVar,verPhastCons46way,verPhyloP46way
CpGobsExp,1.0,0.987035,0.979756,0.014144,0.373929,0.243532,0.139637,0.028156,0.382765,0.196283,...,0.011543,0.00475,0.036744,0.012239,0.035629,0.028231,0.008427,0.021246,0.035059,0.014576
CpGperCpG,0.987035,1.0,0.990182,0.014277,0.37462,0.245349,0.138654,0.028546,0.384098,0.198141,...,0.011942,0.004506,0.03437,0.011542,0.034591,0.026364,0.007892,0.020623,0.032689,0.013674
CpGperGC,0.979756,0.990182,1.0,0.016431,0.368252,0.242628,0.137141,0.030017,0.374522,0.197949,...,0.011191,0.004312,0.033383,0.010717,0.033638,0.025863,0.007744,0.022931,0.03273,0.01302
DGVCount,0.014144,0.014277,0.016431,1.0,-0.005671,-0.007249,-0.009769,-0.019347,-0.002213,0.018098,...,-0.001252,-0.002103,-0.01428,-0.020333,-0.010316,-0.022123,-0.015849,0.00117,-0.003267,-0.019308
DnaseClusteredHyp,0.373929,0.37462,0.368252,-0.005671,1.0,0.788695,0.321137,0.328771,0.405385,0.225107,...,0.06955,0.058719,0.054005,0.045076,0.046812,0.05868,0.028996,0.029058,0.040326,0.04671
DnaseClusteredScore,0.243532,0.245349,0.242628,-0.007249,0.788695,1.0,0.288488,0.38554,0.293971,0.269701,...,0.060179,0.05786,0.058014,0.060543,0.050594,0.071325,0.038185,0.047818,0.044045,0.063083
EncH3K27Ac,0.139637,0.138654,0.137141,-0.009769,0.321137,0.288488,1.0,0.531889,0.502775,0.131596,...,0.064824,0.051893,0.012873,0.036272,0.016798,0.020483,0.02696,0.025423,0.006862,0.036888
EncH3K4Me1,0.028156,0.028546,0.030017,-0.019347,0.328771,0.38554,0.531889,1.0,0.217845,0.202343,...,0.04936,0.047618,0.01394,0.075608,0.022714,0.035572,0.052691,0.011948,0.005625,0.077261
EncH3K4Me3,0.382765,0.384098,0.374522,-0.002213,0.405385,0.293971,0.502775,0.217845,1.0,0.140207,...,0.024266,0.016329,0.022408,0.03415,0.024186,0.027498,0.024898,0.023909,0.01409,0.034703
GCContent,0.196283,0.198141,0.197949,0.018098,0.225107,0.269701,0.131596,0.202343,0.140207,1.0,...,0.013568,0.011052,-0.03416,-0.119984,-0.003642,-0.053416,-0.066847,0.17714,-0.022,-0.120346


Some columns have high absolute correlation, so we will proceed to remove them: which one we should keep of each group? It isn't usually important, expecially when the correlation coefficient is very high.

In [15]:
x = x.drop(columns=x.columns[np.any(np.triu(correlations>0.5, k=1), axis=1)])

#### The classes are balanced?
We now proceed to check if the two classes (pathogenic and non-pathogenic) are balanced or not, meaning if the datapoints are around the same ratio:

In [25]:
np.mean(y==1)

x    0.000036
dtype: float64

The two classes are **strongly** umbalanced! We will need to use some special approaches to avoid overfitting to the class with greater cardinality.

## Splitting and normalizing a dataset
You should know how to do this by yourself! Let's see what you can do! You have 10 minutes!

In [19]:
# Task: Split the dataset
# Use this cell to split the input and output into training and testing
# Tip: remember to import train_test_split from sklearn.model_selection

In [20]:
# Use this cell to normalize the data.
# Tip: remember to import MinMaxScaler from sklearn.preprocessing

## Creating the model
You should know how to do this by yourself! Let's see what you can do! You have 10 minutes!

In [26]:
# Task: create your own MLP
# Use this cell to create your own model
# Tip 1: remember to import Sequential, Dense and InputLayer from keras
# Tip 2: the input size can be determined by x.shape[1]
# Tip 3: remember to set the random seed!

## Training the model
You should know how to do this by yourself! Let's see what you can do! You have 10 minutes!

In [28]:
# Task: train your MLP
# Tip: Try using large batchsizes, not 16 or 32 but closer to 4000.
# Extra points: Can you figure our why a bigger batch size in our task can lead to better results?

## Exploring more models

### Dropout layers
With [`Dropout`](https://keras.io/layers/core/) is intented a special kind of layer, which turns off during training some neurons randomly with a given uniform probability: this is useful to reduce the risk of overfitting. Can you include this layer in your model?

In [29]:
# Task: include at least one dropout layer in your model.
# Tip: remember to import the Dropout layer from Keras

## Free experimentation: what do you want to try?
These were the basics of Keras MLPs: we will now use the remaining time to let you experiment with Keras and try to answer to every doubt about the models you encounter.