<a href="https://colab.research.google.com/github/MaidinuerSaimi/Python-courses/blob/main/Day_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Workshop 1.5 -- data exploration with Matplotlib

## Introduction

The topic of today's workshop is **data exploration**.  This important topic tends to be an under-appreciated part of
data science.  Although there are technical skills involves (mainly, slicing and dicing your Pandas dataframe,
and knowing your way around the plotting functions), it is more accurately described as between an art and a science -- and it crucially involves your knowledge of the problem at hand.  You are looking out for *unexpected* features in your data, due to unforeseen issues in data collection, selection of your subjects, errors, data handling or conversion problems - or perhaps even unexpected but real phenomena in your data.

### Expect the unexpected!
Because you're looking for the unexpected, there is no fool-proof recipe for doing this.  However, in general it is a good idea to spend time *visualising* your data.  Doing so increases your chances of identifying issues, which allows you to clean up your data, and you will also learn about the process you're actually interested in, which will enable you to make better choices when you're building your model.

### Why do this?
A few bad apples spoil the bunch.  Outliers can have a major influence on the final model.  If your data has missing values, and you don't realize it and deal with it, your predictions can become unreliable.  If some columns have very skewed distributions, your model may end up all but ignoring these columns. In practice, time spent on data cleaning and pre-processing brings more benefits than time spent on the modeling itself.

### Appearances matter
To properly explore the data it will be necessary to fine-tune your plots, because different settings (including ones you may think of as purely aesthetic, such as color schemes) can make a substantial difference in how informative the plot is -- but equally it is possible for artefacts to dominate the plot, or to emphasize an unimportant aspect.  Again, this is an art, and practice makes perfect - don't expect a recipe, but things in this workshop will get you started.

### Real data, real issues
We are going to focus on a real data set of over 40,000 physical measurements of aortic and pulmonary heart valve diameters of healthy donors, made by a North American company.  Other characteristics of the donors (sex, height, weight, age) are also available.  This allows us to build a model predicting these heart valve diameters from their donor's height etc., which is helpful to predict the required diameter when someone needs a heart valve replacement.

(The data is anonymized and slightly modified, so don't use the models you build today if you ever need a new valve)

## Workshop

Run the appropriate code below to load the dataset

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns;

In [None]:
## If you're running this on Google Colab, use this to upload your file:

from google.colab import files
files.upload();  ## Upload 'Heart valve dissection data.csv'

In [None]:
## Load your data.
## (If you're running Jupyter and the data is not in the directory where you started Jupyter, modify the path below)

df = pd.read_csv("Heart valve dissection data.csv")

Use `df.info()` and `df.describe()` to get a first sense of the data.

<div class="alert alert-info">
    
## Question 1
    
- 1) What are the target variable(s), what are predictive variables?
- 2) What do the columns mean?  What are their units? Which columns should we ignore when building a model?
- 3) Is there any missingness?
- 4) Do any variables require pre-processing?

In [None]:
# your code here

Use `df.drop` to drop the first (unnamed) index column, and the "Id" column.  Then remove all rows with missing AVD values - we want to predict the aortic valve diameter, so imputation is not an option.

<div class="alert alert-info">
    
## Question 2
    
How many rows does the dataframe have before and after removal of records with missing AVD values?

In [None]:
# your code here

In [None]:
# your code here

Convert the values in the column "dissection_date" to Panda's `DateTime` format, using the `pd.to_datetime` function.

Then use `sns.histplot` to plot histograms of the data in the five columns "Age", "Height_cm", "Weight_kg", "dissection_date" and "AVD".  Use `plt.show()` after each plot, otherwise you will only see the last one.

<div class="alert alert-info">
    
## Question 3
    
Do these histograms show up any obvious issues?  What is the reason for the jagged appearance and/or white gaps in some plots?  (Is it the plotting, or is this a feature of the data?)  Try plotting again, but now setting one of `bins=100`, `discrete=True`, `binwidth=2`.  Can you explain why the plots look different now?

In [None]:
# your code here

Make a separate histogram of height, using only donors with a height between 50 and 150 cm.  Use `binwidth=1`.

<div class="alert alert-info">
    
## Question 4
    
What do you think is the explanation for the peaks at ~61cm, ~92cm and ~122cm?  Could this be a problem when we build a model usign these data?  What could you do to mitigate this issue?

In [None]:
# your code here

A powerful way to visualise your data is using scatterplots -- these plots can convey a lot of information, and
show how to variables are related.  Look at the manual page and tutorials in Matplotlib for the many options and
different plots you can make.

Let us start with `sns.scatterplot` of "Age" and "Height_cm", with default parameters.

In [None]:
# your code here

Notice that the dataset is too large for this standard plot; you get no sense of where the highest density is,
because points are plotted over one another.  A simple first way to deal with this is decreasing the point size,
with parameter `s=2`.

In addition the "Age" variable is discrete, so all points lie on vertical lines.  In fact also "Height_cm" is
(mostly) discretized.  Jittering is the standard way to address this -- adding random noise.

Do this by using `df["Age"] + np.random.normal(0, 0.3, len(df))` as your `x` parameter, and similar for the "Height_cm" variable for `y`.
The `np.random.normal` function generates `len(df)` normally-distributed numbers with mean 0 and
standard deviation 0.3.

In [None]:
# your code here

This starts to show some features -- you can see the heights rounded to nearest feet
that you identified using the histogram.  Still, there seems to be overplotting going on.  You can make
the points transparent to get a better sense of the density.  Make the same plot but now adding the keyword
`alpha=0.2`.

In [None]:
# your code here

Instead of a scatterplot, you can achieve a similar overview using a 2D histogram.  This gives you more
flexibility for very large data sets.  Start by making a `sns.histplot` for the same variables, and defaults
for all keyword parameters.

In [None]:
# your code here

It's recognizably the same shape, but the default bin size for height is far too small.
Run again with keywords `bins=(50,50)`, and also include `cbar=True` to add a legend so you can
relate the color to the actual frequency.

In [None]:
# your code here

That's much better, but it's also obvious that there is a huge density at the bottom left, and the plot
is otherwise dominated by much lower counts that are not well separated visually.  To see more detail in
both low and high frequency regions, we can do a log transform on the counts.  In the `histplot` this
is called "normalization" (don't ask me), so add the keyword `norm=mpl.colors.LogNorm()`.
                           
(Seaborn provides default values for `vmin` and `vmax`, but these don't play well with LogNorm,
 so also add `vmin=None, vmax=None` to override these defaults, otherwise you get errors.)

In [None]:
# your code here

That's a lot better already.  Still the default color scheme isn't great.  Also, some vertical stripes are visible,
which is caused by discretization of "Age" and bins sometimes catching different number of age categories.
Instead we can set the bin width directly.

Add the keywords `binwidth=(1,3)` (and remove `bins=(50,50)`), and also add `cmap="RdYlGn_r"` to choose a reversed
Red-Yellow-Green color map.

In [None]:
# your code here

Now, make the same plot, but include female donors only.  Do this by replacing `df` by `df[df["Sex"] == "female"]`
in the invocation of `histplot`

<div class="alert alert-info">
    
## Question 5
    
Do you see anything strange in ths plot? Look at the height distribution.  Can you give an argument whether or not this is likely to be an artefact, or real?
    

In [None]:
# your code here

It is tedious to make these plots for all pairs of variables.  Seaborn has a handy function, `pairplot` to do this for
you.  It has three main parameters, `data` for the dataframe, `vars` a list of variables, and `kind` to select the kind of
plot you need.

Run `sns.pairplot` with `vars=["Age","Height_cm","Weight_kg","AVD"]` and `kind="hist"`.

In [None]:
# your code here

That looks similar to our initial plot before our tweaks.  We can tweak `pairplot` in the same way
by giving it keywords, through the `plot_kws` parameter which expects a dictionary of keywords and values.

Run the command above again but now adding `plot_kws={"bins":50,"cmap":"RdYlGn_r","vmin":None,"vmax":None,"norm": mpl.colors.LogNorm()}`

<div class="alert alert-info">
    
## Question 6

Can you recognize the two issues we identified before?  Do you see any more potential issues?

In [None]:
# your code here

## Batch effects

Batch effects are common issues with experimental data.  They occur because the conditions in which data are collected
change over time, because of slight differences in the protocol used by different experimentalists or at
different locations (e.g. different hospitals each collecting part of the data).

Here all we know is that the data are collected by a single company, but over several years.  Let's plot some variables
against this dissection time to ensure no systematic problems occured in some time periods.

Use a `sns.histplot` to plot "AVD" against "dissection_date".  Use parameters
`bins=(50,50), cbar=True, norm=mpl.colors.LogNorm(), vmin=None, vmax=None, cmap="RdYlGn_r"`.

<div class="alert alert-info">
    
## Question 7
    
Do you see any unexpected features?

In [None]:
# your code here

## Residual plots

Residual plots are a powerful way to identify issues in your data - and potential model misfits.  

The residual is the difference between a model prediction and the true value.  (The concept makes sense only
for regression, as the "difference" between two classes is meaningless.)
These residuals should be a small as possible.  If there is any systematic relation between the residuals and
the explanatory variables, or between residuals and variables that should **not** matter
(such as the day of the week), this is an indication of a problem.

Residuals are also helpful for identifying outliers -- points where the "true" and predicted values are very different.  These may be mistakes in your data (either outcome or preditors).

To make a residual plot, you need a model.  For the purpose of data exploration, let's keep it simple and
fit a `sklearn.linear_model.LinearRegression` model to the data (variables height, weight and age, target "AVD").
After running `fit`, run `predict` on the training data (again, for our current purpose we do not worry about
overtraining), and compute the difference of the target and the prediction.  Make this a new column named "residual" in your dataframe (`df['residual'] = ...`).

Then, make a `histplot` of "residual" against "dissection_date", using the color scheme as above, and `bins=(100,25)`.

<div class="alert alert-info">
    
## Question 8
    
Is this the same issue you saw in question 7?  Will this be an issue for building a model using these data?
Can you suggest some strategies to mitigate this issue?

In [None]:
# your code here