In [1]:
import pandas as pd
import plotly.express as px

# Exercise

Now that we have learned some of the basics of python, we should practice how to use this new superpower. We have here prepared a loosely guided exercise that focusses on data exploration and visualization on two example datasets, one on strokes and one on cirrhosis. You can also explore a dataset of your choosing, though the questions are prepared with the example datasets in mind.

Here you can see the [metadata](https://www.kaggle.com/datasets/fedesoriano/cirrhosis-prediction-dataset) for the cirrhosis dataset which describes what each of the columns are. For the stroke data the meaning of the columns is more straightforward.

## 1. Data Loading

We will start with the **stroke** dataset. You can find it on the GitHub repository under Exercise/datasets. Load the stroke data into colab by using one of the two approaches detailed below and assign it to variable name ```data```.

### 1.1 Loading the data

**Option 1:**

Use the pandas csv reader with a link to the data on GitHub. To do this, go to the github repository, find the stroke dataset and click on the 'raw' button. Copy that link and enter it as the file path in pandas csv reader.

... or

**Option 2**

Manually load the dataset into colab and then read it with the pandas csv reader. See steps below:

1. go to the left side bar and click on the folder icon
2. click on data upload
3. select dataset from your computer
4. call pandas csv reader with the name of the dataset

### 1.2 First look

Have a first look at the data. There are some neat built-in pandas functions to get an initial understanding of the data, i.e. by using the info function: `df.info()`, or the use pandas `df.describe()` function.

Questions you might want to answer here:
- What different types of columns do you have?
- Is there a column that describes a variable that can be understood as an 'outcome' ? Which one?
- How many values does each variable, i.e. column, have and what are some preliminary statistics of the features? (tip: use pandas `describe` function)



It helps to know which column is the outcome variable. In the stroke datasets (and many others!) the outcome variable is coded as a numerical variable. However, during analysis it should be interpreted as categorical.

Identify the column of the outcome variable and change its type to "category" by using `astype()`. You can see an example in the [API reference on categorical data](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html). Remember to save your changes!

Then, use `info()` on the dataframe again. Has it changed?

## 2. Exploratory analysis

Get to know your data better. If you want to first visually inspect the data it can help to explore with some plots.


### 2.1 Violin plots and histograms

Consider you dataframe columns that are not the outcome variable. How are the measurements distributed?

To study the distributions we want to make **violin plots** of variables, i.e. data columns, that are numeric and **histograms** of the variables that are strings/categorical.

To check for the data type of a column, have a look at how data types are specified in `dataframe.dtypes`. Then check the data type of each column. Remember a column is a **pandas `Series`**, so it has a `dtype` attribute instead (only dataframes have `dtypes`!).

You can start by figuring out how to make a plot of the data in one column. Once you have that, make one plot for each column that is numeric or a string (except the outcome). This is a repetitive task, so it is ideally suited for a loop. Remember to use `fig.show()` to actually display your plots during the loop.

**Pro version**: Some columns are not actually explanatory variables, such as a the ID column. You can identify these columns i.e. by seeing that each of their values is unique (this would be very unlikely for a measured variable). Skip them when making the plots.


### 2.2 Correlation coefficients

Plot the correlation coefficients of all numerical features:

1. Use the method `corr()` on the dataframe. What is the result?

2. Now use a heatmap to show the correlation coeffcients graphically.

3. Try some different options to make your heatmap look nicer.

### 2.3 Scatter plot

Make a scatter plot of the two variables with the highest correlation. Divide the plots by the outcome variable and add marginal plots and a trendline:

1. Find the pair of variables that has the highest correlation with each other and make a scatter plot of them.

2. Divide the scatter plot into two by the outcome variable. Have a look at ``facet`` and the visualization lecture if you have trouble.

3. Add marginal distributions and a trendline.

## 3. Data cleaning

Now, we switch the [cirrhosis dataset]('https://raw.githubusercontent.com/Center-for-Health-Data-Science/PythonTsunami/spring2022/Exercise/datasets/cirrhosis.csv').

We will investigate what data is missing and try to impute it.

A word of caution:

Note that imputation is a __complex subject__ and whether it makes sense to do it and the method used highly depend on the data set. Sometimes, the mean of a value across all non-missing observations is a good approximation for the missing value. On the other hand, if you have a column that says whether or not the person was treated with the drug or the placebo we have no good way to guess which treatment the person received. Replacing missing values in this column with the most common value (which is that they did get the drug) will produce extremely __wrong data__ and lead you to __wrong conclusions__. Do not do that!


### 3.0 Load the data

Load in the cirrhosis dataset using one of the two methods you used earlier for the stroke data. Change as well the outcome variable to a type "category".

### 3.1 Missing data

1. Use the pandas method `isnull`.

2. Get the number of missing values per column by calling `sum()` on the result of `isnull`. Which features, i.e. columns have missing values?

3. Make a barplot that shows the number of missing values per column.


### 3.2 Omitting observations with missing values

1. Create a subset in which you omit all patients, i.e. rows, which have missing values in any column. Take care to not overwrite the original dataframe. If you did, you can re-import it.

2. How many observations, i.e. patients, would you be left with if you removed all missing values?

3. How many if you only omit patients where the outcome is missing?


### 3.3 Effects of removing data

We can now have a look at how removing nans effects the data.


1. First, plot the correlation coefficient between all numerical columns in the original cirrhosis dataframe. (Analogous to 2.2).

2. Now, remake the plot for the subset where you have removed all rows with any missing data. Have the correlations changed?

### 3.4 Imputation

Use the method `fillna()` to impute missing values in the columns **where it makes sense**. Have a look at the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

1. A good way to impute numerical data can be i.e. the mean or median. Calculate the mean for all numerical columns.

2. Perform the imputation.

3. Re-make the barplot from 3.1. to check that it worked.

4. Recalculate correlation coefficients between all numerical columns and show it in a heatmap.
