This material has been adapted by @dcapurro from the Jupyter Notebook developed by:

Author: [Yury Kashnitsky](https://yorko.github.io). Translated and edited by [Christina Butsko](https://www.linkedin.com/in/christinabutsko/), [Yuanyuan Pao](https://www.linkedin.com/in/yuanyuanpao/), [Anastasia Manokhina](https://www.linkedin.com/in/anastasiamanokhina), Sergey Isaev and [Artem Trunov](https://www.linkedin.com/in/datamove/). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.


## 1. Demonstration of main Pandas methods
Well... There are dozens of cool tutorials on Pandas and visual data analysis. This one will guide us through the basic tasks when you are exploring your data (how deos the data look like?)  

**[Pandas](http://pandas.pydata.org)** is a Python library that provides extensive means for data analysis. Data scientists often work with data stored in table formats like `.csv`, `.tsv`, or `.xlsx`. Pandas makes it very convenient to load, process, and analyze such tabular data using SQL-like queries. In conjunction with `Matplotlib` and `Seaborn`, `Pandas` provides a wide range of opportunities for visual analysis of tabular data.

The main data structures in `Pandas` are implemented with **Series** and **DataFrame** classes. The former is a one-dimensional indexed array of some fixed data type. The latter is a two-dimensional data structure - a table - where each column contains data of the same type. You can see it as a dictionary of `Series` instances. `DataFrames` are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances.

In [None]:
import numpy as np
import pandas as pd
pd.set_option("display.precision", 2)

We'll demonstrate the main methods in action by analyzing a dataset that is an extract of the MIMIC III Database.

Let's read the data (using `read_csv`), and take a look at the first 5 lines using the `head` method:

In [None]:
df = pd.read_csv('/home/shared/icu_2012.txt')
df.head()

<details>
<summary>About printing DataFrames in Jupyter notebooks</summary>
<p>
In Jupyter notebooks, Pandas DataFrames are printed as these pretty tables seen above while `print(df.head())` is less nicely formatted.
By default, Pandas displays 20 columns and 60 rows, so, if your DataFrame is bigger, use the `set_option` function as shown in the example below:

```python
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
```
</p>
</details>

Recall that each row corresponds to one patient, an **instance**, and columns are **features** of this instance.

Let’s have a look at data dimensionality, feature names, and feature types.

In [None]:
print(df.shape)

From the output, we can see that the table contains 4000 rows and 79 columns.

Now let's try printing out column names using `columns`:

In [None]:
print(df.columns)

We can use the `info()` method to output some general information about the dataframe: 

In [None]:
print(df.info())

`bool`, `int64`, `float64` and `object` are the data types of our features. We see that one feature is logical (`bool`), 3 features are of type `object`, and 16 features are numeric. With this same method, we can easily see if there are any missing values. Here, we can see that there are columns with missing variables because some columns contain less than the 4000 number of instances (or rows) we saw before with `shape`.

The `describe` method shows basic statistical characteristics of each numerical feature (`int64` and `float64` types): number of non-missing values, mean, standard deviation, range, median, 0.25 and 0.75 quartiles.

In [None]:
df.describe()

The `describe` methods only gives us information about numerical variables. Some of these don't really make sense, like the `subject_id` or `gender` but since they are numbers, we are getting summary statistics anyways.

In order to see statistics on non-numerical features, one has to explicitly indicate data types of interest in the `include` parameter. We would use `df.describe(include=['object', 'bool'])` but in this case, the dataset only has variables of type `int` and `float`.

For categorical (type `object`) and boolean (type `bool`) features we can use the `value_counts` method. This also woeks for variables that have been encoded into integers like Gender. Let's have a look at the distribution of `Gender`:

In [None]:
df['Gender'].value_counts()

Since Gender is encoded in the following way: (0: female, or 1: male)

2246 intances are male patients


### Sorting

A DataFrame can be sorted by the value of one of the variables (i.e columns). For example, we can sort by *Age* (use `ascending=False` to sort in descending order):


In [None]:
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

df.sort_values(by='Age', ascending=False).head()

We can also sort by multiple columns:

In [None]:
df.sort_values(by=['Age', 'Height'],
        ascending=[False, False]).head()

### Indexing and retrieving data

A DataFrame can be indexed in a few different ways. 

To get a single column, you can use a `DataFrame['Name']` construction. Let's use this to answer a question about that column alone: **what is the average maximum heart rate of admitted patients in our dataframe?**

In [None]:
df['HR_max'].mean()

106 bpm is slightly elevated, but it seems reasonable for an ICU population

**Boolean indexing** with one column is also very convenient. The syntax is `df[P(df['Name'])]`, where `P` is some logical condition that is checked for each element of the `Name` column. The result of such indexing is the DataFrame consisting only of rows that satisfy the `P` condition on the `Name` column. 

Let's use it to answer the question:

**What are average values of numerical features for male patients?**

In [None]:
df[df['Gender'] == 1].mean()

**What is the average Max Creatinine for patients female patients?**

In [None]:
df[df['Gender'] == 0]['Creatinine_max'].mean()

DataFrames can be indexed by column name (label) or row name (index) or by the serial number of a row. The `loc` method is used for **indexing by name**, while `iloc()` is used for **indexing by number**.

In the first case below, we say *"give us the values of the rows with index from 0 to 5 (inclusive) and columns labeled from State to Area code (inclusive)"*. In the second case, we say *"give us the values of the first five rows in the first three columns"* (as in a typical Python slice: the maximal value is not included).

In [None]:
df.loc[0:5, 'RecordID':'ICUType']

In [None]:
df.iloc[0:5, 0:3]

If we need the first or the last line of the data frame, we can use the `df[:1]` or `df[-1:]` construct:

In [None]:
df[-1:]


### Applying Functions to Cells, Columns and Rows

**To apply functions to each column, use `apply()`:**
In this example, we will obtain the max value for each feature.


In [None]:
df.apply(np.max)

The `map` method can be used to **replace values in a column** by passing a dictionary of the form `{old_value: new_value}` as its argument. Let's change the values of female and male for the corresponding `strings`

In [None]:
d = {0 : 'Female', 1 : 'Male'}
df['Gender'] = df['Gender'].map(d)
df.head()

The same thing can be done with the `replace` method:

In [None]:
d2 = {1: 'Coronary Care Unit', 2: 'Cardiac Surgery Recovery Unit', 3: 'Medical ICU', 4: 'Surgical ICU'}
df = df.replace({'ICUType': d2})
df.head()

We can also replace missing values when it is necessary. For that we use the `filna()` methohd. In this case, we will replace them in the Mechanical Ventilation column.  

In [None]:
df['MechVent_min'].fillna(0, inplace=True)
df.head()

### Histograms

Histograms are an important tool to understand the distribution of your variables. It can help you detect errors in the data, like extreme or unplausible values.  

In [None]:
df['Age'].hist()

We can quickly see that the distribution of age is not normal. Let's look at Na

In [None]:
df['Na_max'].hist()

Not a lot of resolution here. Let's increase the number of bins to 30

In [None]:
df['Na_max'].hist(bins=30)

Much better! It is easy to see that this is approximately a normal distribution.


### Grouping

In general, grouping data in Pandas works as follows:



```python
df.groupby(by=grouping_columns)[columns_to_show].function()
```


1. First, the `groupby` method divides the `grouping_columns` by their values. They become a new index in the resulting dataframe.
2. Then, columns of interest are selected (`columns_to_show`). If `columns_to_show` is not included, all non groupby clauses will be included.
3. Finally, one or several functions are applied to the obtained groups per selected columns.

Here is an example where we group the data according to `Gender` variable and display statistics of three columns in each group:

In [None]:
columns_to_show = ['Na_max', 'K_max', 
                   'HCO3_max']

df.groupby(['Gender'])[columns_to_show].describe(percentiles=[])

Let’s do the same thing, but slightly differently by passing a list of functions to `agg()`:

In [None]:
columns_to_show = ['Na_max', 'K_max', 
                   'HCO3_max']

df.groupby(['Gender'])[columns_to_show].agg([np.mean, np.std, np.min, 
                                            np.max])


### Summary tables

Suppose we want to see how the observations in our sample are distributed in the context of two variables - `Gender` and `ICUType`. To do so, we can build a **contingency table** using the `crosstab` method:



In [None]:
pd.crosstab(df['Gender'], df['ICUType'])

This will resemble **pivot tables** to those familiar with Excel. And, of course, pivot tables are implemented in Pandas: the `pivot_table` method takes the following parameters:

* `values` – a list of variables to calculate statistics for,
* `index` – a list of variables to group data by,
* `aggfunc` – what statistics we need to calculate for groups, ex. sum, mean, maximum, minimum or something else.

Let's take a look at the average number of day, evening, and night calls by area code:

In [None]:
df.pivot_table(['TroponinI_max', 'TroponinT_max'],
               ['ICUType'], aggfunc='mean')

Nothing surprising here, patients in the coronary/cardiac units have higher values of Troponins.

### DataFrame transformations

Like many other things in Pandas, adding columns to a DataFrame is doable in many ways.

For example, if we want to calculate the change in creatinine, let's create the `Delta_creatinine` Series and paste it into the DataFrame:



In [None]:
Delta_creatinine = df['Creatinine_max'] - df['Creatinine_min']

df.insert(loc=len(df.columns), column='Delta_creatinine', value=Delta_creatinine) 
# loc parameter is the number of columns after which to insert the Series object
# we set it to len(df.columns) to paste it at the very end of the dataframe
df.head()

It is possible to add a column more easily without creating an intermediate Series instance:

In [None]:
df['Delta_BUN'] = df['BUN_max'] - df['BUN_min']
df.head()

To delete columns or rows, use the `drop` method, passing the required indexes and the `axis` parameter (`1` if you delete columns, and nothing or `0` if you delete rows). The `inplace` argument tells whether to change the original DataFrame. With `inplace=False`, the `drop` method doesn't change the existing DataFrame and returns a new one with dropped rows or columns. With `inplace=True`, it alters the DataFrame.

In [None]:
# get rid of just created columns
df.drop(['Delta_creatinine', 'Delta_BUN'], axis=1, inplace=True) 
# and here’s how you can delete rows
df.drop([1, 2]).head()

## 2. Exploring some associations


Let's see how mechanical ventilation is related to Gender. We'll do this using a `crosstab` contingency table and also through visual analysis with `Seaborn`.


In [None]:
pd.crosstab(df['MechVent_min'], df['Gender'], margins=True)

In [None]:
# some imports to set up plotting 
import matplotlib.pyplot as plt
# pip install seaborn 
import seaborn as sns
# Graphics in retina format are more sharp and legible
%config InlineBackend.figure_format = 'retina'

Now we create the plot that will show us the counts of mechanically ventilated patients by gender.

In [None]:
sns.countplot(x='Gender', hue='MechVent_min', data=df);

We see that th number (and probably the proportion) of mechanically ventilated patients is greater among males. 

Next, let's look at the same distribution but comparing the different ICU types: Let's also make a summary table and a picture.

In [None]:
pd.crosstab(df['ICUType'], df['MechVent_min'], margins=True)

In [None]:
sns.countplot(x='ICUType', hue='MechVent_min', data=df);

As you can see, the proportion of patients ventilated and not ventilated is very different across the different types of ICUs. That is particularly true in the cardiac surgery recovery unit. Can you think of a reason why that might be?

## 3. Some useful resources

* ["Merging DataFrames with pandas"](https://nbviewer.jupyter.org/github/Yorko/mlcourse.ai/blob/master/jupyter_english/tutorials/merging_dataframes_tutorial_max_palko.ipynb) - a tutorial by Max Plako within mlcourse.ai (full list of tutorials is [here](https://mlcourse.ai/tutorials))
* ["Handle different dataset with dask and trying a little dask ML"](https://nbviewer.jupyter.org/github/Yorko/mlcourse.ai/blob/master/jupyter_english/tutorials/dask_objects_and_little_dask_ml_tutorial_iknyazeva.ipynb) - a tutorial by Irina Knyazeva within mlcourse.ai
* Main course [site](https://mlcourse.ai), [course repo](https://github.com/Yorko/mlcourse.ai), and YouTube [channel](https://www.youtube.com/watch?v=QKTuw4PNOsU&list=PLVlY_7IJCMJeRfZ68eVfEcu-UcN9BbwiX)
* Official Pandas [documentation](http://pandas.pydata.org/pandas-docs/stable/index.html)
* Course materials as a [Kaggle Dataset](https://www.kaggle.com/kashnitsky/mlcourse)
* Medium ["story"](https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-1-exploratory-data-analysis-with-pandas-de57880f1a68) based on this notebook
* If you read Russian: an [article](https://habrahabr.ru/company/ods/blog/322626/) on Habr.com with ~ the same material. And a [lecture](https://youtu.be/dEFxoyJhm3Y) on YouTube
* [10 minutes to pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html)
* [Pandas cheatsheet PDF](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)
* GitHub repos: [Pandas exercises](https://github.com/guipsamora/pandas_exercises/) and ["Effective Pandas"](https://github.com/TomAugspurger/effective-pandas)
* [scipy-lectures.org](http://www.scipy-lectures.org/index.html) — tutorials on pandas, numpy, matplotlib and scikit-learn