<table width=100%>
<tr>
    <td><h1 style="text-align: left; font-size:300%;">
        Introduction to Exploratory Data Analysis
    </h1></td>
    <td width="30%">
    <div style="text-align: right">
    <b> Practical Data Science Lessons</b><br><br>
    <b> Matteo Frosi</b><br>
    <a href="mailto:matteo.frosi@polimi.it">matteo.frosi@polimi.it</a><br>
    </div>
</tr>
</table>

## Learning outcomes 🔎

*   [What is Exploratory Data Analysis (EDA)?](#what_is_eda)
*   [Preliminary Exploration](#preliminary_exp)
*   [Descriptive Statistics](#descr_stat)
*   [Data Visualization](#data_viz)
*   [Pandas, Seaborn or Matplotlib?](#pandas_seaborn_matplotlib)
*   [Summary of functions](#summary)

#### Resources:
*   *[Harvard 2021 CS109-A: Introduction to Data Science](https://harvard-iacs.github.io/2021-CS109A/)*

<a id='what_is_eda'></a>
## What is Exploratory Data Analysis (EDA)?
Exploratory Data Analysis (EDA) is a critical phase in the data analysis process where the primary goal is **to understand the characteristics of the data at hand**. It involves **visually and statistically summarizing** the main features of a dataset, often using graphical representations, to uncover patterns, trends, anomalies, and insights. EDA helps analysts and data scientists form hypotheses, identify relationships between variables, and guide the direction of further analysis.

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## Load data

The file *quartets.csv* contains 4 different tiny datasets that we will use to quickly understand the value of plotting.

In [None]:
quartets = pd.read_csv('data/quartets.csv', index_col=0)

<a id='preliminary_exp'></a>
## Preliminary Exploration

In [None]:
quartets.info()

We see there are 44 entries, two numerical columns x and y and one column to potentially identify every quartet dataset.

How does this dataframe look like?

In [None]:
quartets.head()

How do random samples look like?

In [None]:
quartets.sample(5)

Quartet's names

In [None]:
quartets['quartet'].unique().tolist()

Display the first 3 samples from every dataset

In [None]:
quartets.groupby('quartet').head(3)

Display 2 random samples from every dataset

In [None]:
quartets.groupby('quartet').sample(2)

Display every quartet's dataset size

In [None]:
quartets.groupby('quartet').size()

<a id='descr_stat'></a>
## Descriptive Statistics

In [None]:
# agg() is a method to ggregate using one or more operations over the specified axis
quartets.groupby('quartet').agg(['mean', 'std']).round(3)

Almost same mean and standard deviation for every quartet.  
This looks like all quartets samples could be sampled from the same distribution.  
These are tiny datasets so we could read them all!

In [None]:
quartets[quartets['quartet'] == 'I']

In [None]:
quartets[quartets['quartet'] == 'II']

In [None]:
quartets[quartets['quartet'] == 'III']

In [None]:
quartets[quartets['quartet'] == 'IV']

### Central Tendency and Dispersion

When doing descriptive / statistical analysis we are interested on central tendency and dispersion, of which, mean and standard deviation are special cases. **Central tendency** measures summarize the **center or average of the data**, while **dispersion** measures indicate **how spread out** or variable the data points are.

1. Central Tendency:

    **Mean (Average)**: The mean is the sum of all values divided by the number of observations. It represents the center of the distribution.
    $ \text{Mean} = \frac{\sum_{i=1}^{n} X_i}{n} $

    **Median**: The median is the middle value when the data is sorted. It is less sensitive to outliers than the mean and provides a measure of the central position.
    $ \text{Median} = \text{Middle value in sorted data} $

    **Mode**: The mode is the most frequently occurring value in the dataset. A distribution can be unimodal (one mode), bimodal (two modes), or multimodal (more than two modes).
    $ \text{Mode} = \text{Most frequently occurring value} $

2. Dispersion:

    **Range**: The range is the difference between the maximum and minimum values in the dataset. It provides a simple measure of the spread but is sensitive to outliers.
    $ \text{Range} = \text{Max} - \text{Min} $

    **Variance**: Variance measures the average squared difference of each data point from the mean. A higher variance indicates greater dispersion.
    $ \text{Variance} = \frac{\sum_{i=1}^{n} (X_i - \text{Mean})^2}{n} $

    **Standard Deviation**: The standard deviation is the square root of the variance. It provides a more interpretable measure of dispersion in the original units of the data.
    $ \text{Standard Deviation} = \sqrt{\text{Variance}} $

    **Interquartile Range (IQR)**: IQR is the range between the first quartile (Q1) and the third quartile (Q3). It is less sensitive to outliers than the range.
    $ \text{IQR} = Q3 - Q1 $

Understanding:
* If the mean, median, and mode are close, the distribution is likely symmetric.
* If the mean is greater than the median, the distribution may be right-skewed (positively skewed), and vice versa.
* A large range, variance, or standard deviation indicates higher dispersion.
* IQR is useful for identifying the spread of the middle 50% of the data.

<a id='data_viz'></a>
## Data Visualization

Pandas by default comes with matplotlib incorporated.

*Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.*

### BoxPlot

Box plots provide a summary of the distribution, including the median, quartiles, and potential outliers. The box represents the interquartile range (IQR), and the whiskers extend to the minimum and maximum values within a certain range. By default, they extend no more than 1.5 * IQR (IQR = Q3 - Q1) from the edges of the box, ending at the farthest data point within that interval. Outliers are plotted as separate dots.

In [None]:
quartets.groupby('quartet').boxplot(grid=False);

### [Seaborn's palettes](https://seaborn.pydata.org/tutorial/color_palettes.html)

In [None]:
sns.color_palette()

In [None]:
sns.color_palette('pastel')

In [None]:
palette = 'pastel'

**Seaborn's boxplots**

Similar boxplots with Matplotlib and Seaborn

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(8,7))
axes = axes.flatten().tolist()
for quartet, g in quartets.groupby('quartet'):
    ax = axes.pop(0)
    sns.boxplot(data=g, ax=ax, palette=palette);
    ax.set_title(f'quartet {quartet}')
plt.suptitle("Quartets' boxplots");

Using seaborn boxplots to compare quartes's shared features

- [seaborn.boxplot()](https://seaborn.pydata.org/generated/seaborn.boxplot.html)
- [pandas.melt()](https://pandas.pydata.org/docs/reference/api/pandas.melt.html)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(16,4))
sns.boxplot(x='x', y='value', hue='quartet',
            data=pd.melt(quartets, id_vars='quartet', var_name='x', value_name='value'),
            ax=ax, palette=palette)
ax.set_title("quartets' features");

In [None]:
pd.melt(quartets, id_vars='quartet', var_name='x', value_name='value')

The problem with the plot above is that we are forcing different features (like `x` and `y`) to share the same y-axis.

So, another way to acomplish the goal could be this one

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16,4))
for i, col in enumerate(['x', 'y']):
    sns.boxplot(x='quartet', y=col, data=quartets, ax=axes[i], palette=palette);
    axes[i].set_title(f'variable {col}')

### Histograms

Histograms provide a visual representation of the distribution of a continuous variable. The data is divided into bins, and the height of each bar represents the frequency or count of observations within that bin.

Pandas let us easily plot the individual quartet's feature histogram in one line of code.

In [None]:
quartets.groupby('quartet').hist();

The histograms allows us to start to see some differences

**Seaborn's histograms**

[seaborn.histplot()](https://seaborn.pydata.org/generated/seaborn.histplot.html)

We could do the same with seaborn with this code

In [None]:
for quartet, g in quartets.groupby('quartet'):
    fig, axes = plt.subplots(1 , 2, figsize=(8, 2.5))
    sns.histplot(data=g, x="x", hue='quartet', ax=axes[0], palette=palette, bins=10, kde=True);
    sns.histplot(data=g, x="y", hue='quartet', ax=axes[1], palette=palette, bins=10, kde=True);
    plt.suptitle(f'Quartet {quartet}')

We can plot all quartets's two features `x` and `y` in two different plots moving out the subplots creation

In [None]:
# The 'element' parameter defines the visual representation of the histogram statistic
# Possible values are 'bars' (default but too noisy when plotting so many features), 'step', 'poly'
element = 'step'
fig, axes = plt.subplots(1 , 2, figsize=(12, 5))
legends = []
for quartet, g in quartets.groupby('quartet'):
    legends.append(f'quartet {quartet}')
    sns.histplot(data=g, x="x", hue='quartet', ax=axes[0], palette=palette, bins=10, kde=False, alpha=.2, element=element);
    sns.histplot(data=g, x="y", hue='quartet', ax=axes[1], palette=palette, bins=10, kde=False, alpha=.2, element=element);

axes[0].legend(legends)
axes[1].legend(legends);

### FacetGrid

This is a powerful tool that can be used in combination with plotting methods from seaborn or even matplotlib to plot multiple subplots based on some conditional relationship. **A Facet Grid allows you to create a grid of subplots based on the values of one or more categorical variables**. Each subplot in the grid represents a subset of the data based on the values of these variables. This makes it easy to compare different subsets of the data and identify patterns or trends.

[`seaborn.FacetGrid()`](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html): Multi-plot grid for plotting conditional relationships.

**Grid of histograms**

In [None]:
for feature in ['x', 'y']:
    # create the grid with condition quartet
    g = sns.FacetGrid(quartets, col="quartet", palette=palette, col_wrap=4)
    # for every condition we are going to create a subplot for the grid for column "feature"
    g.map(sns.histplot, feature, bins=10);

# col_wrap define the number of columns. Change the value to 3 and 2 to understand visually its behaviour

We can create one FacetGrid for all. For that we need to convert the dataframe to access values based on conditions.

In [None]:
melted = pd.melt(quartets, id_vars='quartet', var_name='variable', value_name='value')
melted

In [None]:
# create the grid with quartets as columns and variable as rows
g = sns.FacetGrid(melted, row="variable", col='quartet', palette=palette, sharex=False)
g.map(sns.histplot, 'value', bins=10);
# we need set sharex to False to avoid distorting shapes between rows (you can try changing it to True)

### Scatter plots
Knowing that we have `x` and `y` features, we can think about using other kind of helpful plots. Why not a scatter plot?

In [None]:
quartets.groupby('quartet').plot.scatter(x='x', y='y', s=50);

**Scatter plots with seaborn**

We can combine matplotlib with seaborn to improve the aesthetic.

In [None]:
fig, axes = plt.subplots(2,2,figsize=(7,7))
axes = axes.flatten().tolist()
for quartet, g in quartets.groupby('quartet'):
    ax = axes.pop(0)
    sns.scatterplot(data=g, x='x', y='y', ax=ax)
    ax.set_title(f'quartet {quartet}')
plt.subplots_adjust(hspace=0.3);

**Scatter plots with FacetGrid**

FacetGrid is great to avoid writting too many lines of matplotlib code. In this case we can force the grid to share x and y domain to simplify features domains comparison.

In [None]:
g = sns.FacetGrid(quartets, col='quartet', palette=palette, col_wrap=2, sharex=True, sharey=True)
g.map(sns.scatterplot, 'x', 'y');

### Line plots
We could also use a lineplot but to do that we need to know that dots should be ordered in the x axis.

In [None]:
quartets.sort_values(by='x').groupby('quartet').plot(x='x', y='y', marker='o', lw=.7);

### All in one
We also can use matplotlib to plot all groups in the same plot

In [None]:
# create one figure of 1 x 1 size.
fig, ax = plt.subplots(1,1,figsize=(16,6))
# plot all 4 quartets in the same ax
quartets.sort_values(by='x').groupby('quartet').plot(x='x', y='y', marker='o', ms=10, lw=.7, alpha=.7, ax=ax)
plt.ylabel('y')
plt.title('All in one quartets');

### Lineplots with seaborn

[Seaborn.lineplot()](https://seaborn.pydata.org/generated/seaborn.lineplot.html) simplifies the creation of the same plot.

In [None]:
fig, ax = plt.subplots(1,1,figsize=(16,6))
sns.lineplot(data=quartets, x='x', y='y', hue='quartet', marker='o', ms=10, lw=.7, alpha=.7, ax=ax)
plt.title('All in one quartets');

And we can plot all quartets together (removing the conditional `hue` for seaborn)

In [None]:
fig, axes = plt.subplots(1,2,figsize=(16,4))
sns.lineplot(data=quartets, x='x', y='y', lw=.7, ax=axes[0])
axes[0].set_title('one line of seaborn')
quartets.plot(x='x', y='y', lw=.7, ax=axes[1])
axes[1].set_title('one line of matplotlib');

#### 🗒 Exercise

Seaborn is built on matplotlib, so modifying the function parameters should let you arrive to the same plot. Modify the parameters of the lineplot() Seaborn function such that the two plots are visually similar.

## Let´s use a different dataset

We now load another dataset that consists of the marks secured by the students in various subjects. The aim is to understand the influence of the parents background, test preparation, etc. on students performance.

Source: https://www.kaggle.com/spscientist/students-performance-in-exams  
Original source generator: http://roycekimmons.com/tools/generated_data/exams

If you go to the original source you will find this is a fictitious dataset created specifically for data science training purposes.

In [None]:
df = pd.read_csv('data/StudentsPerformance.csv').rename(
        columns={
            'race/ethnicity': 'group',
            'parental level of education': 'parental',
            'test preparation course': 'course',
            'math score': 'math',
            'reading score': 'reading',
            'writing score': 'writing'
        }
    )

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df['group'].unique().tolist()

### Let's simplify the dataframe

We can simplify the group values to the group letter

**Series.str**

[`Series.str`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html): Vectorized string functions for Series and Index.

In [None]:
df['group'] = df['group'].str[-1]
df['group'].unique().tolist()

In [None]:
df.head()

In [None]:
df['course'].unique()

**Series.apply**

[`Series.apply`](https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html): Invoke function on values of Series.

```python
Series.apply(func, convert_dtype=True, args=(), **kwargs)
```

In [None]:
# we verify that we have never change this column values yet
if 'completed' in df['course'].unique().tolist():
    df['course'] = df['course'].apply(lambda x: 1 if x == 'completed' else 0)

# we can change the column values type to boolean
df['course'] = df['course'].astype(bool)
df['course'].unique()

In [None]:
df.head()

Are there missing values?

In [None]:
df.isna().sum()

None of the column series present missing values

**Some questions:**
- Does gender affect math scores?
- Does reading and writing scores affect math scores?
- Do math scores affect reading and writing scores?
- Does a group perform better at math than the rest?
- Does parental level education affect math scores?

In [None]:
df[['reading','math']].sample(5)

In [None]:
df[['reading','math']].describe()

It's not common at all to see a zero on scores. Here we see a 0 found at math

In [None]:
df[df['math'] == 0]

Does this sample look possible? Why?

#### Histograms for our selected variables

In [None]:
df[['reading', 'math']].hist(bins=50, grid=False);

#### Histograms for our selected variables (seaborn)

[seaborn.histplot()](https://seaborn.pydata.org/generated/seaborn.histplot.html)

We can plot histogram in different plots using matplotlib subplots

In [None]:
plt.figure(figsize=(12,4))
sns.histplot(df[['reading']], bins=50, ax=plt.subplot(121), palette=palette)
sns.histplot(df[['math']], bins=50, ax=plt.subplot(122), palette=palette);

But knowing that by default sns.histplot merges all features into the same plot, it could be simpler

In [None]:
sns.histplot(df[['reading', 'math']], bins=50, palette=palette);

### Kernel Density Estimate
A Kernel Density Estimate (KDE) plot is a non-parametric way to estimate the probability density function of a continuous random variable. It provides a smooth, continuous representation of the underlying distribution of data, similar to a histogram but without discretizing the data into bins.

In [None]:
df[['reading', 'math']].plot.kde()
plt.title('KDEs');

Seaborn comes with the method [`seaborn.kdeplot()`](https://seaborn.pydata.org/generated/seaborn.kdeplot.html) to create Kernel Density Plots but we can just set the histplot params kde to True to combine them.

In [None]:
sns.histplot(df[['reading', 'math']], bins=50, kde=True, palette=palette);

### BoxPlot

In [None]:
df[['reading', 'math']].boxplot();

At first glance distributions looks similar as one could expect. Math scores distribution looks a bit shifted down.

**Boxplots with seaborn**

[`seanborn.boxplot()`](https://seaborn.pydata.org/generated/seaborn.boxplot.html)

In [None]:
sns.boxplot(data=df[['reading', 'math']], palette=palette);

#### Boxplot on the whole dataframe

In [None]:
df.boxplot();

### Boxenplots or Letter values

A boxenplot, also known as a letter-value plot, is a variation of a box plot that provides additional information about the shape of the distribution, particularly in the tails. It is similar to a box plot but with more quantiles, resulting in a more detailed representation of the data distribution.

[`seaborn.boxenplot()`](https://seaborn.pydata.org/generated/seaborn.boxenplot.html)

In [None]:
sns.boxenplot(data=df[['reading', 'math']], palette=palette);

### Violinplots
A violin plot is a type of data visualization that combines aspects of a box plot and a kernel density plot. It is used to visualize the distribution of a continuous variable or numerical data across different categories or groups.

[`seaborn.violinplot()`](https://seaborn.pydata.org/generated/seaborn.violinplot.html)

In [None]:
sns.violinplot(data=df[['reading', 'math']], palette=palette);



What about the relation between the scores? Do they interact?
#### Scatter to the rescue

In [None]:
df.plot.scatter(x='reading', y='math', s=10, alpha=.5, figsize=(6,5))
plt.title('reading vs math');

There is visual correlation between these variables.

### Correlation
Pandas has implemented a method named `corr()`.

`DataFrame.corr()`: Compute pairwise correlation of columns, excluding NA/null values.
```python
DataFrame.corr(method='pearson', min_periods=1)
```


In [None]:
df[['reading', 'math']].corr()

Pandas corr() offers different correlation methods. In most cases `pearson` or/and `spearman` are the methods to go.

In [None]:
for method in ['pearson', 'kendall', 'spearman']:
    # iloc is used to access value at first row second column.
    corr = df[['reading', 'math']].corr(method=method).iloc[0,1]
    print(f'{method} correlation: {corr:.3f}')

We've confirmed there is a strong (linear) correlation between reading and math scores. Each variable could work as a proxy of the other variable.

#### Correlation between all variables

In [None]:
df.corr(numeric_only=True)

Reading and writing have a really strong correlation.

**Of course one could use plots**

In [None]:
cols = ['math', 'reading', 'writing']
for i, c1 in enumerate(cols):
    c2 = cols[i+1] if i < len(cols)-1 else cols[0]
    df.plot.scatter(x=c1, y=c2, s=10, alpha=.5)
    plt.title(f'{c1} vs {c2}')

**Scatter plots with seaborn**

Seaborn comes with [`seaborn.scatterplot()`](https://seaborn.pydata.org/generated/seaborn.scatterplot.html).

In [None]:
sns.pairplot(df.select_dtypes('number'));

### Pie plot

In [None]:
df['gender'].value_counts(normalize=True).plot.pie(figsize=(6,6));

Seaborn doesn't come with a method to plot pie plots

### Heatmap

[`seaborn.heatmap()`](https://seaborn.pydata.org/generated/seaborn.heatmap.html): Plot rectangular data as a color-encoded matrix.

Heatmap is a great tool for plotting features' correlations


In [None]:
fig, ax = plt.subplots(1, 1, figsize=(8,6))
sns.heatmap(df.corr(numeric_only=True), annot=True, fmt='.2f', cmap='Blues', ax=ax);

### We now want to compare the scores of the various skills by gender

In [None]:
df.head()

In [None]:
df.groupby('gender').mean(numeric_only=True)

In [None]:
df.groupby('gender').boxplot();

pandas.melt() is a powerful method to unpivot a dataframe. We are going to use it to simplify use of some seaborn plots.

In [None]:
score_cols = df.select_dtypes('number').columns.tolist()
id_vars = [c for c in df.columns if c not in score_cols]
score_cols, id_vars
melted = pd.melt(df, id_vars=id_vars, var_name='skill', value_name='score')
melted.head()

When you make things easier to read for seaborn, seaborn will make the plots easier to read for you.

In [None]:
for func in [sns.boxplot, sns.boxenplot, sns.violinplot]:
    g = sns.FacetGrid(melted, col="skill")
    g.map(func, 'score', 'gender', order=None, palette=palette);

In [None]:
sns.pairplot(df, palette=palette, hue='gender');

In [None]:
df.groupby('gender').plot.kde();

In [None]:
df['is_female'] = df['gender'].apply(lambda x: 1 if x == 'female' else 0)
df['is_female'] = df['is_female'].astype(float)

df['is_male'] = df['gender'].apply(lambda x: 1 if x == 'male' else 0)
df['is_male'] = df['is_male'].astype(float)

df.head()

Instead of looking at correlation between all variables we want to see how this new variables `is_female` correlates with the scores. Pandas gives us the method `DataFrame.corrwith()` for this kind of cases.

In [None]:
df[['math', 'reading', 'writing']].corrwith(df['is_female'])

In [None]:
df[['math', 'reading', 'writing']].corrwith(df['is_male'])

### Who will approve?

In [None]:
approval_threshold = 40

In [None]:
df['approved'] = df['math'] >= approval_threshold
df['approved'] = df['approved'].astype(int)
df.head()

In [None]:
df['approved'].value_counts(normalize=True).plot.bar();

Seaborn has a method for plotting counts of feature's values.

In [None]:
sns.countplot(data=df, x='group', palette=palette);

The problem is that the countplot method doesn't *count* with a normalize parameter. So trying to plot a normalized version is not as simple as when using pandas (```Series.value_counts(normalize=True).plot.bar()```)

In [None]:
df['approved'].value_counts(normalize=True).to_frame()

In [None]:
sns.barplot(data=((df['approved'].value_counts(normalize=True)*100).to_frame()
                     .reset_index().rename(columns={'approved': '%', 'index': 'approved'})),
            x='%',
            y='proportion',
            palette=palette);

In [None]:
df[['gender', 'course', 'reading', 'writing', 'math']].groupby('gender').corrwith(df['approved'])

In [None]:
df.groupby('approved')['gender'].value_counts(normalize=True).plot.bar();

We will try to do the same plot with seaborn

In [None]:
df.groupby('approved')['gender'].value_counts(normalize=True).to_frame()

In [None]:
tmp = (df.groupby('approved')['gender'].value_counts(normalize=True).to_frame().rename(columns={'gender': '%'})*100).reset_index()
sns.barplot(data=tmp, x='approved', y='proportion', hue='gender', palette=palette);

<a id='pandas_seaborn_matplotlib'></a>
## Pandas, Seaborn or Matplotlib?

When plotting you can think of using one of these four approaches:

- Pandas
- Pandas + Matplotlib
- Pandas + Seaborn
- Pandas + Seaborn + Matplotlib

**Pandas**  
- Learning: easy
- Default Visual: bad
- Custom Visual: regular
- TIP: just knowing what are the plotting methods implemented in pandas is enough to start plotting many things to extract information for you (but maybe not for a presentation).

**Pandas + Matplotlib**  
- Learning: difficult
- Default Visual: regular
- Custom Visual: excellent but tricky (it's all about learning matplotlib, not easy to start from scratch)
- TIP: Think the plot you want and then using DataFrame.groupby or some condition applied to the dataframe will be enough to feed your plots.

**Pandas + Seaborn**   
- Learning: good
- Default Visual: good
- Custom Visual: very good
- TIP: Seaborn is almost about preparing a DataFrame to feed the seaborn plot you are looking for. So you need to learn about Seaborn's available plots and probably expend some time learning pandas methods like `melt` and `pivot` to transform the dataframe in an input kind of the ones seaborn likes.

**Pandas + Matplotlib + Seaborn**   
- Learning: difficult
- Default Visual: good
- Custom Visual: excellent
- TIP: Sky is the limit. Remember that seaborn was built on matplotlib.

In [None]:
sns.countplot(data=df, x='approved', hue='gender', palette=palette);

In [None]:
ax = plt.subplot()
for group, g in df.groupby(['approved','gender']):
    g[['math']].hist(bins=50, ax=ax, alpha=.3, label=f'{group[0]} {group[1]}');
plt.legend();

To do the same plot with seaborn we will need to convert the dataframe like the melted one and add some new column that represents a combination of gender and approved. Sometimes it's better to look for alternatives that let us do the same analysis without too much coding.

In [None]:
g = sns.FacetGrid(df, col='approved', row='gender')
g.map(sns.histplot, 'math', palette=palette);

In [None]:
g = sns.FacetGrid(df, col='approved', row='gender')
g.map(sns.histplot, 'reading', palette=palette);

In [None]:
g = sns.FacetGrid(df, col='approved', row='gender')
g.map(sns.histplot, 'writing', palette=palette);

In [None]:
# let's repeat the three features with violinplots
for feature in ['reading', 'writing', 'math']:
    g = sns.FacetGrid(df, col='approved', row='gender', sharex=True, sharey=True)
    g.map(sns.violinplot, feature, order=None, palette=palette);

If we prepare data for seaborn, seaborn will give what we want. For instance, `seaborn.violinplot()` permits to split the violin distribution using a secondary binary `hue` feature. But this just can be done when using parameters `x` and `y`. In this case we can use a dummy feature to plot what we want. Knowing this will help us to improve our previous plot.

In [None]:
df['dummy'] = ''
# let's repeat the three features with violinplots
for feature in ['reading', 'writing', 'math']:
    g = sns.FacetGrid(df, col='approved', sharey=True)
    g.map(sns.violinplot, data=df, x='dummy', y=feature, hue='gender', split=True, order=None, palette=palette);
    g.add_legend() # we want to display the gender legend
    g.set_ylabels('score')
    g.fig.subplots_adjust(top=0.8)
    g.fig.suptitle(f'feature: {feature}', fontsize=12, font='verdana')
del df['dummy']

### PairGrid

seaborn.PairGrid() is a great tool that let us extend seaborn plots easily.

In [None]:
# this should do something similar to pairplot() but without setting the histogram in the diagonal
g = sns.PairGrid(df)
g.map(sns.scatterplot);

In [None]:
del df['is_female']
del df['is_male']

Maybe you didn't see the power of PairGrids. Let's try again with a new custom PairGrid plot with multivariate KDE subplots

In [None]:
# Create a cubehelix colormap to use with kdeplot
cmap = sns.cubehelix_palette(start=0, light=.95, as_cmap=True)
g = sns.PairGrid(df, diag_sharey=False)
g.map_upper(sns.kdeplot, cmap=cmap, fill=True)
g.map_lower(sns.kdeplot, cmap=cmap, fill=True)
g.map_diag(sns.kdeplot, color='#aa0000', fill=True);

<a id='summary'></a>
## Summary of functions

**Pandas**
- pandas.read_csv()
- pandas.concat()
- pandas.get_dummies()
- DataFrame.info()
- DataFrame.head()
- DataFrame.sample()
- DataFrame.describe()
- DataFrame.unique()
- DataFrame.str
- DataFrame.grouby()
- DataFrame.sourt_values()
- DataFrame.corr()
- DataFrame.corrwith()
- DataFrameGroupBy.size()

**Pandas (plotting)**
- DataFrame.boxplot()
- DataFrame.hist()
- DataFrame.plot()
- DataFrame.plot.kde()
- DataFrame.plot.pie()
- DataFrame.plot.scatter()

**matplotlib**
- matplotlib.pyplot.subplots()
- matplotlib.pyplot.title()
- matplotlib.pyplot.plot()
- matplotlib.pyplot.suptitle()
- matplotlib.pyplot.subplot()
- matplotlib.pyplot.subplots_adjust()
- matplotlib.pyplot.ylabel()
- matplotlib.pyplot.legend()
- matplotlib.pyplot.figure()

**seaborn**
- seaborn.boxplot()
- seaborn.boxenplot()
- seaborn.histplot()
- seaborn.barplot()
- seaborn.countplot()
- seaborn.scatterplot()
- seaborn.violinplot()
- seaborn.lineplot()
- seaborn.pairplot()
- seaborn.heatmap()
- seaborn.kdeplot()
- seaborn.FacetGrid()
- seaborn.PairGrid()