<center>
<img src="https://www.iybssd2022.org/wp-content/uploads/ASAQ.jpg" width="150"/> 
</center>

        
<center>
<h1><font color= "blue" size="+2">ASAQ Python Data Analysis Courses</font></h1>
</center>

---

<center><h1><font color="blue" size="+2">Plotting with Seaborn</font></h1></center>

## <font color="red">Objectives</font>

We want to:

- Read an Excel file.
- Use `seaborn` to create graphs.

## <font color="red">Required modules/packages</font>

- __Pandas__: A data manipulation and data analysis library. 
- __Skimpy__: A light weight tool for creating summary statistics from dataframes.
- __Seaborn__: A data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
- __Plotly__: Python graphing library that makes interactive graphs.

## <font color="green">Uncomment the cell below if using Google Colab</font>

In [None]:
try:
    import google.colab
    print("Running in Google Colab")
except:
    print("Not running in Google Colab")
else:
    print("Installing modules in Google Colab")
    !pip install skimpy
    !pip install plotly

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import matplotlib.pyplot as plt

In [None]:
import pandas as pd

In [None]:
import seaborn as sns

In [None]:
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
pio.templates.default = "plotly_white"

print(f"Pandas  version: {pd.__version__}")
print(f"Seaborn version: {sns.__version__}")

## <font color="red">Data Access</font>

File name:

In [None]:
file_name = "L2-01-08-2023evening.xlsx"
data_url = f"https://github.com/JulesKouatchou/asaq_py/raw/main/sample_data/{file_name}"
#data_url = f"../sample_data/{file_name}"

### <font color="blue">Read the file</font>

In [None]:
df = pd.read_excel(data_url, 
                   sheet_name="Feuil1",
                  parse_dates={'t': [0]}
                  )
df

In [None]:
df.info()

We can remove the column _time_:

In [None]:
df = df.drop(columns=['time'])
df

#### Make the time as the index of the DataFrame

In [None]:
df.set_index('t', inplace=True)

### Define new DataFrames that are portions of the original one

In [None]:
df1 = df[['RH', 'CO2', 'TPM', 'altitude']]
df2 = df[['PM0.5', 'PM1', 'PM2.5', 'PM5', 'PM10', 'TPM']]

# Line plots

#### Create a single  line plot

In [None]:
sns.lineplot(x=df1.index, y="CO2", data=df1);

#### Adjust the figure size

In [None]:
fig = plt.subplots(figsize=(20, 5))
sns.lineplot(x=df1.index, y="CO2", data=df1);

#### Add a title and axis labels

In [None]:
fig = plt.subplots(figsize=(20, 5))
sns.lineplot(x=df1.index, y="CO2", 
             data=df1).set(title='Timeseries of CO2', xlabel='Time');

#### Change the line color, style, and size

In [None]:
sns.set_theme(style='white', font_scale=2)
fig = plt.subplots(figsize=(20, 5))
sns.lineplot(x=df1.index, y="CO2", data=df1,
             linestyle='dotted', color='green', 
             linewidth=5).set(title='Timeseries of CO2', 
                              xlabel='Time');

#### Add markers and customizing their color, style, and size

In [None]:
sns.set_theme(style='white', font_scale=2)
fig = plt.subplots(figsize=(20, 5))
sns.lineplot(x=df1.index, y="CO2", data=df1,
             marker='*', markerfacecolor='green', 
             markersize=9
            ).set(title='Timeseries of CO2', xlabel='Time');

#### Create a line plot with multiple lines

In [None]:
fig = plt.subplots(figsize=(20, 5))
#sns.lineplot(x=df2.index, y="PM1", data=df2)
sns.lineplot(x=df2.index, y="PM5", data=df2).set(label="PM5")
sns.lineplot(x=df2.index, y="PM10", data=df2).set(label="PM10");

# Important parameters

#### `hue` parameter

- The name of a DataFrame column containing categorical values, and then Seaborn generates a plot for each category giving a different color to each line.

#### `style` parameter

- Works the same way as `hue` but it distinguishes between the categories by using different plot styles (solid, dashed, dotted, etc.), without affecting the color.

#### `size` parameter

- Just like `hue` and `style`, the `size` parameter creates a separate plot for each category. It doesn't affect the color and style of the lines but makes each of them of different width.

In [None]:
df3 = df1[df1.RH > 60]

In [None]:
fig = plt.subplots(figsize=(20, 5))
sns.lineplot(x=df3.index, y="CO2", data=df3, 
             hue='RH',
             linestyle='dashed', linewidth=5);

# Boxplot 
- Can provide a quick and clear understanding of the distribution of data.
- Used to identify the existence of outliers.
- Also known as a box-and-whisker plot, it is a graphical representation of the distribution of data based on five summary statistics: minimum, first quartile (Q1), median, third quartile (Q3), and maximum.
  - The box represents the interquartile range (IQR), which is the distance between the first and third quartiles.
  - The whiskers extend from the box to the minimum and maximum values within the range of 1.5 times the IQR.
  - Any data points outside this range are plotted as outliers.


![fig_boxplot](https://miro.medium.com/max/9000/1*2c21SkzJMf3frPXPAR_gZA.png)


In [None]:
sns.boxplot(data=df1)

In [None]:
sns.boxplot(data=df2)

Let us write a simple function that can allos us to better display individual boxplots.

In [None]:
def create_boxplot(mydf):
    """
    This function takes a Pandas DataFrame and creates individual
    boxplots, each one whithits own axis.
    """
    # Obtain the list of columns
    columns_to_plot = list(mydf.columns)
    
    # Create the figure and two subplots
    fig, axes = plt.subplots(ncols=len(columns_to_plot))

    # Create the boxplot with Seaborn
    for column, axis in zip(columns_to_plot, axes):
        sns.boxplot(data=mydf[column], ax=axis) 
        axis.set_xlabel(column)
        axis.set(xticklabels=[], xticks=[], ylabel='')

    # Show the plot
    plt.tight_layout()

In [None]:
create_boxplot(df1)

In [None]:
create_boxplot(df2)

__Quick observations__

- Most of the fields have outliers.
- We need to decide what to do with the outliers before we start creating a model to analyse the data.
   - Addressing outliers usually falls into two categories: deleting them or using summary statistics less prone to outliers.

# Scatterplot with varying point sizes and hues

In [None]:
ax = sns.relplot(data=df3, x="RH", y='CO2', 
                 hue='altitude',
                 size='TPM', 
                 alpha=.75, 
                 #palette="muted",
                 height=5,
                )
ax.set(ylabel="CO2");

# Scatterplot Matrix

```python
seaborn.pairplot(data, *, hue=None, hue_order=None, 
                 palette=None, vars=None, x_vars=None, 
                 y_vars=None, kind='scatter', diag_kind='auto', 
                 markers=None, height=2.5, aspect=1, 
                 corner=False, dropna=False, plot_kws=None,
                 diag_kws=None, grid_kws=None, size=None)
```

- Plot pairwise relationships in a dataset.
- By default, this function will create a grid of Axes such that each numeric variable in `data` will by shared across the y-axes across a single row and the x-axes across a single column. 
   - The diagonal plots are treated differently: a univariate distribution plot is drawn to show the marginal distribution of the data in each column.

In [None]:
sns.pairplot(df1);

In [None]:
sns.pairplot(df2);

### Add the `hue`

In [None]:
#sns.set_theme(style="ticks")
#sns.pairplot(df1, hue="RH")

# Paired density and scatterplot matrix

```python
seaborn.PairGrid(data, *, hue=None, vars=None, 
                 x_vars=None, y_vars=None, hue_order=None, 
                 palette=None, hue_kws=None, corner=False, 
                 diag_sharey=True, height=2.5, aspect=1, 
                 layout_pad=0.5, despine=True, dropna=False)
```

- Subplot grid for plotting pairwise relationships in a dataset.
- This object maps each variable in a dataset onto a column and row in a grid of multiple axes. 
- Different axes-level plotting functions can be used to draw bivariate plots in the upper and lower triangles, and the marginal distribution of each variable can be shown on the diagonal.

In [None]:
g = sns.PairGrid(df1, diag_sharey=False)
g.map_upper(sns.scatterplot, s=15)
g.map_lower(sns.kdeplot)
g.map_diag(sns.kdeplot, lw=2);

## Linear regression with marginal distributions

```python
seaborn.jointplot(data=None, *, x=None, y=None, hue=None, 
                  kind='scatter', height=6, ratio=5, space=0.2, 
                  dropna=False, xlim=None, ylim=None, 
                  color=None, palette=None, hue_order=None,
                  hue_norm=None, marginal_ticks=False, 
                  joint_kws=None, marginal_kws=None, **kwargs)
```

- Draw a plot of two variables with bivariate and univariate graphs.
- `kind`: Kind of plot to draw.
   - ` { “scatter” | “kde” | “hist” | “hex” | “reg” | “resid” }`

In [None]:
sns.set_theme(style="darkgrid")

g = sns.jointplot(x="PM10", y="TPM", data=df2,
                  kind="reg", truncate=False,
                  color="m", height=7)

# Joint kernel density estimate

In [None]:
g = sns.jointplot(
    data=df3,
    x="TPM", y="CO2", hue="RH",
    kind="kde",
)

# Hexbin plot with marginal distributions

In [None]:
g = sns.jointplot(
    data=df3,
    x="TPM", y="CO2",
    kind="hex", color="#4CB391"
)

# Conditional kernel density estimate

```python
seaborn.displot(data=None, *, x=None, y=None, 
                hue=None, row=None, col=None, weights=None, 
                kind='hist', rug=False, rug_kws=None, 
                log_scale=None, legend=True, palette=None, 
                hue_order=None, hue_norm=None, color=None, 
                col_wrap=None, row_order=None, col_order=None, 
                height=5, aspect=1, facet_kws=None, **kwargs)
```

- This function provides access to several approaches for visualizing the univariate or bivariate distribution of data, including subsets of data defined by semantic mapping and faceting across multiple subplots.
- The `kind` parameter selects the approach to use:
   - `histplot()` (with kind="hist"; the default)
   - `kdeplot()` (with kind="kde")
   - `ecdfplot()` (with kind="ecdf"; univariate-only)

In [None]:
sns.displot(
    data=df3[df3.RH>55],
    x="CO2", hue="RH",
    kind="kde", height=6,
    multiple="fill", clip=(0, None),
    palette="ch:rot=-.25,hue=1,light=.75",
)

## Correlation

#### Create the heatmap with `Seaborn`

In [None]:
fig, ax = plt.subplots(figsize=(12, 11))
sns.heatmap(df2.corr(), cmap='seismic', ax=ax)

#### Create the heatmap with `Plotly`

Note that we have here an interactive plot.

In [None]:
mat = px.imshow(df2.corr(), x=df2.columns, 
                 y=df2.columns, 
                title="Correlation matrix", 
                width=1000, height=1000)
mat.show()