---

<center> <h1> <span style='color:#292D78'> CREWES Data Science Training </span> </h1> </center>

<center> <h2> <span style='color:#DF7F00'> Lecture 6: Data Augmentation and Transformation </span> </h2> </center>

---

In this [Jupyter Notebook](https://jupyter.org/install) we will see how to augment and transform tabular data, as well how to interpret a correlation plot.

# Data Augmentation vs Data Transformation

While data augmentation and data transformation have an overlap of methods, we will use *data augmentation* for the *creation* of new features from the existing ones, and *data transformation* as converting the feature from any type of distribution to something close to a normal distribution (example, the log transformation).

Let's start by loading some packages:

In [None]:
# Core
import pandas as pd
import numpy as np

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

# To avoid warnings
import warnings
warnings.filterwarnings("ignore") # 😃

## Loading and Processing the Data

Seaborn has a list of data on which we can train our skills.

In [None]:
sns.get_dataset_names()

Let's use the *Titanic* dataset, used to predict who survived from the disaster using information like gender, age, social class, etc.

In [None]:
titanic = sns.load_dataset("titanic")
titanic

There some columns that are the same as others, like `pclass` is the numerical representation of `class`. Let's remove the redundant columns

In [None]:
titanic.drop(["pclass", "who", "embark_town", "alive"], axis = 1, inplace = True)

### Checking Data Types

In [None]:
titanic.info()

### Statistics of the Data

In [None]:
titanic.describe(include = "all").T

### Checking for Missing Data

In [None]:
titanic.isna().sum()

In [None]:
print(titanic["deck"].unique())
print(titanic["embarked"].unique())

There are missing data in `age`, `deck`, `embarked`, and `deck`.

* `age` is numeric. Let's fill the NaNs with the median per gender.
* `deck` is categorical. It is the location in the ship where the passengers were. It seems to be correlated to `class`.. Let's fill it with the most frequent class per `class`.
* `embarked` has only 2 missinngs, let's fill with the mont common class of the column.

In [None]:
titanic["age"] = titanic["age"].fillna(titanic.groupby(["sex"])["age"].transform("median"))
titanic["deck"] = titanic["deck"].fillna(titanic.groupby(["class"])["deck"].transform(lambda x: x.mode()[0]))
titanic["embarked"] = titanic["embarked"].fillna(titanic["embarked"].mode()[0])

In [None]:
titanic.isna().sum()

## Data Augmentation and Transformation

Aside from Seaborn, Pandas also have its visualization tools over *Matplotlib*:

### Univariate analysis

#### Numeric Features

In [None]:
titanic.hist(figsize = (20,20));

`sibsp`, `parch`, and `fare` are right-skewed and we will apply the *log* on them later.

#### Gender

In [None]:
sns.countplot(data = titanic, x = "sex");

Males are majority.

#### Class

In [None]:
sns.countplot(data = titanic, x = "class")

There are more third class passengers than first and second combined.

#### Embarked

In [None]:
sns.countplot(data = titanic, x = "embarked")

Most passagenger embarked in Southampton.

#### Adult Male

In [None]:
sns.countplot(data = titanic, x = "adult_male");

Most passengers are male and adults.

#### Deck

In [None]:
sns.countplot(data = titanic, x = "deck");

Most passengers are from deck *F*.

#### Alone

In [None]:
sns.countplot(data = titanic, x = "alone");

Most passengers were travelling alone.

### Bivariate Analysis

Let's analyze the distribution of the features by `survived`.

#### Numerical Features

In [None]:
titanic.info()

In [None]:
# Extracting names of numerical and categorical features
cols_num = list(titanic.drop(columns = ["survived"]).select_dtypes(include = ["int64", "float64"]))
cols_cat = list(titanic.drop(columns = ["survived"]).select_dtypes(include = ["object", "category", "bool"]))

print("Numerical columns:", cols_num)
print("Categorical columns:", cols_cat)

In [None]:
# converting survived to categorical feature
titanic["survived"] = titanic["survived"].astype("category")

Plotting all numerical features by `survived`.

In [None]:
for i in cols_num:
    sns.boxplot(data = titanic, x = "survived", y = i, showfliers = False)
    plt.title(i)
    plt.show()

`parch` (number of family members onboard) and `fare` (related to the ticket price) show to have impact if the passenger survived or not. In both cases, survivors had larger numbers.

#### Categorical Features

In [None]:
for i in cols_cat:
    sns.countplot(data = titanic, x = i, hue = "survived")
    plt.title(i)
    plt.show()

* Female passengers had a higher chance of survival, as well as children.
* Third class and/or solo passengers were more unfortunate.
* Passengers that embarked in *Cherbourg* had a higher chance of survival, the opposite of the passengers in deck *F*.

### Correlation Plot

Let's check the correlation between the numerical features.

In [None]:
plt.figure(figsize = (10,7))
sns.heatmap(data = titanic.corr(), vmin = -1, vmax = 1, annot = True);

With the *Pearson* correlation, `alone` has a relative high linear negative correlation with `sibsp` (siblings and spouse) and `parch` (number of family members.

In [None]:
plt.figure(figsize = (10,7))
sns.heatmap(data = titanic.corr(method = "spearman"), vmin = -1, vmax = 1, annot = True);

By looking at the *Spearman* correlation, both `alone` and `adult_male` have an increased negative correlation `sibsp`, `parch`, and `fare`. The correlation is significantly high between `alone` and `sibsp`.

### Pairs Plot

In [None]:
sns.pairplot(data = titanic[cols_num + ["survived"]], hue = "survived");

## Data Augmentation and Log Transformation

When thinking about machine learning, we need to have in mind that the data must be ready for the modeling. Features are preferred to have normal distributions, and most algorithms can't handle categorical features, so we will apply the [one-hot-encoding](https://www.kaggle.com/code/dansbecker/using-categorical-data-with-one-hot-encoding/notebook) method (`get_dummies` in Pandas).

### Log Transformation

There are three columns, `sibsp`, `parch`, and `fare`, that will have their distribution transformed by a log function.

In [None]:
cols_norm = ["sibsp", "parch", "fare"]

# For loop to apply the log transformation
for i in cols_norm:
    titanic["log_" + i] = np.log(titanic[i] + 1) # why do we add 1 to the features?

titanic

Checking distributions:

In [None]:
cols = cols_norm + ["log_sibsp", "log_parch", "log_fare"]

for i in cols:
    sns.histplot(data = titanic, x = i);
    plt.title(i)
    plt.show()

`sibsp` and `patch` still have a right skewed distribution due to the high quantity of zeros, while the $\log$ of `fare` has now a distribution closer to normal.

### One-Hot-Encoding

We will use the Pandas method `get_dummies` with the parameter *drop_first* as True. This parameter drops one of the new created columns, as it can be estimated from the others and is a redundant information.

In [None]:
data = titanic.copy()
temp = data["survived"]
data.drop(columns = ["survived"], inplace = True)
data

In [None]:
data = pd.get_dummies(data, drop_first = True)
data["survived"] = temp
data

Now that we have the new columns one-hot encoded and the log transformations, we can drop the original columns for the modeling step.

In [None]:
data.drop(cols_norm, axis = 1, inplace = True)
data

# Summary

We finished the data analysis and pre-processing of the Titanic dataset and it is ready for the modeling step, which we will start to see next week.

From the analysis, we could observe that:

* `sibsp`, `parch`, and `fare` were right-skewed and we aaplied the *log* transformation on them.
* Adult, males, third class passengers are the majority and the most likely to not have survived.
* Passengers that embarked in *Cherbourg* had a higher chance of survival, the opposite of the passengers in deck *F*.