<div style="text-align: center;">
    <h1 style="color: #3498db;">Artificial Intelligence & Machine Learning</h1>
    <h2 style="color: #3498db;">Part 2: Preprocessing</h2>
</div>

-------------------------------------------------------------

<div style="background-color: #f2f2f2; padding: 10px; border-radius: 5px;">
    <b>Authors:</b> K. Said<br>
    <b>Date:</b> 08-09-2023
</div>

<div style="background-color: #e6e6e6; padding: 10px; border-radius: 5px; margin-top: 10px;">
    <p>This notebook is part of the "Artificial Intelligence & Machine Learning" lecture material. The following copyright statement applies to all contents and code within this file.</p>
    <b>Copyright statement:</b>
    <p>This material, no matter whether in printed or electronic form, may be used for personal and non-commercial educational use only. Any reproduction of this manuscript, no matter whether as a whole or in parts, no matter whether in printed or in electronic form, requires explicit prior acceptance of the authors and lecturers.</p>
</div>


<h1 style="color:rgb(0,120,170)">Introduction</h1>

-----------------------------------------------

<h2 style="color:rgb(0,120,0)">What you have learned so far</h2>

--------------------------------------------------------------------

Up to this point, you should now know the basics of ML, different data types and data analysis. In the last notebook "Module 2 Part 1 - Data Analysis" we analysed and explored the penguins dataset (description in first notebook). We plotted the data as pandas dataframes, checked it for missing values, but also plotted e.g. the distribution of the species and tried to check for the correlation between features.


<h2 style="color:rgb(0,120,0)">What is preprocessing?</h2>

------------------------------------------------------
In the lecture we learned that preprocessing is one of the initial steps in preparing data for machine learning. This step not only involves cleaning and transforming data, but also organizing raw data to make it suitable for our models. Some of the common tasks include handling missing values, scaling features, encoding categorical variables, and splitting data into training and testing sets. But during this step we might also downproject (e.g. pca) or choose the features to feed our model with.

After this step, we should be able to simply load the data and train our model without any errors.

<h2 style="color:rgb(0,120,0)">Our Task</h2>

-------------------------------------------------

Our task in this notebook is now to apply different preprocessing techniques in order to be able to feed it to our model. For this, we not only have to get rid of missing values, but also encode the labels and transform normal features into categorical/nummerical features.

<h1 style="color:rgb(0,120,170)">Preprocessing - Example</h1>

-----------------------------------------------

Now that you are done with your first Data Analysis steps, it's time to prepare your dataset into the right format for your ML-models. To make things easier, we will show you some possible preprocessing steps that can be done by you. For this example we will reuse the penguins dataset again and use some of the insights we gained from the last exercise.

In [1]:
# Some imports we need for the preprocessing
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import os


from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

np.random.seed(42) # DO NOT CHANGE

In [2]:
# Now we load our initial toy dataset 
penguins = sns.load_dataset("penguins")
penguins

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female


<details>
<summary style="font-size: larger; color: white; background-color: rgba(255, 165, 0, 0.6); border: 1px solid grey; padding: 5px 15px; border-radius: 8px; cursor: pointer;">How to handle NaN values?</summary>

<div style="background-color: rgba(255, 204, 153, 0.6); padding: 10px; border-radius: 5px;">
   There are many ways how one could deal with NaN values. We could just drop lines with missing values or <a href="https://scikit-learn.org/stable/modules/impute.html" target="_blank" style="color: blue; text-decoration: none;">impute them</a> instead with the mean or median value of that specific feature. However, in our analysis we saw that around 3% of the sex-column contained nan values, so we will just simply drop them in this case.
</div>
</details>


In [3]:
df = penguins.dropna()
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male
...,...,...,...,...,...,...,...
338,Gentoo,Biscoe,47.2,13.7,214.0,4925.0,Female
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female


<div style="background-color: #f2f2f2; padding: 10px; border-radius: 5px;">
    Alright, we dropped around 11 rows, so luckily not too much. To be sure there are no nan-values, we will check again
</div>

In [4]:
df.isna().sum()

species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

<details>
<summary style="font-size: larger; color: white; background-color: rgba(255, 165, 0, 0.6); border: 1px solid grey; padding: 5px 15px; border-radius: 8px; cursor: pointer;">Standardize nummeric features</summary>

<div style="background-color: rgba(255, 204, 153, 0.6); padding: 10px; border-radius: 5px;">
   Now it's time to standardize all nummeric features. Why Standardization? Standardization ensures that all features contribute equally. Also many algorithms tend to converge faster when the input features are standardized.
</div>
</details>


In [5]:
df_new = df.copy()
numeric_columns = df_new.select_dtypes(include=[np.number])

# Initialize the MinMaxScaler --> values between 0 and 1, since we have only positive values
scaler = MinMaxScaler()

numeric_columns_standardized = scaler.fit_transform(numeric_columns)
df_new[numeric_columns.columns] = numeric_columns_standardized
df_new

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,0.254545,0.666667,0.152542,0.291667,Male
1,Adelie,Torgersen,0.269091,0.511905,0.237288,0.305556,Female
2,Adelie,Torgersen,0.298182,0.583333,0.389831,0.152778,Female
4,Adelie,Torgersen,0.167273,0.738095,0.355932,0.208333,Female
5,Adelie,Torgersen,0.261818,0.892857,0.305085,0.263889,Male
...,...,...,...,...,...,...,...
338,Gentoo,Biscoe,0.549091,0.071429,0.711864,0.618056,Female
340,Gentoo,Biscoe,0.534545,0.142857,0.728814,0.597222,Female
341,Gentoo,Biscoe,0.665455,0.309524,0.847458,0.847222,Male
342,Gentoo,Biscoe,0.476364,0.202381,0.677966,0.694444,Female


<div style="background-color: #f2f2f2; padding: 10px; border-radius: 5px;">
    Comparing the initial dataframe with the current one, we see that our nummeric features are now all scaled between the values 0 and 1, making it more robust to outliers.
</div>

<details>
<summary style="font-size: larger; color: white; background-color: rgba(255, 165, 0, 0.6); border: 1px solid grey; padding: 5px 15px; border-radius: 8px; cursor: pointer;">How to make use of non-numeric features?</summary>

<div style="background-color: rgba(255, 204, 153, 0.6); padding: 10px; border-radius: 5px;">
   Well, for this we can just use <a href="https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html" target="_blank" style="color: blue; text-decoration: none;">One-Hot-Encoding</a>, with which we are able to transform categorical/non-nummeric data into nummeric values.
</div>
</details>

In [6]:
df_new2 = pd.get_dummies(df_new, columns=['sex', "island"], drop_first=False)
df_new2

Unnamed: 0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex_Female,sex_Male,island_Biscoe,island_Dream,island_Torgersen
0,Adelie,0.254545,0.666667,0.152542,0.291667,0,1,0,0,1
1,Adelie,0.269091,0.511905,0.237288,0.305556,1,0,0,0,1
2,Adelie,0.298182,0.583333,0.389831,0.152778,1,0,0,0,1
4,Adelie,0.167273,0.738095,0.355932,0.208333,1,0,0,0,1
5,Adelie,0.261818,0.892857,0.305085,0.263889,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...
338,Gentoo,0.549091,0.071429,0.711864,0.618056,1,0,1,0,0
340,Gentoo,0.534545,0.142857,0.728814,0.597222,1,0,1,0,0
341,Gentoo,0.665455,0.309524,0.847458,0.847222,0,1,1,0,0
342,Gentoo,0.476364,0.202381,0.677966,0.694444,1,0,1,0,0


<div style="background-color: #f2f2f2; padding: 10px; border-radius: 5px;">
    Instead of having a single "sex" column, we've now split it into "sex_Female" and "sex_Male." Similarly, instead of the "island" column, we've created three separate columns. Essentially, we've converted these categorical attributes into three binary columns, where a value of 1 indicates that the penguin originates from a particular island, while a value of 0 signifies it's from a different island. But one question remains, do we really need all featues/columns? For example, if we know that sex_female is 1, then of course sex_male has to be 0, making it completely obsolete. The same also applies to the island columns, were we only need 2 out of 3. E.g. if island_Biscoe and island_Dream are 0, then of course it has to be island_Torgersen. Therefore, we can do the exact same thing, but instead use drop_first=True
</div>

In [7]:
df_new2 = pd.get_dummies(df_new, columns=['sex', "island"], drop_first=True)
df_new2

Unnamed: 0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex_Male,island_Dream,island_Torgersen
0,Adelie,0.254545,0.666667,0.152542,0.291667,1,0,1
1,Adelie,0.269091,0.511905,0.237288,0.305556,0,0,1
2,Adelie,0.298182,0.583333,0.389831,0.152778,0,0,1
4,Adelie,0.167273,0.738095,0.355932,0.208333,0,0,1
5,Adelie,0.261818,0.892857,0.305085,0.263889,1,0,1
...,...,...,...,...,...,...,...,...
338,Gentoo,0.549091,0.071429,0.711864,0.618056,0,0,0
340,Gentoo,0.534545,0.142857,0.728814,0.597222,0,0,0
341,Gentoo,0.665455,0.309524,0.847458,0.847222,1,0,0
342,Gentoo,0.476364,0.202381,0.677966,0.694444,0,0,0


<div style="background-color: #f2f2f2; padding: 10px; border-radius: 5px;">
    By dropping some columns, we not only save space, but also reduce the dimensionality, increase the processing speed and reduce noise.
</div>

<details>
<summary style="font-size: larger; color: white; background-color: rgba(255, 165, 0, 0.6); border: 1px solid grey; padding: 5px 15px; border-radius: 8px; cursor: pointer;">What about the labels?</summary>

<div style="background-color: rgba(255, 204, 153, 0.6); padding: 10px; border-radius: 5px;">
   Since we want to have the labels in one column, we will use the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html" target="_blank" style="color: blue; text-decoration: none;">LabelEncoder</a>.

</div>
</details>

In [8]:
encoded_df = df_new2.copy()
label_encoder = LabelEncoder()
encoded_df['species'] = label_encoder.fit_transform(encoded_df['species'])

encoded_df

Unnamed: 0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex_Male,island_Dream,island_Torgersen
0,0,0.254545,0.666667,0.152542,0.291667,1,0,1
1,0,0.269091,0.511905,0.237288,0.305556,0,0,1
2,0,0.298182,0.583333,0.389831,0.152778,0,0,1
4,0,0.167273,0.738095,0.355932,0.208333,0,0,1
5,0,0.261818,0.892857,0.305085,0.263889,1,0,1
...,...,...,...,...,...,...,...,...
338,2,0.549091,0.071429,0.711864,0.618056,0,0,0
340,2,0.534545,0.142857,0.728814,0.597222,0,0,0
341,2,0.665455,0.309524,0.847458,0.847222,1,0,0
342,2,0.476364,0.202381,0.677966,0.694444,0,0,0


<div style="background-color: #f2f2f2; padding: 10px; border-radius: 5px;">
    The species column was succesfully encoded as labels, great. Before moving on, have a look at the initial dataframe and the final encoded one and compare both.
</div>

<details>
<summary style="font-size: larger; color: white; background-color: rgba(255, 165, 0, 0.6); border: 1px solid grey; padding: 5px 15px; border-radius: 8px; cursor: pointer;">Save the data</summary>

<div style="background-color: rgba(255, 204, 153, 0.6); padding: 10px; border-radius: 5px;">
   For this rather simple dataset, the preprocessign steps done above were all we needed. Now, in order to not always repeat the same preprocessing steps, we simply save our dataset and load it the next time we need it. 
    Keep in mind that we won't do the Train-Test-Split in this notebook, but instead in the next notebook.
</div>
</details>

In [None]:
# Save as csv file
encoded_df.to_csv('preprocessed_dataset.csv', index=False)

<div style="background-color: #f2f2f2; padding: 10px; border-radius: 5px;">
    Checking the folder, we should be able to see our preprocessed dataset. If you got problems running this code, try to put the absolute path instead.
</div>

<h1 style="color:rgb(0,120,170)">Preprocessing - Your turn</h1>

-----------------------------------------------

Now it's your turn to show what you have learned.
The things we showed you above are only a very, very small subset of all the possible things one could do. For your preprocessing, you might want to reuse some parts of the example, have a look at the "Info" button or implement things yourself.



<details>
<summary style="font-size: larger; color: white; background-color: #3498db; border: 1px solid #3498db; padding: 5px 15px; border-radius: 8px; cursor: pointer;">Info</summary>

<div style="background-color: #E6F7FF; padding: 10px; border-radius: 5px;">
    When preparing your data for analysis, consider the following preprocessing steps:

   - **Handling Missing Data**
     - Identify and handle missing values, either by imputing them or removing rows/columns.

   - **Feature Scaling**
     - Normalize numeric features using techniques like Min-Max scaling or Z-score normalization.

   - **Categorical Encoding**
     - Encode categorical data using methods like one-hot encoding or label encoding.

   - **Outlier Detection**
     - Detect and address outliers that may skew your analysis or models.

   - **Feature Selection**
     - Select relevant features to reduce dimensionality and improve model performance.

   - **Feature Engineering**
     - Create new features based on domain knowledge to enhance model learning.

   - **Handling Imbalanced Data**
     - Address class imbalance issues through techniques like oversampling or undersampling.

   - **Principal Component Analysis (PCA)**
     - Use PCA for dimensionality reduction in high-dimensional datasets.

   - **Handling Duplicate Data**
     - Detect and remove duplicate records to prevent bias in analysis.

   - **Data Aggregation**
     - Aggregate data at different levels, e.g., daily to monthly, for temporal analysis.

   - **Handling Noise**
     - Filter out noise in the data using techniques like smoothing or filtering.

   - **Handling Ordinal Data**
     - Encode ordinal data preserving the order information.

   - **Handling Non-Numeric Data**
     - Convert non-numeric data (e.g., dates) into a numeric format for analysis.

   - **Handling Skewed Data**
     - Apply transformations (e.g., log or Box-Cox) to mitigate skewness in features.

</div>
</details>


In [None]:
# TODO: Preprocess your dataset 
