# <font color=#14F278>Unit 10 - Data Preprocessing in Pandas:</font>
---

## <font color=#14F278>1. Data Preprocessing:</font>
So far we learnt how to build Pandas objects - Series and DataFrames, how to explore, navigate, aggregate and combine them. Another important stage to Data Analysis is also <font color=#14F278>**pre-processing**</font> the data - that is, preparing the data for the next stage of our analysis. 

<font color=#14F278>**Data Preprocessing**</font> plays a crucial role in Data Science and Machine Learning - it largely involves <font color=#14F278>**performing transformations**</font>, ensuring the dataset is in a form, fit for purpose. Depending on the analysis, this might include:
- **labelling data**
- **encoding categorical data**
- **normalising data**
- **feature engineering**

In this unit, we will explore how to perform transformations on categorical and numerical data. We will also learn how to find correlations between columns (features). This unit is a <font color=#14F278>**preparation for later training in Data Science**</font>, when we will be revisiting the covered concepts in the context of ML.

---
## <font color=#14F278>2. Types of Data:</font>

Data largely falls in two categories - <font color=#14F278>**Numerical (Quantitative)**</font> and <font color=#14F278>**Categorical (Qualitative)**</font>. Before conducting any analysis on a dataset, we first need to assess the data types we are working with - this will help us identify the type of **pre-processing** we might need to apply later:

<center>
    <div>
        <img src="..\images\preprocessing_001.png"/>
    </div>
</center>

In [1]:
import pandas as pd
import numpy as np

---
## <font color=#14F278>3. Numerical Data Transformations:</font>

Numerical Data can be pre-processed in multiple ways, depending on the use case. In this unit, we will introduce 3 types of transformations:
- <font color=#14F278>**Binarization**</font>
- <font color=#14F278>**Min-Max Normalisation**</font>
- <font color=#14F278>**Standardisation**</font>

<font color=#FF8181>**Note:**</font> There are other types of transformations, applicable to numercial variables, such as <font color=#FF8181>**Log Transformation**</font>.



---
### <font color=#14F278>3.1 Binarization:</font>
<font color=#14F278>**Binarization**</font> is a technique, whereby we order numerical observations ascendingly, assign a threshold (or multiple thresholds) and group the observations in disjoint bins by comparing their values to those thresholds. Binarization can be applied to both discrete and continuous numerical variables.

\
\
Let's explore the following scenario:
- we have a patient dataset, containing a column `'Age'` - this is a dicrete variable
- we can group (bin) observations, based on whether they belong to patients above or below 18 years of age:

<center>
    <div>
        <img src="..\images\preprocessing_002.png"/>
    </div>
</center>

In [2]:
# Example Dataframe
data = {'Patient':[111945, 111946, 111947, 111948, 111949],
        'Age':[14,32,8,44,28]}
df = pd.DataFrame(data)
display(df)

Unnamed: 0,Patient,Age
0,111945,14
1,111946,32
2,111947,8
3,111948,44
4,111949,28


- To binarise in Pandas, we can simply use the `apply()` method with a function, grouping the observations in the dataset:

In [3]:
# Perform binning using the apply() method
# Here we use lambda function, but we can also pre-define our own function in the conventional way
df['Above_18'] = df.apply(lambda row: 0 if row['Age']<18 else 1, axis = 1)
display(df)

Unnamed: 0,Patient,Age,Above_18
0,111945,14,0
1,111946,32,1
2,111947,8,0
3,111948,44,1
4,111949,28,1


---
### <font color=#14F278>3.2 Min-Max Normalisation:</font>
<font color=#14F278>**Min-Max Normalisation**</font> is a technique, whereby we rescale the numerical variable, so that its values fall in the [0,1] range. The highest observed value is assigned 1, and the lowest - 0. The formula used for the normalisation is:
$$   
\text{Min-Max Normalisation} = \frac{\text{value} - \text{min value}}{\text{max value} - \text{min value}}
$$

\
\
Consider the following example:
<center>
    <div>
        <img src="..\images\preprocessing_003.png"/>
    </div>
</center>

- **Temperature** is a numerical column, measuring the daily temperature in Celsius
- **Humidity** is a numerical column, measuring the air humidity in percentage
- Since these are numeric variables, measured in different scales, and spreading across different ranges, we can perform **Normalisation (or Min-Max Normalisation)** - all numeric columns are rescaled between 0 and 1. Normalising data resolves the issue of measuring features on different scales



In [4]:
# Example Dataframe
data = {'Date':['01/01/2011', '02/01/2011', '03/01/2011', '04/01/2011', '05/01/2011'],
        'Temperature': [18.11, 17.68, 9.47, 10.60, 11.46],
        'Humidity': [80.58, 69.6, 43.73, 59.05, 43.7]}
df = pd.DataFrame(data)
display(df)

Unnamed: 0,Date,Temperature,Humidity
0,01/01/2011,18.11,80.58
1,02/01/2011,17.68,69.6
2,03/01/2011,9.47,43.73
3,04/01/2011,10.6,59.05
4,05/01/2011,11.46,43.7


In [5]:
# Normalising one column at a time
df['Temp_norm'] = (df['Temperature']-df['Temperature'].min())/(df['Temperature'].max()-df['Temperature'].min())
df['Hum_norm'] = (df['Humidity']-df['Humidity'].min())/(df['Humidity'].max()-df['Humidity'].min())
display(df)

Unnamed: 0,Date,Temperature,Humidity,Temp_norm,Hum_norm
0,01/01/2011,18.11,80.58,1.0,1.0
1,02/01/2011,17.68,69.6,0.950231,0.702278
2,03/01/2011,9.47,43.73,0.0,0.000813
3,04/01/2011,10.6,59.05,0.130787,0.416215
4,05/01/2011,11.46,43.7,0.230324,0.0


---
### <font color=#14F278>3.3 Standardisation (Z-Score Normalisation):</font>

An alternative rescaling technique is <font color=#14F278>**Standardisation**</font>, whereby we rescale the numerical variable, so that is has a mean of 0 and a variance of 1 (unit-variance). The formula used for standardisation is:
$$   
\text{Standardisation} = \frac{\text{value} - \text{mean}}{\text{standard deviation}}
$$

\
\
Consider the following example:
<center>
    <div>
        <img src="..\images\preprocessing_004.png"/>
    </div>
</center>

- **Exam Score** is a numerical column, measuring each student's exam score. 
- To normalise these values we can perform a **Z-Score Normalisation (Standardisation)** - this procedure rescales the numerical feature so that it has a mean of 0 and a variance of 1. This is often used in academic settings when analysing how well an individual performed, compared to the average performance. 
- **Standardised Score** is the re-scaled version of **Exam Score**

In [6]:
# Example Dataframe
data = {'StudentID':[13090, 13091, 13092, 13093, 13094, 13095, 13096,13097, 13098, 13099],
        'Exam Score':[54,78,65,67,72,83,81,90,52,75]}
df = pd.DataFrame(data)
display(df)

Unnamed: 0,StudentID,Exam Score
0,13090,54
1,13091,78
2,13092,65
3,13093,67
4,13094,72
5,13095,83
6,13096,81
7,13097,90
8,13098,52
9,13099,75


In [7]:
# Standardisation on a given column
df['Standardised Score'] = (df['Exam Score'] - df['Exam Score'].mean())/df['Exam Score'].std()
display(df)

Unnamed: 0,StudentID,Exam Score,Standardised Score
0,13090,54,-1.435607
1,13091,78,0.510979
2,13092,65,-0.543422
3,13093,67,-0.381206
4,13094,72,0.024332
5,13095,83,0.916518
6,13096,81,0.754302
7,13097,90,1.484272
8,13098,52,-1.597823
9,13099,75,0.267656


---
## <font color=#14F278>4. Categorical Data Transformations:</font>

Similarly to Numerical data, Categorical data can be pre-processed in various ways, depending on its nature and the use case. Typically stored in *text format*, categorical variables represent important information for the analysis, yet their format can pose challenges and limitations to many ML algorithms, or other tools for analysis. 

\
\
In what follows, we will learn how to:
- <font color=#14F278>**Explore categorical variables**</font>
- <font color=#14F278>**Integer Encode**</font> ordinal categorical variables
- <font color=#14F278>**One-Hot Encode**</font> nominal categorical variables

---
### <font color=#14F278>4.1 Categorical Data Exploration - `value_counts()`:</font>

- To explore the range of categories, contained in a dataset, we can use the `value_counts()` method
- apply the method to a Series object to retrieve another Series
- the index of the Series contains all distinct categories in the variable
- the value of the Series contains the number of observations per category

\
\
Consider the following example:
<center>
    <div>
        <img src="..\images\preprocessing_005.png" />
    </div>
</center>

In [8]:
# Example Dataframe
data = {'CustomerID': [111945, 111946, 111947, 111948, 111949, 111950, 111951, 111952],
        'Risk Score': ['high', 'low', 'very low', 'low', 'medium', 'very high', 'medium', 'low'], 
        'Gender': ['male', 'female', 'male', 'male', 'female', 'male', 'female', 'female']}
df = pd.DataFrame(data)
display(df)

Unnamed: 0,CustomerID,Risk Score,Gender
0,111945,high,male
1,111946,low,female
2,111947,very low,male
3,111948,low,male
4,111949,medium,female
5,111950,very high,male
6,111951,medium,female
7,111952,low,female


In [9]:
# Explore the distribution of observations per 'Risk Score' category
df['Risk Score'].value_counts()

Risk Score
low          3
medium       2
high         1
very low     1
very high    1
Name: count, dtype: int64

---
### <font color=#14F278>4.2 Ordinal Encoding:</font>

<font color=#14F278>**Ordinal Encoding**</font> is a technique, whereby we translate <font color=#14F278>**categorical data into numbers (integers)**</font>. Although it can be applied to both ordinal and norminal categorical variables, Ordinal Encoding (as the name suggests) is <font color=#14F278>**best suited for ordinal categories**</font> - these are categories which naturally follow a given hierarchy.

\
\
In Pandas, ordinal encoding can be performed in several ways:
- by using the `replace()` method
- by using the `apply()` method with a function


\
\
Consider the following example:
- **Risk Score** is an *ordinal categorical variable* with 5 possible values - *very low*, *low, medium, high* and *very high*. Assigning these to a score from 1 to 5 is thus very easy and intuitive - *very low = 1*,  *very high = 5, etc.*
<center>
    <div>
        <img src="..\images\preprocessing_006.png" />
    </div>
</center>

In [10]:
# Make a copy of df (to avoid the need to recreate it later)
df1 = df.copy()

# Create a dictionary, instructing on the values to replace the current categories
risk_dict = {'Risk Score':{'very low':1, 'low':2, 'medium':3, 'high':4, 'very high':5}}
df1 = df1.replace(risk_dict)
display(df1)

  df1 = df1.replace(risk_dict)


Unnamed: 0,CustomerID,Risk Score,Gender
0,111945,4,male
1,111946,2,female
2,111947,1,male
3,111948,2,male
4,111949,3,female
5,111950,5,male
6,111951,3,female
7,111952,2,female


In [11]:
# Alternatively, define a function and use the .apply() method to encode categories
def risk_converter(row):
    if row['Risk Score'] == 'very low':
        val = 1
    elif row['Risk Score'] == 'low':
        val = 2
    elif row['Risk Score'] == 'medium':
        val = 3
    elif row['Risk Score'] == 'high':
        val = 4
    else:
        val = 5
    return val

# Use the apply() method to perform the encoding
df['Risk Score Num'] = df.apply(risk_converter, axis = 1)
display(df)

Unnamed: 0,CustomerID,Risk Score,Gender,Risk Score Num
0,111945,high,male,4
1,111946,low,female,2
2,111947,very low,male,1
3,111948,low,male,2
4,111949,medium,female,3
5,111950,very high,male,5
6,111951,medium,female,3
7,111952,low,female,2


---
### <font color=#14F278>4.3 One Hot Encoding:</font>

Despite the straightforward implementation of Ordinal Encoding, it often comes with the disadvantage that numeric values can often be misinterpreted by algorithms. This is an issue we need to be particularly careful tackling, when working with <font color=#14F278>**nominal categorical data**</font>. As nominal values follow <font color=#14F278>**no hierarchical order**</font>, encoding them with integers (which obviously have a numerical order) <font color=#14F278>**can create bias**</font>.
\
\
 Consider the following table:
<center>
    <div>
        <img src="..\images\preprocessing_007.png" />
    </div>
</center>

- **Country** is a *nominal categorical variable* with 3 possible values - *United Kingdom, Germany* and *Spain* 
- **Country Code** is the numeric equivalence of column **Country**, produced after Label Encoding. Such encoding creates a bias - United Kingdom is associated with the lowest number (1) while Spain - with the highest number
- **Is UK? , Is Germany? and Is Spain?** are the numeric columns we can create by performing **One Hot Encoding**. This process eliminates any bias when tackling nominal categorical variables. Each record (i.e. each customer) is now associated with a Boolean value for all 3 columns - value of 1 indicates the customer's origin from the given country
- use the `pd.get_dummies()` function to perform **one hot encoding**

In [12]:
# Example Dataframe
data = {'CustomerID': [111945, 111946, 111947, 111948, 111949, 111950, 111951, 111952],
        'Country': ['United Kingdom', 'Germany', 'Spain', 'Spain', 'Spain', 'United Kingdom', 'United Kingdom', 'Germany']}
df = pd.DataFrame(data)

# Perform One Hot Encoding
df = pd.get_dummies(df, columns = ['Country'], prefix = ['Is'])
display(df)

Unnamed: 0,CustomerID,Is_Germany,Is_Spain,Is_United Kingdom
0,111945,False,False,True
1,111946,True,False,False
2,111947,False,True,False
3,111948,False,True,False
4,111949,False,True,False
5,111950,False,False,True
6,111951,False,False,True
7,111952,True,False,False


<font color=#FF8181>**Question:**</font> Which of the columns in 4.2 would you recommend One Hot Encoding for, and why?

---
## <font color=#14F278> 5. Summary:</font>

**Data Preprocessing** is an important step to many Data Science and Machine Learning workflows:
- before applying any preprocessing, we need to ensure good understanding of the data types at hand
- data is either **numerical (quantitative)** or **categorical (qualitative)**
- **numerical data** is further divided into continuous and discrete
- **categorical data** is further divided into ordinal and nominal
- typical transformations on numerical data are **Binarization, Min-Max Normalisation, Standardisation**
- typical transformations on categorical data are **Ordinal and One Hot Encoding**
- useful Pandas methods when exploring and preprocessing data are the `apply()`, `value_counts()`, `replace()` and methods and the `pd.get_dummies()` function

---
## <font color=#FF8181> 6. Concept Check: </font>
Suppose you have a dataset, containing exam score records of students from different universities:
- Standardise the exam scores
- One Hot Encode the university category
- further binarise the standardised exam score to create a column `'Above Average?'`
- How many students scored above the average on the exam?

<center>
    <div>
        <img src="..\images\preprocessing_008.png" />
    </div>
</center>

In [None]:
# Example Dataframe
data = {'StudentID':[13090, 13091, 13092, 13093, 13094, 13095, 13096,13097, 13098, 13099],
        'Exam Score':[54,78,65,67,72,83,81,90,52,75],
        'University':['Warwick', 'Harvard', 'LSE', 'Harvard', 'LSE', 'Warwick', 'LSE', 'Harvard', 'Harvard', 'Warwick']}
df = pd.DataFrame(data)
display(df)