# Load packages and create data

In [1]:
## run this if you haven't already downloaded the packages

## this download the packages from the requirements.txt file
# pip install -r requirements.txt

## this will download the latest version of the packages which may not match the versions in the requirements.txt file
# pip install ipykernel pandas scikit-learn

In [2]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [3]:
data = [
    {"Age": 30, "Gender": "Female", "Education": "School", "Review": "Average", "Purchased": "No"},
    {"Age": 68, "Gender": "Female", "Education": "UG", "Review": "Poor", "Purchased": "No"},
    {"Age": 70, "Gender": "Female", "Education": "PG", "Review": "Good", "Purchased": "No"},
    {"Age": 72, "Gender": "Female", "Education": "PG", "Review": "Good", "Purchased": "No"},
    {"Age": 16, "Gender": "Female", "Education": "UG", "Review": "Average", "Purchased": "No"},
    {"Age": 31, "Gender": "Female", "Education": "School", "Review": "Average", "Purchased": "Yes"},
    {"Age": 18, "Gender": "Male", "Education": "School", "Review": "Good", "Purchased": "Yes"},
    {"Age": 60, "Gender": "Female", "Education": "School", "Review": "Poor", "Purchased": "Yes"},
    {"Age": 65, "Gender": "Female", "Education": "UG", "Review": "Average", "Purchased": "Yes"},
    {"Age": 74, "Gender": "Male", "Education": "UG", "Review": "Good", "Purchased": "Yes"}
]

df = pd.DataFrame(data)

print("""There are 10 rows and 4 columns in the dataframe:
      
- The Purchased column is the target variable.
      
- The other columns are the features.
      
-- Age is a numerical feature.
      
-- Gender is a nominal feature.
      
-- Education and Review are ordinal features.""")

df

There are 10 rows and 4 columns in the dataframe:

- The Purchased column is the target variable.

- The other columns are the features.

-- Age is a numerical feature.

-- Gender is a nominal feature.

-- Education and Review are ordinal features.


Unnamed: 0,Age,Gender,Education,Review,Purchased
0,30,Female,School,Average,No
1,68,Female,UG,Poor,No
2,70,Female,PG,Good,No
3,72,Female,PG,Good,No
4,16,Female,UG,Average,No
5,31,Female,School,Average,Yes
6,18,Male,School,Good,Yes
7,60,Female,School,Poor,Yes
8,65,Female,UG,Average,Yes
9,74,Male,UG,Good,Yes


# Encoding categorical data

* Remember - categorical data is data that is nominal (no order, e.g. gender, ice cream flavour) or ordinal (has order, e.g. customer satisfaction, shoe size, salary banding)

* Mathematical algorithms can only work with numbers. We need to convert any categorical data into numerical format so that our models can use it.

* There are multiple methods to encode categorical data (convert it to numerical). We will focus on label encoding, one-hot encoding and ordinal encoding. 

Sources of inspiration:
- https://www.kaggle.com/code/ybifoundation/encoding/notebook
- https://www.kdnuggets.com/crack-the-code-mastering-category-encoders-for-data-scientists

## Label encoding

Here, we replace the distinct values with something else that is numerical.

For example, we will use:

| Original Purchased value | Map to |
|-|-|
| No | 0 |
| Yes | 1 |

In [4]:
# This is the target variable

print("Before label encoding:")
y = df['Purchased']

y


Before label encoding:


0     No
1     No
2     No
3     No
4     No
5    Yes
6    Yes
7    Yes
8    Yes
9    Yes
Name: Purchased, dtype: object

In [14]:
# Before this code is run, what do you think the output will be?

print("After label encoding:")
y_encoded = y.map({'No': 0, 'Yes': 1})
y_encoded

After label encoding:


0    0
1    0
2    0
3    0
4    0
5    1
6    1
7    1
8    1
9    1
Name: Purchased, dtype: int64

## One-hot encoding (for nominal data)

Here, each value becomes its own column.

A 0 in the column represents that this feature does not apply for the data row.

A 1 in the column represents that this feature does apply.

For example, for the Gender columns, the values "Male" and "Female" can become their own columns, see below.

In [6]:
# This is the nominal column Gender

print("Before one-hot encoding:")
nominal_gender_col = df[['Gender']]

nominal_gender_col

Before one-hot encoding:


Unnamed: 0,Gender
0,Female
1,Female
2,Female
3,Female
4,Female
5,Female
6,Male
7,Female
8,Female
9,Male


In [15]:
# Before this code is run, what do you think the output will be?

# setup the one-hot encoder
# see the documentation for the parameter https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
ohe = OneHotEncoder()
nominal_gender_col_encoded = ohe.fit_transform(nominal_gender_col)

print("""After one-hot encoding:
      
- Left column of array is the first value in the column, so here it represents female. Be careful with this!!
- The other column represents male.

Female | Male""")

print(nominal_gender_col_encoded.toarray().astype(int))

After one-hot encoding:

- Left column of array is the first value in the column, so here it represents female. Be careful with this!!
- The other column represents male.

Female | Male
[[1 0]
 [1 0]
 [1 0]
 [1 0]
 [1 0]
 [1 0]
 [0 1]
 [1 0]
 [1 0]
 [0 1]]


In [16]:
# Alternatively, we can use the get_dummies function from pandas to do the same thing

nominal_gender_col_encoded = pd.get_dummies(df, columns = ['Gender']).loc[:,['Gender_Female', 'Gender_Male']]
nominal_gender_col_encoded

Unnamed: 0,Gender_Female,Gender_Male
0,True,False
1,True,False
2,True,False
3,True,False
4,True,False
5,True,False
6,False,True
7,True,False
8,True,False
9,False,True


## Ordinal encoding (for ordinal data)

Here, we replace the ordered values with numerical values that represent the distance between the categories.

For example, we will use:

| Review | Map to |
|-|-|
| Poor | 0 |
| Average | 1 |
| Good | 2 |

We could change the values to represent the perceived distance between the categories.
For example, we may perceive Poor to be further away from Average and Good so we instead we could assign poor = 0, average = 2 and good = 3.

In [9]:
# This is the ordinal column Review

print("""Before ordinal encoding:
The order of the values is: poor < average < good""")
ordinal_review_col = df['Review']

ordinal_review_col

Before ordinal encoding:
The order of the values is: poor < average < good


0    Average
1       Poor
2       Good
3       Good
4    Average
5    Average
6       Good
7       Poor
8    Average
9       Good
Name: Review, dtype: object

In [17]:
## Before this code is run, what do you think the output will be?

print("After ordinal encoding:")
ordinal_review_col_encoded = ordinal_review_col.map({'Poor': 0,
                                                     'Average': 1,
                                                     'Good': 2})
ordinal_review_col_encoded

After ordinal encoding:


0    1
1    0
2    2
3    2
4    1
5    1
6    2
7    0
8    1
9    2
Name: Review, dtype: int64

### ! Question: For the Education column, what numerical values would you assign to the categories and why? 

| Education (Highest education achieved) | Map to? | Map to? | Map to? | Map to? | Map to? |
|-|-|-|-|-|-|
| No school qualifications | - | - | - | - |-|
| School (GCSE or A Level) | 1 | 2 | 0 | 0 | 1 |
| Undergraduate (Bachelors) | 2 | 4 | 1 | 3 | 3 |
| Postgraduate (Masters) | 3 | 8 | 2 | 7 | 9 |

Tips: Start from 0, it reduces computation

Tips: If in doubt, increase by same value, e.g. 1 from lowest to highest as the model will still know that school < undergrad < postgrad

In [11]:
ordinal_education_col = df['Education']

ordinal_education_col.value_counts()

Education
School    4
UG        4
PG        2
Name: count, dtype: int64

# What our data might look like after encoding the categorical variables (excluding the education column)

In [12]:
print("Before encoding:")

df.drop(columns = ['Education'], inplace = False)

Before encoding:


Unnamed: 0,Age,Gender,Review,Purchased
0,30,Female,Average,No
1,68,Female,Poor,No
2,70,Female,Good,No
3,72,Female,Good,No
4,16,Female,Average,No
5,31,Female,Average,Yes
6,18,Male,Good,Yes
7,60,Female,Poor,Yes
8,65,Female,Average,Yes
9,74,Male,Good,Yes


In [18]:
print("After encoding:")


pd.concat([df['Age'], nominal_gender_col_encoded, ordinal_review_col_encoded, y_encoded], axis=1).astype(int)

After encoding:


Unnamed: 0,Age,Gender_Female,Gender_Male,Review,Purchased
0,30,1,0,1,0
1,68,1,0,0,0
2,70,1,0,2,0
3,72,1,0,2,0
4,16,1,0,1,0
5,31,1,0,1,1
6,18,0,1,2,1
7,60,1,0,0,1
8,65,1,0,1,1
9,74,0,1,2,1
