In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## Problem statement

For many ML-models, but not all, it is required that all the input features are numerical - since the values they have are being used in mathematical operations. 
However, in the previous lab with the used cars dataset, we had to omit all features that were not already numerical since we hadn't yet learned how to deal with them.

In this lab, we'll learn of a technique called One Hot Encoding, which is a method of transforming what we call *categorical* features into numerical ones.

**Let's randomly generate a dataset to work with here**

In [2]:
ages = np.random.randint(20, 60, size=10)
genders = np.random.choice(['male', 'female', 'unknown'], size=10)
salaries = np.random.randint(30000, 80000, size=10)

data = {'Age': ages, 'Gender': genders, 'Salary': salaries}
df = pd.DataFrame(data)

df

Unnamed: 0,Age,Gender,Salary
0,55,female,65813
1,55,unknown,39637
2,21,unknown,33844
3,52,unknown,43772
4,59,female,45283
5,48,male,56173
6,29,unknown,47211
7,42,unknown,53660
8,28,male,33914
9,43,male,39388


In the dataset above, the column Gender is called *categorical*.

A categorical feature is one in which we can't say that one value is more or less than another one. 

In this case, male is neither more or less than female, i.e., we can't say that male < female or vice-versa.

Some other examples of categorical data are hair color (red, blonde, or black), political affiliation (republican, democrat, or other), and favourite food (sushi, kebab, or vegetables). Again, these are considered categorical because one is not more than the other.

In contrast, the columns Age and Salary are definitely not categorical since you can directly compare them. A salary of 60000 > 30000, and conversely 39 is objectively older than 26.

**One Hot Encoding**

A very simple way of dealing with categorical features is simply to create boolean columns, one for each of the values of our categorical column. This is called one hot encoding. We can do this in several ways, and Pandas has a pretty neat function for it aswell.

In [3]:
df

Unnamed: 0,Age,Gender,Salary
0,55,female,65813
1,55,unknown,39637
2,21,unknown,33844
3,52,unknown,43772
4,59,female,45283
5,48,male,56173
6,29,unknown,47211
7,42,unknown,53660
8,28,male,33914
9,43,male,39388


In [4]:
# 1. we call on the get_dummies function
# 2. we give it the dataset as the first argument
# 3. We tell it which columns we want to one hot encode

pd.get_dummies(df, columns = ['Gender'])

Unnamed: 0,Age,Salary,Gender_female,Gender_male,Gender_unknown
0,55,65813,True,False,False
1,55,39637,False,False,True
2,21,33844,False,False,True
3,52,43772,False,False,True
4,59,45283,True,False,False
5,48,56173,False,True,False
6,29,47211,False,False,True
7,42,53660,False,False,True
8,28,33914,False,True,False
9,43,39388,False,True,False


In [5]:
# 1. we call on the get_dummies function
# 2. we give it the dataset as the first argument
# 3. We tell it which columns we want to one hot encode
# 4. In addition, we also let the function know that we want 1/0 and not True/False

pd.get_dummies(df, columns = ['Gender'], dtype=int)

Unnamed: 0,Age,Salary,Gender_female,Gender_male,Gender_unknown
0,55,65813,1,0,0
1,55,39637,0,0,1
2,21,33844,0,0,1
3,52,43772,0,0,1
4,59,45283,1,0,0
5,48,56173,0,1,0
6,29,47211,0,0,1
7,42,53660,0,0,1
8,28,33914,0,1,0
9,43,39388,0,1,0


The result is a new dataframe, with categorical features transformed into numerical ones that we can use for the models that require it! 

If now use these features to try to predict the salary for a given employee, we might have something like this (for a linear model)

$$ salary = w_4 \cdot (Age) + w_3 \cdot (Gender\ female) + w_2 \cdot (Gender\ male) + w_1 \cdot (Gender\ unkown) + w_0 $$

**A word of caution**

However, you must be careful to not fall into the trap of believing that numerical columns can't be categorical! 

Take this for example:

In [6]:
ages = np.random.randint(20, 60, size=10)
genders = np.random.choice(['male', 'female'], size=10)
employee_ids = np.random.randint(10, 20, size=10)
salaries = np.random.randint(30000, 80000, size=10)

data = {'Employee Dept. ID': employee_ids, 'Age': ages, 'Gender': genders, 'Salary': salaries}
df = pd.DataFrame(data)

df

Unnamed: 0,Employee Dept. ID,Age,Gender,Salary
0,13,56,male,49113
1,14,34,male,68275
2,12,21,female,50356
3,19,47,female,54050
4,18,25,male,75398
5,16,55,male,72562
6,18,39,male,73439
7,17,21,female,35667
8,10,52,female,44699
9,19,46,male,70307


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Employee Dept. ID  10 non-null     int64 
 1   Age                10 non-null     int64 
 2   Gender             10 non-null     object
 3   Salary             10 non-null     int64 
dtypes: int64(3), object(1)
memory usage: 452.0+ bytes


Obviously, Employee Dept.ID is a numerical column and we can technically do mathematical comparisons between each Employee ID.

19 > 10. Obviously.

However, imagine now using that column, as is, as a feature for, say, a linear model that predicts salary. Then it might look like this:

$$ salary = w_2 \cdot (Employee\ Dept.\ ID) + w_1 \cdot (Age) + w_0 $$

According to this, a higher Employee Dept. ID contributes more to the predicted salary than a lower Employee Dept. ID. That doesn't seem to make any sense at all.

In fact, the column Employee Dept. ID is also a categorical feature, a *numerical categorical* feature. We could technically do comparisons between its values, but we really shouldn't - since it doesnt make sense. 

We can treat numerical categorical columns just like we did with ordinary categorical features - we'll one hot encode them.



In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Employee Dept. ID  10 non-null     int64 
 1   Age                10 non-null     int64 
 2   Gender             10 non-null     object
 3   Salary             10 non-null     int64 
dtypes: int64(3), object(1)
memory usage: 452.0+ bytes


In [9]:
pd.get_dummies(df, columns = ['Gender', 'Employee Dept. ID'], dtype=int)  # note that we can one hot encode several columns at the same time

Unnamed: 0,Age,Salary,Gender_female,Gender_male,Employee Dept. ID_10,Employee Dept. ID_12,Employee Dept. ID_13,Employee Dept. ID_14,Employee Dept. ID_16,Employee Dept. ID_17,Employee Dept. ID_18,Employee Dept. ID_19
0,56,49113,0,1,0,0,1,0,0,0,0,0
1,34,68275,0,1,0,0,0,1,0,0,0,0
2,21,50356,1,0,0,1,0,0,0,0,0,0
3,47,54050,1,0,0,0,0,0,0,0,0,1
4,25,75398,0,1,0,0,0,0,0,0,1,0
5,55,72562,0,1,0,0,0,0,1,0,0,0
6,39,73439,0,1,0,0,0,0,0,0,1,0
7,21,35667,1,0,0,0,0,0,0,1,0,0
8,52,44699,1,0,1,0,0,0,0,0,0,0
9,46,70307,0,1,0,0,0,0,0,0,0,1


För mer information, läs [dokumentationen!](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)

In [10]:
ages = np.random.randint(20, 60, size=10)
genders = np.random.choice(['male', 'female', 'other'], size=10)
education = np.random.choice(['high school', 'bachelor', 'phd'], size=10)
salaries = np.random.randint(30000, 80000, size=10)

data = {'Age': ages, 'Gender': genders, 'Education': education, 'Salary': salaries}
df = pd.DataFrame(data)

df

Unnamed: 0,Age,Gender,Education,Salary
0,35,female,bachelor,56049
1,39,male,phd,61629
2,55,female,bachelor,41474
3,48,other,phd,77210
4,48,female,bachelor,62786
5,46,other,high school,49326
6,24,other,high school,37514
7,37,other,high school,50905
8,53,male,high school,36286
9,33,female,bachelor,54010


In [11]:
def edu_mapper(value):

    if value=='phd':
        return 3
    elif value=='bachelor': 
        return 2
    else:
        return 1

In [12]:
numerical_education = [edu_mapper(level) for level in df['Education']]

In [13]:
df['New Education'] = numerical_education

In [14]:
df

Unnamed: 0,Age,Gender,Education,Salary,New Education
0,35,female,bachelor,56049,2
1,39,male,phd,61629,3
2,55,female,bachelor,41474,2
3,48,other,phd,77210,3
4,48,female,bachelor,62786,2
5,46,other,high school,49326,1
6,24,other,high school,37514,1
7,37,other,high school,50905,1
8,53,male,high school,36286,1
9,33,female,bachelor,54010,2


$$ salary = w_2 \cdot (Age) + w_1 \cdot (New\ Education) + w_0$$

Ni kan göra som ovan, men ni kan också givetvis köra dummy variables ändå.

In [15]:
ages = np.random.randint(20, 60, size=10)
genders = np.random.choice(['male', 'female', 'other'], size=10)
education = np.random.choice(['high school', 'bachelor', 'phd'], size=10)
salaries = np.random.randint(30000, 80000, size=10)

data = {'Age': ages, 'Gender': genders, 'Education': education, 'Salary': salaries}
df = pd.DataFrame(data)


pd.get_dummies(df, columns = ['Education'], dtype=int)  # note that we can one hot encode several columns at the same time

Unnamed: 0,Age,Gender,Salary,Education_bachelor,Education_high school,Education_phd
0,58,other,42408,0,1,0
1,30,female,59235,1,0,0
2,27,male,46899,0,0,1
3,46,male,66598,1,0,0
4,38,other,62183,0,0,1
5,30,other,72500,0,1,0
6,47,male,48020,0,1,0
7,36,female,79012,0,1,0
8,23,female,31781,0,0,1
9,41,other,64183,0,0,1


$$ salary = w_4 \cdot (Age) + w_3 \cdot (Education\ bachelor) + w_2 \cdot (Education\ high\ school) + w_1 \cdot (Education\ phd) + w_0$$

In [16]:
pd.get_dummies(df, columns = ['Gender', 'Education'], dtype=int)

Unnamed: 0,Age,Salary,Gender_female,Gender_male,Gender_other,Education_bachelor,Education_high school,Education_phd
0,58,42408,0,0,1,0,1,0
1,30,59235,1,0,0,1,0,0
2,27,46899,0,1,0,0,0,1
3,46,66598,0,1,0,1,0,0
4,38,62183,0,0,1,0,0,1
5,30,72500,0,0,1,0,1,0
6,47,48020,0,1,0,0,1,0
7,36,79012,1,0,0,0,1,0
8,23,31781,1,0,0,0,0,1
9,41,64183,0,0,1,0,0,1


**The curse of dimensionality**

One Hot Encoding is great, but you should be careful using it. If the categorical column you transform has too many values, it will translate to equally many new columns. This means that the dimensionality of your features increases (each feature is one dimension). As you increase the number of features, it also turns out that it gets more and more difficult to fit a model to it - unless you have a big enough data set!

Think about it. If you have a very small data set (10 samples), and 1000 features - you're not going to be able to make any sense of the underlying process you're trying to model.

However, if you have a relatively big data set (1M samples), and 1000 features - you **might** be able to make sense of the underlying process and produce a good model.

**conclusion**

One Hot Encoding is great, use it. But be careful. If the column your transforming has *alot* of possible values (say, perhaps over 20 or 30), it might case issues if you don't also have a correspondingly good amount of data.

Clear warning sign: if the number of features exceed the number of training samples, you're most likely in trouble.

What you want is for your training samples >> number of features

$num samples>> num features$

---

Further efficency improvements.


In [17]:
ages = np.random.randint(20, 60, size=10)
genders = np.random.choice(['male', 'female', 'unknown', 'alien'], size=10)
employee_ids = np.random.randint(10, 20, size=10)
salaries = np.random.randint(30000, 80000, size=10)

data = {'Employee Dept. ID': employee_ids, 'Age': ages, 'Gender': genders, 'Salary': salaries}
df = pd.DataFrame(data)

df

Unnamed: 0,Employee Dept. ID,Age,Gender,Salary
0,13,52,female,74294
1,16,44,alien,60795
2,16,28,female,33285
3,11,33,male,44340
4,12,29,male,79629
5,16,22,female,46060
6,13,52,alien,58751
7,18,36,unknown,44158
8,12,37,alien,69377
9,13,44,unknown,63737


In [18]:
pd.get_dummies(df, columns=['Gender'], dtype=int, drop_first=True)

Unnamed: 0,Employee Dept. ID,Age,Salary,Gender_female,Gender_male,Gender_unknown
0,13,52,74294,1,0,0
1,16,44,60795,0,0,0
2,16,28,33285,1,0,0
3,11,33,44340,0,1,0
4,12,29,79629,0,1,0
5,16,22,46060,1,0,0
6,13,52,58751,0,0,0
7,18,36,44158,0,0,1
8,12,37,69377,0,0,0
9,13,44,63737,0,0,1


One strategy to "win" one less column, is to drop one of the resulting one-hot-encoded columns.

**Another trick is to group categories together**

In [19]:
ages = np.random.randint(20, 60, size=1000)
genders = np.random.choice(['male', 'female'], size=1000)
employee_ids = np.random.randint(10, 50, size=1000)
salaries = np.random.randint(30000, 80000, size=1000)

data = {'Employee Dept. ID': employee_ids, 'Age': ages, 'Gender': genders, 'Salary': salaries}
df = pd.DataFrame(data)

df

Unnamed: 0,Employee Dept. ID,Age,Gender,Salary
0,45,35,female,72055
1,27,42,male,43923
2,38,51,female,51870
3,33,22,male,64703
4,27,20,male,50497
...,...,...,...,...
995,41,55,male,38857
996,27,37,male,51514
997,47,37,female,50873
998,44,27,male,35060


In [20]:
pd.get_dummies(df, columns=['Employee Dept. ID'], dtype=int)

Unnamed: 0,Age,Gender,Salary,Employee Dept. ID_10,Employee Dept. ID_11,Employee Dept. ID_12,Employee Dept. ID_13,Employee Dept. ID_14,Employee Dept. ID_15,Employee Dept. ID_16,...,Employee Dept. ID_40,Employee Dept. ID_41,Employee Dept. ID_42,Employee Dept. ID_43,Employee Dept. ID_44,Employee Dept. ID_45,Employee Dept. ID_46,Employee Dept. ID_47,Employee Dept. ID_48,Employee Dept. ID_49
0,35,female,72055,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,42,male,43923,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,51,female,51870,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,22,male,64703,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,20,male,50497,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,55,male,38857,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
996,37,male,51514,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
997,37,female,50873,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
998,27,male,35060,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


As we can see above, there are simply too many Deptepartment IDs to be one-hot-encoded (they yield 40 new features, which can contribute to the curse of dimensionality significantly). We can group them together into a smaller number of categories, if there is a logic to it.

Assume now, e.g., that 

* Depts 10-19 are "Sales"
* Depts 20-29 are "Marketing"
* Depts 30-39 are "HR"
* Depts 40-49 are "IT"

In [21]:
def group_dept(dept_id):
    if 10 <= dept_id <= 19:
        return 'Sales'
    elif 20 <= dept_id <= 29:
        return 'Marketing'
    elif 30 <= dept_id <= 39:
        return 'HR'
    elif 40 <= dept_id <= 49:
        return 'IT'

In [22]:
new_dept_groupings = [group_dept(dept_id) for dept_id in df['Employee Dept. ID']]


df['Employee Dept. Group'] = new_dept_groupings

In [23]:
df

Unnamed: 0,Employee Dept. ID,Age,Gender,Salary,Employee Dept. Group
0,45,35,female,72055,IT
1,27,42,male,43923,Marketing
2,38,51,female,51870,HR
3,33,22,male,64703,HR
4,27,20,male,50497,Marketing
...,...,...,...,...,...
995,41,55,male,38857,IT
996,27,37,male,51514,Marketing
997,47,37,female,50873,IT
998,44,27,male,35060,IT


In [24]:
pd.get_dummies(df, columns=['Employee Dept. Group', 'Gender'], dtype=int)

Unnamed: 0,Employee Dept. ID,Age,Salary,Employee Dept. Group_HR,Employee Dept. Group_IT,Employee Dept. Group_Marketing,Employee Dept. Group_Sales,Gender_female,Gender_male
0,45,35,72055,0,1,0,0,1,0
1,27,42,43923,0,0,1,0,0,1
2,38,51,51870,1,0,0,0,1,0
3,33,22,64703,1,0,0,0,0,1
4,27,20,50497,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...
995,41,55,38857,0,1,0,0,0,1
996,27,37,51514,0,0,1,0,0,1
997,47,37,50873,0,1,0,0,1,0
998,44,27,35060,0,1,0,0,0,1


In [25]:
almost_done_df = pd.get_dummies(df, columns=['Employee Dept. Group', 'Gender'], dtype=int, drop_first=True)

almost_done_df

Unnamed: 0,Employee Dept. ID,Age,Salary,Employee Dept. Group_IT,Employee Dept. Group_Marketing,Employee Dept. Group_Sales,Gender_male
0,45,35,72055,1,0,0,0
1,27,42,43923,0,1,0,1
2,38,51,51870,0,0,0,0
3,33,22,64703,0,0,0,1
4,27,20,50497,0,1,0,1
...,...,...,...,...,...,...,...
995,41,55,38857,1,0,0,1
996,27,37,51514,0,1,0,1
997,47,37,50873,1,0,0,0
998,44,27,35060,1,0,0,1


We can now train on the following features

In [26]:
almost_done_df.columns

Index(['Employee Dept. ID', 'Age', 'Salary', 'Employee Dept. Group_IT',
       'Employee Dept. Group_Marketing', 'Employee Dept. Group_Sales',
       'Gender_male'],
      dtype='object')

In [27]:
columns_to_keep = ['Age', 'Salary', 'Employee Dept. Group_IT',
       'Employee Dept. Group_Marketing', 'Employee Dept. Group_Sales',
       'Gender_male']

In [28]:
trainable_df = almost_done_df[columns_to_keep]
trainable_df

Unnamed: 0,Age,Salary,Employee Dept. Group_IT,Employee Dept. Group_Marketing,Employee Dept. Group_Sales,Gender_male
0,35,72055,1,0,0,0
1,42,43923,0,1,0,1
2,51,51870,0,0,0,0
3,22,64703,0,0,0,1
4,20,50497,0,1,0,1
...,...,...,...,...,...,...
995,55,38857,1,0,0,1
996,37,51514,0,1,0,1
997,37,50873,1,0,0,0
998,27,35060,1,0,0,1


JUST FYI, you dont have to DROP FIRST. It's cute, but nowadays not really needed.