<a href="https://colab.research.google.com/github/Emiliewu/MachineLearning-Assignments/blob/main/w5_d1_demo_ml_Transforming_Features_for_Machine_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transforming Features for Machine learning

## Read in the Dataset

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_excel('/content/drive/MyDrive/CodingDojo/02-MachineLearning/Week05/Data/Tshirt.xlsx')
df.head()

Unnamed: 0,Size,Color,Cost,Sold
0,S,Blue,5.0,Y
1,M,Red,7.49,Y
2,M,Green,8.0,N
3,XL,Green,4.0,N
4,L,Red,9.99,Y


## Format for ML

* Assign the target and features and seperate them into a train and test set

In [4]:
# assign target y and features X
y = df['Sold']
X = df.drop(columns = 'Sold')
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Replace the Ordinal feature (Size) with Numbers

* There are several ways to accomplish this in Python. This example shows how you can create a dictionary and use .replace to make the changes. Note that we are getting in the habit of keeping our test set separate, but we will apply the same pre-processing to both our train and test sets.  

In [5]:
# define dictionary to replace
sizes = {'S': 0, 'M': 1, 'L': 2, 'XL': 3}
#apply the dictionary to the column in the train set
X_train['Size'] = X_train['Size'].replace(sizes)
X_test['Size'] = X_test['Size'].replace(sizes)
# view the dataframe to make sure it worked
X_train.head()

Unnamed: 0,Size,Color,Cost
0,0,Blue,5.0
8,0,Green,9.0
2,1,Green,8.0
4,2,Red,9.99
3,3,Green,4.0


# Transforming Categorical (Nominal) Features

You need to be careful when your categories are unordered. It is not sufficient to simply map each class to a number as can be done with ordinal features.

This is because machine learning models will interpret higher numbers as having a higher value than lower numbers. However, there should not be a higher or lower value associated with nominal variables. For example, if you just replaced the values red, green, and blue with the numbers 0, 1, and 2, the machine learning model would interpret blue as being higher/more than red and green. We know that this should not be the case; all of these colors should be treated as equal values since no one color is inherently better/higher than another.

To deal with this, we can one-hot encode our categories. What this does is creates a binary column for each class in the column.

Note that the Color column in our dataset is an example of a nominal feature.


One hot encoding will create a binary column for each color in our dataset.

In [7]:
df['color_blue'] = df['Color'].replace({'Blue': 1, 'Green': 0, 'Red': 0})
df['color_blue'].value_counts()

0    6
1    3
Name: color_blue, dtype: int64

In [8]:
df['color_green'] = df['Color'].replace({'Blue': 0, 'Green': 1, 'Red': 0})
df['color_green'].value_counts()

0    6
1    3
Name: color_green, dtype: int64

In [9]:
df['color_red'] = df['Color'].replace({'Blue': 0, 'Green': 0, 'Red': 1})
df['color_red'].value_counts()

0    6
1    3
Name: color_red, dtype: int64

In [10]:
df.head()

Unnamed: 0,Size,Color,Cost,Sold,color_blue,color_green,color_red
0,S,Blue,5.0,Y,1,0,0
1,M,Red,7.49,Y,0,0,1
2,M,Green,8.0,N,0,1,0
3,XL,Green,4.0,N,0,1,0
4,L,Red,9.99,Y,0,0,1


Notice, in each of the rows, one (and only one) of the colors is "hot" (assigned 1). This is essentially just a bunch of binary columns that represent whether or not the value in that row is that color.

For example, the first row represents the color blue. So, it has a "1" in the "Color_blue" column (signifying the presence of the color blue) and a "0" in the "Color_green" column, and a "0" in the "Color_red" column, since it is NOT green or red.

Don't worry! You do not have to do this manually. You will learn the code for applying the one hot encoding to your dataset in a future lesson.