# Processing Categorical Features

As most categorical features are textual it will not be understood by most ML models since most models require numerical input. We will learn basic ways to deal with this issue. 

### Ordinal Features
ordinal features have a sense of "order" within it. We gave the example of survey rankings. 
Say that the possible answers to a question are “Great”, “Good”, “Okay”, “Bad”, “Poor”.

How can a model read something like this? We can translate this into numbers which has a ranking!

map funtcion - a 0 to "Poor", 1 to "Bad" and so on.

In [21]:
import random
import pandas as pd
from sklearn import preprocessing

In [3]:
rank_list = ["Great", "Good", "Okay", "Bad", "Poor"]
items = pd.Series([random.choice(rank_list) for i in range(100)]) #random.choice - choose random item 

print(items)

size_dict = {'Poor': 0, 'Bad':1, 'Okay':2, 'Good':3, 'Great':4}
encoded_items = items.map(lambda x: size_dict[x])

pd.DataFrame({"Rank":items, "Encoded Rank":encoded_items})

0     Great
1      Good
2      Okay
3      Poor
4       Bad
      ...  
95      Bad
96     Good
97      Bad
98    Great
99      Bad
Length: 100, dtype: object


Unnamed: 0,Rank,Encoded Rank
0,Great,4
1,Good,3
2,Okay,2
3,Poor,0
4,Bad,1
...,...,...
95,Bad,1
96,Good,3
97,Bad,1
98,Great,4


### Nominal Features
More tricky is dealing with Nominal Features since there’s no longer any order to it and simply assigning a numerical value to it is risky because it adds proportional effect to the target variable. Gender, profession, sports teams, countries are all examples of such.


#### One-Hot-Encoding
A very simple and common way to deal with this is by using binary vectors, also known as one-hot-encoding.

What we do is create a seperate column for each unique value in the categorical feature, and the actual values inside the new features will be binary: 1 if the row is in the category, and 0 otherwise.

There’s a few ways this can be done, but a very convenient way is via pandas' method called get_dummies

In [4]:
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
pd.get_dummies(df, columns=['species'])

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species_setosa,species_versicolor,species_virginica
0,5.1,3.5,1.4,0.2,1,0,0
1,4.9,3.0,1.4,0.2,1,0,0
2,4.7,3.2,1.3,0.2,1,0,0
3,4.6,3.1,1.5,0.2,1,0,0
4,5.0,3.6,1.4,0.2,1,0,0
...,...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,0,0,1
146,6.3,2.5,5.0,1.9,0,0,1
147,6.5,3.0,5.2,2.0,0,0,1
148,6.2,3.4,5.4,2.3,0,0,1


### Label Encoding
This is very similar to what we did with the ordinal features - assign a number to every unique value.
As said above this may not be the best method for nominal features.

In [6]:
df['species'].astype('category').cat.codes

0      0
1      0
2      0
3      0
4      0
      ..
145    2
146    2
147    2
148    2
149    2
Length: 150, dtype: int8

### Target Encoding
Target encoding is where you replace the value of the category with a statistic, such as the mean or median of that category within the target variable.
This means that the dataset has to have continuous target variable.

In [9]:
url = 'https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv'
cars_df = pd.read_csv(url).dropna(subset=['Manufacturer', 'Price']) #dropna - drop nan, subset - check nans in this cols
cars_df.head()

Unnamed: 0,Manufacturer,Model,Type,Min.Price,Price,Max.Price,MPG.city,MPG.highway,AirBags,DriveTrain,...,Passengers,Length,Wheelbase,Width,Turn.circle,Rear.seat.room,Luggage.room,Weight,Origin,Make
0,Acura,Integra,Small,12.9,15.9,18.8,25.0,31.0,,Front,...,5.0,177.0,102.0,68.0,37.0,26.5,,2705.0,non-USA,Acura Integra
2,Audi,90,Compact,25.9,29.1,32.3,20.0,26.0,Driver only,Front,...,5.0,180.0,102.0,67.0,37.0,28.0,14.0,3375.0,non-USA,Audi 90
3,Audi,100,Midsize,,37.7,44.6,19.0,26.0,Driver & Passenger,,...,6.0,193.0,106.0,,37.0,31.0,17.0,3405.0,non-USA,Audi 100
4,BMW,535i,Midsize,,30.0,,22.0,30.0,,Rear,...,4.0,186.0,109.0,69.0,39.0,27.0,13.0,3640.0,non-USA,BMW 535i
5,Buick,Century,Midsize,14.2,15.7,17.3,22.0,31.0,Driver only,,...,6.0,189.0,105.0,69.0,41.0,28.0,16.0,,USA,Buick Century


Let’s say we want to predict the price, so Price will be the target variable. 
Manufacturer is a nominal feature and we can use target encoding to transform it to a numerical value.

In [16]:
means = cars_df[['Manufacturer', 'Price']].groupby('Manufacturer').mean()
cars_df['Manufacturer_transformed'] = cars_df['Manufacturer'].dropna().apply(lambda x: means.loc[x, "Price"])
cars_df[['Manufacturer', 'Manufacturer_transformed', 'Price']].head()

Unnamed: 0,Manufacturer,Manufacturer_transformed,Price
0,Acura,15.9,15.9
2,Audi,33.4,29.1
3,Audi,33.4,37.7
4,BMW,30.0,30.0
5,Buick,21.625,15.7


### Ex

In [31]:
#Using the Cars dataset, perform label encoding on the Make column
le = preprocessing.LabelEncoder()
cars_df["Make_transformed"]=le.fit_transform(cars_df['Make'])
cars_df.tail()

Unnamed: 0,Manufacturer,Model,Type,Min.Price,Price,Max.Price,MPG.city,MPG.highway,AirBags,DriveTrain,...,Width,Turn.circle,Rear.seat.room,Luggage.room,Weight,Origin,Make,Manufacturer_transformed,Model_transformed,Make_transformed
87,Volkswagen,Fox,Small,8.7,9.1,9.5,25.0,33.0,,Front,...,63.0,34.0,26.0,10.0,2240.0,non-USA,Volkswagen Fox,18.025,9.1,81
88,Volkswagen,Eurovan,Van,16.6,19.7,22.7,17.0,21.0,,Front,...,72.0,38.0,34.0,,3960.0,,Volkswagen Eurovan,18.025,19.7,80
89,Volkswagen,Passat,Compact,17.6,20.0,22.4,21.0,30.0,,Front,...,67.0,35.0,31.5,14.0,2985.0,non-USA,Volkswagen Passat,18.025,20.0,82
90,Volkswagen,Corrado,Sporty,22.9,23.3,23.7,18.0,25.0,,Front,...,66.0,36.0,26.0,15.0,2810.0,non-USA,Volkswagen Corrado,18.025,23.3,79
91,Volvo,240,Compact,21.8,22.7,23.5,21.0,28.0,Driver only,Rear,...,67.0,37.0,29.5,14.0,2985.0,non-USA,Volvo 240,22.7,22.7,83


In [28]:
#Target encode the Model column by a statistic of your choice.
medians = cars_df[['Model', 'Price']].groupby('Model').median()
cars_df['Model_transformed'] = cars_df['Model'].dropna().apply(lambda x: medians.loc[x, "Price"])
cars_df[['Model', 'Model_transformed', 'Price']].head()

Unnamed: 0,Model,Model_transformed,Price
0,Integra,15.9,15.9
2,90,29.1,29.1
3,100,37.7,37.7
4,535i,30.0,30.0
5,Century,15.7,15.7
