# Handling Categorical Variables
Categorical Variables: has two major types
- Nomial: no order associated with like gender (male & female) 
- Ordinal: order associated
- Cyclical: Monday > Tuesday > .. > Sunday
- Binary

### Steps:
- Fill NaN values (Treat them as completely new category, say "None", `df.loc[:, "ord_2"] = df.ord_2.fillna("None")`
- Ordinal: convert using Label Encoder or Mapping Dictionary
- Others: convert using One-Hot Encoding

In [1]:
import pandas as pd
from sklearn import preprocessing

In [2]:
df = pd.read_csv("../input/dataset/cat_train.csv")
df_backup = df.copy()

In [3]:
df.head()

Unnamed: 0,id,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,...,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month,target
0,0,0.0,0.0,0.0,F,N,Red,Trapezoid,Hamster,Russia,...,02e7c8990,3.0,Contributor,Hot,c,U,Pw,6.0,3.0,0
1,1,1.0,1.0,0.0,F,Y,Red,Star,Axolotl,,...,f37df64af,3.0,Grandmaster,Warm,e,X,pE,7.0,7.0,0
2,2,0.0,1.0,0.0,F,N,Red,,Hamster,Canada,...,,3.0,,Freezing,n,P,eN,5.0,9.0,0
3,3,,0.0,0.0,F,N,Red,Circle,Hamster,Finland,...,f9d456e57,1.0,Novice,Lava Hot,a,C,,3.0,3.0,0
4,4,0.0,,0.0,T,N,Red,Triangle,Hamster,Costa Rica,...,c5361037c,3.0,Grandmaster,Cold,h,C,OZ,5.0,12.0,0


In [4]:
df.shape

(600000, 25)

## 1. Label Encoding
- Step 1: Fill NaN values since Label Encoding does not handle NaN values
- Step 2: Fit & Transform

### Note:
- Label Encoding cannot be used in linear models, SVM, or Neural Networks as they expect data to be normalized (or standardized)
- For these types of Model, we can binarize the data

`
"Freezing": 0 => 0 0 0
"Warm"    : 1 => 0 0 1
`

In [5]:
#Let's look at `ord_2`, since the computer do not understand the text, so we can convert to number
df.ord_2.value_counts()

Freezing       142726
Warm           124239
Cold            97822
Boiling Hot     84790
Hot             67508
Lava Hot        64840
Name: ord_2, dtype: int64

In [6]:
mapping = {
    "Freezing": 0,
    "Warm": 1,
    "Cold": 2,
    "Boiling Hot": 3,
    "Hot": 4,
    "Lava Hot": 5
}

df.loc[:, "ord_2"] = df.ord_2.map(mapping) #This is equivalent to Label Encoding
df.ord_2.value_counts()

0.0    142726
1.0    124239
2.0     97822
3.0     84790
4.0     67508
5.0     64840
Name: ord_2, dtype: int64

In [7]:
df = df_backup.copy() #revert back the origin data

In [8]:
#Step 1: Fill NaN values since Label Encoding does not handle NaN values
#df.ord_2.isna().sum()
#df.ord_2[df.ord_2.isna()==True]
df.loc[:, "ord_2"] = df.ord_2.fillna("None")
#df.ord_2.value_counts()

lbl_enc = preprocessing.LabelEncoder()
lbl_enc.fit(df.ord_2.values)

df.loc[:,"ord_2"] = lbl_enc.transform(df.ord_2.values)

In [9]:
df.ord_2.value_counts()

2    142726
6    124239
1     97822
0     84790
3     67508
4     64840
5     18075
Name: ord_2, dtype: int64

## 2. One Hot Encoding

In [10]:
df = df_backup.copy()

## 3. Combining Categorical Columns

In [11]:
df[df.ord_2 == "Boiling Hot"].shape

(84790, 25)

In [12]:
df.groupby(["ord_2"])["id"].count()

ord_2
Boiling Hot     84790
Cold            97822
Freezing       142726
Hot             67508
Lava Hot        64840
Warm           124239
Name: id, dtype: int64

In [13]:
#Create a new column with count values
df.groupby(["ord_2"])["id"].transform("count")

0          67508.0
1         124239.0
2         142726.0
3          64840.0
4          97822.0
            ...   
599995    142726.0
599996     84790.0
599997    142726.0
599998    124239.0
599999     84790.0
Name: id, Length: 600000, dtype: float64

- Group by multiple columns and their counts. 
    - For ex: counts by grouping on ord_1 and ord_2

In [18]:
df.groupby(["ord_1", "ord_2"])["id"].count().reset_index(name="count")[:5]

Unnamed: 0,ord_1,ord_2,count
0,Contributor,Boiling Hot,15634
1,Contributor,Cold,17734
2,Contributor,Freezing,26082
3,Contributor,Hot,12428
4,Contributor,Lava Hot,11919


- We can combine categorical columns into a new feature
- NaN will also convert to string

In [20]:
df["new_feature"] = df.ord_1.astype(str) + "_" + df.ord_2.astype(str)
df["new_feature"].head()

0     Contributor_Hot
1    Grandmaster_Warm
2        nan_Freezing
3     Novice_Lava Hot
4    Grandmaster_Cold
Name: new_feature, dtype: object