<a href="https://colab.research.google.com/github/Rachita-G/Python_Practice/blob/main/Model_Concepts/Dealing_with_categorical_variables.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Dummy variables
are the immitation of categories to numbers for doing the analysis.

 ML libraries do not take categorical variables as input. Thus, we convert them into numerical variables. 

In [1]:
import pandas as pd
import numpy as np
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
import pandas as pd
X = pd.DataFrame({'color':['red','white','pink','red','pink']})
X['Size']=['S','M','L','S','M']
X

Unnamed: 0,color,Size
0,red,S
1,white,M
2,pink,L
3,red,S
4,pink,M


In [None]:
pd.Categorical(X.color)

['red', 'white', 'pink', 'red', 'pink']
Categories (3, object): ['pink', 'red', 'white']

In [None]:
X.dtypes

color    object
Size     object
dtype: object

In [None]:
X.color=X.color.astype('category')

In [None]:
X.dtypes

color    category
Size       object
dtype: object

In [None]:
X.color.cat.categories

Index(['pink', 'red', 'white'], dtype='object')

In [None]:
X.color.cat.categories=[ 'red','pink', 'white'] # giving this as the order
X.color=X.color.cat.as_ordered().head()
X.color

0     pink
1    white
2      red
3     pink
4      red
Name: color, dtype: category
Categories (3, object): ['red' < 'pink' < 'white']

In [None]:
X.color.max() # operations can be done now!
X.color.min()

'white'

'red'

# Get Dummies- Pandas
Pandas get_dummies method is a very straight forward one step procedure to get the dummy variables for categorical features. The advantage is you can directly apply it on the dataframe and the algorithm inside will recognize the categorical features and perform get dummies operation on it. 

Should be applied on non categorical variables and fir categorical use mappings or other functions.

The drop_first=True drops one column from the resulted dummy features. The purpose is to avoid multicollinearity.

In [3]:
X

Unnamed: 0,color,Size
0,red,S
1,white,M
2,pink,L
3,red,S
4,pink,M


In [4]:
pd.get_dummies(X)

Unnamed: 0,color_pink,color_red,color_white,Size_L,Size_M,Size_S
0,0,1,0,0,0,1
1,0,0,1,0,1,0
2,1,0,0,1,0,0
3,0,1,0,0,0,1
4,1,0,0,0,1,0


In [5]:
pd.get_dummies(X,drop_first=True) # if red and white is 0 then, color is pink. and if size is not M or S then it's L. Reduces features

Unnamed: 0,color_red,color_white,Size_M,Size_S
0,1,0,0,1
1,0,1,1,0
2,0,0,0,0
3,1,0,0,1
4,0,0,1,0


#  Mapping

In [6]:
X['Sex']=['m','f','f','m','f']
X

Unnamed: 0,color,Size,Sex
0,red,S,m
1,white,M,f
2,pink,L,f
3,red,S,m
4,pink,M,f


In [7]:
X['Sex']=X['Sex'].map({'m':0,'f':1})
X

Unnamed: 0,color,Size,Sex
0,red,S,0
1,white,M,1
2,pink,L,1
3,red,S,0
4,pink,M,1


# Ordinal Encoder- Scikit Learn
To convert categorical features to such integer codes, we can use the OrdinalEncoder. This estimator transforms each categorical feature to one new feature of integers (0 to n_categories - 1):

In [8]:
from sklearn.preprocessing import OrdinalEncoder

In [9]:
X1=X.copy()
X1['Ordinal Encoder for Color']=OrdinalEncoder().fit_transform(np.array(X['color']).reshape(-1,1))

In [10]:
X1 # abscii code for strings so p=0,r=1,w=2.

Unnamed: 0,color,Size,Sex,Ordinal Encoder for Color
0,red,S,0,1.0
1,white,M,1,2.0
2,pink,L,1,0.0
3,red,S,0,1.0
4,pink,M,1,0.0


# Label Encoding

* LabelEncoder should be used to encode target values, i.e. y where order doesn't matter, and not the input X. 
* Ordinal encoding should be used for ordinal variables (where order matters, like cold , warm , hot )

In [11]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
X1['Label Encoder for Color'] = le.fit_transform(X1['color'])

In [12]:
X1

Unnamed: 0,color,Size,Sex,Ordinal Encoder for Color,Label Encoder for Color
0,red,S,0,1.0,1
1,white,M,1,2.0,2
2,pink,L,1,0.0,0
3,red,S,0,1.0,1
4,pink,M,1,0.0,0


In [13]:
x=['a','b','c','d','a','d','c']
le.fit_transform(x) # ordinally-- used if one is imp than other.

array([0, 1, 2, 3, 0, 3, 2])

# One Hot Encoding

Dummy encoding for non ordinal variables in dataset. Same as get dummies in Pandas

In [14]:
from sklearn.preprocessing import OneHotEncoder
Oh=OneHotEncoder(sparse=False)

In [15]:
Oh.fit_transform(np.array(X1['color']).reshape(-1,1))

array([[0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.]])

In [16]:
Oh.categories_

[array(['pink', 'red', 'white'], dtype=object)]

In [17]:
X1

Unnamed: 0,color,Size,Sex,Ordinal Encoder for Color,Label Encoder for Color
0,red,S,0,1.0,1
1,white,M,1,2.0,2
2,pink,L,1,0.0,0
3,red,S,0,1.0,1
4,pink,M,1,0.0,0


# DictVectorizer
As we can see, the LabelEncoder and OneHotEncoder usually need to be used together as two steps procedure. An more convenient way is using DictVectorizer which can achieve these two steps all at once.


In [18]:
X1[['color','Size']].to_dict()

{'color': {0: 'red', 1: 'white', 2: 'pink', 3: 'red', 4: 'pink'},
 'Size': {0: 'S', 1: 'M', 2: 'L', 3: 'S', 4: 'M'}}

In [19]:
X_dict=X1[['color','Size']].to_dict(orient='records')
X_dict

[{'color': 'red', 'Size': 'S'},
 {'color': 'white', 'Size': 'M'},
 {'color': 'pink', 'Size': 'L'},
 {'color': 'red', 'Size': 'S'},
 {'color': 'pink', 'Size': 'M'}]

The orient='records' is required to turn the data frame into a {column:value} format. The result is a list of dictionaries, among which each dictionary represent one sample. Note that, in this case we don’t need to extract the categorical features, we can convert the whole dataframe into a dict. This is one advantage compared to LabelEncoder and OneHotEncoder.

In [20]:
from sklearn.feature_extraction import DictVectorizer

In [21]:
dv_X=DictVectorizer(sparse=False)
dv_X

DictVectorizer(sparse=False)

In [22]:
# apply dv_X on X_dict
X_encod = dv_X.fit_transform(X_dict)
# show X_encoded
X_encod

array([[0., 0., 1., 0., 1., 0.],
       [0., 1., 0., 0., 0., 1.],
       [1., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 1., 0.],
       [0., 1., 0., 1., 0., 0.]])

Each row represents a sample and each column represents a feature. If we want to know what feature for each column, we can check the vocabulary of this DictVectorizer

In [23]:
vocab = dv_X.vocabulary_
vocab

{'color=red': 4,
 'Size=S': 2,
 'color=white': 5,
 'Size=M': 1,
 'color=pink': 3,
 'Size=L': 0}

In [27]:
dv=pd.DataFrame(X_encod,columns=vocab)
dv

Unnamed: 0,color=red,Size=S,color=white,Size=M,color=pink,Size=L
0,0.0,0.0,1.0,0.0,1.0,0.0
1,0.0,1.0,0.0,0.0,0.0,1.0
2,1.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,1.0,0.0,1.0,0.0
4,0.0,1.0,0.0,1.0,0.0,0.0


In [29]:
# pd.merge(X,dv,on=dv.index)

# Conclusion
LabelEncoder and OneHotEncoder is usually need to be used together as a two steps method to encode categorical features. 
* LabelEncoder outputs a dataframe type while OneHotEncoder outputs a numpy array. 
* OneHotEncoder has the option to output a sparse matrix. 
* DictVectorizer is a one step method to encode and support sparse matrix output. Pandas get dummies method is so far the most straight forward and easiest way to encode categorical features. 
* The output will remain dataframe type.
* As my point of view, the first choice method will be pandas get dummies. But if the number of categorical features are huge, DictVectorizer will be a good choice as it supports sparse matrix output.

###### *See column transformation notebook!*