# Encoding Categorical Data

### Data Preprocessing and Feature Engineering
1) Detecting and Handling Outliers
2) Missing Values Imputation
3) Encoding Categorical Features
4) Feature Scaling
5) Extracting Information
6) Combining Information

### Data Types
- Numberical data is a type of data that expresses information in the form of numbers, while categorical data is a type of data that is used to group information with similar characteristics
- Data Types 
1) Numerical or Quantitative Data
     - ***Continuous Data*** (Can be calculated and can have infinite number of values in a range i.e, Weight, Fare, GPA)
     - ***Discrete Data*** (Can be counted and can not be subdivided meaningfully i.e, No of children, World Population)
2) Categorical or Qualitative Data
     - ***Nominal Data*** (Can be slotted into mutual exclusive categories that do not have specific order i.e, blood group, gender)
     - ***Ordinal Data*** (Can be slotted into mutually exclusive categories that cannot be ordered or ranked i.e, Grade, Passenger class)

### Encoding Categorical Data
- A process of converting categorical data into numerical values, so that it could be fed to the Machine Learning models
- sklearn.preprocessing 
    - 1) OrdinalEncoder
    - 2) LabelEncoder
     - 3) OneHotEncoder

### Practical

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('datasets/disease.csv')
df.sample(n=7, random_state=54)

Unnamed: 0,gender,city,age,bp,cough,disease
36,Female,Shaikhupura,38.0,normal,Mild,No
70,Female,Islamabad,68.0,normal,Strong,No
48,Male,Shaikhupura,66.0,low,Moderate,No
94,Male,Lahore,79.0,normal,Strong,Yes
81,Male,Islamabad,65.0,normal,Mild,No
46,Female,Karachi,,normal,Moderate,No
38,Female,Islamabad,49.0,high,Mild,Yes


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   gender   100 non-null    object 
 1   city     100 non-null    object 
 2   age      87 non-null     float64
 3   bp       100 non-null    object 
 4   cough    100 non-null    object 
 5   disease  100 non-null    object 
dtypes: float64(1), object(5)
memory usage: 4.8+ KB


In [5]:
df.nunique()

gender      2
city        4
age        54
bp          3
cough       3
disease     2
dtype: int64

In [6]:
df.cough.value_counts()

Mild        47
Strong      30
Moderate    23
Name: cough, dtype: int64

In [7]:
df.bp.value_counts()

normal    47
low       28
high      25
Name: bp, dtype: int64

In [8]:
df.disease.value_counts()

No     54
Yes    46
Name: disease, dtype: int64

### Sklearn LabelEncoder vs OrdinalEncoder
- Label encoding is used for encoding categorical variables, which assign each category value a unique integer starting from 0 to n-1 based on alphabetical ordering
- LabelEncoder is used for encoding output variable while OridnalEncoder is used for encoding input feature variables of ordinal type, having an intrinsic order.
- LabelEncoder can fit one column at a time while OrdinalEncoder can fit multiple columns at the same time.
- Finally both encoders sort the values of a column alphabetically and then assign them the integer values.
- For example, in case of city column having two values "Islamabad" and "Karachi", Islamabad will be assigned an integer value of 0, while Karachi will be assigned a value of 1. However, in case of OrdinalEncoder , we can mention what integer values should be assigned to each specific category

In [9]:
y = df.iloc[:, -1]
y

0     Yes
1      No
2      No
3     Yes
4      No
     ... 
95     No
96    Yes
97     No
98     No
99    Yes
Name: disease, Length: 100, dtype: object

In [10]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(y)

In [11]:
dis = le.transform(y)

In [13]:
le.classes_

array(['No', 'Yes'], dtype=object)

In [14]:
dis

array([1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1])

In [15]:
dis.shape

(100,)

In [16]:
x = df.iloc[:, 3:5]
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
oe.fit(x)


In [17]:
oe_t = oe.transform(x)

In [18]:
oe.categories_

[array(['high', 'low', 'normal'], dtype=object),
 array(['Mild', 'Moderate', 'Strong'], dtype=object)]

In [20]:
# oe_t

- OrdinalEncoder, unlike LabelEncoder can assign specific integer values by specifying the categories argument

### Encoding Nominal Variables Using OneHotEncoder
- The process of creating dummy variables and each category is represented as one hot vector

***Limitations***
1) Multicollinearity
2) Curse of Dimensionality
   - Suppose we have a feature column which has 100 unique values. Now if we try to encode this feature using one hot encoding we will get 100 columns. This will increase the dimensionality of the overall dataset which lead to the curse of dimensionality
   - Solution is you keep the most frequent used say ten categories as separate columns and for all the less frequently used categories you assign a new 11th category say "Others". So this way you will have a total of 11 columns instead of 99.
   - Label Encoder does not work for linear models, SVMs or neural networks as their data need to be stadarized
   - One hot encoder overcome the limitations of the label encoding and can be used both in tree based and non tree based algorithms

In [21]:
df

Unnamed: 0,gender,city,age,bp,cough,disease
0,Male,Lahore,60.0,low,Moderate,Yes
1,Male,Islamabad,27.0,low,Mild,No
2,Male,Islamabad,,normal,Strong,No
3,Female,Lahore,31.0,high,Moderate,Yes
4,Female,Karachi,65.0,high,Mild,No
...,...,...,...,...,...,...
95,Female,Shaikhupura,,normal,Mild,No
96,Female,Lahore,51.0,high,Strong,Yes
97,Female,Shaikhupura,20.0,normal,Mild,No
98,Female,Karachi,5.0,low,Moderate,No


In [5]:
x_gender = df.iloc[:,0:1]
x_gender

Unnamed: 0,gender
0,Male
1,Male
2,Male
3,Female
4,Female
...,...
95,Female
96,Female
97,Female
98,Female


In [6]:
pd.get_dummies(data = x_gender, drop_first = True)


Unnamed: 0,gender_Male
0,1
1,1
2,1
3,0
4,0
...,...
95,0
96,0
97,0
98,0


In [3]:
pd.get_dummies(data = df, columns = ['gender', 'city'], drop_first = True)


Unnamed: 0,age,bp,cough,disease,gender_Male,city_Karachi,city_Lahore,city_Shaikhupura
0,60.0,low,Moderate,Yes,1,0,1,0
1,27.0,low,Mild,No,1,0,0,0
2,,normal,Strong,No,1,0,0,0
3,31.0,high,Moderate,Yes,0,0,1,0
4,65.0,high,Mild,No,0,1,0,0
...,...,...,...,...,...,...,...,...
95,,normal,Mild,No,0,0,0,1
96,51.0,high,Strong,Yes,0,0,1,0
97,20.0,normal,Mild,No,0,0,0,1
98,5.0,low,Moderate,No,0,1,0,0


In [9]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(drop = 'first', dtype = np.int8)
ohe.fit(x_gender)

In [1]:
ohe_city = ohe.transform(x_gender)
ohe_city.shape

NameError: name 'ohe' is not defined