# What is Categorical Data?

Categorical variables are usually represented as ‘strings’ or ‘categories’ and are finite in number. Here are a few examples:
1. The city where a person lives: Delhi, Mumbai, Ahmedabad, Bangalore, etc.
2. The department a person works in: Finance, Human resources, IT, Production.
3. The highest degree a person has: High school, Diploma, Bachelors, Masters, PhD.
4. The grades of a student:  A+, A, B+, B, B- etc.

In the above examples, the variables only have definite possible values. Further, we can see there are two kinds of categorical data-

* Ordinal Data: The categories have an inherent order
* Nominal Data: The categories do not have an inherent order

### One Hot Encoding Vs Label Encoding

* For Nominal data we use One Hot Encoding
* For Ordinal data we use Label Encoding

### One Hot Encoding/Dummy Encoding

One-Hot Encoding is another popular technique for treating categorical variables. It simply creates additional features based on the number of unique values in the categorical feature. Every unique value in the category will be added as a feature.

In [2]:
import pandas as pd

df=pd.read_csv('bank.csv')
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,deposit
0,59,admin.,married,secondary,no,2343,yes,no,unknown,5,may,1042,1,-1,0,unknown,yes
1,56,admin.,married,secondary,no,45,no,no,unknown,5,may,1467,1,-1,0,unknown,yes
2,41,technician,married,secondary,no,1270,yes,no,unknown,5,may,1389,1,-1,0,unknown,yes
3,55,services,married,secondary,no,2476,yes,no,unknown,5,may,579,1,-1,0,unknown,yes
4,54,admin.,married,tertiary,no,184,no,no,unknown,5,may,673,2,-1,0,unknown,yes


In [3]:
df['month'].unique()

array(['may', 'jun', 'jul', 'aug', 'oct', 'nov', 'dec', 'jan', 'feb',
       'mar', 'apr', 'sep'], dtype=object)

In [6]:
month=pd.get_dummies(df['month'])

In [7]:
pd.concat([df,month],axis=1)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,...,dec,feb,jan,jul,jun,mar,may,nov,oct,sep
0,59,admin.,married,secondary,no,2343,yes,no,unknown,5,...,False,False,False,False,False,False,True,False,False,False
1,56,admin.,married,secondary,no,45,no,no,unknown,5,...,False,False,False,False,False,False,True,False,False,False
2,41,technician,married,secondary,no,1270,yes,no,unknown,5,...,False,False,False,False,False,False,True,False,False,False
3,55,services,married,secondary,no,2476,yes,no,unknown,5,...,False,False,False,False,False,False,True,False,False,False
4,54,admin.,married,tertiary,no,184,no,no,unknown,5,...,False,False,False,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11157,33,blue-collar,single,primary,no,1,yes,no,cellular,20,...,False,False,False,False,False,False,False,False,False,False
11158,39,services,married,secondary,no,733,no,no,unknown,16,...,False,False,False,False,True,False,False,False,False,False
11159,32,technician,single,secondary,no,29,no,no,cellular,19,...,False,False,False,False,False,False,False,False,False,False
11160,43,technician,married,secondary,no,0,no,yes,cellular,8,...,False,False,False,False,False,False,True,False,False,False


In [9]:
pd.get_dummies(df)

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,job_admin.,job_blue-collar,job_entrepreneur,...,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown,deposit_no,deposit_yes
0,59,2343,5,1042,1,-1,0,True,False,False,...,True,False,False,False,False,False,False,True,False,True
1,56,45,5,1467,1,-1,0,True,False,False,...,True,False,False,False,False,False,False,True,False,True
2,41,1270,5,1389,1,-1,0,False,False,False,...,True,False,False,False,False,False,False,True,False,True
3,55,2476,5,579,1,-1,0,False,False,False,...,True,False,False,False,False,False,False,True,False,True
4,54,184,5,673,2,-1,0,True,False,False,...,True,False,False,False,False,False,False,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11157,33,1,20,257,1,-1,0,False,True,False,...,False,False,False,False,False,False,False,True,True,False
11158,39,733,16,83,4,-1,0,False,False,False,...,False,False,False,False,False,False,False,True,True,False
11159,32,29,19,156,2,-1,0,False,False,False,...,False,False,False,False,False,False,False,True,True,False
11160,43,0,8,9,2,172,5,False,False,False,...,True,False,False,False,True,False,False,False,True,False


In [10]:
pd.get_dummies(df,drop_first=True)

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,job_blue-collar,job_entrepreneur,job_housemaid,...,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_other,poutcome_success,poutcome_unknown,deposit_yes
0,59,2343,5,1042,1,-1,0,False,False,False,...,False,False,True,False,False,False,False,False,True,True
1,56,45,5,1467,1,-1,0,False,False,False,...,False,False,True,False,False,False,False,False,True,True
2,41,1270,5,1389,1,-1,0,False,False,False,...,False,False,True,False,False,False,False,False,True,True
3,55,2476,5,579,1,-1,0,False,False,False,...,False,False,True,False,False,False,False,False,True,True
4,54,184,5,673,2,-1,0,False,False,False,...,False,False,True,False,False,False,False,False,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11157,33,1,20,257,1,-1,0,True,False,False,...,False,False,False,False,False,False,False,False,True,False
11158,39,733,16,83,4,-1,0,False,False,False,...,True,False,False,False,False,False,False,False,True,False
11159,32,29,19,156,2,-1,0,False,False,False,...,False,False,False,False,False,False,False,False,True,False
11160,43,0,8,9,2,172,5,False,False,False,...,False,False,True,False,False,False,False,False,False,False


In [None]:
# We use a pandas function called "get_dummies" to create dummy variables 

new_df = pd.get_dummies(df, drop_first = True) # Drop first = True is used to avoid dummy variable trap

## One Hot Encodding from sklearn

In [18]:
from sklearn.preprocessing import OneHotEncoder


df=pd.read_csv('bank.csv')
df.head()
encoder = OneHotEncoder(sparse=False, drop='first')  # drop='first' to avoid dummy variable trap

# Select categorical columns to encode
categorical_cols = ['education']
# Apply OneHotEncoder
encoded_data = encoder.fit_transform(df[categorical_cols])

# Convert encoded data to a DataFrame
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(categorical_cols))

# Combine with the original DataFrame
df_encoded = pd.concat([df.drop(columns=categorical_cols), encoded_df], axis=1)
df_encoded



Unnamed: 0,age,job,marital,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,deposit,education_secondary,education_tertiary,education_unknown
0,59,admin.,married,no,2343,yes,no,unknown,5,may,1042,1,-1,0,unknown,yes,1.0,0.0,0.0
1,56,admin.,married,no,45,no,no,unknown,5,may,1467,1,-1,0,unknown,yes,1.0,0.0,0.0
2,41,technician,married,no,1270,yes,no,unknown,5,may,1389,1,-1,0,unknown,yes,1.0,0.0,0.0
3,55,services,married,no,2476,yes,no,unknown,5,may,579,1,-1,0,unknown,yes,1.0,0.0,0.0
4,54,admin.,married,no,184,no,no,unknown,5,may,673,2,-1,0,unknown,yes,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11157,33,blue-collar,single,no,1,yes,no,cellular,20,apr,257,1,-1,0,unknown,no,0.0,0.0,0.0
11158,39,services,married,no,733,no,no,unknown,16,jun,83,4,-1,0,unknown,no,1.0,0.0,0.0
11159,32,technician,single,no,29,no,no,cellular,19,aug,156,2,-1,0,unknown,no,1.0,0.0,0.0
11160,43,technician,married,no,0,no,yes,cellular,8,may,9,2,172,5,failure,no,1.0,0.0,0.0


In [17]:
encoder = OneHotEncoder(sparse=False, drop='first')  # drop='first' to avoid dummy variable trap

# Select categorical columns to encode
categorical_cols = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']

# Apply OneHotEncoder
encoded_data = encoder.fit_transform(df[categorical_cols])

# Convert encoded data to a DataFrame
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(categorical_cols))

# Combine with the original DataFrame
df_encoded = pd.concat([df.drop(columns=categorical_cols), encoded_df], axis=1)
df_encoded



Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,deposit,job_blue-collar,job_entrepreneur,...,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_other,poutcome_success,poutcome_unknown
0,59,2343,5,1042,1,-1,0,yes,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
1,56,45,5,1467,1,-1,0,yes,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
2,41,1270,5,1389,1,-1,0,yes,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
3,55,2476,5,579,1,-1,0,yes,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
4,54,184,5,673,2,-1,0,yes,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11157,33,1,20,257,1,-1,0,no,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
11158,39,733,16,83,4,-1,0,no,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
11159,32,29,19,156,2,-1,0,no,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
11160,43,0,8,9,2,172,5,no,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


### Label Encoding

Label Encoding is a popular encoding technique for handling categorical variables. In this technique, each label is assigned a unique integer based on alphabetical ordering.

In [19]:
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,deposit
0,59,admin.,married,secondary,no,2343,yes,no,unknown,5,may,1042,1,-1,0,unknown,yes
1,56,admin.,married,secondary,no,45,no,no,unknown,5,may,1467,1,-1,0,unknown,yes
2,41,technician,married,secondary,no,1270,yes,no,unknown,5,may,1389,1,-1,0,unknown,yes
3,55,services,married,secondary,no,2476,yes,no,unknown,5,may,579,1,-1,0,unknown,yes
4,54,admin.,married,tertiary,no,184,no,no,unknown,5,may,673,2,-1,0,unknown,yes


In [21]:
# We use Label encoding in sklearn library

# Import label encoder
from sklearn import preprocessing
 
# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
 
# Encode labels in column 'species'.
df['job']= label_encoder.fit_transform(df['job'])

In [22]:
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,deposit
0,59,0,married,secondary,no,2343,yes,no,unknown,5,may,1042,1,-1,0,unknown,yes
1,56,0,married,secondary,no,45,no,no,unknown,5,may,1467,1,-1,0,unknown,yes
2,41,9,married,secondary,no,1270,yes,no,unknown,5,may,1389,1,-1,0,unknown,yes
3,55,7,married,secondary,no,2476,yes,no,unknown,5,may,579,1,-1,0,unknown,yes
4,54,0,married,tertiary,no,184,no,no,unknown,5,may,673,2,-1,0,unknown,yes


In [26]:
label_encoder.classes_

array(['admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management',
       'retired', 'self-employed', 'services', 'student', 'technician',
       'unemployed', 'unknown'], dtype=object)

***

# <center><span style = "color:CornflowerBlue; font-family:Courier New;font-size:40px">EDURE LEARNING</span></center>