# Goeduhub Technologies - ML Training - Task 10
## Registration ID: GO_STP_939 
## Name: Manoj Kannan D
#### ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#  Assignment-10: Pandas Dummy Variables and One-hot encoding sklearn
**Discuss the concept of One-Hot-Encoding, Multicollinearity and the Dummy Variable Trap.  What is Nominal and Ordinal Variables ?** <br>

**Salary Dataset of 52 professors having categorical columns. Apply dummy variables concept and one-hot-encoding on categorical columns.**

## What is Nominal and Ordinal Variables ?
### Nominal Variable: (Unordered Data)
A nominal variable is a type of variable that is used to name, label or categorize particular attributes
that are being measured. It takes qualitative values representing different categories, and there is no
intrinsic ordering of these categories.<br>
**EG: Categories of shapes, places, color, etc.,**

### Ordinal Variable: (Ordered Data)
Unlike Nominal Variable, Ordinal variable is a type of measurement variable that takes values with an order or rank. <br>
**EG: Categories of Rank(I,II,III), Sizes(S,M,L).**

## What is One-Hot Encoding?
One-Hot Encoding is the process of creating dummy variables. It simply creates additional features based on the number of unique values in the categorical feature. Every unique value in the category will be added as a feature.

__Note:__ One-Hot Encoding can be performed in 2 ways.<br>
     1. Using pandas get_dummies.<br>
     2. using sklearn preprocessing Laber Encoder and OneHotEncoder.<br>

## What is Dummy Variable Trap and Multicollinearity?

__Dummy Variable Trap__ is a scenario in which variables are highly correlated to each other.
The Dummy Variable Trap leads to the problem known as __Multicollinearity__. Multicollinearity occurs where there is a dependency between the independent features. So, in order to overcome the problem of multicollinearity, one of the dummy variables has to be dropped.

__ie.__ if our model consists of 3 dummy_variables, we can drop a dummy_variable because using 2 dummy_variables we can represent all the 3 categories.

### Load Dataset

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('Datasets/professors_salary_data.csv',delim_whitespace=True)
df.head()

Unnamed: 0,sx,rk,yr,dg,yd,sl
0,male,full,25,doctorate,35,36350
1,male,full,13,doctorate,22,35350
2,male,full,10,doctorate,23,28200
3,female,full,7,doctorate,27,26775
4,male,full,19,masters,30,33696


### Explore Dataset

In [2]:
# dimensions of the dataset
df.ndim

2

In [3]:
# shape of the dataset
df.shape

(52, 6)

In [4]:
# size of the dataset (total umber of elements)
df.size

312

In [5]:
# columns present in our dataset
df.columns

Index(['sx', 'rk', 'yr', 'dg', 'yd', 'sl'], dtype='object')

In [6]:
# datatypes of the each column
df.dtypes

sx    object
rk    object
yr     int64
dg    object
yd     int64
sl     int64
dtype: object

In [7]:
# Print a concise summary of a DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   sx      52 non-null     object
 1   rk      52 non-null     object
 2   yr      52 non-null     int64 
 3   dg      52 non-null     object
 4   yd      52 non-null     int64 
 5   sl      52 non-null     int64 
dtypes: int64(3), object(3)
memory usage: 2.6+ KB


In [8]:
# provides statistical analysis only in numerrical columns
df.describe()

Unnamed: 0,yr,yd,sl
count,52.0,52.0,52.0
mean,7.480769,16.115385,23797.653846
std,5.507536,10.22234,5917.289154
min,0.0,1.0,15000.0
25%,3.0,6.75,18246.75
50%,7.0,15.5,23719.0
75%,11.0,23.25,27258.5
max,25.0,35.0,38045.0


In [9]:
# check for null elements
df.isna().sum()

sx    0
rk    0
yr    0
dg    0
yd    0
sl    0
dtype: int64

## Using Pandas get_dummies

In [10]:
# No of classes in categorical feature 'sx'
print(df['sx'].unique())
# No of classes in categorical feature 'rk'
print(df['rk'].unique())
# No of classes in categorical feature 'dg'
print(df['dg'].unique())

['male' 'female']
['full' 'associate' 'assistant']
['doctorate' 'masters']


In [11]:
df_with_dummies = pd.get_dummies(df,columns = ['sx','rk','dg'])
df_with_dummies.head()

Unnamed: 0,yr,yd,sl,sx_female,sx_male,rk_assistant,rk_associate,rk_full,dg_doctorate,dg_masters
0,25,35,36350,0,1,0,0,1,1,0
1,13,22,35350,0,1,0,0,1,1,0
2,10,23,28200,0,1,0,0,1,1,0
3,7,27,26775,1,0,0,0,1,1,0
4,19,30,33696,0,1,0,0,1,0,1


In [12]:
# to avoid dummy variable trap
df_with_dummies = pd.get_dummies(df,columns = ['sx','rk','dg'],drop_first=True)
df_with_dummies.head()

Unnamed: 0,yr,yd,sl,sx_male,rk_associate,rk_full,dg_masters
0,25,35,36350,1,0,1,0
1,13,22,35350,1,0,1,0
2,10,23,28200,1,0,1,0
3,7,27,26775,0,0,1,0
4,19,30,33696,1,0,1,1


## Using SKlearn One-Hot Encoding

In [13]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

## ColumnTransformer
Applies transformers to columns of an array or pandas DataFrame.<br>
This estimator allows different columns of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. <br>

In [14]:
# ColumnTransformer is used if we want to perform multiple Scaling one after another. 
# ie., it is used to form a pipeline.
# Syntax: ColumnTransformer([(any_name(optional), Object, [Columns]),(any_name(optional), Object, [Columns])])
# ie., ColumnTransformer(List of (name, transformer, columns) tuples of Transformers)

ct = ColumnTransformer([('encoder', OneHotEncoder(), [0,1,3])],
                                        remainder='passthrough')
ct

ColumnTransformer(remainder='passthrough',
                  transformers=[('encoder', OneHotEncoder(), [0, 1, 3])])

In [15]:
df_column_transformer = np.array(ct.fit_transform(df), dtype=str)
df_column_transformer[:3]

array([['0.0', '1.0', '0.0', '0.0', '1.0', '1.0', '0.0', '25.0', '35.0',
        '36350.0'],
       ['0.0', '1.0', '0.0', '0.0', '1.0', '1.0', '0.0', '13.0', '22.0',
        '35350.0'],
       ['0.0', '1.0', '0.0', '0.0', '1.0', '1.0', '0.0', '10.0', '23.0',
        '28200.0']], dtype='<U32')