### Registration ID : GO_STP_9654 

Discuss the concept of One-Hot-Encoding, Multicollinearity and the Dummy Variable Trap.  What is Nominal and Ordinal Variables?
Salary Dataset of 52 professors having categorical columns. Apply dummy variables concept and one-hot-encoding on categorical
columns.

One-Hot-Encoding

One hot encoding is one method of converting data to prepare it for an algorithm and get a better prediction.With one-hot, 
we convert each categorical value into a new categorical column and assign a binary value of 1 or 0 to those columns. 
Each integer value is represented as a binary vector.

Multicollinearity

Multicollinearity occurs when two or more independent variables are highly correlated with one another in a regression model. 
This means that an independent variable can be predicted from another independent variable in a regression model.

Dummy Variable Trap

The Dummy variable trap is a scenario where there are attributes which are highly correlated (Multicollinear) and one variable
predicts the value of others. When we use one hot encoding for handling the categorical data, then one dummy variable (attribute)
can be predicted with the help of other dummy variables.

nominal variable

Nominal data is made of discrete values with no numerical relationship between the different categories — mean and median are
meaningless. Animal species is one example. For example, pig is not higher than bird and lower than fish.

ordinal variable

An ordinal variable is a categorical variable for which the possible values are ordered. Ordinal variables can be considered “in between” categorical and quantitative variables. Example: Educational level might be categorized as. Elementary school education.

In [50]:
# import libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

In [51]:
# read dataset
url = "https://data.princeton.edu/wws509/datasets/salary.dat"
df = pd.read_csv(url, delim_whitespace = True)
print(type(df))
print(df.shape)

<class 'pandas.core.frame.DataFrame'>
(52, 6)


In [4]:
df.head()

Unnamed: 0,sx,rk,yr,dg,yd,sl
0,male,full,25,doctorate,35,36350
1,male,full,13,doctorate,22,35350
2,male,full,10,doctorate,23,28200
3,female,full,7,doctorate,27,26775
4,male,full,19,masters,30,33696


In [5]:
df.tail()

Unnamed: 0,sx,rk,yr,dg,yd,sl
47,female,assistant,2,doctorate,2,15350
48,male,assistant,1,doctorate,1,16244
49,female,assistant,1,doctorate,1,16686
50,female,assistant,1,doctorate,1,15000
51,female,assistant,0,doctorate,2,20300


In [52]:
df.columns = ['Sex', 'Rank', 'Yrs. of service', 'Degree', 'Yrs. Since phd', 'Salary']
df.head()

Unnamed: 0,Sex,Rank,Yrs. of service,Degree,Yrs. Since phd,Salary
0,male,full,25,doctorate,35,36350
1,male,full,13,doctorate,22,35350
2,male,full,10,doctorate,23,28200
3,female,full,7,doctorate,27,26775
4,male,full,19,masters,30,33696


In [7]:
df.describe()

Unnamed: 0,Yrs. of service,Yrs. Since phd,Salary
count,52.0,52.0,52.0
mean,7.480769,16.115385,23797.653846
std,5.507536,10.22234,5917.289154
min,0.0,1.0,15000.0
25%,3.0,6.75,18246.75
50%,7.0,15.5,23719.0
75%,11.0,23.25,27258.5
max,25.0,35.0,38045.0


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Sex              52 non-null     object
 1   Rank             52 non-null     object
 2   Yrs. of service  52 non-null     int64 
 3   Degree           52 non-null     object
 4   Yrs. Since phd   52 non-null     int64 
 5   Salary           52 non-null     int64 
dtypes: int64(3), object(3)
memory usage: 2.6+ KB


In [9]:
df.corr()

Unnamed: 0,Yrs. of service,Yrs. Since phd,Salary
Yrs. of service,1.0,0.638776,0.700669
Yrs. Since phd,0.638776,1.0,0.674854
Salary,0.700669,0.674854,1.0


In [10]:
df.isnull().sum()

Sex                0
Rank               0
Yrs. of service    0
Degree             0
Yrs. Since phd     0
Salary             0
dtype: int64

### 1) Normal Method

In [11]:
df1 = df

In [12]:
df1['Sex'].unique()

array(['male', 'female'], dtype=object)

In [13]:
df1['Sex'] = df1['Sex'].replace(['male'], 1)
df1['Sex'] = df1['Sex'].replace(['female'], 2)
df1.head()

Unnamed: 0,Sex,Rank,Yrs. of service,Degree,Yrs. Since phd,Salary
0,1,full,25,doctorate,35,36350
1,1,full,13,doctorate,22,35350
2,1,full,10,doctorate,23,28200
3,2,full,7,doctorate,27,26775
4,1,full,19,masters,30,33696


In [14]:
df1['Rank'].unique()

array(['full', 'associate', 'assistant'], dtype=object)

In [15]:
df1['Rank'] = df1['Rank'].replace(['full'], 1)
df1['Rank'] = df1['Rank'].replace(['associate'], 2)
df1['Rank'] = df1['Rank'].replace(['assistant'], 3)
df1.head()

Unnamed: 0,Sex,Rank,Yrs. of service,Degree,Yrs. Since phd,Salary
0,1,1,25,doctorate,35,36350
1,1,1,13,doctorate,22,35350
2,1,1,10,doctorate,23,28200
3,2,1,7,doctorate,27,26775
4,1,1,19,masters,30,33696


In [16]:
df1['Degree'].unique()

array(['doctorate', 'masters'], dtype=object)

In [17]:
df1['Degree'] = df1['Degree'].replace(['doctorate'], 1)
df1['Degree'] = df1['Degree'].replace(['masters'], 2)
df1.head()

Unnamed: 0,Sex,Rank,Yrs. of service,Degree,Yrs. Since phd,Salary
0,1,1,25,1,35,36350
1,1,1,13,1,22,35350
2,1,1,10,1,23,28200
3,2,1,7,1,27,26775
4,1,1,19,2,30,33696


### 2) Dummy Variable

In [53]:
df2 = df

In [54]:
# get_dummies() creates no. of columns equal to no. of unique values in State column
dummies1 = pd.get_dummies(df2.Sex)
dummies1.head()

Unnamed: 0,female,male
0,0,1
1,0,1
2,0,1
3,1,0
4,0,1


In [55]:
# Dummy Variable Trap
dummies1 = dummies1.drop(['female'], axis=1)
dummies1.head()

Unnamed: 0,male
0,1
1,1
2,1
3,0
4,1


In [56]:
dummies2 = pd.get_dummies(df2.Rank)
dummies2.head()

Unnamed: 0,assistant,associate,full
0,0,0,1
1,0,0,1
2,0,0,1
3,0,0,1
4,0,0,1


In [57]:
dummies2 = dummies2.drop(['full'], axis=1)
dummies2.head()

Unnamed: 0,assistant,associate
0,0,0
1,0,0
2,0,0
3,0,0
4,0,0


In [58]:
dummies3 = pd.get_dummies(df2.Degree)
dummies3.head()

Unnamed: 0,doctorate,masters
0,1,0
1,1,0
2,1,0
3,1,0
4,0,1


In [59]:
dummies3 = dummies3.drop(['masters'], axis=1)
dummies3.head()

Unnamed: 0,doctorate
0,1
1,1
2,1
3,1
4,0


In [60]:
# drop Sex','Rank','Degree columns bcz it all are having string data type
df2 = df2.drop(['Sex','Rank','Degree'], axis=1)
df2.head()

Unnamed: 0,Yrs. of service,Yrs. Since phd,Salary
0,25,35,36350
1,13,22,35350
2,10,23,28200
3,7,27,26775
4,19,30,33696


In [62]:
# concate original and dummies dataframe according to columns
df2 = pd.concat([df2,dummies1,dummies2,dummies3], axis='columns')
df2.head()

Unnamed: 0,Yrs. of service,Yrs. Since phd,Salary,male,assistant,associate,doctorate
0,25,35,36350,1,0,0,1
1,13,22,35350,1,0,0,1
2,10,23,28200,1,0,0,1
3,7,27,26775,0,0,0,1
4,19,30,33696,1,0,0,0


### 3) One Hot Encoding

In [64]:
df3 = df

In [65]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [66]:
# label encoder just converting string to numerical for column which has string data type
df3.Sex = le.fit_transform(df3.Sex)
df3.Rank = le.fit_transform(df3.Rank)
df3.Degree = le.fit_transform(df3.Degree)
df3.head()

Unnamed: 0,Sex,Rank,Yrs. of service,Degree,Yrs. Since phd,Salary
0,1,2,25,0,35,36350
1,1,2,13,0,22,35350
2,1,2,10,0,23,28200
3,0,2,7,0,27,26775
4,1,2,19,1,30,33696


In [68]:
# one hot encoder just creating dummy variables equal to unique values present in column
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

ct1 = ColumnTransformer([( 'Sex', OneHotEncoder(), [0] )], remainder = 'passthrough')
ct2 = ColumnTransformer([( 'Rank', OneHotEncoder(), [1] )], remainder = 'passthrough')
ct3 = ColumnTransformer([( 'Degree', OneHotEncoder(), [3] )], remainder = 'passthrough')
df3 = ct1.fit_transform(df3)
df3 = ct2.fit_transform(df3)
df3 = ct3.fit_transform(df3)

In [71]:
df3

array([[0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00,
        0.0000e+00, 2.5000e+01, 0.0000e+00, 3.5000e+01, 3.6350e+04],
       [0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00,
        0.0000e+00, 1.3000e+01, 0.0000e+00, 2.2000e+01, 3.5350e+04],
       [0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00,
        0.0000e+00, 1.0000e+01, 0.0000e+00, 2.3000e+01, 2.8200e+04],
       [0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00,
        1.0000e+00, 7.0000e+00, 0.0000e+00, 2.7000e+01, 2.6775e+04],
       [0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00,
        0.0000e+00, 1.9000e+01, 1.0000e+00, 3.0000e+01, 3.3696e+04],
       [0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00,
        0.0000e+00, 1.6000e+01, 0.0000e+00, 2.1000e+01, 2.8516e+04],
       [0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00,
        1.0000e+00, 0.0000e+00, 1.0000e+00, 3.2000e+01, 2.4900e+04],
       [0.0000e+00, 0.0000e+00, 1.0000e+0