# Handling Categorical variables

In [127]:
import pandas as pd
import numpy as np

## Use <font color='FF5733'>*OrdinalEncoder*</font> to transform **ordinal** categorical values

In [129]:
X = pd.DataFrame(
    np.array(['M', 'O-', 'medium',
             'M', 'O-', 'high',
              'F', 'O+', 'high',
              'F', 'AB', 'low',
              'F', 'B+', 'medium'])
              .reshape((5,3)))
X.columns = ['sex', 'blood_type', 'edu_level']
X.head()

Unnamed: 0,sex,blood_type,edu_level
0,M,O-,medium
1,M,O-,high
2,F,O+,high
3,F,AB,low
4,F,B+,medium


In [130]:
#Transform an ordinal categorical column to numerical by using oneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
ordEncoder = OrdinalEncoder(categories = [['low', 'medium', 'high']]) 

print(X.edu_level.values,X.edu_level.values.shape )

X['edu_level'] = ordEncoder.fit_transform(X[['edu_level']])
X.head()

['medium' 'high' 'high' 'low' 'medium'] (5,)


Unnamed: 0,sex,blood_type,edu_level
0,M,O-,1.0
1,M,O-,2.0
2,F,O+,2.0
3,F,AB,0.0
4,F,B+,1.0


In [131]:
X.edu_level.dtype
X['edu_level'].dtype

dtype('float64')

### <font color = blue> Change a column's type </font>

In [133]:
#Use Series.astype to convert the column to int
X['edu_level'] = X['edu_level'].astype(int)
#X = X.astype({'edu_level':int})  #Alternative
X

Unnamed: 0,sex,blood_type,edu_level
0,M,O-,1
1,M,O-,2
2,F,O+,2
3,F,AB,0
4,F,B+,1


### <font color = Red> Practice </font> <br>
Given the dataframe `testDF` in the following cell,

**1. Apply `OrdinalEncoder` transform `Class` to numerical.**
- Make a copy of `testDF` and save it to `newDF`
- drop the 'Shape' column
- create an instance of OrdinalEncoder() for the ``Class'` column
- transform the `'Class'` column to numeric using the above created ordinal encoder. 
- use Series.astype() to convert the `'Class'` to int type.
- show `newDF` to verify. It should look like the following:

<img src="newDF_Step1.png" align="middle" style="width:130px;height:110px;"/>

In [135]:
d = {'Shape': ['square','square','oval', 'circle'], 
     'Class':['third','first','second','third'],
     'Size':['S', 'S', 'L', 'M']}
testDF = pd.DataFrame(d, columns = ['Shape', 'Class', 'Size'])
testDF

Unnamed: 0,Shape,Class,Size
0,square,third,S
1,square,first,S
2,oval,second,L
3,circle,third,M


In [136]:
newDF=testDF.copy()
print(newDF)
newDF.drop(columns=['Shape'],inplace=True)
 # print(newDF.drop(columns=['Shape']))

    Shape   Class Size
0  square   third    S
1  square   first    S
2    oval  second    L
3  circle   third    M


In [137]:
encoder = OrdinalEncoder(categories=[['first', 'second', 'third']])
# newDF[['Class']] = encoder.fit_transform(newDF[['Class']])
newDF['Class'] = encoder.fit_transform(newDF[['Class']])
print(newDF)

   Class Size
0    2.0    S
1    0.0    S
2    1.0    L
3    2.0    M


In [138]:
encoder = OrdinalEncoder(categories=[['S', 'M', 'L']])
# newDF[['Class']] = encoder.fit_transform(newDF[['Class']])
newDF['Size'] = encoder.fit_transform(newDF[['Size']])
print(newDF)

   Class  Size
0    2.0   0.0
1    0.0   0.0
2    1.0   2.0
3    2.0   1.0


In [139]:
newDF['Class'] = newDF['Class'].astype(int)
newDF['Size'] = newDF['Size'].astype(int)
print(newDF)

   Class  Size
0      2     0
1      0     0
2      1     2
3      2     1


## Use <font color='FF5733'> *get_dummies* </font> to transform **nominal** categorical values 

In [141]:
X

Unnamed: 0,sex,blood_type,edu_level
0,M,O-,1
1,M,O-,2
2,F,O+,2
3,F,AB,0
4,F,B+,1


In [142]:
X_sex_bloodtype = X[['sex','blood_type']]

In [143]:
nominals_new = pd.get_dummies(X_sex_bloodtype)
nominals_new = nominals_new.astype(int)
nominals_new

Unnamed: 0,sex_F,sex_M,blood_type_AB,blood_type_B+,blood_type_O+,blood_type_O-
0,0,1,0,0,0,1
1,0,1,0,0,0,1
2,1,0,0,0,1,0
3,1,0,1,0,0,0
4,1,0,0,1,0,0


In [144]:
nominals_new.columns = ['F', 'M', 'AB', 'B+', 'O+', 'O-']
nominals_new

Unnamed: 0,F,M,AB,B+,O+,O-
0,0,1,0,0,0,1
1,0,1,0,0,0,1
2,1,0,0,0,1,0
3,1,0,1,0,0,0
4,1,0,0,1,0,0


### **Dummy Variable Trap** 完美共线性
当你把一个 类别变量（categorical variable） 转换为 虚拟变量（dummy variables）\
Note that when using `get_dummies()`, usually `drop_first` is set to be True. This means that the first column of each categorical variable will be dropped to avoid the dummy variable trap.

🧨 What is the Dummy Variable Trap?
The dummy variable trap occurs when you include all categories of a categorical variable as separate binary (dummy) columns in a regression model, leading to perfect **multicollinearity**.


🔁 **Multicollinearity** means one feature can be perfectly predicted from others — this confuses models like linear regression. Tree-based models are not impacted by multicollinearity.\
**多重共线性是指：一个特征变量可以通过其他一个或多个特征精确预测出来，这就破坏了模型中对每个变量“独立解释效应”的要求。**

In [146]:
 #drop_first = True means that the first column of each categorical variable will be dropped
#to avoid the dummy variable trap
#(the first column of each categorical variable is a linear combination of the other columns)
#
nominals_new = pd.get_dummies(X_sex_bloodtype, drop_first = True)
nominals_new = nominals_new.astype(int)
nominals_new

Unnamed: 0,sex_M,blood_type_B+,blood_type_O+,blood_type_O-
0,1,0,0,1
1,1,0,0,1
2,0,0,1,0
3,0,0,0,0
4,0,1,0,0


In [147]:
nominals_new.columns = ['M', 'B+', 'O+', 'O-']
nominals_new

Unnamed: 0,M,B+,O+,O-
0,1,0,0,1
1,1,0,0,1
2,0,0,1,0
3,0,0,0,0
4,0,1,0,0


In [148]:
df_concat = pd.concat([nominals_new,X[['edu_level']]], axis = 1)
df_concat

Unnamed: 0,M,B+,O+,O-,edu_level
0,1,0,0,1,1
1,1,0,0,1,2
2,0,0,1,0,2
3,0,0,0,0,0
4,0,1,0,0,1


### <font color = Red> Practice </font> <br>
Given the dataframe `testDF` in the following cell,
**1. Apply `OrdinalEncoder` transform `Class` to numerical.**
- Make a copy of `testDF` and save it to `newDF`
- drop the 'Shape' column
- create an instance of OrdinalEncoder() for the ``Class'` column
- transform the `'Class'` column to numeric using the above created ordinal encoder. 
- use Series.astype() to convert the `'Class'` to int type.
- show `newDF` to verify. It should look like the following:

<img src="newDF_Step1.png" align="middle" style="width:130px;height:110px;"/>


**2. Apply `OrdinalEncoder` transform column `'Size'` to numerical, by following a similar procedure as above.**

**3. Apply `get_dummies()` to transform the `Shape` column  of the `testDF` to three columns as shown below:**
- Apply `get_dummies()` to transform the `Shape` column  of the `testDF` to three columns and save the results in a new dataframe `shapeColumnDF`. 
- apply `pandas.Series.astype()` to convert the values from True/False to int.
- use `pandas.concat()` to concatenate the `newDF` created in step 1 and the `shapeColumnDF` and save the results in `newDF`.

<img src="Encoding.png" align="middle" style="width:250px;height:120px;"/>

- add  `drop_first = True` to the get_dummies() call and check the result.

In [150]:
d = {'Shape': ['square','square','oval', 'circle'], 
     'Class':['third','first','second','third'],
     'Size':['S', 'S', 'L', 'M']}
testDF = pd.DataFrame(d, columns = ['Shape', 'Class', 'Size'])
testDF

Unnamed: 0,Shape,Class,Size
0,square,third,S
1,square,first,S
2,oval,second,L
3,circle,third,M


In [151]:
ShapeColumnDF=pd.get_dummies(testDF['Shape'])
ShapeColumnDF=ShapeColumnDF.astype(int)
ShapeColumnDF

Unnamed: 0,circle,oval,square
0,0,0,1
1,0,0,1
2,0,1,0
3,1,0,0


In [152]:
newDF=pd.concat([ShapeColumnDF,newDF],axis=1)
newDF

Unnamed: 0,circle,oval,square,Class,Size
0,0,0,1,2,0
1,0,0,1,0,0
2,0,1,0,1,2
3,1,0,0,2,1


## Use<font color='FF5733'> *OneHotEncoder* </font>to transform **nominal** categorical values 

In [154]:
X.head(3)

Unnamed: 0,sex,blood_type,edu_level
0,M,O-,1
1,M,O-,2
2,F,O+,2


In [155]:
from sklearn.preprocessing import OneHotEncoder

#transform the categorical columns 'sex' and 'blood_type' to numerical by using oneHotEncoder
onehot_encoder = OneHotEncoder(dtype=np.int32, sparse=True)
data = onehot_encoder.fit_transform(X[['sex', 'blood_type']]).toarray()

#Create a new data frame based on the transformed results
nominals = pd.DataFrame(
    data,
    columns=['F', 'M', 'AB', 'B+','O+', 'O-'])

#Add a new column'edu_level', which is the transformed result of the ordinal column 'edu_level'
nominals['edu_level'] = X['edu_level']
nominals.head()

TypeError: OneHotEncoder.__init__() got an unexpected keyword argument 'sparse'