# <img style="float: left; padding-right: 100px; width: 300px" src="../images/logo.png">AI4SG Bootcamp:


##   Data Processing  For Categorical Variables
**Authors:** Davis David

In [89]:
# import important modules 
import numpy as np
import pandas as pd
import math
from sklearn import preprocessing
%matplotlib inline

np.random.seed(7) 
import warnings ## importing warnings library. 
warnings.filterwarnings('ignore') ## Ignore warning 

### Load and process data

In [90]:
data = pd.read_csv("../data/students_exams_results.csv") 

In [91]:
# show the first five rows
data.head() 

Unnamed: 0,continue_drop,student_id,gender,caste,mathematics_marks,english_marks,science_marks,science_teacher,languages_teacher,guardian,internet,school_id,total_students,total_toilets,establishment_year
0,continue,s17477,M,BC,0.393,0.338,0.393,2,2,mother,1,362,397,5.0,1950.0
1,continue,s16612,M,SC,0.745,0.645,0.745,4,3,father,1,357,57,14.0,1929.0
2,continue,s04010,M,BC,0.788,0.655,0.788,8,9,father,1,340,134,15.0,1976.0
3,drop,s11124,F,BC,0.623,0.699,0.623,6,0,father,1,345,143,28.0,1879.0
4,continue,s04384,M,SC,0.951,0.704,0.951,8,4,mother,1,304,390,28.0,1914.0


In [92]:
#show list of column
data.columns 

Index(['continue_drop', 'student_id', 'gender', 'caste', 'mathematics_marks',
       'english_marks', 'science_marks', 'science_teacher',
       'languages_teacher', 'guardian', 'internet', 'school_id',
       'total_students', 'total_toilets', 'establishment_year'],
      dtype='object')

In [93]:
# show data information  
data.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17190 entries, 0 to 17189
Data columns (total 15 columns):
continue_drop         17190 non-null object
student_id            17190 non-null object
gender                17190 non-null object
caste                 17190 non-null object
mathematics_marks     17190 non-null float64
english_marks         17190 non-null float64
science_marks         17190 non-null float64
science_teacher       17190 non-null int64
languages_teacher     17190 non-null int64
guardian              17190 non-null object
internet              17190 non-null int64
school_id             17190 non-null int64
total_students        17190 non-null int64
total_toilets         16881 non-null float64
establishment_year    16881 non-null float64
dtypes: float64(5), int64(5), object(5)
memory usage: 2.0+ MB


We have some object types columns

In [94]:
# check the shape of the data
data.shape 

(17190, 15)

In [95]:
## Check for missing values 
data.isnull().sum() 

continue_drop           0
student_id              0
gender                  0
caste                   0
mathematics_marks       0
english_marks           0
science_marks           0
science_teacher         0
languages_teacher       0
guardian                0
internet                0
school_id               0
total_students          0
total_toilets         309
establishment_year    309
dtype: int64

 One percent of data is missing in  total toilets and establishment year columns. Let ignore these columns by addding the following line in the load data function. We will also delete the student id column
 
 ```python
    data.drop(['total_toilets','establishment_year', 'student_id', 'school_id'], axis=1, inplace=True)
```

In [96]:
data.drop(['total_toilets','establishment_year'], axis=1, inplace=True)  

In [97]:
# check again your data
data.sample(5) 

Unnamed: 0,continue_drop,student_id,gender,caste,mathematics_marks,english_marks,science_marks,science_teacher,languages_teacher,guardian,internet,school_id,total_students
12788,continue,s09804,M,OC,0.214,0.684,0.214,4,3,mixed,1,369,387
509,continue,s15046,F,SC,0.332,0.289,0.332,5,10,mother,1,323,344
1222,continue,s14142,F,OC,0.48,0.457,0.48,2,9,mother,1,359,305
9545,continue,s17262,F,BC,0.39,0.511,0.39,4,10,mother,1,390,140
3528,continue,s00039,F,ST,0.563,0.626,0.563,2,6,mother,1,397,221


Drop School_id and student_id columns 

In [98]:
data.drop(['student_id','school_id'], axis=1, inplace=True)   

In [99]:
# check shape again
data.shape  

(17190, 11)

In [100]:
#split features and target variable 
target = data['continue_drop']
features = data.drop(['continue_drop'], axis=1)  

## Get Dummies

Pandas get_dummies method is a very straight forward one step procedure to get the dummy variables for categorical features. The advantage is you can directly apply it on the dataframe and the algorithm inside will recognize the categorical features and perform get dummies operation on it.

Let's  create dummy variables(aka 1 or 0) for categorical features by using **get_dummies()** function from Pandas module.

In [101]:
#create dummy for categorical features 
data_with_dummies = pd.get_dummies(features)

In [102]:
data_with_dummies.head() 

Unnamed: 0,mathematics_marks,english_marks,science_marks,science_teacher,languages_teacher,internet,total_students,gender_F,gender_M,caste_BC,caste_OC,caste_SC,caste_ST,guardian_father,guardian_mixed,guardian_mother,guardian_other
0,0.393,0.338,0.393,2,2,1,397,0,1,1,0,0,0,0,0,1,0
1,0.745,0.645,0.745,4,3,1,57,0,1,0,0,1,0,1,0,0,0
2,0.788,0.655,0.788,8,9,1,134,0,1,1,0,0,0,1,0,0,0
3,0.623,0.699,0.623,6,0,1,143,1,0,1,0,0,0,1,0,0,0
4,0.951,0.704,0.951,8,4,1,390,0,1,0,0,1,0,0,0,1,0


In [103]:
data_with_dummies.shape

(17190, 17)

## Label Encoder

LabelEncoder converts each class under specified feature to a numerical value.

In [104]:
# convert object types into integer types 
le = preprocessing.LabelEncoder()
data['gender'] = le.fit_transform(data['gender'])
data['guardian'] = le.fit_transform(data['guardian'])
data['caste']  = le.fit_transform(data['caste'])
data['continue_drop']= le.fit_transform( data['continue_drop']) 

In [105]:
# let see some of our data again 
data.head() 

Unnamed: 0,continue_drop,gender,caste,mathematics_marks,english_marks,science_marks,science_teacher,languages_teacher,guardian,internet,total_students
0,0,1,0,0.393,0.338,0.393,2,2,2,1,397
1,0,1,2,0.745,0.645,0.745,4,3,0,1,57
2,0,1,0,0.788,0.655,0.788,8,9,0,1,134
3,1,0,0,0.623,0.699,0.623,6,0,0,1,143
4,0,1,2,0.951,0.704,0.951,8,4,2,1,390


NB:In continue_drop column : 0 represent continue and 1 represent drop

Label encoding has the advantage that it is straightforward but it has the disadvantage that the numeric values can be “misinterpreted” by the algorithms. For example, the value of 0 is obviously less than the value of 4 but does that really correspond to the data set in real life?

## OneHotEncoder

After label encoding, we might confuse our model into thinking that a column has data with some kind of order or hierarchy, when we clearly don’t have it.To avoid this, we ‘OneHotEncode’ that column.

What one hot encoding does is, it takes a column which has categorical data, which has been label encoded, and then splits the column into multiple columns. The numbers are replaced by 1s and 0s, depending on which column has what value.


In [108]:
# import OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

# instantiate OneHotEncoder
ohe = OneHotEncoder(sparse=False) 
# categorical_features = boolean mask for categorical columns
# sparse = False output an array not sparse matrix


In [109]:
# apply OneHotEncoder on categorical feature columns
X_ohe = ohe.fit_transform(features) # It returns an numpy array

In [112]:
X_ohe

array([[0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [1., 0., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])

Encoding categorical variables is an important step in the data science process. Because there are multiple approaches to encoding variables, it is important to understand the various options and how to implement them on your own data sets. The python data science ecosystem has many helpful approaches to handling these problems. I encourage you to keep these ideas in mind the next time you find yourself analyzing categorical variables