<a href="https://colab.research.google.com/github/AamirKhaan/Student-Academic-Performance/blob/master/06_Data_Encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 style="background-color:Aqua; padding:20px; border-radius:10px">Data Encoding</h1>

## Overview      


Machine learning models require all input and output variables to be numeric.
This means that if data contains categorical data, you must encode it to numbers before you can fit and evaluate a model.  

<div style="background-color:Gainsboro; padding:20px; text-align:justify; font-weight: bold">
    <p>In this section we will </p>
    <ol>
        <li>Import Modified Data</li>
        <li>Encode Categorical Input Features</li>
        <li>Encode Target Output</li>
        <li>Combine Encoded Data</li>
        <li>Save Encoded Data</li>
    </ol>
</div>


### Standard Imports

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder

In [2]:
# Jupyter Noteboook Configurations (personal prefrences)
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Disable pretty
# %pprint
%matplotlib inline

### 1. Import Modified Data

In [3]:
# Import the Student's data from local storage
# ! Dont' run the cell in colab
data_df = pd.read_csv('./data/xAPI-Edu-Data_modified.csv')

In [4]:
# Import the Student's data from github storage
# Run only in colab
url = 'https://raw.githubusercontent.com/AamirKhaan/Student-Academic-Performance/main/data/xAPI-Edu-Data_modified.csv'
data_df = pd.read_csv(url)

In [5]:
data_df.head()

Unnamed: 0,gender,nationality,place_of_birth,stage_id,grade_id,section_id,topic,semester,relation,raised_hands,visited_resources,announcements_view,discussion,parent_answering_survey,parent_school_satisfaction,student_absence_days,class
0,M,Kuwait,Kuwait,LowerLevel,Lower,A,IT,F,Father,15,16,2,20,Yes,Good,Under-7,M
1,M,Kuwait,Kuwait,LowerLevel,Lower,A,IT,F,Father,20,20,3,25,Yes,Good,Under-7,M
2,M,Kuwait,Kuwait,LowerLevel,Lower,A,IT,F,Father,10,7,0,30,No,Bad,Above-7,L
3,M,Kuwait,Kuwait,LowerLevel,Lower,A,IT,F,Father,30,25,5,35,No,Bad,Above-7,L
4,M,Kuwait,Kuwait,LowerLevel,Lower,A,IT,F,Father,40,50,12,50,No,Bad,Above-7,M


In [6]:
CATEGORICAL_FEATURES = ['gender', 'nationality','place_of_birth', 'stage_id', 'grade_id', 'section_id', 'topic', 
                        'semester', 'relation', 'parent_answering_survey', 'parent_school_satisfaction', 
                        'student_absence_days']

NUMERICAL_FEATURES = ['raised_hands', 'visited_resources', 'announcements_view', 'discussion']

TARGET = ['class']

In [7]:
# Optimize Data Type for efficent memory utilization
for feature in CATEGORICAL_FEATURES:
    data_df[feature] = data_df[feature].astype('category')
    
for feature in NUMERICAL_FEATURES:
    data_df[feature] = data_df[feature].astype('int8')

data_df[TARGET] = data_df[TARGET].astype('category')
data_df.dtypes

gender                        category
nationality                   category
place_of_birth                category
stage_id                      category
grade_id                      category
section_id                    category
topic                         category
semester                      category
relation                      category
raised_hands                      int8
visited_resources                 int8
announcements_view                int8
discussion                        int8
parent_answering_survey       category
parent_school_satisfaction    category
student_absence_days          category
class                         category
dtype: object

### 2. Encode Categorical Data

In [8]:
categorical_encoder = OrdinalEncoder(dtype='int8')
categorical_ds  = categorical_encoder.fit_transform(data_df[CATEGORICAL_FEATURES])
categorical_encoder.categories_

[array(['F', 'M'], dtype=object),
 array(['Jordan', 'Kuwait', 'Others'], dtype=object),
 array(['Jordan', 'Kuwait', 'Others'], dtype=object),
 array(['HighSchool', 'LowerLevel', 'MiddleSchool'], dtype=object),
 array(['Higher', 'Lower'], dtype=object),
 array(['A', 'B', 'C'], dtype=object),
 array(['Humanities', 'IT', 'Language', 'Sciences'], dtype=object),
 array(['F', 'S'], dtype=object),
 array(['Father', 'Mum'], dtype=object),
 array(['No', 'Yes'], dtype=object),
 array(['Bad', 'Good'], dtype=object),
 array(['Above-7', 'Under-7'], dtype=object)]

In [9]:
categorical_ds

array([[1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int8)

In [10]:
categorical_encoded = pd.DataFrame(categorical_ds, columns=CATEGORICAL_FEATURES, dtype='int8')
categorical_encoded.head()

Unnamed: 0,gender,nationality,place_of_birth,stage_id,grade_id,section_id,topic,semester,relation,parent_answering_survey,parent_school_satisfaction,student_absence_days
0,1,1,1,1,1,0,1,0,0,1,1,1
1,1,1,1,1,1,0,1,0,0,1,1,1
2,1,1,1,1,1,0,1,0,0,0,0,0
3,1,1,1,1,1,0,1,0,0,0,0,0
4,1,1,1,1,1,0,1,0,0,0,0,0


### 3. Encode Target Output

In [11]:
target_encoder = LabelEncoder()
target_ds = target_encoder.fit_transform(data_df[TARGET])
target_encoder.classes_

array(['H', 'L', 'M'], dtype=object)

In [12]:
target_ds

array([2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 0, 2, 1, 1, 0, 2, 2, 2, 2, 0, 2, 2,
       2, 1, 1, 1, 2, 1, 2, 2, 0, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2,
       2, 1, 1, 0, 0, 2, 1, 1, 2, 0, 1, 1, 1, 1, 2, 2, 1, 2, 0, 2, 1, 1,
       2, 0, 0, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 0, 1, 1, 1, 2, 0, 1, 0, 1,
       1, 1, 1, 0, 0, 0, 1, 0, 0, 2, 2, 2, 2, 0, 1, 1, 2, 1, 2, 0, 2, 2,
       0, 2, 1, 1, 1, 1, 2, 0, 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 1, 2,
       1, 1, 0, 0, 0, 2, 0, 2, 1, 1, 2, 0, 1, 2, 0, 2, 2, 0, 0, 2, 0, 1,
       2, 0, 2, 2, 1, 2, 0, 2, 0, 2, 2, 0, 2, 0, 0, 2, 0, 2, 1, 1, 2, 1,
       0, 2, 0, 2, 0, 1, 0, 2, 1, 0, 2, 2, 0, 2, 1, 1, 2, 2, 2, 2, 0, 0,
       1, 2, 0, 0, 2, 2, 1, 0, 2, 2, 2, 2, 0, 2, 0, 1, 1, 1, 2, 2, 0, 2,
       2, 2, 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 0, 0, 2,
       1, 2, 0, 2, 0, 2, 1, 2, 0, 1, 2, 1, 0, 0, 0, 2, 2, 1, 1, 2, 2, 2,
       2, 0, 2, 2, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 2, 2,
       0, 0, 2, 2, 1, 1, 0, 0, 2, 2, 0, 0, 2, 2, 1,

In [13]:
target_encoded = pd.DataFrame(target_ds, columns=TARGET, dtype='int8')
target_encoded.head()

Unnamed: 0,class
0,2
1,2
2,1
3,1
4,2


### 4. Combine Encoded Data

In [14]:
encoded_df = pd.concat([categorical_encoded,data_df[NUMERICAL_FEATURES],target_encoded], axis=1)
encoded_df.head()

Unnamed: 0,gender,nationality,place_of_birth,stage_id,grade_id,section_id,topic,semester,relation,parent_answering_survey,parent_school_satisfaction,student_absence_days,raised_hands,visited_resources,announcements_view,discussion,class
0,1,1,1,1,1,0,1,0,0,1,1,1,15,16,2,20,2
1,1,1,1,1,1,0,1,0,0,1,1,1,20,20,3,25,2
2,1,1,1,1,1,0,1,0,0,0,0,0,10,7,0,30,1
3,1,1,1,1,1,0,1,0,0,0,0,0,30,25,5,35,1
4,1,1,1,1,1,0,1,0,0,0,0,0,40,50,12,50,2


In [15]:
encoded_df.dtypes

gender                        int8
nationality                   int8
place_of_birth                int8
stage_id                      int8
grade_id                      int8
section_id                    int8
topic                         int8
semester                      int8
relation                      int8
parent_answering_survey       int8
parent_school_satisfaction    int8
student_absence_days          int8
raised_hands                  int8
visited_resources             int8
announcements_view            int8
discussion                    int8
class                         int8
dtype: object

### 5. Save Modified Data

In [16]:
# The cell will run on local stograge only.
encoded_df.to_csv('./data/xAPI-Edu-Data_encoded.csv',index=False)

## Observations
   
1. There are 12 input categorical features.
2. The target is Multiclass with three labels.  

## Conclusion
Based on the obervations:

1. All 12 categorical features are Ordinal Encoded with label numerical values.
2. The target calss is encoded with Label Encoder with respective numerical labels.
3. The encoded data is combined with numrical data.
4. The combuned data is saved as xAPI-Edu-Data_encoded.csv file.