<h1 style="color:Blue; padding:20px; text-align:center; border-radius:10px">01. Data Cleaning </h1>

<div style="background-color:Gainsboro; padding:20px; text-align:justify; font-size: 110%">
    <p>In data cleaning following operations are perfomed  </p>
    <ol>
        <li>Load Data.</li>
        <li>Handle Missing Values.</li>
        <li>Clean Categorical Data.</li>
        <li>Clean Numerical Data.</li>
        <li>Optimize Data for Memory.</li>
        <li>Save Cleaned Data</li>
    </ol>
</div>

### Standard Imports.

In [1]:
import numpy as np
import pandas as pd
import klib as kl

In [2]:
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')
%pprint

Pretty printing has been turned OFF


### 1. Load Data.

In [3]:
# Loading the Student's data
data = pd.read_csv('./data/xAPI-Edu-Data.csv')

In [4]:
# Cheack the data is properly loaded
data.head()

Unnamed: 0,gender,NationalITy,PlaceofBirth,StageID,GradeID,SectionID,Topic,Semester,Relation,raisedhands,VisITedResources,AnnouncementsView,Discussion,ParentAnsweringSurvey,ParentschoolSatisfaction,StudentAbsenceDays,Class
0,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,15,16,2,20,Yes,Good,Under-7,M
1,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,20,20,3,25,Yes,Good,Under-7,M
2,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,10,7,0,30,No,Bad,Above-7,L
3,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,30,25,5,35,No,Bad,Above-7,L
4,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,40,50,12,50,No,Bad,Above-7,M


In [5]:
# Data Characteristics
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   gender                    480 non-null    object
 1   NationalITy               480 non-null    object
 2   PlaceofBirth              480 non-null    object
 3   StageID                   480 non-null    object
 4   GradeID                   480 non-null    object
 5   SectionID                 480 non-null    object
 6   Topic                     480 non-null    object
 7   Semester                  480 non-null    object
 8   Relation                  480 non-null    object
 9   raisedhands               480 non-null    int64 
 10  VisITedResources          480 non-null    int64 
 11  AnnouncementsView         480 non-null    int64 
 12  Discussion                480 non-null    int64 
 13  ParentAnsweringSurvey     480 non-null    object
 14  ParentschoolSatisfaction  

<p style="background-color:Gainsboro; padding:20px; text-align:justify; font-size: 110%">
The spelling of some feature names have to be adjusted for uniformaty.
</p>

In [6]:
# Clean the feature name labels.
data.rename(columns={
    'gender' : 'Gender',
    'NationalITy' :'Nationality',
    'raisedhands' : 'RaisedHands',
    'VisITedResources' :'VisitedResources',
    'ParentschoolSatisfaction':'ParentSchoolSatisfaction'    
}, inplace = True)

In [7]:
# Check the Data Characteristics again
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Gender                    480 non-null    object
 1   Nationality               480 non-null    object
 2   PlaceofBirth              480 non-null    object
 3   StageID                   480 non-null    object
 4   GradeID                   480 non-null    object
 5   SectionID                 480 non-null    object
 6   Topic                     480 non-null    object
 7   Semester                  480 non-null    object
 8   Relation                  480 non-null    object
 9   RaisedHands               480 non-null    int64 
 10  VisitedResources          480 non-null    int64 
 11  AnnouncementsView         480 non-null    int64 
 12  Discussion                480 non-null    int64 
 13  ParentAnsweringSurvey     480 non-null    object
 14  ParentSchoolSatisfaction  

### 2. Handle Missing Values.

In [8]:
# Check for missing data
data.isna().sum()

Gender                      0
Nationality                 0
PlaceofBirth                0
StageID                     0
GradeID                     0
SectionID                   0
Topic                       0
Semester                    0
Relation                    0
RaisedHands                 0
VisitedResources            0
AnnouncementsView           0
Discussion                  0
ParentAnsweringSurvey       0
ParentSchoolSatisfaction    0
StudentAbsenceDays          0
Class                       0
dtype: int64

<p style="background-color:Gainsboro; padding:20px; text-align:justify; font-size: 110%">
The result of <strong>data.isna().sum()</strong> shows there is no missing values in any of the columns.
</p>

### 3. Clean Categorical Data.

In [9]:
CATEGORICAL_FEATURES = list([column for column in data.columns if data.dtypes[column] == 'object'])[:-1]
CATEGORICAL_FEATURES

['Gender', 'Nationality', 'PlaceofBirth', 'StageID', 'GradeID', 'SectionID', 'Topic', 'Semester', 'Relation', 'ParentAnsweringSurvey', 'ParentSchoolSatisfaction', 'StudentAbsenceDays']

In [10]:
TARGET = 'Class'

In [None]:
# Categorical Featutes
print(f'Total Number for Categorical Feature : {len(CATEGORICAL_FEATURES)}')
for i,feature in enumerate(CATEGORICAL_FEATURES):
    print((i+1),feature)

In [None]:
for i,feature in enumerate(CATEGORICAL_FEATURES):
    print(f'{i+1}. {feature}, has {len(data[feature].unique())} unique attributes. \n {data[feature].unique()} \n')

Looking at the features:
* **Nationality**, 
* **PlaceofBirth** and 
* **StageID** 
these require some cleaning.    

#### 3.1 Cleaning the Nationality feature.

In [None]:
# Categories in Nationality
data['Nationality'].unique()

We need to make following changes
1. Convert KW to Kuwait
2. lebanon to Lebanon
3. venzuela to Venezuela
4. Lybia to Libya

In [None]:
# Helper function for label conversion in a features column
def label_converter(data, original, replacement):
    for i, _ in enumerate(data):
        if(data[i]==original):
            data[i] = replacement
        else:
            pass
    return

In [None]:
label_converter(data['Nationality'],'KW','Kuwait')

In [None]:
label_converter(data['Nationality'],'lebanon','Lebanon')

In [None]:
label_converter(data['Nationality'],'venzuela','Venezuela')

In [None]:
label_converter(data['Nationality'],'Lybia','Libya')

In [None]:
# Check the changes
data['Nationality'].unique()

#### 3.2 Cleaning the PlaceofBirth feature.

In [None]:
# Categories in PlaceofBirth
data['PlaceofBirth'].unique()

We need to make following changes
1. Convert KuwaIT to Kuwait
2. lebanon to Lebanon
3. venzuela to Venezuela
4. Lybia to Libya

In [None]:
label_converter(data['PlaceofBirth'],'KuwaIT','Kuwait')

In [None]:
label_converter(data['PlaceofBirth'],'lebanon','Lebanon')

In [None]:
label_converter(data['PlaceofBirth'],'venzuela','Venezuela')

In [None]:
# Check the changes
data['PlaceofBirth'].unique()

#### 3.2 Cleaning the StageID feature.

In [None]:
# Categories in StageID
data['StageID'].unique()

For consistancy in the labels we need to change lowerlevel to LowerLevel.

In [None]:
label_converter(data['StageID'],'lowerlevel','LowerLevel')

In [None]:
# Check the changes
data['StageID'].unique()

<p style="background-color:Gainsboro; padding:20px; text-align:justify; font-size: 110%">
1. Features with 2 unique lables can be encoded as binary features. </br>
2. Feature with more than 2 independent lables can be coded as nominal features. (No relation exist between lables.)</br> 
3. Feature with more than 2 related lables can be coded as ordinal features. (There is ordinal relation between lables.) 
</p>

In [None]:
# Target 
print(f' TARGET, has {len(data[TARGET].unique())} unique attributes. \n {data[TARGET].unique()} \n')

<p style="background-color:Gainsboro; padding:20px; text-align:justify; font-size: 110%">
Target has three unique attributes. Can possibly be onehot encoded. 
</p>

<p style="background-color:Gainsboro; padding:20px; text-align:justify; font-size: 110%">
The descriptive statistical properties of continuous features are tabulated in the above output. 
</p>

### 4.Clean Numerical Data.

In [None]:
NUMERICAL_FEATURES = list([column for column in data.columns if data.dtypes[column] != 'object'])
NUMERICAL_FEATURES

In [None]:
print(f'Total Number for Continous Feature : {len(NUMERICAL_FEATURES)}')
for i,feature in enumerate(NUMERICAL_FEATURES):
    print((i+1),feature)

In [None]:
stats = data.describe()
stats.transpose()

In [None]:
# Check for null values in the Numerical features
data[NUMERICAL_FEATURES].isnull().sum()

<p style="background-color:Gainsboro; padding:20px; text-align:justify; font-size: 110%">
The result of <strong>data[NUMERICAL_FEATURES].isnull().sum()</strong> shows there is no null values in any numerical feature.
</p>

### 5. Optimize Data for the Memory.

The data is optimized for the memory usage for following advantages
1. Reduces the complexity of the data types
2. Reduce the size of the data

This produce faster convergence in training the ML Models.

In [None]:
# Original Data
data.info()

In [None]:
optimized_data = kl.data_cleaning(data)

In [None]:
# Optimized data
optimized_data.info()

In [None]:
optimized_data.head()

### 6. Save Cleaned Data.

In [None]:
optimized_data.to_csv('./data/xAPI-Edu-Data_cleaned.csv')