<a href="https://colab.research.google.com/github/AamirKhaan/Student-Academic-Performance/blob/master/01_Data_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 style="background-color:Aqua; padding:20px; border-radius:10px">Data Cleaning </h1>

## Overview      
<div style="background-color:Gainsboro; padding:20px; text-align:justify; font-weight: bold">
    <p>In this notebook we perfom following tasks </p>
    <ol>
        <li>Import Data</li>
        <li>Handle Missing Values</li>
        <li>Clean Categorical Data</li>
        <li>Clean Numerical Data</li>
        <li>Optimize Data for Memory</li>
        <li>Save Cleaned Data</li>
    </ol>
</div>

### Standard Imports

In [None]:
# Run this cell working in colab
!pip install klib

In [1]:
import numpy as np
import pandas as pd
import klib as kl

In [2]:
# Jupyter Noteboook Configurations (personal prefrences)
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')
%pprint

Pretty printing has been turned OFF


### 1. Import Data.

In [3]:
# Import the Student's data from local storage
# ! Dont' run the cell in colab
data = pd.read_csv('./data/xAPI-Edu-Data.csv')

In [4]:
# Import the Student's data from github storage
# Run only in colab
url = 'https://raw.githubusercontent.com/AamirKhaan/Student-Academic-Performance/main/data/xAPI-Edu-Data.csv'
data = pd.read_csv(url)

In [5]:
# Cheack the data is properly loaded
data.head()

Unnamed: 0,gender,NationalITy,PlaceofBirth,StageID,GradeID,SectionID,Topic,Semester,Relation,raisedhands,VisITedResources,AnnouncementsView,Discussion,ParentAnsweringSurvey,ParentschoolSatisfaction,StudentAbsenceDays,Class
0,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,15,16,2,20,Yes,Good,Under-7,M
1,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,20,20,3,25,Yes,Good,Under-7,M
2,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,10,7,0,30,No,Bad,Above-7,L
3,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,30,25,5,35,No,Bad,Above-7,L
4,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,40,50,12,50,No,Bad,Above-7,M


In [6]:
# Data Characteristics
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   gender                    480 non-null    object
 1   NationalITy               480 non-null    object
 2   PlaceofBirth              480 non-null    object
 3   StageID                   480 non-null    object
 4   GradeID                   480 non-null    object
 5   SectionID                 480 non-null    object
 6   Topic                     480 non-null    object
 7   Semester                  480 non-null    object
 8   Relation                  480 non-null    object
 9   raisedhands               480 non-null    int64 
 10  VisITedResources          480 non-null    int64 
 11  AnnouncementsView         480 non-null    int64 
 12  Discussion                480 non-null    int64 
 13  ParentAnsweringSurvey     480 non-null    object
 14  ParentschoolSatisfaction  

<p style="background-color:Gainsboro; padding:20px; text-align:justify; font-weight: bold">
The spelling of some feature names can be adjusted for uniformaty.
</p>

In [7]:
# Clean the feature name labels.
data.rename(columns={
    'gender' : 'Gender',
    'NationalITy' :'Nationality',
    'raisedhands' : 'RaisedHands',
    'VisITedResources' :'VisitedResources',
    'ParentschoolSatisfaction':'ParentSchoolSatisfaction',
    'PlaceofBirth':'PlaceOfBirth'
}, inplace = True)

In [8]:
# Check the Data Characteristics again
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Gender                    480 non-null    object
 1   Nationality               480 non-null    object
 2   PlaceOfBirth              480 non-null    object
 3   StageID                   480 non-null    object
 4   GradeID                   480 non-null    object
 5   SectionID                 480 non-null    object
 6   Topic                     480 non-null    object
 7   Semester                  480 non-null    object
 8   Relation                  480 non-null    object
 9   RaisedHands               480 non-null    int64 
 10  VisitedResources          480 non-null    int64 
 11  AnnouncementsView         480 non-null    int64 
 12  Discussion                480 non-null    int64 
 13  ParentAnsweringSurvey     480 non-null    object
 14  ParentSchoolSatisfaction  

### 2. Handle Missing Values

In [9]:
# Check for missing data
data.isna().sum()

Gender                      0
Nationality                 0
PlaceOfBirth                0
StageID                     0
GradeID                     0
SectionID                   0
Topic                       0
Semester                    0
Relation                    0
RaisedHands                 0
VisitedResources            0
AnnouncementsView           0
Discussion                  0
ParentAnsweringSurvey       0
ParentSchoolSatisfaction    0
StudentAbsenceDays          0
Class                       0
dtype: int64

<p style="background-color:Gainsboro; padding:20px; text-align:justify; font-size: 110%">
The result of <strong>data.isna().sum()</strong> shows there is no missing values in any of the columns.
</p>

### 3. Clean Categorical Data

In [10]:
CATEGORICAL_FEATURES = list([column for column in data.columns if data.dtypes[column] == 'object'])[:-1]
CATEGORICAL_FEATURES

['Gender', 'Nationality', 'PlaceOfBirth', 'StageID', 'GradeID', 'SectionID', 'Topic', 'Semester', 'Relation', 'ParentAnsweringSurvey', 'ParentSchoolSatisfaction', 'StudentAbsenceDays']

In [11]:
TARGET = 'Class'

In [12]:
# Categorical Featutes
print(f'Total Number for Categorical Feature : {len(CATEGORICAL_FEATURES)}')
for i,feature in enumerate(CATEGORICAL_FEATURES):
    print((i+1),feature)

Total Number for Categorical Feature : 12
1 Gender
2 Nationality
3 PlaceOfBirth
4 StageID
5 GradeID
6 SectionID
7 Topic
8 Semester
9 Relation
10 ParentAnsweringSurvey
11 ParentSchoolSatisfaction
12 StudentAbsenceDays


In [13]:
for i,feature in enumerate(CATEGORICAL_FEATURES):
    print(f'{i+1}. {feature}: {len(data[feature].unique())} unique labels. \n {data[feature].unique()} \n')

1. Gender: 2 unique labels. 
 ['M' 'F'] 

2. Nationality: 14 unique labels. 
 ['KW' 'lebanon' 'Egypt' 'SaudiArabia' 'USA' 'Jordan' 'venzuela' 'Iran'
 'Tunis' 'Morocco' 'Syria' 'Palestine' 'Iraq' 'Lybia'] 

3. PlaceOfBirth: 14 unique labels. 
 ['KuwaIT' 'lebanon' 'Egypt' 'SaudiArabia' 'USA' 'Jordan' 'venzuela' 'Iran'
 'Tunis' 'Morocco' 'Syria' 'Iraq' 'Palestine' 'Lybia'] 

4. StageID: 3 unique labels. 
 ['lowerlevel' 'MiddleSchool' 'HighSchool'] 

5. GradeID: 10 unique labels. 
 ['G-04' 'G-07' 'G-08' 'G-06' 'G-05' 'G-09' 'G-12' 'G-11' 'G-10' 'G-02'] 

6. SectionID: 3 unique labels. 
 ['A' 'B' 'C'] 

7. Topic: 12 unique labels. 
 ['IT' 'Math' 'Arabic' 'Science' 'English' 'Quran' 'Spanish' 'French'
 'History' 'Biology' 'Chemistry' 'Geology'] 

8. Semester: 2 unique labels. 
 ['F' 'S'] 

9. Relation: 2 unique labels. 
 ['Father' 'Mum'] 

10. ParentAnsweringSurvey: 2 unique labels. 
 ['Yes' 'No'] 

11. ParentSchoolSatisfaction: 2 unique labels. 
 ['Good' 'Bad'] 

12. StudentAbsenceDays: 2 u

Looking at the features:
* **Nationality**, 
* **PlaceofBirth** and 
* **StageID** 
these require some cleaning.    

#### 3.1 Cleaning the Nationality feature

In [14]:
# Categories in Nationality
data['Nationality'].unique()

array(['KW', 'lebanon', 'Egypt', 'SaudiArabia', 'USA', 'Jordan',
       'venzuela', 'Iran', 'Tunis', 'Morocco', 'Syria', 'Palestine',
       'Iraq', 'Lybia'], dtype=object)

We need to make following changes
1. Convert KW to Kuwait
2. lebanon to Lebanon
3. venzuela to Venezuela
4. Lybia to Libya

In [15]:
# Helper function for label conversion in a features column
def label_converter(data, original, replacement):
    for i, _ in enumerate(data):
        if(data[i]==original):
            data[i] = replacement
        else:
            pass
    return

In [16]:
label_converter(data['Nationality'],'KW','Kuwait')

In [17]:
label_converter(data['Nationality'],'lebanon','Lebanon')

In [18]:
label_converter(data['Nationality'],'venzuela','Venezuela')

In [19]:
label_converter(data['Nationality'],'Lybia','Libya')

In [20]:
# Check the changes
data['Nationality'].unique()

array(['Kuwait', 'Lebanon', 'Egypt', 'SaudiArabia', 'USA', 'Jordan',
       'Venezuela', 'Iran', 'Tunis', 'Morocco', 'Syria', 'Palestine',
       'Iraq', 'Libya'], dtype=object)

#### 3.2 Cleaning the PlaceofBirth feature

In [21]:
# Categories in PlaceofBirth
data['PlaceOfBirth'].unique()

array(['KuwaIT', 'lebanon', 'Egypt', 'SaudiArabia', 'USA', 'Jordan',
       'venzuela', 'Iran', 'Tunis', 'Morocco', 'Syria', 'Iraq',
       'Palestine', 'Lybia'], dtype=object)

We need to make following changes
1. Convert KuwaIT to Kuwait
2. lebanon to Lebanon
3. venzuela to Venezuela
4. Lybia to Libya

In [22]:
label_converter(data['PlaceOfBirth'],'KuwaIT','Kuwait')

In [23]:
label_converter(data['PlaceOfBirth'],'lebanon','Lebanon')

In [24]:
label_converter(data['PlaceOfBirth'],'venzuela','Venezuela')

In [25]:
label_converter(data['PlaceOfBirth'],'Lybia','Libya')

In [26]:
# Check the changes
data['PlaceOfBirth'].unique()

array(['Kuwait', 'Lebanon', 'Egypt', 'SaudiArabia', 'USA', 'Jordan',
       'Venezuela', 'Iran', 'Tunis', 'Morocco', 'Syria', 'Iraq',
       'Palestine', 'Libya'], dtype=object)

#### 3.2 Cleaning the StageID feature

In [27]:
# Categories in StageID
data['StageID'].unique()

array(['lowerlevel', 'MiddleSchool', 'HighSchool'], dtype=object)

For consistancy in the labels we need to change lowerlevel to LowerLevel.

In [28]:
label_converter(data['StageID'],'lowerlevel','LowerLevel')

In [29]:
# Check the changes
data['StageID'].unique()

array(['LowerLevel', 'MiddleSchool', 'HighSchool'], dtype=object)

<p style="background-color:Gainsboro; padding:20px; text-align:justify; font-size:110%">
1. Features with 2 unique lables can be encoded as binary catgoreos.</br> 
2. Feature with more than 2 independent lables can be coded as nominal categories. </br>
(If they are independen.)</br> 
3. Feature with more than 2 related lables can be coded as ordinal categories. </br>
(If there is ordinal relation between lables.) 
</p>

#### 3.3 Target class

In [30]:
# Target 
print(f' TARGET, has {len(data[TARGET].unique())} unique attributes. \n {data[TARGET].unique()} \n')

 TARGET, has 3 unique attributes. 
 ['M' 'L' 'H'] 



<p style="background-color:Gainsboro; padding:20px; text-align:justify; font-size: 110%">
Target has three unique attributes. This defines our problem as <strong>Multiclass Clasification</strong> problem.
</p>

<p style="background-color:Gainsboro; padding:20px; text-align:justify; font-size: 110%">
The descriptive statistical properties of continuous features are tabulated in the above output. 
</p>

### 4. Clean Numerical Data.

In [31]:
NUMERICAL_FEATURES = list([column for column in data.columns if data.dtypes[column] != 'object'])
NUMERICAL_FEATURES

['RaisedHands', 'VisitedResources', 'AnnouncementsView', 'Discussion']

In [32]:
print(f'Total Number for Continous Feature : {len(NUMERICAL_FEATURES)}')
for i,feature in enumerate(NUMERICAL_FEATURES):
    print((i+1),feature)

Total Number for Continous Feature : 4
1 RaisedHands
2 VisitedResources
3 AnnouncementsView
4 Discussion


In [33]:
stats = data.describe()
stats.transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
RaisedHands,480.0,46.775,30.779223,0.0,15.75,50.0,75.0,100.0
VisitedResources,480.0,54.797917,33.080007,0.0,20.0,65.0,84.0,99.0
AnnouncementsView,480.0,37.91875,26.611244,0.0,14.0,33.0,58.0,98.0
Discussion,480.0,43.283333,27.637735,1.0,20.0,39.0,70.0,99.0


In [34]:
# Check for null values in the Numerical features
data[NUMERICAL_FEATURES].isnull().sum()

RaisedHands          0
VisitedResources     0
AnnouncementsView    0
Discussion           0
dtype: int64

<p style="background-color:Gainsboro; padding:20px; text-align:justify; font-size: 110%">
The result of <strong>data[NUMERICAL_FEATURES].isnull().sum()</strong> shows there is no null values in any numerical feature.
</p>

### 5. Optimize Data for the Memory.

The data is optimized for the memory usage for following advantages
1. Reduces the complexity of the data types
2. Reduce the size of the data

This produce faster convergence in training the ML Models.

In [35]:
# Original Data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Gender                    480 non-null    object
 1   Nationality               480 non-null    object
 2   PlaceOfBirth              480 non-null    object
 3   StageID                   480 non-null    object
 4   GradeID                   480 non-null    object
 5   SectionID                 480 non-null    object
 6   Topic                     480 non-null    object
 7   Semester                  480 non-null    object
 8   Relation                  480 non-null    object
 9   RaisedHands               480 non-null    int64 
 10  VisitedResources          480 non-null    int64 
 11  AnnouncementsView         480 non-null    int64 
 12  Discussion                480 non-null    int64 
 13  ParentAnsweringSurvey     480 non-null    object
 14  ParentSchoolSatisfaction  

In [36]:
optimized_data = kl.data_cleaning(data)

Long column names detected (>25 characters). Consider renaming the following columns ['parent_school_satisfaction'].
Shape of cleaned data: (478, 17)Remaining NAs: 0

Changes:
Dropped rows: 2
     of which 2 duplicates. (Rows: [326, 327])
Dropped columns: 0
     of which 0 single valued.     Columns: []
Dropped missing values: 0
Reduced memory by at least: 0.05 MB (-83.33%)



In [37]:
# Optimized data
optimized_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478 entries, 0 to 477
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype   
---  ------                      --------------  -----   
 0   gender                      478 non-null    category
 1   nationality                 478 non-null    category
 2   place_of_birth              478 non-null    category
 3   stage_id                    478 non-null    category
 4   grade_id                    478 non-null    category
 5   section_id                  478 non-null    category
 6   topic                       478 non-null    category
 7   semester                    478 non-null    category
 8   relation                    478 non-null    category
 9   raised_hands                478 non-null    int8    
 10  visited_resources           478 non-null    int8    
 11  announcements_view          478 non-null    int8    
 12  discussion                  478 non-null    int8    
 13  parent_answering_sur

In [38]:
optimized_data.head()

Unnamed: 0,gender,nationality,place_of_birth,stage_id,grade_id,section_id,topic,semester,relation,raised_hands,visited_resources,announcements_view,discussion,parent_answering_survey,parent_school_satisfaction,student_absence_days,class
0,M,Kuwait,Kuwait,LowerLevel,G-04,A,IT,F,Father,15,16,2,20,Yes,Good,Under-7,M
1,M,Kuwait,Kuwait,LowerLevel,G-04,A,IT,F,Father,20,20,3,25,Yes,Good,Under-7,M
2,M,Kuwait,Kuwait,LowerLevel,G-04,A,IT,F,Father,10,7,0,30,No,Bad,Above-7,L
3,M,Kuwait,Kuwait,LowerLevel,G-04,A,IT,F,Father,30,25,5,35,No,Bad,Above-7,L
4,M,Kuwait,Kuwait,LowerLevel,G-04,A,IT,F,Father,40,50,12,50,No,Bad,Above-7,M


### 6. Save Cleaned Data.

In [39]:
# The cell will run on local stograge only.
optimized_data.to_csv('./data/xAPI-Edu-Data_cleaned.csv',index=False)

## Observations
Working with the data we observe:   

1. The feature name were not consistent   
2. The categroy lables wer also not consistent
3. The data was not optimized fro the memory usage
4. The target class is categorical with 3 lables
5. The predictors are mixed of categorical and numerical types.

## Conclusion
Based on the obervations:    
    
1. The feature names are corrected for spelling mistakes and were made consistent  
2. The categorical labels were also corrected for spelling mistakes and were made consitent
3. The data was optimized for better memory utlization by converting data in suitable dtypes
4. The cleaned data is stored in the file xAPI-Edu-Dat_cleaned.csv

The operations performed enhance the data for futher operation like

* Data visualization and plotting
* Feature engineering
* Dimentionality reduction etc
