# Introduction
* In this __kernel__ i am going to explore the dataset. Try to understand the relations between features. 
* I am going to make an [EDA](https://en.wikipedia.org/wiki/Exploratory_data_analysis) (Exploratory Data Analysis) with different commonly used __classifiers__ in Machine Learning.
* And finally i will build a __Model__ with high __accuracy score__

## Dataset Information
> It is an educational data set which is collected from learning management system (LMS) called __Kalboard 360__.

* The dataset consists of __480__ student records and __16__ features
* No __null__ or __empty__ values
* Features are classified into __three__ major categories: 

    1. __Demographic__ features such as gender and nationality. 
    2. __Academic background__ features such as educational stage, grade Level and section. 
    3. __Behavioral features__ such as raised hand on class, opening resources, answering survey by parents, and school satisfaction.
 
* The dataset consists of __305__ males and __175__ females
* Most students comes from __Kuwait__ and __Jordan__
* The dataset is collected through two educational __semesters__
* Students are classified into three numerical intervals based on their total __grade/mark__:
    1. __Low-Level__: interval includes values from __0__ to __69__
    2. __Middle-Level__: interval includes values from __70__ to __89__
    3. __High-Level__: interval includes values from __90__ to __100__



In [None]:
# Loading necessary packages 

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt

%matplotlib inline
import seaborn as sns
sns.set()

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier, plot_importance
from sklearn.neural_network import MLPClassifier

from sklearn.preprocessing import Imputer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import confusion_matrix

In [None]:
# loading dataset
dataset = pd.read_csv("../input/xAPI-Edu-Data.csv")

In [None]:
# A summary of the dataset
dataset.info()

In [None]:
# breif description of the numerical valued feature
dataset.describe()

In [None]:
dataset.plot.bar(stacked=True, figsize=(20,10))

In [None]:
fig, ax = plt.subplots(figsize=(20, 8))

dataset["raisedhands"].value_counts().sort_index().plot.bar(
    ax=ax
)
ax.set_title("No. of times vs no. of students raised their hands on particular time", fontsize=18)
ax.set_xlabel("No. of times, student raised their hands", fontsize=14)
ax.set_ylabel("No. of student, on particular times", fontsize=14)

In [None]:
fig, ax = plt.subplots(figsize=(20, 8))

dataset["VisITedResources"].value_counts().sort_index().plot.bar(
    ax=ax
)
ax.set_xlabel("No. of times, student visted resource", fontsize=14)
ax.set_ylabel("No. of student, on particular visit", fontsize=14)

> ### Before jumping into __Data Cleaning__ and __Feature Engineering__ lets make a model based on only 3 features (raisedhands, VisITedResources, AnnouncementsView) described in this [paper](https://github.com/78526Nasir/Kaggle-Student-s-Academic-Performance/blob/master/related%20resarch%20paper/Classify%20the%20Category%20of%20Students%20%20p28-alam.pdf) as top/most effective variables

In [None]:
top_features = ["raisedhands","VisITedResources","AnnouncementsView"]
features = dataset[top_features]
labels = dataset["Class"]
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size = .20, random_state=0)

In [None]:
# model build with SVM.SVC classifier

clf = SVC(gamma='auto', kernel = 'linear')
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)

In [None]:
accuracy_score(pred, labels_test)

In [None]:
# Random Forest Classifier with 200 subtrees

clf = RandomForestClassifier(n_estimators = 200)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy_score(pred, labels_test)

In [None]:
# Logistic regression
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy_score(pred, labels_test)

In [None]:
# Multi-layer Perceptron classifier with (30,30,30) hidden layers

clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(30, 30, 20), random_state=1)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy_score(pred, labels_test)

In [None]:
# XGBoost Classifier

clf = XGBClassifier(max_depth=15, learning_rate=0.1, n_estimators=200, seed=10)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy_score(pred, labels_test)

In [None]:
fig, ax = plt.subplots(figsize=(12,6))
plot_importance(clf, ax = ax)

> ### Till now best accuracy on reduced features model: 0.6875

## Now lets deep dive into the dataset and start cleaning the data and do some feature engineering

In [None]:
dataset.head()

In [None]:
dataset.groupby("gender").count()

In [None]:
def gen_bar(feature, size):
    highest = dataset[dataset["Class"]=="H"][feature].value_counts()
    medium = dataset[dataset["Class"]=="M"][feature].value_counts()
    low = dataset[dataset["Class"]=="L"][feature].value_counts()
    
    df = pd.DataFrame([highest, medium, low])
    df.index = ["highest","medium", "low"]
    df.plot(kind='bar',stacked=True, figsize=(size[0], size[1]))

In [None]:
gen_bar("gender",[6,5])

> From this bar chart we visualize that __male__ students are more on "medium" and "lower" category compared to __female__ student.

In [None]:
# lets map the gender
gender_map = {"F" : 0, "M" : 1}
dataset["gender"] = dataset["gender"].map(gender_map)

In [None]:
dataset["gender"].value_counts().sort_index().plot.bar()

> ### __gender__ done, lets moved to "NationaliTy"

In [None]:
dataset["NationalITy"].describe()

In [None]:
fig, ax = plt.subplots(figsize=(15, 5))

dataset["NationalITy"].value_counts().sort_index().plot.bar(
    ax = ax
)

> We can see that most of the __students__ are from __jordan__ and __kuwait__

In [None]:
dataset["NationalITy"] = dataset["NationalITy"].replace(["Jordan","KW"], "0")
dataset["NationalITy"] = dataset["NationalITy"].replace(["Iraq","Palestine"], "1")
dataset["NationalITy"] = dataset["NationalITy"].replace(["Tunis","lebanon", "SaudiArabia"], "2")
dataset["NationalITy"] = dataset["NationalITy"].replace(["Syria","Egypt","Iran","Morocco","USA","venzuela","Lybia"], "3")

dataset["NationalITy"] = dataset["NationalITy"].astype(int)

In [None]:
fig, ax = plt.subplots(figsize=(5, 5))
dataset["NationalITy"].value_counts().sort_index().plot.bar(
    ax = ax
)

> Theirs a small diffrence between __PlaceofBirth__ and __NationalITy__ values. But we can ignore that! and simply delete the __PlaceofBirth__ feature

In [None]:
del dataset["PlaceofBirth"]

### Lets explore __StageID__

In [None]:
dataset["StageID"].value_counts().sort_index().plot.bar()

In [None]:
stage_map = {"HighSchool" : 0, "MiddleSchool" : 1, "lowerlevel": 2}
dataset["StageID"] = dataset["StageID"].map(stage_map)

### Working with GradeID

In [None]:
gen_bar("GradeID",[8,8])

In [None]:
dataset["GradeID"] = dataset["GradeID"].replace(["G-02","G-08","G-07"], "0")
dataset["GradeID"] = dataset["GradeID"].replace(["G-04","G-06"], "1")
dataset["GradeID"] = dataset["GradeID"].replace(["G-05","G-11", "G-12","G-09","G-10"], "2")

dataset["GradeID"] = dataset["GradeID"].astype(int)

### Working with SectionID

In [None]:
dataset.groupby("SectionID").count()

In [None]:
section_map = {"A":0, "B":1, "C":2}
dataset["SectionID"] = dataset["SectionID"].map(section_map)

### Working with Topic

In [None]:
fig, ax = plt.subplots(figsize=(15, 5))
dataset["Topic"].value_counts().sort_index().plot.bar(ax = ax)

In [None]:
gen_bar("Topic", [8,10])

In [None]:
pd.crosstab(dataset["Class"],dataset["Topic"])

In [None]:
topic_map = {"IT":0, "Arabic":1, "French":2, "English":3, "Biology":4, "Science":5, "Chemistry":6, "Quran":7, "Geology":8, "History":9,"Math":9,"Spanish":9}
dataset["Topic"] = dataset["Topic"].map(topic_map)

In [None]:
dataset.groupby("Topic").count()

In [None]:
facet = sns.FacetGrid(dataset, hue="Class",aspect=4)
facet.map(sns.kdeplot,"Topic",shade= True)
facet.set(xlim=(0, dataset["Topic"].max()))
facet.add_legend()

plt.show()

### Working with Semester

In [None]:
dataset.groupby("Semester").count()

In [None]:
pd.crosstab(dataset["Class"], dataset["Semester"])

In [None]:
semester_map = {"F":0, "S":1}
dataset["Semester"] = dataset["Semester"].map(semester_map)

### Working with Relation Feature

In [None]:
dataset["Relation"].value_counts().sort_index().plot.bar()

In [None]:
relation_map = {"Father":0, "Mum":1}
dataset["Relation"] = dataset["Relation"].map(relation_map)

### "raisedhands", "VisITedResources", "AnnouncementsView", "Discussion" are already in decent form
<hr>
#### Working with "ParentschoolSatisfaction"

In [None]:
dataset["ParentschoolSatisfaction"].nunique()

In [None]:
parent_ss_map = {"Bad": 0, "Good":1}
dataset["ParentschoolSatisfaction"] = dataset["ParentschoolSatisfaction"].map(parent_ss_map)

In [None]:
dataset.groupby("ParentschoolSatisfaction").count()

In [None]:
gen_bar("ParentschoolSatisfaction", [5,5])

### Working with "ParentAnsweringSurvey"

In [None]:
dataset.groupby("ParentAnsweringSurvey").count()

In [None]:
parent_a_s_map = {"No":0, "Yes":1}
dataset["ParentAnsweringSurvey"] = dataset["ParentAnsweringSurvey"].map(parent_a_s_map)

In [None]:
dataset["ParentAnsweringSurvey"].value_counts().sort_index().plot.bar()

### Working with StudentAbsenceDays Feature

In [None]:
dataset.groupby("StudentAbsenceDays").count()

In [None]:
student_absn_day_map = {"Above-7":0, "Under-7":1} 
dataset["StudentAbsenceDays"] = dataset["StudentAbsenceDays"].map(student_absn_day_map)

In [None]:
dataset.groupby("StudentAbsenceDays").count()

In [None]:
dataset.head()

### Last but not the least! Working with "Class" feature

In [None]:
dataset.groupby("Class").count()

In [None]:
class_map = {"H":0, "M":1, "L":2}
dataset["Class"] = dataset["Class"].map(class_map)

In [None]:
dataset.groupby("Class").count()

## Data Cleaning almost done.

### Finding correlation

In [None]:
dataset.corr()

In [None]:
fig, ax = plt.subplots(figsize = (15, 10))
ax = sns.heatmap(dataset.corr())

In [None]:
X = dataset.iloc[:,0:15]
y = dataset["Class"]

features_train, features_test, labels_train, labels_test = train_test_split(X, y, test_size = .20, random_state=0)


In [None]:
# model build with SVM.SVC classifier

clf = SVC(gamma='auto', kernel = 'linear')
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)

In [None]:
accuracy_score(pred, labels_test)

In [None]:
# Random Forest Classifier with 20 subtrees

clf = RandomForestClassifier(n_estimators = 220, random_state=10)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
rfc_pred = pred
accuracy_score(pred, labels_test)

In [None]:
# Logistic regression
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy_score(pred, labels_test)

In [None]:
# Multi-layer Perceptron classifier with (10,10,10) hidden layers

clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(10,10,10), random_state=1)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy_score(pred, labels_test)

In [None]:
clf = XGBClassifier(max_depth=5, learning_rate=0.2, n_estimators=20, seed=10)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
xgb_pred = pred
accuracy_score(pred, labels_test)

In [None]:
fig, ax = plt.subplots(figsize=(12,6))
plot_importance(clf, ax = ax)

In [None]:
# Random Forest Classifier confustion Matrix result
confusion_matrix(labels_test, rfc_pred, labels=[1, 0]) 

In [None]:
# Random Forest Classifier mean_absolute_error
mean_absolute_error(rfc_pred, labels_test)

In [None]:
# XGBoost Classifier confusion matric result
confusion_matrix(labels_test, xgb_pred, labels=[1, 0]) 

In [None]:
# XGBoost Classifier mean_absolute_error
mean_absolute_error(xgb_pred, labels_test)

## So finally,  Random Forest Classifier will give us highest Accuracy: 0.834