# Supervised Learning - Building a Student Performace Prediction System


# Classification vs. Regression
The aim of this project is to predict how likely a student is to pass. Which type of supervised learning problem is this, classification or regression? Why?
Answer:
This project is a classification supervised learning problem because the variable to predict, i.e. if a student graduates or fails to graduate, is categorical. On this case this a dichotomous categorical variable where the only two possible values are "pass" or "fail".

### Overview:

1.Read the problem statement.

2.Get the dataset.

3.Explore the dataset.

4.Pre-processing of dataset.

5.Transform the dataset for building machine learning model.

6.Split data into train, test set.

7.Build Model.

8.Apply the model.

9.Evaluate the model.

10.Provide insights.

## Problem Statement 

Using Logistic Regression **predict the performance of student**. The classification goal is to predict whether the student will pass or fail.

## Dataset 

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in Mathematics.

**Source:** https://archive.ics.uci.edu/ml/datasets/Student+Performance

# Question 1 - Exploring the Data (0.5 points)
*Read the dataset file using pandas. Take care about the delimiter.*

#### Answer:

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

In [2]:
student_df = pd.read_csv("students-data.csv", ';')
# The delimiter here is ';'

In [3]:
student_df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


# Question 2 - drop missing values (0.5 points)
*Set the index name of the dataframe to **"number"**. Check sample of data to drop if any missing values are there.*
*Use .dropna() function to drop the NAs*

#### Answer:

In [4]:
student_df=student_df.rename_axis('number')
student_df.head()

Unnamed: 0_level_0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


In [5]:
student_df.shape

(395, 33)

In [6]:
student_df.dtypes

school        object
sex           object
age            int64
address       object
famsize       object
Pstatus       object
Medu           int64
Fedu           int64
Mjob          object
Fjob          object
reason        object
guardian      object
traveltime     int64
studytime      int64
failures       int64
schoolsup     object
famsup        object
paid          object
activities    object
nursery       object
higher        object
internet      object
romantic      object
famrel         int64
freetime       int64
goout          int64
Dalc           int64
Walc           int64
health         int64
absences       int64
G1             int64
G2             int64
G3             int64
dtype: object

In [7]:
student_df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,395.0,16.696203,1.276043,15.0,16.0,17.0,18.0,22.0
Medu,395.0,2.749367,1.094735,0.0,2.0,3.0,4.0,4.0
Fedu,395.0,2.521519,1.088201,0.0,2.0,2.0,3.0,4.0
traveltime,395.0,1.448101,0.697505,1.0,1.0,1.0,2.0,4.0
studytime,395.0,2.035443,0.83924,1.0,1.0,2.0,2.0,4.0
failures,395.0,0.334177,0.743651,0.0,0.0,0.0,0.0,3.0
famrel,395.0,3.944304,0.896659,1.0,4.0,4.0,5.0,5.0
freetime,395.0,3.235443,0.998862,1.0,3.0,3.0,4.0,5.0
goout,395.0,3.108861,1.113278,1.0,2.0,3.0,4.0,5.0
Dalc,395.0,1.481013,0.890741,1.0,1.0,1.0,2.0,5.0


In [8]:
student_df.isna().sum()

school        0
sex           0
age           0
address       0
famsize       0
Pstatus       0
Medu          0
Fedu          0
Mjob          0
Fjob          0
reason        0
guardian      0
traveltime    0
studytime     0
failures      0
schoolsup     0
famsup        0
paid          0
activities    0
nursery       0
higher        0
internet      0
romantic      0
famrel        0
freetime      0
goout         0
Dalc          0
Walc          0
health        0
absences      0
G1            0
G2            0
G3            0
dtype: int64

In [9]:
#student_df.dropna()
# No need to drop any rows, since there is no missing values

# Transform Data

## Question 3 (0.5 points)

*Print all the attribute names which are not numerical.*

**Hint:** check **select_dtypes()** and its **include** and **exclude** parameters.**

#### Answer:

In [10]:
Sub = student_df.select_dtypes(include = 'object')

In [11]:
Sub.columns

Index(['school', 'sex', 'address', 'famsize', 'Pstatus', 'Mjob', 'Fjob',
       'reason', 'guardian', 'schoolsup', 'famsup', 'paid', 'activities',
       'nursery', 'higher', 'internet', 'romantic'],
      dtype='object')

# Question 4 - Drop variables with less variance (0.5 points)

*Find the variance of each numerical independent variable and drop whose variance is less than 1. Use .var function to check the variance*

In [12]:
student_df.var()

age            1.628285
Medu           1.198445
Fedu           1.184180
traveltime     0.486513
studytime      0.704324
failures       0.553017
famrel         0.803997
freetime       0.997725
goout          1.239388
Dalc           0.793420
Walc           1.658678
health         1.932944
absences      64.049541
G1            11.017053
G2            14.148917
G3            20.989616
dtype: float64

In [13]:
#Since traveltime, studytime, failures, famrel, freetime, dalc variance is less than 1, dropping those col
std_df=student_df.drop(["traveltime","studytime","failures","famrel","freetime","Dalc"], axis=1)

In [14]:
std_df.head()

Unnamed: 0_level_0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,higher,internet,romantic,goout,Walc,health,absences,G1,G2,G3
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,yes,no,no,4,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,yes,no,3,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,yes,no,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,yes,2,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,yes,no,no,2,2,5,4,6,10,10


#### Variables with less variance are almost same for all the records. Hence, they do not contribute much for classification.

# Question 6 - Encode all categorical variables to numerical (0.5 points)

Take the list of categorical attributes(from the above result) and convert them into neumerical variables. After that, print the head of dataframe and check the values.

**Hint:** check **sklearn LabelEncoder()**

#### Answer:

In [15]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
Sub3= Sub.apply(le.fit_transform)

In [16]:
Sub3.head()

Unnamed: 0_level_0,school,sex,address,famsize,Pstatus,Mjob,Fjob,reason,guardian,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,0,0,1,0,0,0,4,0,1,1,0,0,0,1,1,0,0
1,0,0,1,0,1,0,2,0,0,0,1,0,0,0,1,1,0
2,0,0,1,1,1,0,2,2,1,1,0,1,0,1,1,1,0
3,0,0,1,0,1,1,3,1,1,0,1,1,1,1,1,1,1
4,0,0,1,0,1,2,2,1,0,0,1,1,0,1,1,0,0


In [17]:
Sub2 = std_df.select_dtypes(include = 'number')

In [18]:
Sub2.head()

Unnamed: 0_level_0,age,Medu,Fedu,goout,Walc,health,absences,G1,G2,G3
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,18,4,4,4,1,3,6,5,6,6
1,17,1,1,3,1,3,4,5,5,6
2,15,1,1,2,3,3,10,7,8,10
3,15,4,2,2,1,5,2,15,14,15
4,16,3,3,2,2,5,4,6,10,10


In [19]:
FinalStd_df=Sub2

In [20]:
FinalStd_df = FinalStd_df.join(Sub3, how='outer')

In [21]:
FinalStd_df.head()

Unnamed: 0_level_0,age,Medu,Fedu,goout,Walc,health,absences,G1,G2,G3,...,reason,guardian,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,18,4,4,4,1,3,6,5,6,6,...,0,1,1,0,0,0,1,1,0,0
1,17,1,1,3,1,3,4,5,5,6,...,0,0,0,1,0,0,0,1,1,0
2,15,1,1,2,3,3,10,7,8,10,...,2,1,1,0,1,0,1,1,1,0
3,15,4,2,2,1,5,2,15,14,15,...,1,1,0,1,1,1,1,1,1,1
4,16,3,3,2,2,5,4,6,10,10,...,1,0,0,1,1,0,1,1,0,0


# Question 7 - Convert the continuous values of grades into classes (1 point)

*Consider the values in G1, G2 and G3 with >= 10 as pass(1) and < 10 as fail(0) and encode them into binary values. Print head of dataframe to check the values.*

#### Answer:

In [22]:
Sub4= FinalStd_df.iloc[:,7:10]

In [23]:
Sub4.head()

Unnamed: 0_level_0,G1,G2,G3
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,5,6,6
1,5,5,6
2,7,8,10
3,15,14,15
4,6,10,10


In [24]:
FinalStd_df = FinalStd_df.drop(["G1","G2","G3"],axis=1)

In [25]:
Sub4['G1']=Sub4['G1'].apply(lambda x: x>=10)
Sub4['G2']=Sub4['G2'].apply(lambda x: x>=10)
Sub4['G3']=Sub4['G3'].apply(lambda x: x>=10)

In [26]:
Sub4.head()

Unnamed: 0_level_0,G1,G2,G3
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,False,False,False
1,False,False,False
2,False,False,True
3,True,True,True
4,False,True,True


In [27]:
l2=LabelEncoder()
Sub5=Sub4.apply(l2.fit_transform)

In [28]:
Sub5.head()

Unnamed: 0_level_0,G1,G2,G3
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,0,0
1,0,0,0
2,0,0,1
3,1,1,1
4,0,1,1


In [29]:
FinalStd_df = FinalStd_df.join(Sub5, how='outer')

In [30]:
FinalStd_df.head()

Unnamed: 0_level_0,age,Medu,Fedu,goout,Walc,health,absences,school,sex,address,...,famsup,paid,activities,nursery,higher,internet,romantic,G1,G2,G3
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,18,4,4,4,1,3,6,0,0,1,...,0,0,0,1,1,0,0,0,0,0
1,17,1,1,3,1,3,4,0,0,1,...,1,0,0,0,1,1,0,0,0,0
2,15,1,1,2,3,3,10,0,0,1,...,0,1,0,1,1,1,0,0,0,1
3,15,4,2,2,1,5,2,0,0,1,...,1,1,1,1,1,1,1,1,1,1
4,16,3,3,2,2,5,4,0,0,1,...,1,1,0,1,1,0,0,0,1,1


# Question 8 (0.5 points)

*Consider G3 is the target attribute and remaining all attributes as features to predict G3. Now, separate feature and target attributes into separate dataframes with X and y variable names.*

In [31]:
X=FinalStd_df.drop("G3", axis =1)
y=FinalStd_df.pop("G3")

# Question 9 - Training and testing data split (0.5 points)

# *So far, you have converted all categorical features into numeric values. Now, split the data into training and test sets with training size of 300 records. Print the number of train and test records.*

**Hint:** check **train_test_split()** from **sklearn**

#### Answer:

In [32]:
tsize=1-(300/FinalStd_df.count())
print(tsize)

age           0.240506
Medu          0.240506
Fedu          0.240506
goout         0.240506
Walc          0.240506
health        0.240506
absences      0.240506
school        0.240506
sex           0.240506
address       0.240506
famsize       0.240506
Pstatus       0.240506
Mjob          0.240506
Fjob          0.240506
reason        0.240506
guardian      0.240506
schoolsup     0.240506
famsup        0.240506
paid          0.240506
activities    0.240506
nursery       0.240506
higher        0.240506
internet      0.240506
romantic      0.240506
G1            0.240506
G2            0.240506
dtype: float64


In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.240506, random_state=0)

In [34]:
X_train.shape

(300, 26)

In [35]:
y_train.shape

(300,)

# Question 10 - Model Implementation and Testing the Accuracy (0.5 points)

*Build a **LogisticRegression** classifier using **fit()** functions in sklearn. 
* You need to import both Logistic regression and accuracy score from sklearn*
#### Answer:

In [36]:
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [37]:
y_predict = model.predict(X_test)
model_score = model.score(X_test, y_test)
print(model_score)

0.9263157894736842


# Question 11 - Print the intercept of the Logistic regression model (0.5 points)

The value of the intercepts are stored in the model itself. You can use .intercept_ function to do the same

In [38]:
intercept = model.intercept_
print("The intercept for our model is {}".format(intercept))

The intercept for our model is [0.41811065]


# Question 12 - Print the coefficients of the model (0.5 points) and name the coefficient which has the highest impact on the dependent variable (0.5 points)

Hint: Use .coef_ to get the coefficients and use pd.Dataframe to store the coefficients in a dataframe with column names same as the independent variable dataframe

In [39]:
print("The coefficient for the model is ", model.coef_)

The coefficient for the model is  [[-0.15345949  0.0347859  -0.24052267 -0.31167846  0.52832848 -0.06772317
  -0.04138963 -0.22881781 -0.32456753  0.03428283 -0.09890149 -0.17995288
   0.019424    0.34987072  0.0635088   0.22197009 -0.44013349 -0.21887529
   0.26563637 -0.18532831 -0.01059829  0.77838426 -0.07773611 -0.43581071
   1.16037779  3.994352  ]]


# Question 13 - Predict the dependent variable for both training and test dataset (0.5 points)

Accuracy score() should help you to print the accuracies

In [40]:
y_predict1 = model.predict(X_train)
model_score1 = model.score(X_train, y_train)
print(y_predict1,model_score1)

[0 1 1 1 1 1 1 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0
 1 1 1 0 1 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 0 0 0
 1 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 0 1 1 1
 1 1 1 1 0 0 1 0 0 0 1 1 1 1 1 0 1 1 0 1 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 0 1
 1 0 1 1 1 0 1 1 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 0 1
 1 0 1 0 1 1 1 1 1 1 1 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0
 0 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 1 1 0 0 1 1 0 1
 0 1 1 1 1 1 0 0 1 1 0 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1
 0 1 1 1] 0.92


In [41]:
y_predict2 = model.predict(X_test)
model_score2 = model.score(X_test, y_test)
print(y_predict2,model_score2)

[1 1 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 1 0 0 0 1 1 1 1 1 0 1 0 0 0 0 1 1 1
 1 0 1 0 1 0 0 1 0 1 0 1 0 0 1 1 0 1 0 0 1 1 1 1 1 1 1 0 1 1 1 0 0 0 1 0 0
 0 0 1 1 0 1 0 0 1 1 1 1 1 1 1 0 0 0 0 1 0] 0.9263157894736842
