
### Supervised Learning
### Activity: Building a Student Intervention System

### Question 1 - Classification vs. Regression
*Your goal for this project is to identify students who might need early intervention before they fail or pass. Which type of supervised learning problem is this, classification or regression? Why?*

**Answer:  Classification** 

### Question-2
load necessary Python libraries and load the student data. Note that the last column from this dataset, `'passed'`, will be our target label (whether the student graduated or didn't graduate). All other columns are features about each student.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Read student data
dt = pd.read_csv('student-data.csv')
dt.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,passed
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,no,no,4,3,4,1,1,3,6,no
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,no,5,3,3,1,1,3,4,no
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,no,4,3,2,2,3,3,10,yes
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,3,2,2,1,1,5,2,yes
4,GP,F,16,U,GT3,T,3,3,other,other,...,no,no,4,3,2,1,2,5,4,yes


### Question-3
Let's begin by investigating the dataset to determine how many students we have information on, and learn about the graduation rate among these students. In the code cell below, you will need to compute the following:
- The total number of students, `n_students`.
- The total number of features for each student, `n_features`.
- The number of those students who passed, `n_passed`.
- The number of those students who failed, `n_failed`.
- The graduation rate of the class, `grad_rate`, in percent (%).


In [3]:
# Calculate number of students
n_students = dt.shape[0]

In [4]:
# Calculate number of features
n_features = dt.shape[1]

In [5]:
# Calculate passing students
n_passed = dt.passed.value_counts()[0]

In [6]:
# Calculate failing students
n_failed = dt.passed.value_counts()[1]

In [7]:
# Calculate graduation rate
grad_rate = ((n_passed/n_students))*100

In [8]:
# Print the results
print('Total number of students in the data set is :',n_students)
print('Total number of features for each student is:',n_features)
print('The number of students who passed = ',n_passed)
print('The number of students who failed = ',n_failed)
print('Graduation rate of the class is: ',grad_rate)

Total number of students in the data set is : 395
Total number of features for each student is: 31
The number of students who passed =  265
The number of students who failed =  130
Graduation rate of the class is:  67.08860759493672


## Preparing the Data
you will prepare the data for modeling, training and testing.

### Question-4 Identify feature and target columns


separate the student data into feature and target columns to see if any features are non-numeric.

In [9]:
dt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 31 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      395 non-null    object
 1   sex         395 non-null    object
 2   age         395 non-null    int64 
 3   address     395 non-null    object
 4   famsize     395 non-null    object
 5   Pstatus     395 non-null    object
 6   Medu        395 non-null    int64 
 7   Fedu        395 non-null    int64 
 8   Mjob        395 non-null    object
 9   Fjob        395 non-null    object
 10  reason      395 non-null    object
 11  guardian    395 non-null    object
 12  traveltime  395 non-null    int64 
 13  studytime   395 non-null    int64 
 14  failures    395 non-null    int64 
 15  schoolsup   395 non-null    object
 16  famsup      395 non-null    object
 17  paid        395 non-null    object
 18  activities  395 non-null    object
 19  nursery     395 non-null    object
 20  higher    

In [10]:
dt.columns

Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences', 'passed'],
      dtype='object')

In [11]:
# Extract feature columns

In [12]:
dt[dt.columns[0:30]].head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,yes,no,no,4,3,4,1,1,3,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,yes,no,5,3,3,1,1,3,4
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,yes,no,4,3,2,2,3,3,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,yes,3,2,2,1,1,5,2
4,GP,F,16,U,GT3,T,3,3,other,other,...,yes,no,no,4,3,2,1,2,5,4


In [13]:
# Extract target column

In [14]:
pd.DataFrame(dt['passed']).head()

Unnamed: 0,passed
0,no
1,no
2,yes
3,yes
4,yes


In [15]:
# Separate the data into feature data and target data (X and y, respectively)

In [16]:
x=dt.drop(['passed'],axis=1)
y=pd.DataFrame(dt['passed'])

### Question-5 Preprocess Feature Columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation. Run the code cell below to perform the preprocessing routine discussed in this section.

In [17]:
# label encoding
from sklearn import preprocessing
lb_en = preprocessing.LabelEncoder()
z=dt.drop(['age', 'Medu', 'Fedu','Mjob', 'Fjob', 'reason', 'traveltime', 'studytime',
       'failures','famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences', 'passed'],axis=1)
for i in z:
    z[i]=lb_en.fit_transform(z[i])
z=pd.DataFrame(z)
z.columns

Index(['school', 'sex', 'address', 'famsize', 'Pstatus', 'guardian',
       'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher',
       'internet', 'romantic'],
      dtype='object')

In [18]:
x=x.drop(['school', 'sex', 'address', 'famsize', 'Pstatus', 'guardian','nursery',
       'schoolsup', 'famsup', 'paid', 'activities', 'higher', 'internet',
       'romantic'],axis=1)
x=pd.concat([z,x],axis=1)
x.columns

Index(['school', 'sex', 'address', 'famsize', 'Pstatus', 'guardian',
       'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher',
       'internet', 'romantic', 'age', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason',
       'traveltime', 'studytime', 'failures', 'famrel', 'freetime', 'goout',
       'Dalc', 'Walc', 'health', 'absences'],
      dtype='object')

In [19]:
x.head()

Unnamed: 0,school,sex,address,famsize,Pstatus,guardian,schoolsup,famsup,paid,activities,...,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences
0,0,0,1,0,0,1,1,0,0,0,...,2,2,0,4,3,4,1,1,3,6
1,0,0,1,0,1,0,0,1,0,0,...,1,2,0,5,3,3,1,1,3,4
2,0,0,1,1,1,1,1,0,1,0,...,1,2,3,4,3,2,2,3,3,10
3,0,0,1,0,1,1,0,1,1,1,...,1,3,0,3,2,2,1,1,5,2
4,0,0,1,0,1,0,0,1,1,0,...,1,2,0,4,3,2,1,2,5,4


In [20]:
# one hot encoding
x = pd.get_dummies(x)
x.head()

Unnamed: 0,school,sex,address,famsize,Pstatus,guardian,schoolsup,famsup,paid,activities,...,Mjob_teacher,Fjob_at_home,Fjob_health,Fjob_other,Fjob_services,Fjob_teacher,reason_course,reason_home,reason_other,reason_reputation
0,0,0,1,0,0,1,1,0,0,0,...,0,0,0,0,0,1,1,0,0,0
1,0,0,1,0,1,0,0,1,0,0,...,0,0,0,1,0,0,1,0,0,0
2,0,0,1,1,1,1,1,0,1,0,...,0,0,0,1,0,0,0,0,1,0
3,0,0,1,0,1,1,0,1,1,1,...,0,0,0,0,1,0,0,1,0,0
4,0,0,1,0,1,0,0,1,1,0,...,0,0,0,1,0,0,0,1,0,0


### Question - 6 Implementation: Training and Testing Data Split
So far, we have converted all _categorical_ features into numeric values. For the next step, we split the data (both features and corresponding labels) into training and test sets. you will need to implement the following:
- Randomly shuffle and split the data (`X`, `y`) into training and testing subsets.
  - Use 300 training points (approximately 75%) and 95 testing points (approximately 25%).
  - Set a `random_state` for the function(s) you use, if provided.
  - Store the results in `X_train`, `X_test`, `y_train`, and `y_test`.

In [21]:
# splitting the data into train and test
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=42,test_size=0.25)

In [22]:
# Show the results of the split
x_train.shape

(296, 41)

In [23]:
x_test.shape

(99, 41)

In [24]:
y_train.shape

(296, 1)

In [25]:
y_test.shape

(99, 1)

### Question - 7  Training and Evaluating Models
In this section, you will choose 3 supervised learning models that are appropriate for this problem and available in `scikit-learn`. You will first discuss the reasoning behind choosing these three models by considering what you know about the data and each model's strengths and weaknesses. You will then fit the model to varying sizes of training data and measure the accuracy score.

###  Model Application
*List three supervised learning models that are appropriate for this problem. What are the general applications of each model? What are their strengths and weaknesses? Given what you know about the data, why did you choose these models to be applied?*

In [26]:
#explaination


In [27]:
# Import the three supervised learning models from sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, accuracy_score,classification_report

In [28]:
# Fitting Random forest model on training data
rf=RandomForestClassifier()
rf.fit(x_train,y_train)

  rf.fit(x_train,y_train)


RandomForestClassifier()

In [29]:
# predict on the data set
y_pred=rf.predict(x_test)

In [30]:
# calculate the accuracy score
print('Accuracy_score is',round(accuracy_score(y_test,y_pred),2))

Accuracy_score is 0.67


In [31]:
pd.Series(rf.feature_importances_,index=x.columns).sort_values(ascending=False)

absences             0.097056
failures             0.067108
goout                0.058928
age                  0.051249
Medu                 0.044536
freetime             0.041003
Walc                 0.038400
famrel               0.037505
health               0.037324
Fedu                 0.036771
studytime            0.030431
traveltime           0.026993
Dalc                 0.026988
schoolsup            0.023418
higher               0.022559
famsup               0.022430
guardian             0.020712
paid                 0.020532
sex                  0.020518
reason_course        0.018524
famsize              0.017761
Mjob_other           0.016567
romantic             0.016546
Fjob_other           0.016083
reason_reputation    0.015574
activities           0.014982
nursery              0.014760
Mjob_services        0.014067
address              0.013748
Fjob_services        0.013247
Mjob_teacher         0.011930
Mjob_at_home         0.011877
internet             0.011016
reason_hom

In [32]:
x=x.drop(['Fjob_health','Fjob_teacher','reason_other','school','Mjob_health','Pstatus','Fjob_at_home','Mjob_teacher'],axis=1)

In [33]:
x.columns

Index(['sex', 'address', 'famsize', 'guardian', 'schoolsup', 'famsup', 'paid',
       'activities', 'nursery', 'higher', 'internet', 'romantic', 'age',
       'Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel',
       'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences',
       'Mjob_at_home', 'Mjob_other', 'Mjob_services', 'Fjob_other',
       'Fjob_services', 'reason_course', 'reason_home', 'reason_reputation'],
      dtype='object')

In [34]:
x.describe()

Unnamed: 0,sex,address,famsize,guardian,schoolsup,famsup,paid,activities,nursery,higher,...,health,absences,Mjob_at_home,Mjob_other,Mjob_services,Fjob_other,Fjob_services,reason_course,reason_home,reason_reputation
count,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,...,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0
mean,0.473418,0.777215,0.288608,0.853165,0.129114,0.612658,0.458228,0.508861,0.794937,0.949367,...,3.55443,5.708861,0.149367,0.356962,0.260759,0.549367,0.281013,0.367089,0.275949,0.265823
std,0.499926,0.416643,0.45369,0.536684,0.335751,0.487761,0.498884,0.500555,0.40426,0.219525,...,1.390303,8.003096,0.356902,0.479711,0.439606,0.498188,0.450064,0.482622,0.447558,0.442331
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,...,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,...,4.0,4.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
75%,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,5.0,8.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
max,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,...,5.0,75.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [35]:
rf=RandomForestClassifier()
rf.fit(x_train,y_train)
y_pred=rf.predict(x_test)

  rf.fit(x_train,y_train)


In [36]:
print('Accuracy_score is',round(accuracy_score(y_test,y_pred),2))

Accuracy_score is 0.67


In [37]:
# fitting  Logistic Regression on traning data and predict on the test data and measure the accuracy

In [38]:
logit_model = LogisticRegression()
logit_model.fit(x_train,y_train)
y_pred = logit_model.predict(x_test)

  return f(*args, **kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [39]:
print('Accuracy_score is',round(accuracy_score(y_test,y_pred),2))

Accuracy_score is 0.69


In [40]:
# Standard Scaling

In [41]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)

In [42]:
logit_model = LogisticRegression()
logit_model.fit(x_train,y_train)
y_pred = logit_model.predict(x_test)

  return f(*args, **kwargs)


In [43]:
print('Accuracy_score is',round(accuracy_score(y_test,y_pred),2))

Accuracy_score is 0.74


In [44]:
# fitting  SVM - linear on traning data and predict on the test data and measure the accuracy

In [45]:
svc_linear = SVC(kernel='linear')
svc_linear.fit(x_train,y_train)
y_pred = svc_linear.predict(x_test)

  return f(*args, **kwargs)


In [46]:
print('Accuracy_score is',round(accuracy_score(y_test,y_pred),2))

Accuracy_score is 0.68
