
### Supervised Learning
### Activity: Building a Student Intervention System

### Question 1 - Classification vs. Regression
*Your goal for this project is to identify students who might need early intervention before they fail or pass. Which type of supervised learning problem is this, classification or regression? Why?*

**Answer: ** 

#### Classification model.Because our target variable is a categorical variable,we have to predict whether a student pass or fail.

### Question-2
load necessary Python libraries and load the student data. Note that the last column from this dataset, `'passed'`, will be our target label (whether the student graduated or didn't graduate). All other columns are features about each student.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [2]:
# Read student data
data=pd.read_csv('student-data.csv')
data.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,passed
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,no,no,4,3,4,1,1,3,6,no
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,no,5,3,3,1,1,3,4,no
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,no,4,3,2,2,3,3,10,yes
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,3,2,2,1,1,5,2,yes
4,GP,F,16,U,GT3,T,3,3,other,other,...,no,no,4,3,2,1,2,5,4,yes


In [3]:
data.shape

(395, 31)

### Question-3
Let's begin by investigating the dataset to determine how many students we have information on, and learn about the graduation rate among these students. In the code cell below, you will need to compute the following:
- The total number of students, `n_students`.
- The total number of features for each student, `n_features`.
- The number of those students who passed, `n_passed`.
- The number of those students who failed, `n_failed`.
- The graduation rate of the class, `grad_rate`, in percent (%).


In [4]:
# Calculate number of students
n_students=len(data)
n_students


395

In [5]:
# Calculate number of features
n_features=(len(data.columns[:-1]))
n_features

30

In [6]:
# Calculate passing students
n_passed=(data['passed']=='yes').sum()
n_passed


265

In [7]:
# Calculate failing students
n_failed=(data['passed']=='no').sum()
n_failed

130

In [8]:
# Calculate graduation rate
graduation_rate=(n_passed/n_students)*100
graduation_rate

67.08860759493672

In [9]:
# Print the results
print(f'total students are -{n_students}')
print(f'total no of features are -{n_features}')
print(f'no of students passed -{n_passed}')
print(f'no of students failed -{n_failed}')
print(f'graduation rate is-{np.round(graduation_rate,2)}')

total students are -395
total no of features are -30
no of students passed -265
no of students failed -130
graduation rate is-67.09


## Preparing the Data
you will prepare the data for modeling, training and testing.

### Question-4 Identify feature and target columns


separate the student data into feature and target columns to see if any features are non-numeric.

In [10]:
# Extract feature columns

In [11]:
feature_column=list(data.drop('passed',axis=1))
feature_column

['school',
 'sex',
 'age',
 'address',
 'famsize',
 'Pstatus',
 'Medu',
 'Fedu',
 'Mjob',
 'Fjob',
 'reason',
 'guardian',
 'traveltime',
 'studytime',
 'failures',
 'schoolsup',
 'famsup',
 'paid',
 'activities',
 'nursery',
 'higher',
 'internet',
 'romantic',
 'famrel',
 'freetime',
 'goout',
 'Dalc',
 'Walc',
 'health',
 'absences']

In [12]:
# Extract target column 'passed'
target_column=data.columns[-1]
target_column

'passed'

In [13]:
# Separate the data into feature data and target data (X and y, respectively)

In [14]:
X=data[feature_column]
Y=data[target_column]
X.dtypes

school        object
sex           object
age            int64
address       object
famsize       object
Pstatus       object
Medu           int64
Fedu           int64
Mjob          object
Fjob          object
reason        object
guardian      object
traveltime     int64
studytime      int64
failures       int64
schoolsup     object
famsup        object
paid          object
activities    object
nursery       object
higher        object
internet      object
romantic      object
famrel         int64
freetime       int64
goout          int64
Dalc           int64
Walc           int64
health         int64
absences       int64
dtype: object

In [15]:
Y

0       no
1       no
2      yes
3      yes
4      yes
      ... 
390     no
391    yes
392     no
393    yes
394     no
Name: passed, Length: 395, dtype: object

### Question-5 Preprocess Feature Columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation. Run the code cell below to perform the preprocessing routine discussed in this section.

In [16]:
data.dtypes # datatypes

school        object
sex           object
age            int64
address       object
famsize       object
Pstatus       object
Medu           int64
Fedu           int64
Mjob          object
Fjob          object
reason        object
guardian      object
traveltime     int64
studytime      int64
failures       int64
schoolsup     object
famsup        object
paid          object
activities    object
nursery       object
higher        object
internet      object
romantic      object
famrel         int64
freetime       int64
goout          int64
Dalc           int64
Walc           int64
health         int64
absences       int64
passed        object
dtype: object

In [17]:
data_obj=data[['school','Mjob','Fjob','reason','sex','address','guardian','famsize','romantic','Pstatus','schoolsup','famsup','paid','activities','nursery','higher','internet']]
data_obj.head() #printing objective data types

Unnamed: 0,school,Mjob,Fjob,reason,sex,address,guardian,famsize,romantic,Pstatus,schoolsup,famsup,paid,activities,nursery,higher,internet
0,GP,at_home,teacher,course,F,U,mother,GT3,no,A,yes,no,no,no,yes,yes,no
1,GP,at_home,other,course,F,U,father,GT3,no,T,no,yes,no,no,no,yes,yes
2,GP,at_home,other,other,F,U,mother,LE3,no,T,yes,no,yes,no,yes,yes,yes
3,GP,health,services,home,F,U,mother,GT3,yes,T,no,yes,yes,yes,yes,yes,yes
4,GP,other,other,home,F,U,father,GT3,no,T,no,yes,yes,no,yes,yes,no


In [18]:
data_obj['famsize'].value_counts()

GT3    281
LE3    114
Name: famsize, dtype: int64

In [19]:
#label encoding feat
from sklearn.preprocessing import LabelEncoder
label_en=LabelEncoder()
data_lben=pd.DataFrame()
for i in ['school','Pstatus','schoolsup','famsup','paid','activities','nursery','higher','internet','sex','address','famsize','romantic']:
    data_lben[i]=label_en.fit_transform(data_obj[i])
data_lben

Unnamed: 0,school,Pstatus,schoolsup,famsup,paid,activities,nursery,higher,internet,sex,address,famsize,romantic
0,0,0,1,0,0,0,1,1,0,0,1,0,0
1,0,1,0,1,0,0,0,1,1,0,1,0,0
2,0,1,1,0,1,0,1,1,1,0,1,1,0
3,0,1,0,1,1,1,1,1,1,0,1,0,1
4,0,1,0,1,1,0,1,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,1,0,0,1,1,0,1,1,0,1,1,1,0
391,1,1,0,0,0,0,0,1,1,1,1,1,0
392,1,1,0,0,0,0,0,1,0,1,0,0,0
393,1,1,0,0,0,0,0,1,1,1,0,1,0


In [20]:
#label encoding target
from sklearn.preprocessing import LabelEncoder
label_en=LabelEncoder()
#Y_lbn=pd.DataFrame()
#for i in ['passed']:
Y=label_en.fit_transform(Y)
Y=pd.DataFrame(Y,columns=['passed'])
Y

Unnamed: 0,passed
0,0
1,0
2,1
3,1
4,1
...,...
390,0
391,1
392,0
393,1


In [21]:
#one hot encoding
data_hoten=pd.get_dummies(X[['Mjob','Fjob','reason','guardian']])
data_hoten


Unnamed: 0,Mjob_at_home,Mjob_health,Mjob_other,Mjob_services,Mjob_teacher,Fjob_at_home,Fjob_health,Fjob_other,Fjob_services,Fjob_teacher,reason_course,reason_home,reason_other,reason_reputation,guardian_father,guardian_mother,guardian_other
0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0
1,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0
2,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0
3,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0
4,0,0,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,1
391,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,1,0
392,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,1
393,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,1,0


In [22]:
X=pd.concat([data_lben,data_hoten],axis=1)
X.head()

Unnamed: 0,school,Pstatus,schoolsup,famsup,paid,activities,nursery,higher,internet,sex,...,Fjob_other,Fjob_services,Fjob_teacher,reason_course,reason_home,reason_other,reason_reputation,guardian_father,guardian_mother,guardian_other
0,0,0,1,0,0,0,1,1,0,0,...,0,0,1,1,0,0,0,0,1,0
1,0,1,0,1,0,0,0,1,1,0,...,1,0,0,1,0,0,0,1,0,0
2,0,1,1,0,1,0,1,1,1,0,...,1,0,0,0,0,1,0,0,1,0
3,0,1,0,1,1,1,1,1,1,0,...,0,1,0,0,1,0,0,0,1,0
4,0,1,0,1,1,0,1,1,0,0,...,1,0,0,0,1,0,0,1,0,0


### Question - 6 Implementation: Training and Testing Data Split
So far, we have converted all _categorical_ features into numeric values. For the next step, we split the data (both features and corresponding labels) into training and test sets. you will need to implement the following:
- Randomly shuffle and split the data (`X`, `y`) into training and testing subsets.
  - Use 300 training points (approximately 75%) and 95 testing points (approximately 25%).
  - Set a `random_state` for the function(s) you use, if provided.
  - Store the results in `X_train`, `X_test`, `y_train`, and `y_test`.

In [23]:
# splitting the data into train and test
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=.24,random_state=2)

In [24]:
X_train.shape

(300, 1)

In [25]:
X_test.shape

(95, 30)

In [26]:
# Show the results of the split
X_train.head()

Unnamed: 0,school,Pstatus,schoolsup,famsup,paid,activities,nursery,higher,internet,sex,...,Fjob_other,Fjob_services,Fjob_teacher,reason_course,reason_home,reason_other,reason_reputation,guardian_father,guardian_mother,guardian_other
171,0,1,0,1,1,1,1,1,1,1,...,1,0,0,0,0,0,1,0,1,0
12,0,1,0,1,1,1,1,1,1,1,...,0,1,0,1,0,0,0,1,0,0
13,0,1,0,1,1,0,1,1,1,1,...,1,0,0,1,0,0,0,0,1,0
151,0,1,0,0,0,1,1,1,0,1,...,1,0,0,1,0,0,0,0,1,0
310,0,1,0,0,0,1,0,1,0,0,...,0,1,0,0,1,0,0,0,0,1


In [27]:
X_test.head()

Unnamed: 0,school,Pstatus,schoolsup,famsup,paid,activities,nursery,higher,internet,sex,...,Fjob_other,Fjob_services,Fjob_teacher,reason_course,reason_home,reason_other,reason_reputation,guardian_father,guardian_mother,guardian_other
94,0,1,0,1,0,1,1,1,1,1,...,0,0,0,0,0,0,1,0,1,0
32,0,1,0,1,0,1,1,1,1,1,...,0,0,0,1,0,0,0,0,1,0
222,0,1,1,0,0,0,1,1,1,0,...,0,0,1,0,0,1,0,0,1,0
329,0,1,0,1,1,0,0,1,1,0,...,0,0,1,1,0,0,0,0,1,0
369,1,1,0,1,1,0,0,1,1,0,...,0,0,1,0,0,1,0,1,0,0


In [28]:
y_train.head()

Unnamed: 0,passed
171,1
12,1
13,1
151,1
310,0


In [29]:
y_test.head()

Unnamed: 0,passed
94,1
32,1
222,1
329,1
369,1


### Question - 7  Training and Evaluating Models
In this section, you will choose 3 supervised learning models that are appropriate for this problem and available in `scikit-learn`. You will first discuss the reasoning behind choosing these three models by considering what you know about the data and each model's strengths and weaknesses. You will then fit the model to varying sizes of training data and measure the accuracy score.


1) Logistic Regression: 
Strengths: Outputs easy probabilistic interpretation; can be regularized to avoid overfitting; Easily update with new data using stochastic gradient descent;[2] Fast to train; Performs well with small number of observations; Weaknesses: "Logistic regression tends to underperform when there are multiple or non-linear decision boundaries"[2]; Not flexible with more complex relationships; Difficult to handle with noise data

Aplications: Investigate High Employee Turnover [12]; Spam Detection [9]; Credit Card Fraud [9];

Despite the word "regression" in the name, Logistic Regression is a linear model for Classification. The core of this method is based on Logistic Function (sigmoid function). This function has an S-shaped curve and take any value (input) and map it into a between 0 and 1, but not exactly these numbers.Inputs values are linearly combined using differents weights to predict an binary output. The coefficients for each input must be learned from training data. Logistic Regression was one of my choices because we have few samples to training and it is a simple model that works well with a small number of training samples; is fast to training and predict results and if there isn't a complex relation between features it's can be enought.


2)Decision Tree :
Strengths: simple model; easy to interpret and explain; simple to tune; fast for small number of training samples; Works well with missing values; works well with qualitative features; Weaknesses: Easy to overffiting without tuning; Not good for big data problems;

Applications: Star-galaxy classification [11]; Control of nonlinear dynamical systems[11]; Medical diagnosis[11];

Decision Tree model can be use to Classification and Regression problems. It's used for inductive inference and ID3 is a very popular algorithm. For classification it represets a bunch of "if-then' that resulting in a final decision. The "Tree" is construct by reapeatedly spliting the data into separete branches that maximize the information gain ("similiar features").[7,8] This model was chose because: works well with small number of training data; it's simple to interpret and explian the results; it's good to handle with the categorical features of data;

3) Random Forest :Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on the majority votes of predictions, and it predicts the final output.The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting. 
strengths:It takes less training time as compared to other algorithms.
It predicts output with high accuracy, even for the large dataset it runs efficiently.
It can also maintain accuracy when a large proportion of data is missing.

In [30]:
# Import the three supervised learning models from sklearn
# fit model-1  on traning data 

In [31]:
#logistic regression
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(X_train,y_train)
y_pred=model.predict(X_test)
y_pred

  return f(*args, **kwargs)


array([1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 0])

In [32]:
from sklearn.metrics import accuracy_score,f1_score,confusion_matrix
print(accuracy_score(y_test,y_pred))
confusion_matrix(y_test,y_pred)

0.6526315789473685


array([[ 5, 25],
       [ 8, 57]], dtype=int64)

In [33]:
# fit the model-2 on traning data and predict on the test data and measure the accuracy

In [34]:
#decision tree
from sklearn.tree import DecisionTreeClassifier
model=DecisionTreeClassifier()
model.fit(X_train,y_train)
y_pred=model.predict(X_test)

In [35]:
from sklearn.metrics import accuracy_score,f1_score,confusion_matrix
print(accuracy_score(y_test,y_pred))
confusion_matrix(y_test,y_pred)

0.5473684210526316


array([[12, 18],
       [25, 40]], dtype=int64)

In [36]:
# fit the model-3 on traning data and predict on the test data and measure the accuracy

In [37]:
#random forest
from sklearn.ensemble import RandomForestClassifier
model=RandomForestClassifier()
model.fit(X_train,y_train)
y_pred=model.predict(X_test)

  model.fit(X_train,y_train)


In [38]:
from sklearn.metrics import accuracy_score,f1_score,confusion_matrix
print(accuracy_score(y_test,y_pred))
confusion_matrix(y_test,y_pred)

0.6631578947368421


array([[ 8, 22],
       [10, 55]], dtype=int64)

In [41]:
#from the above 3 model we can see that random forest gives more accuracy