
### Supervised Learning
### Activity: Building a Student Intervention System

### Question 1 - Classification vs. Regression
*Your goal for this project is to identify students who might need early intervention before they fail or pass. Which type of supervised learning problem is this, classification or regression? Why?*

**Answer: ** 

 It is a classification problem because we are predicting a discrete output.Here we are predicting whether student fail or not.

### Question-2
load necessary Python libraries and load the student data. Note that the last column from this dataset, `'passed'`, will be our target label (whether the student graduated or didn't graduate). All other columns are features about each student.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Read student data
data = pd.read_csv("student-data.csv")
data.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,passed
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,no,no,4,3,4,1,1,3,6,no
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,no,5,3,3,1,1,3,4,no
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,no,4,3,2,2,3,3,10,yes
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,3,2,2,1,1,5,2,yes
4,GP,F,16,U,GT3,T,3,3,other,other,...,no,no,4,3,2,1,2,5,4,yes


### Question-3
Let's begin by investigating the dataset to determine how many students we have information on, and learn about the graduation rate among these students. In the code cell below, you will need to compute the following:
- The total number of students, `n_students`.
- The total number of features for each student, `n_features`.
- The number of those students who passed, `n_passed`.
- The number of those students who failed, `n_failed`.
- The graduation rate of the class, `grad_rate`, in percent (%).


In [3]:
# Calculate number of students
n_students = len(data)
n_students

395

In [4]:
# Calculate number of features
n_features = len(data.columns[:-1])
n_features

30

In [5]:
# Calculate passing students
n_passed = len(data[data.passed=="yes"])
n_passed

265

In [6]:
# Calculate failing students
n_failed = len(data[data.passed=="no"])
n_failed

130

In [7]:
# Calculate graduation rate
grad_rate = n_passed/(n_passed+n_failed)*100
grad_rate

67.08860759493672

In [8]:
# Print the results
print("Total number of students: {}".format(n_students))
print("Number of features: {}".format(n_features))
print("Number of students who passed: {}".format(n_passed))
print("Number of students who failed: {}".format(n_failed))
print("Graduation rate of the class: {:.2f}%".format(grad_rate))

Total number of students: 395
Number of features: 30
Number of students who passed: 265
Number of students who failed: 130
Graduation rate of the class: 67.09%


## Preparing the Data
you will prepare the data for modeling, training and testing.

### Question-4 Identify feature and target columns


separate the student data into feature and target columns to see if any features are non-numeric.

In [9]:
# Extract feature columns

In [10]:
feature_cols = list(data.columns[:-1])

In [11]:
# Extract target column 'passed'

In [12]:
target_col = data.columns[-1] 

In [13]:
# Separate the data into feature data and target data (X and y, respectively)

In [14]:
X = data[feature_cols]
y = data[target_col]

### Question-5 Preprocess Feature Columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation. Run the code cell below to perform the preprocessing routine discussed in this section.

In [15]:
def preprocess_features(X):
    output = pd.DataFrame(index = X.index)
    for col, col_data in X.iteritems():
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])
        if col_data.dtype == object:
            # Example: 'school' => 'school_GP' and 'school_MS'
            col_data = pd.get_dummies(col_data, prefix = col)
        output = output.join(col_data)
    return output

In [16]:
X = preprocess_features(X)
X.head()

Unnamed: 0,school_GP,school_MS,sex_F,sex_M,age,address_R,address_U,famsize_GT3,famsize_LE3,Pstatus_A,...,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences
0,1,0,1,0,18,0,1,1,0,1,...,1,0,0,4,3,4,1,1,3,6
1,1,0,1,0,17,0,1,1,0,0,...,1,1,0,5,3,3,1,1,3,4
2,1,0,1,0,15,0,1,0,1,0,...,1,1,0,4,3,2,2,3,3,10
3,1,0,1,0,15,0,1,1,0,0,...,1,1,1,3,2,2,1,1,5,2
4,1,0,1,0,16,0,1,1,0,0,...,1,0,0,4,3,2,1,2,5,4


In [17]:
X['Pedu']=X['Medu']+X['Fedu']
X['alc']=X['Dalc']+X['Walc']

In [18]:
X=X.drop(['Medu','Fedu','Dalc','Walc'],axis=1)

In [19]:
X.head()

Unnamed: 0,school_GP,school_MS,sex_F,sex_M,age,address_R,address_U,famsize_GT3,famsize_LE3,Pstatus_A,...,higher,internet,romantic,famrel,freetime,goout,health,absences,Pedu,alc
0,1,0,1,0,18,0,1,1,0,1,...,1,0,0,4,3,4,3,6,8,2
1,1,0,1,0,17,0,1,1,0,0,...,1,1,0,5,3,3,3,4,2,2
2,1,0,1,0,15,0,1,0,1,0,...,1,1,0,4,3,2,3,10,2,5
3,1,0,1,0,15,0,1,1,0,0,...,1,1,1,3,2,2,5,2,6,2
4,1,0,1,0,16,0,1,1,0,0,...,1,0,0,4,3,2,5,4,6,3


### Question - 6 Implementation: Training and Testing Data Split
So far, we have converted all _categorical_ features into numeric values. For the next step, we split the data (both features and corresponding labels) into training and test sets. you will need to implement the following:
- Randomly shuffle and split the data (`X`, `y`) into training and testing subsets.
  - Use 300 training points (approximately 75%) and 95 testing points (approximately 25%).
  - Set a `random_state` for the function(s) you use, if provided.
  - Store the results in `X_train`, `X_test`, `y_train`, and `y_test`.

In [20]:
# splitting the data into train and test
from sklearn.model_selection import train_test_split
num_train = 300
num_test = X.shape[0] - num_train
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=num_train,random_state=42)

In [21]:
# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))

Training set has 300 samples.
Testing set has 95 samples.


In [22]:
X_train.head()

Unnamed: 0,school_GP,school_MS,sex_F,sex_M,age,address_R,address_U,famsize_GT3,famsize_LE3,Pstatus_A,...,higher,internet,romantic,famrel,freetime,goout,health,absences,Pedu,alc
210,1,0,1,0,19,0,1,1,0,0,...,1,1,0,4,3,3,3,10,6,3
75,1,0,0,1,15,0,1,1,0,0,...,1,1,0,4,3,3,5,6,7,5
104,1,0,0,1,15,0,1,1,0,1,...,1,1,0,5,4,4,1,0,7,2
374,0,1,1,0,18,1,0,0,1,0,...,1,1,0,5,4,4,1,0,8,2
16,1,0,1,0,16,0,1,1,0,0,...,1,1,0,3,2,3,2,6,8,3


### Question - 7  Training and Evaluating Models
In this section, you will choose 3 supervised learning models that are appropriate for this problem and available in `scikit-learn`. You will first discuss the reasoning behind choosing these three models by considering what you know about the data and each model's strengths and weaknesses. You will then fit the model to varying sizes of training data and measure the accuracy score.

###  Model Application
*List three supervised learning models that are appropriate for this problem. What are the general applications of each model? What are their strengths and weaknesses? Given what you know about the data, why did you choose these models to be applied?*

1. Logistic Regression

Strengths: Outputs easy probabilistic interpretation; can be regularized to avoid overfitting; Easily update with new data using stochastic gradient descent;[2] Fast to train; Performs well with small number of observations;
Weaknesses: "Logistic regression tends to underperform when there are multiple or non-linear decision boundaries"[2]; Not flexible with more complex relationships; Difficult to handle with noise data

Aplications: Investigate High Employee Turnover [12]; Spam Detection [9]; Credit Card Fraud [9];

Despite the word "regression" in the name, Logistic Regression is a linear model for Classification. The core of this method is based on Logistic Function (sigmoid function). This function has an S-shaped curve and take any value (input) and map it into a between 0 and 1, but not exactly these numbers.Inputs values are linearly combined using differents weights to predict an binary output. The coefficients for each input must be learned from training data.
Logistic Regression was one of my choices because we have few samples to training and it is a simple model that works well with a small number of training samples; is fast to training and predict results and if there isn't a complex relation between features it's can be enought.



2. Support Vector Machines

Strengths: Performs well with non-linear problems (if correct kernel); There are many kernels; Good for high dimensional data; Fast to predict; Predictions are fast;
Weaknesses:  Outputs are hard to interprets; Slow to train from many (>10000) examples; Requires large memory; Costly to learning;  Trickier to tune (dificult to find the right kernel); Speed is also affected by the number of features (complexity ranges between O(n_samples^2 n_features) and O(n_samples^3 n_features) )

Aplications: Face detection [10]; Clasification of Images [10]; Handwriting Recognition[10]; 

Support Vector Machine was chose because is a versatile model (many possible kernels); if there is complex (nonlinear) relationships this model works well and for small training samples it's fast to train and return predictions, therefore can be computationally accessible; 


3. Decision Tree

Strengths: simple model; easy to interpret and explain; simple to tune; fast for small number of training samples; Works well with missing values; works well with qualitative features;
Weaknesses: Easy to overffiting without tuning; Not good for big data problems; 

Applications: Star-galaxy classification [11]; Control of nonlinear dynamical systems[11]; Medical diagnosis[11];

Decision Tree model can be use to Classification and Regression problems. It's used for inductive inference and ID3 is a very popular algorithm. For classification it represets a bunch of "if-then' that resulting in a final decision. The "Tree" is construct by reapeatedly spliting the data into separete branches that maximize the information gain ("similiar features").[7,8]
This model was chose because: works well with small number of training data; it's simple to interpret and explian the results; it's good to handle with the categorical features of data;

In [23]:
def train_classifier(clf, X_train, y_train):
    clf.fit(X_train, y_train)

    
def predict_labels(clf, features, target):
    y_pred = clf.predict(features)
    print ("Accuracy score for test set: {:.4f}.".format(f1_score(target.values, y_pred,average="macro")))
    return f1_score(target.values, y_pred, pos_label='yes')


def train_predict(clf, X_train, y_train, X_test, y_test):
    print ("Training a {} using a training set size of {}. . .".format(clf.__class__.__name__, len(X_train)))
    
    # Train the classifier
    train_classifier(clf, X_train, y_train)
    
    # Print the results of prediction for both training and testing
    #print "F1 score for training set: {:.4f}.".format(predict_labels(clf, X_train, y_train))
    print ("F1 score for test set: {:.4f}.".format(predict_labels(clf, X_test, y_test)))

In [24]:
# from sklearn import model_A
from sklearn.linear_model import LogisticRegression
# from sklearn import model_B
from sklearn.svm import SVC
# from sklearn import model_C
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score,accuracy_score

In [25]:
clf_A = LogisticRegression(random_state = 10)
clf_B = SVC(random_state = 20)
clf_C = DecisionTreeClassifier(random_state = 30)
X_train_100 = X_train[:100]
#print len(X_train_100)
y_train_100 = y_train[:100]

X_train_200 = X_train[:200]
#print len(X_train_200)
y_train_200 = y_train[:200]

X_train_300 = X_train
#print len(X_train_300)
y_train_300 = y_train

In [26]:

print ("*********************************")
print ("1. Logistic Regression")
print ("*********************************")
print ("\n")
print ("Training set size = 100")
train_predict(clf_A, X_train_100, y_train_100, X_test, y_test)
print ("\n")
print ("Training set size = 200")
train_predict(clf_A, X_train_200, y_train_200, X_test, y_test)
print ("\n")
print ("Training set size = 300")
train_predict(clf_A, X_train_300, y_train_300, X_test, y_test)
print ("\n")


print ("\n")
print ("-----------------------------------------------------------------")
print ("\n")

print ("*********************************")
print ("2. Support Vector Machine")
print ("*********************************")
print ("\n")
print ("Training set size = 100")
train_predict(clf_B, X_train_100, y_train_100, X_test, y_test)
print ("\n")
print ("Training set size = 200")
train_predict(clf_B, X_train_200, y_train_200, X_test, y_test)
print ("\n")
print ("Training set size = 300")
train_predict(clf_B, X_train_300, y_train_300, X_test, y_test)
print ("\n")

print ("\n")
print ("-----------------------------------------------------------------")
print ("\n")

print ("*********************************")
print ("3. Decision Tree Classifier")
print ("*********************************")
print ("\n")
print ("Training set size = 100")
train_predict(clf_C, X_train_100, y_train_100, X_test, y_test)
print ("\n")
print ("Training set size = 200")
train_predict(clf_C, X_train_200, y_train_200, X_test, y_test)
print ("\n")
print ("Training set size = 300")
train_predict(clf_C, X_train_300, y_train_300, X_test, y_test)
print ("\n")

*********************************
1. Logistic Regression
*********************************


Training set size = 100
Training a LogisticRegression using a training set size of 100. . .
Accuracy score for test set: 0.6202.
F1 score for test set: 0.7761.


Training set size = 200
Training a LogisticRegression using a training set size of 200. . .
Accuracy score for test set: 0.6293.
F1 score for test set: 0.7971.


Training set size = 300
Training a LogisticRegression using a training set size of 300. . .
Accuracy score for test set: 0.6545.
F1 score for test set: 0.8000.




-----------------------------------------------------------------


*********************************
2. Support Vector Machine
*********************************


Training set size = 100
Training a SVC using a training set size of 100. . .
Accuracy score for test set: 0.5462.
F1 score for test set: 0.7660.


Training set size = 200
Training a SVC using a training set size of 200. . .
Accuracy score for test set: 0.

### Logistic Regression is the best model

In [28]:
from sklearn.linear_model import LogisticRegression
logit_model = LogisticRegression(solver='lbfgs', max_iter=1000)
logit_model.fit(X_train,y_train)
y_pred = logit_model.predict(X_test)

In [29]:
from sklearn.metrics import f1_score,confusion_matrix,accuracy_score,precision_score,recall_score
print('Accuracy is:',accuracy_score(y_test,y_pred))
print('Precision is:',precision_score(y_test,y_pred,pos_label='yes'))
print('recall is:',recall_score(y_test,y_pred,pos_label='yes'))
print('f1 is:',f1_score(y_test,y_pred,pos_label='yes'))

Accuracy is: 0.7157894736842105
Precision is: 0.7142857142857143
recall is: 0.9166666666666666
f1 is: 0.8029197080291971


In [30]:
confusion_matrix(y_test,y_pred)

array([[13, 22],
       [ 5, 55]], dtype=int64)

In [32]:
from sklearn.neighbors import KNeighborsClassifier
acc_values = []
neighbors = np.arange(3,15)
for k in neighbors:
    classifier = KNeighborsClassifier(n_neighbors=k,metric = 'minkowski')
    classifier.fit(X_train,y_train)
    y_pred=classifier.predict(X_test)
    acc = accuracy_score(y_test,y_pred)
    acc_values.append(acc)

In [33]:
acc_values

[0.6526315789473685,
 0.6631578947368421,
 0.6631578947368421,
 0.6526315789473685,
 0.6631578947368421,
 0.6736842105263158,
 0.7052631578947368,
 0.6526315789473685,
 0.6526315789473685,
 0.6526315789473685,
 0.6736842105263158,
 0.6736842105263158]

In [39]:
classifier = KNeighborsClassifier(n_neighbors=5,metric = 'minkowski')
classifier.fit(X_train,y_train)
y_pred=classifier.predict(X_test)

In [40]:
print('Accuracy is:',accuracy_score(y_test,y_pred))
print('Precision is:',precision_score(y_test,y_pred,pos_label='yes'))
print('recall is:',recall_score(y_test,y_pred,pos_label='yes'))
print('f1 is:',f1_score(y_test,y_pred,pos_label='yes'))

Accuracy is: 0.6631578947368421
Precision is: 0.6794871794871795
recall is: 0.8833333333333333
f1 is: 0.7681159420289855


In [41]:
confusion_matrix(y_test,y_pred)

array([[10, 25],
       [ 7, 53]], dtype=int64)

In [43]:
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train,y_train)
y_pred = dt_model.predict(X_test)

In [44]:
print('Accuracy is:',accuracy_score(y_test,y_pred))
print('Precision is:',precision_score(y_test,y_pred,pos_label='yes'))
print('recall is:',recall_score(y_test,y_pred,pos_label='yes'))
print('f1 score is:',f1_score(y_test,y_pred,pos_label='yes'))

Accuracy is: 0.6
Precision is: 0.6666666666666666
recall is: 0.7333333333333333
f1 score is: 0.6984126984126984


In [45]:
y_test.value_counts()

yes    60
no     35
Name: passed, dtype: int64

In [46]:
confusion_matrix(y_test,y_pred)

array([[13, 22],
       [16, 44]], dtype=int64)

In [47]:
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier()
rf.fit(X_train,y_train)
y_pred=rf.predict(X_test)

In [48]:
from sklearn.metrics import f1_score,confusion_matrix
print('Accuracy is:',accuracy_score(y_test,y_pred))
print('Precision is:',precision_score(y_test,y_pred,pos_label='yes'))
print('recall is:',recall_score(y_test,y_pred,pos_label='yes'))
print('f1 score is:',f1_score(y_test,y_pred,pos_label='yes'))

Accuracy is: 0.5894736842105263
Precision is: 0.6615384615384615
recall is: 0.7166666666666667
f1 score is: 0.688


In [49]:
confusion_matrix(y_test,y_pred)

array([[13, 22],
       [17, 43]], dtype=int64)