
### Supervised Learning
### Activity: Building a Student Intervention System

### Question 1 - Classification vs. Regression
*Your goal for this project is to identify students who might need early intervention before they fail or pass. Which type of supervised learning problem is this, classification or regression? Why?*

**Answer: ** 

This is a classification problem as we are finding whether students need any intervention or not which is discrete value. We use regression only to compare continuous variables
  

### Question-2
load necessary Python libraries and load the student data. Note that the last column from this dataset, `'passed'`, will be our target label (whether the student graduated or didn't graduate). All other columns are features about each student.

In [1]:
# Import libraries
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score

In [2]:
# Read student data
df=pd.read_csv('student-data.csv')

In [3]:
df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,passed
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,no,no,4,3,4,1,1,3,6,no
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,no,5,3,3,1,1,3,4,no
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,no,4,3,2,2,3,3,10,yes
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,3,2,2,1,1,5,2,yes
4,GP,F,16,U,GT3,T,3,3,other,other,...,no,no,4,3,2,1,2,5,4,yes


In [4]:
df.columns

Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences', 'passed'],
      dtype='object')

### Question-3
Let's begin by investigating the dataset to determine how many students we have information on, and learn about the graduation rate among these students. In the code cell below, you will need to compute the following:
- The total number of students, `n_students`.
- The total number of features for each student, `n_features`.
- The number of those students who passed, `n_passed`.
- The number of those students who failed, `n_failed`.
- The graduation rate of the class, `grad_rate`, in percent (%).


In [5]:
# Calculate number of students
n_students=len(df)
print('The total number of students is ',n_students )

The total number of students is  395


In [6]:
# Calculate number of features
n_features=len(df.columns)
print('The total number of features for each student is  ',n_features)

The total number of features for each student is   31


In [7]:
# Calculate passing students
passed=df.loc[df['passed']=='yes']
n_passed = len(passed)
print('The passing students is  ',n_passed)

The passing students is   265


In [8]:
# Calculate failing students
failed=df.loc[df['passed']=='no']
n_failed = len(failed)
print('The failing students is  ',n_failed)

The failing students is   130


In [9]:
# Calculate graduation rate
total = n_passed + n_failed
grad_rate = float(n_passed * 100 / total)
print('The graduation rate is  ',grad_rate)

The graduation rate is   67.0886075949367


In [10]:
# Print the results
print('The total number of students is ',n_students )
print('The total number of features for each student is  ',n_features)
print('The failing students is  ',n_failed)
print('The graduation rate is  ',grad_rate)

The total number of students is  395
The total number of features for each student is   31
The failing students is   130
The graduation rate is   67.0886075949367


## Preparing the Data
you will prepare the data for modeling, training and testing.

### Question-4 Identify feature and target columns


separate the student data into feature and target columns to see if any features are non-numeric.

In [11]:
# Extract feature columns

In [12]:
feature_col = list(df.columns[:-1])
feature_col

['school',
 'sex',
 'age',
 'address',
 'famsize',
 'Pstatus',
 'Medu',
 'Fedu',
 'Mjob',
 'Fjob',
 'reason',
 'guardian',
 'traveltime',
 'studytime',
 'failures',
 'schoolsup',
 'famsup',
 'paid',
 'activities',
 'nursery',
 'higher',
 'internet',
 'romantic',
 'famrel',
 'freetime',
 'goout',
 'Dalc',
 'Walc',
 'health',
 'absences']

In [13]:
# Extract target column 'passed'

In [14]:
target_col = df.columns[-1]
target_col

'passed'

In [15]:
# Separate the data into feature data and target data (X and y, respectively)

In [16]:
X=df[feature_col]
y=df[target_col]

### Question-5 Preprocess Feature Columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation. Run the code cell below to perform the preprocessing routine discussed in this section.

In [17]:
def preprocess_data(X):
    W=pd.DataFrame(index=X.index)
    for col, col_data in X.iteritems():
        if(col_data.dtype==object):
            col_data=col_data.replace(['yes','no'],[1,0])
        if(col_data.dtype==object):
            col_data = pd.get_dummies(col_data, prefix = col)
            
        W=W.join(col_data)
    return W
        
X = preprocess_data(X)
X.head()

Unnamed: 0,school_GP,school_MS,sex_F,sex_M,age,address_R,address_U,famsize_GT3,famsize_LE3,Pstatus_A,...,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences
0,1,0,1,0,18,0,1,1,0,1,...,1,0,0,4,3,4,1,1,3,6
1,1,0,1,0,17,0,1,1,0,0,...,1,1,0,5,3,3,1,1,3,4
2,1,0,1,0,15,0,1,0,1,0,...,1,1,0,4,3,2,2,3,3,10
3,1,0,1,0,15,0,1,1,0,0,...,1,1,1,3,2,2,1,1,5,2
4,1,0,1,0,16,0,1,1,0,0,...,1,0,0,4,3,2,1,2,5,4


### Question - 6 Implementation: Training and Testing Data Split
So far, we have converted all _categorical_ features into numeric values. For the next step, we split the data (both features and corresponding labels) into training and test sets. you will need to implement the following:
- Randomly shuffle and split the data (`X`, `y`) into training and testing subsets.
  - Use 300 training points (approximately 75%) and 95 testing points (approximately 25%).
  - Set a `random_state` for the function(s) you use, if provided.
  - Store the results in `X_train`, `X_test`, `y_train`, and `y_test`.

In [18]:
# splitting the data into train and test
from sklearn.model_selection import train_test_split

In [19]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=95,random_state=42)

In [20]:
# Show the results of the split
print("Training data has {} datapoints.".format(len(X_train)))
print("Testing data has {} datapoints.".format(len(X_test)))

Training data has 300 datapoints.
Testing data has 95 datapoints.


### Question - 7  Training and Evaluating Models
In this section, you will choose 3 supervised learning models that are appropriate for this problem and available in `scikit-learn`. You will first discuss the reasoning behind choosing these three models by considering what you know about the data and each model's strengths and weaknesses. You will then fit the model to varying sizes of training data and measure the accuracy score.



The following supervised learning models are currently available in scikit-learn that you may choose from:

    1.Gaussian Naive Bayes (GaussianNB)
    2.Decision Trees
    3.Ensemble Methods (Bagging, AdaBoost, Random Forest, Gradient Boosting)
    4.K-Nearest Neighbors (KNeighbors)
    5.Stochastic Gradient Descent (SGDC)
    6.Support Vector Machines (SVM)
    7.Logistic Regression



###  Model Application
*List three supervised learning models that are appropriate for this problem. What are the general applications of each model? What are their strengths and weaknesses? Given what you know about the data, why did you choose these models to be applied?*

#explaination
The  3 supervised learning models that are appropriate for this problem and available in scikit-learn chosen are
   1. Support Vector Machine model(SVM)
   
          SVM are supervised learning models that analyze data and distinguish patterns and used for classification               analysis 
          The advantages of SVM are
          
           1.SVM is relatively memory efficient.SVM is very efficient in high-dimensional spaces
           2.It is very efficient in cases when we have a non-linear separation problem.
           3.SVM is relatively memory efficient.
           
          The Disadvantages of SVM is the deciding the kernel and is comparatively slower than other classification               models like Decision tree 
         
   2.  Logistic Regression
             
             Logistic regression is a supervised learning classification algorithm used to predict the probability of a              target variable. The nature of target or dependent variable is dichotomous, which means there would be                  only two possible classes.
             
             The advantages of Logistic Regression
              1. It is easier to implement, interpret, and very efficient to train
              2. It makes no assumptions about distributions of classes in feature space.
              3. Good accuracy for many simple data sets and it performs well when the dataset is linearly separable.
              4. It can interpret model coefficients as indicators of feature importance.
              
             Disdvantages of logistic Regression are
              1.If the number of observations is lesser than the number of features, Logistic Regression should not be                 used, otherwise, it may lead to overfitting.
              2.The major limitation of Logistic Regression is the assumption of linearity between the dependent                       variable and the independent variables.
              3.It is tough to obtain complex relationships using logistic regression. More powerful and compact                       algorithms such as Neural Networks can easily outperform this algorithm. 
          
   3. Decision Trees
       
          Decision Tree is a Supervised learning technique that can be used for both classification and Regression               problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier,           where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf           node represents the outcome.
         
          The advantages of a decision trees are that nonlinear relationships between parameters do not influence our             performance metrics and they give us quicker prediction as compared to other models like SVMs. 
          
          Decision Trees do not function well if we have smooth boundaries. i.e they work best when we have                       discontinuous piece wise constant model. If we really have a linear target function decision trees are not             the best.

In [21]:
# Import the three supervised learning models from sklearn
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

In [22]:
# fit model-1  on traning data 

In [23]:
svm_linear = SVC(kernel='linear')
svm_linear.fit(X_train,y_train)

SVC(kernel='linear')

In [24]:
# predict on the test data 

In [25]:
y_pred=svm_linear.predict(X_test)

In [26]:
# calculate the accuracy score

In [27]:

print("Accuracy is:", accuracy_score(y_test,y_pred))

Accuracy is: 0.6947368421052632


In [28]:
# fit the model-2 on traning data and predict on the test data and measure the accuracy

In [29]:
logit_model =LogisticRegression()
logit_model.fit(X_train ,y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [30]:
y_pred1= logit_model.predict(X_test)

In [31]:
print("Accuracy is:", accuracy_score(y_test,y_pred1))

Accuracy is: 0.7157894736842105


In [32]:
# fit the model-3 on traning data and predict on the test data and measure the accuracy

In [33]:
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train,y_train)

DecisionTreeClassifier()

In [34]:
y_pred2 =dt_model.predict(X_test)

In [35]:
print("Accuracy is:",accuracy_score(y_test,y_pred2))

Accuracy is: 0.5789473684210527


In [None]:
Conclusion:
    Logistic Regression model gives highest accuracy and hence best fit the student intervention data