
### Supervised Learning
### Activity: Building a Student Intervention System

### Question 1 - Classification vs. Regression
*Your goal for this project is to identify students who might need early intervention before they fail or pass. Which type of supervised learning problem is this, classification or regression? Why?*

**Answer: ** 

Here we are identifying the students who will need a early intervention, to classify them into two categories "Pass" and "Fail". 
Hence, this a classification problem.

### Question-2
load necessary Python libraries and load the student data. Note that the last column from this dataset, `'passed'`, will be our target label (whether the student graduated or didn't graduate). All other columns are features about each student.

In [1]:
# Import libraries
import pandas as pd
import numpy as np

In [2]:
# Read student data
data = pd.read_csv("student-data.csv")
data

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,passed
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,no,no,4,3,4,1,1,3,6,no
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,no,5,3,3,1,1,3,4,no
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,no,4,3,2,2,3,3,10,yes
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,3,2,2,1,1,5,2,yes
4,GP,F,16,U,GT3,T,3,3,other,other,...,no,no,4,3,2,1,2,5,4,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,MS,M,20,U,LE3,A,2,2,services,services,...,no,no,5,5,4,4,5,4,11,no
391,MS,M,17,U,LE3,T,3,1,services,services,...,yes,no,2,4,5,3,4,2,3,yes
392,MS,M,21,R,GT3,T,1,1,other,other,...,no,no,5,5,3,3,3,3,3,no
393,MS,M,18,R,LE3,T,3,2,services,other,...,yes,no,4,4,1,3,4,5,0,yes


### Question-3
Let's begin by investigating the dataset to determine how many students we have information on, and learn about the graduation rate among these students. In the code cell below, you will need to compute the following:
- The total number of students, `n_students`.
- The total number of features for each student, `n_features`.
- The number of those students who passed, `n_passed`.
- The number of those students who failed, `n_failed`.
- The graduation rate of the class, `grad_rate`, in percent (%).


In [3]:
# Calculate number of students
n_students = len(data)

In [4]:
# Calculate number of features
n_features = data.shape[1]-1

In [5]:
# Calculate passing students
s_passed = len(data[data['passed']=='yes'])

In [6]:
# Calculate failing students
s_failed = len(data[data['passed']=="no"])

In [7]:
# Calculate graduation rate
grad_rate = (s_passed/n_students)*100

In [8]:
# Print the results
print("Total number of students: ",n_students)
print("Number of features:",n_features)
print("Number of students who passed:",s_passed)
print("Number of students who failed:",s_failed)
print("Graduation rate of the class: %.2f"%grad_rate,"%")

Total number of students:  395
Number of features: 30
Number of students who passed: 265
Number of students who failed: 130
Graduation rate of the class: 67.09 %


## Preparing the Data
you will prepare the data for modeling, training and testing.

### Question-4 Identify feature and target columns


separate the student data into feature and target columns to see if any features are non-numeric.

In [None]:
# Separate the data into feature data and target data (X and y, respectively)

In [None]:
# Extract feature columns

In [10]:
x = data.drop("passed", axis =1)
x.columns

Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences'],
      dtype='object')

In [None]:
# Extract target column 'passed'

In [12]:
y=data['passed']

### Question-5 Preprocess Feature Columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation. Run the code cell below to perform the preprocessing routine discussed in this section.

In [17]:
# Initialize new output DataFrame
X = pd.DataFrame(index = x.index)
# Investigate each feature column for the data
for col, col_data in x.iteritems():
        # If data type is non-numeric, replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])
     # If data type is categorical, convert to dummy variables
        if col_data.dtype == object:
            col_data = pd.get_dummies(col_data, prefix = col)  
        # Collect the revised columns
        X = X.join(col_data)
X

Unnamed: 0,school_GP,school_MS,sex_F,sex_M,age,address_R,address_U,famsize_GT3,famsize_LE3,Pstatus_A,...,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences
0,1,0,1,0,18,0,1,1,0,1,...,1,0,0,4,3,4,1,1,3,6
1,1,0,1,0,17,0,1,1,0,0,...,1,1,0,5,3,3,1,1,3,4
2,1,0,1,0,15,0,1,0,1,0,...,1,1,0,4,3,2,2,3,3,10
3,1,0,1,0,15,0,1,1,0,0,...,1,1,1,3,2,2,1,1,5,2
4,1,0,1,0,16,0,1,1,0,0,...,1,0,0,4,3,2,1,2,5,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,0,1,0,1,20,0,1,0,1,1,...,1,0,0,5,5,4,4,5,4,11
391,0,1,0,1,17,0,1,0,1,0,...,1,1,0,2,4,5,3,4,2,3
392,0,1,0,1,21,1,0,1,0,0,...,1,0,0,5,5,3,3,3,3,3
393,0,1,0,1,18,1,0,0,1,0,...,1,1,0,4,4,1,3,4,5,0


In [18]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 48 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   school_GP          395 non-null    uint8
 1   school_MS          395 non-null    uint8
 2   sex_F              395 non-null    uint8
 3   sex_M              395 non-null    uint8
 4   age                395 non-null    int64
 5   address_R          395 non-null    uint8
 6   address_U          395 non-null    uint8
 7   famsize_GT3        395 non-null    uint8
 8   famsize_LE3        395 non-null    uint8
 9   Pstatus_A          395 non-null    uint8
 10  Pstatus_T          395 non-null    uint8
 11  Medu               395 non-null    int64
 12  Fedu               395 non-null    int64
 13  Mjob_at_home       395 non-null    uint8
 14  Mjob_health        395 non-null    uint8
 15  Mjob_other         395 non-null    uint8
 16  Mjob_services      395 non-null    uint8
 17  Mjob_teacher    

### Question - 6 Implementation: Training and Testing Data Split
So far, we have converted all _categorical_ features into numeric values. For the next step, we split the data (both features and corresponding labels) into training and test sets. you will need to implement the following:
- Randomly shuffle and split the data (`X`, `y`) into training and testing subsets.
  - Use 300 training points (approximately 75%) and 95 testing points (approximately 25%).
  - Set a `random_state` for the function(s) you use, if provided.
  - Store the results in `X_train`, `X_test`, `y_train`, and `y_test`.

In [19]:
# splitting the data into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.25)

In [21]:
# Show the results of the split
print("The nummber of samples in training set is",X_train.shape[0])
print ("The nummber of samples in testing set is",X_test.shape[0])

The nummber of samples in training set is 296
The nummber of samples in testing set is 99


### Question - 7  Training and Evaluating Models
In this section, you will choose 3 supervised learning models that are appropriate for this problem and available in `scikit-learn`. You will first discuss the reasoning behind choosing these three models by considering what you know about the data and each model's strengths and weaknesses. You will then fit the model to varying sizes of training data and measure the accuracy score.

###  Model Application
*List three supervised learning models that are appropriate for this problem. What are the general applications of each model? What are their strengths and weaknesses? Given what you know about the data, why did you choose these models to be applied?*

We prefer to choose the below three classification models:

1) Logistic Regression\
2) SVM\
3) Decision Tree

## 1. Logistic Regression

Logistic regression is a classification algorithm used to find the probability of event success and event failure. It is used when the dependent variable is binary(0/1, True/False, Yes/No) in nature. It supports categorizing data into discrete classes by studying the relationship from a given set of labelled data.

Pros:
* Logistic regression is easier to implement, interpret, and very efficient to train.
* It makes no assumptions about distributions of classes in feature space.
* It can easily extend to multiple classes(multinomial regression) and a natural probabilistic view of class predictions.

Cons:

* If the number of observations is lesser than the number of features, Logistic Regression should not be used, otherwise, it     may lead to overfitting.
* It constructs linear boundaries.
* The major limitation of Logistic Regression is the assumption of linearity between the dependent variable and the independent   variables.

## 2. SVM

“Support Vector Machine” (SVM) is a supervised machine learning algorithm that can be used for both classification or regression challenges.In the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is a number of features you have) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiates the two classes very well .

Pros:
* It works really well with a clear margin of separation
* It is effective in high dimensional spaces.
* It is effective in cases where the number of dimensions is greater than the number of samples.
* It uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.

Cons:
* It doesn’t perform well when we have large data set because the required training time is higher
* It also doesn’t perform very well, when the data set has more noise i.e. target classes are overlapping
* SVM doesn’t directly provide probability estimates, these are calculated using an expensive five-fold cross-validation. It is   included in the related SVC method of Python scikit-learn library.

## 3.Decision Tree

Decision Tree is a Supervised learning technique that can be used for both classification and regression problems, 
but mostly it is preferred for solving classification problems. It is a tree-structured classifier, where internal 
nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.

pros:
* Compared to other algorithms decision trees requires less effort for data preparation during pre-processing.
* A decision tree does not require normalization of data.
* A decision tree does not require scaling of data as well.
* Missing values in the data also do not affect the process of building a decision tree to any considerable extent.
* A decision tree model is very intuitive and easy to explain to technical teams as well as stakeholders.

Cons:
* A small change in the data can cause a large change in the structure of the decision tree causing instability.
* For a decision tree sometimes calculation can go far more complex compared to other algorithms.
* Decision tree often involves higher time to train the model.
* Decision tree training is relatively expensive as the complexity and time has taken are more.
* The decision tree algorithm is inadequate for applying regression and predicting continuous values.

In [26]:
# Import the three supervised learning models from sklearn
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression 

In [None]:
# fit model-1  on traning data 

In [27]:
logit_model = LogisticRegression()
logit_model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [None]:
# predict on the test data 

In [28]:
y_pred = logit_model.predict(X_test)

In [None]:
# calculate the accuracy score

In [35]:
from sklearn.metrics import accuracy_score
acc1=accuracy_score(y_test,y_pred)
print('accuracy=',acc1)

accuracy= 0.7070707070707071


In [None]:
# fit the model-2 on traning data and predict on the test data and measure the accuracy

In [36]:
svm_linear = SVC(kernel = 'linear')
svm_linear.fit(X_train,y_train)

SVC(kernel='linear')

In [38]:
y_pred = svm_linear.predict(X_test)

In [39]:
acc2=accuracy_score(y_test,y_pred)
print('accuracy=',acc2)

accuracy= 0.6868686868686869


In [None]:
# fit the model-3 on traning data and predict on the test data and measure the accuracy

In [40]:
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train,y_train)

DecisionTreeClassifier()

In [41]:
y_pred = dt_model.predict(X_test)

In [42]:
acc3=accuracy_score(y_test,y_pred)
print('accuracy=',acc3)

accuracy= 0.5252525252525253


In [44]:
print('Accuracy score for Logistic Regression model is %0.4f'%acc1)
print('Accuracy score for SVM model is %0.4f'%acc2)
print('Accuracy score for Decesion Tree model is %0.4f'%acc3)

Accuracy score for Logistic Regression model is 0.7071
Accuracy score for SVM model is 0.6869
Accuracy score for Decesion Tree model is 0.5253


### Conclusion

Out of three model we have used for study, we can say that Logistic Regression model has a better accuracy score while compared to the other two models

## Group12 DSA B3

## Group Members

* Akshaya V
* Navaneeth R
* Shiffa
* Sujith Narayanan
* Sidharth S
