[View in Colaboratory](https://colab.research.google.com/github/KrisSandy/MachineLearning/blob/master/ML_Assignment%20Original.ipynb)

# Machine Learning Assignment 1

## Why scikit-learn

There are many open source Machine learning packages available in the market. TensorFlow, SciKit-learn, Caffe are few to name. For this task I have choosen scikit-learn library which is an open source package developed in python. scikit-learn along with other scientific packages in python (pandas, numpy etc) provides powerful data structures and machine learning features which can be leveraged with ease.

Besides readily available implementation of K nearest Neighbours algorithm, below are some of the main reasons for choosing scikit-learn


*   scikit-learn is open source package
*   Its regularly updated with more than 1 release per year which means the packages are up to date.
* Easy to use 
* It has implementations for most of the machine learning tasks such as Clustering, Classification, Regression etc.
* Very good and up to date documentation available.

scikit-learn offers below features:

*  Powerful function for data pre processing, transformations and data normalization.
*  Functions to perform many of the Classification, Regression, Clustering algorithms. All the algorithms follow same design principles, so it is easy to adopt and use different machine learning algorithms. 
* Provides powerful functions for Dimensionality reduction 


*Reference: https://www.oreilly.com/ideas/six-reasons-why-i-recommend-scikit-learn*

*Reference: http://scikit-learn.org/stable/index.html*

## Data Pre-processing

* Data in the dataset is seperated by tab. Each row represents an attribute and columns represent each individual patients. 

* In order to load the data into pandas dataframe, the file is read using read_csv using the seperator as tab ('\t'). A transpose of the dataframe is required to bring the data into traditional format i.e. features in columns and observations (patients) in rows. 

* After getting the data in desired format, column names are added to give more sense and completeness to the dataframe.

* Data and Response used to train and test the models should be in numbers as numpy arrays. So Autoimmune_Disease column needs to be coverted to 0 and 1 representing negative and positive respectively. 

#### Configuring Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [3]:
import pandas as pd
import numpy
fields = ['Age', 'Blood_Pressure', 'BMI', 'Plasma_level', 'Autoimmune_Disease', 'Adverse_events', 'Drug_in_serum', 'Liver_function', 'Activity_test', 'Secondary_test']
autoimmune_data = pd.read_csv(r'/content/gdrive/My Drive/GYE06/CT475_ML/autoimmune.txt',
                 sep='\t',
                 header=None
                )
autoimmune_data = autoimmune_data.transpose()
autoimmune_data.columns = fields
autoimmune_data['Autoimmune_Disease'] = autoimmune_data['Autoimmune_Disease'].map({'negative':0, 'positive':1})
print(autoimmune_data.head())

  Age Blood_Pressure   BMI Plasma_level  Autoimmune_Disease Adverse_events  \
0  30             64  35.1           61                   1              1   
1  22             74    30           40                   0              1   
2  21             70  30.8           50                   0              0   
3  23             64  34.9         59.5                   0              0   
4  25             76  53.2           81                   1              0   

  Drug_in_serum Liver_function Activity_test Secondary_test  
0           156          0.692            32           12.7  
1            60          0.527            11              0  
2            50          0.597            26           22.6  
3            92          0.725            18            1.8  
4           100          0.759            56            3.6  


## kNN (k-Nearest Neighbours)

k nearest neighbours algorithm is an instance based learning algorithm used for classification. This is a lazy algorithm which predicts the outcome by calculating the distance from the query point to all the data points. It then taks the k nearest data points and classifies the query point as the category with higest number in the k data points. 

Scikit-learn's KNeighboursClassifier has been used for implementing k-Nearest Neighbours algorithm.  

When the attributes in the data have wide range of values, attributes with large number range will have more impact during the distance calculation than attributes with smaller number range. Hence data must be normilized.



#### Normalizing data

We will be using z normalisation, which means data will be scaled in such a way that mean of an attribute is zero and standard deviation is 1

In [15]:
from sklearn import preprocessing

X_scaled = preprocessing.scale(X)
 
X_scaled.mean(axis=0) # axis=0 indicated that mean should be taken wrt col

array([-1.11022302e-16,  3.68499557e-16,  0.00000000e+00,  0.00000000e+00,
        7.08652994e-18,  6.37787695e-17,  9.44870659e-18,  3.77948264e-17,
       -1.32281892e-16])

In [16]:
from sklearn.model_selection import train_test_split

X = autoimmune_data.drop(columns = ['Autoimmune_Disease'])
y = autoimmune_data['Autoimmune_Disease']

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
print(X_train.shape)
print(X_test.shape)

(282, 9)
(94, 9)


#### Implementing kNN

In [25]:
from sklearn.neighbors import KNeighborsClassifier

autoimmune_knn = KNeighborsClassifier(n_neighbors=5)
autoimmune_knn.fit(X_train, y_train)
accuracy_knn = autoimmune_knn.score(X_test, y_test)
print("The accuracy of the model using k=5 is {}".format(accuracy_knn))

The accuracy of the model using k=5 is 0.7340425531914894


## Logistic Regression

Logistic regression is a machine learning technique used for classification problems. In this technique the probability of the response is calculated for the give predictors. 

The function used in logistic regression is sigmoid function:

`sig(t) = (1/1+e^-t)`

This range of the sigmoid function is between 0 and 1. 

In this example, after training a logistic regression model with the data, for new example if the output of the sigmoid function is > 0.5, then we can say its a case of autoimmune disease, and if the result if < 0.5, then its otherwise.



In [20]:
from sklearn.linear_model import LogisticRegression
autoimmune_log_reg = LogisticRegression()
accuracy_log_reg = cross_val_score(autoimmune_log_reg, X, y, cv=10).mean()
print("Accuracy using Logistic Regression is {}".format(accuracy_log_reg))

Accuracy using Logistic Regression is 0.7818713450292398


References: 

Logistic Regression: https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc 

scikit-learn: https://www.dataschool.io/machine-learning-with-scikit-learn/


## 10-fold cross Validation

k foldcross validation is a technique where a dataset is divided into k folds. The training of the model is done using k-1 folds and 1 fold is used for testing the model. This process is repeated k-1 changing the testing fold in every iteration. For each iteration, accuracy of the model is calculated and the mean of al the accuracies gives the model accuracy.

#### 10-fold cross validation on kNN

scikit-learn provides a inbuit function which performs the k fold validatio given k. The code is as below

In [27]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(autoimmune_knn, X, y, cv=10)
print(scores)

[0.73684211 0.65789474 0.68421053 0.76315789 0.68421053 0.73684211
 0.76315789 0.78947368 0.72222222 0.75      ]


In [7]:
print("Mean of 10-fold cross validation scores : {}".format(scores.mean()))

Mean of 10-fold cross validation scores : 0.7288011695906433


The accurace of the model is 73%.

#### Finding optimal values of k

In the above model, n_neigbors has been set to 5, but this may not give high accuracy. In order to findout the optimum value of k, use GridSearchCv of scikit-learn

GridSearchCV will run the model (kNN in this case) with give range of parameters and gives the resulting accuracy scores.



In [28]:
from sklearn.model_selection import GridSearchCV
parm_grid = dict(n_neighbors=range(1, 20))
autoimmune_knn = GridSearchCV(autoimmune, parm_grid, cv=10)
autoimmune_knn.fit(X, y)

GridSearchCV(cv=10, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': range(1, 20)}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn', scoring=None, verbose=0)

In [29]:
print("Mean Scores")
print(autoimmune_knn.cv_results_['mean_test_score'])
print("\nStd deviation for each score")
print(autoimmune_knn.cv_results_['std_test_score'])

Mean Scores
[0.68617021 0.69414894 0.69148936 0.72606383 0.7287234  0.74202128
 0.75       0.76329787 0.75265957 0.74202128 0.75531915 0.7606383
 0.7606383  0.73670213 0.75531915 0.74734043 0.75531915 0.75265957
 0.75531915]

Std deviation for each score
[0.07074504 0.0535229  0.06917107 0.05215766 0.03976016 0.03469748
 0.03477429 0.04503125 0.03752673 0.05095012 0.04172371 0.0378729
 0.0521295  0.06448708 0.05409981 0.06716467 0.06250306 0.05368341
 0.06419102]


In [14]:
print("Best Score for normalised data : {} for {}".format(autoimmune_knn.best_score_, autoimmune_knn.best_params_))

Best Score for normalised data : 0.7632978723404256 for {'n_neighbors': 8}
