# Part 4: Performing Machine Learning Analysis on 2ndphase information and Model Evaluation

**Importing the necessary packages**

In [1]:
import csv
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn import svm
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix,classification_report 
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier


**Combining Output from 1st Phase Analysis to Data from 2nd Phase Analysis**<br>
In this stage, wewill combine the information gained from the principle component analysis which was obtained as output of phase 1 of our model with the feature selected features from our challenge dataset. 

In [2]:
df=pd.read_csv('1st Phase Output.csv')
df1=pd.read_csv('DB_Phase2.csv')
subject=df['Subject_ID']
df=df.dropna()
df=df.reset_index(drop=True)
df1=df1.drop(['Subject_ID'],axis=1)

df_phase_2=pd.concat([df,df1],axis=1)
df_phase_2=df_phase_2.values.tolist()
header=["SUBJECT","PC1","PC2","RID", "AGE", "GENDER", "EDUCATION","ETHNICITY","RACE","APOE4","MMSE","STATUS"]
csvdata=header
with open("ML.csv","w",encoding='utf-8') as csvFile:
    writer=csv.writer(csvFile)
    writer.writerow(csvdata)
csvFile.close()

csvdata=df_phase_2
with open("ML.csv","a",encoding='utf-8') as csvFile:
    writer=csv.writer(csvFile)
    writer.writerows(csvdata)
csvFile.close()

data=pd.read_csv('ML.csv')
print(data)

        SUBJECT        PC1       PC2    RID   AGE  GENDER  EDUCATION  \
0    002_S_0295   2.370924 -0.693114  295.0  84.8    Male       18.0   
1    002_S_0295   2.372282 -1.203138  295.0  84.8    Male       18.0   
2    002_S_0295   2.372031 -0.698526  295.0  84.8    Male       18.0   
3    002_S_0295   2.407100 -0.976808  295.0  84.8    Male       18.0   
4    002_S_0413   1.672813 -0.126494  413.0  76.3  Female       16.0   
5    002_S_0413   1.236724  0.096254  413.0  76.3  Female       16.0   
6    002_S_0413   1.671244 -0.111068  413.0  76.3  Female       16.0   
7    002_S_0413   1.566481  0.003706  413.0  76.3  Female       16.0   
8    002_S_0619  12.930872  3.287687  619.0  77.5    Male       12.0   
9    002_S_0619  11.014633  2.667719  619.0  77.5    Male       12.0   
10   002_S_0619  13.146209  3.300931  619.0  77.5    Male       12.0   
11   002_S_0619   3.274378  0.367081  619.0  77.5    Male       12.0   
12   002_S_0619  12.479669  3.292650  619.0  77.5    Male       

**Preparing Data for Initial ML Analysis**<br>
In most Machine Learning functions offered by Scikit Learn, the functions are not capable of being able to read Categorical data. From our created dataset. Many factors like Gender, Subject_ID, Ethnicity, Race and so on present categorical values. Hence, we have performed One Hot Encoding to binary discretize Categorical columns, and we have usedLabel Encoder to Label the Target Status 


In [3]:
data=data.dropna()
data2=data
data2=data2.drop(['GENDER'],axis=1)
data2=data2.drop(['SUBJECT'],axis=1)
data2=data2.drop(['ETHNICITY'],axis=1)
data2=data2.drop(['RACE'],axis=1)
dummies=pd.get_dummies(data.SUBJECT)
dummies2=pd.get_dummies(data.GENDER)
dummies3=pd.get_dummies(data.ETHNICITY)
dummies4=pd.get_dummies(data.RACE)
data2=pd.concat([data2,dummies],axis=1)
data2=pd.concat([data2,dummies2],axis=1)
data2=pd.concat([data2,dummies3],axis=1)
data2=pd.concat([data2,dummies4],axis=1)

le=LabelEncoder()
Y=le.fit_transform(data2.STATUS.astype(str))
X=data2.drop(['STATUS'],axis=1)
print(X)
print(Y)



           PC1       PC2    RID   AGE  EDUCATION  APOE4  MMSE  002_S_0295  \
0     2.370924 -0.693114  295.0  84.8       18.0    1.0  28.0           1   
1     2.372282 -1.203138  295.0  84.8       18.0    1.0  28.0           1   
2     2.372031 -0.698526  295.0  84.8       18.0    1.0  28.0           1   
3     2.407100 -0.976808  295.0  84.8       18.0    1.0  28.0           1   
4     1.672813 -0.126494  413.0  76.3       16.0    0.0  29.0           0   
5     1.236724  0.096254  413.0  76.3       16.0    0.0  29.0           0   
6     1.671244 -0.111068  413.0  76.3       16.0    0.0  29.0           0   
7     1.566481  0.003706  413.0  76.3       16.0    0.0  29.0           0   
8    12.930872  3.287687  619.0  77.5       12.0    2.0  22.0           0   
9    11.014633  2.667719  619.0  77.5       12.0    2.0  22.0           0   
10   13.146209  3.300931  619.0  77.5       12.0    2.0  22.0           0   
11    3.274378  0.367081  619.0  77.5       12.0    2.0  22.0           0   

**Random Forest Classification**<br>
In this stage, we will divide the Dataset into training and test data. We will then apply the Random Forest Classifier to the training input and targets. We will look at the efficiency of the algorithm on the Training data. Then we will use the algorithm to predict test data output to know the efficiency of our algoritm after which we will evaluate the model and obtain Mean Avg precision, recall and f1 scores. 

In [4]:
x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.2)
clf = RandomForestClassifier(n_estimators=100)
clf=clf.fit(x_train,y_train)

scores = cross_val_score(clf, x_train, y_train, cv=5)
print("Accuracy of a Random Forest is:",round(np.mean((scores*100)),2))

yp=clf.predict(x_test)
print("Accuracy of a Random Forest predicted over actual is:",round(((accuracy_score(y_test,yp))*100),2))
listt=list(le.classes_)
y_test=list(le.inverse_transform(y_test))
yp=list(le.inverse_transform(yp))
#Evaluation
print('The confusion Matrix is as Below:\n')
print(confusion_matrix(y_test,yp,labels=listt))
print('\nThe Evaluation Report is as Below:\n')
print(classification_report(y_test,yp,target_names=listt))



Accuracy of a Random Forest is: 97.3
Accuracy of a Random Forest predicted over actual is: 97.86
The confusion Matrix is as Below:

[[18  0  1]
 [ 0 41  2]
 [ 0  0 78]]

The Evaluation Report is as Below:

              precision    recall  f1-score   support

          AD       1.00      0.95      0.97        19
          CN       1.00      0.95      0.98        43
        LMCI       0.96      1.00      0.98        78

   micro avg       0.98      0.98      0.98       140
   macro avg       0.99      0.97      0.98       140
weighted avg       0.98      0.98      0.98       140



**Probabilistic Distillation**<br>
The Output acquired on the test data in the previous phase is then divided probabilistically to denote, what % of the test data, pertains to a particular status value of the output(CN, AD, lMCI). 
Then these probabilities are binned or rounded off to the nearest 1 point decimal value and the entire probabilistic values are added to the Input of the dataset to create a new dataset. 

In [5]:
yproba=clf.predict_proba(X)

binlab=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]

bins=[0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]
X['AD']=pd.cut(yproba[:,-3],bins,labels=binlab, include_lowest=True)
X['CN']=pd.cut(yproba[:,-2],bins,labels=binlab, include_lowest=True)
X['LMCI']=pd.cut(yproba[:,-1],bins,labels=binlab, include_lowest=True)

print(X)
print(Y)

           PC1       PC2    RID   AGE  EDUCATION  APOE4  MMSE  002_S_0295  \
0     2.370924 -0.693114  295.0  84.8       18.0    1.0  28.0           1   
1     2.372282 -1.203138  295.0  84.8       18.0    1.0  28.0           1   
2     2.372031 -0.698526  295.0  84.8       18.0    1.0  28.0           1   
3     2.407100 -0.976808  295.0  84.8       18.0    1.0  28.0           1   
4     1.672813 -0.126494  413.0  76.3       16.0    0.0  29.0           0   
5     1.236724  0.096254  413.0  76.3       16.0    0.0  29.0           0   
6     1.671244 -0.111068  413.0  76.3       16.0    0.0  29.0           0   
7     1.566481  0.003706  413.0  76.3       16.0    0.0  29.0           0   
8    12.930872  3.287687  619.0  77.5       12.0    2.0  22.0           0   
9    11.014633  2.667719  619.0  77.5       12.0    2.0  22.0           0   
10   13.146209  3.300931  619.0  77.5       12.0    2.0  22.0           0   
11    3.274378  0.367081  619.0  77.5       12.0    2.0  22.0           0   

**Random Forest + Probabilistic Dsitillation + Support Vector Machines**<br>
For our Unique Model. We wish to carry out ensemble training. Random Forest Method itself taking the role of an ensemble training technique, we have implemented the use of probabilistc Distillation explained above. And using the new dataset, we carried out the Support Vector Mechanism to calculate our results. In this stage, we will divide the Dataset into training and test data. We will then apply the SupportVector Machine Classifier to the training input and targets. We will look at the efficiency of the algorithm on the Training data. Then we will use the algorithm to predict test data output to know the efficiency of our algoritm after which we will evaluate the model and obtain Mean Avg precision, recall and f1 scores.

In [6]:
x_train2,x_test2,y_train2,y_test2=train_test_split(X,Y,test_size=0.2)

clf2 = svm.SVC(gamma=0.01, C=100.)
clf2=clf2.fit(x_train2,y_train2)

scores = cross_val_score(clf2, x_train2, y_train2, cv=5)
print("Accuracy of Random Forest + Probabilistic Distillation + SVM  is:",round(np.mean((scores*100)),2))

yp2=clf2.predict(x_test2)
print("Accuracy of a Random Forest + Probabilistic Distillation + SVM predicted over actual is:",round(((accuracy_score(y_test2,yp2))*100),2))
listt=list(le.classes_)
y_test2=list(le.inverse_transform(y_test2))
yp2=list(le.inverse_transform(yp2))
#Evaluation
print('The confusion Matrix is as Below:\n')
print(confusion_matrix(y_test2,yp2,labels=listt))
print('\nThe Evaluation Report is as Below:\n')
print(classification_report(y_test2,yp2,target_names=listt))




Accuracy of Random Forest + Probabilistic Distillation + SVM  is: 97.12
Accuracy of a Random Forest + Probabilistic Distillation + SVM predicted over actual is: 97.14
The confusion Matrix is as Below:

[[18  0  0]
 [ 0 43  4]
 [ 0  0 75]]

The Evaluation Report is as Below:

              precision    recall  f1-score   support

          AD       1.00      1.00      1.00        18
          CN       1.00      0.91      0.96        47
        LMCI       0.95      1.00      0.97        75

   micro avg       0.97      0.97      0.97       140
   macro avg       0.98      0.97      0.98       140
weighted avg       0.97      0.97      0.97       140



**Random Forest + Probabilistic Dsitillation + K Nearest Neighbour**<br>
For our Unique Model. We wish to carry out ensemble training. Random Forest Method itself taking the role of an ensemble training technique, we have implemented the use of probabilistc Distillation explained above. And using the new dataset, we carried out the K Nearest Neighbour Technique to calculate our results. In this stage, we will divide the Dataset into training and test data. We will then apply the Kk Nearest Neighbour Classifier to the training input and targets. We will look at the efficiency of the algorithm on the Training data. Then we will use the algorithm to predict test data output to know the efficiency of our algoritm after which we will evaluate the model and obtain Mean Avg precision, recall and f1 scores.

In [7]:
x_train3,x_test3,y_train3,y_test3=train_test_split(X,Y,test_size=0.2)

clf3=KNeighborsClassifier(n_neighbors=3)
clf3=clf3.fit(x_train3,y_train3)

scores = cross_val_score(clf3, x_train3, y_train3, cv=5)
print("Accuracy of Random Forest + Probabilistic Distillation + 3KNN  is:",round(np.mean((scores*100)),2))

yp3=clf3.predict(x_test2)
print("Accuracy of a Random Forest + Probabilistic Distillation + 3KNN predicted over actual is:",round(((accuracy_score(y_test3,yp3))*100),2))
listt=list(le.classes_)
y_test3=list(le.inverse_transform(y_test3))
yp3=list(le.inverse_transform(yp3))
#Evaluation
print('The confusion Matrix is as Below:\n')
print(confusion_matrix(y_test3,yp3,labels=listt))
print('\nThe Evaluation Report is as Below:\n')
print(classification_report(y_test3,yp3,target_names=listt))



Accuracy of Random Forest + Probabilistic Distillation + 3KNN  is: 91.92
Accuracy of a Random Forest + Probabilistic Distillation + 3KNN predicted over actual is: 42.14
The confusion Matrix is as Below:

[[ 6  5 15]
 [ 4 12 24]
 [10 23 41]]

The Evaluation Report is as Below:

              precision    recall  f1-score   support

          AD       0.30      0.23      0.26        26
          CN       0.30      0.30      0.30        40
        LMCI       0.51      0.55      0.53        74

   micro avg       0.42      0.42      0.42       140
   macro avg       0.37      0.36      0.36       140
weighted avg       0.41      0.42      0.42       140

