# Data Classification
The MAGIC gamma telescope dataset:
https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope, it generated to simulate
registration of high energy gamma particles in a ground-based atmospheric Cherenkov gamma
telescope using the imaging technique allowing to discriminate statistically the information
caused by primary gammas (signal) from the images of hadronic showers
initiated by cosmic rays in the upper atmosphere (background).
It is required to investigate the data deeper, split into train and test data with class labels
g = gamma (signal) and h = hadron (background). You are asked to apply preprocessing and feature
selection techniques and construct classification models using different approaches such as Decision
Trees, AdaBoost, K-Nearest Neighbor (K-NN) and Logistic Regression and compare the results
between them and between with and without applying preprocessing and feature selection. Moreover,
you should evaluate and test the classification models accuracy.

In [305]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, auc, precision_recall_fscore_support
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif

1. fLength: continuous # major axis of ellipse [mm] 
2. fWidth: continuous # minor axis of ellipse [mm] 
3. fSize: continuous # 10-log of sum of content of all pixels [in #phot] 
4. fConc: continuous # ratio of sum of two highest pixels over fSize [ratio] 
5. fConc1: continuous # ratio of highest pixel over fSize [ratio] 
6. fAsym: continuous # distance from highest pixel to center, projected onto major axis [mm] 
7. fM3Long: continuous # 3rd root of third moment along major axis [mm] 
8. fM3Trans: continuous # 3rd root of third moment along minor axis [mm] 
9. fAlpha: continuous # angle of major axis with vector to origin [deg] 
10. fDist: continuous # distance from origin to center of ellipse [mm] 
11. class: g,h # gamma (signal), hadron (background) 

g = gamma (signal): 12332 
h = hadron (background): 6688 


In [275]:
col_names = ['fLength', 'fWidth', 'fSize', 'fConc', 'fConc1', 'fAsym',  'fM3Long', 'fM3Trans', 'fAlpha', 'fDist', 'class']
feature_names = ['fLength', 'fWidth', 'fSize', 'fConc', 'fConc1', 'fAsym',  'fM3Long', 'fM3Trans', 'fAlpha', 'fDist']
data = pd.read_csv("magic04.data", names=col_names)
X = data[feature_names]
Y = data['class']
data.head()

Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,class
0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.011,-8.2027,40.092,81.8828,g
1,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.261,g
2,162.052,136.031,4.0612,0.0374,0.0187,116.741,-64.858,-45.216,76.96,256.788,g
3,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.449,116.737,g
4,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.648,356.462,g


In [306]:
# models = [GaussianNB(), KNeighborsClassifier(), LogisticRegression(), RandomForestClassifier(), AdaBoostClassifier()] 
models = {"Naive Bayes":GaussianNB(), "KNN": KNeighborsClassifier(), "Logistic Regression": LogisticRegression(), "Random Forest": RandomForestClassifier(), "Ada Boost": AdaBoostClassifier(), "Decision Tree": DecisionTreeClassifier()}

## Without Preprocessing or feature selection

In [307]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.7)
X_train.head()
# print(X.shape)

AttributeError: 'numpy.ndarray' object has no attribute 'head'

In [308]:
# model = GaussianNB()
for name in models:
    model = models[name]
    y_pred = model.fit(X_train, Y_train).predict(X_train)
    
    print(name)
    print("Number of mislabeled points %d out of %d total points."% ((Y_train != y_pred).sum(), X_train.shape[0]))
    r = precision_recall_fscore_support(Y_train, y_pred)
    print("precision: ",str( (r[0][0]+r[0][1])/2 ), "\t Recall: ", str( (r[1][0]+r[1][1])/2 ), "\t FScore: ", str( (r[2][0]+r[2][1])/2 ))
    print("Model accuracy =" , model.score(X_test,Y_test))   
    print("=======================================")
    
    # plt.plot(fpr, tpr, label='%s (area = %0.2f)' % (name, roc_auc))
    # print(X.shape)

Naive Bayes
Number of mislabeled points 3651 out of 13314 total points.
precision:  0.7175130246273267 	 Recall:  0.6455998579730086 	 FScore:  0.6516107192438991
Model accuracy = 0.7237995092884683
KNN
Number of mislabeled points 1632 out of 13314 total points.
precision:  0.8876186135509975 	 Recall:  0.8416156075537435 	 FScore:  0.8578529980662701
Model accuracy = 0.8427970557308097
Logistic Regression
Number of mislabeled points 2827 out of 13314 total points.
precision:  0.7775196761240242 	 Recall:  0.7409135071869462 	 FScore:  0.7523310959987268
Model accuracy = 0.7960042060988434




Random Forest
Number of mislabeled points 133 out of 13314 total points.
precision:  0.9920403161456239 	 Recall:  0.9861104121160437 	 FScore:  0.9889701476570014
Model accuracy = 0.8752190676480898
Ada Boost
Number of mislabeled points 2055 out of 13314 total points.
precision:  0.8378148972585799 	 Recall:  0.8171421060494621 	 FScore:  0.8255744557127235
Model accuracy = 0.8447248510339993
Decision Tree
Number of mislabeled points 0 out of 13314 total points.
precision:  1.0 	 Recall:  1.0 	 FScore:  1.0
Model accuracy = 0.8198387662110059


## With Preprocessing and feature selection

In [309]:
def best_k(X):
    err = X.shape[0]
    best_val = 2
    
    for i in range(1, X.shape[1]):
    
        X_new = SelectKBest(f_classif, k=i).fit_transform(X, Y)
        model = GaussianNB()
        y_pred = model.fit(X_new, Y).predict(X_new)
        num = (Y != y_pred).sum()
        if num < err :
            err = num
            best_val = i
    
    return best_val

In [310]:
# print(X.shape)
X = preprocessing.StandardScaler().fit_transform(X)
K = best_k(X)
# print(K)

X_new = SelectKBest(f_classif, k=K).fit_transform(X, Y)
X_train, X_test, Y_train, Y_test = train_test_split(X_new, Y, train_size=0.7)

for name in models:
    model = models[name]
    y_pred = model.fit(X_train, Y_train).predict(X_train)
    
    print(name)
    print("Number of mislabeled points %d out of %d total points."% ((Y_train != y_pred).sum(), X_train.shape[0]))
    r = precision_recall_fscore_support(Y_train, y_pred)
    print("precision: ",str( (r[0][0]+r[0][1])/2 ), "\t Recall: ", str( (r[1][0]+r[1][1])/2 ), "\t FScore: ", str( (r[2][0]+r[2][1])/2 ))
    print("Model accuracy =" , model.score(X_test,Y_test))   
    print("=======================================")

Naive Bayes
Number of mislabeled points 3112 out of 13314 total points.
precision:  0.7535845893935009 	 Recall:  0.7099788800773768 	 FScore:  0.7214920549768542
Model accuracy = 0.7630564318261479
KNN
Number of mislabeled points 2048 out of 13314 total points.
precision:  0.841077740602201 	 Recall:  0.8130283087696497 	 FScore:  0.8238504298937415
Model accuracy = 0.7940764107956537
Logistic Regression
Number of mislabeled points 2794 out of 13314 total points.
precision:  0.782653596475839 	 Recall:  0.7390010511242914 	 FScore:  0.751855170195262
Model accuracy = 0.7886435331230284




Random Forest
Number of mislabeled points 286 out of 13314 total points.
precision:  0.9824003737037343 	 Recall:  0.9705107785975504 	 FScore:  0.9760508723767813
Model accuracy = 0.7977567472835612
Ada Boost
Number of mislabeled points 2495 out of 13314 total points.
precision:  0.8076327696203504 	 Recall:  0.7677322540252831 	 FScore:  0.780805239649033
Model accuracy = 0.8012618296529969
Decision Tree
Number of mislabeled points 0 out of 13314 total points.
precision:  1.0 	 Recall:  1.0 	 FScore:  1.0
Model accuracy = 0.749211356466877


In [311]:
def knn():
    neigh_score =[]
    for i in range(1,30,1):
        knn = KNeighborsClassifier(n_neighbors=i)
        knn.fit(X_train,y_train)
        pred = knn.predict(X_test)
        score = accuracy_score(y_test,pred)
        neigh_score.append((i, score))
    k = max(neigh_score,key=lambda x:x[1])[0]
    knn = KNeighborsClassifier(n_neighbors=k)
    return k