# Glioma Grading Clinical and Mutation Features Dataset

Gliomas are the common primary brain tumors. They are graded as Lower-Grade Glioma(LGG) or Glioblastoma Multiforme(GBM) depending on the histological/imaging criteria. 

Clinical and molecular/mutation factors are also very crucial for the grading process.In this dataset, the most frequently mutated 20 genes and 4 clinical features are considered from TCGA-LGG and TCGA-GBM brain glioma projects. 

The task is to determine whether a patient is LGG or GBM with a given clinical and molecular/mutation features. The objective is to find the optimal subset of mutation genes and clinical features for the glioma grading process to improve performance and reduce costs.

## Dataset

The dataset is a Giloma Grading Clinical and Mutation Features Dataset, from UCI repository.

Link: https://archive-beta.ics.uci.edu/dataset/759/glioma+grading+clinical+and+mutation+features+dataset

## Importing the dataset

In [4]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

In [5]:
df = pd.read_csv("D:\\glioma+grading+clinical+and+mutation+features+dataset\\TCGA_GBM_LGG_Mutations_all.csv")
df.head()

Unnamed: 0,Grade,Project,Case_ID,Gender,Age_at_diagnosis,Primary_Diagnosis,Race,IDH1,TP53,ATRX,...,FUBP1,RB1,NOTCH1,BCOR,CSMD3,SMARCA4,GRIN2A,IDH2,FAT4,PDGFRA
0,LGG,TCGA-LGG,TCGA-DU-8164,Male,51 years 108 days,"Oligodendroglioma, NOS",white,MUTATED,NOT_MUTATED,NOT_MUTATED,...,MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED
1,LGG,TCGA-LGG,TCGA-QH-A6CY,Male,38 years 261 days,Mixed glioma,white,MUTATED,NOT_MUTATED,NOT_MUTATED,...,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED
2,LGG,TCGA-LGG,TCGA-HW-A5KM,Male,35 years 62 days,"Astrocytoma, NOS",white,MUTATED,MUTATED,MUTATED,...,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED
3,LGG,TCGA-LGG,TCGA-E1-A7YE,Female,32 years 283 days,"Astrocytoma, anaplastic",white,MUTATED,MUTATED,MUTATED,...,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,MUTATED,NOT_MUTATED
4,LGG,TCGA-LGG,TCGA-S9-A6WG,Male,31 years 187 days,"Astrocytoma, anaplastic",white,MUTATED,MUTATED,MUTATED,...,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED


## Eliminating null values

According to the dataset there are no null values present in the dataset

In [6]:
df.isna().sum()

Grade                0
Project              0
Case_ID              0
Gender               0
Age_at_diagnosis     0
Primary_Diagnosis    0
Race                 0
IDH1                 0
TP53                 0
ATRX                 0
PTEN                 0
EGFR                 0
CIC                  0
MUC16                0
PIK3CA               0
NF1                  0
PIK3R1               0
FUBP1                0
RB1                  0
NOTCH1               0
BCOR                 0
CSMD3                0
SMARCA4              0
GRIN2A               0
IDH2                 0
FAT4                 0
PDGFRA               0
dtype: int64

In [14]:
df.Gender.unique()

array(['Male', 'Female', '--'], dtype=object)

In [11]:
df.Race.unique()

array(['white', 'asian', 'black or african american', '--',
       'not reported', 'american indian or alaska native'], dtype=object)

In [15]:
df.IDH2.unique()

array(['NOT_MUTATED', 'MUTATED'], dtype=object)

But on close examination of the datast, we can observe that there are two strings used in place of a null value and they are **--** and **not reported.** 

Thus we will pass these strings to *na_values* parameter of the pandas DataFrame, so that pandas will recognises these special strigs as null values.

In [16]:
df = pd.read_csv("D:\\glioma+grading+clinical+and+mutation+features+dataset\\TCGA_GBM_LGG_Mutations_all.csv", na_values = ["--","not reported"])
df.head()

Unnamed: 0,Grade,Project,Case_ID,Gender,Age_at_diagnosis,Primary_Diagnosis,Race,IDH1,TP53,ATRX,...,FUBP1,RB1,NOTCH1,BCOR,CSMD3,SMARCA4,GRIN2A,IDH2,FAT4,PDGFRA
0,LGG,TCGA-LGG,TCGA-DU-8164,Male,51 years 108 days,"Oligodendroglioma, NOS",white,MUTATED,NOT_MUTATED,NOT_MUTATED,...,MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED
1,LGG,TCGA-LGG,TCGA-QH-A6CY,Male,38 years 261 days,Mixed glioma,white,MUTATED,NOT_MUTATED,NOT_MUTATED,...,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED
2,LGG,TCGA-LGG,TCGA-HW-A5KM,Male,35 years 62 days,"Astrocytoma, NOS",white,MUTATED,MUTATED,MUTATED,...,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED
3,LGG,TCGA-LGG,TCGA-E1-A7YE,Female,32 years 283 days,"Astrocytoma, anaplastic",white,MUTATED,MUTATED,MUTATED,...,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,MUTATED,NOT_MUTATED
4,LGG,TCGA-LGG,TCGA-S9-A6WG,Male,31 years 187 days,"Astrocytoma, anaplastic",white,MUTATED,MUTATED,MUTATED,...,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED


Renaming the **Age_at_diagnonis** column to **age**

In [18]:
df.rename(columns = {"Age_at_diagnosis":"age"}, inplace = True)

Changing the format of the elements in age column from string to numerical values.

In [19]:
df['age'] = df['age'].replace(' years ', '.', regex=True)
df['age'] = df['age'].replace(' years', '', regex=True)
df['age'] = df['age'].replace(' days','', regex = True)

In [20]:
df.age.unique()

array(['51.108', '38.261', '35.62', '32.283', '31.187', '33.78', '35.68',
       '44.239', '33.350', '87', '51.328', '54.95', '52.214', '47.123',
       '34.132', '40.192', '53.352', '41.70', '43.161', '37.159',
       '47.173', '31.8', '25.191', '66.305', '56.250', '35.362', '51.363',
       '37.32', '54.183', '32.76', '65.28', '43.131', '51.59', '43.221',
       '25.214', '45.24', '50.153', '27.166', '53.252', '46.144',
       '24.239', nan, '34.70', '29.198', '45.124', '62.90', '46.224',
       '36.247', '62.202', '70.159', '53.41', '48.124', '40.69', '40.7',
       '20.359', '57.200', '38.322', '52.192', '56.104', '59.275',
       '67.107', '48.346', '59.254', '58.147', '27.247', '51.230',
       '74.11', '52.230', '61.62', '66.146', '42.32', '31.344', '48.268',
       '33.340', '34.213', '24.54', '55.48', '27.323', '29.32', '39.131',
       '70.3', '30.338', '39.178', '25.41', '48.160', '57.17', '34.307',
       '58.137', '55.208', '60.49', '38.13', '56.113', '54.180', '31.152',
 

In [21]:
new_age = list()
for x in df.age:
    try:
        p = x.index('.')
        new_age.append(float(x[:p]) + round(float(x[p+1:]) / 365, 2))
    except ValueError:
        new_age.append(float(x))
    except AttributeError:
        new_age.append(x)
len(new_age), len(df.age)

(862, 862)

In [22]:
df["age"]=new_age

In [23]:
df.age.unique()

array([51.3 , 38.72, 35.17, 32.78, 31.51, 33.21, 35.19, 44.65, 33.96,
       87.  , 51.9 , 54.26, 52.59, 47.34, 34.36, 40.53, 53.96, 41.19,
       43.44, 37.44, 47.47, 31.02, 25.52, 66.84, 56.68, 35.99, 51.99,
       37.09, 54.5 , 32.21, 65.08, 43.36, 51.16, 43.61, 25.59, 45.07,
       50.42, 27.45, 53.69, 46.39, 24.65,   nan, 34.19, 29.54, 45.34,
       62.25, 46.61, 36.68, 62.55, 70.44, 53.11, 48.34, 40.19, 40.02,
       20.98, 57.55, 38.88, 52.53, 56.28, 59.75, 67.29, 48.95, 59.7 ,
       58.4 , 27.68, 51.63, 74.03, 52.63, 61.17, 66.4 , 42.09, 31.94,
       48.73, 33.93, 34.58, 24.15, 55.13, 27.88, 29.09, 39.36, 70.01,
       30.93, 39.49, 25.11, 48.44, 57.05, 34.84, 58.38, 55.57, 60.13,
       38.04, 56.31, 54.49, 31.42, 54.4 , 52.66, 33.15, 42.47, 20.21,
       64.29, 29.64, 36.61, 31.99, 36.18, 32.47, 38.8 , 29.82, 41.21,
       35.52, 33.11, 32.14, 49.76, 69.59, 14.42, 47.25, 33.16, 64.  ,
       62.56, 37.23, 53.5 , 53.2 , 45.21, 33.91, 53.98, 52.43, 64.83,
       71.19, 40.82,

In [24]:
df.age.fillna(df.age.mean())

0      51.30
1      38.72
2      35.17
3      32.78
4      31.51
       ...  
857    77.89
858    85.18
859    77.49
860    63.33
861    76.61
Name: age, Length: 862, dtype: float64

Imputation has been performed to get rid of null values, using **SimpleImputer**

In [26]:
df = pd.DataFrame(SimpleImputer(strategy = "most_frequent").fit_transform(df),columns = df.columns)
df.isna().sum()

The history saving thread hit an unexpected error (OperationalError('database or disk is full')).History will not be written to the database.


Grade                0
Project              0
Case_ID              0
Gender               0
age                  0
Primary_Diagnosis    0
Race                 0
IDH1                 0
TP53                 0
ATRX                 0
PTEN                 0
EGFR                 0
CIC                  0
MUC16                0
PIK3CA               0
NF1                  0
PIK3R1               0
FUBP1                0
RB1                  0
NOTCH1               0
BCOR                 0
CSMD3                0
SMARCA4              0
GRIN2A               0
IDH2                 0
FAT4                 0
PDGFRA               0
dtype: int64

In [27]:
df.head()

Unnamed: 0,Grade,Project,Case_ID,Gender,age,Primary_Diagnosis,Race,IDH1,TP53,ATRX,...,FUBP1,RB1,NOTCH1,BCOR,CSMD3,SMARCA4,GRIN2A,IDH2,FAT4,PDGFRA
0,LGG,TCGA-LGG,TCGA-DU-8164,Male,51.3,"Oligodendroglioma, NOS",white,MUTATED,NOT_MUTATED,NOT_MUTATED,...,MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED
1,LGG,TCGA-LGG,TCGA-QH-A6CY,Male,38.72,Mixed glioma,white,MUTATED,NOT_MUTATED,NOT_MUTATED,...,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED
2,LGG,TCGA-LGG,TCGA-HW-A5KM,Male,35.17,"Astrocytoma, NOS",white,MUTATED,MUTATED,MUTATED,...,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED
3,LGG,TCGA-LGG,TCGA-E1-A7YE,Female,32.78,"Astrocytoma, anaplastic",white,MUTATED,MUTATED,MUTATED,...,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,MUTATED,NOT_MUTATED
4,LGG,TCGA-LGG,TCGA-S9-A6WG,Male,31.51,"Astrocytoma, anaplastic",white,MUTATED,MUTATED,MUTATED,...,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED


### Getting rid of unwanted columns

In [39]:
df.Project

0      TCGA-LGG
1      TCGA-LGG
2      TCGA-LGG
3      TCGA-LGG
4      TCGA-LGG
         ...   
857    TCGA-GBM
858    TCGA-GBM
859    TCGA-GBM
860    TCGA-GBM
861    TCGA-GBM
Name: Project, Length: 862, dtype: object

In [40]:
df.Case_ID

0      TCGA-DU-8164
1      TCGA-QH-A6CY
2      TCGA-HW-A5KM
3      TCGA-E1-A7YE
4      TCGA-S9-A6WG
           ...     
857    TCGA-19-5959
858    TCGA-16-0846
859    TCGA-28-1746
860    TCGA-32-2491
861    TCGA-06-2557
Name: Case_ID, Length: 862, dtype: object

Here we can see that the columns Project and Case_ID do not contribute much to the Grade(target) of our dataset as all project sblong to the same category of brain tunor that is TCGA. Since all case IDs are the different fro different parients, thus we can't derive any correlation amongst the case id and the Target. Thus we will drop these columns.

In [29]:
df.drop(columns = ["Case_ID","Project"], inplace = True)

In [30]:
df.head()

Unnamed: 0,Grade,Gender,age,Primary_Diagnosis,Race,IDH1,TP53,ATRX,PTEN,EGFR,...,FUBP1,RB1,NOTCH1,BCOR,CSMD3,SMARCA4,GRIN2A,IDH2,FAT4,PDGFRA
0,LGG,Male,51.3,"Oligodendroglioma, NOS",white,MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,...,MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED
1,LGG,Male,38.72,Mixed glioma,white,MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,...,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED
2,LGG,Male,35.17,"Astrocytoma, NOS",white,MUTATED,MUTATED,MUTATED,NOT_MUTATED,NOT_MUTATED,...,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED
3,LGG,Female,32.78,"Astrocytoma, anaplastic",white,MUTATED,MUTATED,MUTATED,NOT_MUTATED,NOT_MUTATED,...,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,MUTATED,NOT_MUTATED
4,LGG,Male,31.51,"Astrocytoma, anaplastic",white,MUTATED,MUTATED,MUTATED,NOT_MUTATED,NOT_MUTATED,...,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED


## Label Encoding

Now converting all the categorical data in the dataset to numerical value using the LabelEncoder.

In [32]:
from sklearn.preprocessing import LabelEncoder

In [33]:
cols = list(df.columns)
cols.remove("age")

In [34]:
for x in cols:
    df[x] = LabelEncoder().fit_transform(df[x])

In [35]:
df.head()

Unnamed: 0,Grade,Gender,age,Primary_Diagnosis,Race,IDH1,TP53,ATRX,PTEN,EGFR,...,FUBP1,RB1,NOTCH1,BCOR,CSMD3,SMARCA4,GRIN2A,IDH2,FAT4,PDGFRA
0,1,1,51.3,4,3,0,1,1,1,1,...,0,1,1,1,1,1,1,1,1,1
1,1,1,38.72,3,3,0,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
2,1,1,35.17,0,3,0,0,0,1,1,...,1,1,1,1,1,1,1,1,1,1
3,1,0,32.78,1,3,0,0,0,1,1,...,1,1,1,1,1,1,1,1,0,1
4,1,1,31.51,1,3,0,0,0,1,1,...,1,1,1,1,1,1,1,1,1,1


In [36]:
data = df.drop(columns = ['Grade'])
target = df.Grade
data.shape,target.shape

((862, 24), (862,))

# Pipeline and Cross Validation

Pipeline is object that transforms with the final estimator. It Applies a list of sequential transformation over the datset that applies. It fits and transforms the processes sequentially on the dataset. The final estimator is only fit in the pipline.

K-fold Cross-Validation is when the dataset is split into a K number of folds and is used to evaluate the model's ability when given new data. K refers to the number of groups the data sample is split into.

In [37]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split,StratifiedKFold
from sklearn.metrics import accuracy_score

In [38]:
def results(pipe, xtrain, ytrain, xtest, ytest):
    train_acc = pipe.score(xtrain, ytrain)
    test_acc = pipe.score(xtest, ytest)
    print("Training accuracy:",train_acc)
    print("Testing accuracy:",test_acc)
    print()
    return train_acc, test_acc

In [39]:
xtrain, xtest, ytrain, ytest = train_test_split(data,target, test_size = 0.3)

In [40]:
np.random.seed(0)

# k Nearest Neighbors

Finding the best value of K for applying the cross validation over the model

In [41]:
for i in range(2,16):
    knn = KNeighborsClassifier(n_neighbors = i,weights = 'distance')
    knn = knn.fit(xtrain, ytrain)
    ytrain_preds = knn.predict(xtrain)
    ytest_preds = knn.predict(xtest)
    print("For k:", i)
    print("Training Accuracy:",accuracy_score(ytrain, ytrain_preds))
    print("Testing Accuracy:",accuracy_score(ytest, ytest_preds))
    print()

For k: 2
Training Accuracy: 1.0
Testing Accuracy: 0.8918918918918919

For k: 3
Training Accuracy: 1.0
Testing Accuracy: 0.8803088803088803

For k: 4
Training Accuracy: 1.0
Testing Accuracy: 0.8841698841698842

For k: 5
Training Accuracy: 1.0
Testing Accuracy: 0.888030888030888

For k: 6
Training Accuracy: 1.0
Testing Accuracy: 0.8764478764478765

For k: 7
Training Accuracy: 1.0
Testing Accuracy: 0.8764478764478765

For k: 8
Training Accuracy: 1.0
Testing Accuracy: 0.8764478764478765

For k: 9
Training Accuracy: 1.0
Testing Accuracy: 0.8571428571428571

For k: 10
Training Accuracy: 1.0
Testing Accuracy: 0.8532818532818532

For k: 11
Training Accuracy: 1.0
Testing Accuracy: 0.8494208494208494

For k: 12
Training Accuracy: 1.0
Testing Accuracy: 0.8455598455598455

For k: 13
Training Accuracy: 1.0
Testing Accuracy: 0.833976833976834

For k: 14
Training Accuracy: 1.0
Testing Accuracy: 0.8301158301158301

For k: 15
Training Accuracy: 1.0
Testing Accuracy: 0.8185328185328186



Evaluating the model for 10 folds

In [87]:
def knn_evaluation():
    pipe = Pipeline([('sc',StandardScaler()),
                     ('pca',PCA()),
                     ('knn', KNeighborsClassifier(n_neighbors = 6,weights = 'distance'))])
    fold = 0
    a, b = 0, 0
    trainavg , testavg = 0, 0
    skf = StratifiedKFold(n_splits = 10)
    for trainindex, testindex in skf.split(data,target):
            fold += 1
            xtrain, xtest = data.iloc[trainindex], data.iloc[testindex]
            ytrain, ytest = target.iloc[trainindex], target.iloc[testindex]
            pipe.fit(xtrain, ytrain)
            print("For Fold", fold)
            a, b = results(pipe, xtrain, ytrain, xtest, ytest)
            trainavg += a
            testavg+= b
    print("Training accuracy average:", trainavg/10)
    print("Tresting accuracy average:", testavg/10)

In [88]:
knn_evaluation()

For Fold 1
Training accuracy: 1.0
Testing accuracy: 0.8505747126436781

For Fold 2
Training accuracy: 1.0
Testing accuracy: 0.8850574712643678

For Fold 3
Training accuracy: 1.0
Testing accuracy: 0.872093023255814

For Fold 4
Training accuracy: 1.0
Testing accuracy: 0.8372093023255814

For Fold 5
Training accuracy: 1.0
Testing accuracy: 0.8953488372093024

For Fold 6
Training accuracy: 1.0
Testing accuracy: 0.813953488372093

For Fold 7
Training accuracy: 1.0
Testing accuracy: 0.8023255813953488

For Fold 8
Training accuracy: 1.0
Testing accuracy: 0.8953488372093024

For Fold 9
Training accuracy: 1.0
Testing accuracy: 0.8488372093023255

For Fold 10
Training accuracy: 1.0
Testing accuracy: 0.9069767441860465

Training accuracy average: 1.0
Tresting accuracy average: 0.860772520716386


# Decision Trees

In [89]:
from sklearn.tree import DecisionTreeClassifier

In [90]:
def dt_evaluation():
    pipe = Pipeline([('sc',StandardScaler()),
                     ('pca',PCA()),
                     ('dt', DecisionTreeClassifier())])
    fold = 0
    a, b = 0, 0
    trainavg , testavg = 0, 0
    skf = StratifiedKFold(n_splits = 10)
    for trainindex, testindex in skf.split(data,target):
            fold += 1
            xtrain, xtest = data.iloc[trainindex], data.iloc[testindex]
            ytrain, ytest = target.iloc[trainindex], target.iloc[testindex]
            pipe.fit(xtrain, ytrain)
            print("For Fold", fold)
            a, b = results(pipe, xtrain, ytrain, xtest, ytest)
            trainavg += a
            testavg+= b
    print("Training accuracy average:", trainavg/10)
    print("Tresting accuracy average:", testavg/10)

In [91]:
dt_evaluation()

For Fold 1
Training accuracy: 1.0
Testing accuracy: 0.8160919540229885

For Fold 2
Training accuracy: 1.0
Testing accuracy: 0.8390804597701149

For Fold 3
Training accuracy: 1.0
Testing accuracy: 0.872093023255814

For Fold 4
Training accuracy: 1.0
Testing accuracy: 0.8837209302325582

For Fold 5
Training accuracy: 1.0
Testing accuracy: 0.8372093023255814

For Fold 6
Training accuracy: 1.0
Testing accuracy: 0.8255813953488372

For Fold 7
Training accuracy: 1.0
Testing accuracy: 0.9069767441860465

For Fold 8
Training accuracy: 1.0
Testing accuracy: 0.8255813953488372

For Fold 9
Training accuracy: 1.0
Testing accuracy: 0.872093023255814

For Fold 10
Training accuracy: 1.0
Testing accuracy: 0.9186046511627907

Training accuracy average: 1.0
Tresting accuracy average: 0.8597032878909381


# SVM

In [92]:
from sklearn.svm import SVC

In [94]:
kernels = ['linear', 'poly', 'rbf']
def svc_evaluation():
    for x in kernels:
        pipe = Pipeline([('sc',StandardScaler()),
                         ('pca',PCA()),
                         ('dt', SVC(kernel = x))])
        fold = 0
        a, b = 0, 0
        print("For kernel:", x)
        trainavg , testavg = 0, 0
        skf = StratifiedKFold(n_splits = 10)
        for trainindex, testindex in skf.split(data,target):
                fold += 1
                xtrain, xtest = data.iloc[trainindex], data.iloc[testindex]
                ytrain, ytest = target.iloc[trainindex], target.iloc[testindex]
                pipe.fit(xtrain, ytrain)
                print("For Fold", fold)
                a, b = results(pipe, xtrain, ytrain, xtest, ytest)
                trainavg += a
                testavg+= b
        print("Training accuracy average:", trainavg/10)
        print("Tresting accuracy average:", testavg/10)
        print()

In [95]:
svc_evaluation()

For kernel: linear
For Fold 1
Training accuracy: 0.8735483870967742
Testing accuracy: 0.8735632183908046

For Fold 2
Training accuracy: 0.8748387096774194
Testing accuracy: 0.8620689655172413

For Fold 3
Training accuracy: 0.8711340206185567
Testing accuracy: 0.8953488372093024

For Fold 4
Training accuracy: 0.8762886597938144
Testing accuracy: 0.8488372093023255

For Fold 5
Training accuracy: 0.8737113402061856
Testing accuracy: 0.872093023255814

For Fold 6
Training accuracy: 0.875
Testing accuracy: 0.8604651162790697

For Fold 7
Training accuracy: 0.875
Testing accuracy: 0.8604651162790697

For Fold 8
Training accuracy: 0.8724226804123711
Testing accuracy: 0.8837209302325582

For Fold 9
Training accuracy: 0.8737113402061856
Testing accuracy: 0.872093023255814

For Fold 10
Training accuracy: 0.8698453608247423
Testing accuracy: 0.9069767441860465

Training accuracy average: 0.873550049883605
Tresting accuracy average: 0.8735632183908045

For kernel: poly
For Fold 1
Training accuracy:

# Random forest

In [96]:
from sklearn.ensemble import RandomForestClassifier

In [99]:
def rfc_evaluation():
    pipe = Pipeline([('sc',StandardScaler()),
                     ('pca',PCA()),
                     ('rf', RandomForestClassifier())])
    fold = 0
    a, b = 0, 0
    trainavg , testavg = 0, 0
    skf = StratifiedKFold(n_splits = 10)
    for trainindex, testindex in skf.split(data,target):
            fold += 1
            xtrain, xtest = data.iloc[trainindex], data.iloc[testindex]
            ytrain, ytest = target.iloc[trainindex], target.iloc[testindex]
            pipe.fit(xtrain, ytrain)
            print("For Fold", fold)
            a, b = results(pipe, xtrain, ytrain, xtest, ytest)
            trainavg += a
            testavg+= b
    print("Training accuracy average:", trainavg/10)
    print("Tresting accuracy average:", testavg/10)

In [100]:
rfc_evaluation()

For Fold 1
Training accuracy: 1.0
Testing accuracy: 0.896551724137931

For Fold 2
Training accuracy: 1.0
Testing accuracy: 0.9080459770114943

For Fold 3
Training accuracy: 1.0
Testing accuracy: 0.9186046511627907

For Fold 4
Training accuracy: 1.0
Testing accuracy: 0.872093023255814

For Fold 5
Training accuracy: 1.0
Testing accuracy: 0.9418604651162791

For Fold 6
Training accuracy: 1.0
Testing accuracy: 0.9069767441860465

For Fold 7
Training accuracy: 1.0
Testing accuracy: 0.872093023255814

For Fold 8
Training accuracy: 1.0
Testing accuracy: 0.9069767441860465

For Fold 9
Training accuracy: 1.0
Testing accuracy: 0.872093023255814

For Fold 10
Training accuracy: 1.0
Testing accuracy: 0.9186046511627907

Training accuracy average: 1.0
Tresting accuracy average: 0.9013900026730821


In [101]:
report_data = {"Models":["KNeighborsClassifier(k = 6)","DecisionTreeClassifier"," SVC(kernel = x)","RandomForestClassifier"],
               "Training Accuracy":["100%","100%","90%","100%"],
               "Testing Accuracy":["86.08%","86","86.42%","90.13%"]}

In [102]:
report_data

{'Models': ['KNeighborsClassifier(k = 6)',
  'DecisionTreeClassifier',
  ' SVC(kernel = x)',
  'RandomForestClassifier'],
 'Training Accuracy': ['100%', '100%', '90%', '100%'],
 'Testing Accuracy': ['86.08%', '86', '86.42%', '90.13%']}

In [103]:
report = pd.DataFrame(report_data)

In [104]:
report

Unnamed: 0,Models,Training Accuracy,Testing Accuracy
0,KNeighborsClassifier(k = 6),100%,86.08%
1,DecisionTreeClassifier,100%,86
2,SVC(kernel = x),90%,86.42%
3,RandomForestClassifier,100%,90.13%


Thus we can see that Random Forest Classifier has the maximum accuracy of 90.13%, and thus performs better prediction over other models.