# Genetic Data Classification Practice Challenge

**Problem Domain:** The dataset is related to genomics, and that is reason why this data has a lot more columns, compared to rows, which is common in genetic/genomic dataset. There might be similar genomic/genetic dataset related challenges in the near future and this project should help you to get familiar with datasets like these.

## CHALLENGE REQUIREMENTS
Within this goal the following are the targets:

Train a model, using the file train.csv - The train.csv file can be found in the challenge dataset folder in the forum. That file should be used to train a machine learning model. Here the last (right-most) column in the file is the label, and all other columns are features

Using the trained model, take test_x.csv as feature input and predict the probabilities of each of the 5 classes - The test_x.csv file contains the test dataset. It does NOT contain the labels. Pass these features to your model, and generate a prediction with 241 rows (same as test_x.csv) and 5 columns (corresponding to 5 classes), with each column containing a decimal value between 0 to 1.

## METRIC
This is a classification problem, and we'll be using the classification metric **ROC AUC Score** as the scoring metric.

In [1]:

# import libaries
import pandas as pd
import numpy as np
from pandas import set_option
# for data visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Supress Warnings
import warnings
warnings.filterwarnings("ignore")





from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

from sklearn.naive_bayes import GaussianNB

from sklearn.model_selection import train_test_split

#Performance evaluation libraries

from sklearn.metrics import accuracy_score, auc, roc_curve, roc_auc_score, mean_squared_error, f1_score, precision_score, recall_score





In [2]:
import os

In [3]:
path="C:/Users/Dell/Documents/Topcooder/Genetic Data Classification/public"
os.chdir(path)

In [4]:
# To load the dataset into jupyter notebook
train=pd.read_csv("train.csv",sep=" ",header=None)
test=pd.read_csv("test_x.csv", header=None)

In [36]:
train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20522,20523,20524,20525,20526,20527,20528,20529,20530,20531
0,0.0,3.801873,5.025591,6.40415,9.564754,0.0,9.997688,1.025241,0.0,0.0,...,8.507347,9.849333,8.18436,9.796564,11.607552,10.456272,9.949412,5.980037,0.0,0.0
1,0.0,2.988103,1.811471,5.763507,8.604753,0.0,7.335855,0.78785,0.0,0.0,...,8.986857,10.444663,1.824849,10.053587,11.873652,10.603654,9.60895,5.026884,0.0,1.0
2,0.0,1.913914,3.568069,6.498854,9.865512,0.0,7.66444,1.830012,0.0,0.0,...,8.969893,10.653248,3.042679,11.449443,10.559874,10.351812,9.446593,10.168559,0.888928,1.0
3,0.0,1.786638,1.76846,5.66216,10.360353,0.0,7.294281,1.056098,0.0,0.0,...,7.907323,10.123358,1.389016,9.964891,11.4073,10.644857,9.912535,8.256968,0.0,1.0
4,0.0,4.122805,2.939922,6.730137,9.508001,0.0,6.959306,0.845109,0.0,0.0,...,9.459323,10.705149,6.782736,10.62229,11.136222,10.787837,10.416871,8.318724,0.0,0.0


In [37]:
test.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20521,20522,20523,20524,20525,20526,20527,20528,20529,20530
0,0.0,3.223716,3.02749,9.173387,10.05741,0.0,7.236426,0.600745,0.0,0.0,...,5.940963,8.564466,10.373311,0.0,8.904803,11.991256,10.22865,9.914923,3.95073,0.0
1,0.0,1.624335,2.880039,7.849242,9.961128,0.0,7.426315,0.0,0.0,0.0,...,5.710732,8.710858,9.516045,0.768671,8.638823,12.131397,10.235763,9.007652,2.728965,0.0
2,0.0,4.403799,5.650767,6.237965,9.928791,0.0,8.193653,0.729357,0.0,0.0,...,7.771054,9.087707,10.251198,2.791001,9.82797,11.474755,9.415547,9.388187,1.860605,0.0
3,0.0,3.998728,3.537644,6.336483,10.198433,0.0,6.605121,0.438825,0.0,0.0,...,5.782456,9.07014,9.589436,6.227893,10.301599,11.72903,10.140983,9.519872,8.503579,0.0
4,0.0,2.352504,2.541366,6.823495,10.259614,0.0,6.326269,0.0,0.0,0.0,...,5.19827,7.882233,8.963838,5.740264,9.977638,12.356075,9.423801,9.575008,5.176171,0.0


In [7]:

print("shape of train dataset",train.shape)
print(f" The number of sample present in the train dataset is {train.shape[0]} and the feature of the data is {train.shape[1]}")

shape of train dataset (560, 20532)
 The number of sample present in the train dataset is 560 and the feature of the data is 20532


In [8]:
print("shape of test dataset",test.shape)
print(f" The number of sample present in the test dataset is {test.shape[0]} and the feature of the data is {test.shape[1]}")

shape of test dataset (241, 20531)
 The number of sample present in the test dataset is 241 and the feature of the data is 20531


In [9]:
# locate rows of duplicate data in the train data
train.duplicated().any()

False

In [10]:
# locate rows of duplicate data in the test data
test.duplicated().any()

False

In [11]:
train.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20522,20523,20524,20525,20526,20527,20528,20529,20530,20531
count,560.0,560.0,560.0,560.0,560.0,560.0,560.0,560.0,560.0,560.0,...,560.0,560.0,560.0,560.0,560.0,560.0,560.0,560.0,560.0,560.0
mean,0.025693,3.006491,3.093432,6.710345,9.807214,0.0,7.389785,0.500604,0.017663,0.013226,...,8.761257,10.06177,4.903955,9.760919,11.750069,10.131919,9.607517,5.529026,0.096052,1.551786
std,0.133284,1.208505,1.050021,0.635975,0.506348,0.0,1.138069,0.511205,0.140522,0.19999,...,0.610046,0.379201,2.373578,0.541024,0.667601,0.599084,0.548379,2.018563,0.346781,1.538125
min,0.0,0.0,0.0,5.083098,8.435999,0.0,4.244529,0.0,0.0,0.0,...,6.678368,8.669456,0.0,7.974942,9.210036,7.530141,7.864533,0.593975,0.0,0.0
25%,0.0,2.289332,2.371169,6.295731,9.465073,0.0,6.595209,0.0,0.0,0.0,...,8.383445,9.824306,3.245974,9.424618,11.32879,9.828966,9.272957,4.084589,0.0,0.0
50%,0.0,3.116363,3.135089,6.644747,9.782637,0.0,7.423935,0.438133,0.0,0.0,...,8.78164,10.085864,5.476816,9.798278,11.765061,10.175967,9.584944,5.30046,0.0,1.0
75%,0.0,3.897428,3.800811,7.027516,10.137257,0.0,8.137671,0.794682,0.0,0.0,...,9.15312,10.305545,6.682808,10.100454,12.180619,10.562541,9.937343,6.795979,0.0,3.0
max,1.241108,6.051542,6.063484,10.129528,11.355621,0.0,10.71819,2.779008,1.785592,4.067604,...,11.105431,11.318243,9.207495,11.811632,13.715361,11.455764,12.276936,11.205836,5.254133,4.0


In [12]:
# to check the statistical information
test.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20521,20522,20523,20524,20525,20526,20527,20528,20529,20530
count,241.0,241.0,241.0,241.0,241.0,241.0,241.0,241.0,241.0,241.0,...,241.0,241.0,241.0,241.0,241.0,241.0,241.0,241.0,241.0,241.0
mean,0.028845,3.021176,3.099806,6.750097,9.82848,0.0,7.442047,0.498203,0.014609,0.013897,...,5.876309,8.776659,10.043431,4.717074,9.697998,11.724007,10.209533,9.551711,5.526203,0.093923
std,0.145059,1.185225,1.103169,0.645852,0.507718,0.0,1.036971,0.504215,0.116312,0.215745,...,0.730236,0.588012,0.379933,2.403726,0.515377,0.677809,0.532319,0.597614,2.201337,0.40353
min,0.0,0.0,0.0,5.009284,8.555398,0.0,3.930747,0.0,0.0,0.0,...,3.467045,7.342235,8.858969,0.0,8.374301,9.045255,8.768882,8.061706,1.013641,0.0
25%,0.0,2.322707,2.408331,6.318601,9.456295,0.0,6.878701,0.0,0.0,0.0,...,5.441779,8.395474,9.826027,2.933403,9.38324,11.300524,9.869359,9.210313,4.130519,0.0
50%,0.0,3.151339,3.043834,6.682461,9.811901,0.0,7.536667,0.449957,0.0,0.0,...,5.978582,8.792959,10.031164,5.401607,9.72305,11.712239,10.268753,9.508662,5.089104,0.0
75%,0.0,3.821935,3.802876,7.089932,10.165234,0.0,8.013993,0.781486,0.0,0.0,...,6.392295,9.137073,10.287759,6.532682,10.059304,12.172302,10.588349,9.859704,7.163368,0.0
max,1.482332,6.237034,5.848044,9.173387,11.269372,0.0,10.03885,1.974419,1.440686,3.349266,...,7.771054,10.838558,11.092539,9.139459,11.088907,13.36852,11.675653,12.81332,10.659612,4.180896


In [13]:
## Normalize Data for train dataset
# Normalize data (length of 1)
from sklearn.preprocessing import Normalizer
from pandas import read_csv
from numpy import set_printoptions

array = train.values
# separate array into input and output components
X = array[:,0:-1]
Y = array[:,-1]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
# summarize transformed data
#set_printoptions(precision=3)
#print(normalizedX[0:5,:])
print(normalizedX)


[[0.         0.00345848 0.00457167 ... 0.00905077 0.00543991 0.        ]
 [0.         0.0026924  0.00163221 ... 0.00865804 0.00452942 0.        ]
 [0.         0.00174813 0.003259   ... 0.00862832 0.00928775 0.00081193]
 ...
 [0.         0.00345578 0.00312666 ... 0.00935699 0.00469692 0.        ]
 [0.         0.00314579 0.00348193 ... 0.00875243 0.00520663 0.        ]
 [0.         0.002992   0.00351071 ... 0.00889044 0.00215762 0.        ]]


In [14]:
## Normalize Data for test dataset
# Normalize data (length of 1)
from sklearn.preprocessing import Normalizer
from pandas import read_csv
from numpy import set_printoptions

array = test.values
# separate array into input and output components
X = array
scaler = Normalizer().fit(X)
normalized_test = scaler.transform(X)
# summarize transformed data
#set_printoptions(precision=3)
#print(normalizedX[0:5,:])
print(normalized_test)


[[0.         0.00299137 0.00280929 ... 0.00920031 0.00366598 0.        ]
 [0.         0.00153825 0.00272741 ... 0.00853029 0.00258434 0.        ]
 [0.         0.00405341 0.00520116 ... 0.00864121 0.00171256 0.        ]
 ...
 [0.         0.00477632 0.00339374 ... 0.00888152 0.00354647 0.        ]
 [0.         0.00275289 0.00317276 ... 0.00828006 0.00132577 0.        ]
 [0.         0.00229556 0.00140002 ... 0.00794642 0.00579084 0.        ]]


In [15]:
norms_train=pd.DataFrame(normalizedX,columns=train.columns[:-1])


In [16]:
norms_train

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20521,20522,20523,20524,20525,20526,20527,20528,20529,20530
0,0.0,0.003458,0.004572,0.005826,0.008701,0.0,0.009095,0.000933,0.0,0.0,...,0.005385,0.007739,0.008960,0.007445,0.008912,0.010559,0.009512,0.009051,0.005440,0.000000
1,0.0,0.002692,0.001632,0.005193,0.007753,0.0,0.006610,0.000710,0.0,0.0,...,0.006448,0.008098,0.009411,0.001644,0.009059,0.010699,0.009554,0.008658,0.004529,0.000000
2,0.0,0.001748,0.003259,0.005936,0.009011,0.0,0.007001,0.001671,0.0,0.0,...,0.005472,0.008193,0.009730,0.002779,0.010458,0.009645,0.009455,0.008628,0.009288,0.000812
3,0.0,0.001638,0.001621,0.005192,0.009499,0.0,0.006688,0.000968,0.0,0.0,...,0.005447,0.007250,0.009282,0.001274,0.009137,0.010459,0.009760,0.009089,0.007571,0.000000
4,0.0,0.003785,0.002699,0.006178,0.008728,0.0,0.006389,0.000776,0.0,0.0,...,0.006154,0.008684,0.009827,0.006227,0.009751,0.010223,0.009903,0.009563,0.007637,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
555,0.0,0.001166,0.002242,0.006467,0.009262,0.0,0.006974,0.000388,0.0,0.0,...,0.005491,0.008310,0.009217,0.004329,0.009107,0.010633,0.009961,0.008654,0.003937,0.000000
556,0.0,0.003912,0.003564,0.007502,0.008727,0.0,0.007497,0.001153,0.0,0.0,...,0.005929,0.008820,0.009620,0.001389,0.008377,0.010998,0.010100,0.009517,0.003652,0.000000
557,0.0,0.003456,0.003127,0.006432,0.009112,0.0,0.007428,0.000553,0.0,0.0,...,0.005500,0.008416,0.009467,0.007010,0.009192,0.010175,0.009740,0.009357,0.004697,0.000000
558,0.0,0.003146,0.003482,0.006023,0.008821,0.0,0.007874,0.000526,0.0,0.0,...,0.005194,0.007658,0.009258,0.006316,0.009035,0.010709,0.009193,0.008752,0.005207,0.000000


In [17]:
norms_train.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20521,20522,20523,20524,20525,20526,20527,20528,20529,20530
count,560.0,560.0,560.0,560.0,560.0,560.0,560.0,560.0,560.0,560.0,...,560.0,560.0,560.0,560.0,560.0,560.0,560.0,560.0,560.0,560.0
mean,2.3e-05,0.002754,0.002834,0.006152,0.00899,0.0,0.00677,0.000458,1.6e-05,1.2e-05,...,0.005411,0.008031,0.009221,0.004492,0.008945,0.010767,0.009285,0.008806,0.005061,8.7e-05
std,0.000121,0.001107,0.000959,0.000607,0.000508,0.0,0.001033,0.000467,0.000128,0.000183,...,0.000688,0.000583,0.000356,0.002179,0.000492,0.000592,0.000548,0.000524,0.001841,0.000315
min,0.0,0.0,0.0,0.00469,0.007573,0.0,0.003911,0.0,0.0,0.0,...,0.002633,0.006033,0.007866,0.0,0.007438,0.008418,0.006845,0.007244,0.00056,0.0
25%,0.0,0.002104,0.002191,0.005728,0.008647,0.0,0.006054,0.0,0.0,0.0,...,0.004991,0.007645,0.009002,0.002961,0.008633,0.010397,0.008969,0.008466,0.003742,0.0
50%,0.0,0.002866,0.002853,0.006067,0.008966,0.0,0.006799,0.000401,0.0,0.0,...,0.005473,0.008034,0.009217,0.005004,0.008954,0.010745,0.009313,0.008783,0.004862,0.0
75%,0.0,0.003572,0.003497,0.006452,0.009329,0.0,0.00745,0.000726,0.0,0.0,...,0.005885,0.008398,0.009451,0.006121,0.009271,0.011135,0.009665,0.0091,0.006177,0.0
max,0.001124,0.005477,0.005538,0.009259,0.010719,0.0,0.01018,0.002481,0.001623,0.003727,...,0.00716,0.010166,0.01025,0.008584,0.011234,0.012447,0.010785,0.01156,0.010357,0.004738


In [18]:
norms_train[20531]=train[20531]

In [19]:
norms_train

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20522,20523,20524,20525,20526,20527,20528,20529,20530,20531
0,0.0,0.003458,0.004572,0.005826,0.008701,0.0,0.009095,0.000933,0.0,0.0,...,0.007739,0.008960,0.007445,0.008912,0.010559,0.009512,0.009051,0.005440,0.000000,0.0
1,0.0,0.002692,0.001632,0.005193,0.007753,0.0,0.006610,0.000710,0.0,0.0,...,0.008098,0.009411,0.001644,0.009059,0.010699,0.009554,0.008658,0.004529,0.000000,1.0
2,0.0,0.001748,0.003259,0.005936,0.009011,0.0,0.007001,0.001671,0.0,0.0,...,0.008193,0.009730,0.002779,0.010458,0.009645,0.009455,0.008628,0.009288,0.000812,1.0
3,0.0,0.001638,0.001621,0.005192,0.009499,0.0,0.006688,0.000968,0.0,0.0,...,0.007250,0.009282,0.001274,0.009137,0.010459,0.009760,0.009089,0.007571,0.000000,1.0
4,0.0,0.003785,0.002699,0.006178,0.008728,0.0,0.006389,0.000776,0.0,0.0,...,0.008684,0.009827,0.006227,0.009751,0.010223,0.009903,0.009563,0.007637,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
555,0.0,0.001166,0.002242,0.006467,0.009262,0.0,0.006974,0.000388,0.0,0.0,...,0.008310,0.009217,0.004329,0.009107,0.010633,0.009961,0.008654,0.003937,0.000000,0.0
556,0.0,0.003912,0.003564,0.007502,0.008727,0.0,0.007497,0.001153,0.0,0.0,...,0.008820,0.009620,0.001389,0.008377,0.010998,0.010100,0.009517,0.003652,0.000000,2.0
557,0.0,0.003456,0.003127,0.006432,0.009112,0.0,0.007428,0.000553,0.0,0.0,...,0.008416,0.009467,0.007010,0.009192,0.010175,0.009740,0.009357,0.004697,0.000000,0.0
558,0.0,0.003146,0.003482,0.006023,0.008821,0.0,0.007874,0.000526,0.0,0.0,...,0.007658,0.009258,0.006316,0.009035,0.010709,0.009193,0.008752,0.005207,0.000000,4.0


In [20]:
norms_test=pd.DataFrame(normalized_test,columns=test.columns)


In [21]:
X=norms_train.iloc[:,:-1]
Y=norms_train.iloc[:,-1]

In [22]:
# splitting dataset in train abf val
test_size=0.20
seed=42
xtrain, xVal, ytrain, yVal = train_test_split(X, Y, test_size = 0.20,stratify=train[20531], random_state = 42)

In [23]:
def Models(model, name, d):
    print("Working on {} model".format(name))
    
    cla = model
    cla.fit(xtrain, ytrain)
    
    predicted = cla.predict(xtrain)
    tr_auc = accuracy_score(predicted, ytrain)*100
    #     print("Print accuracy on Training data is {}".format(tr_auc))
    
    predicted = cla.predict(xVal)
    te_auc = accuracy_score(predicted, yVal)*100

    Recall = recall_score(predicted, yVal, average = 'micro')
    
    Precision = precision_score(predicted, yVal, average = 'micro')
    
    F1 = f1_score(predicted, yVal, average = 'micro')
    y_pred = cla.predict_proba(xVal)
    
    roc=roc_auc_score(yVal, y_pred,multi_class="ovo")
    
   # MOC= matthews_corrcoef(predicted , ytest, average = 'micro')
    
    MSE = mean_squared_error(predicted, yVal)
    
    d['Name'].append(name)
    d['Training ACU'].append(tr_auc)
    d['Testing ACU'].append(te_auc)
    d['Recall'].append(Recall)
    d['Precision'].append(Precision)
    d['F1_Score'].append(F1)
    d['MSE'].append(MSE)
    d["AUC_Score"].append(roc)
    #d["MCC SCORE"].append(MOC)
    
    print("**********"*5)
    print()
    return d

In [25]:
random_state = 42
d = {'Name' : [], 'Training ACU': [], 'Testing ACU': [], 
     'Recall': [], 'Precision': [], 'F1_Score': [], 'MSE': [],'AUC_Score':[]}
models = [ [RandomForestClassifier(n_estimators = 350,random_state=42), 'Random Forest'],
          [DecisionTreeClassifier(random_state=42), 'Decision Tree'], 
          
          [MLPClassifier(max_iter=100,hidden_layer_sizes=100, activation = "logistic",random_state=42),"MlpClassifer"],
 
          [GaussianNB(), 'Naive Bayes'],
          [LogisticRegression(random_state=42), 'Logistic Regression'], 
         
         ]

for model in models:
    d = Models(model[0], model[1], d)

acu_data = pd.DataFrame(data = d)

Working on Random Forest model
**************************************************

Working on Decision Tree model
**************************************************

Working on MlpClassifer model
**************************************************

Working on Naive Bayes model
**************************************************

Working on Logistic Regression model
**************************************************



In [26]:
acu_data

Unnamed: 0,Name,Training ACU,Testing ACU,Recall,Precision,F1_Score,MSE,AUC_Score
0,Random Forest,100.0,100.0,1.0,1.0,1.0,0.0,1.0
1,Decision Tree,100.0,97.321429,0.973214,0.973214,0.973214,0.169643,0.978343
2,MlpClassifer,99.553571,100.0,1.0,1.0,1.0,0.0,1.0
3,Naive Bayes,100.0,82.142857,0.821429,0.821429,0.821429,1.089286,0.853977
4,Logistic Regression,80.133929,83.035714,0.830357,0.830357,0.830357,1.080357,1.0


In [27]:
#Picking the best model based on report of Performance
best_model = RandomForestClassifier(n_estimators = 350,random_state=42)

In [28]:
best_model.fit(xtrain,ytrain)

In [29]:
#Making prediction on best model
prediction =  best_model.predict(norms_test)

In [30]:
prediction

array([2., 2., 3., 0., 3., 2., 1., 1., 3., 2., 0., 3., 1., 0., 0., 3., 0.,
       1., 0., 1., 0., 4., 4., 4., 0., 3., 4., 3., 4., 0., 3., 0., 0., 1.,
       3., 1., 1., 0., 0., 1., 0., 1., 4., 2., 2., 0., 2., 2., 4., 3., 0.,
       3., 0., 0., 0., 1., 0., 0., 2., 3., 3., 1., 4., 2., 4., 0., 0., 4.,
       3., 4., 0., 1., 3., 0., 1., 3., 0., 4., 0., 1., 0., 0., 0., 0., 0.,
       4., 1., 3., 2., 0., 2., 0., 3., 0., 4., 4., 4., 1., 1., 4., 4., 3.,
       1., 3., 0., 2., 3., 1., 3., 4., 3., 0., 3., 1., 1., 4., 3., 2., 1.,
       3., 4., 0., 3., 0., 0., 1., 0., 2., 0., 1., 2., 0., 4., 2., 0., 0.,
       1., 0., 0., 1., 3., 3., 0., 0., 2., 0., 4., 4., 0., 0., 1., 4., 2.,
       1., 3., 1., 2., 3., 3., 1., 2., 2., 4., 0., 4., 0., 3., 3., 3., 0.,
       1., 1., 2., 0., 0., 0., 4., 4., 4., 4., 3., 0., 3., 3., 4., 1., 1.,
       0., 1., 3., 2., 0., 0., 4., 0., 0., 0., 0., 0., 0., 0., 4., 0., 0.,
       4., 4., 4., 2., 0., 0., 4., 0., 0., 3., 2., 0., 4., 0., 1., 1., 1.,
       0., 1., 0., 4., 0.

In [31]:
#to check the equality
print(len(prediction),len(test))

241 241


In [34]:
#Saving the prediction on the best model in to CSV file
prediction=pd.DataFrame(prediction)
prediction.to_csv("solution.csv",header=None)

Unnamed: 0,0
0,2.0
1,2.0
2,3.0
3,0.0
4,3.0
...,...
236,4.0
237,0.0
238,4.0
239,2.0
