# CV CW1

The data set for the coursework is a sample from Stallkamp et al's German Street Sign Recognition
Benchmark. Originally the data set consisted of 39,209 RGB-coloured train and 12,630 RGB-coloured
test images of different sizes displaying 43 different types of German traffic signs. These images are
not centred and are taken during different times of the day.

This data set is considered to be an important benchmark for Computer Vision and has close relation
to the street sign recognition tasks that autonomous cars have to perform. And safe deployment of
autonomous cars is the next big challenge that researchers and engineers face.

You will be working with a sample of this data set which consists of 10 classes and 9690 images. The
images have been converted to grey-scale with pixel values ranging from 0 to 255 and were rescaled
to a common size of 48*48 pixels. Hence, each row (= feature vector) in the data set has 2305
features and represents a single image in row-vector format (2304 features) plus its associated class
label. We changed the class labels from the original dataset so the classes we use are now labelled
from 0 to 9. Compensating the light conditions and position of the images is not necessary for the
coursework and is left for the interested student to do.

Below, the class labels and their meanings are displayed:
Class label Meaning
0 speed limit 20
1 speed limit 30
2 speed limit 50
3 speed limit 60
4 speed limit 70
5 left turn
6 right turn
7 beware pedestrian crossing
8 beware children
9 beware cycle route ahead

In [1]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)


In [2]:
# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

In [3]:
# Common imports
import numpy as np
import os
import tarfile
import urllib
import pandas as pd

In [4]:
# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

In [5]:
#read the attribute csvs file into respective dataframes

X = pd.read_csv ('x_train_gr_smpl.csv')
Xbin = pd.read_csv('x_train_smpl_bin.csv')
print (X)
print(Xbin)

          0      1      2      3      4      5      6      7      8      9  \
0      78.0   77.0   76.0   82.0   87.0   92.0  104.0  119.0  117.0  120.0   
1      73.0   75.0   79.0   78.0   76.0   75.0   89.0  107.0  133.0  125.0   
2      72.0   75.0   79.0   77.0   81.0   89.0  105.0  109.0   86.0   90.0   
3      67.0   70.0   74.0   80.0   93.0  107.0  110.0   96.0   69.0  100.0   
4      74.0   74.0   73.0   72.0   77.0   87.0  104.0  109.0   84.0   83.0   
...     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
9685  247.0  220.0  254.0  213.0  129.0  208.0  254.0  255.0  255.0  255.0   
9686  151.0  118.0  254.0  255.0  255.0  255.0  254.0  254.0  254.0  252.0   
9687  255.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0   
9688  255.0  253.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0   
9689  252.0  189.0  238.0  255.0  255.0  245.0  219.0  212.0  140.0   40.0   

      ...   2294  2295  2296   2297  2298  2299  2300   2301   

In [7]:
# read the class attribute file into Y datafram
Y = pd.read_csv('y_train_smpl.csv')
Y.columns = ['Class']
print(Y)

      Class
0         0
1         0
2         0
3         0
4         0
...     ...
9685      9
9686      9
9687      9
9688      9
9689      9

[9690 rows x 1 columns]


In [8]:
#join the two such that Y is the last column

XYraw = pd.concat([X,Y], axis=1)

#check the concatination
print(XYraw)



          0      1      2      3      4      5      6      7      8      9  \
0      78.0   77.0   76.0   82.0   87.0   92.0  104.0  119.0  117.0  120.0   
1      73.0   75.0   79.0   78.0   76.0   75.0   89.0  107.0  133.0  125.0   
2      72.0   75.0   79.0   77.0   81.0   89.0  105.0  109.0   86.0   90.0   
3      67.0   70.0   74.0   80.0   93.0  107.0  110.0   96.0   69.0  100.0   
4      74.0   74.0   73.0   72.0   77.0   87.0  104.0  109.0   84.0   83.0   
...     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
9685  247.0  220.0  254.0  213.0  129.0  208.0  254.0  255.0  255.0  255.0   
9686  151.0  118.0  254.0  255.0  255.0  255.0  254.0  254.0  254.0  252.0   
9687  255.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0   
9688  255.0  253.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0   
9689  252.0  189.0  238.0  255.0  255.0  245.0  219.0  212.0  140.0   40.0   

      ...  2295  2296   2297  2298  2299  2300   2301   2302   

In [9]:
#explore the header
XYraw.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2295,2296,2297,2298,2299,2300,2301,2302,2303,Class
0,78.0,77.0,76.0,82.0,87.0,92.0,104.0,119.0,117.0,120.0,...,79.0,72.0,76.0,83.0,95.0,99.0,98.0,95.0,94.0,0
1,73.0,75.0,79.0,78.0,76.0,75.0,89.0,107.0,133.0,125.0,...,93.0,85.0,77.0,69.0,73.0,83.0,100.0,101.0,101.0,0
2,72.0,75.0,79.0,77.0,81.0,89.0,105.0,109.0,86.0,90.0,...,95.0,88.0,80.0,73.0,71.0,74.0,80.0,89.0,95.0,0
3,67.0,70.0,74.0,80.0,93.0,107.0,110.0,96.0,69.0,100.0,...,92.0,87.0,82.0,77.0,72.0,70.0,72.0,81.0,88.0,0
4,74.0,74.0,73.0,72.0,77.0,87.0,104.0,109.0,84.0,83.0,...,98.0,99.0,100.0,99.0,89.0,78.0,66.0,68.0,72.0,0


In [10]:
# general info on attributes 
XYraw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9690 entries, 0 to 9689
Columns: 2305 entries, 0 to Class
dtypes: float64(2304), int64(1)
memory usage: 170.4 MB


In [11]:
# instance frequency for each class attribute label
XYraw["Class"].value_counts()

2    2250
1    2220
4    1980
3    1410
8     540
6     360
9     270
7     240
5     210
0     210
Name: Class, dtype: int64

In [12]:
print(XYraw.shape)

(9690, 2305)


In [12]:
print(X.shape)

(9690, 2304)


In [13]:
# check for NaNs
anyNans = XYraw.isnull().sum().sum()
#print NaN count
print('\nNaN Count : ' + str(anyNans)) 


NaN Count : 0


In [14]:
# generate dataframe with top 20 attributes selected using weka for each class label 

Top_20_All = XYraw[['744',
'745',
'792',
'742',
'747',
'741',
'697',
'698',
'1214',
'2018',
'2068',
'1869',
'1119',
'648',
'2062',
'2019',
'1676',
'2111',
'1120',
'1820','1168',
'1167',
'1120',
'1215',
'1216',
'1119',
'1263',
'1073',
'1264',
'1072',
'1121',
'1166',
'1214',
'1262',
'1074',
'697',
'648',
'744',
'698',
'743',
'1609',
'1656',
'1610',
'1608',
'1658',
'746',
'1314',
'794',
'1315',
'793',
'1561',
'747',
'1311',
'1312',
'1313',
'841',
'1611',
'1316',
'795',
'1267',
'1609',
'1608',
'1656',
'1657',
'1265',
'1313',
'1655',
'1264',
'1312',
'1607',
'1560',
'1658',
'1610',
'1654',
'1561',
'1266',
'1314',
'1704',
'1705',
'1263',
'983',
'1031',
'935',
'982',
'1363',
'2138',
'2268',
'1030',
'2165',
'2024',
'2281',
'1973',
'934',
'2187',
'2069',
'1316',
'2269',
'2209',
'2077',
'2029',
'1509',
'1461',
'1462',
'1508',
'1460',
'744',
'745',
'746',
'1510',
'1557',
'1264',
'1507',
'1556',
'1555',
'1459',
'1263',
'1519',
'1472',
'1216',
'1471',
'793',
'746',
'745',
'744',
'794',
'743',
'792',
'795',
'1264',
'747',
'697',
'742',
'841',
'696',
'1312',
'791',
'1265',
'1263',
'695',
'1311',
'1657',
'1656',
'1608',
'1705',
'1704',
'1703',
'1753',
'1752',
'1601',
'1650',
'1663',
'1662',
'1751',
'1649',
'1600',
'1743',
'1648',
'1651',
'1713',
'1615',
'1695',
'1714',
'1173',
'1125',
'1647',
'1126',
'1713',
'1174',
'1715',
'1666',
'1078',
'1696',
'1743',
'1221',
'1077',
'1472',
'1762',
'1694',
'1742',
'1665',
'1655',
'1654',
'1560',
'1561',
'1606',
'1607',
'1610',
'1559',
'1611',
'1653',
'1558',
'1605',
'1511',
'1562',
'1660',
'1512',
'1604',
'1462',
'1600',
'1601',


                    'Class'
]]

In [15]:
# confirm 201- top 20 attributes x 10 class label + class attriute
print(Top_20_All.shape)

(9690, 201)


In [16]:
#remove any coumn dulplicates

Top_20_WithoutDuplicates = Top_20_All.loc[:,~Top_20_All.T.duplicated(keep='first')]


In [17]:
# confirm removal of column duplicated

print(Top_20_WithoutDuplicates.shape)

(9690, 140)


In [18]:
#randomize

Top_20_WithoutDuplicates_randomized = Top_20_WithoutDuplicates.sample(frac=1).reset_index(drop=True)
print(Top_20_WithoutDuplicates_randomized)

        744    745    792    742    747    741    697    698   1214   2018  \
0      69.0   71.0   81.0   60.0   67.0   47.0   37.0   37.0   42.0   16.0   
1     252.0  250.0  248.0  255.0  246.0  255.0  253.0  251.0  255.0   46.0   
2     191.0  193.0  154.0  190.0  194.0  191.0  193.0  193.0  132.0   52.0   
3      34.0   35.0   62.0   39.0   38.0   44.0   47.0   48.0   45.0   22.0   
4     255.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0  252.0  141.0   
...     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
9685  204.0  205.0  175.0  202.0  220.0  204.0  225.0  222.0  192.0   21.0   
9686  209.0  208.0  210.0  212.0  197.0  210.0  201.0  195.0  220.0  255.0   
9687   95.0   99.0  118.0   89.0   95.0   84.0   76.0   76.0   98.0   63.0   
9688   80.0   80.0   79.0   76.0   78.0   76.0   74.0   73.0   66.0   64.0   
9689   55.0   55.0   52.0   53.0   55.0   52.0   51.0   51.0   38.0   17.0   

      ...   1559   1653   1558   1605   1511   1562   1660   15

In [19]:
#convert the dataframe into csv

Top_20_WithoutDuplicates.to_csv('Top_20_random3.csv', index = False)

In [20]:
#create dataframe with top 5 attributes for each class label 

Top_5_All = XYraw[['744',
'745',
'792',
'742',
'747', 
                   '1168',
'1167',
'1120',
'1215',
'1216',
                   '1609',
'1656',
'1610',
'1608',
'1658',
'1609',
'1608',
'1656',
'1657',
'1265',
                   '983',
'1031',
'935',
'982',
'1363',
                   '1509',
'1461',
'1462',
'1508',
'1460',

'793',
'746',
'745',
'744',
'794',
                   
'1657',
'1656',
'1608',
'1705',
'1704',
                   '1695',
'1714',
'1173',
'1125',
'1647',
                   '1655',
'1654',
'1560',
'1561',
'1606',


                   

                   


                    'Class'
]]

In [21]:
# confirm selection
print(Top_5_All.shape)

(9690, 51)


In [22]:
#remove duplicates
Top_5_WithoutDuplicates = Top_5_All.loc[:,~Top_5_All.T.duplicated(keep='first')]

In [23]:
#confirm duplicate removal
print(Top_5_WithoutDuplicates.shape)

(9690, 43)


In [24]:
#randomize
Top_5_WithoutDuplicates_randomized = Top_5_WithoutDuplicates.sample(frac=1).reset_index(drop=True)
print(Top_5_WithoutDuplicates_randomized)

        744    745    792    742    747   1168   1167   1120   1215   1216  \
0     133.0  134.0  137.0  134.0  134.0  134.0  133.0  134.0  133.0  132.0   
1     136.0  124.0  159.0  244.0  172.0  132.0  132.0  130.0  132.0  146.0   
2     157.0  156.0  156.0  159.0  154.0   76.0  127.0   96.0  109.0   61.0   
3      29.0   33.0   37.0   25.0   27.0   39.0   26.0   38.0   27.0   38.0   
4      20.0   19.0   21.0   21.0   21.0   34.0   26.0   31.0   27.0   36.0   
...     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
9685  144.0  144.0  144.0  146.0  145.0  146.0  146.0  143.0  146.0  146.0   
9686   67.0   70.0   79.0   53.0   68.0   62.0   43.0   60.0   41.0   61.0   
9687  103.0  114.0  158.0   78.0  110.0  119.0   72.0   99.0   71.0  115.0   
9688   56.0   54.0   87.0   47.0   47.0   88.0   83.0   85.0   84.0   91.0   
9689  218.0  219.0  242.0  210.0  190.0  246.0  251.0  209.0  254.0  254.0   

      ...   1714   1173   1125   1647   1655   1654   1560   15

In [25]:
#save datafram as csv for further weka analyses

Top_5_WithoutDuplicates.to_csv('Top_5_random3.csv', index = False)

In [26]:
#create dataframe with top 10 attributes for each class label

Top_10_All = XYraw[['744',
'745',
'792',
'742',
'747', 
      '741',
'697',
'698',
'1214',
'2018',
              
                    
                    
                   '1168',
'1167',
'1120',
'1215',
'1216',
                    
                    '1119',
'1263',
'1073',
'1264',
'1072',

                   '1609',
'1656',
'1610',
'1608',
'1658',
    '746',
'1314',
'794',
'1315',
'793',
              
'1609',
'1608',
'1656',
'1657',
'1265',
     '1313',
'1655',
'1264',
'1312',
'1607',
 
                   '983',
'1031',
'935',
'982',
'1363',
    '2138',
'2268',
'1030',
'2165',
'2024',
 
                   '1509',
'1461',
'1462',
'1508',
'1460',
       '744',
'745',
'746',
'1510',
'1557',

'793',
'746',
'745',
'744',
'794',
                    '743',
'792',
'795',
'1264',
'747',

                   
'1657',
'1656',
'1608',
'1705',
'1704',
       '1703',
'1753',
'1752',
'1601',
'1650',

                   '1695',
'1714',
'1173',
'1125',
'1647',
       '1126',
'1713',
'1174',
'1715',
'1666',
             
                   '1655',
'1654',
'1560',
'1561',
'1606',

'1607',
'1610',
'1559',
'1611',
'1653',

                    'Class'
]]

In [27]:
# confirm the attribute selection
print(Top_10_All.shape)

(9690, 101)


In [28]:
# remove duplicates

Top_10_WithoutDuplicates = Top_10_All.loc[:,~Top_10_All.T.duplicated(keep='first')]
print(Top_10_WithoutDuplicates)

        744    745    792    742    747    741    697    698   1214   2018  \
0     106.0  103.0  180.0  103.0   87.0   95.0   85.0   86.0  168.0  123.0   
1     164.0  163.0  221.0  123.0  138.0  108.0  114.0  108.0   95.0  122.0   
2     197.0  198.0  234.0  196.0  175.0  168.0  145.0  146.0  147.0  118.0   
3     237.0  233.0  244.0  235.0  205.0  214.0  209.0  192.0  167.0  136.0   
4      85.0   85.0  148.0   78.0   79.0   76.0   71.0   71.0  162.0  169.0   
...     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
9685   82.0   33.0  100.0   96.0   19.0   23.0   23.0   19.0   21.0   59.0   
9686   96.0   38.0  102.0  104.0   19.0   62.0   28.0   18.0   19.0   40.0   
9687   92.0   34.0  101.0  101.0   18.0   68.0   26.0   18.0   17.0   41.0   
9688   94.0   38.0   92.0   94.0   16.0   60.0   27.0   16.0   16.0   35.0   
9689   83.0   26.0   88.0   91.0   15.0   52.0   19.0   15.0   14.0   37.0   

      ...   1715   1666   1654   1560   1561   1606   1559   16

In [29]:
#confirm duplicate removal 
print(Top_10_WithoutDuplicates.shape)

(9690, 80)


In [30]:
#randomize
Top_10_WithoutDuplicates_randomized = Top_10_WithoutDuplicates.sample(frac=1).reset_index(drop=True)
print(Top_10_WithoutDuplicates_randomized)

        744    745    792    742    747    741    697    698   1214   2018  \
0      53.0   53.0   53.0   52.0   51.0   52.0   51.0   44.0   52.0   48.0   
1     125.0  125.0  124.0  126.0  123.0  127.0  124.0  122.0  120.0  243.0   
2      32.0   27.0   57.0   33.0   40.0   29.0   28.0   38.0   37.0   19.0   
3      32.0   20.0   39.0   37.0   16.0   39.0   16.0   17.0   17.0   23.0   
4     193.0  195.0  193.0  190.0  181.0  186.0  193.0  191.0  196.0  201.0   
...     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
9685  128.0  130.0  129.0  129.0  130.0  128.0  130.0  128.0  118.0   54.0   
9686   42.0   43.0   42.0   41.0   41.0   42.0   41.0   40.0   39.0  132.0   
9687   98.0   98.0   95.0   99.0   98.0   99.0   97.0   98.0   95.0   84.0   
9688  175.0  175.0  175.0  171.0  159.0  165.0  149.0  144.0  117.0   72.0   
9689  255.0  255.0  255.0  255.0  255.0  255.0  254.0  253.0  255.0   38.0   

      ...   1715   1666   1654   1560   1561   1606   1559   16

In [31]:
#save dataframe as csv for further weka analyses
Top_10_WithoutDuplicates.to_csv('Top_10_random2.csv', index = False)

In [32]:
#create top 20 attributes from exploratory feature selection analyses

Top_20_Exploratory = XYraw[['792',
'742',
'747',
'741',
'697',
'698',
'1214',
'2068',
'1869',
'1119',
'648',
'1676',
'2111',
'1120',
'2067',
'2260',
'2162',
'2301',
'2212',
'2210',
                            '1168',
'1167',
'1120',
'1263',
'1073',
'1072',
'1121',
'1074',
'1310',
'599',
    '1314',
'1315',
'1316',
'1268',
'1264',
'1030',
'1364',
'981',
'982',
'1317',
'1269',
   '1265',
'1313',
'1312',
'1266',
'1217',
'1361',
'1218',
'1219',
'1170',
'1171',
'1220',
'1172',
       '983',
'1031',
'935',
'982',
'1030',
'934',
      '1509',
'1461',
'1462',
'1508',
'1460',
'745',
'746',
'1507',
'1459',
'1519',
'1471',
'1458',
'794',
'1792',
'1126',
'1793',
'1173',
'1842',
'1840',
'1843',
    '793',
'746',
'745',
'794',
'795',
'1264',
'1134',
'842',
'1086',
'1168',
'1085',
'1037',
'1135',
'843',
'1234',
    '1601',
'1663',
'1649',
'1600',
'1743',
'1648',
'1713',
'1714',
'1744',
'1518',
'1507',
'1695',
'1470',
'1519',
'1742',
'1760',
'1459',
'1506',
'1761',
'1517',
     '1695',
'1714',
'1173',
'1125',
'1713',
'1715',
'1666',
'1743',
'1762',
'1742',
'1667',
'1471',
     '1606',
'1558',
'1511',
'1462',
'1600',
'1601',
'1510',
'1649',
'1648',
'706',
'1505',
'707',
'1553',
'560',
'1506',
'1125',
'1647',
'1030',
'1172',
'2174',
  
                            'Class'
]]

In [33]:
# confirm selection - note: not all attributes have 20 features selected 
print(Top_20_Exploratory.shape)

(9690, 147)


In [34]:
#remove duplicates
Top_20_Exploratory_WithoutDuplicates = Top_20_Exploratory.loc[:,~Top_20_Exploratory.T.duplicated(keep='first')]
print(Top_20_Exploratory_WithoutDuplicates)

        792    742    747    741    697    698   1214   2068  1869   1119  \
0     180.0  103.0   87.0   95.0   85.0   86.0  168.0  117.0  79.0  202.0   
1     221.0  123.0  138.0  108.0  114.0  108.0   95.0  137.0  79.0  117.0   
2     234.0  196.0  175.0  168.0  145.0  146.0  147.0  127.0  79.0  201.0   
3     244.0  235.0  205.0  214.0  209.0  192.0  167.0  121.0  80.0  199.0   
4     148.0   78.0   79.0   76.0   71.0   71.0  162.0  135.0  75.0  163.0   
...     ...    ...    ...    ...    ...    ...    ...    ...   ...    ...   
9685  100.0   96.0   19.0   23.0   23.0   19.0   21.0   56.0  30.0   20.0   
9686  102.0  104.0   19.0   62.0   28.0   18.0   19.0   55.0  36.0   18.0   
9687  101.0  101.0   18.0   68.0   26.0   18.0   17.0   49.0  32.0   17.0   
9688   92.0   94.0   16.0   60.0   27.0   16.0   16.0   40.0  27.0   16.0   
9689   88.0   91.0   15.0   52.0   19.0   15.0   14.0   27.0  28.0   15.0   

      ...   1511   1510    706   1505    707   1553    560   1647   2174  \

In [35]:
# randomize
Top_20_Exploratory_WithoutDuplicates_randomized = Top_20_Exploratory_WithoutDuplicates.sample(frac=1).reset_index(drop=True)
print(Top_20_Exploratory_WithoutDuplicates_randomized)

        792    742    747    741    697    698   1214   2068   1869   1119  \
0      55.0   48.0   46.0   47.0   42.0   40.0   53.0   59.0   48.0   56.0   
1      70.0   70.0   72.0   70.0   72.0   71.0   69.0  143.0  121.0   41.0   
2      28.0   19.0   20.0   20.0   20.0   22.0   74.0   82.0  138.0   76.0   
3     156.0  155.0  155.0  148.0  124.0  121.0  118.0   48.0   46.0  145.0   
4      26.0   24.0   19.0   24.0   17.0   13.0   24.0   12.0    9.0   24.0   
...     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
9685   83.0   85.0   85.0   85.0   86.0   86.0   84.0   70.0   95.0   73.0   
9686  107.0  106.0  103.0  104.0   99.0   92.0   98.0   22.0   22.0  106.0   
9687  255.0  255.0  251.0  255.0  255.0  254.0  255.0  255.0  249.0  255.0   
9688   53.0   55.0   54.0   54.0   55.0   54.0   55.0   21.0   51.0   56.0   
9689   68.0   38.0   31.0   36.0   20.0   21.0   89.0   29.0   30.0   65.0   

      ...   1511   1510    706   1505    707   1553    560   16

In [36]:
# save dataframe as csv for weka analyses
Top_20_Exploratory_WithoutDuplicates_randomized.to_csv('Top_20_Exploratory_random2.csv', index = False)

In [37]:
D2 = pd.read_csv ('Top_20_Exploratory_random2.csv')
print(D2)

        792    742    747    741    697    698   1214   2068   1869   1119  \
0      55.0   48.0   46.0   47.0   42.0   40.0   53.0   59.0   48.0   56.0   
1      70.0   70.0   72.0   70.0   72.0   71.0   69.0  143.0  121.0   41.0   
2      28.0   19.0   20.0   20.0   20.0   22.0   74.0   82.0  138.0   76.0   
3     156.0  155.0  155.0  148.0  124.0  121.0  118.0   48.0   46.0  145.0   
4      26.0   24.0   19.0   24.0   17.0   13.0   24.0   12.0    9.0   24.0   
...     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
9685   83.0   85.0   85.0   85.0   86.0   86.0   84.0   70.0   95.0   73.0   
9686  107.0  106.0  103.0  104.0   99.0   92.0   98.0   22.0   22.0  106.0   
9687  255.0  255.0  251.0  255.0  255.0  254.0  255.0  255.0  249.0  255.0   
9688   53.0   55.0   54.0   54.0   55.0   54.0   55.0   21.0   51.0   56.0   
9689   68.0   38.0   31.0   36.0   20.0   21.0   89.0   29.0   30.0   65.0   

      ...   1511   1510    706   1505    707   1553    560   16

In [38]:
D = pd.read_csv ('Top_20_Exploratory_random.csv')
print(D)

        792    742    747    741    697    698   1214   2068  1869   1119  \
0      33.0   30.0   17.0   16.0   12.0   14.0   21.0   10.0  11.0   20.0   
1     196.0  223.0  222.0  251.0  210.0  213.0  191.0  110.0  57.0  187.0   
2     155.0   73.0   74.0   70.0   83.0   76.0   70.0   23.0  22.0   88.0   
3     165.0  168.0  166.0  165.0  170.0  170.0  123.0   30.0  28.0   44.0   
4      58.0   47.0   50.0   45.0   36.0   34.0   42.0   17.0  25.0   50.0   
...     ...    ...    ...    ...    ...    ...    ...    ...   ...    ...   
9685   67.0   60.0   21.0   22.0   22.0   19.0   20.0   31.0  34.0   22.0   
9686  215.0  222.0  225.0  223.0  221.0  221.0  200.0   31.0  33.0  215.0   
9687  158.0  115.0  154.0   94.0  112.0  110.0   46.0   48.0  58.0   75.0   
9688   70.0   68.0   68.0   70.0   69.0   70.0   72.0   47.0  46.0   70.0   
9689  155.0  108.0  106.0  108.0  129.0  136.0  243.0   66.0  65.0  246.0   

      ...   1511   1510    706   1505    707   1553    560   1647  2174  Cl

In [39]:
A = pd.read_csv ('Top_5_random2.csv')
print(A)

        744    745    792    742    747   1168   1167   1120   1215   1216  \
0     106.0  103.0  180.0  103.0   87.0  224.0  211.0  207.0  220.0  231.0   
1     164.0  163.0  221.0  123.0  138.0  198.0  122.0  183.0  124.0  199.0   
2     197.0  198.0  234.0  196.0  175.0  220.0  189.0  211.0  196.0  225.0   
3     237.0  233.0  244.0  235.0  205.0  233.0  202.0  227.0  206.0  230.0   
4      85.0   85.0  148.0   78.0   79.0  180.0  186.0  178.0  199.0  202.0   
...     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
9685   82.0   33.0  100.0   96.0   19.0   19.0   20.0   18.0   19.0   22.0   
9686   96.0   38.0  102.0  104.0   19.0   19.0   18.0   18.0   19.0   30.0   
9687   92.0   34.0  101.0  101.0   18.0   18.0   17.0   17.0   18.0   36.0   
9688   94.0   38.0   92.0   94.0   16.0   16.0   16.0   16.0   16.0   25.0   
9689   83.0   26.0   88.0   91.0   15.0   15.0   14.0   15.0   15.0   26.0   

      ...   1714   1173   1125   1647   1655   1654   1560   15

In [40]:
B = pd.read_csv ('Top_10_random2.csv')
print(B)

        744    745    792    742    747    741    697    698   1214   2018  \
0     106.0  103.0  180.0  103.0   87.0   95.0   85.0   86.0  168.0  123.0   
1     164.0  163.0  221.0  123.0  138.0  108.0  114.0  108.0   95.0  122.0   
2     197.0  198.0  234.0  196.0  175.0  168.0  145.0  146.0  147.0  118.0   
3     237.0  233.0  244.0  235.0  205.0  214.0  209.0  192.0  167.0  136.0   
4      85.0   85.0  148.0   78.0   79.0   76.0   71.0   71.0  162.0  169.0   
...     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
9685   82.0   33.0  100.0   96.0   19.0   23.0   23.0   19.0   21.0   59.0   
9686   96.0   38.0  102.0  104.0   19.0   62.0   28.0   18.0   19.0   40.0   
9687   92.0   34.0  101.0  101.0   18.0   68.0   26.0   18.0   17.0   41.0   
9688   94.0   38.0   92.0   94.0   16.0   60.0   27.0   16.0   16.0   35.0   
9689   83.0   26.0   88.0   91.0   15.0   52.0   19.0   15.0   14.0   37.0   

      ...   1715   1666   1654   1560   1561   1606   1559   16

In [41]:
C = pd.read_csv ('Top_20_random2.csv')
print(C)

        744    745    792    742    747    741    697    698   1214   2018  \
0     106.0  103.0  180.0  103.0   87.0   95.0   85.0   86.0  168.0  123.0   
1     164.0  163.0  221.0  123.0  138.0  108.0  114.0  108.0   95.0  122.0   
2     197.0  198.0  234.0  196.0  175.0  168.0  145.0  146.0  147.0  118.0   
3     237.0  233.0  244.0  235.0  205.0  214.0  209.0  192.0  167.0  136.0   
4      85.0   85.0  148.0   78.0   79.0   76.0   71.0   71.0  162.0  169.0   
...     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
9685   82.0   33.0  100.0   96.0   19.0   23.0   23.0   19.0   21.0   59.0   
9686   96.0   38.0  102.0  104.0   19.0   62.0   28.0   18.0   19.0   40.0   
9687   92.0   34.0  101.0  101.0   18.0   68.0   26.0   18.0   17.0   41.0   
9688   94.0   38.0   92.0   94.0   16.0   60.0   27.0   16.0   16.0   35.0   
9689   83.0   26.0   88.0   91.0   15.0   52.0   19.0   15.0   14.0   37.0   

      ...   1559   1653   1558   1605   1511   1562   1660   15

In [42]:
A_randomized = A.sample(frac=1).reset_index(drop=True)

In [43]:
B_randomized = B.sample(frac=1).reset_index(drop=True)

In [44]:
C_randomized = C.sample(frac=1).reset_index(drop=True)

In [45]:
D_randomized = D.sample(frac=1).reset_index(drop=True)

In [46]:
D2_randomized = D2.sample(frac=1).reset_index(drop=True)

In [47]:
D3_randomized = D2.sample(frac=1).reset_index(drop=True)

In [48]:
print(A_randomized)

        744    745    792    742    747   1168   1167   1120   1215   1216  \
0      67.0   72.0  112.0   55.0   70.0  127.0   68.0  108.0   77.0  146.0   
1      11.0   13.0   15.0   17.0   21.0   17.0   19.0   19.0   19.0   17.0   
2     106.0  105.0  126.0   88.0  100.0   97.0   76.0   92.0   77.0   98.0   
3     149.0  152.0  174.0  129.0  132.0  131.0   99.0  126.0   89.0  125.0   
4     255.0  255.0  181.0  254.0  255.0  194.0  254.0  169.0  255.0  237.0   
...     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
9685  255.0  255.0  255.0  255.0  255.0  254.0  255.0  255.0  255.0  254.0   
9686  108.0   89.0  148.0   45.0   43.0   68.0  102.0   85.0   78.0   42.0   
9687   60.0   59.0   59.0   60.0   59.0   59.0   56.0   52.0   59.0   60.0   
9688   63.0   64.0   96.0   55.0   59.0  138.0   96.0  125.0  105.0  146.0   
9689  156.0  159.0  155.0  160.0  162.0   99.0  151.0  128.0  139.0   70.0   

      ...   1714   1173   1125   1647   1655   1654   1560   15

In [49]:
print(B_randomized)

        744    745    792    742    747    741    697    698   1214   2018  \
0     221.0  218.0  255.0  185.0  198.0  155.0  110.0  126.0  154.0   47.0   
1     119.0  121.0  144.0  111.0  114.0  107.0  116.0  116.0  111.0   65.0   
2     128.0  129.0  125.0  130.0  129.0  129.0  129.0  127.0  127.0  196.0   
3     255.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0  136.0   
4     179.0  181.0  184.0  185.0  170.0  184.0  178.0  182.0  183.0   50.0   
...     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
9685   95.0   94.0  115.0   90.0   84.0   81.0   66.0   63.0  101.0   67.0   
9686  134.0  133.0  133.0  129.0  132.0  128.0  133.0  134.0   72.0  243.0   
9687   37.0   39.0   50.0   33.0   40.0   32.0   26.0   27.0   26.0  163.0   
9688   57.0   81.0   75.0   18.0   43.0   26.0   66.0   53.0   51.0   91.0   
9689  107.0  105.0  108.0  105.0  105.0  105.0  103.0  106.0  108.0   48.0   

      ...   1715   1666   1654   1560   1561   1606   1559   16

In [50]:
print(C_randomized) 

        744    745    792    742    747    741    697    698   1214   2018  \
0      49.0   53.0   73.0   39.0   56.0   35.0   34.0   36.0   31.0   38.0   
1     228.0  225.0  255.0  214.0  223.0  182.0  178.0  184.0  161.0   30.0   
2      48.0   48.0   47.0   45.0   47.0   34.0   49.0   49.0   18.0   57.0   
3     169.0  171.0  159.0  169.0  170.0  166.0  174.0  173.0  161.0   21.0   
4     137.0  138.0  193.0  109.0  129.0   80.0   84.0   83.0   70.0   95.0   
...     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
9685   73.0   74.0  100.0   68.0   70.0   64.0   58.0   59.0   69.0   21.0   
9686  152.0  153.0  152.0  143.0  146.0  136.0  134.0  133.0  121.0   40.0   
9687  193.0  192.0  188.0  192.0  191.0  186.0  191.0  190.0  174.0   44.0   
9688   72.0   71.0   72.0   71.0   69.0   72.0   71.0   70.0   59.0  255.0   
9689  147.0  148.0  146.0  150.0  147.0  151.0  144.0  143.0  128.0   25.0   

      ...   1559   1653   1558   1605   1511   1562   1660   15

In [51]:
print(D_randomized) 

        792    742    747    741    697    698   1214   2068   1869   1119  \
0      87.0   83.0   86.0   73.0   89.0   89.0   74.0   69.0   69.0   72.0   
1     103.0   38.0   48.0   40.0   39.0   48.0   44.0   33.0   42.0   44.0   
2     147.0   67.0   64.0   66.0   55.0   55.0   87.0   66.0  104.0  125.0   
3      79.0   75.0   76.0   71.0   69.0   68.0   57.0   68.0   68.0   65.0   
4     104.0  102.0  103.0  103.0  101.0  100.0   82.0   25.0   23.0   86.0   
...     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
9685  157.0  147.0  155.0  121.0  127.0  131.0   81.0   54.0   60.0  123.0   
9686  153.0  155.0  158.0  154.0  157.0  157.0  140.0   21.0   25.0   61.0   
9687   60.0   21.0   42.0   35.0   51.0   43.0   73.0   89.0   24.0   74.0   
9688  255.0  255.0  255.0  255.0  255.0  244.0  227.0  110.0   26.0  255.0   
9689   21.0   22.0   20.0   21.0   21.0   20.0   13.0    7.0    7.0   17.0   

      ...   1511   1510    706   1505    707   1553    560   16

In [52]:
print(D2_randomized)

        792    742    747    741    697    698   1214   2068   1869   1119  \
0     252.0  255.0  255.0  255.0  255.0  255.0  255.0   69.0   95.0  255.0   
1      70.0   48.0   64.0   40.0   44.0   42.0   25.0   26.0   23.0   31.0   
2      59.0   59.0   61.0   60.0   61.0   61.0   60.0  202.0  204.0   59.0   
3     140.0   69.0   41.0   34.0   79.0   46.0   80.0  115.0  231.0   71.0   
4      99.0  109.0  108.0  108.0  108.0  108.0  106.0  255.0  255.0   59.0   
...     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
9685  213.0  221.0  179.0  221.0  216.0  216.0  218.0   65.0  132.0  226.0   
9686  255.0  255.0  255.0  255.0  255.0  255.0  251.0   70.0   44.0  255.0   
9687  255.0  255.0  255.0  255.0  254.0  251.0  211.0   29.0   32.0  255.0   
9688   83.0   82.0   81.0   79.0   79.0   76.0   71.0   23.0   19.0   78.0   
9689   87.0  143.0   41.0  108.0   87.0   51.0  178.0  159.0  138.0  183.0   

      ...   1511   1510    706   1505    707   1553    560   16

In [53]:
B_randomized.describe()

Unnamed: 0,744,745,792,742,747,741,697,698,1214,2018,...,1715,1666,1654,1560,1561,1606,1559,1611,1653,Class
count,9690.0,9690.0,9690.0,9690.0,9690.0,9690.0,9690.0,9690.0,9690.0,9690.0,...,9690.0,9690.0,9690.0,9690.0,9690.0,9690.0,9690.0,9690.0,9690.0,9690.0
mean,125.463055,124.093189,132.633333,120.565738,119.066873,115.796285,113.814241,112.488235,110.943034,69.816718,...,65.810423,77.287203,130.587203,136.417441,135.35418,131.1387,132.056553,132.224768,125.818266,3.148607
std,76.319759,76.530594,75.153805,76.688047,76.541369,76.585076,76.484526,76.41206,76.25081,64.217608,...,47.78397,55.796508,75.556268,73.589699,73.731462,74.006799,73.179806,74.342174,75.485332,2.177158
min,7.0,5.0,8.0,9.0,5.0,6.0,5.0,5.0,5.0,4.0,...,4.0,5.0,9.0,12.0,12.0,11.0,11.0,7.0,8.0,0.0
25%,59.0,57.0,66.0,55.0,54.0,52.0,50.0,48.25,48.0,27.0,...,31.0,34.0,66.0,74.25,73.0,68.0,70.0,69.0,61.0,1.0
50%,110.0,109.0,121.0,102.0,101.0,95.0,93.5,91.0,89.0,45.0,...,53.0,62.0,117.0,125.0,123.0,118.0,119.0,119.0,111.0,3.0
75%,184.0,183.0,194.0,179.0,175.0,170.0,166.0,164.0,163.0,86.0,...,87.0,106.0,192.0,196.0,195.0,191.0,188.0,192.0,184.0,4.0
max,255.0,255.0,255.0,255.0,255.0,255.0,255.0,255.0,255.0,255.0,...,255.0,255.0,255.0,255.0,255.0,255.0,255.0,255.0,255.0,9.0


In [54]:
A_randomized.describe()

Unnamed: 0,744,745,792,742,747,1168,1167,1120,1215,1216,...,1714,1173,1125,1647,1655,1654,1560,1561,1606,Class
count,9690.0,9690.0,9690.0,9690.0,9690.0,9690.0,9690.0,9690.0,9690.0,9690.0,...,9690.0,9690.0,9690.0,9690.0,9690.0,9690.0,9690.0,9690.0,9690.0,9690.0
mean,125.463055,124.093189,132.633333,120.565738,119.066873,112.842002,113.899897,111.068421,115.070175,117.006914,...,69.429825,93.662745,96.072652,87.346646,134.297833,130.587203,136.417441,135.35418,131.1387,3.148607
std,76.319759,76.530594,75.153805,76.688047,76.541369,75.457465,75.845408,74.151891,76.906898,76.843335,...,50.842278,63.462087,62.802525,64.680782,75.679938,75.556268,73.589699,73.731462,74.006799,2.177158
min,7.0,5.0,8.0,9.0,5.0,5.0,5.0,5.0,5.0,6.0,...,4.0,5.0,4.0,5.0,9.0,9.0,12.0,12.0,11.0,0.0
25%,59.0,57.0,66.0,55.0,54.0,49.0,51.0,49.0,51.0,52.0,...,31.0,43.0,46.0,38.0,69.0,66.0,74.25,73.0,68.0,1.0
50%,110.0,109.0,121.0,102.0,101.0,93.0,93.0,92.0,94.0,98.0,...,56.0,77.0,80.0,68.0,123.0,117.0,125.0,123.0,118.0,3.0
75%,184.0,183.0,194.0,179.0,175.0,169.0,169.0,164.0,172.0,177.0,...,92.0,131.0,134.0,119.0,198.0,192.0,196.0,195.0,191.0,4.0
max,255.0,255.0,255.0,255.0,255.0,255.0,255.0,255.0,255.0,255.0,...,255.0,255.0,255.0,255.0,255.0,255.0,255.0,255.0,255.0,9.0


In [55]:
from sklearn.model_selection import train_test_split

train_setD, test_setD = train_test_split(D, test_size=0.2, random_state=1938)

In [56]:
from sklearn.model_selection import train_test_split

train_setB, test_setB = train_test_split(B, test_size=0.2, random_state=1938)

In [57]:
test_setB.head()

Unnamed: 0,744,745,792,742,747,741,697,698,1214,2018,...,1715,1666,1654,1560,1561,1606,1559,1611,1653,Class
4696,55.0,56.0,58.0,55.0,55.0,55.0,49.0,47.0,54.0,29.0,...,18.0,20.0,54.0,54.0,56.0,48.0,50.0,54.0,53.0,3
1341,122.0,123.0,123.0,123.0,118.0,122.0,121.0,123.0,123.0,31.0,...,27.0,44.0,116.0,121.0,122.0,115.0,120.0,121.0,119.0,1
5114,84.0,101.0,122.0,47.0,105.0,35.0,40.0,43.0,39.0,20.0,...,34.0,43.0,130.0,132.0,145.0,133.0,117.0,143.0,105.0,3
8461,51.0,62.0,60.0,38.0,49.0,36.0,52.0,50.0,24.0,27.0,...,37.0,43.0,45.0,77.0,80.0,71.0,80.0,64.0,45.0,6
4273,100.0,99.0,99.0,100.0,99.0,101.0,99.0,101.0,93.0,43.0,...,21.0,48.0,101.0,101.0,99.0,98.0,95.0,104.0,101.0,2


In [58]:
train_setD.head()

Unnamed: 0,792,742,747,741,697,698,1214,2068,1869,1119,...,1511,1510,706,1505,707,1553,560,1647,2174,Class
5212,255.0,255.0,255.0,255.0,255.0,254.0,255.0,103.0,112.0,255.0,...,236.0,157.0,72.0,148.0,67.0,219.0,96.0,224.0,73.0,2
3768,204.0,210.0,200.0,210.0,203.0,203.0,100.0,64.0,101.0,116.0,...,198.0,180.0,92.0,87.0,90.0,81.0,83.0,64.0,105.0,1
4456,152.0,170.0,171.0,167.0,174.0,172.0,93.0,44.0,79.0,147.0,...,158.0,157.0,47.0,96.0,51.0,75.0,55.0,68.0,36.0,4
9250,119.0,118.0,118.0,118.0,116.0,115.0,114.0,23.0,35.0,112.0,...,98.0,68.0,26.0,31.0,26.0,45.0,29.0,80.0,22.0,1
7235,217.0,186.0,182.0,173.0,147.0,142.0,195.0,71.0,197.0,209.0,...,164.0,136.0,70.0,171.0,82.0,183.0,104.0,59.0,243.0,2


In [61]:
train_setB.head()

Unnamed: 0,744,745,792,742,747,741,697,698,1214,2018,...,1715,1666,1654,1560,1561,1606,1559,1611,1653,Class
5212,30.0,31.0,31.0,33.0,30.0,34.0,31.0,32.0,23.0,187.0,...,12.0,11.0,31.0,32.0,33.0,32.0,32.0,31.0,32.0,3
3768,138.0,139.0,140.0,140.0,140.0,139.0,136.0,133.0,133.0,16.0,...,32.0,32.0,129.0,127.0,131.0,122.0,113.0,131.0,130.0,2
4456,181.0,185.0,215.0,143.0,162.0,106.0,87.0,85.0,147.0,33.0,...,66.0,53.0,214.0,199.0,211.0,189.0,164.0,206.0,210.0,2
9250,24.0,24.0,26.0,30.0,27.0,26.0,20.0,21.0,15.0,17.0,...,62.0,68.0,51.0,55.0,69.0,44.0,42.0,59.0,69.0,8
7235,42.0,42.0,81.0,33.0,35.0,29.0,29.0,28.0,28.0,19.0,...,19.0,20.0,30.0,79.0,75.0,39.0,74.0,39.0,31.0,4


In [62]:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=1938)

In [63]:
for train_index, test_index in split.split(D2_randomized, D2_randomized["Class"]):
    strat_train_setD2 = D2_randomized.loc[train_index]
    strat_test_setD2 = D2_randomized.loc[test_index]

In [64]:
for train_index, test_index in split.split(D3_randomized, D3_randomized["Class"]):
    strat_train_setD3 = D3_randomized.loc[train_index]
    strat_test_setD3 = D3_randomized.loc[test_index]

In [65]:
strat_train_setD2.shape

(7752, 120)

In [66]:
strat_train_setD2.head()

Unnamed: 0,792,742,747,741,697,698,1214,2068,1869,1119,...,1511,1510,706,1505,707,1553,560,1647,2174,Class
7728,49.0,40.0,35.0,38.0,31.0,33.0,31.0,44.0,255.0,37.0,...,46.0,49.0,214.0,42.0,237.0,44.0,254.0,23.0,48.0,3
5635,255.0,255.0,255.0,255.0,255.0,255.0,255.0,60.0,33.0,254.0,...,250.0,217.0,192.0,221.0,156.0,252.0,142.0,251.0,213.0,2
7459,78.0,111.0,62.0,120.0,81.0,76.0,94.0,146.0,162.0,88.0,...,92.0,99.0,109.0,109.0,110.0,93.0,147.0,105.0,122.0,1
7545,99.0,98.0,79.0,98.0,70.0,63.0,87.0,37.0,26.0,90.0,...,78.0,71.0,39.0,53.0,40.0,65.0,39.0,57.0,35.0,3
7340,24.0,39.0,21.0,53.0,16.0,18.0,15.0,21.0,13.0,15.0,...,42.0,37.0,17.0,15.0,14.0,30.0,14.0,42.0,17.0,8


In [67]:
for train_index, test_index in split.split(C_randomized, C_randomized["Class"]):
    strat_train_setC = C_randomized.loc[train_index]
    strat_test_setC = C_randomized.loc[test_index]

In [68]:
for train_index, test_index in split.split(B_randomized, B_randomized["Class"]):
    strat_train_setB = B_randomized.loc[train_index]
    strat_test_setB = B_randomized.loc[test_index]
    

In [69]:
for train_index, test_index in split.split(A_randomized, A_randomized["Class"]):
    strat_train_set = A_randomized.loc[train_index]
    strat_test_set = A_randomized.loc[test_index]
    

In [70]:
for train_index, test_index in split.split(D_randomized, D_randomized["Class"]):
    strat_train_setD = D_randomized.loc[train_index]
    strat_test_setD = D_randomized.loc[test_index]

In [71]:
strat_train_set.head()

Unnamed: 0,744,745,792,742,747,1168,1167,1120,1215,1216,...,1714,1173,1125,1647,1655,1654,1560,1561,1606,Class
7937,15.0,15.0,16.0,15.0,14.0,22.0,20.0,21.0,21.0,22.0,...,15.0,18.0,19.0,14.0,23.0,23.0,23.0,25.0,22.0,3
5693,147.0,144.0,153.0,138.0,105.0,93.0,111.0,68.0,122.0,112.0,...,60.0,51.0,73.0,33.0,99.0,100.0,103.0,105.0,122.0,2
7338,172.0,179.0,157.0,131.0,141.0,186.0,181.0,188.0,181.0,185.0,...,93.0,162.0,132.0,85.0,154.0,144.0,183.0,186.0,177.0,1
7728,64.0,65.0,66.0,65.0,65.0,21.0,22.0,27.0,17.0,13.0,...,30.0,48.0,61.0,50.0,60.0,60.0,58.0,52.0,57.0,3
7366,148.0,183.0,182.0,50.0,105.0,75.0,106.0,99.0,92.0,60.0,...,205.0,228.0,218.0,191.0,129.0,124.0,215.0,204.0,124.0,8


In [72]:
print(strat_train_setB.shape)
print(strat_train_set.shape)

(7752, 80)
(7752, 43)


In [73]:
print(strat_train_setD.shape)

(7752, 120)


In [74]:
# check for NaNs
anyNans = strat_test_setD3.isnull().sum().sum()
anyNans = strat_train_setD3.isnull().sum().sum()
#print NaN count
print('\nNaN Count : ' + str(anyNans)) 


NaN Count : 0


In [75]:
# check for NaNs
anyNans = test_setD.isnull().sum().sum()
anyNans = train_setD.isnull().sum().sum()
#print NaN count
print('\nNaN Count : ' + str(anyNans)) 



NaN Count : 0


In [76]:
strat_train_setB.head()

Unnamed: 0,744,745,792,742,747,741,697,698,1214,2018,...,1715,1666,1654,1560,1561,1606,1559,1611,1653,Class
7809,43.0,44.0,44.0,43.0,44.0,42.0,44.0,43.0,43.0,30.0,...,27.0,46.0,44.0,44.0,46.0,40.0,42.0,38.0,44.0,3
5546,110.0,110.0,106.0,111.0,107.0,110.0,109.0,108.0,112.0,154.0,...,28.0,46.0,107.0,104.0,105.0,105.0,96.0,109.0,108.0,2
7326,255.0,255.0,255.0,255.0,255.0,255.0,255.0,255.0,255.0,55.0,...,107.0,232.0,255.0,248.0,225.0,255.0,255.0,255.0,255.0,1
7630,55.0,59.0,71.0,49.0,63.0,47.0,39.0,38.0,45.0,35.0,...,43.0,48.0,92.0,105.0,113.0,105.0,98.0,119.0,83.0,3
7545,137.0,113.0,141.0,89.0,38.0,43.0,92.0,53.0,38.0,200.0,...,118.0,109.0,99.0,127.0,94.0,81.0,136.0,97.0,104.0,8


In [77]:
strat_train_setC.head()

Unnamed: 0,744,745,792,742,747,741,697,698,1214,2018,...,1559,1653,1558,1605,1511,1562,1660,1512,1604,Class
7960,68.0,71.0,85.0,58.0,66.0,47.0,36.0,35.0,43.0,36.0,...,81.0,81.0,71.0,85.0,65.0,88.0,89.0,80.0,82.0,3
5623,133.0,134.0,148.0,122.0,123.0,111.0,102.0,101.0,90.0,32.0,...,144.0,101.0,139.0,124.0,129.0,150.0,104.0,135.0,114.0,2
7402,214.0,214.0,217.0,215.0,213.0,217.0,217.0,216.0,214.0,83.0,...,206.0,205.0,189.0,202.0,187.0,205.0,200.0,204.0,189.0,1
7748,57.0,58.0,60.0,51.0,57.0,47.0,49.0,46.0,28.0,23.0,...,54.0,26.0,52.0,38.0,50.0,54.0,30.0,53.0,31.0,3
7361,175.0,155.0,186.0,155.0,119.0,126.0,136.0,130.0,163.0,233.0,...,192.0,168.0,186.0,187.0,175.0,165.0,164.0,186.0,189.0,8


In [78]:
strat_test_set.head()

Unnamed: 0,744,745,792,742,747,1168,1167,1120,1215,1216,...,1714,1173,1125,1647,1655,1654,1560,1561,1606,Class
6043,58.0,57.0,58.0,57.0,57.0,55.0,54.0,53.0,55.0,56.0,...,17.0,22.0,22.0,40.0,54.0,54.0,53.0,53.0,55.0,4
5140,56.0,56.0,50.0,58.0,56.0,68.0,53.0,64.0,55.0,73.0,...,47.0,72.0,63.0,53.0,92.0,89.0,83.0,90.0,86.0,1
9333,179.0,159.0,198.0,215.0,157.0,209.0,154.0,186.0,175.0,224.0,...,254.0,254.0,255.0,249.0,235.0,231.0,170.0,167.0,186.0,8
804,53.0,53.0,73.0,46.0,50.0,69.0,56.0,65.0,58.0,70.0,...,26.0,45.0,66.0,24.0,69.0,65.0,58.0,65.0,64.0,2
8890,219.0,247.0,254.0,221.0,144.0,255.0,254.0,219.0,255.0,255.0,...,116.0,170.0,161.0,193.0,255.0,255.0,255.0,253.0,255.0,1


In [79]:
strat_test_setB.head()

Unnamed: 0,744,745,792,742,747,741,697,698,1214,2018,...,1715,1666,1654,1560,1561,1606,1559,1611,1653,Class
5901,220.0,217.0,223.0,206.0,194.0,191.0,148.0,141.0,157.0,44.0,...,82.0,57.0,209.0,216.0,213.0,217.0,216.0,215.0,198.0,4
5394,142.0,143.0,143.0,143.0,139.0,142.0,142.0,139.0,145.0,25.0,...,48.0,66.0,136.0,139.0,141.0,138.0,138.0,133.0,137.0,1
9375,147.0,138.0,154.0,71.0,62.0,37.0,117.0,81.0,60.0,57.0,...,120.0,119.0,127.0,139.0,121.0,54.0,126.0,91.0,114.0,8
957,123.0,123.0,120.0,123.0,120.0,119.0,114.0,114.0,118.0,63.0,...,53.0,43.0,112.0,119.0,123.0,115.0,114.0,116.0,110.0,2
8695,81.0,80.0,80.0,81.0,78.0,82.0,73.0,71.0,83.0,121.0,...,56.0,55.0,64.0,69.0,64.0,66.0,71.0,63.0,63.0,1


In [80]:
strat_train_set.to_csv('Top_5_train_set.csv', index = False)

In [81]:
strat_test_set.to_csv('Top_5_test_set.csv', index = False)

In [82]:
strat_train_set[['Class']] = strat_train_set[['Class']].replace([0, 1, 2,3,4,5,6,7,8,9], ['zero', 'one', 'two','three', 'four', 'five', 'six', 'seven', 'eight',
                                 'nine'])


In [83]:
strat_train_setD2[['Class']] = strat_train_setD2[['Class']].replace([0, 1, 2,3,4,5,6,7,8,9], ['zero', 'one', 'two','three', 'four', 'five', 'six', 'seven', 'eight',
                                 'nine'])

In [84]:
strat_train_setD3[['Class']] = strat_train_setD3[['Class']].replace([0, 1, 2,3,4,5,6,7,8,9], ['zero', 'one', 'two','three', 'four', 'five', 'six', 'seven', 'eight',
                                 'nine'])

In [85]:
strat_test_setD2[['Class']] = strat_test_setD2[['Class']].replace([0, 1, 2,3,4,5,6,7,8,9], ['zero', 'one', 'two','three', 'four', 'five', 'six', 'seven', 'eight',
                                 'nine'])

In [86]:
strat_train_setB[['Class']] = strat_train_setB[['Class']].replace([0, 1, 2,3,4,5,6,7,8,9], ['zero', 'one', 'two','three', 'four', 'five', 'six', 'seven', 'eight',
                                 'nine'])

In [87]:
strat_train_setC[['Class']] = strat_train_setC[['Class']].replace([0, 1, 2,3,4,5,6,7,8,9], ['zero', 'one', 'two','three', 'four', 'five', 'six', 'seven', 'eight',
                                 'nine'])

In [88]:
strat_train_setD[['Class']] = strat_train_setD[['Class']].replace([0, 1, 2,3,4,5,6,7,8,9], ['zero', 'one', 'two','three', 'four', 'five', 'six', 'seven', 'eight',
                                 'nine'])

In [90]:
strat_train_setB.head()

Unnamed: 0,744,745,792,742,747,741,697,698,1214,2018,...,1715,1666,1654,1560,1561,1606,1559,1611,1653,Class
7809,43.0,44.0,44.0,43.0,44.0,42.0,44.0,43.0,43.0,30.0,...,27.0,46.0,44.0,44.0,46.0,40.0,42.0,38.0,44.0,three
5546,110.0,110.0,106.0,111.0,107.0,110.0,109.0,108.0,112.0,154.0,...,28.0,46.0,107.0,104.0,105.0,105.0,96.0,109.0,108.0,two
7326,255.0,255.0,255.0,255.0,255.0,255.0,255.0,255.0,255.0,55.0,...,107.0,232.0,255.0,248.0,225.0,255.0,255.0,255.0,255.0,one
7630,55.0,59.0,71.0,49.0,63.0,47.0,39.0,38.0,45.0,35.0,...,43.0,48.0,92.0,105.0,113.0,105.0,98.0,119.0,83.0,three
7545,137.0,113.0,141.0,89.0,38.0,43.0,92.0,53.0,38.0,200.0,...,118.0,109.0,99.0,127.0,94.0,81.0,136.0,97.0,104.0,eight


In [91]:
strat_train_setD2.head()

Unnamed: 0,792,742,747,741,697,698,1214,2068,1869,1119,...,1511,1510,706,1505,707,1553,560,1647,2174,Class
7728,49.0,40.0,35.0,38.0,31.0,33.0,31.0,44.0,255.0,37.0,...,46.0,49.0,214.0,42.0,237.0,44.0,254.0,23.0,48.0,three
5635,255.0,255.0,255.0,255.0,255.0,255.0,255.0,60.0,33.0,254.0,...,250.0,217.0,192.0,221.0,156.0,252.0,142.0,251.0,213.0,two
7459,78.0,111.0,62.0,120.0,81.0,76.0,94.0,146.0,162.0,88.0,...,92.0,99.0,109.0,109.0,110.0,93.0,147.0,105.0,122.0,one
7545,99.0,98.0,79.0,98.0,70.0,63.0,87.0,37.0,26.0,90.0,...,78.0,71.0,39.0,53.0,40.0,65.0,39.0,57.0,35.0,three
7340,24.0,39.0,21.0,53.0,16.0,18.0,15.0,21.0,13.0,15.0,...,42.0,37.0,17.0,15.0,14.0,30.0,14.0,42.0,17.0,eight


In [92]:
strat_train_setD3.head()

Unnamed: 0,792,742,747,741,697,698,1214,2068,1869,1119,...,1511,1510,706,1505,707,1553,560,1647,2174,Class
8018,159.0,156.0,157.0,151.0,145.0,140.0,126.0,55.0,67.0,144.0,...,80.0,60.0,37.0,87.0,37.0,95.0,52.0,47.0,81.0,three
5463,179.0,169.0,155.0,158.0,133.0,126.0,119.0,17.0,16.0,146.0,...,174.0,162.0,48.0,145.0,50.0,111.0,52.0,55.0,19.0,two
7423,142.0,142.0,138.0,138.0,136.0,134.0,139.0,55.0,54.0,144.0,...,89.0,24.0,85.0,33.0,47.0,45.0,49.0,96.0,35.0,one
7822,153.0,155.0,150.0,155.0,151.0,143.0,113.0,45.0,29.0,116.0,...,147.0,142.0,48.0,102.0,68.0,135.0,68.0,39.0,28.0,three
7564,42.0,39.0,18.0,29.0,23.0,19.0,20.0,27.0,27.0,18.0,...,45.0,39.0,33.0,28.0,32.0,35.0,52.0,44.0,22.0,eight


In [93]:
strat_test_setD2.head()

Unnamed: 0,792,742,747,741,697,698,1214,2068,1869,1119,...,1511,1510,706,1505,707,1553,560,1647,2174,Class
5683,68.0,72.0,71.0,72.0,74.0,73.0,79.0,23.0,25.0,80.0,...,79.0,77.0,21.0,69.0,18.0,68.0,18.0,24.0,18.0,four
5403,133.0,134.0,128.0,130.0,131.0,132.0,127.0,189.0,72.0,128.0,...,112.0,75.0,90.0,48.0,69.0,73.0,56.0,77.0,255.0,one
9412,167.0,44.0,85.0,35.0,168.0,122.0,90.0,43.0,74.0,93.0,...,156.0,113.0,63.0,133.0,82.0,175.0,64.0,176.0,39.0,eight
841,252.0,255.0,255.0,255.0,255.0,255.0,255.0,23.0,29.0,250.0,...,247.0,196.0,126.0,176.0,89.0,255.0,69.0,171.0,202.0,two
8891,36.0,36.0,34.0,35.0,35.0,34.0,36.0,19.0,13.0,39.0,...,38.0,39.0,19.0,29.0,19.0,26.0,21.0,22.0,17.0,one


In [94]:
train_setD.head()

Unnamed: 0,792,742,747,741,697,698,1214,2068,1869,1119,...,1511,1510,706,1505,707,1553,560,1647,2174,Class
5212,255.0,255.0,255.0,255.0,255.0,254.0,255.0,103.0,112.0,255.0,...,236.0,157.0,72.0,148.0,67.0,219.0,96.0,224.0,73.0,two
3768,204.0,210.0,200.0,210.0,203.0,203.0,100.0,64.0,101.0,116.0,...,198.0,180.0,92.0,87.0,90.0,81.0,83.0,64.0,105.0,one
4456,152.0,170.0,171.0,167.0,174.0,172.0,93.0,44.0,79.0,147.0,...,158.0,157.0,47.0,96.0,51.0,75.0,55.0,68.0,36.0,four
9250,119.0,118.0,118.0,118.0,116.0,115.0,114.0,23.0,35.0,112.0,...,98.0,68.0,26.0,31.0,26.0,45.0,29.0,80.0,22.0,one
7235,217.0,186.0,182.0,173.0,147.0,142.0,195.0,71.0,197.0,209.0,...,164.0,136.0,70.0,171.0,82.0,183.0,104.0,59.0,243.0,two


In [95]:
test_setD.head()

Unnamed: 0,792,742,747,741,697,698,1214,2068,1869,1119,...,1511,1510,706,1505,707,1553,560,1647,2174,Class
4696,18.0,17.0,17.0,18.0,18.0,18.0,18.0,22.0,9.0,18.0,...,14.0,9.0,9.0,8.0,8.0,12.0,8.0,7.0,14.0,3
1341,174.0,127.0,150.0,93.0,72.0,72.0,48.0,42.0,49.0,88.0,...,175.0,159.0,89.0,85.0,87.0,54.0,86.0,68.0,34.0,4
5114,137.0,138.0,137.0,135.0,138.0,139.0,124.0,92.0,45.0,138.0,...,65.0,55.0,103.0,77.0,81.0,113.0,40.0,46.0,70.0,1
8461,228.0,229.0,232.0,228.0,224.0,223.0,176.0,57.0,228.0,217.0,...,181.0,156.0,47.0,192.0,46.0,168.0,59.0,55.0,159.0,1
4273,86.0,87.0,87.0,86.0,88.0,88.0,73.0,66.0,23.0,70.0,...,84.0,79.0,30.0,65.0,25.0,80.0,23.0,27.0,21.0,3


In [96]:
strat_train_setC.head()

Unnamed: 0,744,745,792,742,747,741,697,698,1214,2018,...,1559,1653,1558,1605,1511,1562,1660,1512,1604,Class
7960,68.0,71.0,85.0,58.0,66.0,47.0,36.0,35.0,43.0,36.0,...,81.0,81.0,71.0,85.0,65.0,88.0,89.0,80.0,82.0,three
5623,133.0,134.0,148.0,122.0,123.0,111.0,102.0,101.0,90.0,32.0,...,144.0,101.0,139.0,124.0,129.0,150.0,104.0,135.0,114.0,two
7402,214.0,214.0,217.0,215.0,213.0,217.0,217.0,216.0,214.0,83.0,...,206.0,205.0,189.0,202.0,187.0,205.0,200.0,204.0,189.0,one
7748,57.0,58.0,60.0,51.0,57.0,47.0,49.0,46.0,28.0,23.0,...,54.0,26.0,52.0,38.0,50.0,54.0,30.0,53.0,31.0,three
7361,175.0,155.0,186.0,155.0,119.0,126.0,136.0,130.0,163.0,233.0,...,192.0,168.0,186.0,187.0,175.0,165.0,164.0,186.0,189.0,eight


In [97]:
strat_train_setD.head()

Unnamed: 0,792,742,747,741,697,698,1214,2068,1869,1119,...,1511,1510,706,1505,707,1553,560,1647,2174,Class
8028,108.0,109.0,107.0,110.0,109.0,108.0,99.0,80.0,59.0,116.0,...,92.0,81.0,43.0,99.0,34.0,72.0,30.0,34.0,255.0,three
5724,100.0,108.0,107.0,110.0,107.0,107.0,112.0,136.0,146.0,102.0,...,69.0,42.0,50.0,60.0,32.0,93.0,31.0,46.0,97.0,two
7331,107.0,107.0,100.0,108.0,111.0,110.0,104.0,28.0,29.0,106.0,...,81.0,73.0,42.0,36.0,25.0,69.0,27.0,80.0,43.0,one
7904,60.0,59.0,56.0,56.0,40.0,37.0,40.0,21.0,20.0,60.0,...,58.0,52.0,24.0,46.0,29.0,37.0,26.0,22.0,19.0,three
7183,189.0,217.0,161.0,201.0,138.0,156.0,119.0,106.0,253.0,114.0,...,173.0,188.0,152.0,148.0,156.0,179.0,119.0,235.0,79.0,eight


In [98]:
strat_train_set.head()

Unnamed: 0,744,745,792,742,747,1168,1167,1120,1215,1216,...,1714,1173,1125,1647,1655,1654,1560,1561,1606,Class
7937,15.0,15.0,16.0,15.0,14.0,22.0,20.0,21.0,21.0,22.0,...,15.0,18.0,19.0,14.0,23.0,23.0,23.0,25.0,22.0,three
5693,147.0,144.0,153.0,138.0,105.0,93.0,111.0,68.0,122.0,112.0,...,60.0,51.0,73.0,33.0,99.0,100.0,103.0,105.0,122.0,two
7338,172.0,179.0,157.0,131.0,141.0,186.0,181.0,188.0,181.0,185.0,...,93.0,162.0,132.0,85.0,154.0,144.0,183.0,186.0,177.0,one
7728,64.0,65.0,66.0,65.0,65.0,21.0,22.0,27.0,17.0,13.0,...,30.0,48.0,61.0,50.0,60.0,60.0,58.0,52.0,57.0,three
7366,148.0,183.0,182.0,50.0,105.0,75.0,106.0,99.0,92.0,60.0,...,205.0,228.0,218.0,191.0,129.0,124.0,215.0,204.0,124.0,eight


In [99]:
strat_test_set[['Class']] = strat_test_set[['Class']].replace([0, 1, 2,3,4,5,6,7,8,9], ['zero', 'one', 'two','three', 'four', 'five', 'six', 'seven', 'eight',
                                 'nine'])

In [100]:
strat_test_setD3[['Class']] = strat_test_setD3[['Class']].replace([0, 1, 2,3,4,5,6,7,8,9], ['zero', 'one', 'two','three', 'four', 'five', 'six', 'seven', 'eight',
                                 'nine'])

In [101]:
strat_test_setB[['Class']] = strat_test_setB[['Class']].replace([0, 1, 2,3,4,5,6,7,8,9], ['zero', 'one', 'two','three', 'four', 'five', 'six', 'seven', 'eight',
                                 'nine'])

In [102]:
strat_test_setC[['Class']] = strat_test_setC[['Class']].replace([0, 1, 2,3,4,5,6,7,8,9], ['zero', 'one', 'two','three', 'four', 'five', 'six', 'seven', 'eight',
                                 'nine'])

In [103]:
strat_test_setD[['Class']] = strat_test_setD[['Class']].replace([0, 1, 2,3,4,5,6,7,8,9], ['zero', 'one', 'two','three', 'four', 'five', 'six', 'seven', 'eight',
                                 'nine'])

In [106]:
strat_test_set.head()

Unnamed: 0,744,745,792,742,747,1168,1167,1120,1215,1216,...,1714,1173,1125,1647,1655,1654,1560,1561,1606,Class
6043,58.0,57.0,58.0,57.0,57.0,55.0,54.0,53.0,55.0,56.0,...,17.0,22.0,22.0,40.0,54.0,54.0,53.0,53.0,55.0,four
5140,56.0,56.0,50.0,58.0,56.0,68.0,53.0,64.0,55.0,73.0,...,47.0,72.0,63.0,53.0,92.0,89.0,83.0,90.0,86.0,one
9333,179.0,159.0,198.0,215.0,157.0,209.0,154.0,186.0,175.0,224.0,...,254.0,254.0,255.0,249.0,235.0,231.0,170.0,167.0,186.0,eight
804,53.0,53.0,73.0,46.0,50.0,69.0,56.0,65.0,58.0,70.0,...,26.0,45.0,66.0,24.0,69.0,65.0,58.0,65.0,64.0,two
8890,219.0,247.0,254.0,221.0,144.0,255.0,254.0,219.0,255.0,255.0,...,116.0,170.0,161.0,193.0,255.0,255.0,255.0,253.0,255.0,one


In [107]:
test_setD.head()

Unnamed: 0,792,742,747,741,697,698,1214,2068,1869,1119,...,1511,1510,706,1505,707,1553,560,1647,2174,Class
4696,18.0,17.0,17.0,18.0,18.0,18.0,18.0,22.0,9.0,18.0,...,14.0,9.0,9.0,8.0,8.0,12.0,8.0,7.0,14.0,three
1341,174.0,127.0,150.0,93.0,72.0,72.0,48.0,42.0,49.0,88.0,...,175.0,159.0,89.0,85.0,87.0,54.0,86.0,68.0,34.0,four
5114,137.0,138.0,137.0,135.0,138.0,139.0,124.0,92.0,45.0,138.0,...,65.0,55.0,103.0,77.0,81.0,113.0,40.0,46.0,70.0,one
8461,228.0,229.0,232.0,228.0,224.0,223.0,176.0,57.0,228.0,217.0,...,181.0,156.0,47.0,192.0,46.0,168.0,59.0,55.0,159.0,one
4273,86.0,87.0,87.0,86.0,88.0,88.0,73.0,66.0,23.0,70.0,...,84.0,79.0,30.0,65.0,25.0,80.0,23.0,27.0,21.0,three


In [108]:
strat_test_setC.head()

Unnamed: 0,744,745,792,742,747,741,697,698,1214,2018,...,1559,1653,1558,1605,1511,1562,1660,1512,1604,Class
5897,253.0,253.0,224.0,253.0,253.0,253.0,254.0,254.0,255.0,44.0,...,255.0,255.0,255.0,255.0,255.0,238.0,254.0,255.0,255.0,four
5353,34.0,33.0,32.0,34.0,33.0,32.0,34.0,34.0,29.0,9.0,...,32.0,37.0,34.0,36.0,27.0,26.0,35.0,26.0,38.0,one
9409,89.0,109.0,95.0,107.0,144.0,109.0,80.0,90.0,140.0,46.0,...,193.0,121.0,181.0,174.0,162.0,164.0,186.0,184.0,174.0,eight
995,192.0,205.0,167.0,186.0,207.0,187.0,207.0,211.0,190.0,24.0,...,171.0,203.0,104.0,161.0,100.0,182.0,194.0,181.0,128.0,two
8813,118.0,117.0,117.0,117.0,113.0,119.0,104.0,100.0,101.0,25.0,...,120.0,104.0,119.0,120.0,116.0,112.0,100.0,119.0,115.0,one


In [109]:
strat_test_setB.head()

Unnamed: 0,744,745,792,742,747,741,697,698,1214,2018,...,1715,1666,1654,1560,1561,1606,1559,1611,1653,Class
5901,220.0,217.0,223.0,206.0,194.0,191.0,148.0,141.0,157.0,44.0,...,82.0,57.0,209.0,216.0,213.0,217.0,216.0,215.0,198.0,four
5394,142.0,143.0,143.0,143.0,139.0,142.0,142.0,139.0,145.0,25.0,...,48.0,66.0,136.0,139.0,141.0,138.0,138.0,133.0,137.0,one
9375,147.0,138.0,154.0,71.0,62.0,37.0,117.0,81.0,60.0,57.0,...,120.0,119.0,127.0,139.0,121.0,54.0,126.0,91.0,114.0,eight
957,123.0,123.0,120.0,123.0,120.0,119.0,114.0,114.0,118.0,63.0,...,53.0,43.0,112.0,119.0,123.0,115.0,114.0,116.0,110.0,two
8695,81.0,80.0,80.0,81.0,78.0,82.0,73.0,71.0,83.0,121.0,...,56.0,55.0,64.0,69.0,64.0,66.0,71.0,63.0,63.0,one


In [110]:
strat_test_setD.head()

Unnamed: 0,792,742,747,741,697,698,1214,2068,1869,1119,...,1511,1510,706,1505,707,1553,560,1647,2174,Class
5936,34.0,33.0,33.0,33.0,34.0,34.0,32.0,14.0,17.0,32.0,...,33.0,33.0,12.0,31.0,14.0,29.0,19.0,11.0,14.0,four
5274,154.0,156.0,153.0,158.0,155.0,154.0,149.0,41.0,254.0,152.0,...,144.0,127.0,78.0,45.0,64.0,83.0,51.0,100.0,57.0,one
9229,70.0,27.0,38.0,33.0,49.0,42.0,34.0,22.0,25.0,34.0,...,57.0,53.0,27.0,79.0,25.0,84.0,22.0,67.0,31.0,eight
907,229.0,255.0,255.0,254.0,255.0,255.0,255.0,38.0,113.0,252.0,...,255.0,253.0,170.0,254.0,106.0,255.0,76.0,122.0,61.0,two
8814,90.0,91.0,77.0,90.0,71.0,68.0,93.0,75.0,49.0,97.0,...,83.0,72.0,63.0,79.0,62.0,83.0,61.0,65.0,73.0,one


In [111]:
strat_test_setD3.head()

Unnamed: 0,792,742,747,741,697,698,1214,2068,1869,1119,...,1511,1510,706,1505,707,1553,560,1647,2174,Class
6001,255.0,254.0,255.0,252.0,239.0,238.0,255.0,26.0,68.0,254.0,...,255.0,255.0,149.0,253.0,169.0,251.0,173.0,154.0,42.0,four
5257,137.0,99.0,111.0,97.0,102.0,102.0,99.0,167.0,165.0,92.0,...,106.0,117.0,145.0,155.0,169.0,142.0,183.0,112.0,157.0,one
9250,84.0,51.0,88.0,60.0,47.0,65.0,57.0,40.0,35.0,58.0,...,192.0,167.0,43.0,168.0,41.0,194.0,39.0,191.0,63.0,eight
861,155.0,129.0,158.0,108.0,122.0,126.0,46.0,57.0,29.0,78.0,...,141.0,141.0,29.0,67.0,47.0,42.0,64.0,74.0,32.0,two
8825,133.0,130.0,132.0,129.0,132.0,133.0,132.0,57.0,55.0,129.0,...,128.0,117.0,34.0,35.0,33.0,86.0,34.0,98.0,54.0,one


In [112]:
strat_train_set.to_csv('Top_5_train_set_nom.csv', index = False)

In [113]:
strat_test_set.to_csv('Top_5_test_set_nom.csv', index = False)

In [114]:
strat_train_setB.to_csv('Top_10_train_set_nom.csv', index = False)

In [115]:
strat_test_setB.to_csv('Top_10_test_set_nom.csv', index = False)

In [116]:
strat_train_setC.to_csv('Top_20_train_set_nom.csv', index = False)

In [117]:
strat_test_setC.to_csv('Top_20_test_set_nom.csv', index = False)

In [118]:
strat_train_setD.to_csv('Top_20_Explo_train_set_nom.csv', index = False)

In [119]:
strat_train_setD2.to_csv('Top_20_Explo_train_set_nom_2.csv', index = False)

In [120]:
strat_train_setD3.to_csv('Top_20_Explo_train_set_nom_3.csv', index = False)

In [121]:
strat_test_setD3.to_csv('Top_20_Explo_test_set_nom_3.csv', index = False)

In [122]:
strat_test_setD2.to_csv('Top_20_Explo_test_set_nom_2.csv', index = False)

In [123]:
strat_test_setD.to_csv('Top_20_Explo_test_set_nom.csv', index = False)

In [124]:
train_setD.to_csv('Top_20_Explo_train_set.csv', index = False)

In [125]:
test_setD.to_csv('Top_20_Explo_test_set.csv', index = False)

In [126]:
XYraw = XYraw.sample(frac=1).reset_index(drop=True)