# CV CW1

The data set for the coursework is a sample from Stallkamp et al's German Street Sign Recognition
Benchmark. Originally the data set consisted of 39,209 RGB-coloured train and 12,630 RGB-coloured
test images of different sizes displaying 43 different types of German traffic signs. These images are
not centred and are taken during different times of the day.

This data set is considered to be an important benchmark for Computer Vision and has close relation
to the street sign recognition tasks that autonomous cars have to perform. And safe deployment of
autonomous cars is the next big challenge that researchers and engineers face.

You will be working with a sample of this data set which consists of 10 classes and 9690 images. The
images have been converted to grey-scale with pixel values ranging from 0 to 255 and were rescaled
to a common size of 48*48 pixels. Hence, each row (= feature vector) in the data set has 2305
features and represents a single image in row-vector format (2304 features) plus its associated class
label. We changed the class labels from the original dataset so the classes we use are now labelled
from 0 to 9. Compensating the light conditions and position of the images is not necessary for the
coursework and is left for the interested student to do.

Below, the class labels and their meanings are displayed:
Class label Meaning
0 speed limit 20
1 speed limit 30
2 speed limit 50
3 speed limit 60
4 speed limit 70
5 left turn
6 right turn
7 beware pedestrian crossing
8 beware children
9 beware cycle route ahead

In [2]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)


In [3]:
# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"



In [4]:
# Common imports
import numpy as np
import os
import tarfile
import urllib
import pandas as pd

In [5]:
# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

In [6]:
#read the attribute csvs file into respective dataframes

X = pd.read_csv ('x_train_gr_smpl.csv')
Xbin = pd.read_csv('x_train_smpl_bin.csv')
print (X)
print(Xbin)

          0      1      2      3      4      5      6      7      8      9  \
0      78.0   77.0   76.0   82.0   87.0   92.0  104.0  119.0  117.0  120.0   
1      73.0   75.0   79.0   78.0   76.0   75.0   89.0  107.0  133.0  125.0   
2      72.0   75.0   79.0   77.0   81.0   89.0  105.0  109.0   86.0   90.0   
3      67.0   70.0   74.0   80.0   93.0  107.0  110.0   96.0   69.0  100.0   
4      74.0   74.0   73.0   72.0   77.0   87.0  104.0  109.0   84.0   83.0   
...     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
9685  247.0  220.0  254.0  213.0  129.0  208.0  254.0  255.0  255.0  255.0   
9686  151.0  118.0  254.0  255.0  255.0  255.0  254.0  254.0  254.0  252.0   
9687  255.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0   
9688  255.0  253.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0   
9689  252.0  189.0  238.0  255.0  255.0  245.0  219.0  212.0  140.0   40.0   

      ...   2294  2295  2296   2297  2298  2299  2300   2301   

In [7]:
# read the class attribute file into Y datafram
Y = pd.read_csv('y_train_smpl.csv')
Y.columns = ['Class']
print(Y)

      Class
0         0
1         0
2         0
3         0
4         0
...     ...
9685      9
9686      9
9687      9
9688      9
9689      9

[9690 rows x 1 columns]


In [8]:
#join the two such that Y is the last column

XYraw = pd.concat([X,Y], axis=1)

#check the concatination
print(XYraw)



          0      1      2      3      4      5      6      7      8      9  \
0      78.0   77.0   76.0   82.0   87.0   92.0  104.0  119.0  117.0  120.0   
1      73.0   75.0   79.0   78.0   76.0   75.0   89.0  107.0  133.0  125.0   
2      72.0   75.0   79.0   77.0   81.0   89.0  105.0  109.0   86.0   90.0   
3      67.0   70.0   74.0   80.0   93.0  107.0  110.0   96.0   69.0  100.0   
4      74.0   74.0   73.0   72.0   77.0   87.0  104.0  109.0   84.0   83.0   
...     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
9685  247.0  220.0  254.0  213.0  129.0  208.0  254.0  255.0  255.0  255.0   
9686  151.0  118.0  254.0  255.0  255.0  255.0  254.0  254.0  254.0  252.0   
9687  255.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0   
9688  255.0  253.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0   
9689  252.0  189.0  238.0  255.0  255.0  245.0  219.0  212.0  140.0   40.0   

      ...  2295  2296   2297  2298  2299  2300   2301   2302   

In [9]:
#explore the header
XYraw.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2295,2296,2297,2298,2299,2300,2301,2302,2303,Class
0,78.0,77.0,76.0,82.0,87.0,92.0,104.0,119.0,117.0,120.0,...,79.0,72.0,76.0,83.0,95.0,99.0,98.0,95.0,94.0,0
1,73.0,75.0,79.0,78.0,76.0,75.0,89.0,107.0,133.0,125.0,...,93.0,85.0,77.0,69.0,73.0,83.0,100.0,101.0,101.0,0
2,72.0,75.0,79.0,77.0,81.0,89.0,105.0,109.0,86.0,90.0,...,95.0,88.0,80.0,73.0,71.0,74.0,80.0,89.0,95.0,0
3,67.0,70.0,74.0,80.0,93.0,107.0,110.0,96.0,69.0,100.0,...,92.0,87.0,82.0,77.0,72.0,70.0,72.0,81.0,88.0,0
4,74.0,74.0,73.0,72.0,77.0,87.0,104.0,109.0,84.0,83.0,...,98.0,99.0,100.0,99.0,89.0,78.0,66.0,68.0,72.0,0


In [10]:
# general info on attributes 
XYraw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9690 entries, 0 to 9689
Columns: 2305 entries, 0 to Class
dtypes: float64(2304), int64(1)
memory usage: 170.4 MB


In [11]:
# instance frequency for each class attribute label
XYraw["Class"].value_counts()

2    2250
1    2220
4    1980
3    1410
8     540
6     360
9     270
7     240
5     210
0     210
Name: Class, dtype: int64

In [12]:
print(XYraw.shape)

(9690, 2305)


In [13]:
print(X.shape)

(9690, 2304)


In [14]:
# check for NaNs
anyNans = XYraw.isnull().sum().sum()
#print NaN count
print('\nNaN Count : ' + str(anyNans)) 


NaN Count : 0


In [54]:
# generate dataframe with top 20 attributes selected using weka for each class label 

Top_20_All = XYraw[['744',
'745',
'792',
'742',
'747',
'741',
'697',
'698',
'1214',
'2018',
'2068',
'1869',
'1119',
'648',
'2062',
'2019',
'1676',
'2111',
'1120',
'1820','1168',
'1167',
'1120',
'1215',
'1216',
'1119',
'1263',
'1073',
'1264',
'1072',
'1121',
'1166',
'1214',
'1262',
'1074',
'697',
'648',
'744',
'698',
'743',
'1609',
'1656',
'1610',
'1608',
'1658',
'746',
'1314',
'794',
'1315',
'793',
'1561',
'747',
'1311',
'1312',
'1313',
'841',
'1611',
'1316',
'795',
'1267',
'1609',
'1608',
'1656',
'1657',
'1265',
'1313',
'1655',
'1264',
'1312',
'1607',
'1560',
'1658',
'1610',
'1654',
'1561',
'1266',
'1314',
'1704',
'1705',
'1263',
'983',
'1031',
'935',
'982',
'1363',
'2138',
'2268',
'1030',
'2165',
'2024',
'2281',
'1973',
'934',
'2187',
'2069',
'1316',
'2269',
'2209',
'2077',
'2029',
'1509',
'1461',
'1462',
'1508',
'1460',
'744',
'745',
'746',
'1510',
'1557',
'1264',
'1507',
'1556',
'1555',
'1459',
'1263',
'1519',
'1472',
'1216',
'1471',
'793',
'746',
'745',
'744',
'794',
'743',
'792',
'795',
'1264',
'747',
'697',
'742',
'841',
'696',
'1312',
'791',
'1265',
'1263',
'695',
'1311',
'1657',
'1656',
'1608',
'1705',
'1704',
'1703',
'1753',
'1752',
'1601',
'1650',
'1663',
'1662',
'1751',
'1649',
'1600',
'1743',
'1648',
'1651',
'1713',
'1615',
'1695',
'1714',
'1173',
'1125',
'1647',
'1126',
'1713',
'1174',
'1715',
'1666',
'1078',
'1696',
'1743',
'1221',
'1077',
'1472',
'1762',
'1694',
'1742',
'1665',
'1655',
'1654',
'1560',
'1561',
'1606',
'1607',
'1610',
'1559',
'1611',
'1653',
'1558',
'1605',
'1511',
'1562',
'1660',
'1512',
'1604',
'1462',
'1600',
'1601',


                    'Class'
]]

In [55]:
# confirm 201- top 20 attributes x 10 class label + class attriute
print(Top_20_All.shape)

(9690, 201)


In [56]:
#remove any coumn dulplicates

Top_20_WithoutDuplicates = Top_20_All.loc[:,~Top_20_All.T.duplicated(keep='first')]


In [57]:
# confirm removal of column duplicated

print(Top_20_WithoutDuplicates.shape)

(9690, 140)


In [69]:
#randomize

Top_20_WithoutDuplicates_randomized = Top_20_WithoutDuplicates.sample(frac=1).reset_index(drop=True)
print(Top_20_WithoutDuplicates_randomized)

        744    745    792    742    747    741    697    698   1214   2018  \
0     134.0  132.0  124.0  135.0  132.0  134.0  137.0  136.0   55.0   28.0   
1      45.0   47.0   60.0   45.0   39.0   41.0   31.0   30.0   35.0   20.0   
2     147.0  148.0  146.0  150.0  147.0  151.0  144.0  143.0  128.0   25.0   
3     123.0  123.0  122.0  123.0  123.0  122.0  123.0  123.0  120.0   60.0   
4      75.0   75.0   75.0   77.0   43.0   76.0   75.0   76.0   77.0   37.0   
...     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
9685  128.0   60.0  187.0  199.0   64.0   95.0   52.0   55.0   63.0   58.0   
9686  118.0  121.0   97.0  110.0  119.0  108.0  125.0  124.0   90.0   28.0   
9687   85.0   82.0   93.0   70.0   71.0   63.0   60.0   55.0   68.0   44.0   
9688  139.0  150.0  197.0  115.0  128.0  107.0  109.0  107.0  135.0  126.0   
9689  255.0  255.0  253.0  255.0  255.0  255.0  255.0  255.0  255.0   41.0   

      ...   1559   1653   1558   1605   1511   1562   1660   15

In [73]:
#convert the dataframe into csv

Top_20_WithoutDuplicates.to_csv('Top_20_random3.csv', index = False)

In [72]:
#create dataframe with top 5 attributes for each class label 

Top_5_All = XYraw[['744',
'745',
'792',
'742',
'747', 
                   '1168',
'1167',
'1120',
'1215',
'1216',
                   '1609',
'1656',
'1610',
'1608',
'1658',
'1609',
'1608',
'1656',
'1657',
'1265',
                   '983',
'1031',
'935',
'982',
'1363',
                   '1509',
'1461',
'1462',
'1508',
'1460',

'793',
'746',
'745',
'744',
'794',
                   
'1657',
'1656',
'1608',
'1705',
'1704',
                   '1695',
'1714',
'1173',
'1125',
'1647',
                   '1655',
'1654',
'1560',
'1561',
'1606',


                   

                   


                    'Class'
]]

In [75]:
# confirm selection
print(Top_5_All.shape)

(9690, 51)


In [76]:
#remove duplicates
Top_5_WithoutDuplicates = Top_5_All.loc[:,~Top_5_All.T.duplicated(keep='first')]

In [77]:
#confirm duplicate removal
print(Top_5_WithoutDuplicates.shape)

(9690, 43)


In [86]:
#randomize
Top_5_WithoutDuplicates_randomized = Top_5_WithoutDuplicates.sample(frac=1).reset_index(drop=True)
print(Top_5_WithoutDuplicates_randomized)

        744    745    792    742    747   1168   1167   1120   1215   1216  \
0     255.0  255.0  255.0  255.0  255.0  254.0  255.0  253.0  255.0  255.0   
1     128.0  128.0  127.0  124.0  125.0  124.0  125.0  124.0  124.0  123.0   
2      68.0   68.0   67.0   67.0   67.0   70.0   70.0   69.0   68.0   68.0   
3      67.0   67.0   58.0   65.0   83.0   50.0   49.0   49.0   50.0   52.0   
4     122.0  120.0  174.0  113.0  107.0  238.0  207.0  226.0  223.0  243.0   
...     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
9685   23.0   22.0   36.0   21.0   21.0   47.0   32.0   44.0   34.0   49.0   
9686   94.0   94.0   95.0   97.0   94.0   89.0   95.0   88.0   95.0   91.0   
9687  106.0   68.0  145.0   73.0   54.0   97.0  107.0  104.0  108.0   77.0   
9688  121.0  120.0  110.0  121.0  118.0   84.0   82.0   71.0   94.0   96.0   
9689   58.0   56.0   60.0   57.0   40.0   51.0   59.0   54.0   58.0   44.0   

      ...   1714   1173   1125   1647   1655   1654   1560   15

In [87]:
#save datafram as csv for further weka analyses

Top_5_WithoutDuplicates.to_csv('Top_5_random3.csv', index = False)

In [88]:
#create dataframe with top 10 attributes for each class label

Top_10_All = XYraw[['744',
'745',
'792',
'742',
'747', 
      '741',
'697',
'698',
'1214',
'2018',
              
                    
                    
                   '1168',
'1167',
'1120',
'1215',
'1216',
                    
                    '1119',
'1263',
'1073',
'1264',
'1072',

                   '1609',
'1656',
'1610',
'1608',
'1658',
    '746',
'1314',
'794',
'1315',
'793',
              
'1609',
'1608',
'1656',
'1657',
'1265',
     '1313',
'1655',
'1264',
'1312',
'1607',
 
                   '983',
'1031',
'935',
'982',
'1363',
    '2138',
'2268',
'1030',
'2165',
'2024',
 
                   '1509',
'1461',
'1462',
'1508',
'1460',
       '744',
'745',
'746',
'1510',
'1557',

'793',
'746',
'745',
'744',
'794',
                    '743',
'792',
'795',
'1264',
'747',

                   
'1657',
'1656',
'1608',
'1705',
'1704',
       '1703',
'1753',
'1752',
'1601',
'1650',

                   '1695',
'1714',
'1173',
'1125',
'1647',
       '1126',
'1713',
'1174',
'1715',
'1666',
             
                   '1655',
'1654',
'1560',
'1561',
'1606',

'1607',
'1610',
'1559',
'1611',
'1653',

                    'Class'
]]

In [89]:
# confirm the attribute selection
print(Top_10_All.shape)

(9690, 101)


In [92]:
# remove duplicates

Top_10_WithoutDuplicates = Top_10_All.loc[:,~Top_10_All.T.duplicated(keep='first')]
print(Top_10_WithoutDuplicates)

        744    745    792    742    747    741    697    698   1214   2018  \
0     106.0  103.0  180.0  103.0   87.0   95.0   85.0   86.0  168.0  123.0   
1     164.0  163.0  221.0  123.0  138.0  108.0  114.0  108.0   95.0  122.0   
2     197.0  198.0  234.0  196.0  175.0  168.0  145.0  146.0  147.0  118.0   
3     237.0  233.0  244.0  235.0  205.0  214.0  209.0  192.0  167.0  136.0   
4      85.0   85.0  148.0   78.0   79.0   76.0   71.0   71.0  162.0  169.0   
...     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
9685   82.0   33.0  100.0   96.0   19.0   23.0   23.0   19.0   21.0   59.0   
9686   96.0   38.0  102.0  104.0   19.0   62.0   28.0   18.0   19.0   40.0   
9687   92.0   34.0  101.0  101.0   18.0   68.0   26.0   18.0   17.0   41.0   
9688   94.0   38.0   92.0   94.0   16.0   60.0   27.0   16.0   16.0   35.0   
9689   83.0   26.0   88.0   91.0   15.0   52.0   19.0   15.0   14.0   37.0   

      ...   1715   1666   1654   1560   1561   1606   1559   16

In [93]:
#confirm duplicate removal 
print(Top_10_WithoutDuplicates.shape)

(9690, 80)


In [104]:
#randomize
Top_10_WithoutDuplicates_randomized = Top_10_WithoutDuplicates.sample(frac=1).reset_index(drop=True)
print(Top_10_WithoutDuplicates_randomized)

        744    745    792    742    747    741    697    698   1214   2018  \
0     203.0  172.0  223.0  231.0  159.0  204.0  160.0  156.0  113.0  199.0   
1     115.0  113.0  114.0  110.0  107.0  105.0   96.0   92.0   90.0   43.0   
2      19.0   14.0   28.0   39.0   22.0   31.0   15.0   20.0   18.0   13.0   
3     255.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0  136.0   
4     255.0  255.0  255.0  255.0  230.0  255.0  192.0  206.0  251.0   35.0   
...     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
9685   66.0   66.0   72.0   65.0   62.0   62.0   59.0   59.0   41.0  145.0   
9686   38.0   38.0   37.0   36.0   37.0   38.0   38.0   38.0   36.0   19.0   
9687  157.0  155.0  156.0  161.0  150.0  159.0  155.0  154.0  162.0  106.0   
9688  162.0  165.0  184.0  155.0  148.0  143.0  136.0  128.0  120.0   79.0   
9689  255.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0  255.0   48.0   

      ...   1715   1666   1654   1560   1561   1606   1559   16

In [105]:
#save dataframe as csv for further weka analyses
Top_10_WithoutDuplicates.to_csv('Top_10_random2.csv', index = False)

In [106]:
#create top 20 attributes from exploratory feature selection analyses

Top_20_Exploratory = XYraw[['792',
'742',
'747',
'741',
'697',
'698',
'1214',
'2068',
'1869',
'1119',
'648',
'1676',
'2111',
'1120',
'2067',
'2260',
'2162',
'2301',
'2212',
'2210',
                            '1168',
'1167',
'1120',
'1263',
'1073',
'1072',
'1121',
'1074',
'1310',
'599',
    '1314',
'1315',
'1316',
'1268',
'1264',
'1030',
'1364',
'981',
'982',
'1317',
'1269',
   '1265',
'1313',
'1312',
'1266',
'1217',
'1361',
'1218',
'1219',
'1170',
'1171',
'1220',
'1172',
       '983',
'1031',
'935',
'982',
'1030',
'934',
      '1509',
'1461',
'1462',
'1508',
'1460',
'745',
'746',
'1507',
'1459',
'1519',
'1471',
'1458',
'794',
'1792',
'1126',
'1793',
'1173',
'1842',
'1840',
'1843',
    '793',
'746',
'745',
'794',
'795',
'1264',
'1134',
'842',
'1086',
'1168',
'1085',
'1037',
'1135',
'843',
'1234',
    '1601',
'1663',
'1649',
'1600',
'1743',
'1648',
'1713',
'1714',
'1744',
'1518',
'1507',
'1695',
'1470',
'1519',
'1742',
'1760',
'1459',
'1506',
'1761',
'1517',
     '1695',
'1714',
'1173',
'1125',
'1713',
'1715',
'1666',
'1743',
'1762',
'1742',
'1667',
'1471',
     '1606',
'1558',
'1511',
'1462',
'1600',
'1601',
'1510',
'1649',
'1648',
'706',
'1505',
'707',
'1553',
'560',
'1506',
'1125',
'1647',
'1030',
'1172',
'2174',
  
                            'Class'
]]

In [107]:
# confirm selection - note: not all attributes have 20 features selected 
print(Top_20_Exploratory.shape)

(9690, 147)


In [108]:
#remove duplicates
Top_20_Exploratory_WithoutDuplicates = Top_20_Exploratory.loc[:,~Top_20_Exploratory.T.duplicated(keep='first')]
print(Top_20_Exploratory_WithoutDuplicates)

        792    742    747    741    697    698   1214   2068  1869   1119  \
0     180.0  103.0   87.0   95.0   85.0   86.0  168.0  117.0  79.0  202.0   
1     221.0  123.0  138.0  108.0  114.0  108.0   95.0  137.0  79.0  117.0   
2     234.0  196.0  175.0  168.0  145.0  146.0  147.0  127.0  79.0  201.0   
3     244.0  235.0  205.0  214.0  209.0  192.0  167.0  121.0  80.0  199.0   
4     148.0   78.0   79.0   76.0   71.0   71.0  162.0  135.0  75.0  163.0   
...     ...    ...    ...    ...    ...    ...    ...    ...   ...    ...   
9685  100.0   96.0   19.0   23.0   23.0   19.0   21.0   56.0  30.0   20.0   
9686  102.0  104.0   19.0   62.0   28.0   18.0   19.0   55.0  36.0   18.0   
9687  101.0  101.0   18.0   68.0   26.0   18.0   17.0   49.0  32.0   17.0   
9688   92.0   94.0   16.0   60.0   27.0   16.0   16.0   40.0  27.0   16.0   
9689   88.0   91.0   15.0   52.0   19.0   15.0   14.0   27.0  28.0   15.0   

      ...   1511   1510    706   1505    707   1553    560   1647   2174  \

In [112]:
# randomize
Top_20_Exploratory_WithoutDuplicates_randomized = Top_20_Exploratory_WithoutDuplicates.sample(frac=1).reset_index(drop=True)
print(Top_20_Exploratory_WithoutDuplicates_randomized)

        792    742    747    741    697    698   1214   2068   1869   1119  \
0     142.0  136.0  141.0  130.0  140.0  139.0   79.0  156.0   67.0  123.0   
1     255.0  169.0  123.0  113.0  241.0  156.0  116.0  255.0  255.0   92.0   
2     255.0  255.0  255.0  255.0  245.0  245.0  186.0   24.0   43.0  255.0   
3     215.0  162.0  169.0  153.0  132.0  133.0  250.0  140.0  241.0  247.0   
4     156.0  148.0  157.0  143.0  154.0  156.0   69.0  120.0   25.0   65.0   
...     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
9685  255.0  255.0  255.0  255.0  253.0  253.0  255.0   83.0   80.0  251.0   
9686   94.0   91.0   90.0   90.0   92.0   91.0   72.0   81.0   32.0   87.0   
9687  253.0  255.0  255.0  253.0  255.0  255.0  167.0   75.0   88.0  212.0   
9688   60.0   55.0   52.0   53.0   44.0   41.0   50.0   18.0   17.0   50.0   
9689   30.0   26.0   41.0   24.0   26.0   38.0   46.0   29.0   45.0   43.0   

      ...   1511   1510    706   1505    707   1553    560   16

In [113]:
# save dataframe as csv for weka analyses
Top_20_Exploratory_WithoutDuplicates_randomized.to_csv('Top_20_Exploratory_random.csv', index = False)