**Overview:**
The data contains features extracted from the silhouette of vehicles in different angles. 
Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. 
This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily 
distinguishable, but it would be more difficult to distinguish between the cars.

**Objective:**
The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. 
The vehicle may be viewed from one of many different angles.

In [None]:
%matplotlib inline

# Numerical libraries
import numpy as np  

# to handle data in form of rows and columns 
import pandas as pd  

# preprocessing
from sklearn.preprocessing import StandardScaler


#Sklearn package's data splitting function which is based on random function
from sklearn.model_selection import train_test_split

# calculate accuracy measures and confusion matrix
from sklearn import metrics  

# importing ploting libraries
import matplotlib.pyplot as plt   

#importing seaborn for statistical plots
import seaborn as sns

# Label encoder 
from sklearn.preprocessing import LabelEncoder

# Support Vector Classifier
from sklearn.svm import SVC

# PCA Related
from sklearn.decomposition import PCA

# To calculate the accuracy score of the model
from sklearn.metrics import accuracy_score, confusion_matrix

# Cross Validation related
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score




In [None]:
#load the csv file and make the data frame
df = pd.read_csv('/kaggle/input/vehicle/vehicle.csv')

In [None]:
df.head()

## EDA

In [None]:
df.shape
# 846 rows, 19 columns

In [None]:
# Check data type and other imp information of each column
df.info()

In [None]:
# All fields are numeric except class- no need to convert data types
# There are missing values in many columns like circularity, distance circularity, radius ratio .. etc

In [None]:
# Explore distribution of vehicle in each class
df['class'].value_counts()

In [None]:
# Cars are almost double in number as compared to bus and van. van is least in number

In [None]:
#Label encode the target class
labelencoder = LabelEncoder()
df['class'] = labelencoder.fit_transform(df['class'])
df['class'].value_counts()

In [None]:
#1-car
#0-bus
#2-van

### Pairplot

In [None]:

sns.pairplot(df,diag_kind='kde',hue='class')

**Inferences:**
* Spread of compactness is least for van. mean compactness is highest for car. For Bus compactness is right skewed indicating that less number of buses have high compactness.
* Mean circularity is higher for cars
* Mean distance_circularity is also higher for cars
* Mean radius_ratio is higher for cars, followed by Bus. It is least for vans
* pr.axis_aspect_ratio is has almost same distribution for car, van and buses
* max.length_aspect_ratio is almost same for cars and vans, lower for buses
* Mean scatter ratio is highest for cars, followed by bus and van
* Mean elomngatedness is highest for vans folowed by bus and car
* pr.axis_rectangularity is highest for cars, followed by bus and then vans
* distribution of max.length_rectangularity is almost same for cars, bus and vans
* Mean scaled variance is highest for cars followed by bus then vans
* Mean scaled variance1 is highest for cars followed by bus then vans
* 'scaled_radius_of_gyration', 'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1', 'skewness_about.2', have almost similar distribution for cars, buses and vans.
* 'hollows_ratio' is lower for buses as compared to cars and vans
* Many columns have lonmg tails indicating outliers
* pr.axis_aspect ratio and radius ratio varies strongly +ve for van. for cars and buses it varies in small range- mostly cpuld like
* Scatter ratio & Scaled_variance1 has almost perfect positive linear relationship
* Many features show high correlation indicating that we need to drop multiple features- we will use PCA for the same

### Correlation & Heatmap

In [None]:
df.corr()

In [None]:
# Heatmap
#Correlation Matrix
corr = df.corr() # correlation matrix
lower_triangle = np.tril(corr, k = -1)  # select only the lower triangle of the correlation matrix
mask = lower_triangle == 0  # to mask the upper triangle in the following heatmap

plt.figure(figsize = (15,8))  # setting the figure size
sns.set_style(style = 'white')  # Setting it to white so that we do not see the grid lines
sns.heatmap(lower_triangle, center=0.5, cmap= 'Blues', annot= True, xticklabels = corr.index, yticklabels = corr.columns,
            cbar= False, linewidths= 1, mask = mask)   # Da Heatmap
plt.xticks(rotation = 50)   # Aesthetic purposes
plt.yticks(rotation = 20)   # Aesthetic purposes
plt.show()

**Inference from heat map:**

From above correlation matrix we can see that there are many features which are highly correlated. if we see carefully then scaled_variance.1 and scatter_ratio has correlation of 1 and many other features are also there which having more than 0.9(positive or negative) correlation e.g sekweness_abou2 and hollows_ratio, scaled variance & scaled_variance1, elongatedness & scaled variance, elongatedness & scaled variance1 etc.

There are lot of dimensions with correlation above +- 0.7 and it is difficult to determine which dimensions to drop manually. We will use PCA to determine it.



In [None]:
df.describe().T

From above table it is clear that there are missing values in many columns-circularity,distance_circularity,radius_ratio
scatter_ratio,elongatedness,pr.axis_rectangularity,scaled_variance,scaled_variance,scaled_radius_of_gyration,scaled_radius_of_gyration.1
skewness_about,skewness_about.skewness_about.2	

In [None]:
#Columns having missing values
missing_values_cols=df.columns[df.isnull().any()]
# Number of missing values in each column
df[missing_values_cols].isnull().sum()

In [None]:
#List all the rows having missing value in any of the single or multiple columns

df[df.isnull().any(axis=1)][missing_values_cols].head()

In [None]:
df[df.isnull().any(axis=1)][missing_values_cols].shape

There are total 33 rows with missng values in one or more of 14 columns

### Missing Values Treatment: 
Find individual row with missing values in each of the columns and then we will make decision on whether to drop or not¶

 #### Missing Treatment Values for circularity

In [None]:
df[df['circularity'].isnull()][missing_values_cols]

In [None]:
# 5 rows have missing vales for circularity. one of the 5 rows alsq has missing value for distance_circularity. 
# Another row has missing values for scaled valiance and skewness_about.1. One of the row has missing value for scaled_radius_of_gyration.
# We will drop those rows which has missing value in any other coulmn as well apart from circularity which is 3. 
# will impute missing value in rest 2 rows.

In [None]:
# Row 105,118,266 has missing values in more than 1 column. drop those
df.drop([105,118,266], inplace=True)

In [None]:
# Now lets Check the class level of remaining 2 rows- we will replace the value with median value of the corresponding class
df.loc[5].loc['class'],df.loc[396].loc['class']

In [None]:
# Belong to Bus Class
Median_circularity_bus=df['circularity'][df['class']==0].median()
Median_circularity_bus

In [None]:
df['circularity'].fillna(Median_circularity_bus, inplace=True)

In [None]:
# Douple Check if missing values have been teated for curcularity
df[df['circularity'].isnull()][missing_values_cols]

In [None]:
# Missing value for Circularity treated

#### Missing Treatment Values for distance_circularity

In [None]:
df[df['distance_circularity'].isnull()][missing_values_cols]

In [None]:
# 3 rows have missing values. row 207 has missing  values in more than 1 column- we will drop this
# row 35, 319 have missing values in just one column, We will fill it woth median of the corresponding class

In [None]:
df.drop(207, inplace=True)

In [None]:
df.shape

In [None]:
# Now lets Check the class lavel of remeining 2 rows- we will replace the value with median value of the corresponding class
df.loc[35].loc['class'],df.loc[319].loc['class']

In [None]:
Median_distance_circularity_van=df['distance_circularity'][df['class']==2].median()
Median_distance_circularity_bus=df['distance_circularity'][df['class']==0].median()
Median_distance_circularity_van,Median_distance_circularity_bus

In [None]:
df.loc[35]=df.loc[35].replace(np.nan,Median_distance_circularity_van)

In [None]:
df.loc[319]=df.loc[319].replace(np.nan,Median_distance_circularity_bus)

In [None]:
df.loc[[35,319]]

In [None]:
# Double Check if missing values have been handled
df[df['distance_circularity'].isnull()][missing_values_cols]



In [None]:
#Missing values handled for dian_distance_circularity

#### Missing Treatment Values for radius_ratio

In [None]:
df[df['radius_ratio'].isnull()][missing_values_cols]

In [None]:
# For all the rows with missing radius_ratio only radius ratio is having missing values all the other columns have values.
# We will not drop any rather replace with median of corresponding class.

In [None]:
df.loc[[9,78,159,287,345,467]]['class']

In [None]:
# Lets find median value for car, bus,van
Median_distance_radius_ratio_van=df['radius_ratio'][df['class']==2].median()
Median_distance_radius_ratio_bus=df['radius_ratio'][df['class']==0].median()
Median_distance_radius_ratio_car=df['radius_ratio'][df['class']==1].median()
Median_distance_radius_ratio_van,Median_distance_radius_ratio_bus,Median_distance_radius_ratio_car

In [None]:
# replace rows 9,159 and 467 with car median, 78,345 with bus median and 287 with  van

In [None]:
df.loc[[9,159,467]]=df.loc[[9,159,467]].replace(np.nan,Median_distance_radius_ratio_car)

In [None]:
df.loc[[9,159,467]]

In [None]:
df.loc[[78,345 ]]=df.loc[[ 78,345 ]].replace(np.nan,Median_distance_radius_ratio_bus)

In [None]:
df.loc[[78,345 ]]

In [None]:
df.loc[287]=df.loc[287].replace(np.nan,Median_distance_radius_ratio_van)

In [None]:
df.loc[[287]]

#### Missing Treatment Values for pr.axis_aspect_ratio 

In [None]:
df[df['pr.axis_aspect_ratio'].isnull()][missing_values_cols]

In [None]:
# There are 2 rows with missing values. One row has missing value in one more column in addityion to pr.axis_aspect_ratio
# We will drop that row but treat the missing value in pr.axis_aspect_ratio with median of corresponding class

In [None]:
# drop row 222
df.drop(222, inplace=True)

In [None]:
df.loc[19]['class']

In [None]:

Median_distance_pr_axis_aspect_ratio_car=df['pr.axis_aspect_ratio'][df['class']==1].median()
Median_distance_pr_axis_aspect_ratio_car

In [None]:
df.loc[19]=df.loc[19].replace(np.nan,Median_distance_pr_axis_aspect_ratio_car)

In [None]:
df[df['pr.axis_aspect_ratio'].isnull()][missing_values_cols]

#### Missing Treatment Values for scatter_ratio

In [None]:
df[df['scatter_ratio'].isnull()][missing_values_cols]

In [None]:
# Only one row and 2 cols have missing value in that row including scatter_ratio
# we will drop this row

In [None]:
df.drop(249, inplace=True)

#### Missing Treatment Values for elongatednes

In [None]:
df[df['elongatedness'].isnull()][missing_values_cols]

In [None]:
df.loc[215]['class']

In [None]:
Median_distance_elongatedness_car=df['elongatedness'][df['class']==1].median()
Median_distance_elongatedness_car

In [None]:
df.loc[215]=df.loc[215].replace(np.nan,Median_distance_elongatedness_car)

In [None]:
df[df['elongatedness'].isnull()][missing_values_cols]

#### Missing Treatment Values for pr.axis_rectangularity

In [None]:
df[df['pr.axis_rectangularity'].isnull()][missing_values_cols]

In [None]:
# 3 rows have missing values for pr.axis_rectangularity and only this column has missing value
# We will replace this with median value of the corresponding class

In [None]:
#lets loom at class level of the missing rows
df.loc[[70,237,273]]['class']

In [None]:
Median_distance_pr_axis_rectangularity_van=df['pr.axis_rectangularity'][df['class']==2].median()
Median_distance_pr_axis_rectangularity_car=df['pr.axis_rectangularity'][df['class']==1].median()
Median_distance_pr_axis_rectangularity_bus=df['pr.axis_rectangularity'][df['class']==0].median()
Median_distance_pr_axis_rectangularity_van,Median_distance_pr_axis_rectangularity_car,Median_distance_pr_axis_rectangularity_bus

In [None]:
df.loc[70]=df.loc[70].replace(np.nan,Median_distance_pr_axis_rectangularity_car)
df.loc[237]=df.loc[237].replace(np.nan,Median_distance_pr_axis_rectangularity_bus)
df.loc[273]=df.loc[273].replace(np.nan,Median_distance_pr_axis_rectangularity_van)

In [None]:
# Double Check if missing values have been treated
df[df['pr.axis_rectangularity'].isnull()][missing_values_cols]

#### Missing Treatment Values for scaled_variance

In [None]:
df[df['scaled_variance'].isnull()][missing_values_cols]

In [None]:
# 2 rows have missing values for scaled_variance, no other columns have missing values for these rows. We will replace with median
# of corresponding class

In [None]:
df.loc[[372,522]]['class']

In [None]:
Median_distance_scaled_variance_van=df['scaled_variance'][df['class']==2].median()
Median_distance_scaled_variance_car=df['scaled_variance'][df['class']==1].median()
Median_distance_scaled_variance_van,Median_distance_scaled_variance_car

In [None]:
df.loc[372]=df.loc[372].replace(np.nan,Median_distance_scaled_variance_van)
df.loc[522]=df.loc[522].replace(np.nan,Median_distance_scaled_variance_car)

In [None]:
df[df['scaled_variance'].isnull()][missing_values_cols]

#### Missing Treatment Values for scaled_variance.1

In [None]:
df[df['scaled_variance.1'].isnull()][missing_values_cols]

In [None]:
# 2 rows have missing values for scaled_variance, no other columns have missing values for these rows. We will replace with median
# of corresponding class

In [None]:
df.loc[[308,496]]['class']

In [None]:
Median_distance_scaled_variance1_car=df['scaled_variance.1'][df['class']==1].median()
Median_distance_scaled_variance1_car

In [None]:
df.loc[[308,496]]=df.loc[[ 308,496]].replace(np.nan,Median_distance_scaled_variance1_car)

In [None]:
df[df['scaled_variance.1'].isnull()][missing_values_cols]

#### Missing Treatment Values for scaled_radius_of_gyration.1

In [None]:
df[df['scaled_radius_of_gyration.1'].isnull()][missing_values_cols]

In [None]:
# there are 4  rows with scaled_radius_of_gyration.1 as missing values
# row with index 66 has missing values in 2 columns- will be dropped
# Other 3 rows missing values will be replaced with median value of cotresponding class

In [None]:
# Drop row 66
df.drop(66, inplace=True)

In [None]:
df.loc[[77,192,329]]['class']


In [None]:
Median_distance_radius_gyr1_car=df['scaled_radius_of_gyration.1'][df['class']==1].median()
Median_distance_radius_gyr1_car

In [None]:
df.loc[[77,192,329]]=df.loc[[ 77,192,329]].replace(np.nan,Median_distance_radius_gyr1_car)

In [None]:
df[df['scaled_radius_of_gyration.1'].isnull()][missing_values_cols]

#### Missing Values Treatment for skewness_about

In [None]:
df[df['skewness_about'].isnull()][missing_values_cols]

In [None]:
# 3 rows have missing values  in skewness_about column , no other column has missing value for these rows. 
# we will replace these values with median of the corresponding class

In [None]:
df.loc[[141,177,285]]['class']

In [None]:
Median_distance_skewness_about_car=df['skewness_about'][df['class']==1].median()
Median_distance_skewness_about_bus=df['skewness_about'][df['class']==0].median()
Median_distance_skewness_about_car,Median_distance_skewness_about_bus

In [None]:
df.loc[[141,177]]=df.loc[[141,177]].replace(np.nan,Median_distance_skewness_about_bus)

In [None]:
df.loc[[285]]=df.loc[[285]].replace(np.nan,Median_distance_skewness_about_car)

In [None]:
df[df['skewness_about'].isnull()][missing_values_cols]

#### Missing Values Treatment for skewness_about.1

In [None]:
df[df['skewness_about.1'].isnull()][missing_values_cols]

In [None]:
#  No longer Missing values- corresponding row
#has been dropped while treating other missing values

#### Missing Values Treatment for skewness_about.2

In [None]:
df[df['skewness_about.2'].isnull()][missing_values_cols]

In [None]:
# One row has missing value for skewness_about.2 and no other value is missing for that row
# Lets replace that value with median of the corresponding class

In [None]:
df.loc[419]['class']

In [None]:
Median_distance_skewness_about2_car=df['skewness_about.2'][df['class']==1].median()
Median_distance_skewness_about2_car

In [None]:
df.loc[[419]]=df.loc[[419]].replace(np.nan,Median_distance_skewness_about2_car)

In [None]:
df[df['skewness_about.2'].isnull()][missing_values_cols]

#### Data Frame Summary Statistics after missing values treatment

In [None]:
df.describe().T

#### Only 7/846 rows i.e 0.8 % record has been dropped -should be okay

### Outlier Treatment

In [None]:
# Split data into train and test set. Outlier treatment will be done only on train set
# We will divide into feature and target set during PCA and model building

In [None]:
#Split into Train -Test set
test_size = 0.30 # taking 70:30 training and test set
seed = 7  # Random number seeding for reapeatability of the code
df_train, df_test= train_test_split(df, test_size=test_size, random_state=seed)
df_train.shape, df_test.shape

In [None]:
## function to find outliers and quantile values.
# We will analyse each of the outliers and follow below strategy
# 1. High outliers if close to max value will be replaced with max value of the corresponding class
# 2. if high outlier is much above 75 Quantile value- we will drop that row from our analysis
# 3. Low outlier if close to min value will be replaced by min value of the corresponding class
# 4. low outlier if much lower than 25 quantile value will be dropped fromm analysis

def handleOutlier(aSeries):
    
    q1 = aSeries.quantile(0.25)
    q3 = aSeries.quantile(0.75)
   
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    outliers_low = aSeries[(aSeries < fence_low)]
    outliers_high= aSeries[(aSeries > fence_high)]
    
    print ("25th Quantile value: ", q1)
    print('Outlier low Count =', outliers_low.count())
    print('List of Low outliers: \n')
    print(outliers_low)

    print ("75th Quantile value: ", q3)
    print('Outlier High Count = ', outliers_high.count())
    print('List of High outliers: \n')
    print(outliers_high)
    

#### Compactness

In [None]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(df_train['compactness'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(df_train['compactness'],ax=ax2)
ax2.set_title("Box Plot")

In [None]:
handleOutlier(df_train['compactness'])

In [None]:
#Lets see the complete row
df_train.loc[[44]]

In [None]:
#class is car. Lets observe few rows with class car- in terms of max values as it is high outlier
df_train[df_train['class']==1]['compactness'].sort_values( ascending=False).head(5)

In [None]:
# There are values like 117,116 so we will not treat this outlier. 119 does not seem to be do far.

#### Circularity

In [None]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(df_train['circularity'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(df_train['circularity'],ax=ax2)
ax2.set_title("Box Plot")

From above we can see that there are no outliers in circularity

#### Distance Circularity

In [None]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(df_train['distance_circularity'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(df_train['distance_circularity'],ax=ax2)
ax2.set_title("Box Plot")


From above we can see that there are no outliers in distance_circularity column but in distribution plot we can see that there are two peaks and we can see that there is right skewness because long tail is at the right side(mean>median)

#### radius_ratio

In [None]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(df_train['radius_ratio'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(df_train['radius_ratio'],ax=ax2)
ax2.set_title("Box Plot")

In [None]:
# There are certain outliers on the right side( high ouliers).Lets analyse them and make decision on their treatment
 handleOutlier(df_train['radius_ratio'])

In [None]:
# Lets observe full rows for these outliers
df_train.loc[[37,135,388]]

In [None]:
# All these are for class van. Lets observe maximum radius_ratio for class van
df_train[df_train['class']==2]['radius_ratio'].sort_values( ascending=False).head(8)

In [None]:
# values of radius ratio for outlier are far away  from the max value 250. Lets replace these values with 250
df_train.loc[[37,135,388],'radius_ratio']=250.0

In [None]:
#Double check the values if replaced correctly
df_train.loc[[37,135,388]]

In [None]:
#All Done for radius ratio!

#### pr.axis_aspect_ratio

In [None]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(df_train['pr.axis_aspect_ratio'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(df_train['pr.axis_aspect_ratio'],ax=ax2)
ax2.set_title("Box Plot")

here are many high outliers. Lets observe each of them and treat them

In [None]:
 handleOutlier(df_train['pr.axis_aspect_ratio'])

In [None]:
# Lets observe full rows for these outliers
df_train.loc[[4,37,135,291,388,523,706]]

In [None]:
# Index 4  belongs to class Bus while others belong to class van. Lets observe max values of this column for
#both bus and van

In [None]:
# Lets Check for Bus first
df_train[df_train['class']==0]['pr.axis_aspect_ratio'].sort_values( ascending=False).head(8)

In [None]:
# For bus we can see values around 75 and max value 76. It is better to drop this row as the values 103 is
#significantly higher

In [None]:
df_train.drop(4, inplace=True)

In [None]:
# Lets Check for van now first
df_train[df_train['class']==2]['pr.axis_aspect_ratio'].sort_values( ascending=False).head(20)

In [None]:
##From 72 to 97 it is big jump in value and then other outlier values are even higher upto 138. It is better to drop 
#these rows

In [None]:
df_train.drop([37,135,291,388,523,706], inplace=True)

#### max.length_aspect_ratio

In [None]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(df_train['max.length_aspect_ratio'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(df_train['max.length_aspect_ratio'],ax=ax2)
ax2.set_title("Box Plot")

In [None]:
 handleOutlier(df_train['max.length_aspect_ratio'])

In [None]:
# Lets observe full rows for these outliers
df_train.loc[[391,127,815,544]]

In [None]:
# row with index 391 is for van and others are for bus. lets observe max values as ouliers are hgh in nature

In [None]:
# Lets Check for van now first
df_train[df_train['class']==2]['max.length_aspect_ratio'].sort_values( ascending=False).head(20)

In [None]:
# Outlier is double the max value which is 12. better drop this row

In [None]:
df_train.drop(391, inplace=True)

In [None]:
# Lets Check for bus now
df_train[df_train['class']==0]['max.length_aspect_ratio'].sort_values( ascending=False).head(20)

In [None]:
# Again for bus max length aspect ratio is 8 and Junp from 8 to 19/22 is too high. Lets drop this outlier from train set

In [None]:
df_train.drop([127,815,544], inplace=True)

#### Scatter Ratio

In [None]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(df_train['scatter_ratio'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(df_train['scatter_ratio'],ax=ax2)
ax2.set_title("Box Plot")

No Outlier in scatter ratio

#### elongatedness

In [None]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(df_train['elongatedness'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(df_train['elongatedness'],ax=ax2)
ax2.set_title("Box Plot")

No Outlier in elongetdness

#### pr.axis_rectangularity

In [None]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(df_train['pr.axis_rectangularity'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(df_train['pr.axis_rectangularity'],ax=ax2)
ax2.set_title("Box Plot")

No Outlier in pr_axis_rectangularity

#### max.length_rectangularity

In [None]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(df_train['max.length_rectangularity'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(df_train['max.length_rectangularity'],ax=ax2)
ax2.set_title("Box Plot")

No Outlier in max.length_rectangularity

#### scaled_variance

In [None]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(df_train['scaled_variance'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(df_train['scaled_variance'],ax=ax2)
ax2.set_title("Box Plot")

No Outlier in scaled_variance

#### scaled_variance.1

In [None]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(df_train['scaled_variance.1'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(df_train['scaled_variance.1'],ax=ax2)
ax2.set_title("Box Plot")

There is one outlier in scaled_Variance.1

In [None]:
handleOutlier(df_train['scaled_variance.1'])

In [None]:
# Lets observe full row for this outliers
df_train.loc[[85]]

In [None]:
# The outlier belongs to class car. Lets observe max values as it is high outlier
df_train[df_train['class']==0]['scaled_variance.1'].sort_values( ascending=False).head(8)

There are values in contnuity like 982,987, 962 hence 998 does not look very high. We will leave this outlier as is.



#### scaled_radius_of_gyration.1

In [None]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(df_train['scaled_radius_of_gyration.1'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(df_train['scaled_radius_of_gyration.1'],ax=ax2)
ax2.set_title("Box Plot")

 lot of high outliers

In [None]:
 handleOutlier(df_train['scaled_radius_of_gyration.1'])

In [None]:
# Lets observe full row for this outliers
df_train.loc[[687,734,492,834,515,351,41,231,232,160,553,79,568,612,230,655,420,463,790,47,381]]


In [None]:
# Lets observe full row for this outliers for class Bus
df_train.loc[[687,734,492,834,515,351,41,231,232,160,553,79,568,612,230,655,420,463,790,47,381]][df_train['class']==0]

In [None]:
# The outlier belongs to class car. Lets observe max values as it is high outlier
df_train[df_train['class']==0]['scaled_radius_of_gyration.1'].sort_values( ascending=False).head(20)

In [None]:
# Values ouliers for buses are almost in range of max. We will neithr delete them nor replace them-leave as is

In [None]:
# Lets observe full row for this outliers for class van
df_train.loc[[687,734,492,834,515,351,41,231,232,160,553,79,568,612,230,655,420,463,790,47,381]][df_train['class']==2]

In [None]:
# The outliers belong to class van. Lets observe max values as it is high outlier
df_train[df_train['class']==2]['scaled_radius_of_gyration.1'].sort_values( ascending=False).head(20)

In [None]:
# Values ouliers for vans are almost in range of max. We will neither delete them nor replace them-leave as is

In [None]:
# Lets observe full row for this outliers for class car
df_train.loc[[687,734,492,834,515,351,41,231,232,160,553,79,568,612,230,655,420,463,790,47,381]][df_train['class']==1]

In [None]:
# The outlier belongs to class car. Lets observe max values as it is high outlier
df_train[df_train['class']==1]['scaled_radius_of_gyration.1'].sort_values( ascending=False).head(20)

In [None]:
# Values ouliers for cars are almost in range of max. We will neither delete them nor replace them-leave as is

#### skewness_about

In [None]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(df_train['skewness_about'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(df_train['skewness_about'],ax=ax2)
ax2.set_title("Box Plot")

No outlier in skewness_about field

#### skewness_about.1

In [None]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(df_train['skewness_about.1'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(df_train['skewness_about.1'],ax=ax2)
ax2.set_title("Box Plot")

There is one high outlier

In [None]:
 handleOutlier(df_train['skewness_about.1'])

In [None]:
#Lets observe the full row of the outlier
df_train.loc[[132]]
# Outlier belongs to class 1 that is car

In [None]:
##Lets observe max values for car class
df_train[df_train['class']==1]['skewness_about.1'].sort_values( ascending=False).head(20)

Value is well in range of max value of skewness_about.1 for cars. we will not delete or replace it

#### skewness_about.2

In [None]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(df_train['skewness_about.2'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(df_train['skewness_about.2'],ax=ax2)
ax2.set_title("Box Plot")

No Outliers for skewness_about.2

#### hollows ratio

In [None]:
fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
fig.set_size_inches(20,4)
sns.distplot(df_train['hollows_ratio'],ax=ax1)
ax1.set_title("Distribution Plot")

sns.boxplot(df_train['hollows_ratio'],ax=ax2)
ax2.set_title("Box Plot")

No Outliers for hollows_ratio

### Final shape and statistic of train set after missing values and outlier treatmen

In [None]:
df_train.shape

In [None]:
df_train.describe().T

## PCA & Dimensionality Reduction

In [None]:
# Divide train and test set into feature and target sets
X_train=df_train.drop(labels='class', axis=1)
y_train=df_train['class']
X_test=df_test.drop(labels='class', axis=1)
y_test=df_test['class']
X_train.shape,y_train.shape, X_test.shape, y_test.shape

In [None]:
sc = StandardScaler()

In [None]:
sc.fit(X_train) # Fit scaler in train set

In [None]:
# transform train set
#Transform X_train
X_train_std=sc.transform(X_train)
#Transform X_test ( with same fit as train) to prevent data leak
X_test_std=sc.transform(X_test)

In [None]:
# Covariance Matrix
cov_matrix = np.cov(X_train_std.T)

print('Covariance Matrix \n%s', cov_matrix)

In [None]:
eig_vals, eig_vecs = np.linalg.eig(cov_matrix)
print('Eigen Vectors \n%s', eig_vecs)
print('\n Eigen Values \n%s', eig_vals)

In [None]:
print("Eigen Values:")
pd.DataFrame(eig_vals).transpose()

In [None]:
tot = sum(eig_vals)
var_exp = [( i /tot ) * 100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)   # array of size =  as many PC dimensions
print("Cumulative Variance Explained", cum_var_exp)

In [None]:
# Ploting 
plt.figure(figsize=(15 , 6))
plt.bar(range(1, eig_vals.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(1, eig_vals.size + 1), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()

 First 8 principal components explain 98% of the variance in the data. 

In [None]:
# Make a set of (eigenvalue, eigenvector) pairs
eig_pairs = [(eig_vals[index], eig_vecs[index]) for index in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) pairs from highest to lowest with respect to eigenvalue by default take first field for sorting
eig_pairs.sort(reverse=True)


# Note: always form pair of eigen vector and values  first before sorting...

# Extract the descending ordered eigenvalues and eigenvectors
eigvalues_sorted = [eig_pairs[index][0] for index in range(len(eig_vals))]
eigvectors_sorted = [eig_pairs[index][1] for index in range(len(eig_vals))]

In [None]:
#Dimesionality reduction 

P_reduce = np.array(eigvectors_sorted[0:8]).transpose()   # Selecting first 8 eigen vector out if 18

Proj_train_data = np.dot(X_train_std,P_reduce)   # projecting training data onto the eight eigen vectors

Proj_test_data = np.dot(X_test_std,P_reduce)    # projecting test data onto the eight eigen vectors

In [None]:
#Check shapes of train and test new feature and target set after PCA
Proj_train_data.shape,y_train.shape,Proj_test_data.shape,y_test.shape

## Modelling,Hyperparameter tuning & Cross Validation

With Linear Kernel

In [None]:
# Use SVM

from sklearn.svm import SVC

# Building a Support Vector Machine on train data
svc_model = SVC(C= .1, kernel='linear', gamma= 1)
svc_model.fit(Proj_train_data, y_train)

prediction = svc_model.predict(Proj_test_data)

In [None]:
# check the accuracy on the training set
print(svc_model.score(Proj_train_data, y_train))
print(svc_model.score(Proj_test_data, y_test))

In [None]:
print("Confusion Matrix:\n",confusion_matrix(prediction,y_test))

With Rbf

In [None]:
# Building a Support Vector Machine on train data
svc_model = SVC(kernel='rbf')
svc_model.fit(Proj_train_data, y_train)

prediction = svc_model.predict(Proj_test_data)

In [None]:
print(svc_model.score(Proj_train_data, y_train))
print(svc_model.score(Proj_test_data, y_test))

In [None]:
print("Confusion Matrix:\n",confusion_matrix(prediction,y_test))

### Hyper Parameter Tuning

In [None]:
#With Hyper Parameters Tuning
#2-3,SVM
#importing modules
from sklearn.model_selection import GridSearchCV
from sklearn import svm
#making the instance
model=svm.SVC()
#Hyper Parameters Set
params = {'C': [0.01, 0.05, 0.5, 1], 
      #    'gamma':[0.01, 0.02 , 0.03 , 0.04, 0.05],
          'kernel': ['linear','rbf']}
#Making models with hyper parameters sets
gs = GridSearchCV(model, param_grid=params, n_jobs=-1,cv=10)
#Learning
gs.fit(Proj_train_data,y_train)


In [None]:
#The best hyper parameters set
print("Best Hyper Parameters:\n",gs.best_params_)

*K-fold cross validation( On train set using tuned Hyper parameter i.e gs*

In [None]:
num_folds = 10
seed = 7

kfold = KFold(n_splits=num_folds, random_state=seed)
model = gs
results = cross_val_score(gs,Proj_train_data,y_train, cv=kfold)
print(results)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

In [None]:
#plt.hist(results,normed= True)
sns.distplot(results,kde=True,bins=10)
plt.xlabel("Accuracy")
plt.show()
# confidence intervals
alpha = 0.95                             # for 95% confidence 
p = ((1.0-alpha)/2.0) * 100              # tail regions on right and left .25 on each side indicated by P value (border)
lower = max(0.0, np.percentile(results, p))  
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(results, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100, upper*100))

### Test Accuracy with Hypertuned parameter 

In [None]:
prediction=gs.predict(Proj_test_data)
print("Accuracy:",metrics.accuracy_score(prediction,y_test))
#evaluation(Confusion Metrix)
print("Confusion Matrix:\n",metrics.confusion_matrix(y_test,prediction))


Looking at the confusion matrix, model predicts all the vans correctly through Silhoutte(100%) 59/62 buses are predicted correctly(95 %) 129/138 cars are predicted correctyy(93.5%)

Test Accuracy(95.64%) is well in range of 95% confidence interval(86.8% to 99.6%)