# Statlog Vehicle Silhouettes Project
By : Shashidhar. B

### Introduction
The purpose of the case study is to classify a given silhouette as one of four different types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.


### Data Source
The original dataset is available at UCI Machine Learning Repository and can be downloaded from this address: https://archive.ics.uci.edu/ml/datasets/Statlog+(Vehicle+Silhouettes)

## Objectives
1. Data pre-processing - Understand the data and treat missing values.

2. Understanding the attributes - Find relationship between different attributes (Independent variables) and choose carefully which all attributes have to be a part of the analysis and why?

3. Use PCA from scikit learn and elbow plot to find out reduced number of dimension (which covers more than 95% of the variance).

4. Use Support vector machines and use grid search (try C values - 0.01, 0.05, 0.5, 1 and kernel = linear, rbf) and find out the best hyper parameters and do cross validation to find the accuracy. 

## Steps carried out to achieve objectives
1. Import Necessary Libraries and Packages.

2. Import the Data.

3. EDA [Eploratory Data Analysis].

4. Data Cleaning and Data Visualisation.
   
5. PCA and SVM Analysis.

### 1. Import Necessary Libraries and Packages

In [4]:
#For Numerial Operations numpy(Numerical Python) is used
import numpy as np

#For Data Analysis pandas(Python Data Analysis Library) is used
import pandas as pd

#For 2D graphs ploting matplotlib library is used
import matplotlib.pyplot as plt
#To enable matplotlib to plot in Jupyter Notebook
%matplotlib inline

#For better visualisation of Statistical Data seaborn library is used
import seaborn as sns

### 2. Import Data

In [5]:
#Reading the data
data=pd.read_csv("vehicle.csv",sep=",")

FileNotFoundError: File b'vehicle.csv' does not exist

### 3. EDA [Eploratory Data Analysis]

In [None]:
data.head(10)

In [None]:
data.shape

The dataset has 846-rows and 19-columns

In [None]:
data.info()

only class atribute is of object type and all others are numeric.

There are missing values in various columns, lets explore them.

In [None]:
data.isnull().sum()

There are null entries as shown above.

In [None]:
data.isna().sum()

There are NaN(Not a Number) entries as shown, which is same as null values output.

In [None]:
(data==0).sum()

There are zero values entred in column 'skewness_about' and 'skewness_about.1', which are not a missing values but the actual entries.

In [None]:
data.describe(include='all').T

### 4. Data Cleaning and Data Visualisation

In [None]:
#Firstly replacing all the null values with random inputs of their respective IQR[Inter Qurtile Range],i.e.,between 25% and 75%.
import random

In [None]:
data['circularity']=data['circularity'].fillna(random.randrange(40, 49))

data['distance_circularity']=data['distance_circularity'].fillna(random.randrange(70, 98))

data['radius_ratio']=data['radius_ratio'].fillna(random.randrange(114, 195))

data['pr.axis_aspect_ratio']=data['pr.axis_aspect_ratio'].fillna(random.randrange(57, 65))

data['scatter_ratio']=data['scatter_ratio'].fillna(random.randrange(147, 198))

data['elongatedness']=data['elongatedness'].fillna(random.randrange(33, 46))

data['pr.axis_rectangularity']=data['pr.axis_rectangularity'].fillna(random.randrange(19, 23))

data['scaled_variance']=data['scaled_variance'].fillna(random.randrange(167, 217))

data['scaled_variance.1']=data['scaled_variance.1'].fillna(random.randrange(318, 587))

data['scaled_radius_of_gyration']=data['scaled_radius_of_gyration'].fillna(random.randrange(149, 198))

data['scaled_radius_of_gyration.1']=data['scaled_radius_of_gyration.1'].fillna(random.randrange(67, 75))

data['skewness_about']=data['skewness_about'].fillna(random.randrange(2, 9))

data['skewness_about.1']=data['skewness_about.1'].fillna(random.randrange(5, 19))

data['skewness_about.2']=data['skewness_about.2'].fillna(random.randrange(184, 193))

In [None]:
data.isnull().sum()

From above, we can see that all the missing values are taken care.

#### Data Visualisation 

In [None]:
f,axes=plt.subplots(3,3,figsize=(20,15))
sns.boxplot(data['compactness'],ax=axes[0,0])
sns.boxplot(data['circularity'],ax=axes[0,1])
sns.boxplot(data['distance_circularity'],ax=axes[0,2])
sns.boxplot(data['radius_ratio'],ax=axes[1,0])
sns.boxplot(data['pr.axis_aspect_ratio'],ax=axes[1,1])
sns.boxplot(data['max.length_aspect_ratio'],ax=axes[1,2])
sns.boxplot(data['scatter_ratio'],ax=axes[2,0])
sns.boxplot(data['elongatedness'],ax=axes[2,1])
sns.boxplot(data['pr.axis_rectangularity'],ax=axes[2,2])
plt.show()

In [None]:
f,axes=plt.subplots(3,3,figsize=(20,15))
sns.boxplot(data['max.length_rectangularity'],ax=axes[0,0])
sns.boxplot(data['scaled_variance'],ax=axes[0,1])
sns.boxplot(data['scaled_variance.1'],ax=axes[0,2])
sns.boxplot(data['scaled_radius_of_gyration'],ax=axes[1,0])
sns.boxplot(data['scaled_radius_of_gyration.1'],ax=axes[1,1])
sns.boxplot(data['skewness_about'],ax=axes[1,2])
sns.boxplot(data['skewness_about.1'],ax=axes[2,0])
sns.boxplot(data['skewness_about.2'],ax=axes[2,1])
sns.boxplot(data['hollows_ratio'],ax=axes[2,2])
plt.show()

In [None]:
for i in range(len(data['radius_ratio'])):
    if data.loc[i,'radius_ratio']>300:
        data.loc[i,'radius_ratio']= random.randrange(114, 195)

In [None]:
for i in range(len(data['pr.axis_aspect_ratio'])):
    if data.loc[i,'pr.axis_aspect_ratio']>90:
        data.loc[i,'pr.axis_aspect_ratio']= random.randrange(57, 65)

In [None]:
for i in range(len(data['max.length_aspect_ratio'])):
    if data.loc[i,'max.length_aspect_ratio']>18:
        data.loc[i,'max.length_aspect_ratio']= random.randrange(7, 10)

In [None]:
for i in range(len(data['max.length_aspect_ratio'])):
    if data.loc[i,'max.length_aspect_ratio']<3:
        data.loc[i,'max.length_aspect_ratio']= random.randrange(7, 10)

In [None]:
for i in range(len(data['scaled_variance'])):
    if data.loc[i,'scaled_variance']>300:
        data.loc[i,'scaled_variance']= random.randrange(167, 217)

In [None]:
for i in range(len(data['scaled_variance.1'])):
    if data.loc[i,'scaled_variance.1']>982:
        data.loc[i,'scaled_variance.1']= random.randrange(318, 587)

In [None]:
for i in range(len(data['scaled_radius_of_gyration.1'])):
    if data.loc[i,'scaled_radius_of_gyration.1']>87:
        data.loc[i,'scaled_radius_of_gyration.1']= random.randrange(67, 75)

In [None]:
for i in range(len(data['skewness_about'])):
    if data.loc[i,'skewness_about']>18:
        data.loc[i,'skewness_about']= random.randrange(2, 9)

In [None]:
for i in range(len(data['skewness_about.1'])):
    if data.loc[i,'skewness_about.1']>39:
        data.loc[i,'skewness_about.1']= random.randrange(5, 19)

In [None]:
data_new=data.drop('class',axis=1)

In [None]:
from scipy.stats import zscore
data_scaled=data_new.apply(zscore)
data_new.head()

In [None]:
data_new.boxplot(column=['compactness','circularity','distance_circularity','radius_ratio','pr.axis_aspect_ratio','max.length_aspect_ratio','scatter_ratio','elongatedness','pr.axis_rectangularity'],figsize=(20,5))

In [None]:
data_new.boxplot(column=['max.length_rectangularity','scaled_variance','scaled_variance.1','scaled_radius_of_gyration','scaled_radius_of_gyration.1','skewness_about','skewness_about.1','skewness_about.2','hollows_ratio'],figsize=(20,5))

From above we can see that the data is free from outliers and are on same scale.

In [None]:
#Let's see how these data are correlated
col_names =data_new.columns
corr_matrix = data_new[col_names].corr().abs()
plt.figure(figsize = (20,20))
cmap = sns.diverging_palette(500, 10, as_cmap=True)
sns.heatmap(corr_matrix,annot=True, xticklabels=corr_matrix.columns.values, yticklabels=corr_matrix.columns.values, vmax=.5, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .82})
plt.title('Heatmap of Correlation Matrix')

we can note that, the attributes 'pr.axis_aspect_ratio' is in very bad correlation with all of the other attributes. Hence, this could be droped.

The majority of the attributes are moderate-to-high in corelation with each other. Using PCA gives good results.

Attributes with high correlation are as follows;

1. 'elongatedness' and 'scatter_ratio'

2. 'pr.axis_rectangularity' and 'scatter_ratio'

3. 'pr.axis_rectangularity' and 'elongatedness'

4. 'max.length_rectangularity' and 'circularity'
 
5. 'scaled_variance' and 'scatter_ratio'

6. 'scaled_variance' and 'elongatedness'

7. 'scaled_variance' and 'pr.axis_rectangularity'

8. 'scaled_variance.1' and 'scatter_ratio'

9. 'scaled_variance.1' and 'elongatedness'

10. 'scaled_variance.1' and 'pr.axis_rectangularity'

We have 10 combinations of 7 attributes which are highly co-related (more than 90%).

### 5. PCA and SVM Analysis

To run PCA;

    Dataset should be free from dependent variable.

    Attributes should be in same scale to create PCA dimensions.

we have already transformed and scaled the data to suit this requirement using z-scores.

In [None]:
# Step 1 - Create covariance matrix

cov_matrix = np.cov(data.T)
print('Covariance Matrix \n%s', cov_matrix)

In [None]:
# Step 2- Get eigen values and eigen vector
eig_vals, eig_vecs = np.linalg.eig(cov_matrix)
print('Eigen Vectors \n%s', eig_vecs)
print('\n Eigen Values \n%s', eig_vals)

In [None]:
#Step 3- Find variance and cumulative variance by each eigen vector

tot = sum(eig_vals)
var_exp = [( i /tot ) * 100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)

The Eigen values are sorted from highest to lowest variance.

We will proceed with 19 components here. But depending on requirement 95% variation, 7 components will do good.

In [None]:
plt.plot(var_exp)

Visually we can observe that their is steep drop in variance explained with increase in number of PC's.

we can note that the Elbow/Knee is found at around 2.

Which means that we can use k=2 number of clusters as an optimal value. 

In [None]:
# Ploting 
plt.figure(figsize=(10 , 5))
plt.bar(range(1, eig_vals.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(1, eig_vals.size + 1), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()

#### PCA on Raw/Unaltered data

In [None]:
# Using scikit learn PCA here. It does all the above steps and maps data to PCA dimensions in one shot
from sklearn.decomposition import PCA

# NOTE - we shall generate with all the PCA dimensions firstly

pca = PCA(n_components=18)
data_full = pca.fit_transform(data_new)
pca.fit_transform(data_new).transpose()

In [None]:
pca.components_

In [None]:
#Converting to dataframe
data_comp_full = pd.DataFrame(pca.components_,columns=list(data_new))

#reinserting the droped target variable
data_comp_full['class'] = data['class']
data_comp_full.head()

We can note that the Z-Scores have changed.

The each Z-Scores here are the combination scores of all the other corresponding attributes.

In [None]:
#Let us check it visually
#sns.pairplot(data_comp_full, diag_kind='kde') 

#### SVM 

In [None]:
from sklearn.model_selection import train_test_split

# To calculate the accuracy score of the model
from sklearn.metrics import accuracy_score

target = data_comp_full["class"]
features = data_comp_full.drop("class", axis=1)
X_train, X_test, y_train, y_test = train_test_split(features,target, test_size = 0.35, random_state = 10)

The dataset is split into Test an Train in the ratio 70:30.

The random_state enables us to find teh exact data set to alter for multiple itterations.

In [None]:
from sklearn.svm import SVC

# Building a Support Vector Machine on train data
svc_model = SVC(C= .1, kernel='linear')
svc_model.fit(X_train, y_train)

prediction = svc_model .predict(X_test)

gamma is a measure of influence of a data point. It is inverse of distance of influence. C is complexity of the model.

lower C value creates simple hyper surface while higher C creates complex surface.

In [None]:
# check the accuracy on the training set
print(svc_model.score(X_train, y_train))
print(svc_model.score(X_test, y_test))

In [None]:
# Building a Support Vector Machine on train data
svc_model = SVC(kernel='rbf',c=1)
svc_model.fit(X_train, y_train)

In [None]:
print(svc_model.score(X_train, y_train))
print(svc_model.score(X_test, y_test))

#### PCA and SVM on Reduced data

In [None]:
# NOTE - we shall generate PC dimensions for raw data (7 dimensions to captured 95% variance)

pca = PCA(n_components=7)
data_pca = pca.fit_transform(data_new)
#pca.fit_transform(data_new).transpose()

In [None]:
pca.components_

In [None]:
#Converting to dataframe
data_comp_full = pd.DataFrame(data_pca,columns=['PC1','PC2','PC3','PC4','PC5','PC6','PC7'])

#reinserting the droped target variable
data_comp_full['class'] = data['class']
data_comp_full.head(20)

In [None]:
#Let us check it visually
sns.pairplot(data_comp_full, diag_kind='kde') 

we can see the scattering of the data. Hence co-linearity removed.

In [None]:
#Now let us check again to see if PCA has removed multi colkinierity among new PC dimensions
col_names =data_comp_full.columns
corr_matrix = data_comp_full[col_names].corr().abs()
plt.figure(figsize = (7,7))
cmap = sns.diverging_palette(500, 10, as_cmap=True)
sns.heatmap(corr_matrix,annot=True, xticklabels=corr_matrix.columns.values, yticklabels=corr_matrix.columns.values, vmax=.5, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .82})
plt.title('Heatmap of Correlation Matrix')

Now multi colinierity is removed and dimensions are also reduced. These 7 PC features capture 95% of the variance.

In [None]:
from sklearn.model_selection import train_test_split

# To calculate the accuracy score of the model
from sklearn.metrics import accuracy_score

target = data_comp_full["class"]
features = data_comp_full.drop("class", axis=1)
X_train, X_test, y_train, y_test = train_test_split(features,target, test_size = 0.30, random_state = 7)

The dataset is split into Test an Train in the ratio 70:30.

In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

Using Grid Search with C values - 0.01, 0.05, 0.5 and 1 and with Kernal - Linear and RBF and Cross Validation with 5 folds.

In [6]:
params = {'C': [0.01, 0.05, 0.5, 1], 
          'kernel': ['linear','rbf']}
model = SVC()
#Making models with hyper parameters sets
model1 = GridSearchCV(model, param_grid=params, cv=5,n_jobs=-1)
#Learning
model1.fit(X_train,y_train)
#The best hyper parameters set
print("Best Hyper Parameters:\n",model1.best_params_)
#Prediction
prediction=model1.predict(X_test)
#importing the metrics module
from sklearn import metrics
#evaluation(Accuracy)
print("Accuracy:",metrics.accuracy_score(prediction,y_test))
#evaluation(Confusion Metrix)
print("Confusion Matrix:\n",metrics.confusion_matrix(prediction,y_test))

NameError: name 'SVC' is not defined

## Conclusion

From Confusion matrix we can note that only minmal miss-clasification is found, 

implying to 92.9 ~ 93% accuracy from kernel= rbf and c=1.