## Import Libraries

In [None]:
import numpy as np   
from sklearn.linear_model import LinearRegression
import pandas as pd    
import matplotlib.pyplot as plt 
%matplotlib inline 
import seaborn as sns
from sklearn import svm
from sklearn.model_selection import train_test_split
from scipy.stats import zscore
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

### Loading the dataset.

In [None]:
vehicle_data = pd.read_csv('/kaggle/input/vehicle-silhouettes/vehicle.csv')

### Shape of the dataset. 

In [None]:
vehicle_data.shape

### Datatype information of each coloumns. 

In [None]:
vehicle_data.info()

### Checking if there are any null values present in the dataset.

In [None]:
vehicle_data.apply(lambda x: sum(x.isnull()))

### The target variable 'class' count. 

In [None]:
vehicle_data['class'].value_counts()

### Converting the categorical values into numerical values for target coloumn.

In [None]:
#Label encode the target class
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
vehicle_data['class'] = labelencoder.fit_transform(vehicle_data['class'])
vehicle_data['class'].value_counts()

### Dataset description.
The below step shows the mean, std.deviation, percentile quartile values, min and max for the coloumns present in the dataset. 

In [None]:
vehicle_data.describe().transpose()

In [None]:
vehicle_data.head()

# 1.Performing necessary data pre-processing steps.

## Dealing with Missing Values

There are missing values in the vehicle dataset. Before we train a model, we have to deal with missing values present in the dataset. Hence we will replace missing values with the coloumns ***mean values***.

In [None]:
vehicle_data.isnull().sum()

### Below heatmap representation shows us the missing values presence in the dataset.

In [None]:
sns.heatmap(vehicle_data.isnull(),yticklabels=False,cbar=False,cmap='viridis')

### Mean values applied to the missing values .

In [None]:
vehicle_data.fillna(vehicle_data.mean(), inplace=True)

#### Below heatmap representation shows us the there are no missing values present in the dataset.

In [None]:
sns.heatmap(vehicle_data.isnull(),yticklabels=False,cbar=False,cmap='viridis')

In [None]:
vehicle_data.isnull().sum()

### Below is the box plot visualisation for the coloumns. We can see the outliers present in the dataset through the boxplot. 

In [None]:
num_features=[col for col in vehicle_data.select_dtypes(np.number).columns]

plt.figure(figsize=(20,20))
for i,col in enumerate(num_features,start=1):
    plt.subplot(5,4,i);
    sns.boxplot(vehicle_data['class'],vehicle_data[col]);
plt.show()

## Dealing with the outliers present in the dataset.

In [None]:
vehicle_data.drop(vehicle_data[vehicle_data['radius_ratio']>276].index,axis=0,inplace=True)

In [None]:
vehicle_data.drop(vehicle_data[vehicle_data['pr.axis_aspect_ratio']>77].index,axis=0,inplace=True)

In [None]:
vehicle_data.drop(vehicle_data[vehicle_data['max.length_aspect_ratio']>14.5].index,axis=0,inplace=True)
vehicle_data.drop(vehicle_data[vehicle_data['max.length_aspect_ratio']<2.5].index,axis=0,inplace=True)

In [None]:
vehicle_data[vehicle_data['scaled_variance']>292]

In [None]:
vehicle_data.drop(vehicle_data[vehicle_data['scaled_variance.1']>989.5].index,axis=0,inplace=True)

In [None]:
vehicle_data.drop(vehicle_data[vehicle_data['scaled_radius_of_gyration.1']>87].index,axis=0,inplace=True)

In [None]:
vehicle_data.drop(vehicle_data[vehicle_data['skewness_about']>19.5].index,axis=0,inplace=True)

In [None]:
vehicle_data.drop(vehicle_data[vehicle_data['skewness_about.1']>40].index,axis=0,inplace=True)

In [None]:
print("Shape of the dataset after fixing the outliers:",vehicle_data.shape)

# 2. Understanding the attribute relation with each other and finding the corelation between the attributes.

In [None]:
sns.pairplot(vehicle_data,diag_kind='kde', hue='class')
plt.show()

From above pair plots we can see that many columns are ***correlated*** and there are no long tails in any coloums which is an indication of ***no outliers present***.

In [None]:
num_features=[col for col in vehicle_data.select_dtypes(np.number).columns ]

plt.figure(figsize=(20,20))
for i,col in enumerate(num_features,start=1):
    plt.subplot(5,4,i);
    sns.distplot(vehicle_data[col])
plt.show()

From the above graph we can see that most of the coloums have normal ditribution and some have multiple peaks such as distance_circularity,elongatedness.

In [None]:
plt.figure(figsize=(20,4))
sns.heatmap(vehicle_data.corr(),annot=True)
plt.show()

1.Our main goal is to categorize whether an object is a van or car based on the input features.
2.Our assumption for the features which will categorize the object is that they are truly independent.There is no multicolinearity between the features.
3.If two features is highly correlated then there is no use in using both features.In such a sceanrio we can "drop one feature".
4.The heatmap above gives us the correlation matrix where we can see which features are highly correlated.
5.From above correlation matrix we can see that there are many features which are highly correlated. 
6.If we observe carefully then "scaled_variance.1" and "scatter_ratio" has 0.99(~1) 
7.There are total 8 coloumns having correlation.
8.They are max.length_rectangularity ->scaled_radius_of_gyration ->skewness_about.2 ->scatter_ratio ->elongatedness ->pr.axis_rectangularity ->scaled_variance ->scaled_variance.1
9.Since there are features which are corelated will drop the them to make the features truly independent.
               "***We will be acheiving this by using PCA technique for dimensionality reduction***".

# 3. split the data into train and test data set.

#### Standardising the values from the dataset before training a model. 

In [None]:
scaler = StandardScaler()
scaled_df = scaler.fit_transform(vehicle_data.drop(columns = 'class'))

In [None]:
X = scaled_df
y = vehicle_data['class']

X_train, X_test, Y_train, Y_test = train_test_split(X,y, test_size = 0.3,random_state = 10)

X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

 # 4.Train a Support vector machine using the train set and get the accuracy on the test set using original scaled attributes.

In [None]:
# Training an SVC using the actual attributes(scaled)

model = SVC(gamma = 'auto')

model.fit(X_train,Y_train)

score_using_actual_attributes = model.score(X_test, Y_test)

print(score_using_actual_attributes)

# 5.Perform K-fold cross validation on original scaled attributes and get the cross validation score of the model.

In [None]:
model = SVC(C=1, kernel="rbf", gamma='auto')

scores = cross_val_score(model, X, y, cv=10)

CV_score = scores.mean()
print(CV_score)

# 6.Using PCA from Scikit learn, extract Principal Components that capture about 95% of the variance in the data.

In [None]:
pca = PCA().fit(scaled_df)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
print(np.cumsum(pca.explained_variance_ratio_))

In [None]:
plt.bar(list(range(1,19)),pca.explained_variance_ratio_,alpha=0.5,align='center')
plt.ylabel('cum of variation explained')
plt.xlabel('eigen value')
plt.show()

In [None]:
plt.step(list(range(1,19)),np.cumsum(pca.explained_variance_ratio_),where= 'mid')
plt.ylabel('cum of variation explained')
plt.xlabel('eigen values')
plt.show()

## 7. Picking up 8 prinicipal components as the first 8 capture more than 95% of the variance in the data.
## 7.a Lets split the dataset into training and test data.

In [None]:
pca = PCA(n_components=8)

X = pca.fit_transform(scaled_df)
Y = vehicle_data['class']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state=10)
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

## 7.b.Train a Support vector machine using the principal component analysis features.

In [None]:
# Training an SVC using the PCs instead of the actual attributes 
model = SVC(gamma= 'auto')

model.fit(X_train,Y_train)

score_PCs = model.score(X_test, Y_test)

print(score_PCs)

## 7.c.Perform K-fold cross validation on the principal components analysis and get the cross validation score of the model.

In [None]:
model = SVC(C=1, kernel="rbf", gamma='auto')

scores = cross_val_score(model, X, y, cv=10)

CV_score_pca = scores.mean()
print(CV_score_pca)

## 8.Compare the accuracy scores and cross validation scores of Support vector machines – one trained using raw data and the other using Principal Components, and mention your findings.

In [None]:
matrix = pd.DataFrame({'SVC' : ['All scaled attributes', '8 Principle components'],
                      'Accuracy' : [score_using_actual_attributes,score_PCs],
                      'Cross-validation score' : [CV_score,CV_score_pca]})
matrix

## Conclusion:
***From above we can conclude that PCA is doing a pretty good job.
Accuracy with PCA is approx 95% and with original attributes being approx 96%.
Note that achieving 95% accuracy with only 8 dimensions against initial 18 dimensions is very good.
What we would achieve otherwise with 18 dimensions can be achieved with 8 dimensions.
Here "SVC" algorithm can be used as it has high levels of accuracy(94%) amd cross-validation score(94%) and can be applied on this dataset***.