# Car Evaluation

> ## Business Problem: 

To classify a car as **acceptable** , **unacceptable** , **good** or **very good** based on its price , characterstics and maintenance cost

> ## Description : 

 - **Model concept structure**:

      
       Price                    overall price
       Maintenance Cost         price of the maintenance
       Number of Doors          number of doors
       Capacity                 capacity in terms of persons to carry
       Size of Luggage boot     the size of luggage boot
       safety                   estimated safety of the car

  
 - **Number of Instances**: 1728
   (instances completely cover the attribute space)

 - **Number of Attributes**: 6

 - **Attribute Values**:

       Price                   v-high, high, med, low
       Maintenance Cost        v-high, high, med, low
       Number of Doors         2, 3, 4, 5-more
       Capacity                2, 4, more
       Size of Luggage boot    small, med, big
       safety                  low, med, high

 - **Missing Attribute Values**: None

 - **Class Distribution (number of instances per class)**

       class      N          N[%]
       -----------------------------
       unacc     1210     (70.023 %) 
       acc        384     (22.222 %) 
       good        69     ( 3.993 %) 
       v-good      65     ( 3.762 %) 

> ## Approach : 

You will understand the dataset and build a model in the following stages: 

- **Data Specifications and Cleaning**
- **Exploratory Data Analysis**
      - Uni-Variate Analysis : Pie Charts
      - Bi-Variate Analysis : Stacked Bar Graphs , Viloin Plots and Box Plots
- **Data Processing , Label Encoding**
- **Data Splitting into Train and Test sets**
- **Modelling and Hypertuning**
      - KNN Classifier, HyperTuning
      - Random Forest, HyperTuning: Grid Search
- **Conclusion**

## `1` Data Specifications and Cleaning

**`1.1` Importing basic Libraries and Dataset**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv('C:/Users/kusht/Downloads/car_evaluation.csv',header=None)
# Assigning names to the columns in the dataset

data.columns = ['Price', 'Maintenance Cost', 'Number of Doors', 'Capacity', 'Size of Luggage Boot', 'safety', 'Decision']

# Original dataset doesnt have name so putting them manually

In [None]:
data.head()

**TASK : Print the shape of the dataset**

In [None]:
### START CODE HERE (~ 1 line of code)

### END CODE HERE

**`1.2` Data Insights**

**TASK : Print information about dataset using `info()` method**

In [None]:
###ENTER CODE HERE (~ 1 line of code)

###END CODE HERE 

**TASK: Describe the dataset using `describe` function**

In [None]:
###ENTER CODE HERE (~ 1 line of code)

###END CODE HERE 

**TASK: Print counts of each value of each attribute using for loop and `value_counts()` method**

In [None]:
### START CODE HERE (~1 Line of code)
for i in data.columns : 
    # write code to print value_counts 

### END CODE

Seeing the basic characterstics of the dataset you can now analyse whether the dataset is balanced , skewed , shape and make a model keep the following details in mind

## `2` Exploratory Data Analysis 

After getting insight about the data , now you'll understand the data better by visualisations by doing Uni-Variate and Bi-Variate analysis

## `2.1` Uni-Variate Analysis

### - PIE CHARTS 

**TASK : Make `Pie chart` for `Decision`**

**HINT** : Figure out the sizes of each pie wedge by value_count of that instance . Use `value_counts()` to find sizes.

In [None]:
## Make a PIE chart of `decision`

## START CODE HERE : (FULL CODE , FILL THE PLACES WITH A SINGLE HASH)

# labels are names of elements in the feature that we want to ouput in the pie chart 
labels = #

# Colors provides the colors of the pie wedges
colors = #

# Size of pie wedges are given by using values from value_counts method which we acquired 
size = #

# Explode provides the spacing between each pie wedge 
explode = #

# Set the figure size to (6,6)

# Plot Pie chart 
### autopct shows the percentage on each wedge 

# Use plt.pie and use the above set variables as arguments to create a pie chart 

# Set Title , Axis = 'off' and Legend
    
# Show the plot 


### END CODE HERE

Analyse this graph and justify the imbalance with a suitable reason

**TASK : MAKE A `PIE CHART` OF `Price`**


 - print the value counts of each element in attribute `Price` using `value_counts` method

In [None]:
###ENTER CODE HERE (~1 Line of code)

###END CODE HERE

- Make similar <i>pie chart</i>  for `Price` 

In [None]:
## PIE CHART OF 'PRICE'

### START CODE HERE : (FULL CODE)




### END CODE

##### GOOD WORK! 
However , its clearly visible that since all other attributes are also equally balanced , all the univariate visualisation would be same and of equally balanced classes , so percenatges would always be `33.3%` for three elements and `25%` for four elements in an attribute this completes Uni-Variate Analaysis and lets move on to Bi-Variate analysis to understand data better! 

## `2.2` Bi-Variate Analysis 

### - STACKED BAR GRAPH

**STACKED BAR GRAPH** is a very convenient and easy to understand visualisation for two categorical variables and to make them we'll use a important method from pandas , `crosstab` . The role of crosstab is to form a separate cross table between the given attributes where the values are frequencies/counts having the specific features of the attributes

**TASK : Understand `crosstab` function by making one between `Price` and `Decision`**

In [None]:
# An example of crosstab between price and decision : 
# Put the variables you want the crosstab between in place of '#'
pd.crosstab(# , #)

**TASK  : Make `Stacked Bar Graph` between all attributes and `Decision` using for loop and crosstab function**

In [None]:
## Making STACKED BAR graphs between all attributes and 'decision'

for i in data.columns : 
    ## Write the variables for crosstab. 
    ## Remember one variable 'decision' is fixed and other variable 'i' gives the column names
    ctab = pd.crosstab(#,#)
    
    ## Dividing by sum to give range between 0 and 1
    ## Write the arguments of plot() in place of '#' . To create bar plot use kind='bar' and dont forget to keep stacked='True'
    ## Keep the size of bar as (6,6)
    ctab.div(ctab.sum(1).astype(float), axis = 0).plot(#)
    
    ## Write Title , labels and legend
    ## Use '{}.format()' function to write title corresponding to that each particular column
 
    plt.show()

You can also do the above exercise manually writing individual plots for each attribute but for loop makes it very convenient and effortless

##### GOOD WORK! 
Now you have plotted stacked bar graphs of all attributes . Analyse it and gain insights about the data . Ahead you'll use another visualise technique called `VIOLIN PLOTS` but that visualisation is only for numerical attributes . So first use `labelencoder` or `hard code` by manually giving values to convert them to numerical attribute and then plot the violin plots

## `3` Data Processing 

### `3.1` Label encoding

Since all algorithms and many visualisations require numerical data. You have to convert all the categorical attributes into numerical ones. 

You can use `LabelEncoder()` directly or you can Hard Code them by individually giving values to each instance for each attribute. 
Hard coding is preferred as the values to be given and in which order are exactly known and the attributes and unique elements are less. LabelEncoder would encode the categorical data randomly.

**TASK  : Convert all the categorical attributes to numerical ones using  `LabelEncoder` or Manually `Hard Code` all the categorical attributes as per your choice**

In [None]:
### ENTER CODE HERE (FULL CODE)


### END CODE HERE

##### GREAT! 
You're done with encoding , now check whether encoding has worked properly or not by seeing the dataset and by using `info()` method

As can be seen there are still attributes of `object` datatype which need to be converted to `int` datatype

**TASK  : Convert all attributes to `int`**

In [73]:
### START CODE HERE: 


### END CODE

### - Violin Plots and Box plots

`Violin plots` and `Box plots` are a great way to visualise numerical data. The plots conveniently gives us information such as : 
       
   - Median (a white dot on the violin plot and center line in box plot)
   - Interquartile range (the black bar in the center of violin and the boundaries of the box in box plot)
   - The lower/upper adjacent values (the black lines stretched from the bar) — defined as `first quartile — 1.5 IQR` and `third quartile + 1.5 IQR` respectively. These values can be used in a simple outlier detection technique (Tukey’s fences) — observations lying outside of these “fences” can be considered outliers.
   - The stretch in violin plots gives us the relative counts/frequencies of elements having that value

<img src="https://miro.medium.com/max/1040/1*TTMOaNG1o4PgQd-e8LurMg.png">

**TASK : Make  `Violin Plots` between all attributes and `Decision`**

You now have to make violin plots of all attributes . Instead of manually writing the code everytime ,  use `for loop` again to ease out the process . Fill in the spaces where a single `#` is given 

In [None]:
sns.set(style = 'dark', palette = 'colorblind', color_codes = True)
### START CODE HERE : 

## SET THE FIGURE SIZE USING PLT.RCPARAMS :
#

## SET 5 COLORS OF YOUR CHOICE 
color = #

## PLOT VILOIN PLOTS FOR ALL ATTRIBUTES EXCEPT 'SAFETY'
## CREATE A VARIABLE 'COLS' HAVING ALL COLUMN NAMES EXCEPT 'SAFTEY'

#

## FOR LOOP : 
for c,i in zip(color,cols) : 
    ## WRITE CODE FOR VIOLIN PLOT IN THE VARIABLE 'AX' 
    ## REMEMBER TO PUT 'COLOR' ARGUMENT = c 
    #
    ax.set_title('Violin Plot to show relation between {} and Decision'.format(i), fontsize = 20)
    ax.set_xlabel('{} in Increasing range'.format(i), fontsize = 15)
    ## SET THE YLABEL TO 'Decision'
    #
    plt.show()    

**TASK : PLOT `BOXPLOT`**

Plot boxplot in the similar way using almost the same code as for previous viloin plots but instead of `sns.violinplot` , use `sns.boxplot` . Create a boxplot betweeen **Safety-Decision** : 

In [None]:
### START CODE HERE (FULL CODE)

#SET STYLE,FIGSIZE,BOXPLOT,TITLE,LABELS

### END CODE HERE

Understand the Violin plots and box plots to gain better insight about data

### `3.2` CORREALTION MATRIX

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses

**TASK : Make a `Correlation Matrix` using `corr()` and make it visually aesthetic using `sns.heatmap`**

In [None]:
### START CODE HERE  : (~1 Line of code)

### END CODE  

##### Great!! 
Its clearly visible since there is no proper co relation between any of the attributes except `decision` prompts that there is no use to do `multi variate analysis` so you've successfully completed the different analysis of data and gained all the information . Now its time to make a model based on the information. For that split the data into train and test set and then make a model on it 

## `4`  Data Split into Train and Test Set  

**TASK  : Split the data into dependent and independent variables and print their shapes**

In [None]:
### Splitting the dataset into dependent and independent variables

## START CODE HERE : (~ 2 Lines of code)


### END CODE

**TASK  : Split the data into train and test sets with test size=0.15, random state = 0 and print their shapes**

In [None]:
## Splitting the dataset into train and test sets
## Import the required Library 

### START CODE HERE : 


### END CODE 

## `5`  Creating a model

Youll be creating models based on two algorithms : 
 - **KNN CLassifier**
 - **Random Forest** 

followed by which , model would be improved via `HyperTuning` and finally you'll analyse and choose the best model 

### `5.1` KNN Classifier 

The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems . The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other , "Birds of a feather flock together" .

<img src="https://upload.wikimedia.org/wikipedia/commons/e/e9/Map1NNReducedDataSet.png">

So as can be seen from the image , it forms boundaries of different classes and elements are classified based on which class they lie on

**TASK : Make a `KNN classifier model` and print `train and test accuracy` along with `confusion matrix`**

In [100]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix

### Make a KNN MODEL : 

### START CODE HERE : 

## CRreating a model :

model = # Write the code for a base model of KNeighborsClassifier() with n_jobs=-1

## Fit : 
# fit the model on x_train and y_train 

# Predict the values for x_test using predict() method and put it in a variable 'y_pred'

# Print the training and testing accuracy :

# printing the confusion Matrix : 

### END CODE HERE : 

**Does high accuracy mean that this model is a good predictor?**

Even a high accuracy doesnt necesssarily imply that its a great model especially with `Imbalanced Multiclass Classification` , Accuracy might give misleading results as model might be predicting everything as `unacc` in this case and since majority of the elements actually give the value `unacc` , accuracy is bound to be high but this does not mean the model is good predictor. So you have to analyse it using other measures like `F1 scores` , `Precision` , `Recall`

For a better understand the definitive formulas for precision , recall and F1 scores are: 

<img src="https://miro.medium.com/max/2000/1*6NkN_LINs2erxgVJ9rkpUA.png">

<img src="https://miro.medium.com/max/752/1*UJxVqLnbSj42eRhasKeLOA.png">

`True Positives` , `False positives` and `False negatives` are found out using the earlier calculated confusion matrix

<img src="https://miro.medium.com/max/1400/1*CPnO_bcdbE8FXTejQiV2dg.png"> 

The significance of each parameters are basically , while `recall` expresses the ability to find all relevant instances in a dataset, `precision` expresses the proportion of the data points our model says was relevant actually were relevant. So a high recall score means that we're operating on relevant instances in the dataset and high `precision` means more are the relevant instances which our model predicted to be relevant and thus high precision and recall generally gives high F1 Score which is favorable. 

**TASK : Make a `Classification Report` to properly assess the models performance**

In [115]:
### START CODE HERE (~1 Line of code)


### END CODE HERE

#### KNN HYPERTUNING 
Now try to gain more accuracy by hypertuning the parameter `Number of Neighbors`

You have to find the best hyperparameter by using the following code below. Though its advised not to lose for loops as computation speed is slower however to visualise the scores we're gonna use them. 
This can also be done with `GridSearch` which we'll explore in the next model 

**TASK : Use for loop to give values to `n_neighbors` from 2 to 30 and calculate average `cross validation score` for each of them**

In [103]:
### START CODE HERE : 

## IMPORT LIBRARY FOR CROSS_VAL_SCORE: 

avg_score=[]
for k in range(2,30):
    # PUT ARGUMENTS OF n_jobs and n_neighbors
    knn=KNeighborsClassifier(#)
    ## CALCULATE CROSS_VAL_SCORE with cv=5 and scoring='accuracy' and store it in variable 'score' : 
    #
    avg_score.append(score.mean())

### END CODE

**TASK : Plot the average scores of all `k's`**

In [None]:
### STRART CODE HERE: 

## Set the figure size to (12,8)
plt.figure(#)
# Plot the figure using plt.plot() where x values are range(2,30) and y values the average scores :

# Keep the xlabel as n_neighbours and ylabel as accuracy

### END CODE

Analyse the curve and extract the `Top 2 highest accuracies` values for n_neighbors

**TASK : Calculate `Accuracy` and  `F1 score` for both the values of n_neighbors**

In [None]:
### START CODE HERE : (FULL CODE)

# MAKE MODELS FOR both values of n_neighbors , FIT THEM , PREDICT , PRINT ACCURACY AND CLASSIFICATION REPORT

### END CODE 

Analyse both the models and choose the best one and write their characterstics here in the blank spaces : 

 - **Optimised KNN model : n_neighbors = _**
 - **Accuracy ~ `_`**
 - **F1 Score ~ `_`**

This will further clear your understanding that better accuracy doesnt necessarily mean better F1 score or better model. One has to analyse everything before finalising a model

### `5.2` Random Forest

To understand Random forest classifier , lets first get a brief idea about `Decision Trees` in general. 
Decision Trees are very intuitive and at everyone have used this knowingly or unknowingly at some point . Basically the model keeps sorting them into categories forming a large tree by responses of some questons (decisions) and thats why its called decision tree. An image example would help understand it better : 

<img src="https://miro.medium.com/max/1000/1*LMoJmXCsQlciGTEyoSN39g.jpeg">

`Random Forest` : Random forest, like its name implies, consists of a large number of individual decision trees that operate as an [ensemble](https://en.wikipedia.org/wiki/Ensemble_learning) . Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction. 

<img src="https://miro.medium.com/max/1000/1*VHDtVaDPNepRglIAv72BFg.jpeg"> 

The fundamental concept is  large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models. Since this dataset has very low correlation between attributes , random forest can be a good option. 

**TASK : Make a Random forest Classifier and print the `Accuracy` and `F1 score`**

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
# This allows to use f1_score() function directly

### START YOUR CODE HERE : (FULL CODE)

### END CODE

The accuracy and F1 score is the base model measures and now you will hypertune it using `GridSearch` to make it a better model 

**Random Forest HyperTuning : Grid Search** 

Grid-search is used to find the optimal hyperparameters of a model which results in the most ‘accurate’ predictions. It builds a model for every combination of hyperparameters specified and evaluates each model. A more efficient technique for hyperparameter tuning is the Randomized search — where random combinations of the hyperparameters are used to find the best solution. However , if its a small sample like the current dataset then gridsearch is also fine. 

**TASK : Do a `GridSearch` and print the best hyperparameters for this random forest classifier**

In [113]:
### Import the required library : 

### START YOUR CODE HERE (FULL CODE)







### END CODE HERE 

**TASK  : Print the `Accuracy` , `Precision` , `Recall` and `F1 Score` for the optimised Random forest model and compare with previous models** 

In [114]:
### START CODE HERE : (FULL CODE)



### END CODE HERE

#### AWESOME! 
Now , you are completely done with `Modelling` and `Hypertuning the Parameters` . The last thing thats left is to write a conclusion stating which model is the best and its different scores and that finishes this project. 

## `6` Conclusion

Write the conclusion here