<div class="alert alert-block alert-success">
    <h1 align="center">🚗CAR PRICE PREDICTION</h1>

</div>


    
### **100,000 UK Used Car Dataset**
**100,000 scraped used car listings, cleaned and split into car make.**
    
### Problem Statement
Predict the prices of the used cars using dataset.
    
### Content
The cleaned data set contains information of price, transmission, mileage, fuel type, road tax, miles per gallon (mpg), and engine size.   
    


## TABLE OF CONTENT

* [1.Importing Libraries](#1)

* [2.Load the data](#2)
    
* [3.Meta information of dataframe](#3)
    
* [4.Duplicated rows.](#4)

* [5.Handling Missing Values.](#5)
    
* [6.Statistical information of Dataframe](#6) 

* [7.Categorical Features Into Numerical](#7)

* [8.Correlation](#8)

* [9.EDA & Visualization](#9)

* [10.Split training and testing](#10)

* [11.Modelling](#11)

* [12.Evaluation Metric](#12)

* [13.Feature Importance](#13)

<a id="1"></a>
## 1. Import Libaries
    

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno

from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor,RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor

import cufflinks as cf
import plotly.offline
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)
import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
from plotly import tools


%matplotlib inline 



<a id="2"></a>
## Load Data

### Data Description

* **Model:** Model type.
* **Year:** Registration Year.
* **Price:** Price in euros.
* **Transmission:** Type of Gearbos.
* **Mileage:** Distance Used.
* **FuelType:** Engine Fuel.
* **Tax:** Road Tax.
* **mpg:** Miles per Gallon.
* **EngineSize:** Size in litres.
* **Brand:** Name of the car brand.

In [3]:
# Different Car's dataset
audi = pd.read_csv("../input/used-car-dataset-ford-and-mercedes/audi.csv")
bmw = pd.read_csv("../input/used-car-dataset-ford-and-mercedes/bmw.csv")
ford = pd.read_csv("../input/used-car-dataset-ford-and-mercedes/ford.csv")
hyundi = pd.read_csv("../input/used-car-dataset-ford-and-mercedes/hyundi.csv")
merc = pd.read_csv("../input/used-car-dataset-ford-and-mercedes/merc.csv")
skoda = pd.read_csv("../input/used-car-dataset-ford-and-mercedes/skoda.csv")
toyota = pd.read_csv("../input/used-car-dataset-ford-and-mercedes/toyota.csv")
vauxhall = pd.read_csv("../input/used-car-dataset-ford-and-mercedes/vauxhall.csv")
vw = pd.read_csv("../input/used-car-dataset-ford-and-mercedes/vw.csv")

**We are having different datasets for different brands.So, importing all of them.**

In [4]:
audi['brand'] = 'audi'
bmw['brand'] = 'bmw'
ford['brand'] = 'ford'
hyundi['brand'] = 'hyundi'
merc['brand'] = 'merc'
skoda['brand'] = 'skoda'
toyota['brand'] = 'toyota'
vauxhall['brand'] = 'vauxhall'
vw['brand'] = 'vw'

**All datasets does not contain separate column for brand. So, we are creating a columns for brands**

In [5]:
df = pd.concat([audi,bmw,
               ford,hyundi,merc,skoda,toyota,
               vauxhall,vw])
df.head()

**Concat all the different brands dataset into a single dataframe.**

In [6]:
df.tail()

In [7]:
# drop tax column
df.drop('tax(£)',axis=1,inplace=True)


**Droping the `tax(£)` column. Because we have already `tax` columns without any *null values***

In [8]:
df

**Dataframe with all datasets concated**

In [9]:
# shuffle the dataset
df = df.sample(frac=1)
df

**The new dataframe with `brand` columns is lined one after one like `audi->bmw->......->vw`. But to pass this dataframe to our model for training it is not a good practise. So, to avoid that we are shuffle 10% of dataframe**

<a id="3"></a>
## Meta Information about DataFrame

In [10]:
# info
df.info()

<a id="4"></a>
## Duplicated Rows

In [11]:
#check for dublicated rows
print("Numbers of duplicated rows :",df.duplicated().sum())

In [12]:
#dropping the duplicated rows 
df=df.drop_duplicates(keep="first")
print("After removing,now number of duplicated rows are:",df.duplicated().sum())

<a id="5"></a>
## Handling Missing Values

In [13]:
# Check any null values
missingno.matrix(df)

**Only `tax` column contains `null` values when compared with other columns**

In [14]:
df['tax'].fillna(df['tax'].mean(),inplace=True)


In [15]:
missingno.matrix(df)

**Now they are no null values**

<a id="6"></a>
## Statistical Information about DataFrame

In [16]:
# describe
df.describe().T.style.bar(subset=['mean'], color='#205ff2').background_gradient(
subset=['std'],cmap='mako').background_gradient(subset=['50%'],cmap='coolwarm')

<a id="7"></a>
## Categorical Features Into Numerical Features

In [17]:
df_1 =df.copy()
df_1

In [18]:
def preprocess_data(df):
    
    """
    Performs transformation on df and returns transformed df.
    
    """
    for label,content in df.items():
   
        #Filled categoricnaL missing data and turned categories into numbers 
        if not pd.api.types.is_numeric_dtype(content):
         # We add +1 to the category code because pandas encodes missing categories as -1
           df[label] = pd.Categorical(content).codes+1
    
    return df

In [19]:
df_transformed = preprocess_data(df_1)
df_transformed.head()

**Turning Categorical values into Numerical values**

<a id="8"></a>
## Correlation

In [20]:
# Check the correlation between target and each of columns
df.corr()['price']

In [21]:
plt.figure(figsize=(20,10))

# plot heatmap
sns.heatmap(df_transformed.corr(), annot=True,cmap='coolwarm', linecolor='black')

**The columns `year`,`transmission`,`tax`,`engineSize` are having more *correlation* with *price***

<a id="9"></a>
## EDA & Visualization

In [22]:
plot_columns=[ 'transmission','fuelType','brand']
colors = ["#00FFFF","#FFA500","#ADD8E6","#ED00D9","#ED1400","#EAE7C6","#CF6523","#99ACD2","#4EBA73","#DDA8D7"]
textprops = {"fontsize":22}
i=1
plt.figure(figsize=(45,95))
for col in plot_columns:
    plt.subplot(11,2,i)
    sns.countplot(data=df,x=col,palette='gist_rainbow',order=df[col].value_counts().index)
    plt.xticks(fontsize=25)
    plt.yticks(fontsize=25)
    plt.xlabel(col,fontsize=25)
    plt.ylabel('count',fontsize=25)
    i=i+1
    plt.subplot(11,2,i)
    df[col].value_counts().plot(kind='pie',autopct='%.2f%%',
                               colors=colors,textprops=textprops,shadow=True,radius=1.1)
    plt.xticks(fontsize=25)
    plt.yticks(fontsize=25)
    plt.xlabel(col,fontsize=25)
    plt.ylabel('count',fontsize=25)
    i=i+1
    
plt.show()

**1) First Figure has value counts for `transmission` columns with `Manual`,`Semi-Auto`,`Automatic`,`other`. Here `Manual` transmission has more value counts when compared with other transmission.**

**2) Second Figure has value counts for `fuelType` columns with `Petrol`,`Diesel`,`Hybrid`,`other`,`Electric`. Here `Petrol` fuelType has more value counts when compared with other fuelType.**

**3) Third Figure has value counts for `brand` columns with all different brands. Here `ford` brand has more value counts when compared with other brands**

In [23]:
# model
plt.figure(figsize=(25,5))
sns.countplot(x='model',data=df[:5000])
plt.title("Model",fontsize=15)
plt.xticks(rotation=90)
plt.show();

plt.figure(figsize=(25,5))
sns.countplot(x='model',data=df[5000:9000])
plt.xticks(rotation=90)
plt.show();


**Value counts for `model` columns upto 9000 records**

In [24]:
# Plot scatterplot between price and  year
plt.figure(figsize=(25,5))
sns.scatterplot(x='model',y='year',data=df[:1000],hue='fuelType',s=150)
plt.title("Model Vs Year")
plt.xlabel("Model",fontsize=15)
plt.ylabel("Year",fontsize=15)
plt.xticks(rotation=90,fontsize=12,fontstyle='oblique')
plt.show();

**This plot shows relation between `model` and `year` with `fuelType`**.
**Here we can see that most of the models from year 2006-2020 having `Diesel` and `Petrol` fueltype.**

In [25]:
# Plot scatterplot between price and  year
plt.figure(figsize=(25,5))
sns.scatterplot(x='model',y='year',data=df[:1000],hue='transmission',s=150,palette=['yellow','lightgreen','black'],legend='full')
plt.title("Model Vs Year")
plt.xlabel("Model",fontsize=15)
plt.ylabel("Year",fontsize=15)
plt.xticks(rotation=90,fontsize=12,fontstyle='oblique')
plt.show();

**This plot shows relation between `model` and `year` with `transmission`**.
**Here we can see that most of the models from year 2006-2020 having `Manual` transmission.**

In [26]:
# Plot scatterplot between price and  year
plt.figure(figsize=(25,5))
sns.scatterplot(x='model',y='price',data=df[:1000],hue='fuelType',s=100,palette=['green','brown','dodgerblue','red'],legend='full')
plt.title("Model Vs price")
plt.xlabel("Model",fontsize=15)
plt.ylabel("Price",fontsize=15)
plt.xticks(rotation=90,fontsize=12,fontstyle='oblique')
plt.show();

**Most of the `models` having high price with `Diesel` fuelType and the second highest is `Petrol`**

In [27]:
# Plot scatterplot between price and  year
plt.figure(figsize=(25,5))
sns.scatterplot(x='model',y='price',data=df[:1000],hue='transmission',s=100,palette=['blue','maroon','grey'],legend='full')
plt.title("Model Vs price")
plt.ylabel("Price",fontsize=15)
plt.xlabel("Model",fontsize=15)
plt.xticks(rotation=90,fontsize=12,fontstyle='oblique')
plt.show();

**Most of the `models` having high price with `Manual` transmission and the second highest is `Automatic`**

In [28]:
plt.figure(figsize=(10,5))
sns.pointplot(x='transmission',y='year',data=df,hue='fuelType')
plt.title("Tranmssion With Mileage")
plt.xlabel('Transmission',fontsize=15)
plt.ylabel('Mileage',fontsize=15)
plt.legend(bbox_to_anchor=(1.2,1.0),title='FuelType')
plt.grid(True)
plt.show();

**This is a `pairplot` showing relationship between `transmission` and `year`. Clearly, we observe that `Electric` fuelType is mostly used in `Automatic` transmission.The `Diesel` fuelType is in popular and in demand in all transmission till 2017. Then after that the demand for `petrol` has is risen up when compared with `diesel`.**

In [29]:
plt.figure(figsize=(10,5))
sns.pointplot(x='brand',y='year',data=df,hue='fuelType')
plt.title("Brand WithYear")
plt.xlabel('Brand',fontsize=15)
plt.ylabel('Year',fontsize=15)
plt.legend(bbox_to_anchor=(1.2,1.0),title='FuelType')
plt.grid(True)
plt.show();

**The `Electric` fueltype is only used by `bmw` and `vauxhall`.The `Hybrid` fueltype is fluctuating across all brands. `Hybrid` fueltype is mostly used by `skoda` in 2020. The `petrol` and `diesel` fueltypes are commonly in all brands.**

In [30]:
plt.figure(figsize=(10,5))
sns.stripplot(x='brand',y='price',data=df,hue='transmission')
plt.title("Brand With Price")
plt.xlabel('Brand',fontsize=15)
plt.ylabel('Price',fontsize=15)
plt.legend(bbox_to_anchor=(1.2,1.0),title='Transmission')
plt.show();

**Most of the brands with `semi-auto` transmission having more price across all brands,`bmw` and `mercedes` have more `semi-auto` transmission.The demand for `automatic` transmission is seen in `toyota` brand.The `Manual` transmission is common in all brands, but `mercedes`, `toyota` and `bmw` has less `Manual` transmission. `**

In [31]:
plt.figure(figsize=(10,5))
sns.stripplot(x='brand',y='year',data=df,hue='fuelType')
plt.title("Brand with Year")
plt.xlabel('Brand',fontsize=15)
plt.ylabel('Year',fontsize=15)
plt.legend(bbox_to_anchor=(1.2,1.0),title='FuelType')
plt.show();

**The most of the brands with different fueltypes are seen in year between 2000-2020. All brands having `Diesel` and `Petrol` fueltype.**

In [32]:
sns.set_theme(style="darkgrid")
plt.figure(figsize=(10,5))
sns.barplot(x='brand',y='mileage',data=df,hue='fuelType')
plt.xlabel('Brand',fontsize=15)
plt.ylabel('Mileage',fontsize=15)
plt.legend(bbox_to_anchor=(1.2,1.0),title='FuelType')
plt.grid(True)
plt.show();

**When it comes to `mileage` the brand which has `Hybrid` has more in demand except `hyundi`, `skoda`. The second and third best fueltype for mileage are `Diesel` and `Petrol` respectively, it is popular and mostly used in all brands.**

In [33]:
sns.set_theme(style="darkgrid")
plt.figure(figsize=(10,5))
sns.barplot(x='brand',y='mileage',data=df,hue='transmission')
plt.xlabel('Brand',fontsize=15)
plt.ylabel('Mileage',fontsize=15)
plt.legend(bbox_to_anchor=(1.2,1.0),title='Transmission')
plt.grid(True)
plt.show();

**The `Manual` and `Automatic` transmission are common and mostly used in all brands with respect to `mileage`.**

In [34]:
sns.set_theme(style="darkgrid")
plt.figure(figsize=(10,5))
sns.barplot(x='brand',y='tax',data=df,hue='transmission',palette='tab20')
plt.xlabel('Brand',fontsize=15)
plt.ylabel('Tax',fontsize=15)
plt.legend(bbox_to_anchor=(1.2,1.0),title='Transmission')
plt.grid(True)
plt.show();

**Brand with `Automatic` transmission pays more `tax` and `Manual` transmission has low paying `tax` capability.**

In [35]:
sns.set_theme(style="darkgrid")
plt.figure(figsize=(10,5))
sns.barplot(x='brand',y='tax',data=df,hue='fuelType',palette='tab10')
plt.xlabel('Brand',fontsize=15)
plt.ylabel('Tax',fontsize=15)
plt.legend(bbox_to_anchor=(1.2,1.0),title='FuelType')
plt.grid(True)
plt.show();

**Brand With `Petrol` fueltype pays more tax and the second most payed tax is by `Diesel` fueltype.**

In [36]:
plt.figure(figsize=(15,5))
sns.relplot(x='price',y='year',data=df,hue='fuelType',palette=['green','lightblue','yellow','red','black'])
plt.xlabel("Price")
plt.ylabel("Year")
plt.xticks(rotation=90)
plt.grid(True)
plt.show();

<a id="10"></a>
## Split Dataset into Training and Testing

In [37]:
X = df_transformed.drop('price',axis=1)
y = df_transformed['price']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

len(X_train),len(X_test),len(y_train),len(y_test)

<a id="11"></a>
## Modelling

In [38]:
# Create model and train
def models_score(models,X_train,X_test,y_train,y_test):
    scores = {}
    
    for name, model in models.items():
        model.fit(X_train,y_train)
        y_pred = model.predict(X_test)
        scores[name] = r2_score(y_test,y_pred)
        
        # printing the model name and accuracy
        print("Model name: ",model)
        print("R2 score : ",r2_score(y_test,y_pred))
        print("Mean Absolute Error : ",mean_absolute_error(y_test,y_pred))
        print("Mean Squared Error : ",mean_squared_error(y_test,y_pred))

        print("\n<<<<------------------------------------------------------------->>>>\n")
        
    model_scores = pd.DataFrame(scores, index=['R2 Score']).T
    model_scores = model_scores.sort_values('R2 Score',ascending=False)
    return model_scores
        

In [39]:
# Initialize the models
np.random.seed(42)
models = {"LinearRegression":LinearRegression(),
          "GradiantBoost":GradientBoostingRegressor(),
         "RandomForest":RandomForestRegressor(),
         "XgBoost": XGBRegressor(verbose=0),
         "DecisionTreeRegressor":DecisionTreeRegressor(),
         "CatBoost":CatBoostRegressor(verbose=0),
         "LightGBM":LGBMRegressor()}

In [40]:
model_scores = models_score(models,X_train,X_test,y_train,y_test)
model_scores

<a id="12"></a>
## Evaluation Metric

In [41]:
model_scores = model_scores.reset_index().rename({"index":"Models"},axis=1)
model_scores.style.bar()

In [42]:
fig = px.bar(data_frame = model_scores,
             x="Models",
             y="R2 Score",
             color="Models", title = "<b>Models Score</b>", template = 'plotly_dark')

fig.update_layout(bargap=0.2)

fig.show()

In [43]:
label = model_scores['Models']
value = model_scores['R2 Score']


fig = go.Figure(data=[go.Pie(labels = label, values = value, rotation = 90)])

fig.update_traces(textposition='inside',
                  textinfo='percent+label',
                  marker=dict(line=dict(color='#000000', width = 1.5)))

fig.update_layout(title_x=0.5,
                  title_font=dict(size=20),
                  uniformtext_minsize=15,template='plotly_dark')


fig.show()

### Predicting on unseen data with our best performing model

In [44]:

model = CatBoostRegressor(verbose=0)
model.fit(X_train,y_train)


 **model = 70, 
 year = 2006,
 transmission.automatic = 4,
 fuelType.petrol = 1,
 tax = 305.0,
 mpg = 58.0,
 enginesize = 4.4,
 brand.bmw = 1**

In [45]:
features = [70,2006,4,16356,1,305.0,58.0,4.4,1]
y_pred = model.predict(features)
print(f"Price in euros : {y_pred}")

<a id="13"></a>
## Feature Importance

In [46]:
# Print the name and gini importance of each features 
for feature in zip(df.columns,model.feature_importances_):
    print(feature)

In [47]:
# # visualize feature importance
feature_dict = dict(zip(df.columns,list(model.feature_importances_)))
feature_df = pd.DataFrame(feature_dict,index=[0])
feature_df.T.plot.bar(title="Feature Importance",legend=False);

## End🔚