# BMW Price prediction

<img src="https://static.bangkokpost.com/media/content/20200305/c1_1872299.jpg" alt="BMW logo" width="800.33" height="200">

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Introduction

Dreaming of having a BMW? Here is your tools to help you select for best value BMW in the market.



<a class="anchor" id="0.1"></a>

## Table of Contents

1. [Import libraries](#1)
1. [Download datasets](#2)
1. [EDA](#3)
1. [Preparing to modeling](#4)
1. [Tuning models](#5)
    -  [Linear Regression](#5.1)
    -  [Support Vector Machines](#5.2)
    -  [Linear SVR](#5.3)
    -  [MLPRegressor](#5.4)
    -  [Stochastic Gradient Descent](#5.5)
    -  [Decision Tree Regressor](#5.6)
    -  [Random Forest with GridSearchCV](#5.7)
    -  [XGB](#5.8)
    -  [LGBM](#5.9)
    -  [GradientBoostingRegressor with HyperOpt](#5.10)
    -  [RidgeRegressor](#5.11)
    -  [BaggingRegressor](#5.12)
    -  [ExtraTreesRegressor](#5.13)
    -  [AdaBoost Regressor](#5.14)
    -  [VotingRegressor](#5.15)
1. [Models comparison](#6)
1. [Prediction](#7)

# 1.  Data import and cleaning

## Dataset explanation

Data is provided by 100,000 UK Used car dataset (https://www.kaggle.com/adityadesai13/used-car-dataset-ford-and-mercedes).For this notebook, we condiser only BMW car which has following features used for prediction of its price

1. model : model of the cars (i.e. 5 Series, X3, etc.)
2. year : year of 1st hand purchased
3. transmission : mode of transmission (manual or automatic or semi-auto)
4. mileage : total mileage of the car
5. tax : road tax incured
6. mpg : miles per gallon consumption
7. engineSize : in units of litres

In [None]:
df = pd.read_csv('../input/used-car-dataset-ford-and-mercedes/bmw.csv')
df.head()

In [None]:
df.info()

Dataset was properly collected. There are no missing value presented. So in data preprocessing step, we can skip imputation process.

To gain further insight, I decide to add new columns to calculate vehicle age from its year since it might be a useful predictor for predict car price

In [None]:
#Age calculation (present year - year of purchased)
df['age'] = 2020 - df['year']
df = df.drop(columns = 'year')

In [None]:
#Since there might be error in data gathered (There is petrol and diesel car that have engine size of 0)
df[df['engineSize'] == 0]

In [None]:
#Let's drop instances which fuelType are Diesel or Petrol but have 0.0 engineSize out.
df = df.drop(df[(df['engineSize'] == 0) & (df['fuelType'].isin(['Diesel','Petrol']))].index)
df[df['engineSize'] == 0]

# 2. Exploratory Data Analysis

In [None]:
df.dtypes

In [None]:
df['engineSize'].value_counts()

In [None]:
cat_col = df.select_dtypes(include = object).columns.tolist() + ['engineSize']
num_col = df.select_dtypes(exclude = object).columns.tolist()
num_col.remove('engineSize')

## Univaraiate analysis

In [None]:
#numerical data
fig = plt.figure(figsize=(20,20))
sns.set_style('darkgrid')
for index,col in enumerate(num_col):
    plt.subplot(3,2,index+1)
    sns.distplot(df[col])
fig.tight_layout(pad=1.0)

Data observation #1

In [None]:
#Categorical feature
fig = plt.figure(figsize=(20,20))
sns.set_style('darkgrid')
for index,col in enumerate(cat_col):
    plt.subplot(2,2,index+1)
    if(index == 0):
        plt.xticks(rotation=90)
    sns.set(font_scale = 1.5)
    sns.countplot(df[col], order = df[col].value_counts().index)

    
fig.tight_layout(pad=1.0)

Data observation #2

## Bivariate Analysis

In [None]:
#
fig = plt.figure(figsize=(10,10))
sns.set(font_scale = 1.0)
sns.relplot(x="mileage", y="price", hue="transmission", size="age",
            sizes = (20,200), alpha=.5, palette="muted", aspect = 1.5 ,data=df)

In [None]:
ax = sns.heatmap(df.corr(), annot=True, cmap='RdBu')

Finding 1

In [None]:
sns.set(style="ticks")

# Initialize the figure
f, ax = plt.subplots(figsize=(8, 10))

# Plot the orbital period with horizontal boxes
sns.boxplot(x="price", y="model", data=df,
            whis=[0, 100], palette="vlag",
           order = df.groupby('model').median().sort_values(by = 'price').index)

# Add in points to show each observation
# sns.swarmplot(x="distance", y="method", data=planets,
#               size=2, color=".3", linewidth=0)

# Tweak the visual presentation
ax.xaxis.grid(True)
ax.set(ylabel="Model")
sns.despine(trim=True, left=True)

Finding 2

# 3. Data preprocessing

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.linear_model import ElasticNet
from sklearn.linear_model import SGDRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
import xgboost as xgb

model_list = [(ElasticNet(),'ElasticNet'),
              (SGDRegressor(),'SGDRegressor'),
              (SVR(kernel='linear'),'SVR-linear'),
              (SVR(kernel='rbf'),'SVR-rbf'),
              (RandomForestRegressor(),'RandomForestRegressor'),
              (xgb.XGBRegressor(),'XGBoost')
             ]

In [None]:
X = df.copy().drop(columns='price')
y = df['price'].copy()
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=1, test_size = 0.2)

In [None]:
cat_col = ['model', 'transmission', 'fuelType']
num_col = ['mileage', 'tax', 'mpg', 'age','engineSize']

In [None]:
# num_pipeline = Pipeline([
#     ('std_scaler', StandardScaler())
# ])
# cat_pipeline = Pipeline([
#     ('onehot_enc', OneHotEncoder())
# ])
full_pipeline = ColumnTransformer([
    ('num', StandardScaler(), num_col),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_col)
])

In [None]:
X_train_prepared = full_pipeline.fit_transform(X_train)

In [None]:
model_score = []

for m in model_list:
    model = m[0]
#     model.fit(X_train_prepared,y_train)
    score = cross_val_score(model,X_train_prepared,y_train,cv=4, scoring='r2')
    print(f'{m[1]} score = {score.mean()}')
    model_score.append([m[1],score.mean()])

from cross validation score, we decide to continue develop on XGBoost model