# Car price prediction Exercise 

This script trains a machine learning model to predict the car price in Spain based on the car type, age, fuel type, milege and gearbox type. When you run this python notebook, it will create and save the trained model in the directory. You can use this model in your final product.

## Task
Build a web interface/web application/API that predicts car prices based on user inputs include make, model year, fuel type, gear type, and mileage using a provided pre-trained machine learning model.

## Requirements ##
1. Load the pre-trained car price prediction model created by this python notebook.
2. Create a web interface/web application/API that accepts the following user inputs: 
    1. Car make (e.g. BMW)  
    2. Model year (e.g. 2018)  
    3. Fuel type (e.g. Diesel)  
    4. Transmission type (e.g. Manual)  
    5. Mileage (e.g. 10000)  
3. Pass the user inputs to the pre-trained model to generate a predicted car price  
4. Display the predicted price back to the user

## Guidelines ##
1. Use any frontend (e.g. React, Vue), backend (e.g. Flask, FastAPI, node.JS), web application (Streamlit, Django) technologies
2. Host the app locally, no need to deploy online
3. Include code your Github repository for review
4. Don't forget the README file and comments in your code!


In [266]:
import pandas as pd
import datetime
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

import pickle
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import r2_score



## Data Ingestion

In [267]:
df = pd.read_csv("used_cars_data.csv")

In [268]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 792 entries, 0 to 791
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Unnamed: 0     792 non-null    int64 
 1   brand          792 non-null    object
 2   model          792 non-null    object
 3   price (eur)    792 non-null    int64 
 4   engine         792 non-null    object
 5   year           792 non-null    int64 
 6   mileage (kms)  792 non-null    int64 
 7   fuel           792 non-null    object
 8   gearbox        792 non-null    object
 9   location       792 non-null    object
dtypes: int64(4), object(6)
memory usage: 62.0+ KB


## Data Preprocessing

In [269]:
# One hot encoding for categorical features

enc = OneHotEncoder(handle_unknown='ignore')

In [270]:
X = df[['brand', 'fuel', 'gearbox']]

In [271]:
enc.fit(X)

In [272]:
enc.categories_

[array(['Abarth', 'Alfa', 'Audi', 'BMW', 'Chevrolet', 'Citroen', 'Cupra',
        'DS', 'Dacia', 'Fiat', 'Ford', 'Honda', 'Hyundai', 'Jaguar',
        'Jeep', 'Kia', 'Land', 'Lexus', 'Mazda', 'Mercedes', 'Mini',
        'Mitsubishi', 'Nissan', 'Opel', 'Peugeot', 'Porsche', 'Renault',
        'SEAT', 'Skoda', 'Smart', 'Ssangyong', 'Subaru', 'Suzuki',
        'Toyota', 'Volkswagen', 'Volvo'], dtype=object),
 array(['Diésel', 'Eléctrico', 'GLP', 'Gasolina', 'Híbrido'], dtype=object),
 array(['Automatica', 'Manual'], dtype=object)]

In [273]:
enc.transform(df[['brand', 'fuel', 'gearbox']]).toarray()

array([[0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 1., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [274]:
X_features = pd.DataFrame(enc.transform(df[['brand', 'fuel', 'gearbox']]).toarray())

In [275]:
# Feature transforming
year = datetime.datetime.now().year

In [276]:
df['age'] = year-df['year']

In [277]:
X_num = df[['age', 'mileage (kms)']]
y = df['price (eur)']

In [278]:
X = pd.concat([X_num, X_features], axis=1)

In [279]:
y[:5]

0     8990
1     9990
2    13490
3    24990
4    10460
Name: price (eur), dtype: int64

## Model Training

In [280]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

X_train.columns = X_train.columns.astype(str)
X_test.columns = X_test.columns.astype(str)


regr = RandomForestRegressor()

regr.fit(X_train, y_train)

y_pred = regr.predict(X_test)

In [281]:
X_train.head()

Unnamed: 0,age,mileage (kms),0,1,2,3,4,5,6,7,...,33,34,35,36,37,38,39,40,41,42
615,6,63202,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
18,5,69317,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
580,5,118548,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
722,12,124211,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
83,6,99658,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0


## Model Evaluation

In [282]:
errors = mean_squared_error(y_test, y_pred, squared=False)

errors

4562.903455544323

In [283]:
errors2 = mean_absolute_error(y_test, y_pred)
errors2

3273.48893129771

In [284]:
r2_score(y_test, y_pred)


0.4821810697709207

## Saving the model

In [285]:
from joblib import dump, load

dump(regr, 'model.joblib') 


with open("encoder", "wb") as f: 
    pickle.dump(enc, f)

In [286]:
# To load the model for your product

# regr = load('model.joblib') 