<a href="https://colab.research.google.com/github/Folkas/folkas-portfolio/blob/main/Car_pricing_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Capstone Project

## Background

Woah time really flies and you already reached the last sprint of the second module in the course! You should be proud of yourself. In the past three sprints you were gaining precious knowledge that helped you acquire data engineering skills. Now you should know what good Python code looks like, why OOP is used, how to structure Python project, how to work with SQL, how to develop and deploy web application. All these skills will enable you making outstanding projects that not only cover data analysis and modeling but also making your discoveries reachable to other people.

Now the time has come to put all your learnings into one place and complete the second capstone project of the course. During this project you will have to create a Python package, collect dataset using data scraping technique, train model and deploy it for others to reach.

Most importantly you will have to create whole E2E Machine Learning plan: establish the problem, collect dataset, train model, evaluate it and deploy it. By completing this project, you will strengthen your data engineering skills and prove to yourself and other that you are capable of planning and executing data science projects.

<div style="text-align: center;">
<img src="https://miro.medium.com/max/700/1*x7P7gqjo8k2_bj2rTQWAfg.jpeg"/ width="300px">
</div>

---

## Requirements
The whole capstone project requires you to execute full featured E2E Machine Learning Project so let's see what actually do you have to complete:

### Define problem you want to solve
This is the part where you have to select a problem. Here are the topics that you can choose from: text classification, price prediction, item category classification. Through the second module of the course you saw a few examples of datasets that could be used to solve these problems (eBay listings, Reddit posts, Twitter tweets). In this stage you have to:
- Define the problem and create a short presentation
- Explain what do you wan to solve, and what is the potential value of your solution
- Define the data source you will collect data from

### Collecting data
During this stage you will need to create a Python package that is able to scrape specific website. You saw many examples during the period of the second module, where functions that take few arguments (`keywords`, `number of samples`, etc.) and outputs pandas `DataFrame`s were created. Now you will need to transform this functionality into Python package that is installable through pip.
- Create Python package that is able to scrape specific webpage
- The package should be installable through `pip`
- The package should meed all expected Python package standards: clean code, tests, documentation.
- Collect and process dataset using your created package



## Project description 

This capstone project is dedicated to solve a business problem. Let's say that I'm opening a second-hand car shop. I need to set price for each car, but I don't know its value in the Lithuanian market. Therefore, I'm going to create a model to predict car's price based on several attributes: manufacturing date, engine size (in liters), engine power (in kW), mileage (in km) and whether it has automatic or manual gearbox. My model will use the scraped data from en.autoplius.lt website. The model will help me to evaluate car's value in euros and keep my shop economically sustainable.

In [None]:
!pip install git+https://github.com/Folkas/autoplius-scraper.git

Collecting git+https://github.com/Folkas/autoplius-scraper.git
  Cloning https://github.com/Folkas/autoplius-scraper.git to /tmp/pip-req-build-9mghdkeu
  Running command git clone -q https://github.com/Folkas/autoplius-scraper.git /tmp/pip-req-build-9mghdkeu
Building wheels for collected packages: autoplius-scraper
  Building wheel for autoplius-scraper (setup.py) ... [?25l[?25hdone
  Created wheel for autoplius-scraper: filename=autoplius_scraper-0.0.1-cp36-none-any.whl size=8317 sha256=265366d863c57cd29a077fd3ab5d972302daab986bfa2e7249ad24f86ec7d4af
  Stored in directory: /tmp/pip-ephem-wheel-cache-e3q6wa2j/wheels/96/bc/04/407bd5aa8c55645ff1e97b3ab8d33de8efcea433ba559d99b6
Successfully built autoplius-scraper
Installing collected packages: autoplius-scraper
Successfully installed autoplius-scraper-0.0.1


In [None]:
from function.autoplius_scraper import autoplius_scraper
scraper = autoplius_scraper()
#scraping 20 ads
scraper.multiple_scrapes(20)

In [None]:
#showing ads in pandas DataFrame table
scraper.into_pandas()

---

### Training and saving the model
During this step you will need to use your collected data to train, test and save machine learning model. Do not spend much time on this step just make sure that:
- Correct machine learning algorithm is selected
- Model is successfully trained (remember first module of the course)
- Model is saved for later deployment



#EDA

In [None]:
import pandas as pd
import numpy as np
#loading dataset
df1 = pd.read_csv("autoplius.csv")
df1.head()

Unnamed: 0,Marque,CarType,FuelType,Gearbox,ManufacturingDate,Engine_l,Power_kW,Mileage_km,Price_euro
0,Ford Tourneo Custom,commercial,Diesel,Manual,2016,1.5,70.0,188928.0,7550
1,Peugeot 207,hatchback,Diesel,Manual,2011,1.6,68.0,150000.0,2850
2,BMW 730,saloon / sedan,Diesel,Automatic,2004,3.0,160.0,303800.0,3200
3,Ford Mondeo,wagon,Diesel,Manual,2016,1.5,88.0,188987.0,8450
4,Ford Mondeo,hatchback,Diesel,Manual,2016,1.5,88.0,113090.0,9150


In [None]:
#converting Gearbox into a dummy variable
GearboxDummy = pd.get_dummies(df1[["Gearbox"]])
GearboxDummy.head()

Unnamed: 0,Gearbox_Automatic,Gearbox_Manual
0,0,1
1,0,1
2,1,0
3,0,1
4,0,1


In [None]:
#altering the dataset
df1 = pd.concat([df1, GearboxDummy], axis=1).drop(["Gearbox", "Marque", "CarType", "FuelType"], axis=1)
df1

Unnamed: 0,ManufacturingDate,Engine_l,Power_kW,Mileage_km,Price_euro,Gearbox_Automatic,Gearbox_Manual
0,2016,1.5,70.0,188928.0,7550,0,1
1,2011,1.6,68.0,150000.0,2850,0,1
2,2004,3.0,160.0,303800.0,3200,1,0
3,2016,1.5,88.0,188987.0,8450,0,1
4,2016,1.5,88.0,113090.0,9150,0,1
...,...,...,...,...,...,...,...
4995,2006,1.6,79.0,154000.0,1299,0,1
4996,2005,1.8,85.0,255000.0,1299,1,0
4997,2003,1.9,88.0,235000.0,1299,0,1
4998,2002,2.4,103.0,287000.0,1299,1,0


In [None]:
#checking for wrong values
df1.Engine_l.unique()

array(['1.5', '1.6', '3.0', '2.0', 'wagon', '2.4', '3.2', '1.3', '2.2',
       '1.0', '1.2', '2.9', '1.9', '4.0', '1.4', '2.5', '6.0', '4.4',
       '1.8', '2.7', '0.3', '1.7', '2.1', '5.0', '3.5', '3.6', 'pick-up',
       '2.8', '2.3', '0.4', 'mpv', 'saloon', '0.9', 'suv', '3.3', '4.5',
       '4.3', '4.9', 'hatchback', 'passenger', '3.7', '4.7', '5.4', '5.5',
       '4.8', '5.7', '4.2', '5.6', '0.7', 'commercial', '0.2', '2.6',
       '4.6', '6.2', 'coupe', '6.9', '3.8', '1.1', '0.6', '3.1', 'other',
       '0.8'], dtype=object)

In [None]:
#deleting rows with wrong values from the dataframe (227 rows)
engine_str = ["wagon", "mpv", "pick-up", "saloon", "suv", "hatchback", "passenger", "commercial", "coupe", "other"]
df1 = df1[~df1["Engine_l"].isin(engine_str)]

#converting Engine_l into float variable
df1.Engine_l = df1.Engine_l.astype("float")


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [None]:
#checking for null values
df1.isnull().any()

ManufacturingDate    False
Engine_l             False
Power_kW              True
Mileage_km            True
Price_euro           False
Gearbox_Automatic    False
Gearbox_Manual       False
dtype: bool

In [None]:
#deleting rows with Na values
df1=df1.dropna()

In [None]:
df1.describe()

Unnamed: 0,ManufacturingDate,Engine_l,Power_kW,Mileage_km,Price_euro,Gearbox_Automatic,Gearbox_Manual
count,4014.0,4014.0,4014.0,4014.0,4014.0,4014.0,4014.0
mean,2005.786248,1.990907,100.038117,233383.1,4357.484803,0.30568,0.69432
std,5.711753,0.572059,58.376151,138531.2,7426.231354,0.460752,0.460752
min,1968.0,0.2,4.0,0.0,1.0,0.0,0.0
25%,2002.0,1.6,74.0,174512.5,1054.0,0.0,0.0
50%,2005.0,2.0,90.0,235825.0,1299.0,0.0,1.0
75%,2009.0,2.2,112.0,287000.0,4700.0,1.0,1.0
max,2021.0,6.2,2656.0,3300000.0,129900.0,1.0,1.0


In [None]:
df1 = df1.reset_index(drop=True)

In [None]:
df1

Unnamed: 0,ManufacturingDate,Engine_l,Power_kW,Mileage_km,Price_euro,Gearbox_Automatic,Gearbox_Manual
0,2016,1.5,70.0,188928.0,7550,0,1
1,2011,1.6,68.0,150000.0,2850,0,1
2,2004,3.0,160.0,303800.0,3200,1,0
3,2016,1.5,88.0,188987.0,8450,0,1
4,2016,1.5,88.0,113090.0,9150,0,1
...,...,...,...,...,...,...,...
4009,2006,1.6,79.0,154000.0,1299,0,1
4010,2005,1.8,85.0,255000.0,1299,1,0
4011,2003,1.9,88.0,235000.0,1299,0,1
4012,2002,2.4,103.0,287000.0,1299,1,0


## Standard scaling

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
import pickle

#scaling only features
scaler = StandardScaler()

features = ["ManufacturingDate", "Engine_l", "Power_kW", "Mileage_km", "Gearbox_Automatic", "Gearbox_Manual"]

scaled_features = pd.DataFrame(scaler.fit_transform(df1[features]), columns=features)

(4014,)

In [None]:
data = pd.concat([scaled_features, df1.Price_euro], axis=1)

In [None]:
data.describe()

Unnamed: 0,ManufacturingDate,Engine_l,Power_kW,Mileage_km,Gearbox_Automatic,Gearbox_Manual
0,1.788422,-0.858248,-0.514626,-0.320943,-0.663520,0.663520
1,0.912925,-0.683419,-0.548890,-0.601983,-0.663520,0.663520
2,-0.312771,1.764188,1.027292,0.508374,1.507114,-1.507114
3,1.788422,-0.858248,-0.206242,-0.320517,-0.663520,0.663520
4,1.788422,-0.858248,-0.206242,-0.868455,-0.663520,0.663520
...,...,...,...,...,...,...
4009,0.037428,-0.683419,-0.360434,-0.573105,-0.663520,0.663520
4010,-0.137672,-0.333761,-0.257639,0.156063,1.507114,-1.507114
4011,-0.487870,-0.158932,-0.206242,0.011673,-0.663520,0.663520
4012,-0.662970,0.715214,0.050744,0.387086,1.507114,-1.507114


In [None]:
#dividing dataset into train and test subsets

from sklearn.model_selection import train_test_split
x_train2, x_test2, y_train2, y_test2 = train_test_split(data[features], data.Price_euro, random_state=8, train_size=0.7)

In [None]:
#training the regression model
from sklearn.linear_model import LinearRegression

scaled_model = LinearRegression()
scaled_model.fit(x_train2, y_train2)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [None]:
# evaluating the trained model
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

predicted2 = scaled_model.predict(x_test2)
expected2 = y_test2

print(f"Mean Squared Error: {round(mean_squared_error(expected2, predicted2), 2)}")
print(f"R2 Score: {round(r2_score(expected2, predicted2), 2)}")

Mean Squared Error: 22103703.44
R2 Score: 0.58


In [None]:
# saving model to file
pickle.dump(lrmodel, open("model\lrmodel.pkl", "wb"))

# saving scaler to file
pickle.dump(scaler, open("model\scaler.pkl", "wb"))

### Creating API for the trained model
This is the step you have done at least couple of times. You will need to create API using Flask. While creating the application you will need to do these things:
- Load trained model
- Create inference pipeline
- Create `POST` route to reach model and send it's outputs as response

### Tracking model's predictions
Now you will need to enable model's predictions tracking. During this step you will need to connect your flask application to PostgreSQL database hosted by Heroku and put model's inputs and outputs into one table:
- Create PostgreSQL database hosted by Heroku
- Create table for predictions tracking. There should be columns for inputs and outputs of model
- At every request of model insert required values to the database
- Create new route in Flask application that returns 10 most recent requests and responses in JSON format

### Deploying the application
After completing all the steps required above, you will need to deploy your application to Heroku. You will need to follow steps provided in the fourth lesson of this sprint.
- Make sure all secrets and passwords are set as ENV variables in Heroku
- Deploy application to Heroku
- Ensure that you application is accessible (provide link to it)

## Evaluation criteria
- All requirements are met
- The project is well thought out. Defined problem is clearly presented
- Model actually works, is able to make predictions that make sense
- Written code is clear and clean. All the PEP8 standards are met