# Description of the project

Service for the sale of used cars "Not beaten, not beautiful" is developing an application to attract new customers. In it, you can quickly find out the market value of your car. Historical data is available: technical characteristics, configurations and prices of cars. You need to build a model to determine the cost.

The customer is important:
- quality of prediction;
- prediction speed;
- studying time.

# Description of data

The data is in the `/datasets/autos.csv` file.

Signs:

- DateCrawled - date of downloading the profile from the database
- VehicleType - type of car body
- RegistrationYear — year of car registration
- Gearbox - type of gearbox
- Power - power (hp)
- Model - car model
- Kilometer - mileage (km)
- RegistrationMonth — month of car registration
- FuelType — type of fuel
- Brand - car brand
- NotRepaired - was the car under repair or not
- DateCreated — date of creation of the questionnaire
- NumberOfPictures - the number of photos of the car
- PostalCode - postal code of the owner of the profile (user)
- LastSeen - date of last user activity

Target feature:
- Price - price (EUR)

# Action plan

1. Download data

2. Analyze and prepare data

3. Build price forecast models and evaluate their quality

4. Conclusion

Notes:

- The RMSE metric will be applied to assess the quality of models
- Gradient boosting model will be implemented using the LightGBM library
- Since the gradient boosting model can be trained for a long time, only 2-3 parameters will be changed for it

# Loading data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

import lightgbm as lgb

pd.set_option('display.max_row',100)
pd.set_option('display.max_columns',100)

In [None]:
df = pd.read_csv('/datasets/autos.csv', sep=',')
# df = pd.read_csv('datasets/autos.csv', sep=',')
df

In [None]:
df.info()

In [None]:
df.describe()

# Data analysis and preparation

The following factors affect the price calculation:
<br>VehicleType, RegistrationYear, Gearbox, Power, Kilometer, RegistrationMonth, FuelType, NotRepaired, Price.
<br>We will continue to work with them.

With the number of photos is a moot point. But we will assume that their number does not affect the price.

In [None]:
df_n = df[['VehicleType', 'RegistrationYear', 'Gearbox', 'Power', 'Kilometer', 
           'FuelType', 'NotRepaired', 'Price']]

## Car type

Total number of rows and number of missing values

In [None]:
a = df_n['VehicleType'].count()
b = df_n['VehicleType'].isna().sum()
print(a)
print(b)
100*b/a

Let's take a look at the data

In [None]:
df_n['VehicleType'].sort_values().unique()

Remove missing values

In [None]:
index = df_n[df_n['VehicleType'].isna() == True].index
df_n = df_n.drop(index).reset_index(drop=True)
df_n.info()

## Year of registration

Total number of rows and number of missing values

In [None]:
a = df_n['RegistrationYear'].count()
b = df_n['RegistrationYear'].isna().sum()
print(a)
print(b)
100*b/a

In [None]:
sns.displot(df_n['RegistrationYear'],kde=True)

The spread is very strong.
<br>Let's limit the minimum year - 1990. After this year, it seems to me, the car can not be considered rare. Everything that used to be a rarity and the price for it is determined based on specific characteristics.
<br>And the limitation of the maximum year is today, 2020.

In [None]:
index = df_n.loc[(df_n['RegistrationYear'] < 1990) | (df_n['RegistrationYear'] > 2016)].index
df_n = df_n.drop(index).reset_index(drop=True)
sns.displot(df_n['RegistrationYear'],kde=True)
plt.show()
df_n.info()

## Gearbox type

Total number of rows and number of missing values

In [None]:
a = df_n['Gearbox'].count()
b = df_n['Gearbox'].isna().sum()
print(a)
print(b)
100*b/a

Let's take a look at the data

In [None]:
df_n['Gearbox'].sort_values().unique()

Remove missing values

In [None]:
index = df_n[df_n['Gearbox'].isna() == True].index
df_n = df_n.drop(index).reset_index(drop=True)
df_n.info()

## Power

Total number of rows and number of missing values

In [None]:
a = df_n['Power'].count()
b = df_n['Power'].isna().sum()
print(a)
print(b)
100*b/a

In [None]:
sns.displot(df_n['Power'],kde=True)

The spread is very strong.
<br>Minimum limit - 20. In case someone decides to put up for sale a car similar in characteristics to Zaporozhets.
<br>And the maximum limit is 700. This is the power of the most powerful tractor to date. Of course, there are also capacities of more than 1000, up to 1600, but they are extremely few and unlikely to be sold on this site.

In [None]:
index = df_n.loc[(df_n['Power'] < 20) | (df_n['Power'] > 700)].index
df_n = df_n.drop(index).reset_index(drop=True)
sns.displot(df_n['Power'],kde=True)
plt.show()
df_n.info()

## Mileage

Total number of rows and number of missing values

In [None]:
a = df_n['Kilometer'].count()
b = df_n['Kilometer'].isna().sum()
print(a)
print(b)
100*b/a

In [None]:
sns.displot(df_n['Kilometer'],height=8,aspect=1,kde=True)

Пробег может быть любым в данном случае

## Fuel type

Total number of rows and number of missing values

In [None]:
a = df_n['FuelType'].count()
b = df_n['FuelType'].isna().sum()
print(a)
print(b)
100*b/a

Let's take a look at the data

In [None]:
df_n['FuelType'].sort_values().unique()

Remove missing values

In [None]:
index = df_n[df_n['FuelType'].isna() == True].index
df_n = df_n.drop(index).reset_index(drop=True)
df_n.info()

## Repair mark

Total number of rows and number of missing values

In [None]:
a = df_n['NotRepaired'].count()
b = df_n['NotRepaired'].isna().sum()
print(a)
print(b)
100*b/a

Let's take a look at the data

In [None]:
df_n['NotRepaired'].sort_values().unique()

Remove missing values

In [None]:
index = df_n[df_n['NotRepaired'].isna() == True].index
df_n = df_n.drop(index).reset_index(drop=True)
df_n.info()

In [None]:
df_n['NotRepaired'] = df_n['NotRepaired'].fillna('no')
df_n.info()

## Price

Total number of rows and number of missing values

In [None]:
a = df_n['Price'].count()
b = df_n['Price'].isna().sum()
print(a)
print(b)
100*b/a

In [None]:
sns.displot(df_n['Price'],kde=True)

The spread is very strong.
<br>Minimum limit - 100. You can't find a cheaper price on European sites.
<br>In this case, you can do without limiting the maximum.

In [None]:
index = df_n.loc[(df_n['Price'] < 100)].index
df_n = df_n.drop(index).reset_index(drop=True)
sns.displot(df_n['Price'],kde=True)
plt.show()
df_n.info()

# Building price forecast models

First, we build simple linear regression, decision tree, and random forest models without gradient boosting.
<br>Then the same models, but with gradient boosting.

Let's transform categorical features into quantitative ones

In [None]:
df_ohe = pd.get_dummies(df_n, drop_first=True)
df_ohe.info()

Divide the sample into sets with features and a target feature

In [None]:
features = df_ohe.drop(['Price'], axis=1)
target = df_ohe['Price']

Let's break the sets into three subsets: train, validation and test in the ratio `3:1:1`

In [None]:
features_train, features_valid = train_test_split(features, test_size=0.20, random_state=12345)
features_train, features_test = train_test_split(features_train, test_size=0.25, random_state=12345)

target_train, target_valid = train_test_split(target, test_size=0.20, random_state=12345)
target_train, target_test = train_test_split(target_train, test_size=0.25, random_state=12345)

print(features.shape)
print(target.shape)
print(features_train.shape)
print(target_train.shape)
print(features_valid.shape)
print(target_valid.shape)
print(features_test.shape)
print(target_test.shape)

Scale features

In [None]:
scaler = StandardScaler()
scaler.fit(features_train)
features_train = scaler.transform(features_train)
features_valid = scaler.transform(features_valid)
features_test = scaler.transform(features_test)

## Linear regression

In [None]:
%%time
model_lr = LinearRegression()
model_lr.fit(features_train, target_train)
predictions = model_lr.predict(features_valid)
rmse = mean_squared_error(target_valid, predictions) ** 0.5
print('RMSE linear regression:', rmse)

## Decision tree

In [None]:
%%time
param_grid = {'max_depth': range(1,100,10)}

dtr = GridSearchCV(estimator=DecisionTreeRegressor(random_state=12345), param_grid=param_grid, cv=5)
dtr.fit(features_train, target_train)
dtr.best_params_

In [None]:
predictions = dtr.predict(features_valid)
rmse = mean_squared_error(target_valid, predictions) ** 0.5
print('RMSE decision tree:', rmse)

## Random forest

In [None]:
%%time
param_grid = {'n_estimators': range(1,40,10), 'max_depth': range(1,40,10)}

rfr = GridSearchCV(estimator=RandomForestRegressor(random_state=12345), param_grid=param_grid, cv=5)
rfr.fit(features_train, target_train)
rfr.best_params_

In [None]:
predictions = rfr.predict(features_valid)
rmse = mean_squared_error(target_valid, predictions) ** 0.5
print('RMSE random forest:', rmse)

## LightGBM

In [None]:
hyper_params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'learning_rate': 0.005,
    'verbose': 0,
    "max_depth": 8,
    "num_iterations": 20000,
    "n_estimators": 1000
}

In [None]:
%%time
gbm = lgb.LGBMRegressor(**hyper_params)
gbm.fit(features_train, target_train, 
        eval_set=[(features_valid, target_valid)],
        eval_metric='rmse', verbose=0)

print('RMSE LightGBM:', gbm.best_score_['valid_0']['rmse'])

# Checking models on a test dataset

In [None]:
predictions = model_lr.predict(features_test)
rmse = mean_squared_error(target_test, predictions) ** 0.5
print('RMSE Linear Regression:', rmse)

predictions = dtr.predict(features_test)
rmse = mean_squared_error(target_test, predictions) ** 0.5
print('RMSE Decision Tree:', rmse)

predictions = rfr.predict(features_test)
rmse = mean_squared_error(target_test, predictions) ** 0.5
print('RMSE random forest:', rmse)

predictions = gbm.predict(features_test)
rmse = mean_squared_error(target_test, predictions) ** 0.5
print('RMSE LightGBM:', rmse)

Importance of factors

In [None]:
feature_importances = pd.DataFrame(gbm.feature_importances_,
                                   index = features.columns,
                                   columns=['importance']).sort_values('importance', ascending=False)
feature_importances

# Conclusion

As can be seen from the results on the test dataset, LightGBM takes first place, and random forest is in second place.
<br>It should also be taken into account that LightGBM works, in this case, 5 times faster than random forest

# Appendix 1

Combine train and validation datasets

In [None]:
features_tr_v = np.concatenate([features_train, features_valid])
target_tr_v = np.concatenate([target_train, target_valid])

print(features_train.shape)
print(features_valid.shape)
print(features_tr_v.shape)
print()
print(target_train.shape)
print(target_valid.shape)
print(target_tr_v.shape)

Linear Regression

In [None]:
%%time
model_lr = LinearRegression()
model_lr.fit(features_tr_v, target_tr_v)
predictions = model_lr.predict(features_test)
rmse = mean_squared_error(target_test, predictions) ** 0.5
print('RMSE Linear Regression:', rmse)

LightGBM

In [None]:
hyper_params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'learning_rate': 0.005,
    'verbose': 0,
    "max_depth": 8,
    "num_iterations": 20000,
    "n_estimators": 1000
}

In [None]:
%%time
gbm = lgb.LGBMRegressor(**hyper_params)
gbm.fit(features_tr_v, target_tr_v, 
        eval_set=[(features_test, target_test)],
        eval_metric='rmse', verbose=0)

print('RMSE LightGBM:', gbm.best_score_['valid_0']['rmse'])

Hmm, predictions haven't improved. Perhaps I missed something.

# Appendix 2

We will issue the predictions of the test sample in the form of a dataset

In [None]:
predictions = pd.DataFrame(gbm.predict(features_test),columns=['Predictions'])

We combine all the features and predictions of the test set, and also calculate the difference between predictions and answers

In [None]:
diff = df_n.merge(target_test, left_index=True, right_index=True).reset_index(drop=True)
diff = diff.merge(predictions, left_index=True, right_index=True)
diff['difference'] = abs(diff['Price_y'] - diff['Predictions'])
diff.head()

I think 500 euros can fluctuate the price of the same car. The percentage of errors less than 500 euros is equal to:

In [None]:
100 * diff.loc[diff['difference'] < 500, 'difference'].count() / diff['difference'].count()

Less than 10% is not great. Let's look at the lines

In [None]:
under_500 = diff.query('difference < 500')
under_500

In [None]:
more_500 = diff.query('difference > 500')
more_500