## Audi Prices

Hi! I'm new to ML so feel free to give any advice or feedback!
I inspired some feature engineering code from Gireesh https://www.kaggle.com/gireeshs/volkswagen-price-regression-r-2-0-9555 Kernel

**The objective of the analysis is to compare different regression algorithms to predict in the best way the selling price of Audi cars, given different variables**

*Importing packages*

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split # Train-test split
from sklearn.preprocessing import MinMaxScaler # Scaling data
from sklearn.feature_selection import SelectKBest, f_regression # N° variables
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVR

import warnings
warnings.filterwarnings('ignore') # Ignoring sklearn warnings

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

sns.set()

In [None]:
dataset = pd.read_csv('../input/used-car-dataset-ford-and-mercedes/audi.csv')
dataset.head()

In [None]:
dataset.describe().T

**The dataset has:**

- 5 numerical columns: Price, mileage, tax, mpg and EngineSize.
- 3 categorical columns: Model, transmission and FuelType
- 1 date column: year

**Getting to know the data**

*How are the cars distributed by the registration year?*

In [None]:
sns.distplot(dataset['year'], bins = 10, color = 'orange', label = 'KDE')
plt.legend()
plt.gcf().set_size_inches(12, 5)

*How many models are in the set?*
*How many types of transmission?*
*How many types of fuel?*

In [None]:
print(dataset.model.unique())
print('--'* 50)

print(dataset.transmission.unique())
print('--'* 50)

print(dataset.fuelType.unique())
print('--'* 50)

In [None]:
fig, ax =plt.subplots(1,2, sharey = True)
plt.gcf().set_size_inches(12, 5)
sns.countplot(dataset['fuelType'], ax = ax[0])
sns.countplot(dataset['transmission'], ax = ax[1])
plt.show()

**So there are just 3 types of transmission and 3 types of fuel, it seems there are many models.**

In [None]:
Model = pd.DataFrame(dataset['model'].value_counts())

In [None]:
sns.barplot(x = Model.index, y = Model['model'])

labels = Model.index.tolist()
plt.gcf().set_size_inches(15, 7)

plt.title('Models vs quantity', fontsize = 20)
plt.xlabel('Model', fontsize = 15)

plt.xticks(ticks = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] , labels = labels, rotation = 'vertical')
plt.show()

*Price analysis*

In [None]:
sns.heatmap(dataset.corr(), annot = True, linewidths=.5, cmap='cubehelix')
plt.title('Correlation', fontsize = 20)
plt.show()

**All the variables except 'tax' are strong predictors for the car price. Also, there is surely a correlation between the model, fuel type and transmission, so later on we will do some OHE for those variables** 

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, sharey = True)

plt.gcf().set_size_inches(15, 7)
ax1.plot(dataset.mileage, dataset.price, c = 'green')
ax1.set_title('Mileage vs. Price', c = 'green', fontsize = 25)
ax2.scatter(dataset.engineSize, dataset.price, c='red')
ax2.set_title('Engine size vs. Price', c ='red', fontsize = 25)

plt.ylabel('Price', fontsize = 25)

plt.show()

**There are clear correlations:**

- The price goes down as the car has more mileage (it might be really worn out).
- The price goes up as the engine size gets bigger. There are some cars that do not follow this tendency, so we should assume that the different models have an impact on the price.

*Preparing the data for the model*

In [None]:
dataset2 = dataset.copy()
dataset2 = dataset2[['model','year','transmission','mileage','fuelType','tax','mpg','engineSize','price']]
dataset2.head(3)

**OHE for model, transmission and fuel type**

In [None]:
data_audi_D = pd.get_dummies(dataset2)
data_audi_D.head(3)

**To avoid multicollinearity issues, I'll drop one column for model, one for transmission, and one from fuel type.**

In [None]:
data_audi_D = data_audi_D.drop(['model_ A1', 'transmission_Automatic', 'fuelType_Diesel'], axis=1)

**Scaling the data**

In [None]:
MinMaxScaler = MinMaxScaler() 
data_audi_D_Scaled = MinMaxScaler.fit_transform(data_audi_D)
data_audi_D_Scaled = pd.DataFrame(data_audi_D_Scaled, columns = data_audi_D.columns)
data_audi_D_Scaled.head(3)

**Separating variables**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data_audi_D_Scaled.drop(columns = ['price']),
                                                    data_audi_D_Scaled[['price']],
                                                    test_size = 0.2, random_state = 0)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

*There are 35 independent variables*

**Selecting the best features for the regression with SelectKBest, f_regression**

In [None]:
column_names = data_audi_D_Scaled.drop(columns = ['price']).columns

no_of_features = []
r_squared_train = []
r_squared_test = []

for k in range(3, 35, 2): # From 3 to 35 variables (every single one)
    selector = SelectKBest(f_regression, k = k)
    X_train_transformed = selector.fit_transform(X_train, y_train)
    X_test_transformed = selector.transform(X_test)
    regressor = LinearRegression()
    regressor.fit(X_train_transformed, y_train)
    no_of_features.append(k)
    r_squared_train.append(regressor.score(X_train_transformed, y_train))
    r_squared_test.append(regressor.score(X_test_transformed, y_test))
    
sns.lineplot(x = no_of_features, y = r_squared_train, legend = 'full')
sns.lineplot(x = no_of_features, y = r_squared_test, legend = 'full')
plt.show()

*We get a steable curve from 27 variables on, so that is the amount of variables I'm using*

In [None]:
# Doing the same as above but only with k = 27

selector = SelectKBest(f_regression, k = 27)
X_train_transformed = selector.fit_transform(X_train, y_train)
X_test_transformed = selector.transform(X_test)
column_names[selector.get_support()]

**Linear Modeling**

In [None]:
def regression_model(model):
    """
    Will fit the regression model passed and will return the regressor object and the score
    """
    regressor = model
    regressor.fit(X_train_transformed, y_train)
    score = regressor.score(X_test_transformed, y_test) # R2
    return regressor, score

In [None]:
model_performance = pd.DataFrame(columns = ["Features", "Model", "Score"])

models_to_evaluate = [LinearRegression(), Ridge(), Lasso(), SVR(), RandomForestRegressor(), MLPRegressor()]

for model in models_to_evaluate:
    regressor, score = regression_model(model)
    model_performance = model_performance.append({"Features": "Linear","Model": model, "Score": score}, ignore_index=True)

model_performance

**Polynomial Modelling**

*Transforming X variable*

In [None]:
poly = PolynomialFeatures()
X_train_transformed_poly = poly.fit_transform(X_train)
X_test_transformed_poly = poly.transform(X_test)

print(X_train_transformed_poly.shape)

In [None]:
no_of_features = []
r_squared = []

for k in range(10, 400, 5): # Seeing what happens up to 400 variables
    selector = SelectKBest(f_regression, k = k)
    X_train_transformed = selector.fit_transform(X_train_transformed_poly, y_train)
    regressor = LinearRegression()
    regressor.fit(X_train_transformed, y_train)
    no_of_features.append(k)
    r_squared.append(regressor.score(X_train_transformed, y_train))
    
sns.lineplot(x = no_of_features, y = r_squared)
plt.show()

*We get a steable curve from 250 variables on, so that is the amount of variables I'm using*

In [None]:
selector = SelectKBest(f_regression, k = 250)

X_train_transformed = selector.fit_transform(X_train_transformed_poly, y_train)
X_test_transformed = selector.transform(X_test_transformed_poly)

models_to_evaluate = [LinearRegression(), Ridge(), Lasso(), SVR(), RandomForestRegressor(), MLPRegressor()]

for model in models_to_evaluate:
    regressor, score = regression_model(model)
    model_performance = model_performance.append({"Features": "Polynomial","Model": model, "Score": score}, ignore_index=True)

model_performance

**The maximum score is 0.96 with RandomForest polynomial regression**

**Predictions with RandomForest polynomial regression**

In [None]:
regressor_final = RandomForestRegressor(n_estimators = 1000, random_state = 42)
regressor.fit(X_train_transformed_poly, y_train)

y_pred = regressor.predict(X_test_transformed_poly)

In [None]:
y_test = y_test.reset_index() # To join the Dataframes


y_pred_df = pd.DataFrame({'Price_prediction': y_pred.flatten()})

In [None]:
Comparison = y_test.join(y_pred_df) 
Comparison = Comparison.drop(['index'], axis=1)
Comparison.head()

In [None]:
Fifty_comparison = Comparison.head(50)

In [None]:
Fifty_comparison.plot(kind = 'bar', figsize=(20,15))
plt.grid(which = 'both', linestyle = '-', linewidth = '0.5', color = 'green')
plt.show()

Thanks for reaching the end!