# Can we predict the price of a car based on its properties?

Can we predict the price of a car based on the dataset we have through a linear regression?
A somewhat classical example of a linear regression, I will be working on a dataset where some properties of more than 45k cars to predict their price.

So, we should start as usual with importing the relevant libraries.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Then, we are opening our dataset, again as usual.

In [None]:
df = pd.read_csv('../input/cars-germany/autoscout24-germany-dataset.csv')

In [None]:
pd.options.display.max_rows = 1000

In [None]:
df

## Feature Engineering

### Outliers

Let's detect, if any, outliers and get rid of them.

In [None]:
df.corr()['price'].sort_values()
#As seen the most correlated property is 'hp' vis-à-vis the price.

Let's plot the data, to visually see the outliers, again if any.

In [None]:
sns.scatterplot(x='hp',y='price', data = df)

In [None]:
sns.scatterplot(x='year',y='price', data = df)

So, we do have a few outliers, as visually seen from the graphs. We will remove them from our dataset.

In [None]:
df[(df['price']>600000)]
#These three cars are outliers due to their prices.

In [None]:
drop_ind = df[(df['price']>600000)].index
df = df.drop(drop_ind, axis = 0)

Let's see the current version of our dataset.

In [None]:
sns.scatterplot(x='hp',y='price', data = df)
#The data is dispersed especially after the 600 hp, but the current version is better to work on.

In [None]:
df.info()
#We have some null values in our dataset. We need to either get rid of them, or fill them with some rational values.

In [None]:
df.isnull().sum()
#This is a better representation of our null data.

Let's see if these null values are meaningful.

In [None]:
100 * df.isnull().sum() / len(df)

1)As seen from the table, there are 3 features containing null values. We are more interested in 'make', 'mileage', 'hp' and 'year'.

2)In order to have as many data as possible, I will try to fill these values rather than simply removing them. 


In [None]:
df['model'] = df['model'].fillna('None')
df['gear'] = df['gear'].fillna('None')
100 * df.isnull().sum() / len(df)

In order not to remove the rows where 'hp' data is missing, I will adopt a very unorthodox approach and fill these data with the average values of 'hp', which we will later on see that it is '132'.

In [None]:
df['hp'].mean()

In [None]:
df['hp'] = df['hp'].fillna(132)

In [None]:
100 * df.isnull().sum() / len(df)

## Creating the Dummy Variables

In [None]:
my_object_df = df.select_dtypes(include = 'object')
my_numeric_df = df.select_dtypes(exclude = 'object')
my_object_df

In [None]:
df_objects_dummies = pd.get_dummies(my_object_df, drop_first = True)
df_objects_dummies
#So we have created dummy variables, instead of having them as strings.

In [None]:
final_df = pd.concat([my_numeric_df,df_objects_dummies],axis=1)
final_df
#We are concatenating the dummy variables with the numeric columns.

In [None]:
final_df.info()
#Let's see the final version of our dataframe.

## Creating our Features (X) and Target (y)

In [None]:
X = final_df.drop('price', axis = 1)
y = final_df['price']

## Creating our Training and Test Sets

In [None]:
from sklearn.model_selection import train_test_split
#At this stage, we are certainly importing train_test_split module from sklearn.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle = True)

## Scaling our X Features

In [None]:
from sklearn.preprocessing import StandardScaler
#For this we are going to need StandScaler module from sklearn library.

In [None]:
scaler = StandardScaler()

In [None]:
scaled_X_train = scaler.fit_transform(X_train)

In [None]:
scaled_X_test = scaler.transform(X_test)

## Creating our Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
#We are importing LinearRegression module in order to create and run our linear regression.

In [None]:
reg = LinearRegression()
reg.fit(X_train, y_train)

In [None]:
reg.score(X_train,y_train)
#Our regression score is around 0.924. Does not seem so bad I guess.

In [None]:
reg.fit(X_test,y_test)

In [None]:
reg.score(X_test,y_test)

Our test score is around 0.9277. This implies some overfitting, since I would expect the test score to be a little lower than the training regression score. But in overall, I think the linear regression model gave a satisfactory result.