# Car Price Prediction


### Problem Statement
Given data of US cars. We need to create model which predicts prices.

### Business Goal 

We also need to find top features that have higher correlation to price value. We aren't looking for complex dependencies. We need to consider all features separately with the target. 

# Imports

In [1]:
# to remove some user warnings from pandas and numpy
import warnings
warnings.filterwarnings('ignore')

# for numerical computations
import numpy as np
import pandas as pd

# for plotting and visualizations
import matplotlib.pyplot as plt
import seaborn as sns

# for training models, evaluations and feature processing
from sklearn.metrics import mean_squared_error, make_scorer, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, cross_validate

# Helper Functions

In [2]:
def hist_plot(dataframe, features, rows, cols):
    fig = plt.figure(figsize=(20, 20))
    for i, feature in enumerate(features):
        ax = fig.add_subplot(rows, cols, i+1)
        dataframe[feature].hist(bins=20, ax=ax, facecolor='midnightblue')
        ax.set_title(feature+" Distribution", color='DarkRed')

    fig.tight_layout()
    plt.show()

In [3]:
def pie_plot(series, figsize=(7, 7)):
    ax_ = plt.figure(figsize=figsize).add_subplot()
    ax_.pie(series.value_counts(), labels=series.value_counts().index)

In [4]:
def corr_plot(dataframe, figsize=(10, 10)):
    axes_ = plt.figure(figsize=figsize).add_subplot()
    corr = dataframe.corr()
    ax = sns.heatmap(
        corr,
        vmin=-1, vmax=1, center=0,
        cmap=sns.diverging_palette(20, 220, n=200),
        square=True
    )
    ax.set_xticklabels(
        ax.get_xticklabels(),
        rotation=45,
        horizontalalignment='right'
    )

In [5]:
def barh_plot(series, figsize=(10, 10)):
    ax_ = plt.figure(figsize=figsize).add_subplot()
    ax_.barh(series.index, series.values)

In [6]:
def show_top_unique_values(dataframe, features=None, top_n=10):
    if not features:
        features = dataframe.columns
    
    data = {}
    for feature in features:
        if feature in dataframe.columns:
            data[feature] = dataframe[feature].value_counts().index.values[:top_n]
    
    return data

# Reading and Understanding the Data

Let's start with the following steps:

1. Importing data using the pandas library
2. Understanding the structure of the data

## Read Data

In [7]:
# extra code for colab
# donwload one file from github -> https://raw.githubusercontent.com/user/repository/branch/filename
# our case -> https://raw.githubusercontent.com/MaxinAI/school-of-ai/master/data/ml_intro/Advertising.csv


try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False

if IN_COLAB:
    print('colab')
    import os
    os.system("wget https://raw.githubusercontent.com/MaxinAI/school-of-ai/master/data/workshop_1/CarPrice_Assignment.csv")
    # read the data
    cars = pd.read_csv('./CarPrice_Assignment.csv')
else:
    # just the old code
    cars = pd.read_csv('data/workshop_1/CarPrice_Assignment.csv')

In [8]:
cars.head()

Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


## number of records

Since we have very low amount of records our model won't be as good as it should be in general. Anyway it's a example for demonstrating your skills on model selection, hyperparameter tuning and model selection/scoring 

In [None]:
cars.shape

## Features

Names of these features are easy to understand

In [None]:
cars.columns

## See top frequent values from each column

To get quick overview on values for each feature we can list top frequent values for each

In [None]:
show_top_unique_values(cars, top_n=5)

## List Numerical Features

If you check data types for each feature you can select numerical ones. It may happen that numerical values are in "Object" format if numbers are represented as strings, so please double check to make sure you got them correctly!

In [None]:
# find numerical features by checking their data type not equals to "object" 
numerical_features = 
numerical_features

## We have not empty (NAN) values

In [None]:
# count number of nan's in total in data
print()

# Data Preparation

## Take only numerical features

We only use numerical features for simplicity now. Usually all features are used considering their importance and correlation with target. Feature engineering will be taught later in separate lecture. You can get some examples from [here](https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html)

In [None]:
# take only chosen numerical features and assign back to dataframe
cars = 

# Visualizing the data


## Show target distribution

In [None]:
plt.figure(figsize=(20,8))

plt.subplot(1,2,1)
plt.title('Car Price Distribution Plot')
sns.distplot(cars.price, bins=20)

plt.subplot(1,2,2)
plt.title('Car Price Spread')
sns.boxplot(y=cars.price)

plt.show()

### Show some percentiles

In [None]:
print(cars.price.describe(percentiles = [0.25,0.50,0.75,0.85,0.90,1]))

#### Inference :

1. The plot seemed to be right-skewed, meaning that the most prices in the dataset are low(Below 15,000).
2. There is a significant difference between the mean and the median of the price distribution.
3. The data points are far spread out from the mean, which indicates a high variance in the car prices.(85% of the prices are below 18,500, whereas the remaining 15% are between 18,500 and 45,400.)

## Visualising numerical data

### show histograms

In [None]:
hist_plot(dataframe=cars, features=cars.columns, rows=6, cols=3)

## Look at main statistical properties

In [None]:
cars.describe()

## Show Feature PairWise Scatter Plots

In [None]:
pd.plotting.scatter_matrix(cars, figsize=(20, 20));

## Show feature correlations

In [None]:
corr_plot(cars, figsize=(20,20))

## Show Features Correlations with Target sorted with absolute value

In [None]:
barh_plot(cars.corr()['price'].abs().sort_values(ascending=True))

In [None]:
cars.corr()['price'].abs().sort_values(ascending=False)

In [None]:
# take top correlated features with correlation more than 0.5
top_correlated_features = 

## Take top correlated features and target only

In [None]:
# take only high correlated features + target and assign all of them to cars dataframe (overwrite)
cars = 
cars.head()

## Show Pair Plot of top correlated features

In [None]:
sns.pairplot(cars)
plt.show()

# Train-Test Split and feature scaling

## Split Data

In [None]:
RANDOM_SEED = 42
TEST_SIZE = 0.3

In [None]:
# split data into training and test parts using random seed, test size (here we don't use stratify since price is not binary)
df_train, df_test = 

## Scale Features

Min-Max Scaler does the scaling of each feature into \[0,1\] range. That is the basic scaling function which could be used. You can explore variety of scalers [here](https://scikit-learn.org/stable/modules/preprocessing.html)

initialize scaler object

You can find MinMaxScaler [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) 

In [None]:
# creta min-max scaler object from scikit learn with default parameters.
scaler = 

list numerical features

In [None]:
# list numerical features again (including price since it's not binary)
numerical_features = 

fit scaler on training data and then transform it. Will will transform test data using fitted scaler. It's usually done like that to simulate unseen data processing where.

In [None]:
# fit and transform numerical features from training data using scaler object created above 
df_train[numerical_features] = 

In [None]:
# transform numerical features from test data using scaler object trained on training data 
df_test[numerical_features] = 

## Check how correlations changed using pair plot

If you compare that to the original one (without transformations) you will see that correlations are changed. We are doing feature scaling to speedup process of convergence on training. However, using good feature engineering and scaling you can increase the performance of model. We aren't doing that for now.

In [None]:
sns.pairplot(pd.concat([df_train, df_test], axis=0))
plt.show()

## Dividing data into X and y variables

In [None]:
# split all columns in training and test data into input features and target_feature 
input_features = 
target_feature = 

In [None]:
# separate input features and target for training data
y_train = 
X_train = 

In [None]:
# separate input features and target for test data
y_test = 
X_test = 

# Linear Regression

Linear regression is a type of regression analysis in statistics used for prediction of outcome of a numerical dependent variable from a set of predictor or independent variables. 

[Linear Regression Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

## Create Linear Regression Model

### Init Model

In [None]:
# create linear regression model
model = 

### Create Scoring for cross validation

In [None]:
# create scoring parameter with mean squared error measure named as "mse"
SCORING = {
    
}
CV = 20

### Cross validate and get scores and models

In [None]:
# calculate cross validation scores of linear regression model on training data using scoring and cross validation.
# use return_estimator=True to get also all estimator models tried
scores = 

since we don't have Hyperparameters to tune for linear regression all the models are same and scores actually describe the how hard is each split to fit by linear hyperplane.

In [None]:
sorted(scores.keys())

### Show Scores and models

In [None]:
# take out "estimator" and "test_mse" from scores data 
all_models = 
all_scores = 

In [None]:
all_models, all_scores

### Check Scores Histogram

In [None]:
pd.Series(all_scores).hist(bins=20);

As we said we don't have any chance to find best model

### Choose one model

In [None]:
# get "Best" model from all models with lowest mse score
best_model = 

### Do prediction on training data

In [None]:
# predict on training data using best model
y_train_pred = 

### show error terms distribution

let's check how error terms are distributed. If it's close to normal distribution, it can mean that hyperplane is well fitted to our data

In [None]:
fig = plt.figure()
sns.distplot((y_train - y_train_pred), bins = 20)
fig.suptitle('Error Terms', fontsize = 20) 
plt.xlabel('Errors', fontsize = 18); 

Error terms seem to be approximately normally distributed, so the assumption on the linear modeling seems to be fulfilled.

# Prediction and Evaluation

## predict on test data

In [None]:
# predict on test data using best model
y_test_pred = best_model.predict(X_test)

## MSE and R2

In [None]:
# calculate and print "mse" and "r2" scores for test data predictions (hint: use imported metrics from sklearn.metrics)


## Plotting y_test and y_test_pred to understand the spread

In [None]:
fig = plt.figure()
plt.scatter(y_test, y_test_pred)
fig.suptitle('y_test vs y_test_pred', fontsize=20) 
plt.xlabel('y_test', fontsize=18)
plt.ylabel('y_test_pred', fontsize=16); 