# Car Price Prediction Using Keras

#### Below are the steps which we will be basically following:

1. [Step 1: Reading and Understanding the Data](#1)
1.  [Step 2: Cleaning the Data](#2)
    - Missing Value check
    - Data type check
    - Duplicate check
1. [Step 3: Data Visualization](#3)
    - Heatmap
1. [Step 4: Data Preprocessing](#4) 
   - One-Hot Encoding
1. [Step 5: Splitting the Data into Training and Testing Sets](#5)
1. [Step 6: Normalizing the Data](#6)
1. [Step 7: Building a Model](#7)
1. [Step 8: K-Fold Validation](#8)
1. [Step 9: Training](#9)
1. [Step 10: Model Evaluation](#10)
   - MSE Score
1. [Step 11: Prediction](#11)

## Setting-up Envoirnment 

Firstly, we will import all the required libraries.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import os
plt.rcParams['figure.figsize']=(12,5)
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras import regularizers
#!pip install openpyxl

<a id="1"></a> <br>
## Loading Data

In [None]:
df_car = pd.read_csv("../input/car-price-prediction/CarPrice_Assignment.csv")
#data_car = pd.read_excel("../input/car-price-prediction/Data Dictionary - carprices.xlsx")

In [None]:
df_car.head()

#### Checking Shape and Size

In [None]:
print(df_car.shape)
print(df_car.size)

<a id="2"></a> <br>
## Cleaning the Data

In [None]:
df_car.info()

There is no missing value.

In [None]:
df_car.describe()

#### Dropping Useless Column

In [None]:
df_car.drop(columns = ['car_ID'], inplace= True)

<a id="3"></a> <br>
## Data Visualization

In [None]:
plt.figure(figsize=(16,10))
sns.heatmap(df_car.select_dtypes(include=['int','float']).corr(),annot=True)

Price is highly(positively) correlated with wheelbase, carlength, carwidth, curbweight and enginesize. And negatively correlated with citympg and highwaympg.

<a id="4"></a> <br>
## Data Preprocessing

In [None]:
df_car.columns

In [None]:
# Coverting categorical data to dummy variables
car_dummies = pd.get_dummies(df_car,columns=['symboling','CarName', 'fueltype', 'aspiration', 'doornumber','carbody', 
                                             'drivewheel','enginelocation', 'enginetype', 'cylindernumber', 'fuelsystem'])

In [None]:
car_dummies.describe()

<a id="5"></a> <br>
## Splitting the Data
Splitting data into training and testing data.

In [None]:
# Training Data
np.random.seed(11111) 
msk = np.random.rand(len(car_dummies)) < 0.72
X_train = car_dummies[msk]
X_test = car_dummies[~msk]

In [None]:
print(len(X_train))
print(len(X_test))

In [None]:
# Target Data 
y_train = X_train.pop('price')
y_test = X_test.pop('price')

In [None]:
{X_train.columns.get_loc(c): c for idx, c in enumerate(X_train.columns)}

<a id="6"></a> <br>
## Normalizing the Data
Here we are normalizing data by subtracting data by mean of the data and then dividing by standard deviation of the data.

In [None]:
X_mean = X_train.iloc[:,0:13].mean(axis=0) # taking mean of training data
X_train.iloc[:,0:13] -= X_mean # subtracting the mean from training data
X_std = X_train.iloc[:,0:13].std(axis=0) # taking std of training data
X_train.iloc[:,0:13] /= X_std # dividing train data by std
X_test.iloc[:,0:13] -= X_mean # subrating the mean from testing data
X_test.iloc[:,0:13] /= X_std # dividing test data by std

In [None]:
y_mean = y_train.mean() 
y_train -= y_mean
y_std = y_train.std()
y_train /= y_std
y_test -= y_mean
y_test /= y_std

### Changing Data Type To Float

In [None]:
X_train = np.asarray(X_train).astype(float)
X_test = np.asarray(X_test).astype(float)

y_train = np.asarray(y_train).astype(float)
y_test = np.asarray(y_test).astype(float)

In [None]:
print(len(X_train))
print(len(y_train))
print(len(X_test))
print(len(y_test))

<a id="7"></a> <br>
## Building a Model

In [None]:
def build_model():
    model = Sequential()
    model.add(Dense(80 , activation='relu', input_shape=(X_train.shape[-1],))) # Input Layer
    model.add(Dropout(0.5)) # Dropout Layer
    model.add(Dense(40 , activation='relu'))
    model.add(Dropout(0.5)) # Dropout Layer
    model.add(Dense(20 , activation='relu'))
    model.add(Dropout(0.5)) # Dropout Layer
    model.add(Dense(10 , activation='relu'))
    model.add(Dense(1))
    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae']) # Compiling Model
    return model

**Note** the network with the mse loss function—mean squared error,
the square of the difference between the predictions and the targets. This is a widely
used loss function for regression problems.

In [None]:
build_model().summary()

<a id="8"></a> <br>
## K- Fold Validation

In [None]:
import numpy as np
k =  4 # no of folds
num_val_samples = len(X_train) // k
num_epochs = 100
all_scores_relu = []
for i in range(k):
    print('processing fold #', i)
    val_X = X_train[i * num_val_samples: (i + 1) * num_val_samples]
    val_y = y_train[i * num_val_samples: (i + 1) * num_val_samples]
    partial_train_data = np.concatenate([X_train[:i * num_val_samples],X_train[(i + 1) * num_val_samples:]],  axis=0)
    # print(partial_train_data)
    partial_train_targets = np.concatenate([y_train[:i * num_val_samples],y_train[(i + 1) * num_val_samples:]],axis=0)
    model = build_model()
    model.fit(partial_train_data, partial_train_targets,epochs=num_epochs, batch_size=1, verbose=0)
    val_mse, val_mae = model.evaluate(val_X, val_y, verbose=0)
    all_scores_relu.append(val_mae)

##### Validation MSE

In [None]:
val_mse

In [None]:
all_scores_relu

<a id="9"></a> <br>
## Training
Here we will call model and train on the training data and evaluate on the test data.

In [None]:
model_relu = build_model()
model_relu.fit(X_train, y_train,epochs= 80, batch_size=1, verbose=0)
test_mse_score, test_mae_score = model_relu.evaluate(X_test, y_test)

<a id="10"></a> <br>
## Model Evaluation

In [None]:
# MSE Score
test_mse_score

In [None]:
# MAE Score
test_mae_score

<a id="11"></a> <br>
## Prediction

In [None]:
x_relu = model_relu.predict(X_test[5].reshape(1,X_test.shape[1]))

 **Note** that here we will use the reverse process of Normalization to retrieve our values of price in thousand of dollars i.e. x = (y - mean)/ std ==>> we will calculate y = x * std + mean and then we will compare it with our target values.

In [None]:
x_relu * y_std + y_mean

### Actual Value

In [None]:
 y_test[5] * y_std + y_mean 

### If this Kernel helped you in any way, some <span style="color:red">UPVOTES !!!</span> would be very much appreciated.