# ***Car Sales Price Prediction***

## Understanding the Problem & Dataset
*Perform EDA and derive Insights from the CAR DETAILS dataset using Various Data Analysis and Data Visualization libraries of Python such as Pandas, Matplotlib & Seaborn. Create and Deploy a ML Model Which can be accessed by all,using Streamlit and GitHub.*

**About Dataset :**  
*This dataset contains information about used cars. This data can be used for a lot of purposes such as price prediction to exemplify the use of linear regression in Machine Learning.*

*The columns in the given dataset are as follows:*
- name
- year
- selling_price
- km_driven
- fuel
- seller_type
- transmission
- Owner

### Questions
 - Explore the Data using Excel . understand the data and prepare a short summary about the dataset in the PPT.
 - Download the CAR DETAILS dataset and perform Data cleaning and Data Pre-Processing if Necessary.
 - Use the various methods such as Handling null values, One-Hot Encoding, Imputation and Scaling of Data Pre-Processing where necessary.
 - Perform Exploratory data analysis (EDA) on the Data and perform Graphical Analysis on the Data. Include the graphs with conclusions from the Graphical Analysis.
 - Prepare the Data for Machine Learning modeling.
 - Apply various Machine Learning techniques such as Regression or classification ,Bagging, Ensemble techniques and find out the best model using various Machine Learning model evaluation metrics.
 - Save the best model and Load the model.
 - Take the original data set and make another dataset by randomly picking 20 data points from the CAR DETAILS dataset and apply the saved model on the same Dataset and test the model.
 - Make a GitHub Account by visiting the GitHub Website. Create a repository named Data Science Capstone Project and upload the model with the dataset, code file.
 - Create a Streamlit Account by visiting the Streamlit Website. Connect your GitHub account with streamlit.
 - Create an app.py file and other dependencies files for Streamlit app to be deployed on Streamlit Cloud. Make a simple website and deploy your ML model on Streamlit, Make the website public.
 - Share the Streamlit website and GitHub repository links in the Project PPT.

# Model Building,Training & Testing

#### Import necessary library

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import BayesianRidge
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.svm import SVR
from sklearn.metrics import *

import warnings 
warnings.filterwarnings('ignore')

## Loading and Understanding the dataset

In [2]:
df = pd.read_csv(r"C:\Users\shubh\Downloads\2\cleaned data.csv")
df.head()

Unnamed: 0,brand,year,km_driven,fuel,seller_type,transmission,owner,selling_price
0,Maruti,2007.0,70000.0,Petrol,Individual,Manual,First Owner,60000.0
1,Maruti,2007.0,50000.0,Petrol,Individual,Manual,First Owner,135000.0
2,Hyundai,2012.0,100000.0,Diesel,Individual,Manual,First Owner,600000.0
3,Other,2017.0,46000.0,Petrol,Individual,Manual,First Owner,250000.0
4,Honda,2014.0,141000.0,Diesel,Individual,Manual,Second Owner,450000.0


### Encode the Catgeorical Features

In [3]:
cat_cols = df.dtypes[df.dtypes=='object'].index
print(cat_cols)

Index(['brand', 'fuel', 'seller_type', 'transmission', 'owner'], dtype='object')


In [4]:
for i in cat_cols:
    print(i,df[i].unique(),df[i].nunique())
    print()

brand ['Maruti' 'Hyundai' 'Other' 'Honda' 'Tata' 'Chevrolet' 'Toyota' 'Skoda'
 'Mahindra' 'Ford' 'Nissan' 'Renault' 'Volkswagen'] 13

fuel ['Petrol' 'Diesel' 'CNG' 'LPG' 'Electric'] 5

seller_type ['Individual' 'Dealer' 'Trustmark Dealer'] 3

transmission ['Manual' 'Automatic'] 2

owner ['First Owner' 'Second Owner' 'Fourth & Above Owner' 'Third Owner'
 'Test Drive Car'] 5



In [5]:
cat_cols = df.dtypes[df.dtypes=='object'].index
print(cat_cols)

Index(['brand', 'fuel', 'seller_type', 'transmission', 'owner'], dtype='object')


In [6]:
df.dtypes

brand             object
year             float64
km_driven        float64
fuel              object
seller_type       object
transmission      object
owner             object
selling_price    float64
dtype: object

In [7]:
df.head()

Unnamed: 0,brand,year,km_driven,fuel,seller_type,transmission,owner,selling_price
0,Maruti,2007.0,70000.0,Petrol,Individual,Manual,First Owner,60000.0
1,Maruti,2007.0,50000.0,Petrol,Individual,Manual,First Owner,135000.0
2,Hyundai,2012.0,100000.0,Diesel,Individual,Manual,First Owner,600000.0
3,Other,2017.0,46000.0,Petrol,Individual,Manual,First Owner,250000.0
4,Honda,2014.0,141000.0,Diesel,Individual,Manual,Second Owner,450000.0


In [8]:
lb = LabelEncoder()

In [9]:
for col in cat_cols:
    df[col] = lb.fit_transform(df[col])

In [10]:
df.head()

Unnamed: 0,brand,year,km_driven,fuel,seller_type,transmission,owner,selling_price
0,5,2007.0,70000.0,4,1,1,0,60000.0
1,5,2007.0,50000.0,4,1,1,0,135000.0
2,3,2012.0,100000.0,1,1,1,0,600000.0
3,7,2017.0,46000.0,4,1,1,0,250000.0
4,2,2014.0,141000.0,1,1,1,2,450000.0


### Select x and y

In [11]:
x = df.drop('selling_price',axis=1)
y = df['selling_price']
print(x.shape)
print(y.shape)

(3577, 7)
(3577,)


### Split the data into train and test

In [12]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.30,
                                                random_state=42)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(2503, 7)
(1074, 7)
(2503,)
(1074,)


In [13]:
x_train.head()

Unnamed: 0,brand,year,km_driven,fuel,seller_type,transmission,owner
1078,5,2012.0,90000.0,4,1,1,4
844,5,2019.0,5000.0,4,1,1,0
1656,11,2013.0,80000.0,1,1,1,4
2887,4,2013.0,149534.8,1,1,1,0
1731,5,2003.0,35000.0,4,1,1,0


In [14]:
x_test.head()

Unnamed: 0,brand,year,km_driven,fuel,seller_type,transmission,owner
907,5,2018.0,20000.0,4,1,1,0
2684,5,2017.0,39000.0,4,0,1,0
1373,11,2016.0,146000.0,1,0,1,0
538,7,2018.0,10000.0,1,0,0,0
1454,2,2007.0,70000.0,4,1,1,2


#### Create Function to Evaluate the Model

In [15]:
def eval_model(model, mname):
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)
    train_r2 = model.score(x_train, y_train)
    test_r2 = model.score(x_test, y_test)
    test_mae = mean_absolute_error(y_test, y_pred)
    test_mse = mean_squared_error(y_test, y_pred)
    test_rmse = np.sqrt(test_mse)
    res_df = pd.DataFrame({
        'Train_R2': train_r2,
        'Test_R2': test_r2,
        'Test_MAE': test_mae,
        'Test_MSE': test_mse,
        'Test_RMSE': test_rmse
    }, index=[mname])
    return res_df

#### Build ML models

#### 1) Linear Regression

In [16]:
lr1 = LinearRegression()

lr1_res = eval_model(lr1,'LinearRegressor')
lr1_res

Unnamed: 0,Train_R2,Test_R2,Test_MAE,Test_MSE,Test_RMSE
LinearRegressor,0.555788,0.579887,153230.769931,39446870000.0,198612.357479


#### 2) Ridge Reg

In [17]:
ridge = Ridge()

ridge_res = eval_model(ridge,'ridge')
ridge_res

Unnamed: 0,Train_R2,Test_R2,Test_MAE,Test_MSE,Test_RMSE
ridge,0.555785,0.579828,153256.885429,39452430000.0,198626.367221


#### 3) Lasso Reg

In [18]:
lasso = Lasso()

lasso_res = eval_model(lasso,'lasso')
lasso_res

Unnamed: 0,Train_R2,Test_R2,Test_MAE,Test_MSE,Test_RMSE
lasso,0.555788,0.579887,153230.914584,39446900000.0,198612.429698


#### 4) Decision Tree Reg

In [19]:
dt1 = DecisionTreeRegressor(max_depth=8,min_samples_split=12)  # random_state

dt1_res = eval_model(dt1,'DecisionTreeRegressor')
dt1_res

Unnamed: 0,Train_R2,Test_R2,Test_MAE,Test_MSE,Test_RMSE
DecisionTreeRegressor,0.77143,0.678067,120035.656589,30228230000.0,173862.676653


#### 5) Random Forest Regression

In [20]:
rf1 = RandomForestRegressor(n_estimators=80,max_depth=8,
                            min_samples_split=12)

rf1_res = eval_model(rf1,'RandomForestRegressor')
rf1_res

Unnamed: 0,Train_R2,Test_R2,Test_MAE,Test_MSE,Test_RMSE
RandomForestRegressor,0.788152,0.734938,112866.709593,24888280000.0,157760.202851


#### 6) Gradient Boosting Regressor

In [21]:
gbr = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

gbr_res = eval_model(gbr,'GradientBoostingRegressor')
gbr_res

Unnamed: 0,Train_R2,Test_R2,Test_MAE,Test_MSE,Test_RMSE
GradientBoostingRegressor,0.756626,0.73666,115429.761771,24726510000.0,157246.649185


#### 7) KNeighborsRegressor

In [22]:
knn = KNeighborsRegressor(n_neighbors=5)
knn_res = eval_model(knn,'KNeighborsRegressor')
knn_res

Unnamed: 0,Train_R2,Test_R2,Test_MAE,Test_MSE,Test_RMSE
KNeighborsRegressor,0.534877,0.327522,179414.437803,63142970000.0,251282.652291


#### 8) AdaBoostRegressor

In [23]:
base_regressor = DecisionTreeRegressor(max_depth=4)

# Initialize the AdaBoostRegressor
ada_regressor = AdaBoostRegressor(estimator=base_regressor, n_estimators=100, random_state=42)
ada_res = eval_model(ada_regressor,'AdaBoostRegressor')
ada_res

Unnamed: 0,Train_R2,Test_R2,Test_MAE,Test_MSE,Test_RMSE
AdaBoostRegressor,0.546912,0.553408,172073.81943,41933120000.0,204775.79026


#### 9) BaggingRegressor

In [24]:
base_regressor = DecisionTreeRegressor()

bagging_regressor = BaggingRegressor(estimator=base_regressor, n_estimators=100, random_state=42)
bagging_res = eval_model(bagging_regressor,'BaggingRegressor')
bagging_res

Unnamed: 0,Train_R2,Test_R2,Test_MAE,Test_MSE,Test_RMSE
BaggingRegressor,0.939412,0.714573,116540.092679,26800440000.0,163708.404279


In [25]:
all_res = pd.concat([lr1_res, ridge_res, lasso_res, dt1_res, rf1_res, gbr_res, knn_res, ada_res, bagging_res])
all_res

Unnamed: 0,Train_R2,Test_R2,Test_MAE,Test_MSE,Test_RMSE
LinearRegressor,0.555788,0.579887,153230.769931,39446870000.0,198612.357479
ridge,0.555785,0.579828,153256.885429,39452430000.0,198626.367221
lasso,0.555788,0.579887,153230.914584,39446900000.0,198612.429698
DecisionTreeRegressor,0.77143,0.678067,120035.656589,30228230000.0,173862.676653
RandomForestRegressor,0.788152,0.734938,112866.709593,24888280000.0,157760.202851
GradientBoostingRegressor,0.756626,0.73666,115429.761771,24726510000.0,157246.649185
KNeighborsRegressor,0.534877,0.327522,179414.437803,63142970000.0,251282.652291
AdaBoostRegressor,0.546912,0.553408,172073.81943,41933120000.0,204775.79026
BaggingRegressor,0.939412,0.714573,116540.092679,26800440000.0,163708.404279


### The best performing model is RandomForestRegressor

### Saving the Model

In [26]:
import pickle

In [27]:
# pickle.dump(gbr,open('GradientBoosting.pkl','wb'))
pickle.dump(rf1,open('RandomForest.pkl','wb'))

### Loading the saved model

In [28]:
load_model = pickle.load(
    open(f"RandomForest.pkl", "rb"))  # rb = read binary
print(f"Name of loaded Model : {'RandomForest.pkl'}")
load_model

Name of loaded Model : RandomForest.pkl


In [29]:
with open('RandomForest.pkl', 'rb') as file:
    load_model = pickle.load(file)

#### Take the original data set and make another dataset by randomly picking 20 data points from the CAR DETAILS dataset and apply the saved model on the same Dataset and test the model.

#### Generating sample data from cleaned df to test on the trained model.

In [30]:
random_datasample = df.sample(20)
random_datasample_df = random_datasample.drop("selling_price", axis=1)
print(random_datasample_df.shape)
random_datasample_df.head()

(20, 7)


Unnamed: 0,brand,year,km_driven,fuel,seller_type,transmission,owner
2684,5,2017.0,39000.0,4,0,1,0
1583,5,2008.0,35008.0,4,0,1,0
2669,5,2002.0,60000.0,4,1,1,4
2284,5,2012.0,149534.8,1,1,1,0
2416,4,2005.0,149534.8,1,1,1,4


#### Resetting the index as the randomly generated data has no continuos index (wil delete later,just for understanding)

In [31]:
random_datasample_df.reset_index()

Unnamed: 0,index,brand,year,km_driven,fuel,seller_type,transmission,owner
0,2684,5,2017.0,39000.0,4,0,1,0
1,1583,5,2008.0,35008.0,4,0,1,0
2,2669,5,2002.0,60000.0,4,1,1,4
3,2284,5,2012.0,149534.8,1,1,1,0
4,2416,4,2005.0,149534.8,1,1,1,4
5,3292,5,2005.0,70000.0,4,1,1,2
6,2598,0,2012.0,120000.0,1,1,0,4
7,2450,0,2014.0,52000.0,1,0,1,0
8,1769,5,2016.0,40000.0,1,1,1,2
9,2754,3,2015.0,70000.0,1,1,1,0


In [32]:
random_datasample_df.to_csv("20_random_sample.csv", index=False)

#### Loading the sample data and checking basics

In [33]:
testsample_df = pd.read_csv("20_random_sample.csv")
print(
    "Shape of loaded sample dataframe:",
    testsample_df.shape,
    "\n\nSample Dataframe contents",
)
testsample_df

Shape of loaded sample dataframe: (20, 7) 

Sample Dataframe contents


Unnamed: 0,brand,year,km_driven,fuel,seller_type,transmission,owner
0,5,2017.0,39000.0,4,0,1,0
1,5,2008.0,35008.0,4,0,1,0
2,5,2002.0,60000.0,4,1,1,4
3,5,2012.0,149534.8,1,1,1,0
4,4,2005.0,149534.8,1,1,1,4
5,5,2005.0,70000.0,4,1,1,2
6,0,2012.0,120000.0,1,1,0,4
7,0,2014.0,52000.0,1,0,1,0
8,5,2016.0,40000.0,1,1,1,2
9,3,2015.0,70000.0,1,1,1,0


#### Making Predictions on sample dataset against the trained model

In [34]:
# making prediction on random data
predicted_data = load_model.predict(testsample_df)
print(f"The predicted data from RandomForest model:\n", predicted_data)

The predicted data from RandomForest model:
 [ 399172.39560742  126646.40715668   84215.64066698  406224.98349041
  203691.54928784   99076.8162395   542356.42035961  410605.03177213
  625454.48858924  513792.46656592  299867.19702465  259507.98039449
  461813.86052521  126646.40715668  309616.51446951 1076302.5457455
  229562.55361735  477055.83737642  447721.57131376  478754.4430406 ]
