Aim: predicting the selling price of the car based on various features of the cars, including the present price of the cars.

In [3]:
#importing the libraries
import numpy as np
import pickle
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import warnings
import pickle
warnings.filterwarnings('ignore')

loading the exploratory data set

In [4]:
df_eda = pickle.load(open('C:\\Users\\acer\\Desktop\\Car-Price-Prediction\\models\\ExploratoryDataAnalysis.pkl','rb'))
df_eda.head()

AttributeError: Can't get attribute '_unpickle_block' on <module 'pandas._libs.internals' from 'C:\\Users\\acer\\anaconda3\\lib\\site-packages\\pandas\\_libs\\internals.cp39-win_amd64.pyd'>

In [None]:
#independent features and dependent features
X=df_eda.drop(columns='selling_price',axis=1)
y=df_eda['selling_price']

In [None]:
#splitting the data into train and test
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1)
print(X_train)
print(X_test)
print(y_train)

In [None]:
#standarization of the data
from sklearn.preprocessing import StandardScaler

scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.fit_transform(X_test)

In [None]:
#machine learning algorithms
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.tree import plot_tree

Building a linear regression

In [None]:
#linear regression model
lin_reg=LinearRegression()
lin_reg.fit(X_train, y_train)#training the algorithm

#getting coefficients and intercepts
print('coefficients:',lin_reg.coef_)
print('intercept:',lin_reg.intercept_)

#predicting on the test data
y_pred_lin=lin_reg.predict(X_test)

#compare the actual output values and predicted values
df = pd.DataFrame({'Actual':y_test, 'Predicted':y_pred_lin})
print(df.reset_index(inplace=True,drop=True))

#showing the difference between the actual and predicted values
df1=df.head(25)
print(df1.plot(kind='bar', figsize=(15,5)))

#calculate the accuracy
from sklearn import metrics

print('r2_score:',metrics.r2_score(y_test,y_pred_lin))
print('Mean Absolute Error:',metrics.mean_absolute_error(y_test,y_pred_lin))
print('Mean Squared Error:',metrics.mean_squared_error(y_test,y_pred_lin))
print('Root Mean Square Error:',np.sqrt(metrics.mean_squared_error(y_test,y_pred_lin)))

#or
print('accuracy score of training data:',lin_reg.score(X_train,y_train))
print('accuracy score of testing data:',lin_reg.score(X_test,y_test))


In [None]:
sns.set(rc{'figure.facecolor':'lightblue'})
plt.scatter(y_test,y_pred_lin)

In [None]:
#building a LR model using statsmodels Ordinary least-squares (OLS) models
#!pip install statsmodels
import statsmodels.api as sm

y=df_eda['selling_price']
X=df_eda.drop(['selling_price'],axis=1)
X_constant=sm.add_constant(X)
model=sm.OLS(y,X_constant).fit()
model.predict(X_constant)
print(model.summary())

Regularized Regression
1. Ridge Regression
Ridge regression is an extension of linear regression where the loss function is modified to minimize the complexity of the model. This modification is done by adding a penalty parameter that is equivalent to the square of the magnitude of the coefficients.

Loss function = OLS + alpha * summation (squared coefficient values)

In the above loss function, alpha is the parameter we need to select. A low alpha value can lead to over-fitting, whereas a high alpha value can lead to under-fitting.

Instead of arbitrarily choosing alpha value ,it would be better to use cross-validation to choose the tuning parameter alpha. We can do this using the cross-validated ridge regression function, RidgeCV()

In [None]:
#building ridgeregression model
from sklearn.linear_model import Ridge, RidgeCV

alphas=10**np.linspace(10,-2,100)*0.5

ridge_cv= RidgeCV(alphas=alphas,normalize=True)
ridge_cv.fit(X_train,y_train)
print(ridge_cv.alpha_)

#ridge regression L2 regularization
ridge=Ridge(alpha=ridge_cv.alpha_,normalize=True)
ridge.fit(X_train,y_train)

print('Root Mean Squared Error:',np.sqrt(metrics.mean_squared_error(y_test,ridge.predict(X_test))))
print('r2_score:',r2_score(y_test,ridge.predict(X_test)))

In [None]:
#ridge regression L2 regularization
ridge=Ridge(alpha=ridge_cv.alpha_,normalize=True)
ridge.fit(X_train,y_train)
y_predr=ridge.predict(X_test)

print('Root Mean Squared Error:',np.sqrt(metrics.mean_squared_error(y_test,y_predr)))
print('r2_score:',r2_score(y_test,y_predr))

2. Lasso Regression
Lasso regression, or the Least Absolute Shrinkage and Selection Operator, is also a modification of linear regression. In Lasso, the loss function is modified to minimize the complexity of the model by limiting the sum of the absolute values of the model coefficients (also called the l1-norm).

The loss function for Lasso Regression can be expressed as below:

Loss function = OLS + alpha * summation (absolute values of the magnitude of the coefficients)

We now ask whether the lasso can yield either a more accurate or a more interpretable model than ridge regression. In order to fit a lasso model, we'll use the Lasso() function; however, this time we'll need to include the argument max_iter = 10000. Other than that change, we proceed just as we did in fitting a ridge model:

In [None]:
#building lasso regression model
from sklearn.linear_model import Lasso,LassoCV
#without alpha and cv parameters
lasso=Lasso(max_iter=10000,normalize=True)
lasso.fit(X_train,y_train)
y_predl=lasso.predict(X_test)
print(lasso.coef_)

#with CV
lasso_cv=LassoCV(alphas=None,cv=10,max_iter=100000,normalize=True)
lasso_cv.fit(X_train,y_train)

lasso.set_params(alpha=lasso_cv.alpha_)
lasso.fit(X_train,y_train)
y_predll=lasso.predict(X_test)
print(lasso.coef_)
print('r2_score:',r2_score(y_test,y_predll))

In [None]:
#DecisionTreeRegressor
from sklearn.tree import plot_tree

dtr=DecisionTreeRegressor(max_depth=60, min_samples_leaf=10,min_samples_split=10)
dtr.fit(X_train,y_train)

y_pred_dtr=dtr.predict(X_test)

print('r2_score:',metrics.r2_score(y_test,y_pred_dtr))

In [None]:
#Random forest Regressor
rfr=RandomForestRegressor()
rfr.fit(X_train,y_train)
y_pred_rf=rfr.predict(X_test)
print(metrics.r2_score(y_test,y_pred_rf))

#with best prameters
rfr_best_model=RandomForestRegressor(n_estimators=300,
                                     max_features='sqrt',
                                     min_samples_split=10,min_samples_leaf=1,max_depth=30)
rfr_best_model.fit(X_train,y_train)
y_pred_rfr=rfr_best_model.predict(X_test)
print(metrics.r2_score(y_test,y_pred_rfr))
#If "sqrt", then max_features=sqrt(n_features)

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import VotingRegressor

#voting classifier
vote = VotingRegressor(estimators=[('LR', lin_reg), ('RFR', rfr), ('DTR', dtr)])
vote.fit(X_train, y_train)
print(vote.estimators_)
y_pred_voting = vote.predict(X_test)

#accuracy score
score = metrics.r2_score(y_pred_voting, y_test)

cv = KFold(n_splits=10, random_state=1, shuffle=True)

labels = ['Linear Regression', 'Random Forest Regressor', 'Decision Tree Regressor']

for classifier, label in zip([lin_reg, rfr, dtr], labels):
    scores = cross_val_score(estimator=classifier, X=X_train, y=y_train, cv=cv)
    print('accuracy:',scores.mean(), scores.std(), label)

from the observations, we can coclude that Random Forest Regressor is the best model among other models for our car price prediction.

Random Forest is an ensemble learning based regression model. It uses a model called decision tree, multiple decision trees to generate the ensemble model which collectively produces a prediction.

The benefit of this model is that the trees are produced in parallel and are relatively uncorrelated, producing good results.

In [None]:
with open('C:\\Users\\acer\\Desktop\\Car-Price-Prediction\\models\\best_model_car_prediction.pkl', 'wb') as file:
    pickle.dump(rfr_best_model, file)

#loading a pickle file from models directory
model_file=pickle.load(open('C:\\Users\\acer\\Desktop\\Car-Price-Prediction\\models\\best_model_car_prediction.pkl', 'rb'))