<a href="https://colab.research.google.com/github/Tanu-N-Prabhu/UsedCarPricePredictionSystem-Files/blob/master/Model_Planning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Model Planning**

## **Variable Selection**

In [0]:
import pandas as pd    
import joblib
                              # Loading the dataset into a dataframe and performing the desired operations

In [0]:
# Storing the dataset (CSV file) as a pandas dataframe

df = pd.read_csv("/content/drive/My Drive/Dataset/CleanedData.csv")   # Storing the CSV file into a dataframe
df.head(5)

Unnamed: 0,price,yearOfRegistration,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,postalCode,vehicleType_0,vehicleType_1,vehicleType_2,vehicleType_3,vehicleType_4,vehicleType_5,vehicleType_6,vehicleType_7,gearbox_0,gearbox_1
0,650,1995,102,11,150000,10,1,2,33775,0,0,0,0,0,0,1,0,0,1
1,2000,2004,105,10,150000,12,1,19,96224,0,0,0,0,0,0,1,0,0,1
2,2799,2005,140,160,150000,12,3,37,57290,0,0,0,0,0,1,0,0,0,1
3,999,1995,115,160,150000,11,1,37,37269,0,0,0,0,0,1,0,0,0,1
4,2500,2004,131,160,150000,2,1,37,90762,0,0,0,0,0,1,0,0,0,1


In [0]:
features = list(df.columns.values)
features

['price',
 'yearOfRegistration',
 'powerPS',
 'model',
 'kilometer',
 'monthOfRegistration',
 'fuelType',
 'brand',
 'postalCode',
 'vehicleType_0',
 'vehicleType_1',
 'vehicleType_2',
 'vehicleType_3',
 'vehicleType_4',
 'vehicleType_5',
 'vehicleType_6',
 'vehicleType_7',
 'gearbox_0',
 'gearbox_1']

In [0]:
selectedFeatures = ['yearOfRegistration','powerPS','model','kilometer','monthOfRegistration','fuelType','brand','postalCode','vehicleType_0','vehicleType_1','vehicleType_2','vehicleType_3','vehicleType_4','vehicleType_5','vehicleType_6','vehicleType_7','gearbox_0','gearbox_1']

In [0]:
X = df[selectedFeatures]
y = df['price']



---



## **Model Selection**

There are 4 types of model selection

1.   Classification
2.   Regression
3.   Association Rules
4. Text/Image/Video Analysis




**What kind of problem am I solving and how will I solve it?**

I want to determine the relationship between the outcome and the input variables.






---



# **Model Building**

Build training and test datasets (~80% for training (labeled), 20% for testing)

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3, random_state = 42)

**Training the selected model**

In [0]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

**Random Forest Regressor - Train and Test**

In [0]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor().fit(X_train, y_train)
print('Accuracy of Random Forest Regressor  on training set: {:.2f}'
     .format(rfr.score(X_train, y_train)))
print('Accuracy of Random Forest Regressor on test set: {:.2f}'
     .format(rfr.score(X_test, y_test)))

Accuracy of Random Forest Regressor  on training set: 0.98
Accuracy of Random Forest Regressor on test set: 0.85


**Decision Tree Regressor - Train and Test**

In [0]:
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor(max_depth=20).fit(X_train, y_train)
print('Accuracy of Decision Tree Regression on training set: {:.2f}'
     .format(dtr.score(X_train, y_train)))
print('Accuracy of Decision Tree Regression on test set: {:.2f}'
     .format(dtr.score(X_test, y_test)))

Accuracy of Decision Tree Regression on training set: 0.98
Accuracy of Decision Tree Regression on test set: 0.75


**Linear Regression - Train and Test**

In [0]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)
print('Accuracy of Linear Regression on training set: {:.2f}'
     .format(linreg.score(X_train, y_train)))
print('Accuracy of Linear Regression on test set: {:.2f}'
     .format(linreg.score(X_test, y_test)))

Accuracy of Linear Regression on training set: 0.56
Accuracy of Linear Regression on test set: 0.57


**Lasso Regression - Train and Test**

In [0]:
from sklearn import linear_model
lasso = linear_model.Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
print('Accuracy of Lasso Regression on training set: {:.2f}'
     .format(lasso.score(X_train, y_train)))
print('Accuracy of Lasso Regression on test set: {:.2f}'
     .format(lasso.score(X_test, y_test)))

Accuracy of Lasso Regression on training set: 0.56
Accuracy of Lasso Regression on test set: 0.57


**Ridge Regression - Train and Test**

In [0]:
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
print('Accuracy of Ridge Regression on training set: {:.2f}'
     .format(ridge.score(X_train, y_train)))
print('Accuracy of Ridge Regression on test set: {:.2f}'
     .format(ridge.score(X_test, y_test)))

Accuracy of Ridge Regression on training set: 0.56
Accuracy of Ridge Regression on test set: 0.57




---



## **Pipelining**

In [0]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

In [0]:
## Pipelines Creation
## 1. Data Preprocessing by using Standard Scaler
## 2. Reduce Dimension using PCA
## 3. Apply  Classifier

In [0]:
pipeline_rf=Pipeline([('scalar1',StandardScaler()),
    
                     ('lr_classifier',RandomForestRegressor())])

In [0]:
pipeline_dt=Pipeline([('scalar1',StandardScaler()),
    
                     ('lr_classifier',DecisionTreeRegressor(max_depth=20))])

In [0]:
pipeline_lr=Pipeline([('scalar1',StandardScaler()),
    
                     ('lr_classifier',LinearRegression())])

In [0]:
#pipeline_r=Pipeline([('scalar1',StandardScaler()),
    
                    # ('lr_classifier',linear_model.SGDRegressor())])

In [0]:
pipelines = [pipeline_rf, pipeline_dt, pipeline_lr]


In [0]:
best_accuracy=0.0
best_classifier=0
best_pipeline=""

In [0]:
# Dictionary of pipelines and classifier types for ease of reference
pipe_dict = {0: 'Random Forest Regression', 1: 'Decision Tree Regressor', 2: 'Linear Regression'}

# Fit the pipelines
for pipe in pipelines:
	pipe.fit(X_train, y_train)

In [0]:
for i,model in enumerate(pipelines):
    print("{} Test Accuracy: {}".format(pipe_dict[i],model.score(X_test,y_test)))

Random Forest Regression Test Accuracy: 0.8524731823476703
Decision Tree Regressor Test Accuracy: 0.7539438571006447
Linear Regression Test Accuracy: 0.5695008332653897


In [0]:
for i,model in enumerate(pipelines):
    if model.score(X_test,y_test)>best_accuracy:
        best_accuracy=model.score(X_test,y_test)
        best_pipeline=model
        best_classifier=i
print('Regressor with best accuracy is : {}'.format(pipe_dict[best_classifier]))

Regressor with best accuracy is : Random Forest Regression




---



In [0]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

Training

In [0]:
from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor()
clf.fit(X_train, y_train)
pred = clf.predict(X_train)

In [0]:
print("Mean Absolute Error is :", mean_absolute_error(y_train, pred))
print(" — — — — — — — — — — — — — — — — — — — — — — — ")
print("Mean Squared Error is :", mean_squared_error(y_train, pred))
print(" — — — — — — — — — — — — — — — — — — — — — — — ")
print("The R2 square value of Random Forest Regression is :",clf.score(X_train, y_train)* 100)

Mean Absolute Error is : 417.2985144637495
 — — — — — — — — — — — — — — — — — — — — — — — 
Mean Squared Error is : 400278.6488965146
 — — — — — — — — — — — — — — — — — — — — — — — 
The R2 square value of Random Forest Regression is : 97.9081517675405


Testing

In [0]:
from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor()
clf.fit(X_test, y_test)
pred = clf.predict(X_test)

In [0]:
print("Mean Absolute Error is :", mean_absolute_error(y_test, pred))
print(" — — — — — — — — — — — — — — — — — — — — — — — ")
print("Mean Squared Error is :", mean_squared_error(y_test, pred))
print(" — — — — — — — — — — — — — — — — — — — — — — — ")
print("The R2 square value of Random Forest Regression is :",clf.score(X_test, y_test)* 100)

Mean Absolute Error is : 444.9417102351957
 — — — — — — — — — — — — — — — — — — — — — — — 
Mean Squared Error is : 437277.1055813621
 — — — — — — — — — — — — — — — — — — — — — — — 
The R2 square value of Random Forest Regression is : 97.70609734335183




---



In [0]:
#clf = RandomForestRegressor()
#clf.fit(X_train, y_train)
#pred = clf.predict(X_test)

In [0]:
import pickle

In [0]:
clf = DecisionTreeRegressor(max_depth=10)
clf.fit(X_train, y_train)


DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=10,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

In [0]:

pickle.dump(clf, open('/content/drive/My Drive/Dataset/decision.pkl', 'wb'))

In [0]:
!python --version

Python 3.6.9


In [0]:
import joblib


In [0]:
# save the model to disk
filename = '/content/drive/My Drive/Dataset/1.sav'
joblib.dump(clf, filename)

['/content/drive/My Drive/Dataset/1.sav']

In [0]:
loaded_model = joblib.load(filename)


In [0]:
loaded_model.predict([[1995, 102, 11, 150000, 1, 1, 2, 33775, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0]])

array([12500.])

In [0]:
#load the model from disk
loaded_model = pickle.load(open('/content/drive/My Drive/Dataset/decision.pkl', 'rb'))


In [0]:
from sklearn.metrics import r2_score
print("The R2 square value of Linear Regression is :", r2_score(y_test, pred)* 100)

The R2 square value of Linear Regression is : 81.76520661089148


In [0]:
predictPrice = []
for i in range(1000):
    data = df.loc[i]
    del data['price']
    clf.predict([data])
    predictPrice.append(clf.predict([data])[0])

In [0]:

actualPrice = []
for i in range(1000):
    data = df.loc[i]
    data['price']
    actualPrice.append(data['price'])

In [0]:
total_lis=[]
for i in range(1000):
    diff = abs(predictPrice[i] - actualPrice[i])
    if diff >= 200:
        print("index--%s, actual price--%s, predicted price--%s, diff--%s"%(i, actualPrice[i], predictPrice[i], diff))
        total_lis.append(i)
print(len(total_lis))

index--0, actual price--650, predicted price--999.469964664311, diff--349.46996466431096
index--1, actual price--2000, predicted price--2755.4155844155844, diff--755.4155844155844
index--2, actual price--2799, predicted price--5517.2770330652365, diff--2718.2770330652365
index--4, actual price--2500, predicted price--3473.740932642487, diff--973.7409326424872
index--5, actual price--3699, predicted price--6060.296875, diff--2361.296875
index--6, actual price--500, predicted price--1319.9939698492462, diff--819.9939698492462
index--7, actual price--2500, predicted price--4015.7854251012145, diff--1515.7854251012145
index--8, actual price--5555, predicted price--3096.157894736842, diff--2458.842105263158
index--9, actual price--3300, predicted price--1734.795294117647, diff--1565.204705882353
index--10, actual price--3500, predicted price--4275.022727272727, diff--775.022727272727
index--11, actual price--12500, predicted price--10324.53596287703, diff--2175.4640371229707
index--13, actu