# Group 2 Project: Used Cars

## Description to run the notebook:

### After downloading the vehicles.csv file, the file needs to be put into the same folder which holds our project (and have the name "vehicles.csv"). Our csv reader assumes that the file is directly accessible from the same folder, and needs to see it, otherwise it'll throw an error. From there, all cells can be ran as normal. (NOTE: Our dataset has approximately 450,000 entries, so the file is 1.4 Gigabytes and will take some time to upload to jupyter lab and read).

In [1]:
import math
import pandas as pd
from typing import List
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import root_mean_squared_error, r2_score, mean_absolute_error
#Creating a function to collect the relative absolute error
def relative_absolute_error(a, b):
    meanabs = mean_absolute_error(a, b)
    a_mean = np.mean(a)
    return meanabs/(mean_absolute_error(a, np.full(a.shape, a_mean)))
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Lasso
from typing import Dict
import csv
from typing import NamedTuple
from scipy.spatial import distance
from collections import defaultdict
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeRegressor

Vector = List[float]
pd.options.mode.chained_assignment = None 

In [None]:
# The car dataset in its original form before any testing
original = pd.read_csv("vehicles.csv", engine = "python", encoding = "latin1")

In [None]:
#########
#Extracting the input features from the original dataset
usedcars = original[["price", "year", "manufacturer", "condition", "cylinders", "odometer", "title_status", "transmission", "drive", "type", "model"]]
#########

In [None]:
#Consider only values which don't have a null value
newUsedCars = usedcars.dropna()

#Find outliers within the dataset to reduce extremeities and skewness
def find_outliers_iqr(df, column):
    q1 = df[column].quantile(0.25)
    q3 = df[column].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    df.drop(df[(df[column] < lower_bound)].index, inplace = True)
    df.drop(df[(df[column] > upper_bound)].index, inplace = True)
    
find_outliers_iqr(newUsedCars, 'price')
find_outliers_iqr(newUsedCars, 'odometer')
find_outliers_iqr(newUsedCars, 'year')

##Turns string to int value for relevant input features
le = LabelEncoder()
newUsedCars['conditionint'] = le.fit_transform(newUsedCars['condition'])
newUsedCars["title_statusint"] = le.fit_transform(newUsedCars["title_status"])
newUsedCars["transmissionint"] = le.fit_transform(newUsedCars["transmission"])
newUsedCars["manufacturerint"] = le.fit_transform(newUsedCars["manufacturer"])
newUsedCars["cylindersint"] = le.fit_transform(newUsedCars["cylinders"])
newUsedCars["driveint"] = le.fit_transform(newUsedCars["drive"])
newUsedCars["typeint"] = le.fit_transform(newUsedCars["type"])
newUsedCars["modelint"] = le.fit_transform(newUsedCars["model"])

# Data Visualization

## Bar Charts for categorical variables and their mean prices

In [None]:
newUsedCars.groupby(["condition"])["price"].mean().plot(kind="bar", title="Prices by Condition")

#### As seen by this graph, the mean value of a car generally decreases as its condition gets worse, though one exception are cars in "good" condition, which has a slightly higher mean price than the other categories. This could be the result of most of the cars in the dataset having the "good" condition applied to them compared to the others

In [None]:
newUsedCars.groupby(["title_status"])["price"].mean().plot(kind="bar", title="Prices by Car Status")

#### The results from this graph are as expected, with cars that are more complete and finished being more valuable. Cars with a lien status ont hem are sold off at even higher prices than the rest, though the means are still closer together

In [None]:
newUsedCars.groupby(["transmission"])["price"].mean().plot(kind="bar", title="Prices by Transmission Type")

#### There isn't much difference between the mean prices of the types of transmissions, though cars with alternative transmissions are more valuable overall.

In [None]:
newUsedCars.groupby(["manufacturer"])["price"].mean().plot(kind="bar", title="Prices by Manufacturer")

#### There are strong differences among the mean prices of cars by their manufacutrer. Generally more high-class manufactuers like Alfa-Romeo and Aston-Martin are far more valuable than more general car manufacturers like Chevrolet and Ford.

In [None]:
newUsedCars.groupby(["cylinders"])["price"].mean().plot(kind="bar", title="Prices by Number of Cylinders")

#### Though both 10-cylinder and 5-cylinder cars break from this trend, the mean price of a car generally increases as the number of cylinders increases giving a somewhat linear relationship between cylinders and prices.

In [None]:
newUsedCars.groupby(["drive"])["price"].mean().plot(kind="bar", title="Prices by Their Type of Drive")

#### Cars which have a forward wheel drive have far lower mean prices than cars with a rear wheel or 4 wheel drive

In [None]:
newUsedCars.groupby(["type"])["price"].mean().plot(kind="bar", title="Prices by the type of car")

#### Mean prices differ heavily based on what type a specific car is, ranging from hatchbacks and mini-vans to coupes and pickups, the differences in types should be a good indicator of a specific car's value.

## Exploration of independent sellers vs businesses on Cragislist

In [None]:
vehicles = original.copy() # Creating a copy to preserve the
vehicles = vehicles.dropna(subset=['description', 'price', 'odometer', 'manufacturer', 'model']) 
find_outliers_iqr(vehicles, 'price')
find_outliers_iqr(vehicles, 'odometer')
find_outliers_iqr(vehicles, 'year')
vehicles = vehicles.drop_duplicates(subset=['description'], keep='first')
ogDf = vehicles.copy()
carvana = vehicles[vehicles['description'].str.contains("Carvana", case=False, na=False)]
dealership = vehicles[vehicles['description'].str.contains("dealership", case=False, na=False)]
newCar =  vehicles[
    vehicles["description"].str.contains("new car", case=False, na=False) &
    ~vehicles["description"].str.contains("like new", case=False, na=False)
]
finance = vehicles[vehicles['description'].str.contains("financing available", case=False, na=False)]

combined_indices = newCar.index.union(carvana.index).union(finance.index).union(dealership.index)
vehicles = vehicles.drop(combined_indices)
vehicles = vehicles.reset_index(drop=True)
carvana = carvana.reset_index(drop=True)
newCar = newCar.reset_index(drop=True)
finance = finance.reset_index(drop=True)
dealership = dealership.reset_index(drop=True)
usedDealerShips = pd.concat([carvana, dealership, newCar, finance], ignore_index=True)


In [None]:
print(f"Dataset size (Bar duplicates and null values): {ogDf.shape[0]}")
print(f"Vehicles listed by Carvana: {carvana.shape[0]}")
print(f"Dealership vehicles: {dealership.shape[0]}")
print(f"New cars (excluding 'like new'): {newCar.shape[0]}")
print(f"Vehicles with financing available: {finance.shape[0]}")
print(f"Remaining vehicles after removing combined indices: {vehicles.shape[0]}")
print(f"Original dataset size: {original.shape[0]}")

In [None]:
iterable = ['price', 'odometer', 'year']
for i in iterable:
    plt.hist(vehicles[i], label = "Independent sellers", alpha=0.8)
    plt.hist(usedDealerShips[i], label = "Dealerships", alpha=0.8)
    plt.title(f'Distribution of {i}')
    plt.xlabel(i)
    plt.ylabel('Frequency')
    plt.legend()
    plt.show()

## The histograms for year, price, and year indicate skew. For dealerships, the distibution of year of cars is skewed left but independent sellers aren't as skewed. The distribution of odometer is also heavily skewed by dealerships but right this time instead of left. The distribution of odometer isn't heavily skewed in entries by independent sellers. In the distribution of price we see the opposite. The distribution of price in entries by independent sellers are skewed right heavily but dealerships is not skewed.

In [None]:
dfs = [vehicles, usedDealerShips, ogDf, original, newUsedCars]
dfNames = [ 'Dataset of cars sold by independent sellers (with outliers removed, without duplicates and null values for certain features kept)',
           'Dataset of cars sold by dealerships (with outliers removed, without duplicates and null values for certain features kept)',
           "Data set with both sellers (with outliers removed, without duplicates and null values for certain features kept)",
           "Dataset in it's original state",
           "Data set with outliers removed and all null values for all features removed"]
for i in dfs:
    i['conditionint'] = le.fit_transform(i['condition'])
    i["title_statusint"] = le.fit_transform(i["title_status"])
    i["transmissionint"] = le.fit_transform(i["transmission"])
    i["manufacturerint"] = le.fit_transform(i["manufacturer"])
    i["modelint"] = le.fit_transform(i["model"])
    i["cylindersint"] = le.fit_transform(i["cylinders"])
    i["driveint"] = le.fit_transform(i["drive"])
    i["typeint"] = le.fit_transform(i["type"])
for df, name in zip(dfs, dfNames):
    print(f"\nAnalysis for: {name}\n")

    # Correlation matrix
    print("Correlation with 'price':")
    correlation_matrix = df[['price', 'odometer', 'conditionint', 'title_statusint', 'transmissionint', 
                              'manufacturerint', 'cylindersint', 'driveint', 'typeint', 'modelint']].corr()
    print(correlation_matrix['price'].sort_values(ascending=False))
    
    # Standard deviation
    dev = df[['price', 'odometer', 'year', 'conditionint', 'title_statusint', 'transmissionint', 
                  'manufacturerint', 'cylindersint', 'driveint', 'typeint', 'modelint']].std()
    print("\nStandard Deviation:")
    print(dev)
    
    # Covariance matrix
    covariance_matrix = df[['price', 'odometer', 'year', 'conditionint', 'title_statusint', 'transmissionint', 
                             'manufacturerint', 'cylindersint', 'driveint', 'typeint', 'modelint']].cov()
    print("\nCovariance with 'price':")
    print(covariance_matrix['price'])
    # Variance
    variance = df[['price', 'odometer', 'year', 'conditionint', 'title_statusint', 'transmissionint', 
                   'manufacturerint', 'cylindersint', 'driveint', 'typeint', 'modelint']].var()
    print("\nVariance:")
    print(variance)
    print("\n----------------------------------------------------------------------------------------------------\n")

#### Outliers were excluding using IQR earlier, so it was omitted from this analysis. The raw data is basically useless due to noisy data. Correlation is generally highest magnitude with odometer, transimission, and sometimes cylinders. Price has the highest varaince followed by odometer, price and year. Covariance is high for odometer and model. Covariance is not as high but still moderate for year and cylinders. Standard deviation for price and odometer are fairly large numbers and are consistent through the data sets. 

## Scatterplots for relationship between quantitative variables and car prices

In [None]:
#Setting up a scatter plot between the year and the price
fig, ax = plt.subplots()
year = newUsedCars["year"]
price = newUsedCars["price"]
ax.scatter(year, price)

# Creating a regression line to show the trend

m, b = np.polyfit(year, price, deg=1)
ax.plot(year, m * year + b, color="red")

#labels
plt.xlabel("Year")
plt.ylabel("Price (Dollars)")

# setting the limits to make it easy to view
plt.ylim(-1000, 60000)
plt.show()
#plt.xlim(1900, 2022)

#### Looking at this graph, there seems to be a slightly positive relationship between the year and the price. There is a strong positve linear relationship between the age of a car and its price, showing that age is a good indicator of price and a valuable input for our model.

In [None]:
#Setting up a scatter plot between the mileage and the price
fig, ax = plt.subplots()
mileage = newUsedCars["odometer"]
price = newUsedCars["price"]

#plotting the line
ax.scatter(mileage, price)

# Creating a regression line to show the trend
m, b = np.polyfit(mileage, price, deg=1)
ax.plot(mileage, m * mileage + b, color="red")

#labels
plt.xlabel("Mileage (Miles)")
plt.ylabel("Price (Dollars)")

# setting the limits to make it easy to view
plt.ylim(-1000, 60000)
plt.xlim(-5000, 300000)
plt.show()

#### Unlike the relationship between the year and the price, the relationship between the mileage of a car and its price is negative, meaning that the more a car has been driven, the lower its price becomes. The strongly downward sloping nature of the regression line means mileage is a valuable metric to judge a car's price

## Heat Map of Correlations

In [None]:

#Correlating each relevant category with others
correlations = newUsedCars[["price", "year", "odometer", "conditionint", "title_statusint", "transmissionint", "manufacturerint", "cylindersint", "driveint", "typeint"]].corr()
#Setting the axes up (Transm. = Transmission and Manuf. = Manufacturer, reduced to avoid word overlap)
ax = plt.axes()
ax.set_xticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
ax.set_xticklabels(["Price", "Year", "Mileage", "Condition", "Status", "Transm.", "Manuf.", "Cylinders", "Drive", "Type"], fontsize = 6)
ax.set_yticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
ax.set_yticklabels(["Price", "Year", "Mileage", "Condition", "Status", "Transm.", "Manuf.", "Cylinders", "Drive", "Type"], fontsize = 6)

#Setting the title
plt.title("Correlations between selected inputs and price")
plt.imshow(correlations, cmap="winter")
plt.colorbar()
#Putting the actual correlation coefficients with their respective correlation
for i in range(10):
    for j in range(10):
        plt.annotate(str(round(correlations.values[i][j], 3)), xy=(j - 0.225, i), fontsize=6)

#### The correlation matrix shows which inputs have the most impact on price and, as a result, which inputs will be the most valuable to improving our dataset. Variables with a higher absolute correlation, like the Year and Mileage, have a far greater impact on predictions than less correlative variables like the Status and Manufacturer of the car

## Box Plot for price

In [None]:
newUsedCars.boxplot(column=["price"], vert=False, xlabel="Price (in millions)")
plt.title("Boxplot for the price of used cars")

#### As shown by the boxplot, when major outliers are removed from the data set, the price of most cars tends to sit in the 10,000 to 20,000 dollar range. The data is somewhat skewed right, but not to a troubling extent that can get in the way of accurate testing.

# Model Creation and Evaluation

# -------------------------Polynomial Regression (Roshan)------------------------------
#### For my progression, I initially started with both a Quadratic and Cubic model which tested the 2 original inputs, the year and mileage, then the four best inputs (year, mileage, transmission, number of cylinders), and finally all inputs. Throughout my progress, I compared and contrasted the accuracy between both the models while seeing the rate at which each model developed and improved. The testing for Polynomial Regression is split into the Quadratic and Cubic model, with each model split into three with the different set of inputs used for each model.

### Two Original Inputs - Quadratic

In [None]:
#The inputs to predict the price
Inputs = newUsedCars[["year", "odometer"]]
#Price, the output
Price = newUsedCars["price"]
#Splitting the dataset into 80% training and 20% testing
trainerX, testerX, trainerY, testerY = train_test_split(Inputs, Price, test_size=0.2, random_state=27)

In [None]:
## Quadratic Regression Model

PolyRegress = PolynomialFeatures(2)
polyFTrainer = PolyRegress.fit_transform(trainerX)
polyFTester = PolyRegress.fit_transform(testerX)
polyM = linear_model.LinearRegression()
polyM.fit(polyFTrainer, trainerY)
trainerYPredictions = polyM.predict(polyFTrainer)
testerYPredictions = polyM.predict(polyFTester)

#### Evaluation

In [None]:
print(f'Root mean squared error: {root_mean_squared_error(testerYPredictions, testerY)}')
print(f'Mean absolute error: {mean_absolute_error(testerYPredictions, testerY)}')
print(f'R-Squared: {r2_score(testerYPredictions, testerY)}')
print(f'Relative absolute error: {relative_absolute_error(testerYPredictions, testerY)}')
print(f'Correlation Coefficients: {polyM.coef_}')

### Four Best Inputs - Quadratic

In [None]:
#The inputs to predict the price
Inputs = newUsedCars[["year", "odometer", "transmissionint", "cylindersint"]]
#Price, the output
Price = newUsedCars["price"]
#Splitting the dataset into 80% training and 20% testing
trainerX, testerX, trainerY, testerY = train_test_split(Inputs, Price, test_size=0.2, random_state=27)

In [None]:
## Quadratic Regression Model

PolyRegress = PolynomialFeatures(2)
polyFTrainer = PolyRegress.fit_transform(trainerX)
polyFTester = PolyRegress.fit_transform(testerX)
polyM = linear_model.LinearRegression()
polyM.fit(polyFTrainer, trainerY)
trainerYPredictions = polyM.predict(polyFTrainer)
testerYPredictions = polyM.predict(polyFTester)

#### Evaluation

In [None]:
print(f'Root mean squared error: {root_mean_squared_error(testerYPredictions, testerY)}')
print(f'Mean absolute error: {mean_absolute_error(testerYPredictions, testerY)}')
print(f'R-Squared: {r2_score(testerYPredictions, testerY)}')
print(f'Relative absolute error: {relative_absolute_error(testerYPredictions, testerY)}')
print(f'Correlation Coefficients: {polyM.coef_}')

### All Inputs - Quadratic

In [None]:
#The inputs to predict the price
Inputs = newUsedCars[["year", "odometer", "conditionint", "title_statusint", "transmissionint", "manufacturerint", "cylindersint", "driveint", "typeint"]]
#Price, the output
Price = newUsedCars["price"]
#Splitting the dataset into 80% training and 20% testing
trainerX, testerX, trainerY, testerY = train_test_split(Inputs, Price, test_size=0.2, random_state=27)
 

In [None]:
## Quadratic Regression Model

PolyRegress = PolynomialFeatures(2)
polyFTrainer = PolyRegress.fit_transform(trainerX)
polyFTester = PolyRegress.fit_transform(testerX)
polyM = linear_model.LinearRegression()
polyM.fit(polyFTrainer, trainerY)
trainerYPredictions = polyM.predict(polyFTrainer)
testerYPredictions = polyM.predict(polyFTester)


#### Evaluation

In [None]:
print(f'Root mean squared error: {root_mean_squared_error(testerYPredictions, testerY)}')
print(f'Mean absolute error: {mean_absolute_error(testerYPredictions, testerY)}')
print(f'R-Squared: {r2_score(testerYPredictions, testerY)}')
print(f'Relative absolute error: {relative_absolute_error(testerYPredictions, testerY)}')
print(f'Correlation Coefficients: {polyM.coef_}')

### Two Original Inputs - Cubic

In [None]:
#The inputs to predict the price
Inputs = newUsedCars[["year", "odometer"]]
#Price, the output
Price = newUsedCars["price"]
#Splitting the dataset into 80% training and 20% testing
trainerX, testerX, trainerY, testerY = train_test_split(Inputs, Price, test_size=0.2, random_state=27)

In [None]:
## Cubic Regression Model

PolyRegress = PolynomialFeatures(3)
polyFTrainer = PolyRegress.fit_transform(trainerX)
polyFTester = PolyRegress.fit_transform(testerX)
polyM = linear_model.LinearRegression()
polyM.fit(polyFTrainer, trainerY)
trainerYPredictions = polyM.predict(polyFTrainer)
testerYPredictions = polyM.predict(polyFTester)

### Evaluation

In [None]:
print(f'Root mean squared error: {root_mean_squared_error(testerYPredictions, testerY)}')
print(f'Mean absolute error: {mean_absolute_error(testerYPredictions, testerY)}')
print(f'R-Squared: {r2_score(testerYPredictions, testerY)}')
print(f'Relative absolute error: {relative_absolute_error(testerYPredictions, testerY)}')
print(f'Correlation Coefficients: {polyM.coef_}')

### Four Best Inputs - Cubic

In [None]:
#The inputs to predict the price
Inputs = newUsedCars[["year", "odometer", "transmissionint", "cylindersint"]]
#Price, the output
Price = newUsedCars["price"]
#Splitting the dataset into 80% training and 20% testing
trainerX, testerX, trainerY, testerY = train_test_split(Inputs, Price, test_size=0.2, random_state=27)

In [None]:
## Cubic Regression Model

PolyRegress = PolynomialFeatures(3)
polyFTrainer = PolyRegress.fit_transform(trainerX)
polyFTester = PolyRegress.fit_transform(testerX)
polyM = linear_model.LinearRegression()
polyM.fit(polyFTrainer, trainerY)
trainerYPredictions = polyM.predict(polyFTrainer)
testerYPredictions = polyM.predict(polyFTester)

### Evaluation

In [None]:
print(f'Root mean squared error: {root_mean_squared_error(testerYPredictions, testerY)}')
print(f'Mean absolute error: {mean_absolute_error(testerYPredictions, testerY)}')
print(f'R-Squared: {r2_score(testerYPredictions, testerY)}')
print(f'Relative absolute error: {relative_absolute_error(testerYPredictions, testerY)}')
print(f'Correlation Coefficients: {polyM.coef_}')

### All Inputs - Cubic

In [None]:
#The inputs to predict the price
Inputs = newUsedCars[["year", "odometer", "conditionint", "title_statusint", "transmissionint", "manufacturerint", "cylindersint", "driveint", "typeint"]]
#Price, the output
Price = newUsedCars["price"]
#Splitting the dataset into 80% training and 20% testing
trainerX, testerX, trainerY, testerY = train_test_split(Inputs, Price, test_size=0.2, random_state=27)

In [None]:
## Cubic Regression Model

PolyRegress = PolynomialFeatures(3)
polyFTrainer = PolyRegress.fit_transform(trainerX)
polyFTester = PolyRegress.fit_transform(testerX)
polyM = linear_model.LinearRegression()
polyM.fit(polyFTrainer, trainerY)
trainerYPredictions = polyM.predict(polyFTrainer)
testerYPredictions = polyM.predict(polyFTester)


### Evaluation

In [None]:
print(f'Root mean squared error: {root_mean_squared_error(testerYPredictions, testerY)}')
print(f'Mean absolute error: {mean_absolute_error(testerYPredictions, testerY)}')
print(f'R-Squared: {r2_score(testerYPredictions, testerY)}')
print(f'Relative absolute error: {relative_absolute_error(testerYPredictions, testerY)}')
print(f'Correlation Coefficients: {polyM.coef_}')

## Conclusions

* When both the Quadratic and Cubic models had very few inputs, originally only the age and mileage, they underfit tremendously, to the point of having a negative R-squared value
* Even inputs with lower correlation values can still contribute to greater accuracy, both the Quadrtaic and Cubic models improved their accuracy tremendously when they had more inputs to work with regardless of the strength of correlation
* Neither this model nor the quadratic model ran into overfitting the training dataset, meaning both the number of inputs and the degree of the model were low enough to keep the training set as a good indicator for testing.
* Cubic regression seemed to be more extreme in its accuracy, where the more inputs that were added the more exponentially the accuracy of the model rose
* This could indicate that Cubic regression is better for studying such a dataset than Quadratic regression, but only if the Cubic model has a large array of inputs to work with


# -------------------------------Lasso Regression (Rachel)----------------------------------------

For Lasso Regression I will create two models, one with all features and one with only the features with high correlations to price.

The 3 features with the most correlation are:

    Mileage
    Transmission
    Year



In [None]:
lass = Lasso()
lass.fit(trainerX, trainerY)
print(lass.intercept_)
print(lass.coef_)
lassPredict = lass.predict(testerX)
print("Score ", root_mean_squared_error(testerY, lassPredict))
print(f'R-Squared: {r2_score(testerY, lassPredict)}')


print()
print()
refinedInput = newUsedCars[["year", "odometer","transmissionint"]]
rtrainerX, rtesterX, rtrainerY, rtesterY = train_test_split(refinedInput, Price, test_size=0.2, random_state=27)

rlass = Lasso()
rlass.fit(rtrainerX, rtrainerY)
print(rlass.intercept_)
print(rlass.coef_)
rlassPredict = rlass.predict(rtesterX)
print("Score ", root_mean_squared_error(rtesterY, rlassPredict))
print(f'R-Squared: {r2_score(rtesterY, rlassPredict)}')



**My theory that less criterias would lead to a higher R-Squared value was wrong.**

My finding with Lasso Regression is that there is a 55% accuracy with this models output. Because Lasso experiences hardly any changes when tuning is done though the changing of hyperparameters, I will not be changing the alpha in attempts to make the model more accurate.

**However, improvements can be made with data scaling**


# ------------Decision tree Regression (Bilal)------------

In [None]:
x = newUsedCars[['odometer']]
y = newUsedCars['price']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=27)
RTModel = DecisionTreeRegressor(random_state=27)
RTModel.fit(X_train, y_train)
y_pred = RTModel.predict(X_test)

print(f'Root mean squared error: {root_mean_squared_error(y_test, y_pred)}')
print(f'Mean absolute error: {mean_absolute_error(y_test, y_pred)}')
print(f'R-Squared: {r2_score(y_test, y_pred)}')
print(f'Relative absolute error: {relative_absolute_error(y_test, y_pred)}')
print(f'Correlation Coefficients: {RTModel.feature_importances_}')

In [None]:
x = newUsedCars[['odometer', 'year', 'modelint']]
y = newUsedCars['price']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=27)
RTModel = DecisionTreeRegressor(random_state=27)
RTModel.fit(X_train, y_train)
y_pred = RTModel.predict(X_test)

print(f'Root mean squared error: {root_mean_squared_error(y_test, y_pred)}')
print(f'Mean absolute error: {mean_absolute_error(y_test, y_pred)}')
print(f'R-Squared: {r2_score(y_test, y_pred)}')
print(f'Relative absolute error: {relative_absolute_error(y_test, y_pred)}')
print(f'Correlation Coefficients: {RTModel.feature_importances_}')

In [None]:
x = newUsedCars[['odometer', 'year', 'modelint', 'transmissionint', 'manufacturerint']]
y = newUsedCars['price']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=27)
RTModel = DecisionTreeRegressor(random_state=27)
RTModel.fit(X_train, y_train)
y_pred = RTModel.predict(X_test)

print(f'Root mean squared error: {root_mean_squared_error(y_test, y_pred)}')
print(f'Mean absolute error: {mean_absolute_error(y_test, y_pred)}')
print(f'R-Squared: {r2_score(y_test, y_pred)}')
print(f'Relative absolute error: {relative_absolute_error(y_test, y_pred)}')
print(f'Correlation Coefficients: {RTModel.feature_importances_}')

In [None]:
x = newUsedCars[['odometer', 'year', 'modelint', 'transmissionint', 'manufacturerint', 'cylindersint']]
y = newUsedCars['price']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=27)
RTModel = DecisionTreeRegressor( random_state=27)
RTModel.fit(X_train, y_train)
y_pred = RTModel.predict(X_test)

print(f'Root mean squared error: {root_mean_squared_error(y_test, y_pred)}')
print(f'Mean absolute error: {mean_absolute_error(y_test, y_pred)}')
print(f'R-Squared: {r2_score(y_test, y_pred)}')
print(f'Relative absolute error: {relative_absolute_error(y_test, y_pred)}')
print(f'Correlation Coefficients: {RTModel.feature_importances_}')

#### The decision tree regression model performed quite well on the data set. As more input features were added the performance increased quite a lot. The decision tree regression model is quite suitable for this dataset.

In [None]:
x = newUsedCars[['odometer', 'year', 'modelint', 'transmissionint', 'manufacturerint', 'cylindersint', 'driveint']]
y = newUsedCars['price']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=27)
RTModel = DecisionTreeRegressor(random_state=27)
RTModel.fit(X_train, y_train)
y_pred = RTModel.predict(X_test)

print(f'Root mean squared error: {root_mean_squared_error(y_test, y_pred)}')
print(f'Mean absolute error: {mean_absolute_error(y_test, y_pred)}')
print(f'R-Squared: {r2_score(y_test, y_pred)}')
print(f'Relative absolute error: {relative_absolute_error(y_test, y_pred)}')
print(f'Correlation Coefficients: {RTModel.feature_importances_}')

In [None]:
x = newUsedCars[['odometer', 'year', 'modelint', 'transmissionint', 'manufacturerint', 'cylindersint', 'driveint']]
y = newUsedCars['price']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=27)
RTModel = DecisionTreeRegressor(max_depth=26, min_samples_split=18, min_samples_leaf=9,random_state=27)
RTModel.fit(X_train, y_train)
y_pred = RTModel.predict(X_test)
            
print(f'Root mean squared error: {root_mean_squared_error(y_test, y_pred)}')
print(f'Mean absolute error: {mean_absolute_error(y_test, y_pred)}')
print(f'R-Squared: {r2_score(y_test, y_pred)}')
print(f'Relative absolute error: {relative_absolute_error(y_test, y_pred)}')
print(f'Correlation Coefficients: {RTModel.feature_importances_}')

In [None]:
x = usedDealerShips[['odometer', 'year','modelint', 'transmissionint', 'manufacturerint',  "cylindersint"]]
y = usedDealerShips['price']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=27)
RTModel = DecisionTreeRegressor(max_depth=23, min_samples_split=27, min_samples_leaf=5, random_state=27)
RTModel.fit(X_train, y_train)
y_pred = RTModel.predict(X_test)

print(f'Root mean squared error: {root_mean_squared_error( y_test, y_pred)}')
print(f'Mean absolute error: {mean_absolute_error(y_test, y_pred)}')
print(f'R-Squared: {r2_score(y_test, y_pred)}')
print(f'Relative absolute error: {relative_absolute_error(y_test, y_pred)}')
print(f'Correlation Coefficients: {RTModel.feature_importances_}')

In [None]:
x = vehicles[['odometer', 'year','modelint', 'transmissionint', 'manufacturerint']]
y = vehicles['price']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=27)
RTModel = DecisionTreeRegressor(max_depth=23, min_samples_split=23, min_samples_leaf=7,random_state=27)
RTModel.fit(X_train, y_train)
y_pred = RTModel.predict(X_test)
print(f'Root mean squared error: {root_mean_squared_error( y_test, y_pred)}')
print(f'Mean absolute error: {mean_absolute_error(y_test, y_pred)}')
print(f'R-Squared: {r2_score(y_test, y_pred)}')
print(f'Relative absolute error: {relative_absolute_error(y_test, y_pred)}')
print(f'Correlation Coefficients: {RTModel.feature_importances_}')

#### When compared between the 3 datasets with different preprocessing, it performed the best on the data set where no distinction was made between independent sellers and dealerships. The data set where it performed the best was also the dataset where duplicate descriptions were not removed which could be a factor on why predictive modeling is better on the dataset. The dataset with entries that are likely from dealerships performed better than the dataset with independent sellers. This can indicate that the dealerships use the input features in their pricing decisions to a larger extent.