In [448]:
import numpy as np 
import pandas as pd
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import train_test_split
from sklearn.tree import export_text
from matplotlib import pyplot as plt
from matplotlib.pyplot import figure



df = pd.read_csv("../input/carsforsale/cars_raw.csv")
df['FuelType'].unique()

Here we can see all fuel types present in the dataset. In this analysis, We want to compare common fuel types, specifically Gasoline, Electric, Hybrid, and Diesel. We will include Diesel Fuel in the Diesel category, as well as Gasoline fuel in the Gasoline Category, but remove the rest of the less common fuel types. 

In [449]:
df['FuelType'] = df['FuelType'].replace({"Diesel Fuel": "Diesel", "Gasoline Fuel": "Gasoline"})

names = ["Gasoline", "Diesel", "Electric", "Hybrid"]

df = df[df['FuelType'].isin(names)]

print(df['FuelType'].unique())

Now we want to use a decision tree on the data, optimizing it to predict the type of car. First, we have to do some pre-processing, specifically, removing irrelevant columns (VIN) or giveaway columns (Maker: i.e., Tesla). We also need to relabel certain columns with numeric values, such as Seller Type and Used/New with a number for each category.

In [450]:
df = df.drop(['Make', 'Model','SellerName', 'StreetName','State','Zipcode','DealType','InteriorColor','ExteriorColor', 'Drivetrain', 'MinMPG', 'MaxMPG', 'Transmission', 'Engine', 'VIN', 'Stock#','Price'], axis = 1)

columnsToFactorize = ['SellerType', 'Used/New']

for item in columnsToFactorize:
    df[item] = pd.factorize(df[item])[0].tolist()

df['Year'] -= 2000

Here, we will implement the decision tree algorithm - we will use a 75/25 training/test split for the model.

In [451]:
Features = list(df.columns[:13])
Target = 'FuelType'
Features_train, Features_test, Target_train, Target_test = train_test_split(df[Features], df[Target], test_size = 0.25, random_state = 7)

dt = DecisionTreeClassifier(min_samples_split=50, random_state=7, max_depth = 3)
dt.fit(Features_train, Target_train)

r = export_text(dt, feature_names=Features)
print(r)


Here we can see that the model mostly returns gasoline (In fact, it doesn't even give a return path for hybrid or diesel), most likely due to the prevalence of gas cars in the database - we will redo the model, but this time, remove gasoline from the dataset, and limit the depth further due to the lack of data

In [452]:
names = [ "Diesel", "Electric", "Hybrid"]

df = df[df['FuelType'].isin(names)]

Features = list(df.columns[:13])
Target = 'FuelType'
Features_train, Features_test, Target_train, Target_test = train_test_split(df[Features], df[Target], test_size = 0.25, random_state = 7)

dt = DecisionTreeClassifier(min_samples_split=50, random_state=7, max_depth = 3)
dt.fit(Features_train, Target_train)

r = export_text(dt, feature_names=Features)
print(r)

Here, we can see some interesting results - first, the model predicts Electric cars as having distinctly low Exterior Style Ratings, which is an interesting counter-point to the common perception that Electric cars are mainly flashy toys. We can also see that the model adjust them having higher performance ratings (Notably, only for electric car with high styling ratings) than Diesel and Hybrids. As for Hybrids and Diesel Cars, the model points out that Hybrids have higher reliability than Diesel cars, We can verify all of this by comparing these attributes in graphs. 

In [453]:
labels = ['Electric', 'Hybrid', 'Diesel']
ReliabilityRating = []
PerformanceRating = []
ExteriorStylingRating = []

for item in labels:
    ReliabilityRating.append(df[df['FuelType'] == item]['ReliabilityRating'].mean())
    PerformanceRating.append(df[df['FuelType'] == item]['PerformanceRating'].mean())
    ExteriorStylingRating.append(df[df['FuelType'] == item]['ExteriorStylingRating'].mean())




figure, axis = plt.subplots(1, 3, figsize = (12,6))
axis[0].bar(labels, ReliabilityRating)
axis[0].set_title('Reliability')
axis[1].bar(labels, PerformanceRating)
axis[1].set_title('Performace')
axis[2].bar(labels, ExteriorStylingRating)
axis[2].set_title('Exterior Styling')
plt.show()

Overall, the model could be substantially improved by diversifying our datasets - gasoline cars being removed from the dataset allowed for a more nuanced view of the other categories, but also removed enough cars to reduce the effectiveness of the model. At the moment, we simply don't have enough training data to train a very effective model for most of our car types. An interesting result we can see is that electric cars still rank low compared to other kinds of cars in many categories - perhaps an indication that they still need development to be competitive in the market.