# Project 1: used car price predicition 
### Dataset: https://www.kaggle.com/datasets/therohithanand/used-car-price-prediction
### Introduction: This is a practice on basic machine learning technique such as decision trees, split data, and MAE.  
In this practice, I downloaded the pre-cleaned data from https://www.kaggle.com/datasets/therohithanand/used-car-price-prediction, performed a machine learning prediction on the price of used cars and evaluated the model accuracy using MAE. 

In [3]:
import pandas as np 
# Import data set from the local file 
used_car_file_path = "/Users/wangxinyuan/Desktop/Dataset/used_car_price_dataset_extended.csv"

used_car_data = np.read_csv(used_car_file_path)

used_car_data.head ()

Unnamed: 0,make_year,mileage_kmpl,engine_cc,fuel_type,owner_count,price_usd,brand,transmission,color,service_history,accidents_reported,insurance_valid
0,2001,8.17,4000,Petrol,4,8587.64,Chevrolet,Manual,White,,0,No
1,2014,17.59,1500,Petrol,4,5943.5,Honda,Manual,Black,,0,Yes
2,2023,18.09,2500,Diesel,5,9273.58,BMW,Automatic,Black,Full,1,Yes
3,2009,11.28,800,Petrol,1,6836.24,Hyundai,Manual,Blue,Full,0,Yes
4,2005,12.23,1000,Petrol,2,4625.79,Nissan,Automatic,Red,Full,0,Yes


In [6]:
# Step 1: specify prediction target 
y= used_car_data.price_usd

In [10]:
# Step 2: specify the feature names
feature_names = ["make_year","mileage_kmpl","engine_cc","owner_count"]
X = used_car_data[feature_names]
X.head()

Unnamed: 0,make_year,mileage_kmpl,engine_cc,owner_count
0,2001,8.17,4000,4
1,2014,17.59,1500,4
2,2023,18.09,2500,5
3,2009,11.28,800,1
4,2005,12.23,1000,2


In [11]:
# Step 3: fit model: decision tree
from sklearn.tree import DecisionTreeRegressor 
used_car_price_model = DecisionTreeRegressor (random_state =1)
used_car_price_model.fit (X,y)

0,1,2
,criterion,'squared_error'
,splitter,'best'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,1
,max_leaf_nodes,
,min_impurity_decrease,0.0


In [13]:
# Step 4: make predictions
predictions = used_car_price_model.predict (X)
predictions

array([ 8587.64,  5943.5 ,  9273.58, ...,  4557.1 ,  7413.59, 11634.09],
      shape=(10000,))

### Split the data into fitting data and validation data

In [16]:
from sklearn.model_selection import train_test_split
# Step 1: split the data into train and val data 
train_X, val_X, train_y, val_y = train_test_split (X,y, random_state =1)

In [17]:
# Step 2: specify and fit the model
used_car_price_model2 = DecisionTreeRegressor (random_state =1)
used_car_price_model2.fit (train_X, train_y)

0,1,2
,criterion,'squared_error'
,splitter,'best'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,1
,max_leaf_nodes,
,min_impurity_decrease,0.0


In [18]:
# Step 3: make predictions with validation data
predictions_val = used_car_price_model2.predict (val_X)

In [21]:
# Step 4: calculate the Mean Absolute Error (MAE) in validation data
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error (predictions_val, val_y)
print (mae)

1499.409026


### Overfitting and Underfitting 
We can see from above that the prediction is not so good. Now, I will optimize the size of the tree to make better prediction. 

In [22]:
# Step 0: write the funtion to get mae 
def get_mae (max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor (max_leaf_nodes = max_leaf_nodes, random_state = 0)
    model.fit (train_X, train_y)
    pred_vals = model.predict (val_X)
    mae_err = mean_absolute_error (val_y, pred_vals)
    return (mae_err)

In [23]:
# Step 1: compare different tree sizes
candidate_max_leaf_nodes = [5,25,50,75,100,500]
for max_leaf_nodes in candidate_max_leaf_nodes: 
    my_mae = get_mae (max_leaf_nodes, train_X, val_X, train_y, val_y)
    print ("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  1514
Max leaf nodes: 25  		 Mean Absolute Error:  1273
Max leaf nodes: 50  		 Mean Absolute Error:  1210
Max leaf nodes: 75  		 Mean Absolute Error:  1194
Max leaf nodes: 100  		 Mean Absolute Error:  1181
Max leaf nodes: 500  		 Mean Absolute Error:  1197


In [24]:
max_leaf_nodes = 100

In [25]:
# Step 2: fit the model with all data 
model2 = DecisionTreeRegressor (max_leaf_nodes =100, random_state =1)
model2.fit (X,y)

0,1,2
,criterion,'squared_error'
,splitter,'best'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,1
,max_leaf_nodes,100
,min_impurity_decrease,0.0
