# Modelling

In this notebook, we will try to train a model to predict the future price of a bulldozer. 

In the first part, we will try to consider all the features, even those with 70% or more missing data. In the second part, we will drop those features to confront the results.

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestRegressor


## Data Wrangling

Loading the dataset and pre-processing data to train a model.

In [2]:
df = pd.read_csv("/work/bulldozer-data-date-parsing.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 412698 entries, 0 to 412697
Data columns (total 57 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   SalesID                   412698 non-null  int64  
 1   SalePrice                 412698 non-null  float64
 2   MachineID                 412698 non-null  int64  
 3   ModelID                   412698 non-null  int64  
 4   datasource                412698 non-null  int64  
 5   auctioneerID              392562 non-null  float64
 6   YearMade                  412698 non-null  int64  
 7   MachineHoursCurrentMeter  147504 non-null  float64
 8   UsageBand                 73670 non-null   object 
 9   fiModelDesc               412698 non-null  object 
 10  fiBaseModel               412698 non-null  object 
 11  fiSecondaryDesc           271971 non-null  object 
 12  fiModelSeries             58667 non-null   object 
 13  fiModelDescriptor         74816 non-null   o

### Filling numeric variables

In [4]:
for label, content in df.items():
    if pd.api.types.is_numeric_dtype(content):
        df[label] = content.fillna(content.mean())

### Convert object variables to category and fill missing values

In [5]:
for label, content in df.items():
    if pd.api.types.is_string_dtype(content):
        df[label] = content.astype("category").cat.as_ordered()


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 412698 entries, 0 to 412697
Data columns (total 57 columns):
 #   Column                    Non-Null Count   Dtype   
---  ------                    --------------   -----   
 0   SalesID                   412698 non-null  int64   
 1   SalePrice                 412698 non-null  float64 
 2   MachineID                 412698 non-null  int64   
 3   ModelID                   412698 non-null  int64   
 4   datasource                412698 non-null  int64   
 5   auctioneerID              412698 non-null  float64 
 6   YearMade                  412698 non-null  int64   
 7   MachineHoursCurrentMeter  412698 non-null  float64 
 8   UsageBand                 73670 non-null   category
 9   fiModelDesc               412698 non-null  category
 10  fiBaseModel               412698 non-null  category
 11  fiSecondaryDesc           271971 non-null  category
 12  fiModelSeries             58667 non-null   category
 13  fiModelDescriptor         748

In [7]:
for label, content in df.items():
    if not pd.api.types.is_numeric_dtype(content):
        df[label] = pd.Categorical(content).codes + 1

In [8]:
df.isnull().sum()

SalesID                     0
SalePrice                   0
MachineID                   0
ModelID                     0
datasource                  0
auctioneerID                0
YearMade                    0
MachineHoursCurrentMeter    0
UsageBand                   0
fiModelDesc                 0
fiBaseModel                 0
fiSecondaryDesc             0
fiModelSeries               0
fiModelDescriptor           0
ProductSize                 0
fiProductClassDesc          0
state                       0
ProductGroup                0
ProductGroupDesc            0
Drive_System                0
Enclosure                   0
Forks                       0
Pad_Type                    0
Ride_Control                0
Stick                       0
Transmission                0
Turbocharged                0
Blade_Extension             0
Blade_Width                 0
Enclosure_Type              0
Engine_Horsepower           0
Hydraulics                  0
Pushblock                   0
Ripper    

## Modeling

In [9]:
# Create a RandomForestRegressor using only 10000 examples to see how it perform.

rf = RandomForestRegressor(random_state = 42, max_samples = 10000)

Before train our model we need to create our own train and validation sets. 

In [10]:
# Make a copy of our dataset
df_copy = df.copy()

# Sort our data by the sale year
df_copy.sort_values(by = "SaleYear", inplace = True)

In [11]:
# We create our validation and train set by splitting examples before 2012 (train set), and 2012 (validation set)

train = df_copy[df_copy["SaleYear"] != 2012]
val = df_copy[df_copy["SaleYear"] == 2012]

In [12]:
# Create X and y

X_train, y_train = train.drop("SalePrice", axis = 1), train["SalePrice"]
X_val, y_val = val.drop("SalePrice", axis = 1), val["SalePrice"]

In [13]:
rf.fit(X_train, y_train)

In [14]:
rf.score(X_val, y_val)

0.8335455561396891

In [15]:
# Import other metrics to test our model. 
from sklearn.metrics import mean_absolute_error, mean_squared_log_error, r2_score

def rmsle(y_val, y_pred):
    msle = mean_squared_log_error(y_val, y_pred)
    return np.sqrt(msle)

In [22]:
predictions = rf.predict(X_val)

def scoring(y_val, y_pred):
    score_dict = {}
    mae = mean_absolute_error(y_val, y_pred)
    score_dict["MAE"] = mae
    logerr = rmsle(y_val, y_pred)
    score_dict["RMSLE"] = logerr
    r2 = r2_score(y_val, y_pred)
    score_dict["R2"] = r2

    print(f"Mean absloute error: {mae}")
    print(f"Root mean squared log error: {logerr}")
    print(f"R2 score: {r2}")
    return score_dict

We can see that our model perform descritely well, but we have trained it over 10000 samples only.

In [18]:
score = scoring(y_val, predictions)

Mean absloute error: 7134.698415276938
Root mean squared log error: 0.2919730060328132
R2 score: 0.8335455561396891


In [19]:
# Now we train the model over all the train set
rf_all = RandomForestRegressor(random_state = 42)
rf_all.fit(X_train, y_train)

In [23]:
y_pred = rf_all.predict(X_val)
score_2 = scoring(y_val, y_pred)

Mean absloute error: 6112.923909098765
Root mean squared log error: 0.25404690119729373
R2 score: 0.8731890838521611


Training on all the data has increased the performance of our model.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=96cf5f10-502f-492e-9f46-36bdb4751390' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>