# 🚜 Predicting the Sale Price of Bulldozers using Machine Learning

## 1. Problem defition

> How well can we predict the future sale price of a bulldozer, given its characteristics and previous examples of how much similar bulldozers have been sold for?

## 2. Data

The data is downloaded from the Kaggle Bluebook for Bulldozers competition: https://www.kaggle.com/c/bluebook-for-bulldozers/data

There are 3 main datasets:

* Train.csv is the training set, which contains data through the end of 2011.
* Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012.
* Test.csv is the test set, which won't be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. The 

## 3. Evaluation

The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.

For more on the evaluation of this project check: https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation

## 4. Features

Kaggle provides a data dictionary detailing all of the features of the dataset. 
Google Sheets: https://docs.google.com/spreadsheets/d/18ly-bLR8sbDJLITkWG7ozKm8l3RyieQ2Fpgix-beSYI/edit?usp=sharing

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

In [None]:
#Import training validation sets

df=pd.read_csv("../input/bluebook-for-bulldozers/TrainAndValid.csv",
              low_memory=False)

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
df.head().T

In [None]:
fig, ax= plt.subplots()
ax.scatter(df["saledate"][:1000], df["SalePrice"][:1000])

In [None]:
df.saledate[:1000]

In [None]:
df.SalePrice.plot.hist()

In [None]:
df.ModelID.value_counts()

### Parsing dates

As this is a time series data, we have to enrich the data-time variable.
We can do that by telling pandas which of our columns has dates in it using the `parse_dates` parameter.

In [None]:
#Import data again with parse dates
df= pd.read_csv("../input/bluebook-for-bulldozers/TrainAndValid.csv",
               low_memory=False,
               parse_dates=["saledate"])

In [None]:
df.saledate.dtype

In [None]:
df.saledate[:1000]

In [None]:
fig, ax= plt.subplots()
ax.scatter(df["saledate"][:1000], df["SalePrice"][:1000])

In [None]:
df.head().T

### Sort DataFrame by saledate

When working with time series data, it's a good idea to sort it by date.

In [None]:
#Sort DataFrame in date order
df.sort_values(by=["saledate"], inplace=True, ascending=True)
df.saledate.head(20)

### Make a copy of the original DataFrame

We make a copy of the original dataframe so when we manipulate the copy, we've still got our original data.

In [None]:
#Make a copy
df_tmp = df.copy()

In [None]:
### Add datetime parameters for saledate 
df_tmp["saleYear"] = df_tmp.saledate.dt.year
df_tmp["saleMonth"] = df_tmp.saledate.dt.month
df_tmp["saleDay"] = df_tmp.saledate.dt.day
df_tmp["saleDayOfWeek"] = df_tmp.saledate.dt.dayofweek
df_tmp["saleDayOfYear"] = df_tmp.saledate.dt.dayofyear

In [None]:
df_tmp.head().T

In [None]:
#saledate column can be dropped now
df_tmp.drop("saledate",axis=1,inplace=True)

In [None]:
#Check the values of diff columns
df_tmp.state.value_counts()

## 5.Modeling



In [None]:
# Let's build a machine learning model 
#from sklearn.ensemble import RandomForestRegressor

#model = RandomForestRegressor(n_jobs=-1,
#                              random_state=42)

#model.fit(df_tmp.drop("SalePrice", axis=1), df_tmp["SalePrice"])

Above cell will return error due to the data type in the dataframe

In [None]:
#Attributes are in object type
#Will need to convert to numerical for training model
df_tmp["UsageBand"].dtype

### Convert string to categories

Convert all data into numbers by using pandas catogories.

Documentation to check for different datatypes:
https://pandas.pydata.org/pandas-docs/stable/reference/general_utility_functions.html#data-types-related-functionality

In [None]:
df_tmp.head().T

In [None]:
pd.api.types.is_string_dtype(df_tmp["UsageBand"])

In [None]:
#Find columns which contain strings
for label, content in df_tmp.items():
    if pd.api.types.is_string_dtype(content):
        print(label)

In [None]:
#Turn all string value to category values
for label,content in df_tmp.items():
    if pd.api.types.is_string_dtype(content):
        df_tmp[label]=content.astype("category").cat.as_ordered()

In [None]:
df_tmp.info()

In [None]:
df_tmp.UsageBand.cat.categories

In [None]:
#-1: missing value
#0: High
#1: Low
#2: Medium
df_tmp.UsageBand.cat.codes

In [None]:
#Check missing data
df_tmp.isnull().sum()/len(df_tmp) >0

### Save preprocessed data

In [None]:
#Export current tmp dataframe
df_tmp.to_csv("train_tmp.csv",index=False)

In [None]:
#Import preprocessed data
df_tmp = pd.read_csv("train_tmp.csv",low_memory=False)
df_tmp.head().T

In [None]:
df_tmp.isna().sum()

## Fill missing values

### Fill numerical missing values first

In [None]:
#Check which numeric cilumns have null values
for label, content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(label)

In [None]:
#Fill numeric rows with the median
for label, content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            # Add a binary column which tells us if the data was missing or not
            df_tmp[label+"_is_missing"] = pd.isnull(content)
            # Fill missing numeric values with median
            df_tmp[label] = content.fillna(content.median())

In [None]:
#Check if there's any null numeric values now
for label, content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(label)

In [None]:
#Check to see how many examples were missing
df_tmp.auctioneerID_is_missing.value_counts()

In [None]:
df_tmp.isna().sum()

### Filling and turning categorical variables into numbers

In [None]:
#Check for columns which aren't numeric
for label, content in df_tmp.items():
    if not pd.api.types.is_numeric_dtype(content):
        print(label)

In [None]:
# Turn categorical variables into numbers and fill missing values
for label, content in df_tmp.items():
    if not pd.api.types.is_numeric_dtype(content):
        # Add binary column to indicate whether sample had missing value
        df_tmp[label+"_is_missing"] = pd.isnull(content)
        # Turn categories into numbers and add +1
        df_tmp[label] = pd.Categorical(content).codes+1 

In [None]:
df_tmp.info()

In [None]:
df_tmp.isna().sum()

## Model Building

In [None]:
len(df_tmp)

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
%%time
#Instantiate model
model = RandomForestRegressor(n_jobs=-1,
                             random_state=42)

#Fit model
model.fit(df_tmp.drop("SalePrice",axis=1),df_tmp["SalePrice"])

In [None]:
#Score the model
model.score(df_tmp.drop("SalePrice",axis=1),df_tmp["SalePrice"])

### Splitting data into train/validation sets

In [None]:
df_tmp.saleYear

In [None]:
df_tmp.saleYear.value_counts()

In [None]:
#Split data into traning and validation
df_val = df_tmp[df_tmp.saleYear ==2012]
df_train = df_tmp[df_tmp.saleYear !=2012]

len(df_val),len(df_train)

In [None]:
#Split data into X and y
X_train, y_train = df_train.drop("SalePrice", axis=1), df_train.SalePrice
X_valid, y_valid = df_val.drop("SalePrice", axis=1), df_val.SalePrice

X_train.shape, y_train.shape, X_valid.shape, y_valid.shape

## Building an evaluation function

In [None]:
#Create evaluation function (the competition uses RMSLE)
from sklearn.metrics import mean_squared_log_error,mean_absolute_error,r2_score

def rmsle(y_test,y_preds):
    """
    Calculate root mean squared log error between predictions and true labels
    """
    return np.sqrt(mean_squared_log_error(y_test, y_preds))

#Create function to evaluate model on a few different levels
def show_scores(model):
    train_preds = model.predict(X_train)
    val_preds = model.predict(X_valid)
    scores = {"Training MAE": mean_absolute_error(y_train,train_preds),
              "Valid MAE": mean_absolute_error(y_valid,val_preds),
              "Training RMSLE": rmsle(y_train, train_preds),
              "Valid RMSLE": rmsle(y_valid, val_preds),
              "Training R^2": r2_score(y_train, train_preds),
              "Valid R^2": r2_score(y_valid, val_preds)}
    return scores

## Test the model on a subset(to tune the hyperparameters)

In [None]:
# # This takes far too long... for experimenting

# %%time
# model = RandomForestRegressor(n_jobs=-1, 
#                               random_state=42)

# model.fit(X_train, y_train)

In [None]:
#len(X_train) has 401125 rows data
# Change max_samples value
model = RandomForestRegressor(n_jobs=-1,
                              random_state=42,
                              max_samples=10000)

In [None]:
%%time
#Cutting down on the max number of samples each estimator can see improves training time
model.fit(X_train,y_train)

In [None]:
show_scores(model)

## Hyperparameter tuning with RandomizedSearchCV

In [None]:
%%time
from sklearn.model_selection import RandomizedSearchCV

# Different RandomForestRegressor hyperparameters
rf_grid = {"n_estimators": np.arange(10, 100, 10),
           "max_depth": [None, 3, 5, 10],
           "min_samples_split": np.arange(2, 20, 2),
           "min_samples_leaf": np.arange(1, 20, 2),
           "max_features": [0.5, 1, "sqrt", "auto"],
           "max_samples": [10000]}

# Instantiate RandomizedSearchCV model
rs_model = RandomizedSearchCV(RandomForestRegressor(n_jobs=-1,
                                                    random_state=42),
                              param_distributions=rf_grid,
                              n_iter=2,
                              cv=5,
                              verbose=True)

# Fit the RandomizedSearchCV model
rs_model.fit(X_train, y_train)

In [None]:
#Find best model hyperparameters
rs_model.best_params_

In [None]:
show_scores(rs_model)

### Train a model with the best hyperparameters

In [None]:
%%time

#Most ideal hyperparameters
ideal_model = RandomForestRegressor(n_estimators=40,
                                    min_samples_leaf=1,
                                    min_samples_split=14,
                                    max_features=0.5,
                                    n_jobs=-1,
                                    max_samples=None,
                                    random_state=42) # random state so our results are reproducible

#Fit the model
ideal_model.fit(X_train,y_train)

In [None]:
#Scores for ideal model (trained on all the data)
show_scores(ideal_model)

In [None]:
#Scores on rs_model (only trained on ~10000 examples)
show_scores(rs_model)

## Make predictions on test data

In [None]:
#Import the test data
df_test = pd.read_csv("../input/bluebook-for-bulldozers/Test.csv",
                     low_memory = False,
                     parse_dates=["saledate"])

df_test.head()

## Preprocessing the test data (to same format as our training dataset)

In [None]:
def preprocess_data(df):
    """
    Preforms transformations on df and returns transformed df.
    """
    df["saleYear"] = df.saledate.dt.year
    df["saleMonth"] = df.saledate.dt.month
    df["saleDay"] = df.saledate.dt.day
    df["saleDayOfWeek"] = df.saledate.dt.dayofweek
    df["saleDayOfYear"] = df.saledate.dt.dayofyear
    
    df.drop("saledate", axis=1, inplace=True)
    
    # Fill the numeric rows with median
    for label, content in df.items():
        if pd.api.types.is_numeric_dtype(content):
            if pd.isnull(content).sum():
                # Add a binary column which tells us if the data was missing or not
                df[label+"_is_missing"] = pd.isnull(content)
                # Fill missing numeric values with median
                df[label] = content.fillna(content.median())
    
        # Filled categorical missing data and turn categories into numbers
        if not pd.api.types.is_numeric_dtype(content):
            df[label+"_is_missing"] = pd.isnull(content)
            # We add +1 to the category code because pandas encodes missing categories as -1
            df[label] = pd.Categorical(content).codes+1
    
    return df

In [None]:
#Process the test data
df_test = preprocess_data(df_test)
df_test.head()

In [None]:
#df_test has 101 columns but training dataset has 102 columns
X_train.head()

In [None]:
#Find the columns differ using sets
set(X_train.columns)- set(df_test.columns)

In [None]:
#Manually add the missing column
df_test["auctioneerID_is_missing"] = False
df_test.head()

Finally, the format of this test dataset is the same as the training dataset. 

Predictions are doable now.

In [None]:
#Make predictions on test data
test_preds = ideal_model.predict(df_test)

In [None]:
test_preds

Format the predictions into the same format as the competition asked for:

https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation


In [None]:
#Format the prediction outcome
df_preds = pd.DataFrame()
df_preds["SalesID"] = df_test["SalesID"]
df_preds["SalesPrice"] = test_preds
df_preds

In [None]:
#Export prediction data
df_preds.to_csv("test_predictions.csv",index=False)

## Feature importance

Feature importance seeks to figure out which different attributes of the data were most important when it comes to predicting the target variable (SalePrice).

In [None]:
#Find feature importance of our best model
ideal_model.feature_importances_

In [None]:
#Helper functin for plotting the feature importance
def plot_features(columns,importances,n=20):
    df = (pd.DataFrame({"features": columns,
                       "feature_importances": importances})
         .sort_values("feature_importances",ascending=False)
         .reset_index(drop=True))
    
    #plot the dataframe
    fig,ax = plt.subplots()
    ax.barh(df["features"][:n],df["feature_importances"][:20])
    ax.set_ylabel("Features")
    ax.set_xlabel("Feature importance")
    ax.invert_yaxis()

In [None]:
plot_features(X_train.columns,ideal_model.feature_importances_)