# 🚜 Predicting the Sale Price of Bulldozers using Machine Learning
In this notebook, we're going to go through an example machine learning project with the goal of predicting the sale price of bulldozers.

### 1. Problem defition
How well can we predict the future sale price of a bulldozer, given its characteristics and previous examples of how much similar bulldozers have been sold for?

### 2. Data
The data is downloaded from the Kaggle Bluebook for Bulldozers competition: https://www.kaggle.com/c/bluebook-for-bulldozers/data

There are 3 main datasets:

* Train.csv is the training set, which contains data through the end of 2011.
* Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set throughout the majority of the competition. Our score on this set is used to create the public leaderboard.
* Test.csv is the test set, which won't be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.


### 3. Evaluation
The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.

For more on the evaluation of this project check: https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation

Note: The goal for most regression evaluation metrics is to minimize the error. For example, our goal for this project will be to build a machine learning model which minimises RMSLE.

### 4. Features
Kaggle provides a data dictionary detailing all of the features of the dataset. You can view this data dictionary on Google Sheets: https://docs.google.com/spreadsheets/d/18ly-bLR8sbDJLITkWG7ozKm8l3RyieQ2Fpgix-beSYI/edit?usp=sharing

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn.ensemble import RandomForestRegressor

In [None]:
# import Data (training and validation sets)
df = pd.read_csv("../input/bluebook-for-bulldozers/TrainAndValid.csv", low_memory=False)

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
fig, ax =  plt.subplots()
ax.scatter(df["saledate"][:1000], df.SalePrice[:1000])

In [None]:
df.SalePrice.plot.hist();

### Parsing Dates
When we work with timeseries data, we want to enrich the time and Date coponent
as much a possible

We can do that by telling pandas which of our columns has dates in it using 'pars date' parameter

In [None]:
# Import data but this time parae dates

df=pd.read_csv("../input/bluebook-for-bulldozers/TrainAndValid.csv", low_memory=False, parse_dates=["saledate"] )

In [None]:
df.saledate.dtype

In [None]:
df.saledate[:1000]

In [None]:
fig, ax= plt.subplots()

ax.scatter(df.saledate[:1000], df.SalePrice[:1000])

In [None]:
df.head()

In [None]:
df.head().T

In [None]:
df.saledate.head(20)

# Sort DataFrame by saledate¶
When working with time series data, it's a good idea to sort it by date.

In [None]:
df.sort_values(by=["saledate"], inplace=True, ascending=True)
df.saledate.head(20
                )

# Make a copy of the original dataframe

We make a copy of the original dataframe so when we manipulate the copy, we've still got our original data.


In [None]:
df_temp= df.copy()

### Add datetime param for 'saledate' column

In [None]:
df_temp["saleYear"]= df_temp.saledate.dt.year
df_temp["saleMonth"]= df_temp.saledate.dt.month
df_temp["saleDay"]= df_temp.saledate.dt.day
df_temp["saleDayOfWeek"]= df_temp.saledate.dt.dayofweek
df_temp["saleDayOfYear"]= df_temp.saledate.dt.dayofyear

In [None]:
df_temp.head().T

In [None]:
# Now we have enriched our DF with Dattime deatures, we can remove saledate

df_temp.drop("saledate", axis=1, inplace=True)

In [None]:
# check values of different dataset columns
df_temp.state.value_counts()

# 5. Modelling

We've done enough EDA (we could always do more) but let's start to do some model-driven EDA

# Convert string to categories
One way we can turn all of our data into numbers is by converting them into pandas catgories.

We can check the different datatypes compatible with pandas here: https://pandas.pydata.org/pandas-docs/stable/reference/general_utility_functions.html#data-types-related-functionality

In [None]:
df_temp.head().T

In [None]:
pd.api.types.is_string_dtype(df_temp["UsageBand"])

In [None]:
# to find out which columns contain strings

for labels, content in df_temp.items():
    if pd.api.types.is_string_dtype(content):
        print(labels)

In [None]:
# This will turn all of the s tring value into category values

for label, content in df_temp.items():
    if pd.api.types.is_string_dtype(content):
        df_temp[label]= content.astype("category").cat.as_ordered()

In [None]:
df_temp.info()

In [None]:
df_temp.state.cat.categories

In [None]:
df_temp.state.cat.codes

# Thanks to pandas categories() we now have a way to access all of our data in form of numbers. But we still have a bunch of missing data

In [None]:
# CHeck null percentage
df_temp.isnull().sum()/len(df_temp)

## Save df_temp to a new csv 

In [None]:
#Export current tmp csv

#df_temp.to_csv("data/train_tmp.csv", index=False)

In [None]:
# Import reprocessed data

#df_temp= pd.read_csv("data/train_tmp.csv", low_memory=False)

In [None]:
df_temp.isna().sum()

## Fill missing values

### Fill numeric missing values first

In [None]:
for label, content in df_temp.items():
    if pd.api.types.is_numeric_dtype(content):
        print(label)

In [None]:
# check for which numeric columns have null values

for label,content in df_temp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(label)

In [None]:
#Fill numeric rows with median

for label,content in df_temp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            #Add a binary columns which tells us if the data was missing or not
            df_temp[label+"_is_missing"]= pd.isnull(content)
            #Fill missing numeric values with median
            df_temp[label] = content.fillna(content.median())

In [None]:
for label,content in df_temp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(labels)

In [None]:
df_temp.head().T

In [None]:
# Lets fill all missing categorical values
for label, content in df_temp.items():
    if not pd.api.types.is_numeric_dtype(content):
        print(label)

In [None]:
#Turn categorical values into numbers and fill missing
# +1 is done to convert -1 to 0. pd.Categories fills empty vales with Code -1. As we do not want any negative data in our 
#evaluation we are going to add 1 to it so that it becomes 0
for label, content in df_temp.items():
    if not pd.api.types.is_numeric_dtype(content):
        #Add binary column to indicate wether sampple has missing values
        df_temp[label+"is_missing"]= pd.isnull(content)
        #Turn categories into number and add+1
        df_temp[label] =pd.Categorical(content).codes +1

In [None]:
pd.Categorical(df_temp["state"]).codes +1


In [None]:
df_temp.info()

In [None]:
df_temp.head().T

In [None]:
df_temp.isna().sum()[:20]

# Now that all of our data is numeric, as well as dataframe has no missing values, we should be able to biuld a ML model

In [None]:
%%time
# Instantiate model 

#model= RandomForestRegressor(n_jobs=-1,
#                            random_state=42)

#model.fit(df_temp.drop("SalePrice", axis=1), df_temp["SalePrice"])

In [None]:
#score the model
#model.score(df_temp.drop("SalePrice", axis=1), df_temp["SalePrice"])

**Question:** WHy the above value, does not hold water/true/reliable?
Because we have done our scoring/evaluation on the same datset on which we trianed our model

In [None]:
## splitting data into train and validation sets
df_temp.saleYear.value_counts()

In [None]:
#Split data into training and validation
df_val=df_temp[df_temp.saleYear==2012]
df_train=df_temp[df_temp.saleYear!=2012]

len(df_val), len(df_train)

In [None]:
#Split data into X and y
X_train, y_train =df_train.drop("SalePrice", axis=1), df_train.SalePrice


In [None]:
X_valid, y_valid = df_val.drop("SalePrice", axis=1), df_val.SalePrice

In [None]:
X_train.shape , y_train.shape, X_valid.shape, y_valid.shape

### Building an evalluation function


In [None]:
#create an evaluation function so that we can use this functionality multiple times over different params
from sklearn.metrics import mean_absolute_error, mean_squared_log_error, r2_score

def rmsle(y_test, y_preds):
    """
    Calcs rmsle between ppredictions and true labels.
    """
    return np.sqrt(mean_squared_log_error(y_test, y_preds))

#Create function to evaluate model on a few diff levels

def show_scores(model):
    train_preds= model.predict(X_train)
    val_preds = model.predict(X_valid)
    scores={"Training MAE": mean_absolute_error(y_train, train_preds),
           "Valid MAE": mean_absolute_error(y_valid, val_preds),
           "Training RMSLE": rmsle(y_train, train_preds),
           "Valid RMSLE": rmsle(y_valid, val_preds),
           "Training R2": r2_score(y_train, train_preds),
           "Valid R2": r2_score(y_valid, val_preds)}
    return scores

## Testing our model on a subset (to tune hyperparams)

In [None]:
#this takes far too long for experimenting
#%%time
#model= RandomForestRegressor(n_jobs=-1,
#                            random_state=42)

#model.fit(X_train)

# Change max samples value

In [None]:
model= RandomForestRegressor(n_jobs=-1, random_state=42,
                            max_samples=10000)



In [None]:
%%time
#Cutting down on max_samples to see how much it imporoves training time
model.fit(X_train, y_train)

In [None]:
%%time
show_scores(model)

In [None]:
# HyperParameters tuning with RandomizedSerarchCV

In [None]:
%%time
from sklearn.model_selection import RandomizedSearchCV

# Different Random forest regressor hyper params
rf_grid={"n_estimators":np.arange(10, 100, 10),
        "max_depth":[None, 3, 5, 10],
        "min_samples_split":np.arange(2,10,2),
        "min_samples_leaf": np.arange(1,20,2),
        "max_features":[0.5,1,"sqrt","auto"],
        "max_samples":[10000]}

#Intantiate Randomized search CV model
rs_model= RandomizedSearchCV(RandomForestRegressor(n_jobs=-1,
                                                  random_state=42),
                                                  param_distributions=rf_grid,
                                                  n_iter=5,
                                                  cv=5,
                                                  verbose=True)
#Fit the randomizedSearchCV model
rs_model.fit(X_train, y_train)

In [None]:
#Finding the best model params
rs_model.best_params_

In [None]:
#Evaluate the randomized search models(only trained on 10000 examples)
show_scores(rs_model)

## Train a model with the best HyperParams
**Note** These were found after a 100 iterations of RandomizedSearchCV

In [None]:
%%time
# Most ideal parameters:
ideal_model=RandomForestRegressor(n_estimators=40,
                                 min_samples_leaf=1,
                                 min_samples_split=14,
                                 max_features=0.5,
                                 n_jobs=-1,
                                 max_samples=None,
                                 random_state=42)

ideal_model.fit(X_train, y_train)

In [None]:
show_scores(ideal_model)

# The 'Valid RMSLE' value is what we are looking for and it is around  0.2452416398953833, which is very close and puts our code in top 30 of the submissions

# Make preds on test data

In [None]:
#import test data
df_test = pd.read_csv("../input/bluebook-for-bulldozers/Test.csv", low_memory=False, parse_dates=["saledate"])
df_test.shape

# Make predictions over test dataset

### test_preds= ideal_model.predict(df_test)

This will not work as it has not been manipulated, filtered or cleaned

## Preprocessing the data(getting the test dataset in the same form of our training dataset)

In [None]:
def preprocess_data(df):
    """
    Performs transformations on df and returns transformed df.
    """
    df["saleYear"] = df.saledate.dt.year
    df["saleMonth"] = df.saledate.dt.month
    df["saleDay"] = df.saledate.dt.day
    df["saleDayOfWeek"] = df.saledate.dt.dayofweek
    df["saleDayOfYear"] = df.saledate.dt.dayofyear
    
    df.drop("saledate", axis=1, inplace=True)
    
    # Fill the numeric rows with median
    for label, content in df.items():
        if pd.api.types.is_numeric_dtype(content):
            if pd.isnull(content).sum():
                # Add a binary column which tells us if the data was missing or not
                df[label+"_is_missing"] = pd.isnull(content)
                # Fill missing numeric values with median
                df[label] = content.fillna(content.median())
    
        # Filled categorical missing data and turn categories into numbers
        if not pd.api.types.is_numeric_dtype(content):
            df[label+"_is_missing"] = pd.isnull(content)
            # We add +1 to the category code because pandas encodes missing categories as -1
            df[label] = pd.Categorical(content).codes+1
    
    return df

In [None]:
#Process test data
df_test= preprocess_data(df_test)
df_test.head()

# we can find how the columns differ using sets
set(X_train.columns) - set(df_test.columns)

In [None]:
df_test.head()

In [None]:
#manually adjust df_test to have auctioneerID_is_missing columns

df_test["auctioneerID_is_missing"]=False
df_test.head()

Finally now our test df has same features as training df, we can make preds

In [None]:
test_preds= ideal_model.predict(df_test)

In [None]:
len(test_preds)

# Format preds into same format Kaggle has asked


In [None]:
df_preds= pd.DataFrame()
df_preds["SalesID"] = df_test["SalesID"]
df_preds["SalesPrice"]= test_preds
df_preds

In [None]:
#Save the predictions as per the competition format and check out the results
#df_preds.to_csv("data/bluebook_for_bulldozer_test_predictions.csv", index=False)

## Feature Importance

which diff attributes of the data were most important when it comes to predicting target variables(SalesPrice)

In [None]:
len(ideal_model.feature_importances_)

In [None]:
len(X_train.columns)

In [None]:
# Helper function for plotting feature importance


def plot_features(columns, importances, n=20):
    df = (pd.DataFrame({"features": columns,
                        "feature_importances": importances})
          .sort_values("feature_importances", ascending=False)
          .reset_index(drop=True))
    
    # Plot the dataframe
    fig, ax = plt.subplots()
    ax.barh(df["features"][:n], df["feature_importances"][:20])
    ax.set_ylabel("Features")
    ax.set_xlabel("Feature importance")
    ax.invert_yaxis()

In [None]:
plot_features(X_train.columns, ideal_model.feature_importances_)

def plot_features(columns, importances, n=20):
    df = (pd.DataFrame({"features": columns,
                        "feature_importances": importances})
          .sort_values("feature_importances", ascending=False)
          .reset_index(drop=True))
    
    # Plot the dataframe
    fig, ax = plt.subplots()
    ax.barh(df["features"][:n], df["feature_importances"][:20])
    ax.set_ylabel("Features")
    ax.set_xlabel("Feature importance")
    ax.invert_yaxis()

**Question to finish: Why knowing the feature importances of a trained machine learning model is helpful?**

Final challenge/extension: What other machine learning models could you try on our dataset? Hint: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html check out the regression section of this map, or try to look at something like CatBoost.ai or XGBooost.ai.