# 🚜 Predicting the Sale Price of Bulldozers using Machine Learning

## Problem Definition

For this dataset, the problem we're trying to solve, or better, the question we're trying to answer is,

How well can we predict the future sale price of a bulldozer, given its characteristics previous examples of how much similar bulldozers have been sold for?

## Data

Looking at the dataset from Kaggle, you can you it's a time series problem. This means there's a time attribute to dataset.

In this case, it's historical sales data of bulldozers. Including things like, model type, size, sale date and more.

There are 3 datasets:

    - Train.csv - Historical bulldozer sales examples up to 2011 (close to 400,000 examples with 50+ different attributes, including SalePrice which is the target variable).
    - Valid.csv - Historical bulldozer sales examples from January 1 2012 to April 30 2012 (close to 12,000 examples with the same attributes as Train.csv).
    - Test.csv - Historical bulldozer sales examples from May 1 2012 to November 2012 (close to 12,000 examples but missing the SalePrice attribute, as this is what we'll be trying to predict).
    
## Evaluation

For this problem, Kaggle has set the evaluation metric to being root mean squared log error (RMSLE). As with many regression evaluations, the goal will be to get this value as low as possible.

To see how well our model is doing, we'll calculate the RMSLE and then compare our results to others on the Kaggle leaderboard.

## Features
Features are different parts of the data. During this step, you'll want to start finding out what you can about the data.

One of the most common ways to do this, is to create a data dictionary.

For this dataset, Kaggle provide a data dictionary which contains information about what each attribute of the dataset means. You can download this file directly from the Kaggle competition page (account required) or view it on Google Sheets.

With all of this being known, let's get started!

First, we'll import the dataset and start exploring. Since we know the evaluation metric we're trying to minimise, our first goal will be building a baseline model and seeing how it stacks up against the competition.

In [None]:
## Importing the data and preparing it for model

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


from sklearn.ensemble import RandomForestRegressor

In [None]:
df = pd.read_csv("../input/bluebook-for-bulldozers/TrainAndValid.csv",
                 low_memory=False,
                error_bad_lines=False)

In [None]:
df.shape

In [None]:
df.info()

There are total of 52 columns present here where `SalePrice` is the target column.

In [None]:
fig , ax = plt.subplots(figsize = (20,10))
ax.scatter(df['saledate'][:1000],df['SalePrice'][:1000])

In [None]:
df.SalePrice.plot.hist()

### Parsing dates

    In the dataframe `Saledate` column is in object type we have convert it into a date type.
    
    We can do this using parse_dates feature in the read_csv function.

In [None]:
df = pd.read_csv('../input/bluebook-for-bulldozers/TrainAndValid.csv', 
                 low_memory=False,
                 parse_dates=['saledate']
                )

df.info()

In [None]:
fig , ax = plt.subplots()
ax.scatter(df["saledate"][:1000], df["SalePrice"][:1000])

In [None]:
df.head().T

In [None]:
df.saledate.head(10)

### Sorting the Dataframe by saledate

Since we are working in a timeseries problem we have make our data as a historical one we can do that by sorting the saledate column

In [None]:
df.sort_values(by=['saledate'],ascending = True,inplace = True)
df.saledate.head(10)

### Making a copy of the original data

In [None]:
df_tmp = df.copy()

### Adding date time parameters seperately for the saledate column

Why?

So we can enrich our dataset with as much information as possible.

Because we imported the data using read_csv() and we asked pandas to parse the dates using parase_dates=["saledate"], we can now access the different datetime attributes of the saledate column.

In [None]:
df_tmp['saleyear'] = df_tmp.saledate.dt.year
df_tmp['salemonth'] = df_tmp.saledate.dt.month
df_tmp['saleday'] = df_tmp.saledate.dt.day
df_tmp['saledayofweek'] = df_tmp.saledate.dt.dayofweek
df_tmp['saledayofyear'] = df_tmp.saledate.dt.dayofyear

#dropping the original saledate column
df_tmp.drop('saledate',axis=1,inplace = True)

In [None]:
df_tmp.head().T

## Converting strings to categories

One way to help turn all of our data into numbers is to convert the columns with the string datatype into a category datatype.

why categorical ?

 **Under the hood pandas will handle all the categorical objects as numerical.**

To do this we can use the pandas api types which allows us to interact and manipulate the types of data.

link for reference is https://pandas.pydata.org/pandas-docs/stable/reference/general_utility_functions.html#dtype-introspection 


In [None]:
df_tmp.dtypes

In [None]:
# To check whether a column is string we use 

pd.api.types.is_string_dtype(df_tmp['UsageBand'])

In [None]:
# Thse columns contains string

for label , content in df_tmp.items():
    if pd.api.types.is_string_dtype(content):
        print(label)

In [None]:
### To change all the string columns into categorical

for label , content in df_tmp.items():
    if pd.api.types.is_string_dtype(content):
        df_tmp[label] = content.astype('category').cat.as_ordered()

In [None]:
df_tmp.info()

In [None]:
df_tmp.state.cat.categories

**Pandas will convert all the categorical into numbers to view it we have to see using the `.codes` and for null values -1 will be assigned**

In [None]:
df_tmp.state.cat.codes

## Handling the missing values

Our data has a plenty of missing values it seems uff..

In [None]:
df_tmp.isna().sum()

### Filling the numerical missing values
 
We are going to fill all the numerical missing values with `median`.

In [None]:
for label , content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        print(label)

These are the numerical columns present..

In [None]:
for label , content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(label)

These are the columns that have missing numerical values

In [None]:
for label , content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            #Adding a binary column which tells if the data is missing or not
            df_tmp[label+'_is_missing'] = pd.isnull(content)
            #Filling the numeric place with the median value
            df_tmp[label] = content.fillna(content.median())

Why add a binary column indicating whether the data was missing or not?

We can easily fill all of the missing numeric values in our dataset with the median. However, a numeric value may be missing for a reason. In other words, absence of evidence may be evidence of absence. Adding a binary column which indicates whether the value was missing or not helps to retain this information.

### Filling and turning categorical to numbers

In [None]:
for label , content in df_tmp.items():
    if not pd.api.types.is_numeric_dtype(content):
        # Add binary column to inidicate whether sample had missing value
        df_tmp[label+"_is_missing"] = pd.isnull(content)
        # We add the +1 because pandas encodes missing categories as -1
        df_tmp[label] = pd.Categorical(content).codes+1

**we have filled all the missing values and also changed the categorical variables into numerical using the pandas generated codes for the categories**

In [None]:
df_tmp.info()

In [None]:
len(df_tmp)

In [None]:
df_tmp.shape

Now there are 103 columns in the training dataframe including the target column.

## Model building

### Training and validation set splitting

There is a test.csv file which we have to predict, to tune the model hyperparameters also to improve the score we are splitting the data into train and validation.

In [None]:
df_tmp.saleyear

There are values from the year of 1989 to 2012..., So we are training our model from the 1989 to 2011 and the validation set consists of all the attributes in the 2012 saleyear.

In [None]:
df_tmp.saleyear.value_counts()

In [None]:
df_val = df_tmp[df_tmp.saleyear == 2012]
df_train = df_tmp[df_tmp.saleyear != 2012]

len(df_train) , len(df_val)

In [None]:
X_train , y_train = df_train.drop('SalePrice',axis = 1),df_train['SalePrice']
X_val , y_val = df_val.drop('SalePrice',axis = 1),df_val['SalePrice']

X_train.shape , y_train.shape , X_val.shape , y_val.shape

### Building evaluation function

We know that in this the evaluation function is RMSLE - Root Mean Squared Log Error

In [None]:
from sklearn.metrics import mean_squared_log_error , mean_absolute_error,r2_score


#Function to return the RMSLE

def rmsle(y_test , y_preds):
    """
    Caculates Root mean squared log error for given y_true and y_preds
    """
    return np.sqrt(mean_squared_log_error(y_test,y_preds))


#Function to evaluate model on different metrics

def show_scores(model):
    
    train_preds = model.predict(X_train)
    val_preds = model.predict(X_val)
    
    scores = {'Training MAE':mean_absolute_error(y_train,train_preds),
              'Validation MAE':mean_absolute_error(y_val,val_preds),
              'Training RMSLE':rmsle(y_train,train_preds),
              'Validation RMSLE':rmsle(y_val,val_preds),
              'Training R2':r2_score(y_train,train_preds),
              'Validation R2':r2_score(y_val,val_preds)
             }
    

    return scores


### Training the model

In [None]:
len(X_train)

There are large number of samples present in the training set, hence what we do setting the `max_samples = 10000` so training will be done only on the 10000 samples which will reduce the time for the traninig.

**We use this 10,000 samples to tune the hyperparameter and then using the best params we will train the whole dataset**

In [None]:
%%time

model = RandomForestRegressor(random_state=42,
                              max_samples=10000)

model.fit(X_train,y_train)

In [None]:
show_scores(model)

### RandomizedSearchCV for Hyperparameter tuning

In [None]:
%%time

from sklearn.model_selection import RandomizedSearchCV

rf_grid = {'n_estimators':np.arange(10,100,10),
           'max_depth':[None,3,5,10],
           'min_samples_split': np.arange(2,20,2),
           'min_samples_leaf': np.arange(1,20,2),
           'max_features': [0.5,1,'sqrt','auto'],
           'max_samples' : [10000]
          }

rs_model = RandomizedSearchCV(RandomForestRegressor(random_state=42),
                             param_distributions=rf_grid,
                             n_iter=10,
                             cv = 5,
                             verbose=True
                             )

rs_model.fit(X_train,y_train)

In [None]:
show_scores(rs_model)

In [None]:
rs_model.best_params_

### Training the model with the best params

In [None]:
%%time

ideal_model = RandomForestRegressor( n_estimators= 60,
                                     min_samples_split= 10,
                                     min_samples_leaf= 1,
                                     max_features= 'auto',
                                     max_depth= 10,
                                    random_state = 42)

ideal_model.fit(X_train,y_train)

In [None]:
show_scores(ideal_model)

## Prediction on test data

In [None]:
df_test = pd.read_csv('../input/bluebook-for-bulldozers/Test.csv',
                      low_memory=False,
                      parse_dates=['saledate']
                     )
df_test.head()

In [None]:
df_test.isna().sum()

In [None]:
df_test.dtypes

### Preprocess the Test data

In [None]:
def preprocess_data(df):
    """
    perform the transformations on the data and returns it
    """
    df['saleyear'] = df.saledate.dt.year
    df['salemonth'] = df.saledate.dt.month
    df['saleday'] = df.saledate.dt.day
    df['saledayofweek'] = df.saledate.dt.dayofweek
    df['saledayofyear'] = df.saledate.dt.dayofyear
    
    df.drop('saledate',axis=1,inplace = True)
    
    #Filling the numeric rows with median    
    for label , content in df.items():
        if pd.api.types.is_numeric_dtype(content):
            if pd.isnull(content).sum():
                df[label+'_is_missing'] = pd.isnull(content)
                
                df[label] = content.fillna(content.median())
                
        #Filling the categorical missing data and turn categories into numeric
        if not pd.api.types.is_numeric_dtype(content):
            df[label+'_is_missing'] = pd.isnull(content)
            
            df[label] = pd.Categorical(content).codes+1
            
    return df

In [None]:
df_test = preprocess_data(df_test)
df_test.head()

**Note:** After the preproceesing of test data there is one column missing in that when compared to the training data.

In [None]:
X_train.shape , df_test.shape

In [None]:
set(X_train.columns) - set(df_test.columns)

We can see that `auctioneer_ID_is_missing` columns is not present in the test dataset so we add that column manually

In [None]:
df_test['auctioneerID_is_missing'] = False

df_test.shape

### Making prediction on our ideal_model

In [None]:
test_preds = ideal_model.predict(df_test)

In [None]:
test_preds

In [None]:
# Format predictions into the same format Kaggle is after
df_preds = pd.DataFrame()
df_preds["SalesID"] = df_test["SalesID"]
df_preds["SalesPrice"] = test_preds
df_preds

### Creating submission file for kaggle

In [None]:
df_preds.to_csv('submission1.csv',index=False)

### Feature Importances

In [None]:
ideal_model.feature_importances_

In [None]:
#Funtion for plotting feature importances
def plot_features(columns , importances , n=20):
    """
    To plot the important features that makes the prediction
    """
    df = (pd.DataFrame({'features':columns,
                        'feature_importances':importances})
                     .sort_values('feature_importances',ascending=False)
                     .reset_index(drop=True))
    
    fig,ax = plt.subplots()
    ax.barh(df['features'][:n] , df['feature_importances'][:20])
    ax.set_ylabel('Features')
    ax.set_xlabel('Feature importance')
    ax.invert_yaxis()

In [None]:
plot_features(X_train.columns , ideal_model.feature_importances_)

**These are the importances of the feature that lead to the prediction of the model**