# 🎯 Predicting the Sale Price of Bulldozers with Machine Learning
![](https://media.giphy.com/media/jVFgWDtkY5v2M/giphy.gif?cid=790b7611shywkymfgi4qzy76gk4gfjc64jxufqai5r87v1a0&ep=v1_gifs_search&rid=giphy.gif&ct=g)

This Notebook performs a machine learning project with the goal of predicting the sale price of bulldozers (a kaggle competition).

Since we're trying to predict a continuous value (a number), this kind of problem is known as a **regression problem**.

The data and evaluation metric we'll be using (root mean square log error or RMSLE) is from the [Kaggle Bluebook for Bulldozers competition](https://www.kaggle.com/c/bluebook-for-bulldozers/overview).

## What we'll end up with

Since we already have a dataset, we'll approach the problem with the following machine learning modelling framework.

| <img src="./images/6-step-ml-framework.png" width=500/> | 
|:--:| 
| 6 Step Machine Learning Modelling Framework |

To work through these topics, we'll use pandas, Matplotlib, plotly and NumPy for data analysis, as well as, Scikit-Learn for machine learning and modelling tasks.

| <img src="./images/tools-used.png" width=500/> | 
|:--:| 
| Tools which can be used for each step of the machine learning modelling process. |

We'll work through each step and by the end of the notebook, we'll have a trained machine learning model which predicts the sale price of a bulldozer given different characteristics about it.

## 1. Problem Definition

The Problem here is that we have to predict the sale price of Bulldozers!

> We have to create a machine learning model that can predict the sale price of Bulldozers given different characteristics of bulldozers sold in the Past!

Clearly, we have a regression problem to predict a continuous value (price).

## 2. Data Available

By looking at the data on [Kaggle](https://www.kaggle.com/c/bluebook-for-bulldozers/data), we can say it's a Time Series problem, we have a time attribute in our dataset. Including features like size, manufacture date, model type etc.

There are 3 main datasets:
1. **Train.csv** - Historical bulldozer sales examples up to 2011 (close to 400,000 examples with 50+ different attributes, including `SalePrice` which is the **target variable**).
2. **Valid.csv** - Historical bulldozer sales examples from January 1 2012 to April 30 2012 (close to 12,000 examples with the same attributes as **Train.csv**).
3. **Test.csv** - Historical bulldozer sales examples from May 1 2012 to November 2012 (close to 12,000 examples but missing the `SalePrice` attribute, as this is what we'll be trying to predict).

## 3. Evaluation

For this problem, [Kaggle has set the evaluation metric to being root mean squared log error (RMSLE)](https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation). As with many regression evaluations, the goal will be to get this value as low as possible.

To see how well our model is doing, we'll calculate the RMSLE and then compare our results to others on the [Kaggle leaderboard](https://www.kaggle.com/c/bluebook-for-bulldozers/leaderboard).

## 4. Features

Features are different parts of the data that will help us to predict the target (price in this case).

We've to find and explore our data set, to fully understand what each feature means!

One of the most common ways to do this, is to create a **data dictionary**.

For this dataset, [Kaggle provide a data dictionary](https://www.kaggle.com/c/bluebook-for-bulldozers/data) which contains information about what each attribute of the dataset means!

Let's see it here! (Can refer [google sheets](https://docs.google.com/spreadsheets/d/1Im3Yq6ez5Yzz7lTfW66PThq1Se__a4P2r1apFkj2asc/edit?usp=sharing) also)

In [27]:
import pandas as pd

data_dictionary = pd.read_csv("./data/bluebook-for-bulldozers/Data Dictionary.csv")
data_dictionary.head() # Wrangle it to understand more!
# See it in variable View for more info!

Unnamed: 0,Variable,Description
0,SalesID,unique identifier of a particular sale of a ...
1,MachineID,identifier for a particular machine; machin...
2,ModelID,identifier for a unique machine model (i.e. ...
3,datasource,source of the sale record; some sources are...
4,auctioneerID,"identifier of a particular auctioneer, i.e. ..."


We've known that! Let's start

Now, let's explore our data more. Since we have an Evaluation error metric to minimise. At first we'll try to create a baseline model to see how it competes against the metric!

For that let's perform -

### Exploratory Data Analysis (EDA)

Here, we'll try to understand & interpret our data as much as possible like feature types, how they align with each other, missing data, outliers etc!!

Also, we'll modify it only for effective exploration and visualization! We'll do all the preprocessing after exploration along with keeping track of *what to preprocess*!

#### But first, let's load and view our data!

> ***Note:*** Here, we're using [`TrainValid.csv`](./data/bluebook-for-bulldozers/TrainAndValid.csv) for exploration of our DataSet entirely! But we'll use Just [`Train.csv`](./data/bluebook-for-bulldozers/Train.csv) to fit `preprocessors` on our data & then use the same preprocessor on new data i.e. valid & test sets (to avoid data leakage!)

In [28]:
# Import the tools for EDA (Analysis and visualization)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px # Will use plotly for more interactive exploration!

In [32]:
# Load the data
bdozer_df = pd.read_csv("./data/bluebook-for-bulldozers/Train.csv", low_memory=False)

In [33]:
# Let's get `Information` about our data
bdozer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 401125 entries, 0 to 401124
Data columns (total 53 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   SalesID                   401125 non-null  int64  
 1   SalePrice                 401125 non-null  int64  
 2   MachineID                 401125 non-null  int64  
 3   ModelID                   401125 non-null  int64  
 4   datasource                401125 non-null  int64  
 5   auctioneerID              380989 non-null  float64
 6   YearMade                  401125 non-null  int64  
 7   MachineHoursCurrentMeter  142765 non-null  float64
 8   UsageBand                 69639 non-null   object 
 9   saledate                  401125 non-null  object 
 10  fiModelDesc               401125 non-null  object 
 11  fiBaseModel               401125 non-null  object 
 12  fiSecondaryDesc           263934 non-null  object 
 13  fiModelSeries             56908 non-null   o