# Bulldozer sale price prediction 
In this notebook, we're going to go through an example machine learning project with the goal of predicting sale price of bulldozer.
#### 1. Problem Defination
> how well can we predict the future sale price of a bulldozer, given its characteristics and previous example of how much similar bulldozers have been sold so far.
#### 2. Data
A brief info about data -

* Train.csv is the training set, which contains data through the end of 2011.
* Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.
* Test.csv is the test set, which won't be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.

#### 3. Evaluation
* The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.

* NOTE - The goal for most regression evaluation metrics is to minimize the error. For example, my goal for this project will be to build a machine learning model which minimizes RMSLE.
#### 4. Features 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Importing training and validation sets
df = pd.read_csv('data/TrainAndValid.csv', low_memory=False)
len(df)

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
df.columns

In [None]:
fig, ax = plt.subplots()
ax.scatter(df['saledate'][:1000], df['SalePrice'][:1000]);

In [None]:
df['SalePrice'].hist()

### Parsing Dates
When we work with time series data, we want to enrich the time and date component as much as possible. 

We can do that by tellin pandas which of our columns has dates in it using the `parse_dates` parameter.

In [None]:
# Import data again but this time parse dates
df = pd.read_csv('data/TrainAndValid.csv', parse_dates=['saledate'], low_memory=False)
df.saledate

In [None]:
df.saledate.dtype

In [None]:
fig, ax = plt.subplots()
ax.scatter(df['saledate'][:1000], df['SalePrice'][:1000])

In [None]:
df.head().T

### Sorting Dataframe by saledate 
When working with time series date, it's good idea to sort it by date.

In [None]:
# Sort dataframe in date order
df.sort_values('saledate', inplace=True)
df.saledate.head()

### Make a copy of the original dataFrame
We make a copy of the original dataframe so that when we manipulate copy, we've still got our original data.

In [None]:
df_temp = df.copy() 

In [None]:
df_temp.saledate

### Adding datetime parameter for `saledate` column.

In [None]:
df_temp['saleYear'] = df_temp.saledate.dt.year
df_temp['saleMonth'] = df_temp.saledate.dt.month
df_temp['saleDay'] = df_temp.saledate.dt.day
df_temp['saleDayOfWeek'] = df_temp.saledate.dt.dayofweek
df_temp['saleDayOfYear'] = df_temp.saledate.dt.dayofyear 

In [None]:
df_temp.head().T

In [None]:
# Now we've enriched our dataframe with datetime features, we can remove saledate column
df_temp.drop('saledate', axis=1, inplace=True)

In [None]:
# Checking values of different columns
df_temp.state.value_counts()

### Converting Strings into categories
One way we can turn all of our data into numbers is by converting them into  pandas categories

In [None]:
pd.api.types.is_string_dtype(df_temp['UsageBand'])

In [None]:
for label, content in df_temp.items():
    if pd.api.types.is_string_dtype(content): 
        print(label)

In [None]:
# Changing strings into categories
for label, content in df_temp.items():
    if pd.api.types.is_string_dtype(content):
        df_temp[label] = content.astype('category').cat.as_ordered()

In [None]:
df_temp.info()

In [None]:
df_temp['state'].dtype

In [None]:
# Checking missing data
df_temp.isnull().sum() / len(df_temp)

### Fill missing values
#### Filling numeric missing values first

In [None]:
# Checking which numerical colums have missing values
for label, content in df_temp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(label)

In [None]:
# Filling numerical columns with missing values
for label, content in df_temp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            # Add a binary column which tells us if data was missing
            df_temp[label + "_is_missing"] = pd.isnull(content)
            # Filling missing values with median
            df_temp[label] = content.fillna(content.median() )

### Filling and turnig categorical variables into numbers

In [None]:
# check for columns with categorical values
for label, content in df_temp.items():
    if not pd.api.types.is_string_dtype(content):
        print(label)

In [None]:
# Turn categorical values into numbers and fill the missing
for label, content in df_temp.items():
    if not pd.api.types.is_numeric_dtype(content):
        #  Add a binary column to indicate if the column previously had missin values 
        df_temp[label + "_is_missing"] = pd.isnull(content)
        # turning categories into numbers 
        df_temp[label] = pd.Categorical(content).codes + 1

In [None]:
df_temp.head().T

In [None]:
df_temp.isna().sum()

### 5. Modelling
Now that all of our data is numeric and has no missing values in it, we should be now able to build a machine learning model

In [None]:
#  Let's make our machine learning model
from sklearn.ensemble import RandomForestRegressor 
# Instantiating the model
model = RandomForestRegressor()
# Splitting the data into x and y
x = df_temp.drop('SalePrice', axis=1)
y = df_temp['SalePrice']

In [None]:
# Splittin data into train and validation sets 
df_val = df_temp[df_temp.saleYear == 2012]
df_train = df_temp[df_temp.saleYear != 2012]
x_train, y_train = df_train.drop('SalePrice', axis=1), df_train.SalePrice
x_valid, y_valid = df_val.drop('SalePrice', axis=1), df_val.SalePrice

### Building a custom Evaluation function

In [None]:
# Create Evaluation Function (The competition uses RMSLE)
from sklearn.metrics import mean_squared_log_error, mean_absolute_error, r2_score
def rmsle(y_test, y_preds):
    """Calculate RootMeanSquaredLogError between predictions and true labels."""
    return np.sqrt(mean_squared_log_error(y_test, y_preds))

# Create a function to evaluate model on different levels
def show_scores(model):
    train_preds = model.predict(x_train)
    valid_preds = model.predict(x_valid) 
    scores = {
        'Training MAE': mean_absolute_error(y_train, train_preds),
        'Valid MAE': mean_absolute_error(y_valid, valid_preds),
        'Training RMSLE': rmsle(y_train, train_preds),
        'Valid RMSLE': rmsle(y_valid, valid_preds),
        'Training R^2': r2_score(y_train, train_preds),
        'Valid R^2': r2_score(y_valid, valid_preds)
    }
    
    return scores

### Testing our model on a subset (To tune Hyperparameters)

In [None]:
# Change max_samples value
model = RandomForestRegressor(n_jobs=-1 , random_state=42, max_samples=10000)

In [None]:
# Fitting the model
model.fit(x_train, y_train)

In [None]:
# Evaluating the model using custom evaluation function
show_scores(model)

### HyperParameter Tuning with RandomizedSearchCV

In [None]:
from sklearn.model_selection import RandomizedSearchCV
# Evaluating model(RandomForestRegressor) by changing different HyperParameters through RandomSearchCV
rs_grid = {
    'n_estimators': np.arange(10, 100, 10),
    'max_depth': [None, 3, 5, 10],
    'min_samples_split': np.arange(2, 20, 2),
    'min_samples_leaf': np.arange(1, 20, 2),
    'max_features': [0.5, 1, 'sqrt', 'log2'],
    'max_samples': [10000]
    }
rs_model = RandomizedSearchCV(
    RandomForestRegressor(n_jobs=-1, random_state=42), 
    param_distributions=rs_grid,
    n_iter=5,
    cv=5,
    verbose=True,
    error_score='raise'
    )
# Fitting RandomizedSearchCV model
rs_model.fit(x_train, y_train)

In [None]:
# Best hyperParameters 
rs_model.best_params_

In [None]:
show_scores(rs_model)

### Train a model with best hyperparameters
NOTE : These were found after 100 iterations of `RandomizedSearchCV`

In [None]:
# Most ideal hyperparameters
ideal_model = RandomForestRegressor(n_estimators=40,
                                    min_samples_leaf=1,
                                    min_samples_split=14,
                                    max_features=0.5,
                                    n_jobs=-1,
                                    max_samples=None,
                                    random_state=42)
# Fiting ideal model
ideal_model.fit(x_train, y_train)

In [None]:
# Scores for ideal model, trained on all the data 
show_scores(ideal_model)

### Make predictions on test data

In [None]:
# Importing test dataset
df_test = pd.read_csv('data/Test.csv', parse_dates=['saledate'], low_memory=False)
df_test.head()

In [None]:
#  Make predictions on test_dataset
# test_predicts = ideal_model.predict(df_test)  # Gives an error because the dataFrame on which model was trained and this dataFrame have different values and column count

In [None]:
df_test.isna().sum()

### Preprocessing our data (ie.. getting the test dataset into the same format as our training dataset)

In [None]:
def preprocess_data(df):
    """Performs Transformations on a df and returns Transformed df"""
    # add datetime parameter for saledate column
    # df["saleYear"] = df.saledate.dt.year
    # df["saleMonth"] = df.saledate.dt.month
    # df["saleDay"] = df.saledate.dt.day
    # df["saleDayOfWeek"] = df.saledate.dt.dayofweek
    # df["saleDayOfYear"] = df.saledate.dt.dayofyear
    # df.drop("saledate", axis=1, inplace=True)

    # Convert string dtypes into categorical dytpe
    for label, content in df.items():
        if pd.api.types.is_string_dtype(content):
            df[label] = content.astype("category").cat.as_ordered()

    #  Fill missing numerical rows with median
    for label, content in df.items():
        if pd.api.types.is_numeric_dtype(content):
            if pd.isnull(content).sum():
                # Add a binary column which tells us if data was missing
                df[label + "_is_missing"] = pd.isnull(content)
                # Filling missing values with median
                df[label] = content.fillna(content.median())

        # Fill categorical missing data and turn categories into numbers
        if not pd.api.types.is_numeric_dtype(content):
            df[label + '_is_missing'] = pd.isnull(content)
            # We add + 1 to category codes because pandas encodes null value as -1 and we want 0
            df[label] = pd.Categorical(content).codes + 1

    return df

In [None]:
df_test_copy = df_test.copy()

In [None]:
# Process the test data
preprocess_data(df_test)

In [None]:
x_train.head()

In [None]:
# We can find how columns differe by usings sets
set(x_train.columns) - set(df_test.columns)

In [None]:
# Now we have to manually adjust df_test to have auctioneerID_is_missing and set it to False.df_test
df_test['auctioneerID_is_missing'] = False
df_test = df_test.reindex(columns=list(x_train.columns))
df_test.head()

In [None]:
#  Now we can make predictions as df_test is in the same format as ther training data
test_preds = ideal_model.predict(df_test)

In [None]:
test_preds

We've made some preditions but they're in not the same format in which kaggle wants.

In [None]:
# Format the preidctions in the same format which kaggle is after
df_preds = pd.DataFrame()
df_preds['SalesID'] = df_test.SalesID
df_preds['SalesPrice'] = df_preds

In [None]:
df_preds

In [None]:
df_preds.to_csv('data/test_predictions_practice.csv')

### Feature Importance
Feature importance seeks to figure out which different attributes of data are most important when it comes to predicting the target variable (salePrice).

In [None]:
# Find feature importance of our best model
ideal_model.feature_importances_

In [None]:
# helper function for plotting feature importance
def plot_features(columns, importances, n=20):
    df = (
        pd.DataFrame({"features": columns, "feature_importances": importances})
        .sort_values("feature_importances", ascending=False)
        .reset_index(drop=True)
    )

    # Plot dataframe
    fig, ax = plt.subplots()
    ax.barh(df["features"][:n], df["feature_importances"][:n])
    ax.set_ylabel("Features")
    ax.set_xlabel("Feature Importance")
    ax.invert_yaxis()

In [None]:
plot_features(x_train.columns, ideal_model.feature_importances_)