<a href="https://colab.research.google.com/github/a7medElSayed/Blue-Book-for-Bulldozers/blob/main/bludzers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This repository represent the Notebook for the kaggle competition "Blue Book for Bulldozers" https://www.kaggle.com/c/bluebook-for-bulldozers

 **Introduction**

we want here to predict the future sale price of a bulldozer, given its characteristics and previous examples of how much similiar bulldozers have been sold for

***Importing libraries***

In [None]:
import pandas as pd 
import numpy as np
from sklearn.ensemble import  RandomForestRegressor
import re
from sklearn.impute import SimpleImputer
from IPython.display import display
from pandas.api.types import is_string_dtype, is_numeric_dtype

 **Data**
 

# We have three main Data:


*   Train Data
*   Valid Data
*   Test Data



**Parsing dates**

When we work with time series data, we want to enrich the time & date component as much as possible.

We can do that by telling pandas which of our columns has dates in it using the parse_dates parameter.

In [None]:
data_set = pd.read_csv('/content/TrainAndValid.csv.zip', low_memory=False, 
                     parse_dates=["saledate"])

data_test=pd.read_csv('/content/Test.csv.zip', low_memory=False, 
                     parse_dates=["saledate"])

In [None]:
data_set.info()

**preprocessing Data**

This dataset contains a mix of continuous and categorical variables , so we will preprocessing it


**Add datetime parameters for saledate column**

we know from the description of the problem that "sale date " is time series so we convert it into many fields in "int64"

In [None]:
data_set['saleyear']=data_set.saledate.dt.year
data_set['salemonth']=data_set.saledate.dt.month
data_set['saleday']=data_set.saledate.dt.day

In [None]:
# Test Data
data_test['saleyear']=data_test.saledate.dt.year
data_test['salemonth']=data_test.saledate.dt.month
data_test['saleday']=data_test.saledate.dt.day

In [None]:
data_set.drop(columns=['saledate'],inplace=True)
data_test.drop(columns=['saledate'],inplace=True)

**Convert string to categories**


In [None]:
data_set.state.unique()

we cant use one hot encoding because as we see there are some feature have more than 15 variable .


In [None]:

for col ,val in data_set.items():
  if pd.api.types.is_string_dtype(val):
    data_set[col]=val.astype("category").cat.as_ordered()


In [None]:
#Test Data
for col ,val in data_test.items():
  if pd.api.types.is_string_dtype(val):
    data_test[col]=val.astype("category").cat.as_ordered()


we can't use Labelencoder from Sklearn because it cant handle 'nan' data



In [None]:
for col,val in data_set.items():
  if  not pd.api.types.is_numeric_dtype(val):
    data_set[col]=pd.Categorical(val).codes+1

In [None]:
for col,val in data_test.items():
  if  not pd.api.types.is_numeric_dtype(val):
    data_test[col]=pd.Categorical(val).codes+1

**Handling missing value**

In [None]:
for col,val in data_set.items():
  if pd.api.types.is_numeric_dtype(val):
    if pd.isnull(val).sum():
        data_set[col]=val.fillna(val.median())


In [None]:
#Test Data
for col,val in data_test.items():
  if pd.api.types.is_numeric_dtype(val):
    if pd.isnull(val).sum():
        data_test[col]=val.fillna(val.median())


check if there is null value

In [None]:
for col,val in data_set.items():
  if pd.api.types.is_numeric_dtype(val):
    if pd.isnull(val).sum():
        print(col)

In [None]:
y=data_set.SalePrice
data_set.drop(columns=['SalePrice'],inplace=True)
sales_id=data_test.SalesID
data_set.drop(columns=['SalesID'],inplace=True)
data_test.drop(columns=['SalesID'],inplace=True)



In [None]:
type(data_set)

pandas.core.frame.DataFrame

***Scaling the Data***

In [None]:
from sklearn.preprocessing import  StandardScaler
SC=StandardScaler()
data_set[data_set.columns]=(SC.fit_transform(data_set[data_set.columns]))
data_test[data_test.columns]=(SC.fit_transform(data_test[data_set.columns]))


In [None]:
type(data_set)

pandas.core.frame.DataFrame

Spliting The Data 


1.   Train Data
2.   Valid Daya



**Calculate size of valid data**

In [None]:
test_valid_size=data_test.shape[0]/ (data_set.shape[0]) 

test_valid_size

0.030184299415068647

In [None]:
from sklearn.model_selection import  train_test_split
x_train,x_valid,y_train,y_valid=train_test_split(data_set,y,test_size=test_valid_size,random_state=44)

**Building an evaluation function**


The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and
predicted auction prices

In [None]:
from sklearn.metrics import  mean_squared_log_error
def rmse(y_test, y_preds):
    """
    Calculates root mean squared error between predictions and truelabels.
    """
    return np.sqrt(mean_squared_log_error(y_test, y_preds))
def show_evalution_score (model):
  print("Training score",model.score(x_train,y_train))
  print("Valid score",model.score(x_valid,y_valid))
  print("Training RMSLE",rmse(model.predict(x_train),y_train))
  print("Valid RMSLE",rmse(model.predict(x_valid),y_valid))



**Train the model**

In [None]:
m = RandomForestRegressor(n_jobs=-1)
m.fit(x_train, y_train)




RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=-1, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

In [None]:
show_evalution_score(m)

Training score 0.9874250289527058
Valid score 0.9125503699681762
Training RMSLE 0.08438261449267169
Valid RMSLE 0.21031149422681375


there is difference between the train and validton score we have here overfitting 

**Hyperparameter tunning with GridSearchCV**

we will use Grid search to find the best paramters 

In [None]:

from sklearn.model_selection import RandomizedSearchCV

 # Different RandomForestRegressor hyperparameters
rf_grid = {"n_estimators": np.arange(20, 100, 20),
          "max_depth": [None, 3, 5, 10],
          "min_samples_split": np.arange(2, 20, 2),
          "min_samples_leaf": np.arange(1, 20, 2),
          "max_features": [0,5, 1, "sqrt", "auto"],
          "max_samples": [10000]}

 # Instantiate RandomizedSearchCV
scv_model = RandomizedSearchCV(RandomForestRegressor(n_jobs=-1,
                                                   random_state=12),
                             param_distributions=rf_grid,
                             n_iter=50,
                             cv=5,
                             verbose=True)

 # Fit the RandomizedSearchCV
scv_model.fit(x_train, y_train)

In [None]:
scv_model.best_params_

In [None]:
show_evalution_score(scv_model)

Training score 0.8183773855054859
Valid score 0.8145317375991589
Training RMSLE 0.2927624705120116
Valid RMSLE 0.2978807541685223


**Train a model with the best hyperparameters**

In [None]:
best_model = RandomForestRegressor(n_estimators=60,
                                   min_samples_leaf=7,
                                   min_samples_split=12,
                                   max_features='auto',
                                   n_jobs=-1,
                                   max_samples=None,
                                   random_state=12)

In [None]:
best_model.fit(x_train, y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=7,
                      min_samples_split=12, min_weight_fraction_leaf=0.0,
                      n_estimators=60, n_jobs=-1, oob_score=False,
                      random_state=12, verbose=0, warm_start=False)

In [None]:
show_evalution_score(best_model)

Training score 0.9412147475676828
Valid score 0.9037022378478624
Training RMSLE 0.1688309965013587
Valid RMSLE 0.21899367711846998


**Make predictions on test data**

In [None]:
test_pred=best_model.predict(data_test)
test_pred

array([28051.4496712 , 23015.86599842, 92823.75033105, ...,
       11436.3532862 , 21797.81002794, 28294.83049189])

In [None]:
df_predict=pd.DataFrame()
df_predict["SalesID"] = sales_id
df_predict["SalesPrice"] = test_pred
df_predict

Unnamed: 0,SalesID,SalesPrice
0,1227829,28051.449671
1,1227844,23015.865998
2,1227847,92823.750331
3,1227848,98353.769183
4,1227863,57427.565476
...,...,...
12452,6643171,48993.317846
12453,6643173,19506.552379
12454,6643184,11436.353286
12455,6643186,21797.810028
