<a href="https://colab.research.google.com/github/CarlaFFochs/Udemy_Projects/blob/main/Structured_Data_Project_2_Predicting_the_sale_price_of_Bulldozers_(regression).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting the Sale Price of Bulldozers using Machine Learining

In this notebook we're going to go through an example machine learning project with the goal of predicting the sale price of bulldozers.

### 1. Problem definition

> How well can we predict the future sale price of a bulldozer, given its characterisitics and previous examples of how much similiar bulldozers have been sold for? 

### 2. Data

The data is downloaded from Kaggle Competition: https://www.kaggle.com/c/bluebook-for-bulldozers/data

  The data for this competition is split into three parts:

  * Train.csv is the training set, which contains data through the end of 2011.
  * Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.
  *Test.csv is the test set, which won't be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.

### 3. Evalutation

The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices. For more info check: https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation

**Note:** The goal for most regression evalutation metrics is to minimize the error. For example, the goal for this project will be to build a ML model which minimises RMSLE.

### 4. Features

Kaggle provides a data dictionary detailing all the features of the data set: https://docs.google.com/spreadsheets/d/1Mhm5o1ZXLt2o-uE2GPq506iyoJDpoQX_/edit?usp=sharing&ouid=104958891807092880404&rtpof=true&sd=true


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn

In [None]:
# Import training and  validation sets
df = pd.read_csv("/content/drive/MyDrive/MASTER DATA SCIENCE/M0/M0 - UDEMY/Time Series (Supervised Learning)/data/bluebook-for-bulldozers/TrainAndValid.csv",
                 low_memory=False)

#no cal que minimitzem el espai (low_memory=False)

In [None]:
df.info()

In [None]:
df.isna().sum() #mirem els valors nuls

In [None]:
df.columns

In [None]:
fig, ax = plt.subplots()
ax.scatter(df["saledate"][:1000], df["SalePrice"][:1000])

In [None]:
df.saledate[:1000] #no ens agrada el format de la data

In [None]:
df.SalePrice.plot.hist()

### Parsing dates

When we work with time series data, we want to enrich the time & date component as much as possible.

We can do that by telling pandas which of our columns has dates in it using "parse_dates" parameters. 

In [None]:
# Import data again but this time parse dates

df = pd.read_csv("/content/drive/MyDrive/MASTER DATA SCIENCE/M0/M0 - UDEMY/Time Series (Supervised Learning)/data/bluebook-for-bulldozers/TrainAndValid.csv",
                 low_memory=False, parse_dates=["saledate"])

In [None]:
df["saledate"].dtype # és equivalent al "datetime64[ns]"

In [None]:
df["saledate"] # gràcies al "parse_date" ens ho ha passat al format internacional YYYY-MM-DD

In [None]:
fig, ax = plt.subplots()
ax.scatter(df["saledate"][:1000], df["SalePrice"][:1000])

### Sort DataFrame by saledate

When working with time series data, it's a good idea to sort it by date.

In [None]:
# Sort DataFrame in date order
df.sort_values(by=["saledate"], inplace=True, ascending=True)
df.saledate.head(20)

### Make a copy of the original DataFrame

We make a copy of the original datagrame so when we manipulate the copy, we've still got our original data.

In [None]:
df_tmp= df.copy()

In [None]:
df_tmp.saledate.head(20)

### Add datetime parameters for 'saledate' column

In [None]:
df_tmp[:1]["saledate"]

In [None]:
df_tmp[:1]["saledate"].dt.year

In [None]:
df_tmp[:1]["saledate"].dt.day

In [None]:
df_tmp["saleYear"]= df_tmp["saledate"].dt.year
df_tmp["saleMonth"]= df_tmp["saledate"].dt.month
df_tmp["saleDay"]= df_tmp["saledate"].dt.day
df_tmp["saleDayofWeek"]= df_tmp["saledate"].dt.dayofweek
df_tmp["saleDayofYear"]= df_tmp["saledate"].dt.dayofyear

In [None]:
df_tmp.T

In [None]:
# Podem veure que s'han afegit les columnes al final del dataframe
# No necessitem ja el "saledate"

In [None]:
# Now we've enriched our DataFrame with date time features, we cann remove "saledate"

df_tmp.drop("saledate", axis=1, inplace=True)

In [None]:
# Check the values of different columns
df_tmp["state"].value_counts() #obtenim la llista de les ventes

In [None]:
len(df_tmp)

## 5.Modelling

We've done enough EDA (we could always do more) but let's start to do some model-driven EDA.

In [None]:
# Let's build a ML model
#from sklearn.ensemble import RandomForestRegressor

#model= RandomForestRegressor(n_jobs=-1,
#                             random_state=42)

#model.fit(df_tmp.drop("SalePrice", axis=1), df_tmp["SalePrice"])


# ERROR que ens surt:
# could not convert string to float: 'Low'

In [None]:
df_tmp["UsageBand"].dtype 

In [None]:
df.isna().sum()

### Convert string into categories

One way we can turn our data into numbers is by converting them into pandas categories.

In [None]:
df_tmp.head().T

In [None]:
pd.api.types.is_string_dtype(df_tmp["UsageBand"])

In [None]:
# Find the columns which contain strings

for label, content in df_tmp.items():
  if pd.api.types.is_string_dtype(content):
    print(label)

In [None]:
# Tenim totes les columnes que tenen strings

In [None]:
# If you are wondering what df.items() does, here's an example:

random_dict = {"key1": "hello",
               "key2": "world"}

for key, value in random_dict.items():
  print(f"this is a key:  {key}",
        f"this is a value: {value}")

In [None]:
# This will turn all of the string values to category values

for label, content in df_tmp.items():
  if pd.api.types.is_string_dtype(content):
    df_tmp[label] = content.astype("category").cat.as_ordered()

In [None]:
df_tmp.info()

In [None]:
df_tmp.state.cat.categories #estan ordenades amb el cat.as_ordered(), pero pandas esta tractant com a números, les categories sónn números. Per cada etiqueta asigna un número, ho comprovem abaix

In [None]:
df_tmp.state.cat.codes #mirem quin número se li ha asignat per cada "state"

Thanks to pandas Categories we now have a way to acces all of our data in a form of numbers, but we still have a bunch of missing data...

In [None]:
# Check missing data

df_tmp.isnull().sum()/len(df_tmp)

### Save preprocessed data

In [None]:
# Export current tmp dataframe
# Guardem el dataframe manipulat

df_tmp.to_csv("/content/drive/MyDrive/MASTER DATA SCIENCE/M0/M0 - UDEMY/Time Series (Supervised Learning)/data/train_tmp.csv",
              index=False)

In [None]:
# Import preprocessed data

df_tmp = pd.read_csv("/content/drive/MyDrive/MASTER DATA SCIENCE/M0/M0 - UDEMY/Time Series (Supervised Learning)/data/train_tmp.csv",
              low_memory=False)

In [None]:
df_tmp.head().T

## Fill missing values

###Fill numerical values


In [None]:
for label, content in df_tmp.items():
  if pd.api.types.is_numeric_dtype(content):
    print(label)

In [None]:
df_tmp.ModelID

In [None]:
# Check for which numeric columns have null values

for label, content in df_tmp.items():
  if pd.api.types.is_numeric_dtype(content):
    if pd.isnull(content).sum(): #els valors que tenen la suma de nuls superiors a 0, sino directament ja no fa la suma
      print(label)

In [None]:
# Fill numeric rows with the median

for label, content in df_tmp.items():
  if pd.api.types.is_numeric_dtype(content):
    if pd.isnull(content).sum(): #els valors que tenen la suma de nuls superiors a 0, sino directament ja no fa la suma
      # Add a binay column which tells us if the data is missing (per saber que inicialment hi havia un valor que faltava)
      df_tmp[label+"_is_missing"] = pd.isnull(content)
      #Fill missing numeric values with the mdeianl
      df_tmp[label] = content.fillna(content.median()) #es millor la mediana que la media

In [None]:
# Demonstrate how median is more robust than mean

hundreds = np.full((1000,), 100)
hundreds_billion = np.append(hundreds, 1000000000)
np.mean(hundreds), np.mean(hundreds_billion), np.median(hundreds), np.median(hundreds_billion)

In [None]:
# Check if there's any null numeric values

for label, content in df_tmp.items():
  if pd.api.types.is_numeric_dtype(content):
    if pd.isnull(content).sum():
      print(label)

In [None]:
# Check to see what the binary column has done
df_tmp.auctioneerID_is_missing.value_counts()

In [None]:
# Hem rellenat 20136 valors amb la median

In [None]:
df_tmp.isna().sum() # encara hem de omplir els missing values (de les categories...)

## Filling and turning categorical variables into numbers

In [None]:
# Check for columns which aren't numeric

for label, content in df_tmp.items():
  if not pd.api.types.is_numeric_dtype(content):
    print(label)

In [None]:
pd.Categorical(df_tmp["state"]).dtype

In [None]:
pd.Categorical(df_tmp["UsageBand"]).codes 

In [None]:
# Turn categorical variables into numbers and fill missing

for label, content in df_tmp.items():
  if not pd.api.types.is_numeric_dtype(content):
    # Add binary column to indicate wheter sample had missing value
    df_tmp[label+"is_missing"] = pd.isnull(content)
    # Turn categories into numbers and add +1
    df_tmp[label] = pd.Categorical(content).codes + 1

In [None]:
pd.Categorical(df_tmp["UsageBand"]).codes # si hi ha una categoria que te un missing value, li asigna directament un -1 (pero nosaltres volem que sigui 0)

In [None]:
pd.Categorical(df_tmp["state"]).codes

In [None]:
pd.Categorical(df_tmp["state"]).codes +1

In [None]:
df_tmp.info()

In [None]:
df_tmp.head().T

In [None]:
df_tmp.isna().sum()[:20]

Now that all data is numerica as well as our dateframe has no missing values, we should be able to build a ML model.

In [None]:
df_tmp.head()

In [None]:
len(df_tmp)

In [None]:
%%time
# Instantiate model
model = RandomForestRegressor(n_jobs=-1,
                              random_state=42)

# Fit the model
model.fit(df_tmp.drop("SalePrice", axis=1), df_tmp["SalePrice"])

In [None]:
# Score the model
model.score(df_tmp.drop("SalePrice", axis=1), df_tmp["SalePrice"])

**Question:** Why does't the above metric hold water? (why isn't the metric reliable)

In [None]:
# Ens dona una score tant alta perquè hem EVALUAT el model amb la mateixes dades que el TRAINING set.

# Hem ampres els materials de la classe, però en comptes de fer un examen final (preguntes noves), ens evaluen exactament de les mateixes preguntes extretes del llibre que has llegit abans de fer l'examen.
# Esta bé, però nosaltres el que busquem la capacitat de que el nostre model per generalitzar (habilitat d'un model de ML que performi bé amb data que mai ha vist)


### Splitting data into train/validation sets

In [None]:
df_tmp.saleYear

In [None]:
df_tmp.saleYear.value_counts()

In [None]:
# Split data into training and validation
df_val = df_tmp[df_tmp.saleYear == 2012]
df_train = df_tmp[df_tmp.saleYear != 2012]

len(df_val), len(df_train)

In [None]:
# Split data into X & y
X_train, y_train = df_train.drop("SalePrice", axis=1), df_train["SalePrice"]
X_valid, y_valid = df_val.drop("SalePrice", axis=1), df_val["SalePrice"]

X_train.shape, y_train.shape, X_valid.shape, y_valid.shape

In [None]:
y_train

### Building an evalutation function


In [None]:
# Create evalutation function (the competition uses RMSLE)
from sklearn.metrics import mean_squared_log_error, mean_absolute_error, r2_score

def rmsle(y_test,y_preds):
  """
  Calculates root mean squared log error between predictions and true lables
  """
  return np.sqrt(mean_squared_log_error(y_test, y_preds))

# Create function to evaluate model on a few different levels
def show_scores(model):
  train_preds = model.predict(X_train)
  val_preds = model.predict(X_valid) # si aqui ho ha millor, ens dona una pista que el model té overfitting - normalment el validation set té una pitjor performance
  scores = {"Training MAE": mean_absolute_error(y_train, train_preds),
            "Valid MAE": mean_absolute_error(y_valid, val_preds),
            "Training RMSLE": rmsle(y_train, train_preds),
            "Valid RMSLE": rmsle(y_valid, val_preds),
            "Training R^2": r2_score(y_train, train_preds),
             "Valid R^2": r2_score(y_valid, val_preds)}

  return scores

### Testing our model on a subset (to tune the hyperparameters)

In [None]:
# This takes far too long...for experimenting

# %%time
# model = RandomForestRegressor(n_jobs=-1,
#                              random_state=42)
# model.fit(X_train,y_train)

In [None]:
len(X_train)

In [None]:
# Change max_samples value
model = RandomForestRegressor(n_jobs=-1,
                            random_state=42,
                            max_samples=10000)
model

In [None]:
# Cutting down on the max_number of samples each estimator can see improves training time
%%time
model.fit(X_train,y_train)

In [None]:
X_train.shape[0]

In [None]:
show_scores(model)

### Hyperparameter tunning with RandomizedSearchCV

Randomized search on hyperparameters.

In [None]:
%%time
from sklearn.model_selection import RandomizedSearchCV

# Different RandomForestRegressor hyperparameters
rf_grid = {"n_estimators": np.arange(10,100,10),
           "max_depth": [None, 3, 5, 10],
           "min_samples_split": np.arange(2,20,2),
           "min_samples_leaf": np.arange(1,20,2),
           "max_features": [0.5,1, "sqrt", "auto"],
           "max_samples": [10000]}

# Instantiate RandomizedSearch CV model
rs_model = RandomizedSearchCV(RandomForestRegressor(n_jobs=-1,
                                                    random_state=42),
                              param_distributions=rf_grid,
                              n_iter=2,
                              cv=5,
                              verbose=True)

# Fit the RandomizedSearchCV model (farà un fit de 2 iteracions, 2 combinacions de parametrres)
rs_model.fit(X_train, y_train)

In [None]:
# Find the best hyperparameters

rs_model.best_params_

In [None]:
# Nomes hem buscat 2 combinacions.. no seran els millors

show_scores(rs_model)

In [None]:
# Veiem que el RMSLE ha empitjorat una mica, però quadra perquè nomes hem fet 2 iteracions per buscar els millors parametres...

### Train a model with the best hyperparameters

**Note:** These where found after 100 iterations of RandomizedSearchCV


In [None]:
%%time

# Most ideal hyperparameters
ideal_model = RandomForestRegressor(max_depth= 10,
                                    max_features= 0.5,
                                    max_samples= None, #perque agafi tot el model
                                    min_samples_leaf= 15,
                                    min_samples_split= 18,
                                    n_estimators= 30,
                                    n_jobs=-1)

# Fit the ideal model
ideal_model.fit(X_train, y_train)

In [None]:
# Scores for ideal_model (trained on all the data)

show_scores(ideal_model)

In [None]:
# Scores for rs_model (trained on 10.000 examples)

show_scores(rs_model)

## Make predictions on test data

In [None]:
# Import the test data

df_test = pd.read_csv("/content/drive/MyDrive/MASTER DATA SCIENCE/M0/M0 - UDEMY/Time Series (Supervised Learning)/data/bluebook-for-bulldozers/Test.csv",
                      low_memory = False,
                      parse_dates= ["saledate"])
df_test.head()

In [None]:
df_test.columns

In [None]:
# Make predicitons on the test dataset
# test_preds = ideal_model.predict(df_test)

# ValueError: could not convert string to float: 'Low'

In [None]:
df_test.isna().sum() #tenim valors nuls..

In [None]:
df.info() #no es tot numeric

In [None]:
# No esta al mateix format que el training set, ho hem de preprocessar com abans...

### Preprocessing the data (gettind the test dataset in the same format as our training dataset)

In [None]:
def preprocess_data(df):
  """
  Performs transformations on df and returns transformed df.
  """
  df["saleYear"]= df["saledate"].dt.year
  df["saleMonth"]= df["saledate"].dt.month
  df["saleDay"]= df["saledate"].dt.day
  df["saleDayofWeek"]= df["saledate"].dt.dayofweek
  df["saleDayofYear"]= df["saledate"].dt.dayofyear

  df.drop("saledate", axis=1, inplace=True)

  # Fill the numeric rows with median
  for label, content in df.items():
    if pd.api.types.is_numeric_dtype(content):
      if pd.isnull(content).sum(): #els valors que tenen la suma de nuls superiors a 0, sino directament ja no fa la suma
        # Add a binay column which tells us if the data is missing (per saber que inicialment hi havia un valor que faltava)
        df[label+"_is_missing"] = pd.isnull(content)
        # Fill missing numeric values with the mdeianl
        df[label] = content.fillna(content.median()) #es millor la mediana que la media

  # Fill categorical missing data and turn categories into numbers
  for label, content in df.items():
    if not pd.api.types.is_numeric_dtype(content):
      # Add binary column to indicate wheter sample had missing value
      df[label+"is_missing"] = pd.isnull(content)
      # We add +1 to the category code because pandas encodes missinf categories as -1
      df[label] = pd.Categorical(content).codes + 1

  return df

In [None]:
df_test.columns

In [None]:
# Process the test data
df_test = preprocess_data(df_test)

In [None]:
df_test.head()

In [None]:
X_train.shape

In [None]:
# Make predictions on updated test data
# test_preds= ideal_model.predict(df_test)

In [None]:
# No tenen la mateixa shape el train i test set...
# We can find how the columns differ using sets
set(X_train.columns) - set(df_test.columns)

In [None]:
# df_test no té la columna 'auctioneerID_is_missing'

# Manually adjust df_test to have auctioneerID_is_missing
df_test["auctioneerID_is_missing"]= False #no tenia missing values
df_test.head()

Finally our test dataframe has the same features as our training data frame. we can make predicitons!

In [None]:
test_preds = ideal_model.predict(df_test)

In [None]:
test_preds

We've made some predicitions but they're not in the same format Kaggle  is asking for: https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation

In [None]:
# Format predicitons into the same format Kaggle:
df_preds= pd.DataFrame()
df_preds["SalesID"] = df_test["SalesID"] #fem la nova columna "SalesID"
df_preds["SalesPrice"] = test_preds # creem la nova columna "SalesPrice", els valors son el resultats del test_preds
df_preds

In [None]:
# Export prediciton data

df_preds.to_csv("/content/drive/MyDrive/MASTER DATA SCIENCE/M0/M0 - UDEMY/Time Series (Supervised Learning)/data/test_predicitons", index= False) #al arxiu li direm "test_predictions"

### Feature importance

Feature importance seeks to figure out wich different attributes of the data were most important when it comes to predictinf the **target variable** (SalesPrice)

In [None]:
# Find feature importance of our best model
ideal_model.feature_importances_

In [None]:
len(ideal_model.feature_importances_)

In [None]:
X_train.shape

In [None]:
X_train

We are getting a value for each feature

In [None]:
# Helper function for plotting feature importance

def plot_features(columns, importances, n=20): #the tiop 20 values
  df= (pd.DataFrame({"features": columns,
                    "feature_importances": importances}).sort_values("feature_importances", ascending = False).reset_index(drop=True))

# Plot the dataframe
  fig, ax = plt.subplots() # instantiate the plot
  ax.barh(df["features"][:n], df["feature_importances"][:20])
  ax.set_ylabel("Features")
  ax.set_xlabel("Feature importance")
  ax.invert_yaxis() #perque surti de més a menys (descending)

In [None]:
plot_features(X_train.columns, ideal_model.feature_importances_)

In [None]:
df["ProductSize"].value_counts()

In [None]:
df["Enclosure"].value_counts()

**Question to finish:** Why might knowing feature importances of a trained data be important

**Final challenge:** What order machine learing models could we try on our dataset? Hint: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html or try to look at something like CatBoost.ai or XGBoost.ai.