<img src="https://i.imgur.com/GNiaWNs.png">
<center><h1>Predict House Prices in Iowa</h1></center>
<center><h2>CORGI Workshop</h2></center>

# Introduction

### What is a Regression Problem?

A Regression problem is a **supervised machine learning technique**, which has the aim of predicting a *real* or *continuous* variable (e.g.: salary, height, house prices 😉, etc.) by looking at other *helper* variables (which can be also called *features*).

Hence, we're trying to:
* predict a TARGET
* by looking at FEATURES

<img src="https://i.imgur.com/eXHXYdZ.png" width = 600>


### Predicting House Prices in Iowa (Kaggle Competition) 🏡

This competition aims to predict the Sales Price of Houses in Iowa, by looking at various aspects of the house or area, such as *number of rooms, number of floors, garage, the street, pool, utilities etc.*

There are 2 Chapters that we'll cover to achieve our goal:
1. **Data Preprocessing**
2. **Data Modeling**


# 1. Data Preprocessing 🛠

In this Chpater we'll learn how to do the following:

1. Import and analyse the data
2. Clean the data from missing values
3. Encode categorical variables

### Libraries 📚

In [None]:
# Import Data Libraries
import pandas as pd
import numpy as np

# Import Visualization Libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## 1.1 Importing the data

In [None]:
# Read in the datasets
train = pd.read_csv("../input/iowa-house-prices/train.csv")
test = pd.read_csv("../input/iowa-house-prices/test.csv")

# Drop ID columns
train.drop(columns="Id", axis=1, inplace=True)
test.drop(columns="Id", axis=1, inplace=True)

We have 2 datasets that we need to import:
* `train` : it's the TRAINING data. The model *learns* by looking at it and understanding patterns through different algorithms. It is **labeled** (meaning that we have the houses *price* column in there).
* `test` : it's the data for TESTING. Once we create our model and we know are working, we can test it on a NEW completely unseen dataset, to see if it can properly generalise the information learned during testing.

> Splitting the data into TRAIN and TEST ensures that the model is **robust**, meaning that it is able to **generalise**. 

<img src="https://i.imgur.com/WrMWQT3.png" width=600>

In [None]:
# Explore the format
print("Train shape: {}".format(train.shape))
print("Test shape: {}".format(test.shape))

# Explore the head of the dataframe
train.head()

In [None]:
test.head()

In [None]:
# # Other helpful commands
# train.dtypes          # check the datatypes of the columns
# train.columns         # check name of columns
# train.shape           # check the shape of the columns
# train.isna().sum()    # check how much missing data in each column

## 1.2 Check for Leakages 🚿

A **leakage** is when you are using *features* that you wouldn't have available if a **new case** comes up to train your model.

Imagine trying to predict the price you will sell your house with, but your model all of the sudden expects you to input the *day you sold the house*, or if the transaction was *cash or through a bank order*. You can't possibly know these aspects, as you **haven't sold the house yet**.

Hence, we need to look for leakages in our data before proceeding with the analysis.

In [None]:
train.columns

In [None]:
# Leaked columns
leaked_columns = ['MoSold', 'YrSold', 'SaleType', 'SaleCondition']

# Remove in BOTH datasets
train.drop(labels = leaked_columns, axis = 1, inplace = True)
test.drop(labels = leaked_columns, axis = 1, inplace= True)

In [None]:
# # Check if the columns have dissapeared
# train.columns

## 1.3 Data Preprocessing 🧹

There are many techniques that can be performed during this phase. Some of them are:
* checking for **missing** data
* analysis of the distributions and **patterns** in the data
* creating **new features** from the existing data (feature engineering)
* **encoding** the categorical features
* etc.

In [None]:
# Check the data types of the columns
train.dtypes

### #I. Looking at the Target column

In [None]:
# Check for missing values in the target column
train["SalePrice"].isna().sum()

In [None]:
# Plot of the target column
plt.figure(figsize = (16, 4))
sns.distplot(a = train["SalePrice"], color = "#FF7F50")
plt.title("Distribution of Sales Price", fontsize=16);

In [None]:
# Storing the target variable separately
y = train["SalePrice"]
train.drop(columns=["SalePrice"], axis=1, inplace=True)

### #II. Numerical Data

> Numerical columns are the ones that are of type `int` or `float`.

In [None]:
# Select ONLY numerical columns
numerical_cols = [col for col in train.columns if 
                  train[col].dtype in ["int64", "float64"]]

Let's see if there are any missing values between these columns.

In [None]:
# Find the columns that have null values
na_count_n = train[numerical_cols].isna().sum()   # this becomes a Pandas Series
na_count_n[na_count_n > 0]                          # filter

**How should we make the imputation?**

> **Imputation** is a technique that replaces missing values in the data with another value (like the mean, median, mode, or other more complex operations). *Be weary of the bias*!

In [None]:
# Select columns with NAs
na_numeric_columns = na_count_n[na_count_n > 0].index
na_numeric_data = train[na_numeric_columns].dropna(axis=0)

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Plot the distribution of these variables
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

plt.figure(figsize = (16, 4))
sns.distplot(a = na_numeric_data[na_numeric_columns[0]], color = "blue")
plt.title(f"{na_numeric_columns[0]}", fontsize=16);

In [None]:
plt.figure(figsize = (16, 4))
sns.distplot(a = na_numeric_data[na_numeric_columns[1]], color = "purple")
plt.title(f"{na_numeric_columns[1]}", fontsize=16);

In [None]:
plt.figure(figsize = (16, 4))
sns.distplot(a = na_numeric_data[na_numeric_columns[2]], color = "orange")
plt.title(f"{na_numeric_columns[2]}", fontsize=16);

#### Introducing "SimpleImputer"🥁

In [None]:
# Import Simple Imputer and from SkLearn
from sklearn.impute import SimpleImputer

In [None]:
# Prepare the Imputation Objects
median_impute = SimpleImputer(strategy = 'median')
mode_impute = SimpleImputer(strategy = 'most_frequent')
mean_impute = SimpleImputer(strategy = 'mean')

In [None]:
def apply_imputation(impute_object, column):
    '''Function that applies the imputation to the desired column.
    Returns the values for train and test.'''

    ### attention at the difference between fit_transform and transform!
    imputed_train = impute_object.fit_transform(X = train[[column]])
    imputed_test = impute_object.transform(X = test[[column]])
    
    return imputed_train, imputed_test

In [None]:
# Make the Imputation
train['LotFrontage'] = apply_imputation(median_impute, 'LotFrontage')[0]
test['LotFrontage'] = apply_imputation(median_impute, 'LotFrontage')[1]

train['MasVnrArea'] = apply_imputation(mode_impute, 'MasVnrArea')[0]
test['MasVnrArea'] = apply_imputation(mode_impute, 'MasVnrArea')[1]

train['GarageYrBlt'] = apply_imputation(mean_impute, 'GarageYrBlt')[0]
test['GarageYrBlt'] = apply_imputation(mean_impute, 'GarageYrBlt')[1]

#### Scale the Data

You'll also need to be scaling the data, so that all the variables are normalised. It can also speed up the model computations.

#### Introducing "StandardScaler"🥁

In [None]:
# # Import the Standard Scaler
# from sklearn.preprocessing import StandardScaler

# # Scale the data
# scaler = StandardScaler()
# scaled_matrix_train = pd.DataFrame(scaler.fit_transform(train[numerical_cols]),
#                                    columns=numerical_cols)
# scaled_matrix_test = pd.DataFrame(scaler.transform(test[numerical_cols]),
#                                    columns=numerical_cols)

# # Erase old data and append scaled one
# train.drop(columns=numerical_cols, axis=1, inplace=True)
# test.drop(columns=numerical_cols, axis=1, inplace=True)

# train = pd.concat([train, scaled_matrix_train], axis=1)
# test = pd.concat([test, scaled_matrix_test], axis=1)

### #III. Categorical Data

> Categorical data is the one stored in object datatypes.

In [None]:
# Select ONLY categorical columns
categ_cols = [col for col in train.columns if 
              train[col].dtype in ["object"]]

Let's see if there are any missing values between these columns.

In [None]:
# Find the columns that have null values
na_count_s = train[categ_cols].isna().sum()     # this becomes a Pandas Series
na_count_s[na_count_s > 0]                        # filter

In [None]:
# Drop all columns with more than 50% missing data
to_drop = na_count_s[na_count_s > train.shape[0]*0.5].index

train.drop(labels=to_drop, axis=1, inplace=True)
test.drop(labels=to_drop, axis=1, inplace=True)

# Update the categ_cols
categ_cols = list(set(categ_cols) - set(to_drop))

Now all we have to do is **impute** the rest of the data and **encode** it (transform it from data type `object` to data type `int`)

In [None]:
# The computer doesn't know how to read letters :)
train[categ_cols]

#### Introducing "One Hot Encoder" 🥁

There are different types of encoding methodologies:
* **Label Encoding**: when you convert the categories (e.g.: day, night, noon) into numbers/labels (e.g.: 1, 2, 3).
* **One Hot Encoding**: when you create a *flag* for each category, with 1 where that category appears and 0 otherwise.

<img src="https://i.imgur.com/3725Gwc.png" width=600>

In [None]:
# Import OneHotEncoder from sklearn
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

In [None]:
# Create the imputation object
c_mode_impute = SimpleImputer(strategy = 'most_frequent')
# Create the encoder object
encoder = OneHotEncoder(handle_unknown='ignore')

In [None]:
# Make the Imputation
for column in categ_cols:
    train[column] = apply_imputation(c_mode_impute, column)[0]
    test[column] = apply_imputation(c_mode_impute, column)[1]
    

# Perform One Hot Encoding
encoded_train = pd.DataFrame(encoder.fit_transform(train[categ_cols]).toarray())
encoded_test = pd.DataFrame(encoder.transform(test[categ_cols]).toarray())

# Drop old columns and replace with encoded ones
train.drop(columns=categ_cols, axis=1, inplace=True)
test.drop(columns=categ_cols, axis=1, inplace=True)

train = pd.concat([train, encoded_train], axis=1)
test = pd.concat([test, encoded_test], axis=1)

# 2. Model Training 💻⏰

In this Chapter we'll learn how to do the following:

1. **Prepare** the data to properly feed to the model
2. Create multiple **models** and assess which one is the best

### Libraries 📚

In [None]:
# Data Splitting
from sklearn.model_selection import train_test_split

# Models (or Algorithms)
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

# Model Evaluation
from sklearn.metrics import mean_absolute_error

# Ignore warnings
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

## 2.1 Data Validation

This step is extremely important. It ensures that during the training session you have some sort of **indicator on how your model is performing**.

It also takes care of model **overfitting**: when a model overfits it means that the model isn't learning patterns and generalities from the data. Fitting the points too well might lead to a very high accuracy of the model during training, but a very low score when you actually deploy it into production:

<img src="https://i.imgur.com/7632QAP.png" width=600>

One solution to this is to *split* the training data into a **training** part and **validation** part.

Hence, you can **train** the model on the Training Data and then **predict** on the Validation Data. This way, you can use you labeled data not only for training, but also for validating how your model performs.

<img src="https://miro.medium.com/max/1552/1*Nv2NNALuokZEcV6hYEHdGA.png" width=600>

There are many more other options to this technique, such as K Fold or Stratified K Fold, but we won't get into them in this notebook.

In [None]:
# train -> training data
# test -> testing data (unlabeled)
# y -> target variable (we stored it from the train data)

# Target Variable: y
# Features -> X
X = train

# Split data further
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.3, 
                                                      random_state = 0)

## 2.2 Model Selection: trial and error

Now we start to try 1 by 1 the 4 algorithms we imported earlier. Remember, there are many more other algorithms that you can try for this problem. Head to the [sklearn documentation](https://scikit-learn.org/stable/) to find more.

### #I. Linear Model

<img src="https://backlog.com/wp-blog-app/uploads/2019/12/Nulab-Gradient-descent-for-linear-regression-using-Golang-Blog.png" width=300>

In [None]:
# ~~~~~~~~~~~~~~~~
# LinearRegression
# ~~~~~~~~~~~~~~~~

linear_model = LinearRegression()

# Train the model on training data
linear_model.fit(X=X_train, y=y_train)

# Predict on validation data
predictions = linear_model.predict(X_valid)

# Get how well it performed
mae_linear = mean_absolute_error(y_valid, predictions)

print("Linear: {:,}".format(mae_linear))

The linear model looks like it's **underfitting** big time.

### #II. Decission Tree Regressor

<img src="https://miro.medium.com/max/2000/1*WerHJ14JQAd3j8ASaVjAhw.jpeg" width=300>

In [None]:
# ~~~~~~~~~~~~~~~~~~~~~
# DecisionTreeRegressor
# ~~~~~~~~~~~~~~~~~~~~~

tree_model = DecisionTreeRegressor()

# Train the model on training data
tree_model.fit(X=X_train, y=y_train)

# Predict on validation data
predictions = tree_model.predict(X_valid)

# Get how well it performed
mae_tree = mean_absolute_error(y_valid, predictions)

print("Tree: {:,}".format(mae_tree))

The Decission Tree is MUCH better, with an error of only 24,823.

### #III. Random Forest Regressor

<img src="https://upload.wikimedia.org/wikipedia/commons/7/76/Random_forest_diagram_complete.png" width=300>

In [None]:
# ~~~~~~~~~~~~~~~~~~~~~
# RandomForestRegressor
# ~~~~~~~~~~~~~~~~~~~~~

rf_model = RandomForestRegressor()

# Train the model on training data
rf_model.fit(X=X_train, y=y_train)

# Predict on validation data
predictions = rf_model.predict(X_valid)

# Get how well it performed
mae_rf = mean_absolute_error(y_valid, predictions)

print("Random Forest: {:,}".format(mae_rf))

Again, a very nice improvement. The Random Forest is performing better than the Decission Tree.

### #IV. XGBoost

> Still an ensemble, but more complicated. You can find [documentation here](https://xgboost.readthedocs.io/en/latest/tutorials/model.html).

In [None]:
# ~~~~~~~~~~~~
# XGBRegressor
# ~~~~~~~~~~~~

xgb_model = XGBRegressor(n_estimators=600)

# Train the model on training data
xgb_model.fit(X=X_train, y=y_train)

# Predict on validation data
predictions = xgb_model.predict(X_valid)

# Get how well it performed
mae_xgb = mean_absolute_error(y_valid, predictions)

print("XGBoost: {:,}".format(mae_xgb))

# 3. There's more

There's soooooo much more to this. I'll leave here some **names** of techniques or agorithms that you might want to check out in order to further improve this score:

* Feature Engineering ([check out this notebook](https://www.kaggle.com/gunesevitan/titanic-advanced-feature-engineering-tutorial))
* Fancy Impute (or any other Imputation technique)
* PCA before model training
* K Fold or Stratified K Fold
* Model Selection (try other models)
* Hyperparameter Tunning
* many many many more, you just need to be curious 🙃

# 4. Submit to Competition

Once you're ready, you can use your model to predict on the `test` dataset and submit to the [Iowa Housing Prices Competition](https://www.kaggle.com/c/home-data-for-ml-course).

If you want to learn mode on Machine Learning or just take a deep dive into this tutorial, check out the [Kaggle Courses](https://www.kaggle.com/learn/overview) in the Machine Learning Series:
* Intro to ML
* Intermediate ML
* ML Explainability

In [None]:
# This model is created after Grid Search
### check out version 4 of this notebook to see how I did it :)
my_model = XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
                        colsample_bynode=1, colsample_bytree=0.8, gamma=0.0,
                        importance_type='gain', learning_rate=0.005, 
                        max_delta_step=0, max_depth=4, min_child_weight=1, 
                        missing=None, n_estimators=5000, n_jobs=1, 
                        nthread=4, objective='reg:squarederror', random_state=0,
                        reg_alpha=1, reg_lambda=1, scale_pos_weight=1, seed=27,
                        silent=None, subsample=0.8, verbosity=1)

# Fit on the entire training data
my_model.fit(X, y)

In [None]:
# Predict and Submit
final_predictions = my_model.predict(test)

# Import test to get ID
X_test = pd.read_csv("../input/iowa-house-prices/test.csv", index_col ='Id')

output = pd.DataFrame({'Id': X_test.index, 'SalePrice' : final_predictions})
output.to_csv('submission_final.csv', index = False)

> **Happy Data Sciencin'!**

<img src="https://i.imgur.com/cUQXtS7.png">

# Specs on how I prepped & trained ⌨️🎨
### (*locally*)
* Z8 G4 Workstation 🖥
* 2 CPUs & 96GB Memory 💾
* NVIDIA Quadro RTX 8000 🎮
* RAPIDS version 0.17 🏃🏾‍♀️