#Housing Prices Competition for Kaggle Learn Users
This notebook is meant as an entry in the competition linked below:

https://www.kaggle.com/competitions/home-data-for-ml-course/overview

As preperation for this competition, I completed the following two courses:

1. Intermediate Machine Learning, https://www.kaggle.com/learn/intermediate-machine-learning



## Competition Details
---

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

**Practice Skills**

Creative feature engineering
Advanced regression techniques like random forest and gradient boosting


**Acknowledgments**

The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset.

**Goal**

It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable.

**Metric**

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

## Table of Content
---
1. Mount, Import and Load
2. Self Defined Functions
3. Data Processing
4. Model Tuning
5. The Final Model
6. The Score

## Mount, Import and Load
---

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install scikit-optimize
!pip install category_encoders

Collecting scikit-optimize
  Downloading scikit_optimize-0.10.0-py2.py3-none-any.whl (107 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.7/107.7 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Collecting pyaml>=16.9 (from scikit-optimize)
  Downloading pyaml-23.12.0-py3-none-any.whl (23 kB)
Installing collected packages: pyaml, scikit-optimize
Successfully installed pyaml-23.12.0 scikit-optimize-0.10.0
Collecting category_encoders
  Downloading category_encoders-2.6.3-py2.py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.9/81.9 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: category_encoders
Successfully installed category_encoders-2.6.3


In [3]:
#Data Manipulation Libraries
import numpy as np
import pandas as pd

#Data Processing Libraries
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from category_encoders.target_encoder import TargetEncoder

#Testing Model Libraries
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import RandomizedSearchCV
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer

#XGBoost
from xgboost import XGBRegressor

In [4]:
#Load Training Data
df = pd.read_csv('/content/drive/MyDrive/Kaggle/HousingPricesCompetitionForKaggleLearnUsers/train.csv', index_col= 'Id')

#Load Testing Data
X_test = pd.read_csv('/content/drive/MyDrive/Kaggle/HousingPricesCompetitionForKaggleLearnUsers/test.csv', index_col = 'Id')

## Self Defined Functions
---

In [5]:
def score_model(model, X_t, X_v, y_t, y_v):
  """Inputs model and training data and will return the mean_square_error"""
  model.fit(X_t, y_t)
  preds = model.predict(X_v)
  print(f"{model} has a MSE of {np.sqrt(mean_squared_error(y_v, preds))}")



def train_and_predict(model, X_t, X_v,  y_t):
  """Will train model on X_t and y_t.  Then will make estimates for X_v"""
  model.fit(X_t, y_t)

  predictions = model.predict(X_v)

  output = pd.DataFrame({'Id': list(X_v.index), 'SalePrice': predictions})
  output = output.set_index('Id')
  output.to_csv('submission.csv')

  print(output)

## Data Processing

In [6]:
#Defines X and y for our models
y = df['SalePrice']
X = df.drop('SalePrice', axis = 1)

## Model Tuning
---

In [7]:
#Sets encoder to be used
estimators = [
    ('encoder', TargetEncoder()),
    ('clf', XGBRegressor(random_state=8)) # can customize objective function with the objective parameter
]
pipe = Pipeline(steps=estimators)


In [8]:
#Set ranges for parameters to be searched
search_space = {
    'clf__n_estimators' : Integer(100, 2000),
    'clf__max_depth' : Integer(2,8),
    'clf__learning_rate': Real(0.001, 1.0, prior='log-uniform'),
    'clf__subsample': Real(0, 1.0),
    'clf__colsample_bytree': Real(0, 1.0),
    'clf__colsample_bylevel': Real(0.5, 1.0),
    'clf__colsample_bynode' : Real(0.5, 1.0),
    'clf__reg_alpha': Real(0.0, 50.0),
    'clf__reg_lambda': Real(0.0, 10.0),
    'clf__gamma': Real(0.0, 10.0)
}

opt = BayesSearchCV(pipe, search_space, cv=4, n_iter=100, scoring='neg_root_mean_squared_error', random_state=8)

The code cell below will search for the best parameters.  It takes about an hour to run.  Leave commented out unless you are tuning the model.

In [None]:
# opt.fit(X, y)
# opt.best_params_

## Final Model
---

In [9]:
# Should be the model returned from above
final_model = XGBRegressor(
    colsample_bylevel = 1,
    colsample_bynode = 0.5,
    colsample_bytree =  0.5,
    gamma =  4.081132340756288,
    learning_rate =  0.00904526008269213,
    max_depth = 4,
    n_estimators = 3000,
    reg_alpha =  20,
    reg_lambda = 0,
    subsample = 0.5)

In [10]:
#Predticts and outputs the predictions as submission.csv
my_pipeline = Pipeline(steps=[('encoder', TargetEncoder()),
                              ('model', final_model)
                             ])

#This function  trains the model, predicts the values of y_test, and then outputs the file as submission.csv
train_and_predict(my_pipeline, X, X_test, y)

          SalePrice
Id                 
1461  124278.796875
1462  156541.109375
1463  184426.656250
1464  194690.187500
1465  184657.875000
...             ...
2915   84330.015625
2916   80403.914062
2917  163890.859375
2918  118695.085938
2919  207813.562500

[1459 rows x 1 columns]


## Score
---
When submitted to the competition, our predictions had a mean squared error of 13,049. At the time of the submission, this earned me 421st place out of 86,278 entries.  This means the my entry is in the 0.5%.  

In [None]:
!jupyter nbconvert --to html XGBoost.ipynb

[NbConvertApp] Converting notebook XGBoost.ipynb to html
[NbConvertApp] Writing 615682 bytes to XGBoost.html
