# House Prices - Advanced Regression Techniques

In this notebook we will be analysing the data and experiment with it.

*To View the full code refer to `kaggle.ipynb` - This notebook was submitted in the competition.*

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Loading the Data

In [2]:
train_dataset_path = "data/train.csv"
test_dataset_path = "data/test.csv"

In [3]:
train_dataset = pd.read_csv(train_dataset_path)

test_dataset = pd.read_csv(test_dataset_path)

# Analysing the Data

In [4]:
train_dataset.head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [5]:
test_dataset.head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


## Missing Values

In [6]:
from helper_functions import pd_to_csv

# Saving Missing Values in train_dataset
pd_to_csv(train_dataset.isna().sum(), "raw_data/train_missing_values.csv")

# Saving Missing Values in train_dataset
pd_to_csv(test_dataset.isna().sum(), "raw_data/test_missing_values.csv")

'DONE'

In [7]:
# Viewing the Missing Values
train_missing_values = pd.read_csv("raw_data/train_missing_values.csv")

train_missing_values

Unnamed: 0.1,Unnamed: 0,0
0,Id,0
1,MSSubClass,0
2,MSZoning,0
3,LotFrontage,259
4,LotArea,0
...,...,...
76,MoSold,0
77,YrSold,0
78,SaleType,0
79,SaleCondition,0


## DataTypes

In [8]:
train_dataset.dtypes

Id                 int64
MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
                  ...   
MoSold             int64
YrSold             int64
SaleType          object
SaleCondition     object
SalePrice          int64
Length: 81, dtype: object

*Here, we can see that not all the datatypes are in integer or float, they are in object (i.e. String). Which means we need to convert them to number in order to convert them to tensors.*

## Duplicated Values

In [9]:
print("Training Dataset Duplicated Values")
print(train_dataset.duplicated().sum())

print("\n----------------\n")

print("Test Dataset Duplicated Values")
print(test_dataset.duplicated().sum())

Training Dataset Duplicated Values
0

----------------

Test Dataset Duplicated Values
0


**NOTE**: After viewing the data and `data_description`, I came to understand that no data is missing, everthing is given and we just need to convert them to desired output.

# Pre-Processing Data

This would involve transforming the data which would be best for our ML.

To know how it was achieved in detail, please refer to [Transform Data Guide](https://github.com/adityajideveloper/kaggle-competition/house-prices/transform_data.md)

In [10]:
from helper_functions import transform_csv

transform_csv(train_dataset, "raw_data/train.csv")
transform_csv(test_dataset, "raw_data/test.csv")

## Getting Data Ready

In [11]:
df = pd.read_csv("raw_data/train.csv")
df_test = pd.read_csv("raw_data/test.csv")

df.head(5)

Unnamed: 0.1,Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,0,1,60,5,65.0,8450,2,0.0,0,0,...,0,0.0,0.0,0.0,0,2,2008,0,0,208500
1,1,2,20,5,80.0,9600,2,0.0,0,0,...,0,0.0,0.0,0.0,0,5,2007,0,0,181500
2,2,3,60,5,68.0,11250,2,0.0,1,0,...,0,0.0,0.0,0.0,0,9,2008,0,0,223500
3,3,4,70,5,60.0,9550,2,0.0,1,0,...,0,0.0,0.0,0.0,0,2,2006,0,1,140000
4,4,5,60,5,84.0,14260,2,0.0,1,0,...,0,0.0,0.0,0.0,0,12,2008,0,0,250000


In [12]:
# Removing ID
df = df.drop(df.columns[0], axis=1)
df = df.drop("Id", axis=1)

In [13]:
# Splitting the Data

from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(df, test_size=0.2, random_state=42, shuffle=True)

In [14]:
# Further splitting data in X and y
X_train = train_set.drop("SalePrice", axis=1).to_numpy()
y_train = train_set['SalePrice'].to_numpy()

X_test = test_set.drop("SalePrice", axis=1).to_numpy()
y_test = test_set['SalePrice'].to_numpy()

In [15]:
# This function was written as described in Competition Evaluation Overview
# https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview/evaluation

def acc_fn(y_pred, y_true):
    absoulute_error = np.log(y_true) - np.log(y_pred)
    square_error = np.square(absoulute_error)
    MSE = np.mean(square_error)
    RMSE = np.sqrt(MSE)

    return RMSE

# Creating Decision Tree

In [16]:
from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor(criterion="squared_error", random_state=42, learning_rate=0.0001, loss="squared_error", n_estimators=100_000)

# Training the Model

In [17]:
model.fit(X_train, y_train)

In [18]:
print(f"Train Score -> {acc_fn(model.predict(X_train), y_train)}")
print(f"Test Score -> {acc_fn(model.predict(X_test), y_test)}")

Train Score -> 0.08403201759971402
Test Score -> 0.13972577454415008


# Submission

In [19]:
test_data = df_test.drop(df_test.columns[0], axis=1).drop("Id", axis=1).to_numpy()
test_ids = df_test["Id"].to_numpy()

print(f"Total Ids -> {len(test_ids)}")

# Opening CSV file
import csv

with open("raw_data/submission.csv", "w") as f:
    writer = csv.writer(f)
    
    writer.writerow(["Id", "SalePrice"])
    
    # Looping through the data
    for i, sale_price in enumerate(model.predict(test_data)):
        writer.writerow([test_ids[i], sale_price])

Total Ids -> 1459
