<a href="https://colab.research.google.com/github/DoubleCyclone/house-price-prediction/blob/main/notebooks/house_price_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I will be working with the "**House Prices - Advanced Regression Techniques**" dataset today to perform
*   Exploratory Data Analysis
*   Data Preprocessing
*   Feature Engineering
*   Model Building
*   Evaluation
*   and Visualisation

First things first, I am going to mount google drive so that I can upload the dataset there and easily access it from the notebook.

The dataset (and the competition) is at [Kaggle Competition/Dataset Link](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview)

In [None]:
from google.colab import drive
import pandas as pd

# Mount the Google Drive
drive.mount('/content/drive')

In [None]:
# Let's import the other dependencies as well
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

Then I will read the train.csv which is the training dataset and display a small portion of it.

In [None]:
# Load the training and test datasets, examine their shapes and contents
data_train = pd.read_csv('/content/drive/MyDrive/Colab_Materials/House_Price_Estimation/train.csv')
data_test = pd.read_csv('/content/drive/MyDrive/Colab_Materials/House_Price_Estimation/test.csv')
# I will drop the ID columns as they are not used for training models
data_train = data_train.drop('Id', axis=1)
data_test = data_test.drop('Id', axis=1)

print(f"Shape of the train dataset = {data_train.shape}")
data_train.head()

In [None]:
print(f"Shape of the test dataset = {data_test.shape}")
data_test.head()

Seems like there are almost an equal amount of data in both datasets. There is also one more column in the train dataset called **SalePrice** is the sale price which will be the labels the model will learn from in this case. How about getting an idea of its distribution in the dataset?

In [None]:
# Use the pandas.describe method to automatically generate valuable information about the dataset (only for a numerical variable in this case)
print(data_train['SalePrice'].describe())

# Plot the distribution of the SalePrice column
plt.figure(figsize=(5, 5))
sns.distplot(data_train['SalePrice'], color='g', bins=100, hist_kws={'alpha': 0.4});

By looking at the output of the pandas.describe function and the graph, we can see that most of the Sale Prices reside at around 180000 and the standard deviation is quite low. Meaning that the data has low variability. How do I decide if the standard deviation is low or not? Firstly, I calculate the range of the data which is **max - min**. In this case **755000 - 34900 = 720100**. If the standard deviation is close to the range, I can say that the variability is high but in our case, standard deviation is approximately 10x lesser than the range which lets us conclude that the standard deviation, thus the variability in Sale Prices is low. <br><br>
Now let's see what type of data is stored in the training dataset.

In [None]:
# Print the list of unique data types
list(set(data_train.dtypes.tolist()))

Now let's store the numerical data in a new DataFrame so that we can use it easily.

In [None]:
# Create a new DataFrame for numerical data only
train_num = data_train.select_dtypes(include = ['float64', 'int64'])
train_num.head()

While at that, let's plot the distributions of all these numerical features at the same time.

In [None]:
train_num.hist(figsize=(16, 20), bins=50, xlabelsize=8, ylabelsize=8)