![AIRBNB](https://www.stevenridercpa.au/wp-content/uploads/2022/09/airbnb-tax.jpeg)

# Airbnb - Price Prediction
-------

## Table of Contents

1. [Introduction](##Introduction)
    - [Problem Statement](###Problem-Statement)
    - [Objective](###Objective)
    - [Dataset Overview](###Dataset-Overview)
2. [Setup](##Setup)
2. [Data Loading and Exploration](##Data-Loading-and-Exploration)
    - [Loading the Dataset](###Loading-the-Dataset)
    - [Exploratory Data Analysis (EDA)](###Exploratory-Data-Analysis-(EDA))
    - [Data Preprocessing](###Data-Preprocessing)
3. [Baseline Model](##Baseline-Model)
    - [Model Architecture](###Model-Architecture)
    - [Model Compilation](###Model-Compilation)
    - [Model Training](###Model-Training)
    - [Model Evaluation](###Model-Evaluation)
4. [Hyperparameter Tuning](##Hyperparameter-Tuning)
    - [Grid Search Setup](###Grid-Search-Setup)
    - [Execution of Grid Search](###Execution-of-Grid-Search)
    - [Analysis of Grid Search Results](###Analysis-of-Grid-Search-Results)
5. [Advanced Techniques Implementation](##Advanced-Techniques-Implementation)
    - [Batch Normalization](###Batch-Normalization)
    - [Gradient Normalization and/or Gradient Clipping](###Gradient-Normalization-and/or-Gradient-Clipping)
6. [Final Model](##Final-Model)
    - [Model Architecture](###Model-Architecture)
    - [Model Compilation](###Model-Compilation)
    - [Model Training](###Model-Training)
    - [Model Evaluation](###Model-Evaluation)
7. [Participation in Kaggle Competition](##Participation-in-Kaggle-Competition)
    - [Submission Preparation](###Submission-Preparation)
    - [Submission to Kaggle](###Submission-to-Kaggle)
8. [Conclusion](##Conclusion)
    - [Summary](##Summary)
    - [Future Work](###Future-Work)
9. [References](##References)
10. [Appendices](##Appendices)
    - [Supplementary Scripts](###Supplementary-Scripts)
    - [Additional Resources](###Additional-Resources)

### Problem Statement

A dataset containing information of accommodations published in AirBnB with their respective prices is presented. The size of the train dataset is approximately 1.5 Gb, and 0.5 Gb for the test dataset. This has 84 predictor variables that can be used as they see fit.

The objective is to assign the correct price to the listed accommodations. 

In addition to the dataset, you are provided with this notebook containing the data loading script and a baseline model corresponding to a feed forward architecture.


### Objective

The primary objective of this project is to develop a predictive model capable of accurately estimating the rental prices of accommodations listed on AirBnB. By leveraging a dataset consisting of various attributes and historical pricing data of listed accommodations, we aim to build a model that minimizes the error in price prediction. 

The success of this endeavor will be evaluated based on the Mean Absolute Error (MAE) metric, with the goal of achieving an MAE of less than 70 points (based on the [participation in Kaggle](##Participation-in-Kaggle-Competition)). This objective aligns with the criteria set forth in the associated Kaggle competition, which serves as a structured platform for benchmarking the performance of our model against others.

Several tasks have been outlined to aid in the accomplishment of this objective:

- Conduct thorough [exploratory data analysis](###Exploratory-Data-Analysis-(EDA)) to understand the underlying patterns and characteristics of the data.
- [Preprocess the data](###Data-Preprocessing) to ensure it is well-suited for training machine learning models.
- Establish a [baseline model](##Baseline-Model) using a feed-forward neural network architecture, against which further models and techniques can be compared.
- Engage in a methodical [grid search](###Grid-Search-Setup) to identify the optimal hyperparameters for our model, utilizing tools such as [Weights and Biases](https://wandb.ai/site) for systematic exploration and logging.
- Incorporate advanced techniques including [Batch Normalization](###Batch-Normalization) and [Gradient Normalization/Clipping](###Gradient-Normalization-and/or-Gradient-Clipping) to enhance the learning process and stability of the model.
- Continuously evaluate the performance of the model, iterating on the architecture and training process as necessary to inch closer to the desired MAE goal.

By adhering to a structured approach encompassing data exploration, preprocessing, model building, hyperparameter tuning, and advanced technique implementation, we aspire to develop a robust model that stands up to the competition standards and possibly exceeds them, thereby moving closer to solving the real-world problem of accurate price prediction in the peer-to-peer accommodation rental domain.


### Dataset Overview

The dataset provided for this project comprises information pertaining to accommodations listed on AirBnB, captured across 85 different attributes or columns. This dataset is housed within a Pandas DataFrame and totals 326,287 entries, extending from index 0 to 326,286. Below is a high-level summary of the dataset's structure and contained attributes:

- **Entries:** 326,287
- **Attributes:** 85
- **Target Variable:** `Price`
- **Data Types:** 
    - Integer: 2
    - Float: 31
    - Object: 52
- **Memory Usage:** 211.6+ MB

#### Attribute Highlights

1. **Identifier Attributes:**
   - `id`: Unique identifier for each listing.
   - `Host ID`: Unique identifier for each host.

2. **Textual Descriptions:**
   - `Name`, `Summary`, `Description`: Textual descriptions of the listing.
   - `Neighborhood Overview`, `Notes`, `Transit`: Additional textual information about the listing’s neighborhood and transit options.

3. **Host Information:**
   - `Host Name`, `Host Since`, `Host Location`: Information regarding the host.
   - `Host Response Time`, `Host Response Rate`: Host’s responsiveness metrics.

4. **Location and Property Attributes:**
   - `Street`, `Neighbourhood`, `City`, `State`, `Country`: Location-related attributes.
   - `Property Type`, `Room Type`: Descriptors of the property type and room type.

5. **Accommodation Features:**
   - `Accommodates`, `Bathrooms`, `Bedrooms`, `Beds`: Attributes indicating the accommodation capacity and facilities.
   - `Amenities`: List of amenities provided.

6. **Pricing and Booking Information:**
   - `Price`, `Security Deposit`, `Cleaning Fee`: Pricing-related information.
   - `Guests Included`, `Extra People`, `Minimum Nights`, `Maximum Nights`: Booking-related attributes.

7. **Availability and Review Metrics:**
   - `Availability 30`, `Availability 60`, `Availability 90`, `Availability 365`: Availability metrics over different time horizons.
   - `Number of Reviews`, `Review Scores Rating`, `Reviews per Month`: Review-related metrics.

8. **Miscellaneous:**
   - `Features`: Other features of the listing.
   - `Geolocation`: Geographical coordinates of the listing.

This dataset presents a rich and diverse set of attributes, offering a substantial foundation upon which to build predictive models aimed at accurately estimating rental prices for AirBnB listings. The extensive variety of data attributes spans textual descriptions, categorical variables, numerical metrics, and date-related information, providing a well-rounded basis for a comprehensive exploratory data analysis (EDA) and subsequent model development.

> The memory usage of this dataset is significant, amounting to over 211.6 MB, which necessitates efficient data handling and processing techniques to ensure smooth and effective model training and evaluation.


## Setup

### Imports

In this project, we will be leveraging the powerful capabilities of [Keras](https://keras.io/) to build and train our machine learning models. Keras is an open-source software library that provides a Python interface for artificial neural networks. It acts as an interface for the TensorFlow library, allowing for high-level building and training of models.

To ensure that the necessary dependencies are correctly installed and managed throughout the project, we'll be utilizing [Conda](https://docs.conda.io/en/latest/) as our package manager. Conda is an open-source package management and environment management system that runs on Windows, macOS, and Linux.

The `environment.yml` file located in the root of the project directory contains the list of all necessary packages and their respective versions required for this project. This file will allow us to create a Conda environment with the specified dependencies, ensuring a consistent environment across different setups.


In [None]:
# PyTorch is an open-source machine learning library used for a variety of tasks,
# but primarily for training deep neural networks.
import torch

# nn is a sub-module in PyTorch that contains useful classes and functions to build neural networks.
import torch.nn as nn

# F is a sub-module in PyTorch that contains useful functions for building neural networks.
import torch.nn.functional as F

# DataLoader is a PyTorch utility for loading and batching data efficiently.
from torch.utils.data import DataLoader

# torchvision contains various utilities, pre-trained models, and datasets specifically
# geared towards computer vision tasks.
import torchvision

# transforms are a set of common image transformations that are often required when
# working with image data.
from torchvision import transforms

# ImageFolder is a utility for loading images directly from a directory structure where
# each sub-directory represents a different class.
from torchvision.datasets import ImageFolder

# random_split is a utility function to randomly split a dataset into non-overlapping
# new datasets of given lengths.
from torch.utils.data import random_split


# SummaryWriter is a PyTorch utility for logging information to be displayed in TensorBoard.
from torch.utils.tensorboard import SummaryWriter

# summary is a PyTorch utility for displaying the summary of a PyTorch model.
from torchinfo import summary

# tqdm is a Python library that adds a progress bar to an iterable object.
from tqdm import tqdm

# Matplotlib is a plotting library that is useful for visualizing data, plotting graphs, etc.
import matplotlib.pyplot as plt

# Seaborn is a Python data visualization library based on Matplotlib.
import seaborn as sns

# NumPy is a library for numerical operations and is especially useful for array and
# matrix computations.
import numpy as np

# Pandas is a library for data manipulation and analysis.
import pandas as pd


# PIL is a library for image processing.
from PIL import Image

# os is a Python module that provides a portable way of using operating system dependent
import os

# time is a module that provides various time-related functions.
import time

# random is a module that implements pseudo-random number generators for various distributions.
import random

# accuracy_score computes the accuracy classification score.
# confusion_matrix computes confusion matrix to evaluate the accuracy of a classification.
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

# itertools is a module that provides various functions that work on iterators to produce
from itertools import product

# math is a module that provides access to the mathematical functions.
import math

### Setting the Random Seed for Reproducibility

For any machine learning experiment, reproducibility is crucial. Setting a random seed ensures that the random numbers generated by our code are the same across different runs, making the results reproducible. In this project, the random seed is set for PyTorch.

In [None]:
SEED = 117
# Set the seed for generating random numbers
torch.manual_seed(SEED)
random.seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

## Data Loading and Exploration
In this section, we delve into the initial phase of our project where we load the dataset into our environment and perform an exploratory data analysis (EDA) to better understand the nature and characteristics of the data we are dealing with.

### Loading the Dataset
The dataset for this project is conveniently provided and is located at the `/data` directory. Since this dataset it's too large to be uploaded to GitHub, it is not included in the repository. However, it can be downloaded from the [Kaggle competition page](https://www.kaggle.com/competitions/obligatorio-deep-learning-2023).


In [None]:
TRAIN_PATH = './data/public_train_data.csv'
SUBMISSION_PATH = './data/private_data_to_predict.csv'
train_df = pd.read_csv(TRAIN_PATH)

### Exploratory Data Analysis (EDA)

The Exploratory Data Analysis (EDA) is a vital step that helps in understanding the intricacies of the data, spotting any anomalies, and uncovering patterns that could be instrumental in building a precise predictive model. The steps involved in the EDA for this dataset are outlined as follows:

1. **Summary Statistics**
   Acquiring summary statistics will provide insights into the central tendency and dispersion of the numerical attributes.

2. **Data Type Analysis**
   A review of the data types of each attribute to ensure they are in the correct format for analysis and modeling.

3. **Missing Values Assessment**
   Identifying and addressing missing values across different attributes to ensure completeness of the data.

4. **Categorical Variable Analysis**
   Exploring the unique values and counts of categorical variables to understand the distribution across different categories.

5. **Correlation Analysis**
   Analyzing the correlation between numerical variables, especially with respect to the target variable `Price`, to understand any strong relationships that might exist.

6. **Visualization**
   Employing visualization techniques to create histograms, box plots, and scatter plots to visualize data distribution, outliers, and relationships between variables.

7. **Text Data Overview**
   Reviewing textual data to understand the quality and potential feature extraction opportunities it presents.

Through a detailed EDA, the aim is to garner insights that will be pivotal in guiding the subsequent data preprocessing and model building stages, thereby ensuring a solid foundation for developing an accurate price prediction model.


#### Summary Statistics

Summary statistics provide a high-level overview of the numerical attributes within the dataset, offering insights into the central tendency, dispersion, and shape of the distribution of the dataset, sans any influence of the other attributes. These statistics are crucial for understanding the typical behavior of the dataset, identifying outliers, and observing the distribution and spread of the data points across different attributes.

##### Key Components of Summary Statistics:

1. **Count:** The number of non-null entries for each attribute.
2. **Mean:** The average value of each attribute, providing a measure of central tendency.
3. **Standard Deviation (std):** A measure of the amount of variation or dispersion of a set of values.
4. **Minimum (min) and Maximum (max):** The smallest and largest values in each attribute, respectively.
5. **25th, 50th (median), and 75th Percentiles:** These values provide a summary of the distribution of values, where for instance, 25% of the data points are below the 25th percentile.

A tabulated summary of these statistics can be procured for each numerical attribute in the dataset. This tabulated format allows for a clear, concise view of the dataset's overall behavior, and aids in identifying any potential anomalies or outliers that may require further investigation.

Furthermore, summary statistics play a vital role in the data preprocessing stage, where understanding the distribution of data is crucial for tasks such as normalization, handling outliers, and feature scaling. By thoroughly analyzing these summary statistics, one can make informed decisions on the necessary preprocessing steps to enhance the model's performance in subsequent stages of the project.


In [None]:
summary_statistics = train_df.describe()

# Displaying the summary statistics
print(summary_statistics)

#### Data Type Analysis

Analyzing the data types of each attribute is a crucial step in understanding the kind of data you are dealing with. This analysis helps in ensuring that each attribute is formatted correctly, which is essential for both data preprocessing and modeling stages of the project.

In [None]:
train_df.info()


#### Missing Values Assessment

Assessing missing values is a critical step in the data exploration process. Missing data can lead to incorrect or biased analyses and conclusions. Identifying the presence and extent of missing values in the dataset is crucial to decide on the appropriate handling strategies.

Implementing appropriate strategies to handle missing values is crucial to ensure the robustness and accuracy of the predictive model. The strategy chosen can significantly affect the model's performance and the insights derived from the data analysis.


In [None]:
# Setting the aesthetic style of the plots
sns.set(style="whitegrid")

# Create a heatmap to visualize the missing values
plt.figure(figsize=(20, 8))
sns.heatmap(train_df.isnull(), cbar=False, cmap='viridis', yticklabels=False)
plt.title('Missing values heatmap')
plt.show()


In [None]:

# display all columns
pd.set_option('display.max_columns', None)
train_df.head()

In [None]:
# Select only numeric columns
df_numeric = train_df.select_dtypes(include=['float64', 'int64'])

# Calculate the correlation matrix
correlation_matrix = df_numeric.corr()

# Get correlation of all features with 'price'
price_correlation = correlation_matrix['Price'].sort_values(ascending=False)

# Filter out the features with a correlation above a certain threshold, for example 0.3
important_features = price_correlation[abs(price_correlation) >= 0.1]

# Room Type, Smart Location	

In [None]:
print(important_features)

### Data Preprocessing
## Baseline Model
### Model Architecture
### Model Compilation
### Model Training
### Model Evaluation
## Hyperparameter Tuning
### Grid Search Setup
### Execution of Grid Search
### Analysis of Grid Search Results
## Advanced Techniques Implementation
### Batch Normalization
### Gradient Normalization and
## Final Model
### Model Architecture
### Model Compilation
### Model Training
### Model Evaluation
## Participation in Kaggle Competition
### Submission Preparation
### Submission to Kaggle
## Conclusion
## Summary
### Future Work
## References
## Appendices
### Supplementary Scripts
### Additional Resources