<img src="Map_of_London_v2.png" width="1000" height="100">

# Airbnb Regression Project

## 1. Introduction

Inside Airbnb host and provide data scarped from Airbnb's website for a number of cities across the world. It is open source and available [here](http://insideairbnb.com/get-the-data/).

This data includes information about the listings such as:

 -  GIS Locations (longitude/latitude)
 -  number of bathrooms
 -  how many people it accomodates
 -  average number of maximum and minimum nights for stays.

This project will focus on the data in relation to London.
Files downloaded for London: neighbourhoods.csv, listings.csv, calendar.csv, reviews.csv

## 2. Project Aims

These data can be used to understand the distribution of airbnb stays across the city, and the relationship between the area, the type of listing, and it's value.

Through the use of supplementary data sources a regression model may be built to try to determine the value a new listing may have based on it's location and properties.

## 4. Estimating the value of a new listing based on property characteristics and location

### 4.1 The Problem

1. Perform any cleaning, feature engineering, and EDA you deem necessary.
2. Addess any missing data.
3. Identify features that can predict price.
4. Investigate the distribution of listings across the city.
5. Train a model on 90% of the data and evaluate its performance on the remaining 10%.
6. Characterise your model. How well does it perform? What are the best estimates of price?

## 3. EDA

### 3.1 Reading in data

In [None]:
import pandas as pd
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
import logging

main_path = Path(r'C:\Users\user\Documents\Data Science Upskilling')

### 3.2 Exploring Neighbourhoods dataset

In [None]:
neighbourhoods = pd.read_csv(main_path / 'neighbourhoods.csv')

### 3.3 Exploring the listings dataset

In [None]:
listings = pd.read_csv(main_path / 'listings_in_depth.csv' / 'listings.csv')
listings.describe()
listings.info()

In [None]:
listings.id.nunique()

In [None]:
len(listings)

As the number of rows and unique id's in the dataset are consistent an assumption has been made that there are no duplicates within the dataset.

In [None]:
listings.hist(bins=50, figsize=(20,15))
plt.show

Number of entries with the neighbourhood column complete

In [None]:
listings.neighbourhood_cleansed.count()

### 3.4 Exploring the calendar dataset

In [None]:
calendar = pd.read_csv(main_path / 'calendar.csv.gz')
calendar.describe()

## 5. Data cleaning and feature engineering

Features of an airbnb that may determine the value of the listings:

1. Number of bedrooms
2. Number of bathrooms
3. Type of listing (private/shared/tent etc)
4. Number of people it accomodates
5. Amenities
6. Neighbourhood
7. License
8. Number of positive reviews
9. Host response time/rate/acceptance rate
10. Instant bookable

These data are available within the listings dataset, the quality and completeness of the data needs to be ascertained.

Checking how complete the number of bedrooms/bathrooms dataset is:

In [None]:
print('Number of bathroom types: ', listings.bathrooms_text.nunique())
print('Number of rows with a bathroom type assigned:', len(listings.bathrooms_text.notna()))

In [None]:
print('Number of rows with bedroom data:', len(listings.bedrooms.notna()))
listings.bedrooms.hist(bins=50, figsize=(20,15))

Can be seen that the bathroom and bedroom data is complete and so can be used in the model

In [None]:
print('Number of rows with Maximum Occupation data:', len(listings.accommodates.notna()))
listings.accommodates.hist(bins=50, figsize=(20,15))

In [None]:
print('Number of instant bookable categories: ', listings.instant_bookable.nunique())
print('Number of rows with instant bookable categories assigned:', len(listings.instant_bookable.notna()))
print('Instant bookable categories:', listings.instant_bookable.drop_duplicates())

## 6. Building the Model

### 6.1 Parameter preparation