<img src="./assets/airbnb_logo.png"
    style="width:300px; float: right; margin: 0 40px 40px 40px;"></img>
# Group Project AirBnB
**Useful links**

<a href="https://github.com/ShimantoRahman/aCRM-Group-Project-AirBnB"><img src="./assets/github_logo.png" style="width:120px; margin: 0 40px 40px 40px;"></a>

> [Inside AirBnB: New York](http://insideairbnb.com/new-york-city/?fbclid=IwAR3lvDyNFboZqns1jNJ8v4OzqzG8sLFsqeSlRjqb_-tyvk4iM_XRSYdwmEQ)

> [Airbnb Rental Listings Dataset Mining](https://towardsdatascience.com/airbnb-rental-listings-dataset-mining-f972ed08ddec)

## 1. Setup
### 1.1 Import modules

In [None]:
import numpy as np
import pandas as pd
import matplotlib as plt

![green-divider](./assets/green_divider.png)

### 1.2 Read data

In [None]:
calendar_detail = pd.read_csv("./data/calendar_detail.csv")
listings_summary = pd.read_csv("./data/listings_summary.csv")
reviews_summary = pd.read_csv("./data/reviews_summary.csv")
neighbourhoods = pd.read_csv("./data/neighbourhoods.csv")

In [None]:
listings_detail = pd.read_csv("./data/listings_detail.csv")

In [None]:
reviews_detail = pd.read_csv("./data/reviews_detail.csv")

In [None]:
calendar_detail.head()

In [None]:
listings_summary.head()

In [None]:
reviews_summary.head()

In [None]:
neighbourhoods.head()

In [None]:
listings_detail.head()

In [None]:
reviews_detail.head()

![green-divider](./assets/green_divider.png)

### 1.3 data preparation
#### 1.3.1 Detecting NaN values

In [None]:
print("listings_summary\n")
print(listings_summary.isnull().sum())

In [None]:
print("listings_detail\n")
print(listings_detail.isnull().sum())

In [None]:
print("reviews_summary\n")
print(reviews_summary.isnull().sum())

In [None]:
print("reviews_detail\n")
print(reviews_detail.isnull().sum())

In [None]:
print("neighbourhoods\n")
print(neighbourhoods.isnull().sum())

In [None]:
print("\ncalendar_detail\n")
print(calendar_detail.isnull().sum())

![green-divider](./assets/green_divider.png)

#### 1.3.2 Cleansing data

In [None]:
# removing listings where first and last review do not both match
listings_detail = listings_detail[~((listings_detail["first_review"].isnull()) & ~(listings_detail["last_review"].isnull()))]

In [None]:
# replacing NaN values for reviews_per_month to 0
# rows with NaN value for reviews_per_month do not have a first or last review, thus they have 0 reviews per month
column_imputations = {"reviews_per_month": 0}
listings_detail = listings_detail.fillna(value = column_imputations)

![green-divider](./assets/green_divider.png)

#### 1.3.2 Correcting data types

In [None]:
# dates
def column_to_date(df, column):
    column = pd.to_datetime(column, format="%Y-%m-%d")
    
# listings_summary
column_to_date(listings_summary, "last_review")

# reviews_summary
column_to_date(reviews_summary, "date")

# calendar_detail
column_to_date(calendar_detail, "date")

# listings_detail
column_to_date(listings_detail, "first_review")
column_to_date(listings_detail, "last_review")
column_to_date(listings_detail, "last_scraped")
column_to_date(listings_detail, "calendar_last_scraped")
column_to_date(listings_detail, "host_since")

# reviews_detail

In [None]:
listings_detail.host_response_rate

In [None]:
listings_detail.dtypes

In [None]:
# change t/f columns to 1/0
# important to check missing values first else missing values will be replaced with 0
def replace_wrong_values(column_obs):
    if column_obs == "t":
        return(1)
    else:
        return(0)

def column_to_bool(df, column):
    df[column] = df[column].apply(replace_wrong_values)
    df[column] = pd.to_numeric(df[column])

column_to_bool(listings_detail, "instant_bookable")
column_to_bool(listings_detail, "requires_license")
column_to_bool(listings_detail, "is_business_travel_ready")
column_to_bool(listings_detail, "require_guest_profile_picture")
column_to_bool(listings_detail, "require_guest_phone_verification")
column_to_bool(listings_detail, "instant_bookable")
column_to_bool(listings_detail, "host_is_superhost")
column_to_bool(listings_detail, "has_availability")


> **CODE DOES NOT WORK**

In [None]:
# CODE DOES NOT WORK

def replace_wrong_values(column_obs):
    if not column_obs.isna():
        print(column_obs)
        h = str(column_obs)
        s = h[:-1]
        f = float(s)
        v = f / 100
        k = str(v)
        
        return(column_obs)
    else:
        return(None)

def column_to_float(df, column):
    df[column] = df[column].apply(replace_wrong_values)
    df[column] = pd.to_numeric(df[column])
    
    
column_to_float(listings_detail, "host_response_rate")
# listings_detail.host_response_rate.dtype
    
# host_response_rate
# s = "100%"
# str(float(s[:-1]) / 100)


![green-divider](./assets/green_divider.png)

## 2 Analysis
### 2.1 Calculate the average listing price per neighbourhood

In [None]:
print(listings_summary.groupby("neighbourhood_group").mean()["price"])

### 2.2 Plot how the average price evolves through the year across New York.

### 2.3	Identify which neighborhood has the largest price fluctuations across the year. Plot the fluctuations for this neighborhood.

### 2.4 In marketing, there is a phenomenon known as ‘the long tail’ (Hint: look it up). This also translates to the number of reviews. Plot this on an intuitive graph.

### 2.5	Run a regression to explain the price per listing. (Hint: location, reviews, etc. may all explain this).

### 2.6 Find additional data sources to explain the average listing price per neighbourhood. (Hint : think demographics)

### 2.7 Plot how the average prices differ across New York using a color-coded heat map of New York neighborhoods.

### 2.8 The latitude of Statue of Liberty National Monument, New York, USA is 40.68927, and the longitude is -74.044502. This monument is one of the most popular tourist places in New York. Statistically test wether a distance smaller than 2 miles to the monument increases average listing price.

### 2.9	Create a timeline and plot for each year the highest, Q1, the median, Q3 and lowest price on one graph. Do this for each neighborhood group as well as for the entire city. Determine which neighborhood group stands out the most and create a comparative graph of this neighborhood with all other groups.

### 2.10 Plot the number of rooms per host in function of the number of reviews per host. 

### 2.11 Are there a lot of hosts having multiple locations? Do most people just rent their own place? Is there a ‘host long tail’? Make a comprehensive plot.

### 2.12 Do hosts with multiple locations stay within the same neighbourhood? (hint: use subset)

### 2.13 What are the 5 most used words in reviews that are no stop words? (e.g. the, or, etc. Python can filter these automatically using packages such as NLTK).

### 2.14 Do these most frequent words differ across neighborhoods? What are the ‘most different’ areas? What distinguishes them? Interpret.

### 2.15 Plot the amount of reviews across time. 

### 2.16 Is there a link between availability (days per year) with the price? Determine both graphically and statistically. 

### 2.17 Is there a link between how many times the word ‘great’ appears in a review and the listing price? Determine both graphically and statistically. 

### 2.18 Plot how the number of Airbnb locations are distributed across the city on a map. Plot the number of locations per neighborhood and color code according to neighborhood group.

### 2.19 Williamsburg is a ‘hip’ area in in Brooklyn with a lot of Airbnb locations on offer. Explore how this area differs from other locations and visualize. You may also use external data sources.

### 2.20  Create a stacked bar chart of the distribution of room type per neighborhood group. Statistically test whether these differences are significant.

### 2.21 Color-coded plot the most popular room type per neighborhood on a city map.