# SI618 Project
### Analyzing the Impact of Various Factors on B&B Visitor's Reviews 
#### — A study based on Airbnb datasets

Team members: Qian Dong (dqq) section 001; Yujun Zhang (yukiz) section 001; Yinuo Wei (seesaway) section 001


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium.plugins import HeatMap

### Cleaning and manipulation
1. Primary dataset description

In [None]:
ab=pd.read_csv('data/AB_NYC_2019.csv')

In [None]:
ab.head()

In [None]:
ab.shape

In [None]:
ab.columns

In [None]:
ab.describe()

In [None]:
ab.select_dtypes(exclude=['object'])\
    .plot(kind='box', subplots=True, layout=(3,4), figsize=(14,14), fontsize=14)

In [None]:
ab.select_dtypes(exclude=['object']).isna().sum()

Only review_per_month has missing values

In [None]:
ab['reviews_per_month'].min()

we can infer that the missing value should be 0.

In [None]:
#replace missing with mode
ab['reviews_per_month'].fillna(0, inplace=True)

updated graph:

In [None]:
ab['reviews_per_month'].plot(kind='box')

In [None]:
ab.select_dtypes(include=['object']).head()

In [None]:
ab['neighbourhood_group'].value_counts().plot(kind='bar')

In [None]:
ab['neighbourhood'].nunique()

In [None]:
ab['neighbourhood'].value_counts().head(10)

In [None]:
ab['room_type'].value_counts().plot(kind='bar')

In [None]:
ab['last_review']=pd.to_datetime(ab['last_review'])
ab['last_review'].dt.year.value_counts().sort_index().plot(kind='bar')

In [None]:
ab['last_review'].dt.month.value_counts().sort_index().plot(kind='bar')

In [None]:
# last and first review
ab['last_review'].max(), ab['last_review'].min()

In [None]:
ab['last_review'].value_counts().head(10)

In [None]:
ab.select_dtypes(exclude=['number']).isna().sum()

Missing names are not meaningful to fill. Only missing last_review can be filled. The missing value is corresponding to the missing value of review_per_month. So we should keep the missing values as null because it has the meaning of having no review.

In [None]:
ab[ab['last_review'].isna()][['number_of_reviews', 'last_review', 'reviews_per_month']].sample(5)

In [None]:
ab[ab['last_review'].isna()][['number_of_reviews', 'last_review', 'reviews_per_month']].nunique()

It turns out that missing values of review related data are all because reviews do not exit.

2. Secondary dataset description

In [None]:
reviews=pd.read_csv('data/AB_reviews_NYC.csv')

In [None]:
reviews.head()

In [None]:
reviews.shape

In [None]:
reviews.isna().sum()

There's no missing values

In [None]:
reviews['listing_id'].nunique()

In [None]:
reviews['url'].nunique()

url and listing_id number is corresponded. url is not needed for data analysis

In [None]:
reviews.drop(columns=['url'], inplace=True)

In [None]:
#revert review_posted_date into datatime
reviews['review_posted_date']=pd.to_datetime(reviews['review_posted_date'])
#plot review_posted_date
reviews['review_posted_date'].dt.year.value_counts().sort_index().plot(kind='bar')

In [None]:
reviews['review_posted_date'].dt.month.value_counts().sort_index().plot(kind='bar')

In [None]:
#plot histgram of review length
reviews['review'].str.len().plot(kind='hist', bins=50)

### Visualizatiom

1. Heatmap of Correlations of the Primary dataset

In [None]:
sns.heatmap(ab.select_dtypes(exclude=['object']).drop(['id', 'host_id'], axis=1)
            .corr(), cmap='coolwarm', center=0)

The positive relation of review related data is of no doubt. Longitude also affect price and host listing negatively and review_per_month positively, which is a geographic influence. Mininum night and review per month has logical negative influence. availability_365 has positive affect on review numbers. host listing number  has negative effect on review numebrs. More host_listing and more availability of year is related.

In [None]:
m = folium.Map(location=[ab['latitude'].mean(), ab['longitude'].mean()], zoom_start=12)
HeatMap(data=ab[['latitude', 'longitude']], radius=15).add_to(m)
m.save('heatmap.html')
m