<a href="https://colab.research.google.com/github/MehzHats/Boston-Data-Analysis/blob/main/Boston%20Airbnb%20EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Boston Airbnb Data Analysis

The dataset used in this notebook was choosen from the Boston Airbnb Open Data hosted on Kaggle. This dataset contains bookings, listings and reviews of homestay activity from 2008 till 2016.

This dataset is analysed based on the  the Cross-Industry Standard Process for Data Mining (CRISP-DM) incorporated as a part of the "Write a Data Science Blog Post" project for Udacity's Data Scientist Nanodegree. 

The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model that serves as the base for a data science process. It has six sequential phases:

1. **Business understanding** – What are the needs of the business?
2. **Data understanding** – What data do we have / need? Is it clean?
3. **Data preparation** – How the data is organised for modeling?
4. **Modeling** – What are the modeling techniques applied?
5. **Evaluation** – Which model best meets the objectives of business?
6. **Deployment** – How do stakeholders access the results?




## About Data

The following Airbnb activity is included in this Boston dataset:

1.   Listings, including full descriptions and average review score.
2.   Reviews, including unique id for each reviewer and detailed comments
3. Calendar, including listing id and the price and availability for that day

## Section 1: Business Understanding
The listing of AirBnB's rental homes in and near the neighbourhood of Boston, Massachusetts, as well as the guest evaluations and the dates that the homes were available, were made public. We'll examine this information in this blog to learn more about the cities and the areas , the impacts of the various columns on the data sets , how the positive and negative reviews affect the listings by the host, the number of other factors that contribute to the choice of the guests and price factor based on the time of the booking.

We will make an effort to comprehend the types of homes that are offered, their costs, the distribution of these listings, the number of reviews, and, finally, how visitors described the properties after their stay.


 **We aspire to answer following questions from the dataset.**
1. Which column has the most impact on the data?
2. Which cities are the most expensive?
3. How reviews affect listings?
4. What factors affect the listings booked by customers?
5. Price difference between the peak period and off season?

## Section 2: Data Understanding

AirBnB has made the listing of its rental properties in and around Boston, MA neighborhoods public, along with the reviews left by guests, and the dates when properties were available. In this blog, we will review this data to learn about the neighborhood and attempt to understand which types of properties are available, their prices, the concentration of these listings, the number of reviews and lastly, how guests described the properties after their stay.

### Exploration of dataset

In [None]:
import pandas as pd

In [None]:
calendar = pd.read_csv('calendar.csv')
listings = pd.read_csv('listings.csv')
reviews = pd.read_csv('reviews.csv')


#### Calendar Data

In [None]:
#print the head of calendar
calendar.head()

Unnamed: 0,listing_id,date,available,price
0,12147973,2017-09-05,f,
1,12147973,2017-09-04,f,
2,12147973,2017-09-03,f,
3,12147973,2017-09-02,f,
4,12147973,2017-09-01,f,


In [None]:
calendar[(calendar["price"].isna()) & (calendar["available"] != "f")]

Unnamed: 0,listing_id,date,available,price


In [None]:
calendar[(calendar["price"].isna()) & (calendar["available"] == "t")]

Unnamed: 0,listing_id,date,available,price


In [None]:
calendar[(calendar["price"].isna()) & (calendar["available"] == "f")]

Unnamed: 0,listing_id,date,available,price
0,12147973,2017-09-05,f,
1,12147973,2017-09-04,f,
2,12147973,2017-09-03,f,
3,12147973,2017-09-02,f,
4,12147973,2017-09-01,f,
...,...,...,...,...
1308885,14504422,2016-09-10,f,
1308886,14504422,2016-09-09,f,
1308887,14504422,2016-09-08,f,
1308888,14504422,2016-09-07,f,


In [None]:
# calendar["price"] = calendar["price"].fillna("$0")
# calendar["price"] = calendar["price"].replace('[\$,]', '', regex=True).astype(float)

In [None]:
#Calendar[calendar.columns[0]].count()
#len(calendar.index)
calendar.shape

(1308890, 4)

In [None]:
#print description of data
calendar.describe()

Unnamed: 0,listing_id
count,1308890.0
mean,8442118.0
std,4500149.0
min,3353.0
25%,4679319.0
50%,8578710.0
75%,12796030.0
max,14933460.0


In [None]:
#print concise summary of data
calendar.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1308890 entries, 0 to 1308889
Data columns (total 4 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   listing_id  1308890 non-null  int64 
 1   date        1308890 non-null  object
 2   available   1308890 non-null  object
 3   price       643037 non-null   object
dtypes: int64(1), object(3)
memory usage: 39.9+ MB


In [None]:
# Check if number of missing values is the same as number of 'f' availability
calendar.price.isna().sum() 

665853

In [None]:
calendar.duplicated().sum()

365

In [None]:
# calendar[(calendar["listing_id"] == 12898806) & (calendar["date"] == "2017-06-15")]
calendar[calendar.duplicated()]

Unnamed: 0,listing_id,date,available,price
748468,12898806,2017-06-15,f,
748469,12898806,2017-06-14,f,
748470,12898806,2017-06-13,f,
748471,12898806,2017-06-12,f,
748472,12898806,2017-06-11,f,
...,...,...,...,...
748975,12898806,2016-12-17,f,
748976,12898806,2016-12-16,f,
748977,12898806,2016-12-15,f,
748978,12898806,2016-12-14,f,


**Calendar Data Issues Identified**

- listing_id: convert to `str`
- date: convert to `datetime`
- available: convert to dummy
- price: drop $ and comma and convert to float/int




#### Listings Data

In [None]:
#print the head of listings
listings.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,f,,,f,moderate,f,f,1,
1,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...",...,9.0,f,,,t,moderate,f,f,1,1.3
2,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,...,10.0,f,,,f,moderate,t,f,1,0.47
3,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,none,Roslindale is a lovely little neighborhood loc...,...,10.0,f,,,f,moderate,f,f,1,1.0
4,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",none,"I love the proximity to downtown, the neighbor...",...,10.0,f,,,f,flexible,f,f,1,2.25


In [None]:
# len_listings = len('listings')
# print(len_listings)
#listings[listings.columns[0]].count()
#len(listings.index)
listings.shape

(3585, 95)

In [None]:
#print description of data
listings.describe()

Unnamed: 0,id,scrape_id,host_id,host_listings_count,host_total_listings_count,neighbourhood_group_cleansed,latitude,longitude,accommodates,bathrooms,...,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,jurisdiction_names,calculated_host_listings_count,reviews_per_month
count,3585.0,3585.0,3585.0,3585.0,3585.0,0.0,3585.0,3585.0,3585.0,3571.0,...,2762.0,2767.0,2765.0,2767.0,2763.0,2764.0,0.0,0.0,3585.0,2829.0
mean,8440875.0,20160910000000.0,24923110.0,58.902371,58.902371,,42.340032,-71.084818,3.041283,1.221647,...,9.431571,9.258041,9.646293,9.646549,9.414043,9.168234,,,12.733891,1.970908
std,4500787.0,0.0,22927810.0,171.119663,171.119663,,0.024403,0.031565,1.778929,0.501487,...,0.931863,1.168977,0.762753,0.735507,0.903436,1.011116,,,29.415076,2.120561
min,3353.0,20160910000000.0,4240.0,0.0,0.0,,42.235942,-71.171789,1.0,0.0,...,2.0,2.0,2.0,4.0,2.0,2.0,,,1.0,0.01
25%,4679319.0,20160910000000.0,6103425.0,1.0,1.0,,42.329995,-71.105083,2.0,1.0,...,9.0,9.0,9.0,9.0,9.0,9.0,,,1.0,0.48
50%,8577620.0,20160910000000.0,19281000.0,2.0,2.0,,42.345201,-71.078429,2.0,1.0,...,10.0,10.0,10.0,10.0,10.0,9.0,,,2.0,1.17
75%,12789530.0,20160910000000.0,36221470.0,7.0,7.0,,42.354685,-71.062155,4.0,1.0,...,10.0,10.0,10.0,10.0,10.0,10.0,,,6.0,2.72
max,14933460.0,20160910000000.0,93854110.0,749.0,749.0,,42.389982,-71.0001,16.0,6.0,...,10.0,10.0,10.0,10.0,10.0,10.0,,,136.0,19.15


In [None]:
#print concise summary of data
listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3585 entries, 0 to 3584
Data columns (total 95 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   id                                3585 non-null   int64  
 1   listing_url                       3585 non-null   object 
 2   scrape_id                         3585 non-null   int64  
 3   last_scraped                      3585 non-null   object 
 4   name                              3585 non-null   object 
 5   summary                           3442 non-null   object 
 6   space                             2528 non-null   object 
 7   description                       3585 non-null   object 
 8   experiences_offered               3585 non-null   object 
 9   neighborhood_overview             2170 non-null   object 
 10  notes                             1610 non-null   object 
 11  transit                           2295 non-null   object 
 12  access

In [None]:
# Check if all rows have unique IDs
listings.id.nunique()

3585

In [None]:
# Are there less host ids?
listings.host_id.nunique()

2181

In [None]:
# Check response times
listings.host_response_time.nunique()
print(listings.host_response_time.nunique())

4


In [None]:
listings.host_neighbourhood.nunique()


53

In [None]:
# Check values for city
listings.city.value_counts()

Boston                       3381
Roxbury Crossing               24
Somerville                     19
Jamaica Plain                  18
Brookline                      18
Cambridge                      16
Dorchester                     15
Brighton                       15
Charlestown                    15
Allston                        12
Roslindale                      6
West Roxbury                    5
ROXBURY CROSSING                4
Mattapan                        3
East Boston                     3
ALLSTON                         2
South Boston                    2
Jamaica Plain, Boston           2
Hyde Park                       2
Jamaica Plain                   2
Boston, Massachusetts, US       2
Boston                          1
Roslindale, Boston              1
dorchester, boston              1
Milton                          1
Jamaica Plain (Boston)          1
Newton                          1
波士顿                             1
Jamaica Plain, MA               1
Watertown     

#### Reviews Data 

In [None]:
#print the head of listings
reviews.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,1178162,4724140,2013-05-21,4298113,Olivier,My stay at islam's place was really cool! Good...
1,1178162,4869189,2013-05-29,6452964,Charlotte,Great location for both airport and city - gre...
2,1178162,5003196,2013-06-06,6449554,Sebastian,We really enjoyed our stay at Islams house. Fr...
3,1178162,5150351,2013-06-15,2215611,Marine,The room was nice and clean and so were the co...
4,1178162,5171140,2013-06-16,6848427,Andrew,Great location. Just 5 mins walk from the Airp...


In [None]:
#reviews[reviews.columns[0]].count()
#len(reviews.index)
reviews.shape

(68275, 6)

In [None]:
#print description of data
reviews.describe()

Unnamed: 0,listing_id,id,reviewer_id
count,68275.0,68275.0,68275.0
mean,4759910.0,52465160.0,28023890.0
std,3788990.0,27909910.0,22340970.0
min,3353.0,1021.0,143.0
25%,1458081.0,30104200.0,9001346.0
50%,4080000.0,52231210.0,23051790.0
75%,7377034.0,76632480.0,42134540.0
max,14843780.0,99990450.0,93350340.0


In [None]:
#print concise summary of data
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68275 entries, 0 to 68274
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   listing_id     68275 non-null  int64 
 1   id             68275 non-null  int64 
 2   date           68275 non-null  object
 3   reviewer_id    68275 non-null  int64 
 4   reviewer_name  68275 non-null  object
 5   comments       68222 non-null  object
dtypes: int64(3), object(3)
memory usage: 3.1+ MB


In [None]:
reviews.reviewer_id.nunique()

63789

In [None]:
reviews.duplicated("reviewer_id").sum()

4486

In [None]:
reviews.duplicated(subset=["reviewer_id", "listing_id"]).sum()

947

In [None]:
print(reviews.duplicated(subset=["reviewer_id", "listing_id", "date"]).sum())
reviews[reviews.duplicated(subset=["reviewer_id", "listing_id", "date"], keep=False)]

10


Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
6558,1529321,32216803,2015-05-15,30877683,Jonathan,Nice neighborhood with a lot of local life and...
6559,1529321,32223055,2015-05-15,30877683,Jonathan,"A+ treatment all the way, you are an adult in ..."
29370,568234,19256952,2014-09-08,16199611,Lyn,The reservation was canceled 23 days before ar...
29371,568234,19256954,2014-09-08,16199611,Lyn,The reservation was canceled 24 days before ar...
30826,4402209,66858887,2016-03-25,63239764,Sandiya,My night at Aris house was perfect yet again! ...
30827,4402209,66963792,2016-03-25,63239764,Sandiya,"The hosts communication was great, the listin..."
33673,3897963,68851332,2016-04-07,47193495,Emily,Ari was a great host! Easy check in and check ...
33674,3897963,68854695,2016-04-07,47193495,Emily,Very easy check in and check out and super com...
35146,11826815,65924361,2016-03-18,51343202,Jon,Derian offered me a room last minute without h...
35147,11826815,65931647,2016-03-18,51343202,Jon,"Again, It was exceptional."


In [None]:
print(reviews.duplicated(subset=["reviewer_id", "listing_id", "comments"]).sum())

reviews[reviews.duplicated(subset=["reviewer_id", "listing_id", "comments"], keep=False)]

18


Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
130,1178162,36990201,2015-07-03,35957819,Austin,"Great space. Cozy and right off the airport, v..."
132,1178162,37500401,2015-07-07,35957819,Austin,"Great space. Cozy and right off the airport, v..."
3493,3415033,34129761,2015-06-05,29861162,Matthew,Donny and Chris were extremely welcoming. I ha...
3494,3415033,34531976,2015-06-09,29861162,Matthew,Donny and Chris were extremely welcoming. I ha...
4067,6500646,85351737,2016-07-11,8387157,Ion,"Staying here was a great experience, the host ..."
4068,6500646,86638486,2016-07-17,8387157,Ion,"Staying here was a great experience, the host ..."
4714,6347026,35601363,2015-06-20,13825246,Jay,Megan and Stephen opened up their home to me a...
4715,6347026,35718502,2015-06-21,13825246,Jay,Megan and Stephen opened up their home to me a...
10469,24240,135659,2010-11-08,216891,David,"Best houseboat ever! Gretchen was very nice, ..."
10470,24240,138161,2010-11-13,216891,David,"Best houseboat ever! Gretchen was very nice, ..."


In [None]:
reviews[reviews.duplicated(subset=["reviewer_id", "listing_id", "date"])]

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
6559,1529321,32223055,2015-05-15,30877683,Jonathan,"A+ treatment all the way, you are an adult in ..."
29371,568234,19256954,2014-09-08,16199611,Lyn,The reservation was canceled 24 days before ar...
30827,4402209,66963792,2016-03-25,63239764,Sandiya,"The hosts communication was great, the listin..."
33674,3897963,68854695,2016-04-07,47193495,Emily,Very easy check in and check out and super com...
35147,11826815,65931647,2016-03-18,51343202,Jon,"Again, It was exceptional."
35148,11826815,66021514,2016-03-18,51343202,Jon,"Again, It was excellent. Thank you."
38564,1173306,7110330,2013-09-06,8492099,Tommy,"Great spot walking distance to food, drink, an..."
42897,3901439,66794902,2016-03-24,9618964,Gregory,I've stayed several times at Ari's properties....
54567,3897995,66654316,2016-03-23,39441871,Doron,Great host and place. I airbnb at Ari's each ...
56980,3866526,66022211,2016-03-18,63239764,Sandiya,My stay at Aris was just as he described. I ...


In [None]:
reviews.comments.isnull().value_counts()


False    68222
True        53
Name: comments, dtype: int64

**Review Data Issues Identified**

- reviewer_id: convert to `str`
- date: convert to `datetime`
- reviewer_id: convert to dummy
- comment: ? missing value and duplication 



## Section 3: Data Preparation