# Seattle and Boston Airbnb Data

"Airbnb (ABNB) is an online marketplace that connects people who want to rent out their homes with people who are looking for accommodations in specific locales.

The company has come a long way since 2007, when its co-founders first came up with the idea to invite paying guests to sleep on an air mattress in their living room. According to Airbnb's latest data, it has in excess of six million listings, covering more than 100,000 cities and towns and 220-plus countries worldwide." 

Information gathered from:
https://www.investopedia.com/articles/personal-finance/032814/pros-and-cons-using-airbnb.asp

The datasets that are going to be analyzed in this project are the Seattle and Boston Airbnb Data. These datasets have information about the reviews, the scores and calendar information that include prices.

<br> There are somme suggested questions for both datasets that are:</br>

<br>1) Can you describe the vibe of each city neighborhood using listing descriptions?</br>
<br>2) What are the busiest times of the year to visit each city? By how much do prices spike?</br>
<br>3) Is there a general upward trend of both new Airbnb listings and total Airbnb visitors to each city?</br>
<br>4) Which city has the best Airbnb ratings?</br>
<br>5) Is the review related to the price?</br>
<br>6) Which are the best rated hosts?</br>

### A Look at the Data

In order to get a better understanding of the data we will be looking at throughout this lesson, let's take a look at some of the characteristics of the dataset.

First, let's read in the data and necessary libraries.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython import display

#Contiene información de los 
Seattle_calendar = pd.read_csv('./Seattle/calendar.csv')
Seattle_listings= pd.read_csv('./Seattle/listings.csv')
Seattle_reviews= pd.read_csv('./Seattle/reviews.csv')

Boston_calendar = pd.read_csv('./Boston/calendar.csv')
Boston_listings= pd.read_csv('./Boston/listings.csv')
Boston_reviews= pd.read_csv('./Boston/reviews.csv')


**1.** Number of rows and columns.

In [7]:
print ("Seattle Calendar:" + str(Seattle_calendar.shape))
print ("Seattle listings:" + str(Seattle_listings.shape))
print ("Boston reviews:" + str(Boston_reviews.shape))
print ("Boston Calendar:" + str(Boston_calendar.shape))
print ("Boston listings:" + str(Boston_listings.shape))
print ("Boston reviews:" + str(Boston_reviews.shape))

Seattle Calendar:(1393570, 4)
Seattle listings:(3818, 92)
Boston reviews:(68275, 6)
Boston Calendar:(1308890, 4)
Boston listings:(3585, 95)
Boston reviews:(68275, 6)


**2.** Which columns had no missing values? Provide a set of column names that have no missing values.

In [11]:
print ("Seattle Calendar:" + str(Seattle_calendar.info()))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1393570 entries, 0 to 1393569
Data columns (total 4 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   listing_id  1393570 non-null  int64 
 1   date        1393570 non-null  object
 2   available   1393570 non-null  object
 3   price       934542 non-null   object
dtypes: int64(1), object(3)
memory usage: 42.5+ MB
Seattle Calendar:None


In [14]:
Seattle_calendar.price.isna().sum()/len(Seattle_calendar)

0.32938998399793334

In [32]:
Seattle_calendar.head()

Unnamed: 0,listing_id,date,available,price
0,241032,2016-01-04,t,$85.00
1,241032,2016-01-05,t,$85.00
2,241032,2016-01-06,f,
3,241032,2016-01-07,f,
4,241032,2016-01-08,f,


33.94% of the Seattle price values are null. It is important to see the relevance of this proportion in order to see how they are going to be treated.

In [33]:
Seattle_listings.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,241032,https://www.airbnb.com/rooms/241032,20160104002432,2016-01-04,Stylish Queen Anne Apartment,,Make your self at home in this charming one-be...,Make your self at home in this charming one-be...,none,,...,10.0,f,,WASHINGTON,f,moderate,f,f,2,4.07
1,953595,https://www.airbnb.com/rooms/953595,20160104002432,2016-01-04,Bright & Airy Queen Anne Apartment,Chemically sensitive? We've removed the irrita...,"Beautiful, hypoallergenic apartment in an extr...",Chemically sensitive? We've removed the irrita...,none,"Queen Anne is a wonderful, truly functional vi...",...,10.0,f,,WASHINGTON,f,strict,t,t,6,1.48
2,3308979,https://www.airbnb.com/rooms/3308979,20160104002432,2016-01-04,New Modern House-Amazing water view,New modern house built in 2013. Spectacular s...,"Our house is modern, light and fresh with a wa...",New modern house built in 2013. Spectacular s...,none,Upper Queen Anne is a charming neighborhood fu...,...,10.0,f,,WASHINGTON,f,strict,f,f,2,1.15
3,7421966,https://www.airbnb.com/rooms/7421966,20160104002432,2016-01-04,Queen Anne Chateau,A charming apartment that sits atop Queen Anne...,,A charming apartment that sits atop Queen Anne...,none,,...,,f,,WASHINGTON,f,flexible,f,f,1,
4,278830,https://www.airbnb.com/rooms/278830,20160104002432,2016-01-04,Charming craftsman 3 bdm house,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,none,We are in the beautiful neighborhood of Queen ...,...,9.0,f,,WASHINGTON,f,strict,f,f,1,0.89


In [15]:
print ("Seattle listings:" + str(Seattle_listings.info()))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3818 entries, 0 to 3817
Data columns (total 92 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   id                                3818 non-null   int64  
 1   listing_url                       3818 non-null   object 
 2   scrape_id                         3818 non-null   int64  
 3   last_scraped                      3818 non-null   object 
 4   name                              3818 non-null   object 
 5   summary                           3641 non-null   object 
 6   space                             3249 non-null   object 
 7   description                       3818 non-null   object 
 8   experiences_offered               3818 non-null   object 
 9   neighborhood_overview             2786 non-null   object 
 10  notes                             2212 non-null   object 
 11  transit                           2884 non-null   object 
 12  thumbn

In [30]:
most_missing_cols = []
for i in Seattle_listings.columns:
    if Seattle_listings[i].isnull().sum()/len(Seattle_listings[i])>0.75:
        most_missing_cols.append(i)

In [31]:
most_missing_cols

['square_feet', 'license']

The columns that have more than 75% of missing values are square_feet and license. Therefore, those columns are going to be deleted. 

On the other hand, the columns with URLs, scrape ID,Last_scrapped  does not give important information to the analysis I will develop. Therefore this will be dalated.

Price, weekly price, monthly price, security deposit, cleaning fee,  has the wrong datatype.

In [35]:
Seattle_reviews.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,7202016,38917982,2015-07-19,28943674,Bianca,Cute and cozy place. Perfect location to every...
1,7202016,39087409,2015-07-20,32440555,Frank,Kelly has a great room in a very central locat...
2,7202016,39820030,2015-07-26,37722850,Ian,"Very spacious apartment, and in a great neighb..."
3,7202016,40813543,2015-08-02,33671805,George,Close to Seattle Center and all it has to offe...
4,7202016,41986501,2015-08-10,34959538,Ming,Kelly was a great host and very accommodating ...


In [36]:
print ("Seattle reviews:" + str(Seattle_reviews.info()))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84849 entries, 0 to 84848
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   listing_id     84849 non-null  int64 
 1   id             84849 non-null  int64 
 2   date           84849 non-null  object
 3   reviewer_id    84849 non-null  int64 
 4   reviewer_name  84849 non-null  object
 5   comments       84831 non-null  object
dtypes: int64(3), object(3)
memory usage: 3.9+ MB
Seattle reviews:None


In [41]:
Seattle_reviews.comments.isna().sum()/len(Seattle_reviews)*100

0.021214156914047308

The date type needs to be changed because it appears as an object

In [42]:
Boston_calendar.head()

Unnamed: 0,listing_id,date,available,price
0,12147973,2017-09-05,f,
1,12147973,2017-09-04,f,
2,12147973,2017-09-03,f,
3,12147973,2017-09-02,f,
4,12147973,2017-09-01,f,


In [43]:
print ("Boston Calendar:" + str(Boston_calendar.info()))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1308890 entries, 0 to 1308889
Data columns (total 4 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   listing_id  1308890 non-null  int64 
 1   date        1308890 non-null  object
 2   available   1308890 non-null  object
 3   price       643037 non-null   object
dtypes: int64(1), object(3)
memory usage: 39.9+ MB
Boston Calendar:None


In [46]:
Boston_calendar.price.isna().sum()/len(Boston_calendar)*100

50.87157820748879

50% of the prices are null. on the other hand, the date has an incorrect type.

In [47]:
Boston_listings.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,f,,,f,moderate,f,f,1,
1,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...",...,9.0,f,,,t,moderate,f,f,1,1.3
2,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,...,10.0,f,,,f,moderate,t,f,1,0.47
3,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,none,Roslindale is a lovely little neighborhood loc...,...,10.0,f,,,f,moderate,f,f,1,1.0
4,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",none,"I love the proximity to downtown, the neighbor...",...,10.0,f,,,f,flexible,f,f,1,2.25


In [48]:
print ("Boston listings:" + str(Boston_listings.info()))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3585 entries, 0 to 3584
Data columns (total 95 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   id                                3585 non-null   int64  
 1   listing_url                       3585 non-null   object 
 2   scrape_id                         3585 non-null   int64  
 3   last_scraped                      3585 non-null   object 
 4   name                              3585 non-null   object 
 5   summary                           3442 non-null   object 
 6   space                             2528 non-null   object 
 7   description                       3585 non-null   object 
 8   experiences_offered               3585 non-null   object 
 9   neighborhood_overview             2170 non-null   object 
 10  notes                             1610 non-null   object 
 11  transit                           2295 non-null   object 
 12  access

In [54]:
most_missing_colsB = []
for i in Boston_listings.columns:
    if Boston_listings[i].isnull().sum()/len(Boston_listings[i])>0.75:
        most_missing_colsB.append(i)

In [55]:
most_missing_colsB

['neighbourhood_group_cleansed',
 'square_feet',
 'weekly_price',
 'monthly_price',
 'has_availability',
 'license',
 'jurisdiction_names']

The columns with most null values are: 'neighbourhood_group_cleansed','square_feet','weekly_price','monthly_price', 'has_availability', 'license',
 'jurisdiction_names'. It is important to see if they will be delated or treated. For example if null values are set as 0 and the final number of columns should be the same as the Seattle_listing data.
 
 On the other hand, the columns with URLs, scrape ID,Last_scrapped  does not give important information to the analysis I will develop. Therefore this will be dalated.

Price, weekly price, monthly price, security deposit, cleaning fee,  has the wrong datatype.

In [56]:
Boston_reviews.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,1178162,4724140,2013-05-21,4298113,Olivier,My stay at islam's place was really cool! Good...
1,1178162,4869189,2013-05-29,6452964,Charlotte,Great location for both airport and city - gre...
2,1178162,5003196,2013-06-06,6449554,Sebastian,We really enjoyed our stay at Islams house. Fr...
3,1178162,5150351,2013-06-15,2215611,Marine,The room was nice and clean and so were the co...
4,1178162,5171140,2013-06-16,6848427,Andrew,Great location. Just 5 mins walk from the Airp...


In [57]:
print ("Boston reviews:" + str(Boston_reviews.info()))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68275 entries, 0 to 68274
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   listing_id     68275 non-null  int64 
 1   id             68275 non-null  int64 
 2   date           68275 non-null  object
 3   reviewer_id    68275 non-null  int64 
 4   reviewer_name  68275 non-null  object
 5   comments       68222 non-null  object
dtypes: int64(3), object(3)
memory usage: 3.1+ MB
Boston reviews:None


**3.** Describe de information of each dataset.

## General Findigs
<br><b>Seattle price</b></br>
<br> * 33.94% of the Seattle price values are null.</br>
<br> * The date in Seattle price is type object and not date. The data type need to be changed. Something similar occurs with price which type is object. This data type needs to be changed.</br>
<br><b>Seattle listings</b></br>
<br> * The columns that have more than 75% of missing values in seattle listings are square_feet and license. Therefore, those columns are going to be deleted.</br>
<br> * On the other hand, the listink URL does not give important information to the analysis I will develop. Therefore this will be delated.</br>
<br> * Price, weekly price, monthly price, security deposit, cleaning fee,  has the wrong datatype.</br>
<br><b>Seattle reviews</b></br>
<br> * The date type needs to be changed because it appears as an object.</br>
<br><b>Boston calendar</b></br>
<br> * 50% of the prices are null. on the other hand, the date has an incorrect type.</br>
<br><b>Boston listings</b></br>
<br> * The columns with most null values are: 'neighbourhood_group_cleansed','square_feet','weekly_price','monthly_price', 'has_availability', 'license',
 'jurisdiction_names'. It is important to see if they will be delated or treated. For example if null values are set as 0 and the final number of columns should be the same as the Seattle_listing data.</br>
<br> * On the other hand, the listink URL does not give important information to the analysis I will develop. Therefore this will be delated.</br>
<br> * Price, weekly price, monthly price, security deposit, cleaning fee,  has the wrong datatype.</br>
<br><b>Boston reviews</b></br>
<br> * The date type needs to be changed because it appears as an object.</br>
<br><b>General</b></br>
<br> * There should be one dataset for each dimension and one for the facts regardless the city</br>
<br> * There should be one dataset for each dimension and one for the facts regardless the city</br>