# Airbnb Raw Data Aggregation

### Introduction

In this notebook I will be aggregating about a year’s worth of listings, reviews, and calendar information from [Airbnb](http://insideairbnb.com/get-the-data.html). 

The data pertains to the San Francisco (SF) area from December of 2018 through December of 2019.

I am aggregating the listings, reviews, and calendar information into separate data frames and writing them to CSV files for further analysis downstream. Aggregated files will not be available on GitHub; however, you can view all of the raw data used  [here](https://github.com/KishenSharma6/Airbnb-SF_ML_-_Text_Analysis/tree/master/Data/01_Raw/SF%20Airbnb%20Raw%20Data).

### Aggregation

In [1]:
#Read in libraries
import pandas as pd
import glob

#Set path to location of SF Airbnb raw data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\01_Raw\SF Airbnb Raw Data/'

**SF Listings Data Aggregation(11/2018 - 12/2019)**

In [2]:
#Capture listings data from path
all_files = glob.glob(path + "listings?*.gz")

#Create empty list to append csv files
li = []

#Read in listings data and append to li
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
#Create aggregated dataframe
listings = pd.concat(li, sort=True, axis=0, ignore_index=True)

#Drop any duplicate rows
listings.drop_duplicates(inplace=True)

#View listings shape
print('Listings shape: ', listings.shape)

#Check to make sure columns have been properly concatenated
listings.head()

Listings shape:  (98796, 106)


Unnamed: 0,access,accommodates,amenities,availability_30,availability_365,availability_60,availability_90,bathrooms,bed_type,bedrooms,...,space,square_feet,state,street,summary,thumbnail_url,transit,weekly_price,xl_picture_url,zipcode
0,*Full access to patio and backyard (shared wit...,3,"{TV,""Cable TV"",Internet,Wifi,Kitchen,""Pets liv...",0,77,0,1,1.0,Real Bed,1.0,...,"Newly remodeled, modern, and bright garden uni...",,CA,"San Francisco, CA, United States",New update: the house next door is under const...,,*Public Transportation is 1/2 block away. *Ce...,"$1,120.00",,94117
1,"Our deck, garden, gourmet kitchen and extensiv...",5,"{Internet,Wifi,Kitchen,Heating,""Family/kid fri...",0,0,0,0,1.0,Real Bed,2.0,...,We live in a large Victorian house on a quiet ...,,CA,"San Francisco, CA, United States",,,The train is two blocks away and you can stop ...,"$1,600.00",,94110
2,,2,"{TV,Internet,Wifi,Kitchen,""Free street parking...",30,365,60,90,4.0,Real Bed,1.0,...,Room rental-sunny view room/sink/Wi Fi (inner ...,,CA,"San Francisco, CA, United States",Nice and good public transportation. 7 minute...,,N Juda Muni and bus stop. Street parking.,$485.00,,94117
3,,2,"{TV,Internet,Wifi,Kitchen,""Free street parking...",30,365,60,90,4.0,Real Bed,1.0,...,Room rental Sunny view Rm/Wi-Fi/TV/sink/large ...,,CA,"San Francisco, CA, United States",Nice and good public transportation. 7 minute...,,"N Juda Muni, Bus and UCSF Shuttle. small shopp...",$490.00,,94117
4,Guests have access to everything listed and sh...,5,"{TV,Internet,Wifi,Kitchen,Heating,""Family/kid ...",30,90,60,90,1.5,Real Bed,2.0,...,Please send us a quick message before booking ...,,CA,"San Francisco, CA, United States",Pls email before booking. Interior featured i...,,,,,94117


In [4]:
#Set path to write df to csv
path_to_file = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\01_Raw\SF Airbnb Raw Data\SF Airbnb Raw Data - Aggregated\01_04_2020_Listings_Raw_Aggregated.csv'

#Write to csv file
listings.to_csv(path_to_file, sep=',')

**SF Reviews Data Aggregation(11/2018 - 12/2019)**

In [5]:
#Capture reviews data
all_files = glob.glob(path + "reviews?*.gz")

#Create empty list
li = []

#Read in reviews data and append to li
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

#Create aggregated dataframe
reviews = pd.concat(li, sort=True, axis=0, ignore_index=True)

#Drop duplicate rows
reviews.drop_duplicates(inplace=True)

#View Shape
print('Reviews shape: ', reviews.shape)

#Check to make sure columns have been properly concatenated
reviews.head()

Reviews shape:  (458157, 6)


Unnamed: 0,comments,date,id,listing_id,reviewer_id,reviewer_name
0,"Our experience was, without a doubt, a five st...",2009-07-23,5977,958,15695,Edmund C
1,Returning to San Francisco is a rejuvenating t...,2009-08-03,6660,958,26145,Simon
2,We were very pleased with the accommodations a...,2009-09-27,11519,958,25839,Denis
3,We highly recommend this accomodation and agre...,2009-11-05,16282,958,33750,Anna
4,Holly's place was great. It was exactly what I...,2010-02-13,26008,958,15416,Venetia


In [6]:
#Set path to write df to csv
path_to_file = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\01_Raw\SF Airbnb Raw Data\SF Airbnb Raw Data - Aggregated\01_04_2020_Reviews_Raw_Aggregated.csv'

#Write to csv file
reviews.to_csv(path_to_file, sep=',')

**SF Calendar Data Aggregation(09/2018 - 10/2019)**

In [7]:
#Capture calendar data
all_files = glob.glob(path + "calendar?*.gz")

#Create empty list
li = []

#Append reviews files to aggregate into a dataframe
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

#Create data frame
calendar = pd.concat(li, sort=True, axis=0, ignore_index=True)

#Drop any duplicate rows
calendar.drop_duplicates(inplace=True)

#View Shape
print('Calendar shape: ', calendar.shape)

#Check to make sure columns have been properly concatenate
calendar.head()

Calendar shape:  (18089659, 7)


Unnamed: 0,adjusted_price,available,date,listing_id,maximum_nights,minimum_nights,price
0,$80.00,f,2019-04-03,187730,120.0,3.0,$80.00
1,$80.00,f,2019-04-04,187730,120.0,3.0,$80.00
2,$82.00,t,2019-04-05,187730,120.0,3.0,$82.00
3,$82.00,t,2019-04-06,187730,120.0,3.0,$82.00
4,$81.00,t,2019-04-07,187730,120.0,3.0,$81.00


In [None]:
#Set path to write df to csv
path_to_file = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\01_Raw\SF Airbnb Raw Data\SF Airbnb Raw Data - Aggregated\01_04_2020_Calendar_Raw_Aggregated.csv'

#Write to csv file
calendar.to_csv(path_to_file, sep=',')