### Introduction

In the following notebook, I will be cleaning a raw data file of review data from Inside Airbnb and concatenating review scores from a listings file.

**Read in libraries**

In [23]:
import numpy as np
import pandas as pd
import swifter

**Set notebook preferences**

In [24]:
#Set pandas preferences
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)

**Read in data**

In [25]:
#Set path to data on local machine
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python\In Progress\Airbnb - San Francisco\Data\01_Raw\SF Airbnb'

#Read in data
reviews = pd.read_csv(path + '/2020_0526_Aggregated_Reviews.csv',
                 parse_dates= ['date'],index_col=0)

#Set path to local machine
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python\In Progress\Airbnb - San Francisco\Data\02_Cleaned'

#Read in data
listings = pd.read_csv(path + '/2020_0520_Listings_Cleaned.csv', index_col=0,
                parse_dates=['host_since','last_review'])

### Data Overview

**Preview Data**

In [26]:
#Display data, print shape
print('Review data shape:', reviews.shape)
display(reviews.head(3))

Review data shape: (466004, 6)


Unnamed: 0,comments,date,id,listing_id,reviewer_id,reviewer_name
0,"Our experience was, without a doubt, a five st...",2009-07-23,5977,958,15695,Edmund C
1,Returning to San Francisco is a rejuvenating t...,2009-08-03,6660,958,26145,Simon
2,We were very pleased with the accommodations a...,2009-09-27,11519,958,25839,Denis


**View data description**

In [27]:
#View data description
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 466004 entries, 0 to 359216
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   comments       465824 non-null  object        
 1   date           466004 non-null  datetime64[ns]
 2   id             466004 non-null  int64         
 3   listing_id     466004 non-null  int64         
 4   reviewer_id    466004 non-null  int64         
 5   reviewer_name  466003 non-null  object        
dtypes: datetime64[ns](1), int64(3), object(2)
memory usage: 24.9+ MB


## Data Cleaning

### Drop Columns

**Subset cols from Listings for merge**

In [28]:
#Subset
listings = listings[['id','review_scores_rating','last_review','host_id','host_name']]

#Check
display(listings.head())

Unnamed: 0,id,review_scores_rating,last_review,host_id,host_name
0,958,97.0,2020-03-13,1169,Holly
1,5858,98.0,2017-08-06,8904,Philip And Tania
2,7918,84.0,2020-03-06,21994,Aaron
3,8142,93.0,2018-09-12,21994,Aaron
4,8339,97.0,2019-06-28,24215,Rosy


### Merge Data

In [29]:
#Merge review and listings data
merged_df = pd.merge(left = reviews, right=listings, how = 'left', left_on=['listing_id','date'], right_on=['id', 'last_review'])

#Remove rows where review_scores_rating is na. Drop unnecessary cols and drop dupes
merged_df = merged_df[~merged_df.review_scores_rating.isna()]
merged_df.drop(['last_review', 'id_x','id_y'], inplace =True, axis=1)
merged_df.drop_duplicates(inplace=True)

#Check
display(merged_df.head())

Unnamed: 0,comments,date,listing_id,reviewer_id,reviewer_name,review_scores_rating,host_id,host_name
184,Holly's place is in a prime location with ever...,2019-04-17,958,27129990,Brian,97.0,1169.0,Holly
191,"On a very quiet residential street, but close ...",2019-05-16,958,45838839,Lucy,97.0,1169.0,Holly
193,We recently stayed at Holly's place and it was...,2019-05-31,958,138005765,Rachel,97.0,1169.0,Holly
198,"Clean, quiet, has all the amenities you need f...",2019-07-19,958,11709118,Anna,97.0,1169.0,Holly
209,This place is great! It was perfect for my qui...,2019-08-28,958,70436481,Nathan,97.0,1169.0,Holly


### Write CSV

In [30]:
#Print merged_df shape
print("Data shape:", merged_df.shape)

#Set path and write file
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python\In Progress\Airbnb - San Francisco\Data\02_Cleaned'
merged_df.to_csv(path + '/2020_0526_Reviews_Cleaned.csv')

Data shape: (40178, 8)
