# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Do all imports and installs here
import pandas as pd
import numpy as np

import configparser
from datetime import datetime
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col
from pyspark.sql.functions import year, month, dayofmonth, hour, weekofyear, date_format
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql.functions import monotonically_increasing_id, row_number, desc
from pyspark.sql.window import Window

### Step 1: Scope the Project and Gather Data

#### Scope 
In this project I want to create a postgres database to collect info about airbnb data on the city of Berlin.
The goal is to create a snowflake schema and in order to make it available for querying once connected to a BI tool. 

#### Describe and Gather Data 
The data set I am using has been gathered from this website () and contains various csv files with information about listing of the properties in airbnb, neighbourhood, and reviews left by visitors.

In [70]:
calendar_csv = 'airbnb_data/calendar.csv'
list_det_csv = 'airbnb_data/listings_detailed.csv'
list_csv = 'airbnb_data/listings.csv'
hoods_csv = 'airbnb_data/neighbourhoods.csv'
reviews_csv = 'airbnb_data/reviews_detailed.csv'

##pandas df for data exploration
calendar_df = pd.read_csv(calendar_csv)
list_det_df = pd.read_csv(list_det_csv)
list_df = pd.read_csv(list_csv)
hoods_df = pd.read_csv(hoods_csv)
reviews_df = pd.read_csv(reviews_csv)

In [71]:
#rename price at it is also found in another future table
calendar_df = calendar_df.rename(columns={'price': 'requested_price'})

calendar_df.head(2)

Unnamed: 0,listing_id,date,available,requested_price,adjusted_price,minimum_nights,maximum_nights
0,652868795892201022,2022-09-15,f,$85.00,$85.00,1.0,1125.0
1,652868795892201022,2022-09-16,f,$85.00,$85.00,1.0,1125.0


In [49]:
list_df.head(2)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,652868795892201022,Kleine Auszeit? Oder Business-Trip? Alles mögl...,21708794,Familie Sek,Tempelhof - Schöneberg,Lichtenrade,52.357652,13.399098,Entire home/apt,88,1,0,,,1,6,0,
1,27080612,Apartment with Living/Sleeping Room & own Kitchen,130216168,Tommy,Marzahn - Hellersdorf,Mahlsdorf,52.52006,13.65956,Entire home/apt,60,2,126,2022-09-11,2.54,2,163,18,


In [72]:
#drop min and max nights as they appear in another future table
to_drop = ['maximum_nights','minimum_nights','price']
list_det_df = list_det_df.drop(to_drop, axis=1)

list_det_df.head(2)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,652868795892201022,https://www.airbnb.com/rooms/652868795892201022,20220915162225,2022-09-15,city scrape,Kleine Auszeit? Oder Business-Trip? Alles mögl...,"Hallo ihr Lieben,<br /><br />kommt und verbrin...",,https://a0.muscache.com/pictures/miso/Hosting-...,21708794,...,,,,,t,1,1,0,0,
1,29077694,https://www.airbnb.com/rooms/29077694,20220915162225,2022-09-16,previous scrape,Wohnung im Grünen nah an der Metropole,Wohnung befindet sich im Souterrain eines Einf...,"We live in a green, quite and save area with ...",https://a0.muscache.com/pictures/e2642fca-3833...,219116245,...,4.93,4.52,4.66,,f,1,1,0,0,0.63


In [None]:
hoods_df.head(2)

In [23]:
reviews_df.head(2)

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,3176,4283,2009-06-20,21475,Milan,"excellent stay, i would highly recommend it. a..."
1,22438,218181,2011-04-05,401483,Alexandre,Javier gave us quite a fright when none of his...


### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [3]:
#merging the two listing df to create one, left join on the detailed
listing_df = pd.merge(list_det_df, list_df,  how='left', left_on=['id'], right_on = ['id'])

# Transpose the dataframe and drop duplicate columns
listing_df_transposed = listing_df.T
listing_df_unique = listing_df_transposed.drop_duplicates()

# Transpose the dataframe back to its original shape
listing_df = listing_df_unique.T

#remove the suffix _x generated by column duplication
suffix = '_x'
listing_df = listing_df.rename(columns={col: col.replace(suffix, '') for col in listing_df.columns})

#dropping two columns
listing_df = listing_df.drop(['name_y','price_y'],axis=1)
##listing_df.columns

In [None]:
# % of missing values per column
for col in listing_df.columns:
    pct_missing = np.mean(listing_df[col].isnull())
    print('{} - {}%'.format(col, round(pct_missing*100)))

In [4]:
#dropping bathroom as it is always empty
listing_df = listing_df.drop('bathrooms',axis=1)

In [5]:
config = configparser.ConfigParser()
config.read_file(open('creds.cfg'))

KEY             = config.get('AWS','KEY')
SECRET          = config.get('AWS','SECRET')

pd.DataFrame({"Param":
                  ["KEY", "SECRET"],
              "Value":
                  [KEY, SECRET] })

Unnamed: 0,Param,Value
0,KEY,AKIA3NPZEDQS2FHCROMV
1,SECRET,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


In [6]:
config = configparser.ConfigParser()
config.read('creds.cfg')

os.environ['AWS_ACCESS_KEY_ID']=config['AWS']['KEY']
os.environ['AWS_SECRET_ACCESS_KEY']=config['AWS']['SECRET']

In [7]:
spark = SparkSession \
        .builder \
        .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \
        .config("spark.executor.instances", 10) \
        .config("spark.executor.memory", "8g") \
        .getOrCreate()

In [79]:
#forcing the schema onto the dataframes as spark reads all fields as string
from pyspark.sql.types import IntegerType, TimestampType, StructType, StructField, StringType, DateType, BooleanType, DecimalType, DoubleType

listing_schema = StructType([
        StructField('id', IntegerType()),
        StructField('listing_url', StringType()),
        StructField('scrape_id', IntegerType()),
        StructField('last_scraped', TimestampType()),
        StructField('source', StringType()),
        StructField('name', StringType()),
        StructField('description', StringType()),
        StructField('neighborhood_overview', StringType()),
        StructField('picture_url', StringType()),
        StructField('host_id', IntegerType()),
        StructField('host_url', StringType()),
        StructField('host_name', StringType()),
        StructField('host_since', TimestampType()),
        StructField('host_location', StringType()),
        StructField('host_about', StringType()),
        StructField('host_response_time', IntegerType()),
        StructField('host_response_rate', IntegerType()),
        StructField('host_acceptance_rate', IntegerType()),
        StructField('host_is_superhost', StringType()),
        StructField('host_thumbnail_url', StringType()),
        StructField('host_picture_url', StringType()),
        StructField('host_neighbourhood', StringType()),
        StructField('host_listings_count', IntegerType()),
        StructField('host_total_listings_count', IntegerType()),
        StructField('host_verifications', StringType()),
        StructField('host_has_profile_pic', StringType()),
        StructField('host_identity_verified', StringType()),
        StructField('neighbourhood', StringType()),
        StructField('neighbourhood_cleansed', StringType()),
        StructField('neighbourhood_group_cleansed', StringType()),
        StructField('latitude', DoubleType()),
        StructField('longitude', DoubleType()),
        StructField('property_type', StringType()),
        StructField('room_type', StringType()),
        StructField('accommodates', StringType()),
        #StructField('bathrooms', StringType()),
        StructField('bathrooms_text', StringType()),
        StructField('bedrooms', IntegerType()),
        StructField('beds', IntegerType()),
        StructField('amenities', StringType()),
        StructField('price', DoubleType()),
        StructField('minimum_nights', IntegerType()),
        StructField('maximum_nights', IntegerType()),
        StructField('minimum_minimum_nights', IntegerType()),
        StructField('maximum_minimum_nights', IntegerType()),
        StructField('minimum_maximum_nights', IntegerType()),
        StructField('maximum_maximum_nights', IntegerType()),
        StructField('minimum_nights_avg_ntm', IntegerType()),
        StructField('maximum_nights_avg_ntm', IntegerType()),
        #StructField('calendar_updated', TimestampType()),
        StructField('has_availability', StringType()),
        StructField('availability_30', IntegerType()),
        StructField('availability_60', IntegerType()),
        StructField('availability_90', IntegerType()),
        StructField('availability_365', IntegerType()),
        StructField('calendar_last_scraped', StringType()),
        StructField('number_of_reviews', IntegerType()),
        StructField('number_of_reviews_ltm', IntegerType()),
        StructField('number_of_reviews_l30d', IntegerType()),
        StructField('first_review', TimestampType()),
        StructField('last_review', TimestampType()),
        StructField('review_scores_rating', IntegerType()),
        StructField('review_scores_accuracy', IntegerType()),
        StructField('review_scores_cleanliness', IntegerType()),
        StructField('review_scores_checkin', IntegerType()),
        StructField('review_scores_communication', IntegerType()),
        StructField('review_scores_location', IntegerType()),
        StructField('review_scores_value', IntegerType()),
        StructField('license', StringType()),
        StructField('instant_bookable', StringType()),
        StructField('calculated_host_listings_count', IntegerType()),
        StructField('calculated_host_listings_count_entire_homes', IntegerType()),
        StructField('calculated_host_listings_count_private_rooms', IntegerType()),
        StructField('calculated_host_listings_count_shared_rooms', IntegerType()),
        StructField('reviews_per_month', IntegerType())
    ])

calendar_schema = StructType([
        StructField('listing_id', StringType(), True),
        StructField('date', TimestampType(), True),
        StructField('available', BooleanType(), True),
        StructField('requested_price', DecimalType(), True),
        StructField('adjusted_price', DecimalType(), True)
    ])

hoods_schema = StructType([
        StructField('neighbourhood_group', StringType()),
        StructField('neighbourhood', StringType())
    ])

reviews_schema = StructType([
        StructField('listing_id', IntegerType()),
        StructField('id', IntegerType()),
        StructField('date', TimestampType()),
        StructField('reviewer_id', IntegerType()),
        StructField('reviewer_name', StringType()),
        StructField('comments', StringType())
    ])

the code below works better already inferring the schema, still doesn't work for listing.
For listing I have merged at the DF level already above, I need to create the spark dataframe from the pandas df instead of the csv.
Then recheck the schema

In [9]:
#cleaning the columns further in order to cast correct data types in order to convert from pd to spark df
listing_df = listing_df.fillna('')

listing_df['price'] = listing_df['price'].replace('$', '')
null_to_zero = ['bedrooms','beds','review_scores_rating','review_scores_accuracy','review_scores_cleanliness',
                'review_scores_checkin','reviews_per_month','review_scores_communication','review_scores_location','review_scores_value']
for i in null_to_zero:
    listing_df[i] = listing_df[i].replace('', 0)

In [11]:
listing_df['id'] = listing_df['id'].astype('int32')
listing_df['listing_url'] = listing_df['listing_url'].astype('str')
listing_df['scrape_id'] = listing_df['scrape_id'].astype('int32')
listing_df['last_scraped'] = pd.to_datetime(listing_df['last_scraped'])
listing_df['source'] = listing_df['source'].astype('str')
listing_df['name'] = listing_df['name'].astype('str')
listing_df['description'] = listing_df['description'].astype('str')
listing_df['neighborhood_overview'] = listing_df['neighborhood_overview'].astype('str')
listing_df['picture_url'] = listing_df['picture_url'].astype('str')
listing_df['host_id'] = listing_df['host_id'].astype('int32')
listing_df['host_url'] = listing_df['host_url'].astype('str')
listing_df['host_name'] = listing_df['host_name'].astype('str')
listing_df['host_since'] = pd.to_datetime(listing_df['host_since'])
listing_df['host_location'] = listing_df['host_location'].astype('str')
listing_df['host_about'] = listing_df['host_about'].astype('str')
listing_df['host_response_time'] = listing_df['host_response_time'].astype('str')
listing_df['host_response_rate'] = listing_df['host_response_rate'].astype('str')
listing_df['host_acceptance_rate'] = listing_df['host_acceptance_rate'].astype('str')
listing_df['host_is_superhost'] = listing_df['host_is_superhost'].astype('str')
listing_df['host_thumbnail_url'] = listing_df['host_thumbnail_url'].astype('str')
listing_df['host_picture_url'] = listing_df['host_picture_url'].astype('str')
listing_df['host_neighbourhood'] = listing_df['host_neighbourhood'].astype('str')
listing_df['host_listings_count'] = listing_df['host_listings_count'].astype('str')
listing_df['host_total_listings_count'] = listing_df['host_total_listings_count'].astype('str')
listing_df['host_verifications'] = listing_df['host_verifications'].astype('str')
listing_df['host_has_profile_pic'] = listing_df['host_has_profile_pic'].astype('str')
listing_df['host_identity_verified'] = listing_df['host_identity_verified'].astype('str')
listing_df['neighbourhood'] = listing_df['neighbourhood'].astype('str')
listing_df['neighbourhood_cleansed'] = listing_df['neighbourhood_cleansed'].astype('str')
listing_df['neighbourhood_group_cleansed'] = listing_df['neighbourhood_group_cleansed'].astype('str')
listing_df['latitude'] = listing_df['latitude'].astype('float')
listing_df['longitude'] = listing_df['longitude'].astype('float')
listing_df['property_type'] = listing_df['property_type'].astype('str')
listing_df['room_type'] = listing_df['room_type'].astype('str')
listing_df['accommodates'] = listing_df['accommodates'].astype('int32')
listing_df['bathrooms_text'] = listing_df['bathrooms_text'].astype('str')
listing_df['bedrooms'] = listing_df['bedrooms'].astype('str')
listing_df['beds'] = listing_df['beds'].astype('int32')
listing_df['amenities'] = listing_df['amenities'].astype('str')
listing_df['price'] = listing_df['price'].replace('$', '', regex=True).astype('float', errors='ignore')
listing_df['minimum_nights'] = listing_df['minimum_nights'].astype('int32')
listing_df['maximum_nights'] = listing_df['maximum_nights'].astype('int32')
listing_df['minimum_minimum_nights'] = listing_df['minimum_minimum_nights'].astype('int32')
listing_df['maximum_minimum_nights'] = listing_df['maximum_minimum_nights'].astype('int32')
listing_df['minimum_maximum_nights'] = listing_df['minimum_maximum_nights'].astype('int32')
listing_df['maximum_maximum_nights'] = listing_df['maximum_maximum_nights'].astype('int32')
listing_df['minimum_nights_avg_ntm'] = listing_df['minimum_nights_avg_ntm'].astype('int32')
listing_df['maximum_nights_avg_ntm'] = listing_df['maximum_nights_avg_ntm'].astype('int32')
listing_df['has_availability'] = listing_df['has_availability'].astype('str')
listing_df['availability_30'] = listing_df['availability_30'].astype('int32')
listing_df['availability_60'] = listing_df['availability_60'].astype('int32')
listing_df['availability_90'] = listing_df['availability_90'].astype('int32')
listing_df['availability_365'] = listing_df['availability_365'].astype('int32')
#listing_df['calendar_last_scraped'] = pd.to_datetime(listing_df['calendar_last_scraped'])
listing_df['number_of_reviews'] = listing_df['number_of_reviews'].astype('int32')
listing_df['number_of_reviews_ltm'] = listing_df['number_of_reviews_ltm'].astype('int32')
listing_df['number_of_reviews_l30d'] = listing_df['number_of_reviews_l30d'].astype('int32')
listing_df['first_review'] = pd.to_datetime(listing_df['first_review'])
listing_df['last_review'] = pd.to_datetime(listing_df['last_review'])
listing_df['review_scores_rating'] = listing_df['review_scores_rating'].astype('int32')
listing_df['review_scores_accuracy'] = listing_df['review_scores_accuracy'].astype('int32')
listing_df['review_scores_cleanliness'] = listing_df['review_scores_cleanliness'].astype('int32')
listing_df['review_scores_checkin'] = listing_df['review_scores_checkin'].astype('int32')
listing_df['review_scores_communication'] = listing_df['review_scores_communication'].astype('int32')
listing_df['review_scores_location'] = listing_df['review_scores_location'].astype('int32')
listing_df['review_scores_value'] = listing_df['review_scores_value'].astype('int32')
#listing_df['requires_license'] = listing_df['requires_license'].astype('str')
listing_df['license'] = listing_df['license'].astype('str')
#listing_df['jurisdiction_names'] = listing_df['jurisdiction_names'].astype('str')
#listing_df['cancellation_policy'] = listing_df['cancellation_policy'].astype('str')
#listing_df['is_business_travel_ready'] = listing_df['is_business_travel_ready'].astype('str')
listing_df['instant_bookable'] = listing_df['instant_bookable'].astype('str')
#listing_df['require_guest_profile_picture'] = listing_df['require_guest_profile_picture'].astype('str')
#listing_df['require_guest_phone_verification'] = listing_df['require_guest_phone_verification'].astype('str')
listing_df['calculated_host_listings_count'] = listing_df['calculated_host_listings_count'].astype('str')
listing_df['calculated_host_listings_count_entire_homes'] = listing_df['calculated_host_listings_count_entire_homes'].astype('str')
listing_df['calculated_host_listings_count_private_rooms'] = listing_df['calculated_host_listings_count_private_rooms'].astype('str')
listing_df['calculated_host_listings_count_shared_rooms'] = listing_df['calculated_host_listings_count_shared_rooms'].astype('str')
listing_df['reviews_per_month'] = listing_df['reviews_per_month'].astype('float')

In [None]:
print(listing_df['review_scores_communication'].unique())

In [81]:
#creating spark dataframes and passing the schemas to impart data type, exept for listing which converts from a pandas df

#df_listing = spark.read.csv(list_det_csv, header=True, sep=",", schema=listing_schema)
#df_calendar = spark.read.csv(calendar_csv, header=True, sep=",",schema=calendar_schema)
df_hoods = spark.read.csv(hoods_csv, header=True, sep=",", schema=hoods_schema)
df_reviews = spark.read.csv(reviews_csv, header=True, sep=",", schema=reviews_schema)
df_reviews = df_reviews.withColumnRenamed("id", "review_id").withColumnRenamed("date", "review_date")

df_listing = spark.createDataFrame(listing_df)
df_calendar = spark.read.format('csv').option('header', 'true').option('inferSchema', 'true').option('sep', ',').load(calendar_csv)

In [83]:
df_calendar = df_calendar.drop('price')
df_calendar.printSchema()

root
 |-- listing_id: long (nullable = true)
 |-- date: string (nullable = true)
 |-- available: string (nullable = true)
 |-- adjusted_price: string (nullable = true)
 |-- minimum_nights: integer (nullable = true)
 |-- maximum_nights: integer (nullable = true)



In [None]:
# df_calendar = spark.read.format('csv').option('header', 'true').option('inferSchema', 'true').option('sep', ',').load(calendar_csv)
# df_listing = spark.read.format('csv').option('header', 'true').option('inferSchema', 'true').option('sep', ',').load(list_det_csv)
# df_hoods = spark.read.format('csv').option('header', 'true').option('inferSchema', 'true').option('sep', ',').load(hoods_csv)
# df_reviews = spark.read.format('csv').option('header', 'true').option('inferSchema', 'true').option('sep', ',').load(reviews_csv)

In [None]:
#spark can run sql queries on the fly without create or insert statements
df_reviews.createOrReplaceTempView("reviews")
spark.sql("""
    SELECT listing_id, count(*) as reviews
    FROM reviews
    WHERE comments is not null
    GROUP by 1
    ORDER by reviews desc
    LIMIT 5
""").show()

In [None]:
df_reviews.createOrReplaceTempView("reviews")
df_hoods.createOrReplaceTempView("hoods")
df_listing.createOrReplaceTempView("listing")
df_calendar.createOrReplaceTempView("calendar")

spark.sql("""
    SELECT *
    from calendar 
    left join listing 
        on listing.id=calendar.listing_id
    left join reviews 
        using (listing_id)
    limit 1
""").show()

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Star Schema would be the model of choice as I believe for the way the data is presented and already available, it can very well be tranformed into dimension tables without the need to normalise any further. Additionally I am planning to build a central fact table by joining calendar, listing and reviews. 

#### 3.2 Mapping Out Data Pipelines
I am currently using Spark, so I will be parsing and querying the data on the fly in the notebook without building table and filling them.
Once the fields to keep in the dimension tables and the structure of the fact table is clear, I will upload the files in parquet format in an S3 bucket.
From them I will query directly from the parquet files for demonstration purposes and run a few queries.

In [12]:
output_bucket = "s3a://udacitycapstone123/"

In [None]:
# extract columns to create review table
df_reviews = df_reviews.na.drop()
df_reviews = df_reviews.withColumn('month', month('review_date'))

reviews_table = df_reviews.select('listing_id', 'review_id', 'review_date', 'reviewer_id', 'reviewer_name', 'comments', 'month')
print(reviews_table.show(3))

# write time table to parquet files partitioned by month and date (day)
reviews_table.write.mode('overwrite').partitionBy('month').parquet(output_bucket + 'reviews')
print('Review Table is created in the S3 bucket!')

In [None]:
# extract columns to create Hoods table
print(df_hoods.show(3))
hoods_table = df_hoods.select('neighbourhood_group', 'neighbourhood')

# write time table to parquet files partitioned by year and month
hoods_table.write.mode('overwrite').partitionBy('neighbourhood_group').parquet(output_bucket + 'neighbourhoods')
print('Neighbourhoods Table is created in the S3 bucket!')

In [None]:
# extract columns to create review table
df_calendar = df_calendar.withColumn('month', month('date'))

calendar_table = df_calendar.select('listing_id', 'month', 'date', 'available', 'price', 'adjusted_price', 'minimum_nights', 'maximum_nights')
print(calendar_table.show(3))

# write time table to parquet files partitioned by year and month
calendar_table.write.mode('overwrite').partitionBy('month').parquet(output_bucket + 'calendar')
print('Calendar Table is created in the S3 bucket!')

In [77]:
# extract columns to create review table
df_listing = df_listing.withColumn('month', month('host_since'))

#including only some columns in the listing table for demostration purposes and because it is very slow
listing_table = df_listing.select('id', 'month','name','description','host_id','host_name','host_since','source','latitude','longitude','price','review_scores_rating','reviews_per_month','room_type')
print(listing_table.show(3))

# write time table to parquet files partitioned by year and month
listing_table.write.mode('overwrite').partitionBy('month', 'room_type').parquet(output_bucket + 'listing')
print('Listing Table is created in the S3 bucket!')

+----------+-----+--------------------+--------------------+---------+-----------+-------------------+---------------+-----------------+------------------+------+--------------------+-----------------+---------------+
|        id|month|                name|         description|  host_id|  host_name|         host_since|         source|         latitude|         longitude| price|review_scores_rating|reviews_per_month|      room_type|
+----------+-----+--------------------+--------------------+---------+-----------+-------------------+---------------+-----------------+------------------+------+--------------------+-----------------+---------------+
|-132680130|    9|Kleine Auszeit? O...|Hallo ihr Lieben,...| 21708794|Familie Sek|2014-09-24 00:00:00|    city scrape|52.35765223245062|13.399097844958305|$88.00|                   0|              0.0|Entire home/apt|
|  29077694|   10|Wohnung im Grünen...|Wohnung befindet ...|219116245|    Tilmann|2018-10-06 00:00:00|previous scrape|         5

In [84]:
##Creating the fact table now
df_bookings_stg = df_calendar.join(listing_table, on=[df_calendar.listing_id == listing_table.id], how='left')
df_bookings = df_bookings_stg.join(df_reviews, on=[df_bookings_stg.listing_id == df_reviews.listing_id], how='left')
df_bookings.printSchema()

root
 |-- listing_id: long (nullable = true)
 |-- date: string (nullable = true)
 |-- available: string (nullable = true)
 |-- adjusted_price: string (nullable = true)
 |-- minimum_nights: integer (nullable = true)
 |-- maximum_nights: integer (nullable = true)
 |-- id: long (nullable = true)
 |-- month: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- description: string (nullable = true)
 |-- host_id: long (nullable = true)
 |-- host_name: string (nullable = true)
 |-- host_since: timestamp (nullable = true)
 |-- source: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- price: string (nullable = true)
 |-- review_scores_rating: long (nullable = true)
 |-- reviews_per_month: double (nullable = true)
 |-- room_type: string (nullable = true)
 |-- listing_id: integer (nullable = true)
 |-- review_id: integer (nullable = true)
 |-- review_date: timestamp (nullable = true)
 |-- reviewer_id: integer (nullable = t

In [86]:
df_bookings.show()

+----------+----------+---------+--------------+--------------+--------------+-----+-----+--------------------+--------------------+-------+---------+-------------------+-----------+--------+---------+-------+--------------------+-----------------+---------------+----------+---------+-------------------+-----------+-------------+--------------------+
|listing_id|      date|available|adjusted_price|minimum_nights|maximum_nights|   id|month|                name|         description|host_id|host_name|         host_since|     source|latitude|longitude|  price|review_scores_rating|reviews_per_month|      room_type|listing_id|review_id|        review_date|reviewer_id|reviewer_name|            comments|
+----------+----------+---------+--------------+--------------+--------------+-----+-----+--------------------+--------------------+-------+---------+-------------------+-----------+--------+---------+-------+--------------------+-----------------+---------------+----------+---------+---------

In [None]:
# extract columns to create review table
df_bookings = df_bookings.withColumn('month', month('host_since'))

#including only some columns in the listing table for demostration purposes and because it is very slow
booking_table = df_bookings.select('id', 'month','name','description','host_id','host_name','host_since','source','latitude','longitude','price','review_scores_rating','reviews_per_month','room_type')
print(booking_table.show(3))

# write time table to parquet files partitioned by year and month
booking_table.write.mode('overwrite').partitionBy('month', 'room_type').parquet(output_bucket + 'booking')
print('Booking Table is created in the S3 bucket!')

In [None]:
#reading the tables from the S3 in order to parse, run analysis etc
booking = spark.read.parquet(output_bucket + 'booking')
listing = spark.read.parquet(output_bucket + 'listing')
reviews = spark.read.parquet(output_bucket + 'reviews')
neighbourhoods = spark.read.parquet(output_bucket + 'neighbourhoods')

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.