# Project: Capstone with EMR ( Spark & Co. @ AWS )
### Data Engineering Capstone Project


#### Project Summary
--describe your project at a high level-- (see also @ README.md at GitHub repo)


#### Overview and Steps
The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

<font color="red">Clear fields at [</font>aws-access.cfg<font color="red">] before submitting project for review!  
    KEY = 'YOUR_AWS_KEY'  
    SECRET='YOUR_AWS_SECRET'</font>

Version 1 **(LOC = Udacity Workspace)** - Revision 07 - 2023/04/26 - Mr Morphy - GitHub Profile (https://github.com/MrMorphy)  
GitHub Project - udacity-course-proj-final-capstone (https://github.com/mrmorphy/udacita-course-proj-final-capstone)

### Step 0: Create Base Config
* Setup IAM-Role (aws-access.cfg)
* Setup EMR Cluster (with Spark) at AWS UI
* Create a Notebook at AWS UI

Load following libraries to execute code

In [1]:
!python --version

Python 3.6.3


In [2]:
# Do all imports and installs here
import configparser              # required for reading out config-file [*.cfg]
from datetime import datetime    # 
import os                        # required for setting environment variables, like IAM-Authentification

In [3]:
# Optional for Analytics & Visualisation (pending/ in development), if next command fails.
#!pip install numpy matplotlib     # later use of a Bar Graph:  plt.hist() - 
                                 # plt.xlabel('time') - plt.ylabel('Value') - 
                                 # plt.title('Distribution of xxx') - plt.show()

In [2]:
# Optional ... (pending/ in development)
import numpy as np
import matplotlib.pyplot as plt  # Pyplot - Bar Graph, Line Graph, Others (pending)
                                 # i.e. plt.scatter(guests_hour_pd["hour"], guests_hour_pd["count"])
                                 # plt.xlim(-1,24); - plt.ylim(0, 1.2 * max( guests_hour_pd["count"]))
                                 # plt.xlabel("Hour"); - plt.ylabel("Accomodations")

In [3]:
from pyspark.sql import SparkSession

In [4]:
from pyspark.sql.functions import udf, col   #, monotonically_increasing_id
from pyspark.sql.functions import year, month, dayofmonth, hour, weekofyear, date_format

In [5]:
from pyspark.sql.types import StructType  as R,   StructField as Fld, \
                              DoubleType  as Dbl, StringType  as Str, \
                              IntegerType as Int, DateType    as Date, \
                              BooleanType as Bol, DecimalType as Dec, \
                              FloatType   as Flt, TimestampType as TS, \
                              LongType    as Lng

In [6]:
# only for jupyter workbook tests necessary
import pandas as pd

<font color="red">INFO: Add [AWS] on aws-access.cfg file as well at code " =config['AWS'][...] "</font>

In [7]:
config = configparser.ConfigParser()
config.read('aws-access.cfg')
print(">> Read Out Config-Infos from [aws-access.cfg]")

os.environ['AWS_ACCESS_KEY_ID']     = config['AWS']['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY'] = config['AWS']['AWS_SECRET_ACCESS_KEY']

>> Read Out Config-Infos from [aws-access.cfg]


####  Step0 (2): Create Spark Session

In [8]:
print(">> START (main) - [A] def configure_spark()")

>> START (main) - [A] def configure_spark()


In [9]:
# def create_spark_session():
spark = SparkSession \
    .builder \
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.10.0") \
    .getOrCreate()
#return Spark
# @ EMR (AWS): "emr.5.20.0" > "hadoop 2.7.0"
# @ EMR (AWS): "emr-5.31.0" > "hadoop 2.10.0"
print(">> spark session created")

>> spark session created


In [10]:
# DEBUG
print(">> Spark information details")
spark

>> Spark information details


### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc.

* For project plan see README.md on GitHub repo.
* Data from Inside Airbnb, specific, from Mexico City.
* Endsolution: Dimension and Fact table for "Self Service Dashboard" (Microsoft Power BI or AWS QuickSight.
* Selected Tools: Spark (EMR on EC2 at AWS) and S3 Bucket.


#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included?  


| FILE NAME | DESCRIPTION |
| ----- | ----- |
| calendar.csv.gz | Detailed Calendar Data |
| listings.csv | Summary information and metrics for listings in Mexico City (good for visualisations) |
| listings.csv.gz | Detailed Listings Data |
| neighbourhoods.csv | Neighbourhood list for geo filter. Sourced from city or open source GIS files. |
| neighbourhoods.geojson | GeoJSON file of neighbourhoods of the city |
| reviews.csv | Summary Review data and Listing ID (to facilitate time based analytics and visualitaion linekd to a listing) |
| reviews.csv.gz | Detailed Review Data |

In [11]:
# DEBUG
!ls -all

total 879540
drwxr-xr-x 4 root root      4096 Apr 26 21:46 .
drwxr-xr-x 3 root root      4096 Apr 26 17:32 ..
-rw-r--r-- 1 root root       107 Apr 25 19:38 aws-access.cfg
-rw-r--r-- 1 root root 462719658 Apr 26 00:27 calendar.csv
-rw-r--r-- 1 root root  26455179 Apr 25 22:56 calendar.csv.gz
-rw-r--r-- 1 root root    129851 Apr 26 21:46 Capstone-(LOC)-Notebook-MrMorphy.ipynb
-rw-r--r-- 1 root root     13059 Apr 22 22:00 Capstone Project Template.ipynb
drwxr-xr-x 2 root root      4096 Apr 26 17:32 .ipynb_checkpoints
-rw-r--r-- 1 root root   3747993 Apr 26 01:50 listings.csv
-rw-r--r-- 1 root root  15445405 Apr 25 22:55 listings.csv.gz
-rw-r--r-- 1 root root  60742007 Apr 26 00:23 listings_detailed.csv
-rw-r--r-- 1 root root       275 Apr 24 23:03 neighbourhoods.csv
-rw-r--r-- 1 root root  18548579 Apr 25 22:17 reviews.csv
-rw-r--r-- 1 root root  86708818 Apr 25 22:52 reviews.csv.gz
-rw-r--r-- 1 root root 226117945 Apr 26 00:35 reviews_detailed.csv
drwxr-xr-x 3 root root  

In [12]:
# Read in the data here
# Local data (Udacity Workspace)
input_data  = "./"  # "data/"  # "data_airbnb/"
output_data = "data_output/"

In [13]:
# define filenames 
# Making 1st tests...
fnNeighbourhoodsCSV = "neighbourhoods.csv"

#list_csv     = "listings.csv"
#list_det_csv = "listings_detailed.csv"  # "listings.csv.gz"
list_gz      = "listings.csv.gz"
#calendar_csv = "calendar.csv"           # "calendar.csv.gz"
calendar_gz  = "calendar.csv.gz"
#reviews_csv  = "reviews.csv"
#rev_det_csv  = "reviews_detailed.csv"   # "reviews.csv.gz"
reviews_gz   = "reviews.csv.gz"

In [14]:
print(">> processing compressed data, preparation for exploring data")

>> processing compressed data, preparation for exploring data


#### Extract "listings_detailed.csv" from GZ file "listings.csv.gz"

In [17]:
import gzip

In [18]:
# DEBUG
# list of all gzipped files
!ls -all *.gz

-rw-r--r-- 1 root root 26455179 Apr 25 22:56 calendar.csv.gz
-rw-r--r-- 1 root root 15445405 Apr 25 22:55 listings.csv.gz
-rw-r--r-- 1 root root 86708818 Apr 25 22:52 reviews.csv.gz


In [19]:
# DEBUG
# list only with "list" begining
!ls -all list*

-rw-r--r-- 1 root root  3747993 Apr 26 01:50 listings.csv
-rw-r--r-- 1 root root 15445405 Apr 25 22:55 listings.csv.gz
-rw-r--r-- 1 root root 60742007 Apr 26 00:23 listings_detailed.csv


In [60]:
# TODO: Adaptar con nombre pre definidos arriba!
# SINGLE EXECUTION - ONLY ONCE REQUIRED, for unpacking ALL on Udacity Workbook Environment
# At the end Spark does require the CSV! Pandas can work with compressed files.

# path_to_file_to_be_extracted
ip = input_data + 'listings.csv.gz'

# output_file_to_be_filled
op_fn = ip[:-7] + '_detailed.csv'

print("DBG: source filename: [" + ip + "]")
print("DBG: ..without last 3 > target filename: [" + ip[:-3] + "]")
print("DBG: ..without last 7 > target filename: [" + ip[:-7] + "]")
print("DBG: > target filename: [" + op_fn + "]")

op = open(op_fn, 'w')

with gzip.open(ip, 'rb') as ip_byte:
    op.write(ip_byte.read().decode('utf-8'))
    op.close()    

DBG: source filename: [./listings.csv.gz]
DBG: ..without last 3 > target filename: [./listings.csv]
DBG: ..without last 7 > target filename: [./listings]
DBG: > target filename: [./listings_detailed.csv]


In [21]:
# DEBUG
# changes? 
!ls -all list*

-rw-r--r-- 1 root root  3747993 Apr 26 01:50 listings.csv
-rw-r--r-- 1 root root 15445405 Apr 25 22:55 listings.csv.gz
-rw-r--r-- 1 root root 60742007 Apr 26 00:23 listings_detailed.csv


#### Extract "calender.csv" from GZ file "calendar.csv.gz"

In [20]:
# DEBUG
# list only with "cal" begining
!ls -all cal*

-rw-r--r-- 1 root root 462719658 Apr 26 00:27 calendar.csv
-rw-r--r-- 1 root root  26455179 Apr 25 22:56 calendar.csv.gz


In [63]:
# TODO: Adaptar con nombre pre definidos arriba!
# SINGLE EXECUTION - ONLY ONCE REQUIRED, for unpacking ALL on Udacity Workbook Environment

# path_to_file_to_be_extracted
ip = input_data + 'calendar.csv.gz'

# output_file_to_be_filled
op_fn = ip[:-3] #+ '_detailed.csv'

print("DBG: source filename: [" + ip + "]")
print("DBG: ..without last 3 > target filename: [" + ip[:-3] + "]")
print("DBG: > target filename: [" + op_fn + "]")

op = open(op_fn, 'w')

with gzip.open(ip, 'rb') as ip_byte:
    op.write(ip_byte.read().decode('utf-8'))
    op.close()    

DBG: source filename: [./calendar.csv.gz]
DBG: ..without last 3 > target filename: [./calendar.csv]
DBG: > target filename: [./calendar.csv]


In [65]:
# DEBUG
# unzipped? 
!ls -all cal*

-rw-r--r-- 1 root root 462719658 Apr 26 00:27 calendar.csv
-rw-r--r-- 1 root root  26455179 Apr 25 22:56 calendar.csv.gz


#### Extract "reviews_detailed.csv" from GZ file "reviews.csv.gz"

In [22]:
# DEBUG
# only with "rev" begining
!ls -all rev*

-rw-r--r-- 1 root root  18548579 Apr 25 22:17 reviews.csv
-rw-r--r-- 1 root root  86708818 Apr 25 22:52 reviews.csv.gz
-rw-r--r-- 1 root root 226117945 Apr 26 00:35 reviews_detailed.csv


In [69]:
# TODO: Adaptar con nombre pre definidos arriba!
# SINGLE EXECUTION - ONLY ONCE REQUIRED, for unpacking ALL on Udacity Workbook Environment

# path_to_file_to_be_extracted
ip = input_data + 'reviews.csv.gz'

# output_file_to_be_filled
op_fn = ip[:-7] + '_detailed.csv'

print("DBG: source filename: [" + ip + "]")
print("DBG: ..without last 3 > target filename: [" + ip[:-3] + "]")
print("DBG: ..without last 7 > target filename: [" + ip[:-7] + "]")
print("DBG: > target filename: [" + op_fn + "]")

op = open(op_fn, 'w')

with gzip.open(ip, 'rb') as ip_byte:
    op.write(ip_byte.read().decode('utf-8'))
    op.close()    

DBG: source filename: [./reviews.csv.gz]
DBG: ..without last 3 > target filename: [./reviews.csv]
DBG: ..without last 7 > target filename: [./reviews]
DBG: > target filename: [./reviews_detailed.csv]


In [71]:
# DEBUG
# unzipped?
!ls -all rev*

-rw-r--r-- 1 root root  18548579 Apr 25 22:17 reviews.csv
-rw-r--r-- 1 root root  86708818 Apr 25 22:52 reviews.csv.gz
-rw-r--r-- 1 root root 226117945 Apr 26 00:35 reviews_detailed.csv


#### + + + + END GZ extracting + + + +  

#### Exploring source data files step-by-step

In [19]:
nb_data = input_data + fnNeighbourhoodsCSV
print(">> Source path [" + fnNeighbourhoodsCSV + "]: [" + nb_data + "]")

>> Source path [neighbourhoods.csv]: [./neighbourhoods.csv]


In [20]:
df_nb = spark.read.csv(nb_data)

In [21]:
df_nb.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)



In [22]:
df_nb.head()

Row(_c0='neighbourhood_group', _c1='neighbourhood')

In [25]:
# count of content
print(">> [" + str(df_nb.count()) + "] neighbourhoods from df_nb read out in CSV format")

>> [17] neighbourhoods from df_nb read out in CSV format


In [26]:
# DEBUG
df_nb.limit(5).toPandas()

Unnamed: 0,_c0,_c1
0,neighbourhood_group,neighbourhood
1,,Álvaro Obregón
2,,Azcapotzalco
3,,Benito Juárez
4,,Coyoacán


In [27]:
# define schema 
neighbourhoodSchema = R([
    Fld("neighbourhood_group", Str()),
    Fld("neighbourhood",       Str()),
])

In [40]:
# reread with prepared schema
df_nb = spark.read.csv(nb_data, neighbourhoodSchema)

In [31]:
# DEBUG 
# neighbourhoodSchema given
df_nb.printSchema()
print(">> WRITE OUT FOR DATA DICTIONARY...")

root
 |-- neighbourhood_group: string (nullable = true)
 |-- neighbourhood: string (nullable = true)

>> WRITE OUT FOR DATA DICTIONARY...


In [41]:
df_nb.limit(5).toPandas()

Unnamed: 0,neighbourhood_group,neighbourhood
0,neighbourhood_group,neighbourhood
1,,Álvaro Obregón
2,,Azcapotzalco
3,,Benito Juárez
4,,Coyoacán


In [42]:
# reread with prepared schema, and without first row are the header names
df_nb = spark.read.csv(nb_data, header='true', sep=',', schema=neighbourhoodSchema)

# df = spark.read.option("header","true").format("csv").schema(myManualSchema).load("maestraDestacados.csv")
# https://stackoverflow.com/questions/34832312/how-to-make-the-first-row-as-header-when-reading-a-file-in-pyspark-and-convertin

In [43]:
# DEBUG - Verification if some data content breaks the scheme while reading out
# .. if broken, content will be "None" on all columns displayed!
df_nb.limit(5).toPandas()

Unnamed: 0,neighbourhood_group,neighbourhood
0,,Álvaro Obregón
1,,Azcapotzalco
2,,Benito Juárez
3,,Coyoacán
4,,Cuajimalpa de Morelos


### Apply same steps before to the other CSV files

#### Explore the data

Explore the data on the same way to write down the possible data dictionary as well to be able to design the data model at the end.

In [15]:
# define (csv) filenames 
nhoods_csv_fn   = "neighbourhoods.csv"
list_csv_fn     = "listings.csv"
list_det_csv_fn = "listings_detailed.csv"  # < listings.csv.gz
calendar_csv_fn = "calendar.csv"           # < calendar.csv.gz
reviews_csv_fn  = "reviews.csv"
rev_det_csv_fn  = "reviews_detailed.csv"   # < reviews.csv.gz

In [16]:
nhoods_csv   = input_data + nhoods_csv_fn
list_csv     = input_data + list_csv_fn
list_det_csv = input_data + list_det_csv_fn
calendar_csv = input_data + calendar_csv_fn
reviews_csv  = input_data + reviews_csv_fn
rev_det_csv  = input_data + rev_det_csv_fn

In [17]:
# DEBUG
print(">> Source path [" + nhoods_csv_fn   + "]: [" + nhoods_csv   + "]")
print(">> Source path [" + list_csv_fn     + "]: [" + list_csv     + "]")
print(">> Source path [" + list_det_csv_fn + "]: [" + list_det_csv + "]")
print(">> Source path [" + calendar_csv_fn + "]: [" + calendar_csv + "]")
print(">> Source path [" + reviews_csv_fn  + "]: [" + reviews_csv  + "]")
print(">> Source path [" + rev_det_csv_fn  + "]: [" + rev_det_csv  + "]")

>> Source path [neighbourhoods.csv]: [./neighbourhoods.csv]
>> Source path [listings.csv]: [./listings.csv]
>> Source path [listings_detailed.csv]: [./listings_detailed.csv]
>> Source path [calendar.csv]: [./calendar.csv]
>> Source path [reviews.csv]: [./reviews.csv]
>> Source path [reviews_detailed.csv]: [./reviews_detailed.csv]


In [18]:
# DEBUG
!ls -all *.csv

-rw-r--r-- 1 root root 462719658 Apr 26 00:27 calendar.csv
-rw-r--r-- 1 root root   3747993 Apr 26 01:50 listings.csv
-rw-r--r-- 1 root root  60742007 Apr 26 00:23 listings_detailed.csv
-rw-r--r-- 1 root root       275 Apr 24 23:03 neighbourhoods.csv
-rw-r--r-- 1 root root  18548579 Apr 25 22:17 reviews.csv
-rw-r--r-- 1 root root 226117945 Apr 26 00:35 reviews_detailed.csv


In [70]:
# read out and create spark dataframes ("_df"), NOT Data Feed ("df_")
nhoods_df   = spark.read.csv(nhoods_csv)
list_df     = spark.read.csv(list_csv)
calendar_df = spark.read.csv(calendar_csv)
reviews_df  = spark.read.csv(reviews_csv)
rev_det_df  = spark.read.csv(rev_det_csv)   # Special quotes/ delimiters required for read out? 
list_det_df = spark.read.csv(list_det_csv)

# INFO: At the end, do not forget to transfer the complete call (with parameters & Schema) to 'etl.py' file!

#### Exploring "listings.csv" - (Summary)

In [52]:
# as default, data type "string" will be as default used!
list_df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)
 |-- _c17: string (nullable = true)



In [53]:
# unformated list of columnnames
list_df.head()

Row(_c0='id', _c1='name', _c2='host_id', _c3='host_name', _c4='neighbourhood_group', _c5='neighbourhood', _c6='latitude', _c7='longitude', _c8='room_type', _c9='price', _c10='minimum_nights', _c11='number_of_reviews', _c12='last_review', _c13='reviews_per_month', _c14='calculated_host_listings_count', _c15='availability_365', _c16='number_of_reviews_ltm', _c17='license')

In [54]:
print(">> [" + str(list_df.count()) + "] ... read out of CSV format")

>> [24382] ... read out of CSV format


In [55]:
list_df.limit(5).toPandas()

Unnamed: 0,_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,_c10,_c11,_c12,_c13,_c14,_c15,_c16,_c17
0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
1,35797,Villa Dante,153786,Dici,,Cuajimalpa de Morelos,19.38283,-99.27178,Entire home/apt,3658,1,0,,,1,363,0,
2,696037,"3 Bedrooms, 2 blocks from Polanco - ALL RENOVA...",3531879,Gonzalo & Sandra,,Miguel Hidalgo,19.4418,-99.18402,Entire home/apt,1469,24,39,2022-12-21,0.31,3,87,2,
3,2056638,Amplio y luminoso loft en Coyoacán,10531228,Maria,,Coyoacán,19.35353,-99.16299,Entire home/apt,1434,1,21,2018-06-24,0.19,3,324,0,
4,44616,CONDESA HAUS B&B,196253,Condesa Haus Bed & Breakfast CDMX,,Cuauhtémoc,19.41162,-99.17794,Entire home/apt,18000,1,64,2023-03-26,0.46,12,357,12,


In [59]:
# define schema as data type is required
listSchema = R([
    Fld("id",                             Int()),
    Fld("name",                           Str()),
    Fld("host_id",                        Int()),
    Fld("host_name",                      Str()),
    Fld("neighbourhood_group",            Str()),
    Fld("neighbourhood",                  Str()),
    Fld("latitude",                       Flt()),
    Fld("longitude",                      Flt()),
    Fld("room_type",                      Str()),
    Fld("requested_price",                Int()),
    Fld("minimum_nights",                 Int()),
    Fld("number_of_reviews",              Int()),
    Fld("last_review",                    Date()),
    Fld("reviews_per_month",              Flt()),
    Fld("calculated_host_listings_count", Int()),
    Fld("availability_365",               Int()),
    Fld("number_of_reviews_ltm",          Int()),
    Fld("license",                        Int()),
])

In [60]:
list_df = spark.read.option("header","true").format("csv").schema(listSchema).load(list_csv)

In [61]:
# DEBUG 
# ___Schema given
list_df.printSchema()
print(">> WRITE OUT FOR DATA DICTIONARY...")

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- host_id: integer (nullable = true)
 |-- host_name: string (nullable = true)
 |-- neighbourhood_group: string (nullable = true)
 |-- neighbourhood: string (nullable = true)
 |-- latitude: float (nullable = true)
 |-- longitude: float (nullable = true)
 |-- room_type: string (nullable = true)
 |-- price: integer (nullable = true)
 |-- minimum_nights: integer (nullable = true)
 |-- number_of_reviews: integer (nullable = true)
 |-- last_review: date (nullable = true)
 |-- reviews_per_month: float (nullable = true)
 |-- calculated_host_listings_count: integer (nullable = true)
 |-- availability_365: integer (nullable = true)
 |-- number_of_reviews_ltm: integer (nullable = true)
 |-- license: integer (nullable = true)

>> WRITE OUT FOR DATA DICTIONARY...


In [62]:
list_df.limit(5).toPandas()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,35797,Villa Dante,153786,Dici,,Cuajimalpa de Morelos,19.38283,-99.271782,Entire home/apt,3658,1,0,,,1,363,0,
1,696037,"3 Bedrooms, 2 blocks from Polanco - ALL RENOVA...",3531879,Gonzalo & Sandra,,Miguel Hidalgo,19.441799,-99.184021,Entire home/apt,1469,24,39,2022-12-21,0.31,3,87,2,
2,2056638,Amplio y luminoso loft en Coyoacán,10531228,Maria,,Coyoacán,19.353531,-99.162987,Entire home/apt,1434,1,21,2018-06-24,0.19,3,324,0,
3,44616,CONDESA HAUS B&B,196253,Condesa Haus Bed & Breakfast CDMX,,Cuauhtémoc,19.411619,-99.17794,Entire home/apt,18000,1,64,2023-03-26,0.46,12,357,12,
4,2072354,Coyoacan Historic Studio Apartment,16840050,Mónica,,Coyoacán,19.35358,-99.169479,Entire home/apt,830,3,61,2022-11-04,0.54,2,346,1,


#### Exploring "calendar.csv"

#### "calendar.csv": Variant (A): direct reading and creating data frame ("_df") through spark.read()

In [28]:
calendar_df.printSchema()
# was read through:
# calendar_df = spark.read.csv(calendar_csv)

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)



In [29]:
calendar_df.head()

Row(_c0='listing_id', _c1='date', _c2='available', _c3='price', _c4='adjusted_price', _c5='minimum_nights', _c6='maximum_nights')

In [30]:
print(">> [" + str(calendar_df.count()) + "] rows read out from CSV file")

>> [8841402] rows read out from CSV file


In [31]:
calendar_df.limit(5).toPandas()

Unnamed: 0,_c0,_c1,_c2,_c3,_c4,_c5,_c6
0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
1,2056638,2023-03-30,f,"$1,434.00","$1,434.00",1,1125
2,2056638,2023-03-31,f,"$1,434.00","$1,434.00",1,1125
3,2056638,2023-04-01,f,"$1,434.00","$1,434.00",1,1125
4,2056638,2023-04-02,f,"$1,434.00","$1,434.00",1,1125


In [32]:
# define schema - 1st shoot - but! >> ERROR, while "None" showing - WHY?! incompatible data for given types!
calendarSchema = R([
    Fld("listing_id",     Str()),
    Fld("date",           Date()),
    Fld("available",      Bol()),  # 'f' <> 'False'  >> afterwars replace it !
    Fld("price",          Dec()),  # '$'             >> afterwars replace it !
    Fld("adjusted_price", Dec()),  # '$'             >> afterwars replace it !
    Fld("minimum_nights", Int()),
    Fld("maximum_nights", Int()),
])

In [33]:
calendar_df = spark.read.option("header","true").option('sep',',').format("csv").schema(calendarSchema).load(calendar_csv)

In [34]:
# DEBUG 
# ___Schema given
calendar_df.printSchema()
# print(">> WRITE OUT FOR DATA DICTIONARY...")
# >> ERROR, while "None" showing - WHY?! incompatible data for given types!

root
 |-- listing_id: string (nullable = true)
 |-- date: date (nullable = true)
 |-- available: boolean (nullable = true)
 |-- price: decimal(10,0) (nullable = true)
 |-- adjusted_price: decimal(10,0) (nullable = true)
 |-- minimum_nights: integer (nullable = true)
 |-- maximum_nights: integer (nullable = true)



In [37]:
# df1 = calendar_df.filter(calendar_df.listing_id == "2056638")
# df1.limit(5).toPandas()
calendar_df.limit(5).toPandas()

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,,,,,,,
1,,,,,,,
2,,,,,,,
3,,,,,,,
4,,,,,,,


In [65]:
# define schema - 2nd shoot - CORRECTIONS! - not more "None" display - Neutralized through StringType()
calendarSchema = R([
    Fld("listing_id",     Str()),
    Fld("date",           Date()),
    Fld("available",      Str()), # Bol()),
    Fld("price",          Str()), # Dec()),
    Fld("adjusted_price", Str()), # Dec()),
    Fld("minimum_nights", Int()),
    Fld("maximum_nights", Int()),
])

In [66]:
calendar_df = spark.read.option('header','true') \
                   .option('sep',',') \
                   .format('csv') \
                   .schema(calendarSchema) \
                   .load(calendar_csv)

In [60]:
#calendar_df = spark.read.option('header','true') \
#                   .option('sep',',') \
#                   .option('inferSchema', 'true') \
#                   .format('csv') \
#                   .schema(calendarSchema) \
#                   .load(calendar_csv)

In [49]:
calendar_df.count()

8841401

In [67]:
calendar_df.limit(5).toPandas()

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,2056638,2023-03-30,f,"$1,434.00","$1,434.00",1,1125
1,2056638,2023-03-31,f,"$1,434.00","$1,434.00",1,1125
2,2056638,2023-04-01,f,"$1,434.00","$1,434.00",1,1125
3,2056638,2023-04-02,f,"$1,434.00","$1,434.00",1,1125
4,2056638,2023-04-03,f,"$1,434.00","$1,434.00",1,1125


In [68]:
# DEBUG 
# ___Schema given
calendar_df.printSchema()
print(">> WRITE OUT (temporary) FOR DATA DICTIONARY...")

root
 |-- listing_id: string (nullable = true)
 |-- date: date (nullable = true)
 |-- available: string (nullable = true)
 |-- price: string (nullable = true)
 |-- adjusted_price: string (nullable = true)
 |-- minimum_nights: integer (nullable = true)
 |-- maximum_nights: integer (nullable = true)

>> WRITE OUT (temporary) FOR DATA DICTIONARY...


In [None]:
# TODO: 
# - replace '$' by ''
# - replace 'f' to False
# calendar_df['price'] = calendar_df['price'].replace('$', '')

In [69]:
# TODO - Better?
calendar_df.limit(5).toPandas()

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,2056638,2023-03-30,f,"$1,434.00","$1,434.00",1,1125
1,2056638,2023-03-31,f,"$1,434.00","$1,434.00",1,1125
2,2056638,2023-04-01,f,"$1,434.00","$1,434.00",1,1125
3,2056638,2023-04-02,f,"$1,434.00","$1,434.00",1,1125
4,2056638,2023-04-03,f,"$1,434.00","$1,434.00",1,1125


Because of replacement required, i.e. "$", this way does not work.  
Workaround through Pandas.

#### "calendar.csv": Variant (B): direct reading and creating data frame ("_df") through pandas.read_csv()

In [55]:
# read csv(s) into pandas dataframe ("_df")
calendar_df = pd.read_csv(calendar_csv)

In [56]:
calendar_df.head()

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,2056638,2023-03-30,f,"$1,434.00","$1,434.00",1.0,1125.0
1,2056638,2023-03-31,f,"$1,434.00","$1,434.00",1.0,1125.0
2,2056638,2023-04-01,f,"$1,434.00","$1,434.00",1.0,1125.0
3,2056638,2023-04-02,f,"$1,434.00","$1,434.00",1.0,1125.0
4,2056638,2023-04-03,f,"$1,434.00","$1,434.00",1.0,1125.0


In [40]:
# https://www.geeksforgeeks.org/python-pandas-dataframe-replace/
## calendar_df.replace(to_replace="f", value="S")
# works on 100% same letter / word. But seems on a part not! "$" not successfull, but on "f" > "S"
# df.replace( to_replace= ["old-a", "old-b"], value="New-Value")
# df.replace( to_replace= np.nan, value= -9999 )  # NaN values will be replaced by the value -9999

# https://www.geeksforgeeks.org/replace-values-in-pandas-dataframe-using-regex/
# calendar_df.replace('$','',regex=True) # NO changes/ update
# calendar_df['minimum_nights'].replace('.','/',regex=True) # NOT work, because [float64] DataType


#calendar_df['price'] = calendar_df['price'].str.replace('$','')

In [57]:
calendar_df['price'] = calendar_df['price'].str.replace('$', '')

In [59]:
calendar_df['adjusted_price'] = calendar_df['adjusted_price'].str.replace('$', '')

In [60]:
calendar_df.head()

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,2056638,2023-03-30,f,1434.0,1434.0,1.0,1125.0
1,2056638,2023-03-31,f,1434.0,1434.0,1.0,1125.0
2,2056638,2023-04-01,f,1434.0,1434.0,1.0,1125.0
3,2056638,2023-04-02,f,1434.0,1434.0,1.0,1125.0
4,2056638,2023-04-03,f,1434.0,1434.0,1.0,1125.0


In [61]:
calendar_df.count()
# !! Advantage of reading with PANDAS: quick check, how well columns are filled !!

listing_id        8841401
date              8841401
available         8841401
price             8841401
adjusted_price    8841401
minimum_nights    8841397
maximum_nights    8841397
dtype: int64

In [62]:
# print first 3 rows of the data frame
calendar_df[:3]

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,2056638,2023-03-30,f,1434.0,1434.0,1.0,1125.0
1,2056638,2023-03-31,f,1434.0,1434.0,1.0,1125.0
2,2056638,2023-04-01,f,1434.0,1434.0,1.0,1125.0


<font color="red">Trying to create a data frame with "cleaned" data, BUT it does fail!</font>

In [76]:
# define schema - 2nd shoot - CORRECTIONS! - not more "None" display - Neutralized through StringType()
calendarSchema = R([
    Fld("listing_id",     Str(),  True),
    Fld("date",           Str(),  True),   # TS(),   True),    # Date(), True),
    Fld("available",      Str(),  True),   # Bol(),  True),
    Fld("price",          Dec(),  True),
    Fld("adjusted_price", Dec(),  True),
    Fld("minimum_nights", Int(),  True),
    Fld("maximum_nights", Int(),  True)
])

In [77]:
df_calendar = spark.createDataFrame(calendar_df, calendarSchema)

TypeError: field price: DecimalType(10,0) can not accept object '1,434.00' in type <class 'str'>

In [68]:
# DEBUG 
# ___Schema given
calendar_df.printSchema()
print(">> WRITE OUT (temporary) FOR DATA DICTIONARY...")

root
 |-- listing_id: string (nullable = true)
 |-- date: date (nullable = true)
 |-- available: string (nullable = true)
 |-- price: string (nullable = true)
 |-- adjusted_price: string (nullable = true)
 |-- minimum_nights: integer (nullable = true)
 |-- maximum_nights: integer (nullable = true)

>> WRITE OUT (temporary) FOR DATA DICTIONARY...


<font color="red">**NOTE:** Already here cleaning my structure, leaving ['price','m*'] out!</font>

In [None]:
# define schema - 3rd shoot - Alredy cutting columns!
calendarSchema = R([
    Fld("listing_id",     Str()),
    Fld("date",           Date()),
    Fld("available",      Str()),
    Fld("adjusted_price", Str()),
])

#### Exploring "reviews.csv"

In [20]:
reviews_df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)



In [21]:
reviews_df.head()

Row(_c0='listing_id', _c1='date')

In [23]:
print(">> [" + str(reviews_df.count()) + "] rows read out from CSV file")

>> [886241] rows read out from CSV file


In [24]:
reviews_df.limit(5).toPandas()

Unnamed: 0,_c0,_c1
0,listing_id,date
1,696037,2012-10-31
2,44616,2011-11-09
3,44616,2012-08-16
4,44616,2012-12-28


In [25]:
# define schema 
reviewsSchema = R([
    Fld("listing_id", Int()),
    Fld("date",       Date()),
])

In [26]:
reviews_df = spark.read.option("header","true").format("csv").schema(reviewsSchema).load(reviews_csv)

In [27]:
# DEBUG 
# ___Schema given
reviews_df.printSchema()
print(">> WRITE OUT FOR DATA DICTIONARY...")

root
 |-- listing_id: integer (nullable = true)
 |-- date: date (nullable = true)

>> WRITE OUT FOR DATA DICTIONARY...


In [28]:
reviews_df.limit(5).toPandas()

Unnamed: 0,listing_id,date
0,696037,2012-10-31
1,44616,2011-11-09
2,44616,2012-08-16
3,44616,2012-12-28
4,44616,2013-01-04


#### Exploring "reviews_detailed.csv"

In [31]:
rev_det_df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)



In [32]:
rev_det_df.head()

Row(_c0='listing_id', _c1='id', _c2='date', _c3='reviewer_id', _c4='reviewer_name', _c5='comments')

In [33]:
print(">> [" + str(rev_det_df.count()) + "] rows read out from CSV file")

>> [902194] rows read out from CSV file


In [34]:
rev_det_df.limit(5).toPandas()

Unnamed: 0,_c0,_c1,_c2,_c3,_c4,_c5
0,listing_id,id,date,reviewer_id,reviewer_name,comments
1,44616,706908,2011-11-09,634733,Lindsay,Forget staying in a hotel. Stay at condesa hau...
2,2056638,9623913,2014-01-03,6743067,Nora Carolina,"El sitio es precioso, está muy bien ubicado al..."
3,<br/>Las vías de comunicación son muy buenas,los servicios públicos también son buenos.,,,,
4,<br/>Es deseable que el lugar tenga muebles un...,que la cocina esté mejor equipada también,ya que,aunque existen muchos sitios donde comer,elegimos rentar a través de airbnb por la com...,que es lo que ofrecen.


In [66]:
# define schema 
rev_detSchema = R([
    Fld("listing_id",    Int()),
    Fld("id",            Int()),
    Fld("date",          Date()),
    Fld("reviewer_id",   Int()),
    Fld("reviewer_name", Str()),
    Fld("comments",      Str()),
])

In [67]:
rev_det_df = spark.read.format("csv") \
                  .option("header",    True) \
                  .option('quotes',    '"') \
                  .option("delimiter", ',') \
                  .schema(rev_detSchema) \
                  .load(rev_det_csv)

In [68]:
# DEBUG 
# ___Schema given
rev_det_df.printSchema()
print(">> WRITE OUT FOR DATA DICTIONARY...")

root
 |-- listing_id: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- date: date (nullable = true)
 |-- reviewer_id: integer (nullable = true)
 |-- reviewer_name: string (nullable = true)
 |-- comments: string (nullable = true)

>> WRITE OUT FOR DATA DICTIONARY...


In [69]:
rev_det_df.limit(5).toPandas()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,44616.0,706908.0,2011-11-09,634733.0,Lindsay,Forget staying in a hotel. Stay at condesa hau...
1,2056638.0,9623913.0,2014-01-03,6743067.0,Nora Carolina,"El sitio es precioso, está muy bien ubicado al..."
2,,,,,,
3,,,,,,
4,,,,,,


<font color="red">FAQ: Why I'm not able to read the other lines? Which sign disturbs?>/font>

**REFERENCE**  

![reviews_detailed.csv looking inside](reviews_detailed.csv-viewing-data.png)
_Image: [reviews_detailed.csv-viewing-data.png] Searching for disturbing signs_

#### Exploring "listing_detailed.csv"

In [71]:
list_det_df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)
 |-- _c17: string (nullable = true)
 |-- _c18: string (nullable = true)
 |-- _c19: string (nullable = true)
 |-- _c20: string (nullable = true)
 |-- _c21: string (nullable = true)
 |-- _c22: string (nullable = true)
 |-- _c23: string (nullable = true)
 |-- _c24: string (nullable = true)
 |-- _c25: string (nullable = true)
 |-- _c26: string (nullable = true)
 |-- _c27: string (nullable = tru

In [72]:
list_det_df.head()

Row(_c0='id', _c1='listing_url', _c2='scrape_id', _c3='last_scraped', _c4='source', _c5='name', _c6='description', _c7='neighborhood_overview', _c8='picture_url', _c9='host_id', _c10='host_url', _c11='host_name', _c12='host_since', _c13='host_location', _c14='host_about', _c15='host_response_time', _c16='host_response_rate', _c17='host_acceptance_rate', _c18='host_is_superhost', _c19='host_thumbnail_url', _c20='host_picture_url', _c21='host_neighbourhood', _c22='host_listings_count', _c23='host_total_listings_count', _c24='host_verifications', _c25='host_has_profile_pic', _c26='host_identity_verified', _c27='neighbourhood', _c28='neighbourhood_cleansed', _c29='neighbourhood_group_cleansed', _c30='latitude', _c31='longitude', _c32='property_type', _c33='room_type', _c34='accommodates', _c35='bathrooms', _c36='bathrooms_text', _c37='bedrooms', _c38='beds', _c39='amenities', _c40='price', _c41='minimum_nights', _c42='maximum_nights', _c43='minimum_minimum_nights', _c44='maximum_minimum_ni

In [73]:
print(">> [" + str(list_det_df.count()) + "] rows read out from CSV file")

>> [42432] rows read out from CSV file


In [74]:
list_det_df.limit(5).toPandas()

Unnamed: 0,_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,...,_c65,_c66,_c67,_c68,_c69,_c70,_c71,_c72,_c73,_c74
0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
1,2056638,https://www.airbnb.com/rooms/2056638,20230329041210,2023-03-30,city scrape,Amplio y luminoso loft en Coyoacán,Cómodo loft de dos pisos magníficamente ubicad...,,https://a0.muscache.com/pictures/28353712/4379...,10531228,...,2014-01-03,2018-06-24,4.95,4.71,4.95,5.0,4.95,5.0,4.86,
2,2072354,https://www.airbnb.com/rooms/2072354,20230329041210,2023-03-30,city scrape,Coyoacan Historic Studio Apartment,This studio flat is adjacent to the owner's ho...,"Located in Coyoacan, in a quiet neighborhood w...",https://a0.muscache.com/pictures/369f3371-593a...,16840050,...,"""""Clothing storage: closet""""","""""Body soap""""]""",$830.00,3,1125,3,3,1125,1125,3.0
3,696037,https://www.airbnb.com/rooms/696037,20230329041210,2023-03-29,city scrape,"3 Bedrooms, 2 blocks from Polanco - ALL RENOVA...","Beautifully decorated 3 bedroom apartment, it ...","This area is called Nuevo Polanco, it has bein...",https://a0.muscache.com/pictures/10960397/67b2...,3531879,...,,,,,,,,,,
4,I love Mexico City,and I know every single thing to do and see,specially historical places and museums. More...,within an hour,100%,90%,t,https://a0.muscache.com/im/users/3531879/profi...,https://a0.muscache.com/im/users/3531879/profi...,Centro Histórico,...,"""""Dryer""""","""""Clothing storage: closet""""","""""Shared indoor pool - heated""""","""""Shampoo""""]""","$1,469.00",24,1125,24,24,1125


In [76]:
# define schema from only relevant columns - already here cleaning data 
list_detSchema = R([
    Fld("id",                   Int()),
    Fld("name",                 Str()),
    Fld("description",          Str()),
    Fld("host_id",              Int()),
    Fld("host_name",            Str()),
    Fld("host_since",           Date()),
    Fld("source",               Str()),
    Fld("latitude",             Flt()),
    Fld("longitude",            Flt()),
    Fld("price",                Flt()),
    Fld("review_scores_rating", Int()),
    Fld("reviews_per_month",    Flt()),
    Fld("roomtype",             Str()),
    Fld("accommodates",         Int()),
    Fld("bathrooms",            Int()),
    Fld("bedrooms",             Int()),
    Fld("beds",                 Int()),
])

In [77]:
list_det_df = spark.read.option("header","true").format("csv").schema(list_detSchema).load(list_det_csv)

In [78]:
# DEBUG 
# ___Schema given
list_det_df.printSchema()
print(">> WRITE OUT FOR DATA DICTIONARY...")

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- description: string (nullable = true)
 |-- host_id: integer (nullable = true)
 |-- host_name: string (nullable = true)
 |-- host_since: date (nullable = true)
 |-- source: string (nullable = true)
 |-- latitude: float (nullable = true)
 |-- longitude: float (nullable = true)
 |-- price: float (nullable = true)
 |-- review_scores_rating: integer (nullable = true)
 |-- reviews_per_month: float (nullable = true)
 |-- roomtype: string (nullable = true)
 |-- accommodates: integer (nullable = true)
 |-- bathrooms: integer (nullable = true)
 |-- bedrooms: integer (nullable = true)
 |-- beds: integer (nullable = true)

>> WRITE OUT FOR DATA DICTIONARY...


In [79]:
list_det_df.limit(5).toPandas()

Unnamed: 0,id,name,description,host_id,host_name,host_since,source,latitude,longitude,price,review_scores_rating,reviews_per_month,roomtype,accommodates,bathrooms,bedrooms,beds
0,,,,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,,,,


<font color="red">FAQ: Why I'm not able to read the other lines? Which sign disturbs?>/font>

**REFERENCE**  

![listing_detailed.csv looking inside](listings_detailed.csv-viewing-data.png)
_Image: [listings_detailed.csv-viewing-data.png] Searching for disturbing signs_

In [None]:
# !! volver a executar leyendo con pandas, para ver la opcion de leer 
# .. "list_det_fd.count()" que tan bien estan las columnas llenas.
# .. En parte "0"!
# en ['bed'] posiblemente "NA"/"inf" - ARGG

In [None]:
# https://stackoverflow.com/questions/31028815/how-to-unzip-gz-file-using-python
#import pandas as pd
#import gzip

gz_file = input_data + 'listings.csv.gz'

with gzip.open(gz_file) as f:
    
    list_det_fd = pd.read_csv(f)

In [39]:
list_det_fd.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,2056638,https://www.airbnb.com/rooms/2056638,20230329041210,2023-03-30,city scrape,Amplio y luminoso loft en Coyoacán,Cómodo loft de dos pisos magníficamente ubicad...,,https://a0.muscache.com/pictures/28353712/4379...,10531228,...,4.95,5.0,4.86,,f,3,3,0,0,0.19
1,2072354,https://www.airbnb.com/rooms/2072354,20230329041210,2023-03-30,city scrape,Coyoacan Historic Studio Apartment,This studio flat is adjacent to the owner's ho...,"Located in Coyoacan, in a quiet neighborhood w...",https://a0.muscache.com/pictures/369f3371-593a...,16840050,...,5.0,4.95,4.84,,f,2,2,0,0,0.54
2,696037,https://www.airbnb.com/rooms/696037,20230329041210,2023-03-29,city scrape,"3 Bedrooms, 2 blocks from Polanco - ALL RENOVA...","Beautifully decorated 3 bedroom apartment, it ...","This area is called Nuevo Polanco, it has bein...",https://a0.muscache.com/pictures/10960397/67b2...,3531879,...,4.92,4.58,4.71,,f,3,3,0,0,0.31
3,35797,https://www.airbnb.com/rooms/35797,20230329041210,2023-03-29,city scrape,Villa Dante,"Dentro de Villa un estudio de arte con futon, ...","Centro comercial Santa Fe, parque interlomas y...",https://a0.muscache.com/pictures/f395ab78-1185...,153786,...,,,,,f,1,1,0,0,
4,44616,https://www.airbnb.com/rooms/44616,20230329041210,2023-03-30,city scrape,CONDESA HAUS B&B,A new concept of hosting in mexico through a b...,,https://a0.muscache.com/pictures/251410/ec75fe...,196253,...,4.78,4.98,4.48,,f,12,3,2,0,0.46


In [7]:
list_det_fd.count()

id                                              24224
listing_url                                     24224
scrape_id                                       24224
last_scraped                                    24224
source                                          24224
name                                            24223
description                                     23296
neighborhood_overview                           15145
picture_url                                     24224
host_id                                         24224
host_url                                        24224
host_name                                       24224
host_since                                      24224
host_location                                   19052
host_about                                      13560
host_response_time                              21133
host_response_rate                              21133
host_acceptance_rate                            22181
host_is_superhost           

In [40]:
list_det_fd.show()

AttributeError: 'DataFrame' object has no attribute 'show'

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [None]:
# Performing cleaning tasks here





### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

Planed overview part of defined Schemas:        - Expand "Description" [TODO]

`Neighbourhood  
++++++++++++++  
 |-- neighbourhood_group: string (nullable = true)  
 |-- neighbourhood: string (nullable = true)  
`

`reviews ( Summary ) NOT Detailed  
++++++++  
 |-- listing_id: integer (nullable = true)  
 |-- date: date (nullable = true)  
`

`reviews_detailed  
+++++++++++++++++  
 |-- listing_id: integer (nullable = true)  
 |-- id: integer (nullable = true)  
 |-- date: date (nullable = true)  
 |-- reviewer_id: integer (nullable = true)  
 |-- reviewer_name: string (nullable = true)  
 |-- comments: string (nullable = true)  
`

`calendar  
++++++++++  
 |-- listing_id: integer (nullable = true)  
 |-- date: date (nullable = true)  
 |-- available: boolean (nullable = true)  
 |-- adjusted_price: decimal(10,0) (nullable = true)  
`

`listings (Summary) - NOT Detailed!  
+++++++++  
 |-- id: integer (nullable = true)  
 |-- name: string (nullable = true)  
 |-- host_id: integer (nullable = true)  
 |-- host_name: string (nullable = true)  
 |-- neighbourhood_group: string (nullable = true)  
 |-- neighbourhood: string (nullable = true)  
 |-- latitude: float (nullable = true)  
 |-- longitude: float (nullable = true)  
 |-- room_type: string (nullable = true)  
 |-- requested_price: integer (nullable = true)  
 |-- minimum_nights: integer (nullable = true)  
 |-- number_of_reviews: integer (nullable = true)  
 |-- last_review: date (nullable = true)  
 |-- reviews_per_month: float (nullable = true)  
 |-- calculated_host_listings_count: integer (nullable = true)  
 |-- availability_365: integer (nullable = true)  
 |-- number_of_reviews_ltm: integer (nullable = true)  
 |-- license: integer (nullable = true)  
`

`listings_detailed  
++++++++++++++++++  
 |-- id: integer (nullable = true)  
 |-- name: string (nullable = true)  
 |-- description: string (nullable = true)  
 |-- host_id: integer (nullable = true)  
 |-- host_name: string (nullable = true)  
 |-- host_since: date (nullable = true)  
 |-- source: string (nullable = true)  
 |-- latitude: float (nullable = true)  
 |-- longitude: float (nullable = true)  
 |-- price: float (nullable = true)  
 |-- review_scores_rating: integer (nullable = true)  
 |-- reviews_per_month: float (nullable = true)  
 |-- roomtype: string (nullable = true)  
 |-- accommodates: integer (nullable = true)  
 |-- bathrooms: integer (nullable = true)  
 |-- bedrooms: integer (nullable = true)  
 |-- beds: integer (nullable = true)  
`

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.