# Airbnb Data Warehousing Data Transformation Script

## Set Up

### Install Required Modules

In [1]:
import pandas as pd


### Read in files

In [2]:
reviews = pd.read_csv("data/reviews.csv")
listings = pd.read_csv("data/listings.csv")

### Getting General Information on Listings Table

In [3]:
listings.info(max_cols=None)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2668 entries, 0 to 2667
Data columns (total 75 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            2668 non-null   int64  
 1   listing_url                                   2668 non-null   object 
 2   scrape_id                                     2668 non-null   int64  
 3   last_scraped                                  2668 non-null   object 
 4   source                                        2668 non-null   object 
 5   name                                          2668 non-null   object 
 6   description                                   0 non-null      float64
 7   neighborhood_overview                         1826 non-null   object 
 8   picture_url                                   2668 non-null   object 
 9   host_id                                       2668 non-null   i

### Dropping redundant/empty columns

In [4]:
# These two columns are the same, we can drop one of them
print(pd.to_datetime(listings["last_scraped"]).equals(pd.to_datetime(listings["calendar_last_scraped"])))

# Others have no non-null values, let's remove all the uneccessary fluff

listings = listings.drop("calendar_last_scraped", axis=1)
listings = listings.drop("description",axis=1)
listings = listings.drop("calendar_updated",axis=1)
listings = listings.drop("bedrooms",axis=1)
listings = listings.drop("bathrooms",axis=1)
listings = listings.drop("neighbourhood_group_cleansed",axis=1)



True


### Setting datetime objects to datetime

In [5]:
# Changing the datetime columns to the datetime data type
listings.last_scraped = pd.to_datetime(listings.last_scraped)
listings.host_since = pd.to_datetime(listings.host_since)


## Making all the dimension tables

### Starting Off With Host Dimension Tables

In [6]:
host_df = listings[
    [
        "host_id",
        "host_url",
        "host_name",
        "host_since",
        "host_location",
        "host_about",
        "host_thumbnail_url",
        "host_picture_url",
        "host_neighbourhood",
        "host_response_time",
        "host_response_rate",
        "host_acceptance_rate",
        "host_is_superhost",
        "host_listings_count",
        "host_total_listings_count",
        "host_verifications",
        "host_has_profile_pic",
        "host_identity_verified",
        "calculated_host_listings_count",
        "calculated_host_listings_count_entire_homes",
        "calculated_host_listings_count_private_rooms",
        "calculated_host_listings_count_shared_rooms",
    ]
].drop_duplicates().reset_index()


#### The Dimensions of the Hosts Table

In [7]:
host_ld_df = host_df[
    [
        "host_response_time",
        "host_response_rate",
        "host_acceptance_rate",
        "host_is_superhost",
        "host_listings_count",
        "host_total_listings_count",
        "host_verifications",
        "host_has_profile_pic",
        "host_identity_verified",
    ]
].drop_duplicates().reset_index()

hqad_df = host_df[
    [
        "calculated_host_listings_count",
        "calculated_host_listings_count_entire_homes",
        "calculated_host_listings_count_private_rooms",
        "calculated_host_listings_count_shared_rooms",
    ]
].drop_duplicates().reset_index()

#### Giving Each Dimension of The Host Dimension A Primary Key

In [8]:
host_ld_df["listing_diagnostics_id"] = host_ld_df.index
hqad_df["hqad_id"] = hqad_df.index

#### And Then Reorganizing so The ID Is The First Column of Each Host Dimension

In [9]:
host_ld_df = host_ld_df[
    [
        "listing_diagnostics_id",
        "host_response_time",
        "host_response_rate",
        "host_acceptance_rate",
        "host_is_superhost",
        "host_listings_count",
        "host_total_listings_count",
        "host_verifications",
        "host_has_profile_pic",
        "host_identity_verified",
    ]
]

hqad_df = hqad_df[
    [
        "hqad_id",
        "calculated_host_listings_count",
        "calculated_host_listings_count_entire_homes",
        "calculated_host_listings_count_private_rooms",
        "calculated_host_listings_count_shared_rooms",
    ]
]

### We'll Be Doing What We Just Did To The Other Tables

### Next Up, Property Dimension Table

In [10]:
property_df = listings[[
"latitude", 
"longitude", 
"property_type", 
"room_type", 
"accommodates", 
"bathrooms_text", 
"beds", 
"amenities", 
"price", 
]].drop_duplicates().reset_index()

In [11]:
# Set the index
property_df["property_id"] = property_df.index

In [12]:
property_df = property_df[
    [
    "property_id", 
    "latitude", 
    "longitude", 
    "property_type", 
    "room_type", 
    "accommodates", 
    "bathrooms_text", 
    "beds", 
    "amenities", 
    "price", 
    ]
]

### Now The Reviews Diagnostics Dimension Table

In [13]:
reviews_diagnostics_df = listings[
    [
        "number_of_reviews", 
        "number_of_reviews_ltm", 
        "number_of_reviews_l30d", 
        "first_review", 
        "last_review", 
        "review_scores_rating", 
        "review_scores_accuracy", 
        "review_scores_cleanliness", 
        "review_scores_checkin", 
        "review_scores_communication", 
        "review_scores_location", 
        "review_scores_value", 
        "reviews_per_month",
    ]
].drop_duplicates().reset_index()

In [14]:
reviews_diagnostics_df["rev_diag_id"] = reviews_diagnostics_df.index

In [15]:
reviews_diagnostics_df = reviews_diagnostics_df[
    [
        "rev_diag_id",
        "number_of_reviews", 
        "number_of_reviews_ltm", 
        "number_of_reviews_l30d", 
        "first_review", 
        "last_review", 
        "review_scores_rating", 
        "review_scores_accuracy", 
        "review_scores_cleanliness", 
        "review_scores_checkin", 
        "review_scores_communication", 
        "review_scores_location", 
        "review_scores_value", 
        "reviews_per_month",
    ]
]

### Scrapings Dimension Table

In [16]:
scrapings_df = listings[
    [
        "scrape_id",
        "last_scraped",
        "source" 
    ]
].drop_duplicates().reset_index()

In [17]:
scrapings_df["scrapings_id"] = scrapings_df.index

In [18]:
scrapings_df = scrapings_df[
    [
        "scrapings_id",
        "scrape_id",
        "last_scraped",
        "source" 
    ]
]

### Neighbourhood Dimension Table

In [19]:
neighbourhood_df = listings[
    [
        "neighbourhood",
        "neighborhood_overview",
        "neighbourhood_cleansed",
    ]
].drop_duplicates().reset_index()

#### Fixing Spelling Error and Assigning ID

In [20]:
# Fixing Typo
neighbourhood_df["neighbourhood_overview"] = neighbourhood_df["neighborhood_overview"]
neighbourhood_df.drop("neighborhood_overview", axis=1, inplace=True)

# Assigning ID
neighbourhood_df["neighbourhood_id"] = neighbourhood_df.index

In [21]:
neighbourhood_df = neighbourhood_df[
    [
        "neighbourhood_id",
        "neighbourhood",
        "neighbourhood_overview",
        "neighbourhood_cleansed",
    ]
]

### Min And Max Insights Dimension Table

In [22]:
minmax_insights_df = listings[
    [
        "maximum_nights",
        "minimum_nights",
        "minimum_minimum_nights",
        "maximum_minimum_nights",
        "minimum_maximum_nights",
        "maximum_maximum_nights",
        "minimum_nights_avg_ntm",
        "maximum_nights_avg_ntm",
    ]
].drop_duplicates().reset_index()

In [23]:
minmax_insights_df["minmax_insights_id"] = minmax_insights_df.index

In [24]:
minmax_insights_df = minmax_insights_df[
    [
        "minmax_insights_id",
        "maximum_nights",
        "minimum_nights",
        "minimum_minimum_nights",
        "maximum_minimum_nights",
        "minimum_maximum_nights",
        "maximum_maximum_nights",
        "minimum_nights_avg_ntm",
        "maximum_nights_avg_ntm",
    ]
]

### Availibility Dimension Table

In [25]:
availibility_df = listings[
    [
        "has_availability",
        "availability_30",
        "availability_60",
        "availability_90",
        "availability_365",
    ]
].drop_duplicates().reset_index()

In [26]:
availibility_df["avail_id"] = availibility_df.index

In [27]:
availibility_df = availibility_df[
    [
        "avail_id",
        "has_availability",
        "availability_30",
        "availability_60",
        "availability_90",
        "availability_365",
    ]
]

### Finding Unique Values To Base The Joins

#### Host Dimension Table 

In [28]:
pd.set_option("display.max_columns", None)
# Host ID works
host_df.nunique()

index                                           977
host_id                                         977
host_url                                        977
host_name                                       720
host_since                                      867
host_location                                    95
host_about                                      513
host_thumbnail_url                              940
host_picture_url                                940
host_neighbourhood                              151
host_response_time                                4
host_response_rate                               26
host_acceptance_rate                             58
host_is_superhost                                 2
host_listings_count                              47
host_total_listings_count                        55
host_verifications                                5
host_has_profile_pic                              2
host_identity_verified                            2
calculated_h

##### The Host Qualifications and Diagnostics (HQAD), And The Host Listings Diagnostics

In [29]:
host_ld_df.nunique()

listing_diagnostics_id       609
host_response_time             4
host_response_rate            26
host_acceptance_rate          58
host_is_superhost              2
host_listings_count           47
host_total_listings_count     55
host_verifications             5
host_has_profile_pic           2
host_identity_verified         2
dtype: int64

In [30]:
hqad_df.nunique()

hqad_id                                         69
calculated_host_listings_count                  31
calculated_host_listings_count_entire_homes     33
calculated_host_listings_count_private_rooms    13
calculated_host_listings_count_shared_rooms      3
dtype: int64

#### Property Dimension Table

In [31]:
print(property_df.count())
# This does not quite work, but it looks like there are multiple occurences of the same property with minuscule differences from eachother.
property_df[["latitude", "longitude"]].drop_duplicates().count()

property_id       2649
latitude          2649
longitude         2649
property_type     2649
room_type         2649
accommodates      2649
bathrooms_text    2649
beds              2618
amenities         2649
price             2560
dtype: int64


latitude     2581
longitude    2581
dtype: int64

#### Reviews Dimension Table

In [32]:
print(reviews_diagnostics_df.count())
reviews_diagnostics_df[
    [ 
        "number_of_reviews", 
        "number_of_reviews_ltm", 
        "number_of_reviews_l30d", 
        "first_review", 
        "last_review",
        "review_scores_rating",
        "review_scores_accuracy",
        "review_scores_cleanliness",
        "review_scores_checkin",
        "review_scores_communication",
        "review_scores_location",
        "review_scores_value",
        "reviews_per_month",
    ]
].drop_duplicates().count()

rev_diag_id                    2272
number_of_reviews              2272
number_of_reviews_ltm          2272
number_of_reviews_l30d         2272
first_review                   2269
last_review                    2269
review_scores_rating           2271
review_scores_accuracy         2271
review_scores_cleanliness      2270
review_scores_checkin          2271
review_scores_communication    2271
review_scores_location         2271
review_scores_value            2271
reviews_per_month              2269
dtype: int64


number_of_reviews              2272
number_of_reviews_ltm          2272
number_of_reviews_l30d         2272
first_review                   2269
last_review                    2269
review_scores_rating           2271
review_scores_accuracy         2271
review_scores_cleanliness      2270
review_scores_checkin          2271
review_scores_communication    2271
review_scores_location         2271
review_scores_value            2271
reviews_per_month              2269
dtype: int64

#### Scrapings Dimension Table

In [33]:
# We can use last_scraped and source as unqique identifiers
scrapings_df.count()
scrapings_df[["last_scraped", "source"]].drop_duplicates().count()

last_scraped    3
source          3
dtype: int64

#### Neighbourhood Dimension Table

In [34]:
# Turns out, we can get all of the possible values by using the overview and the cleansed attribute, trying to do this for other data
# might pose some problems, but for this it works without losing any data
neighbourhood_df.count()
neighbourhood_df[["neighbourhood_overview", "neighbourhood_cleansed"]].drop_duplicates().count()

neighbourhood_overview    1274
neighbourhood_cleansed    1299
dtype: int64

#### Min Max Insights Dimension Table

In [40]:
# We have to use all of the columns for this one as well to not lose any data
minmax_insights_df.count()
minmax_insights_df[
    [
        "maximum_nights", 
        "minimum_nights",
        "minimum_minimum_nights", 
        "maximum_minimum_nights", 
        "minimum_maximum_nights",
        "maximum_maximum_nights",
        "minimum_nights_avg_ntm",
        "maximum_nights_avg_ntm"
    ]
].drop_duplicates().count()

maximum_nights            795
minimum_nights            795
minimum_minimum_nights    795
maximum_minimum_nights    795
minimum_maximum_nights    795
maximum_maximum_nights    795
minimum_nights_avg_ntm    795
maximum_nights_avg_ntm    795
dtype: int64

#### Availibility Info Dimension Table

In [50]:
# Again forced to use all columns for identififying
print(availibility_df.count())
availibility_df[
    [
        "has_availability",
        "availability_30",
        "availability_60",
        "availability_90",
        "availability_365",
    ]
].drop_duplicates().count()

avail_id            1336
has_availability    1332
availability_30     1336
availability_60     1336
availability_90     1336
availability_365    1336
dtype: int64


has_availability    1332
availability_30     1336
availability_60     1336
availability_90     1336
availability_365    1336
dtype: int64

#### Summary
* Host Dimension Table 
  - This one's pretty easy, the host ID Itself

  - Host Qualifications and Diagnostics (HQAD), And The Host Listings Diagnostics Dimension Tables
      * We're forced to use all the columns as an unique identifier. I know this might seems a bit useless, but to better understand how data warehousing works this is the best approach, and it's best not lose data. We also remove some redundant data, so this takes us through the normal forms of data.

* Property Dimension Table
    - This one's not so bad. We use the `longitude` and `latitude` as unique identifiers

* Reviews Diagnostics Dimension table
    - Unfortunately, for this one we have to use all of the columns as unique identifiers as well. Might affect run time later, especially if this were an even bigger dataset.

* Scrapings Dimension Table
    - We can use the `last_scraped` and `source` columns as unique identifiers since the `scrape_id` is the same for all data

* Neighbourhood Dimension Table
    - We'll use the `neighbourhood_overview` and `neighbourhood_cleansed` columns as unique identifiers. This ends up working out without losing data, but we can not generalize this for all data.

* Min Max Insights Dimension Table
    - We are again forced to use all the columns as an unique identifier

* Host's Availibility Info Dimension Table
    - Again, forced to use all the columns as an unique identifier