![NYC Skyline](nyc.jpg)

Welcome to New York City, one of the most-visited cities in the world. There are many Airbnb listings in New York City to meet the high demand for temporary lodging for travelers, which can be anywhere between a few nights to many months. In this project, we will take a closer look at the New York Airbnb market by combining data from multiple file types like `.csv`, `.tsv`, and `.xlsx`.

We will work with three files containing data on 2019 Airbnb listings described as follows:

**data/airbnb_price.csv**
This is a CSV file containing data on Airbnb listing prices and locations.
- **`listing_id`**: unique identifier of listing
- **`price`**: nightly listing price in USD
- **`nbhood_full`**: name of borough and neighborhood where listing is located

**data/airbnb_room_type.xlsx**
This is an Excel file containing data on Airbnb listing descriptions and room types.
- **`listing_id`**: unique identifier of listing
- **`description`**: listing description
- **`room_type`**: Airbnb has three types of rooms: shared rooms, private rooms, and entire homes/apartments

**data/airbnb_last_review.tsv**
This is a TSV file containing data on Airbnb host names and review dates.
- **`listing_id`**: unique identifier of listing
- **`host_name`**: name of listing host
- **`last_review`**: date when the listing was last reviewed

## The Scenario

As a consultant working for a real estate start-up, we have collected Airbnb listing data from various sources to investigate the short-term rental market in New York. We'll analyze this data to provide insights on private rooms to the real estate company.

### First Research Question - *What are the dates of the earliest and most recent reviews?*

In [1]:
# Import necessary packages
import pandas as pd
import numpy as np

# consolidate the data sources for easy access
data_sources = {
    "prices": {"file_path": "./data/airbnb_price.csv",
               "file_type": "text/CSV",
               "delimiter": ","},
    "room_types": {"file_path": "./data/airbnb_room_type.xlsx",
                   "file_type": "binary/Excel",
                   "delimiter": np.NaN},  # delimeter not applicable for binary files
    "reviews": {"file_path": "./data/airbnb_last_review.tsv",
                "file_type": "text/TSV",  # tab separated variable
                "delimiter": "\t"}
}

df_reviews = pd.read_csv(data_sources["reviews"]["file_path"], sep=data_sources["reviews"]["delimiter"])
print(f"Total number of reviews: {df_reviews.shape[0]}\n")
print(df_reviews.info(), "\n")
df_reviews.head()

Total number of reviews: 25209

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25209 entries, 0 to 25208
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   listing_id   25209 non-null  int64 
 1   host_name    25201 non-null  object
 2   last_review  25209 non-null  object
dtypes: int64(1), object(2)
memory usage: 591.0+ KB
None 



Unnamed: 0,listing_id,host_name,last_review
0,2595,Jennifer,May 21 2019
1,3831,LisaRoxanne,July 05 2019
2,5099,Chris,June 22 2019
3,5178,Shunichi,June 24 2019
4,5238,Ben,June 09 2019


#### First task - convert the `last_review` column to datetime

+ check how dates are encoded
+ convert to datetime so we can find earliest and most recent reviews

In [2]:
# how are dates formatted?
all_dates = df_reviews["last_review"].unique()
all_dates[:10]

array(['May 21 2019', 'July 05 2019', 'June 22 2019', 'June 24 2019',
       'June 09 2019', 'June 23 2019', 'June 29 2019', 'June 28 2019',
       'July 01 2019', 'January 01 2019'], dtype=object)

Dates appear to be encoded consistently. If they are not, assume that `to_datetime` will not be happy and we can dig into possible issues futher if this is the case.

In [3]:
# should be able to parse to datetime without issues
df_reviews["last_review"] = pd.to_datetime(df_reviews["last_review"])
df_reviews.head()

Unnamed: 0,listing_id,host_name,last_review
0,2595,Jennifer,2019-05-21
1,3831,LisaRoxanne,2019-07-05
2,5099,Chris,2019-06-22
3,5178,Shunichi,2019-06-24
4,5238,Ben,2019-06-09


In [4]:
earliest_review = df_reviews["last_review"].min()
most_recent_review = df_reviews["last_review"].max()
print(f"earliest_review: {earliest_review}, most recent review: {most_recent_review}")

earliest_review: 2019-01-01 00:00:00, most recent review: 2019-07-09 00:00:00


### Second Research Question - *How many of the listings are private rooms?*

+ Required data resides in the file `data/airbnb_room_type.xlsx`
+ This spreadsheet has a single sheet: `airbnb_room_type`

In [5]:
excel_data = pd.ExcelFile(data_sources["room_types"]["file_path"])
print(excel_data.sheet_names, "\n")  # single sheet: 'airbnb_room_type'
df_room_types = excel_data.parse('airbnb_room_type')  # could've use 0 (sheet index), but this is clearer
print(df_room_types.shape, "\n")
df_room_types.head()

['airbnb_room_type'] 

(25209, 3) 



Unnamed: 0,listing_id,description,room_type
0,2595,Skylit Midtown Castle,Entire home/apt
1,3831,Cozy Entire Floor of Brownstone,Entire home/apt
2,5099,Large Cozy 1 BR Apartment In Midtown East,Entire home/apt
3,5178,Large Furnished Room Near B'way,private room
4,5238,Cute & Cozy Lower East Side 1 bdrm,Entire home/apt


In [6]:
# check the number of room types
print(df_room_types["room_type"].unique())

['Entire home/apt' 'private room' 'Private room' 'entire home/apt'
 'PRIVATE ROOM' 'shared room' 'ENTIRE HOME/APT' 'Shared room'
 'SHARED ROOM']


Looks like we 3 different versions of *private room* as well as the other 2 types (*entire home/apt* and *shared room*). We can handle all three by simply converting everything lower case.

In [7]:
df_room_types["room_type"] = df_room_types["room_type"].str.lower()
# check
print(df_room_types["room_type"].unique())

['entire home/apt' 'private room' 'shared room']


In [8]:
room_type_counts = df_room_types["room_type"].value_counts()
print(room_type_counts)

room_type
entire home/apt    13266
private room       11356
shared room          587
Name: count, dtype: int64


In [9]:
private_room_count = room_type_counts['private room']
print(f"There are {private_room_count} private room listings")

There are 11356 private room listings


### Third Research Question - *What is the average listing price?*

+ Required data resides in the file `data/airbnb_price.csv`
+ `price` column originally comes in as string (object, as shown below)
  + Values are formatted as `ddd dollars` where `ddd` is the amount, so this needs to be stripped out

In [10]:
df_prices = df_reviews = pd.read_csv(data_sources["prices"]["file_path"], sep=data_sources["prices"]["delimiter"])
print(df_prices.info(), "\n")
df_prices.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25209 entries, 0 to 25208
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   listing_id   25209 non-null  int64 
 1   price        25209 non-null  object
 2   nbhood_full  25209 non-null  object
dtypes: int64(1), object(2)
memory usage: 591.0+ KB
None 



Unnamed: 0,listing_id,price,nbhood_full
0,2595,225 dollars,"Manhattan, Midtown"
1,3831,89 dollars,"Brooklyn, Clinton Hill"
2,5099,200 dollars,"Manhattan, Murray Hill"
3,5178,79 dollars,"Manhattan, Hell's Kitchen"
4,5238,150 dollars,"Manhattan, Chinatown"


In [11]:
# strip out the dollars text in the price column and convert to float
df_prices['price'] = df_prices['price'].str.strip(" dollars")
df_prices['price'] = df_prices['price'].astype(float)
df_prices.head()

Unnamed: 0,listing_id,price,nbhood_full
0,2595,225.0,"Manhattan, Midtown"
1,3831,89.0,"Brooklyn, Clinton Hill"
2,5099,200.0,"Manhattan, Murray Hill"
3,5178,79.0,"Manhattan, Hell's Kitchen"
4,5238,150.0,"Manhattan, Chinatown"


In [12]:
average_listing_price = round(df_prices['price'].mean(), 2)
print(f"The average listing price is: ${average_listing_price}")

The average listing price is: $141.78


In [13]:
review_dates = pd.DataFrame({"first_reviewed": [earliest_review],
                              "last_reviewed": [most_recent_review],
                              "nb_private_rooms": [private_room_count],
                              "avg_price": [average_listing_price]})
review_dates

Unnamed: 0,first_reviewed,last_reviewed,nb_private_rooms,avg_price
0,2019-01-01,2019-07-09,11356,141.78
