# Exploratory Data Analysis (EDA)

In this section we will conduct some preliminary analysis in order to better understand our data set

This analysis will be done in a way which can help us give the course leaders a better understandin  of our data set

This will help us to predict any shortcomings, outliers or systematic errors which our groups will need to account for in our analysis

### There will be 4 main sections of this EDA:
1. Precise descriptions of the data fields and their units of measurement
2. Developing Summary Statistics, Distributions and Outliers
3. Appropriate plots to communicate the distribution of key fields
4. Appropriate plots to illustrate the relationship between key fields

#### 0. Load libraries and read in data

In [3]:
# Load libraries
import pandas as pd
import os
import re
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import scipy.stats
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from src import drop_column_using_vif_, show_vif_values

<jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
<jemalloc>: (This is the expected behaviour if you are running under QEMU)


Remove unnecessary columns.

In [13]:
# Open yvr_listing_data.csv in the data folder
listings_df = pd.read_csv(os.path.join('data', 'yvr_listing_data.csv'))

# Exclude columns manually that are completly textual description or apparently non-related to legality(including coordinates).
# Also exclude some redundant variables like 'neighbourhood' and 'neighbourhood_cleansed'

excluded_columns = ['listing_url','scrape_id', 'last_scraped', 'source', 
                       'name','description', 'neighborhood_overview', 'picture_url', 
                       'host_id', 'host_url', 'host_name', 'host_since', 
                       'host_location', 'host_about', 'host_thumbnail_url', 
                       'host_picture_url', 'latitude', 'longitude', 'calendar_updated', 
                       'calendar_last_scraped', 'amenities', 'bathrooms_text',
                       'first_review','last_review','neighbourhood','property_type','host_neighbourhood',
                       'maximum_minimum_nights','maximum_nights','minimum_minimum_nights',
                       'maximum_maximum_nights','minimum_maximum_nights','minimum_nights_avg_ntm','maximum_nights_avg_ntm']

remained_columns = [col for col in listings_df if col not in excluded_columns]
remained_columns = list(set(remained_columns))

# Delete all textual description columns 

listings_df = listings_df[remained_columns]

# Dropped completely empty columns
listings_df= listings_df.dropna(axis=1, how='all')

# Drop listings with 'minimum_nights > 30' based on the regulation in Vancouver
listings_df = listings_df[listings_df['minimum_nights']<=30]

In [12]:
listings_df['room_type'].value_counts()

room_type
Entire home/apt    5134
Private room       1195
Shared room          21
Hotel room            3
Name: count, dtype: int64

In [11]:
listings_df.columns

Index(['reviews_per_month', 'review_scores_value', 'minimum_nights',
       'host_has_profile_pic', 'room_type', 'host_acceptance_rate',
       'review_scores_location', 'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_count_private_rooms',
       'number_of_reviews_l30d', 'host_listings_count',
       'calculated_host_listings_count', 'id', 'availability_90',
       'review_scores_communication', 'host_response_time', 'beds',
       'review_scores_accuracy', 'host_response_rate', 'review_scores_rating',
       'calculated_host_listings_count_shared_rooms', 'price',
       'availability_30', 'has_availability', 'neighbourhood_cleansed',
       'bedrooms', 'host_total_listings_count', 'host_identity_verified',
       'instant_bookable', 'review_scores_cleanliness',
       'number_of_reviews_ltm', 'host_verifications', 'accommodates',
       'host_is_superhost', 'availability_365', 'availability_60',
       'number_of_reviews', 'review_scores_checkin', 'leg

#### Finding "Legal" Listings

Using regex, we scan through the listings licenses and determine which ones are valid.

In [14]:
%%capture --no-stdout
"""
Create a new column titled "legal_listing" that contains the boolean describing whether or not the listing has a valid license.
The column is True if the listing has a valid license or does not require one and False if the listing does not have a valid license.
To compute the value of the column, we use the following logic:

If the listing has a number in the "license" column with the regex pattern of r'.*?(\d{2}[-\s]?\d{3}[-\s]?\d{3}).*?' 
OR the listing has a number in the "minimum_nights" column with a value equal to or greater than 30,
THEN the "legal_listing" is True. ELSE the "valid_license" is False.

Note:
The regex pattern '.*?(\d{2}[-\s]?\d{3}[-\s]?\d{3}).*?' is used to find a numbers with the pattern ##-###### or ##-###-### with 
spaces/dashes/nothing in between the numbers. The number can be surrounded by any number of characters. 
TODO: Verify this is the correct pattern for the license numbers and find any other ways of verifying legitimate license numbers.
"""

###Just found there are some values like 'dd-ddd-ddd', so I changed regex pattern for better compatibility
#regex_pattern = re.compile(r'.*?(\d{2}[-\s]?\d{6}).*?')
regex_pattern = re.compile(r'.*?(\d{2}[-\s]?\d{3}[-\s]?\d{3}).*?')

# Create the valid_license column using the logic described above
listings_df['legal_listing'] = listings_df['license'].str.contains(regex_pattern) | (listings_df['minimum_nights'] >= 30)

# Create new dataframe storing values after normalization or preprocessing
listings_df_cleaned = pd.DataFrame()
listings_df_cleaned['id'] = listings_df['id']
listings_df_cleaned['legal_listing'] = listings_df['legal_listing']

# Drop the 'license' column for better processing
listings_df.drop('license',axis=1, inplace=True)

# Print count of valid and invalid licenses
print(listings_df['legal_listing'].value_counts())

legal_listing
True     4533
False    1820
Name: count, dtype: int64


## Dealing with Data Types

- Converting variables to the correct data types while also cleaning unnecessary characters.
- Accounting for categorical data with one-hot encoding.

In [15]:
#print(listings_df.columns)
listings_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6353 entries, 0 to 6694
Data columns (total 39 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   reviews_per_month                             5429 non-null   float64
 1   review_scores_value                           5419 non-null   float64
 2   minimum_nights                                6353 non-null   int64  
 3   host_has_profile_pic                          6353 non-null   object 
 4   room_type                                     6353 non-null   object 
 5   host_acceptance_rate                          5575 non-null   object 
 6   review_scores_location                        5419 non-null   float64
 7   calculated_host_listings_count_entire_homes   6353 non-null   int64  
 8   calculated_host_listings_count_private_rooms  6353 non-null   int64  
 9   number_of_reviews_l30d                        6353 non-null   int64 

**DROP columns with low varience**

In [17]:
# Select every columns with dtype float or int
# Eliminate the impact that one-hot columns' varience is low
listings_df_num = listings_df.select_dtypes(include=['float64','int64'])

# Calculate the varience
variances = listings_df_num.var()

In [18]:
# Set the bottom boundary 
low_threshold = 0.1  
high_threshold = 1000000  
# Drop columns with low varience
low_variance_cols = variances[variances < low_threshold].index
high_variance_cols = variances[variances > high_threshold].index
listings_df.drop(low_variance_cols, axis=1, inplace=True)
listings_df.drop(high_variance_cols, axis=1, inplace=True)

In [19]:
"""
# Deal with the calculated_ columns
calculated_columns = ['calculated_host_listings_count_private_rooms',
                      'calculated_host_listings_count_shared_rooms',
                      'calculated_host_listings_count_entire_homes',
                      'calculated_host_listings_count']
for name in calculated_columns:
    counts = listings_df[name].value_counts()
    count_0 = counts.get(0, 0)
    count_1 = counts.get(1, 0)

    print(f"amount of 0 in column '{name}': {count_0}")
    print(f"amount of 1 in column '{name}': {count_1}")

# Because the proportion of 0's in 'calculated_host_listings_count_shared_rooms' is above 90%
# So just drop the column
listings_df.drop('calculated_host_listings_count_shared_rooms', axis=1, inplace=True)
"""

'\n# Deal with the calculated_ columns\ncalculated_columns = [\'calculated_host_listings_count_private_rooms\',\n                      \'calculated_host_listings_count_shared_rooms\',\n                      \'calculated_host_listings_count_entire_homes\',\n                      \'calculated_host_listings_count\']\nfor name in calculated_columns:\n    counts = listings_df[name].value_counts()\n    count_0 = counts.get(0, 0)\n    count_1 = counts.get(1, 0)\n\n    print(f"amount of 0 in column \'{name}\': {count_0}")\n    print(f"amount of 1 in column \'{name}\': {count_1}")\n\n# Because the proportion of 0\'s in \'calculated_host_listings_count_shared_rooms\' is above 90%\n# So just drop the column\nlistings_df.drop(\'calculated_host_listings_count_shared_rooms\', axis=1, inplace=True)\n'

### Dealing with Object Columns

In [20]:
# Print names of object columns
print(listings_df.select_dtypes(include=['object']).columns)

Index(['host_has_profile_pic', 'room_type', 'host_acceptance_rate',
       'host_response_time', 'host_response_rate', 'price', 'has_availability',
       'neighbourhood_cleansed', 'host_identity_verified', 'instant_bookable',
       'host_verifications', 'host_is_superhost'],
      dtype='object')


In [23]:
#converting 'price' column
# Convert price to a float variable
if listings_df['price'].dtype == 'object':
    listings_df['price'] = listings_df['price'].str.replace('$', '').str.replace(',', '').astype(float)

# Convert 'host_acceptance_rate' to a float variable
if listings_df['host_acceptance_rate'].dtype == 'object':
    listings_df['host_acceptance_rate'] = listings_df['host_acceptance_rate'].str.replace('%', '').astype(float)

# Convert 'host_response_time' to a float variable
# The reason is a bit far-fetched for range(0,0.25,0.5,0.75,1), just make it easier for regression model operating. 
# Moreover it does make sense, to some extent
if listings_df['host_response_time'].dtype == 'object':
    listings_df['host_response_time'] = listings_df['host_response_time'].map({
        'within an hour': 1, 'within a few hours': 0.75, 'within a day': 0.5, 'a few days or more': 0.25}).fillna(0)

# Convert 'host_response_rate' to a float variable
if listings_df['host_response_rate'].dtype == 'object':
    listings_df['host_response_rate'] = listings_df['host_response_rate'].str.replace('%', '').astype(float)

# Convert 'host_verifications' to a float variable
if listings_df['host_verifications'].dtype == 'object':
    listings_df['host_verifications'] = listings_df['host_verifications'].map({
        "['email', 'phone', 'photographer', 'work_email']": 1, "['email', 'phone', 'work_email']": 0.75, 
        "['email', 'phone']": 0.5, "['phone', 'work_email']":0.5, 
        "['phone']": 0.25, "['email']": 0.25}).fillna(0)


# Convert 'host_is_superhost' to a bool variable
if listings_df['host_is_superhost'].dtype == 'object':
    listings_df['host_is_superhost'] = listings_df['host_is_superhost'].map({'t': 1, 'f': 0})

# Convert 'host_has_profile_pic' to a bool variable
if listings_df['host_has_profile_pic'].dtype == 'object':
    listings_df['host_has_profile_pic'] = listings_df['host_has_profile_pic'].map({'t': 1, 'f': 0})

# Convert 'has_availability' to a bool variable
if listings_df['has_availability'].dtype == 'object':
    listings_df['has_availability'] = listings_df['has_availability'].map({'t': 1, 'f': 0})

# Convert 'instant_bookable' to a bool variable
if listings_df['instant_bookable'].dtype == 'object':
    listings_df['instant_bookable'] = listings_df['instant_bookable'].map({'t': 1, 'f': 0})

# Convert 'host_identity_verified' to a bool variable
if listings_df['host_identity_verified'].dtype == 'object':
    listings_df['host_identity_verified'] = listings_df['host_identity_verified'].map({'t': 1, 'f': 0})

In [22]:
# Check the object columns again
object_columns = listings_df.select_dtypes(include='object')

object_columns_name = list(object_columns.columns)
object_columns_name

['room_type', 'neighbourhood_cleansed']

In [24]:
listings_df['host_verifications'].unique()

array([0.5 , 0.25, 0.75, 1.  ])

### One-hot to code categorical columns

In [25]:
print("Dropped categories:")
for colname in object_columns_name:
    # convert room_type column to 'category' dtype
    listings_df[colname] = listings_df[colname].astype('category')

    # Since we will be dropping the first category of each column, 
    # lets print out the first category of each column so we know what we are dropping
    print(colname, ':', listings_df[colname].cat.categories[0])

    # applying one-hot coding (drop_first means eliminate one freedom degree to prevent multicollinearity)
    one_hot_encoded = pd.get_dummies(listings_df[colname], prefix=colname, drop_first=True)
    # join new columns back to DataFrame
    listings_df = listings_df.join(one_hot_encoded)

Dropped categories:
room_type : Entire home/apt
neighbourhood_cleansed : Arbutus Ridge


In [26]:
# Print types of all columns
listings_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6353 entries, 0 to 6694
Data columns (total 61 columns):
 #   Column                                           Non-Null Count  Dtype   
---  ------                                           --------------  -----   
 0   reviews_per_month                                5429 non-null   float64 
 1   review_scores_value                              5419 non-null   float64 
 2   minimum_nights                                   6353 non-null   int64   
 3   host_has_profile_pic                             6353 non-null   int64   
 4   room_type                                        6353 non-null   category
 5   host_acceptance_rate                             5575 non-null   float64 
 6   calculated_host_listings_count_entire_homes      6353 non-null   int64   
 7   calculated_host_listings_count_private_rooms     6353 non-null   int64   
 8   number_of_reviews_l30d                           6353 non-null   int64   
 9   host_listings_count     

## Preparing for VIF Analysis

In [27]:
listings_df_VIF = listings_df.select_dtypes(include=['bool','float64','int64'])
listings_df_VIF = listings_df_VIF.astype('float64')

**Using VIF to filter relating variables**

In [28]:
# calculating VIF

# Drop all rows containing NAs or infs in listings_df_VIF
listings_df_VIF.replace([np.inf, -np.inf], np.nan, inplace=True)
listings_df_VIF.dropna(inplace=True)

## VIF Filtering

In [29]:
%%capture --no-stdout

listings_df_VIF_new = drop_column_using_vif_(listings_df_VIF.drop('legal_listing', axis=1), thresh=2)

Dropping: calculated_host_listings_count (VIF: 28166.657801674846)
Dropping: host_listings_count (VIF: 58.940057139676455)
Dropping: availability_60 (VIF: 15.442869964457207)
Dropping: neighbourhood_cleansed_Downtown (VIF: 15.204804174421792)
Dropping: calculated_host_listings_count_entire_homes (VIF: 7.387763613248074)
Dropping: review_scores_rating (VIF: 5.824788487134804)
Dropping: accommodates (VIF: 5.722489934595499)
Dropping: number_of_reviews_ltm (VIF: 3.7127987559084463)
Dropping: bedrooms (VIF: 3.213550936569566)
Dropping: review_scores_accuracy (VIF: 2.9440856290313926)
Dropping: reviews_per_month (VIF: 2.910677131433851)
Dropping: review_scores_communication (VIF: 2.476927248705724)
Dropping: availability_90 (VIF: 2.4295637671524397)
Dropping: review_scores_value (VIF: 2.0853368160573367)


In [30]:
# After VIF now we have the 'listings_df_VIF_new'
print(f"There are {listings_df_VIF_new.shape[1]} variables after VIF operation.")

# Add legal_listing back to csv
listings_df_VIF_new['legal_listing'] = listings_df_VIF['legal_listing']


# And save the new dataframe to csv.file
listings_df_VIF_new.to_csv(os.path.join('data','yvr_listing_data_cleaned.csv'),index=False)

There are 44 variables after VIF operation.


#### 1. Precise descriptions of the data fields and their units of measurement: 
Gaining an understanding of the data, its fields and units of measurement

In [34]:
# View first 10 rows
print(listings_df.head(10))

    reviews_per_month  review_scores_value  minimum_nights  \
0                1.68                 4.80               2   
1                2.96                 4.65               1   
2                0.66                 4.89              30   
3                0.22                 4.71               3   
4                1.63                 4.73              30   
5                0.11                 4.29               3   
7                1.53                 4.66               5   
8                 NaN                  NaN              30   
9                0.78                 4.49              30   
10               3.30                 4.95               1   

    host_has_profile_pic        room_type  host_acceptance_rate  \
0                      1  Entire home/apt                 100.0   
1                      1  Entire home/apt                  98.0   
2                      1  Entire home/apt                  96.0   
3                      1  Entire home/apt        

In [36]:
# Print data fields and their types
listings_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6353 entries, 0 to 6694
Data columns (total 61 columns):
 #   Column                                           Non-Null Count  Dtype   
---  ------                                           --------------  -----   
 0   reviews_per_month                                5429 non-null   float64 
 1   review_scores_value                              5419 non-null   float64 
 2   minimum_nights                                   6353 non-null   int64   
 3   host_has_profile_pic                             6353 non-null   int64   
 4   room_type                                        6353 non-null   category
 5   host_acceptance_rate                             5575 non-null   float64 
 6   calculated_host_listings_count_entire_homes      6353 non-null   int64   
 7   calculated_host_listings_count_private_rooms     6353 non-null   int64   
 8   number_of_reviews_l30d                           6353 non-null   int64   
 9   host_listings_count     

In [40]:
# Generate descriptive statistics on numerical columns
print(listings_df.describe)

<bound method NDFrame.describe of       reviews_per_month  review_scores_value  minimum_nights  \
0                  1.68                 4.80               2   
1                  2.96                 4.65               1   
2                  0.66                 4.89              30   
3                  0.22                 4.71               3   
4                  1.63                 4.73              30   
...                 ...                  ...             ...   
6690                NaN                  NaN              30   
6691                NaN                  NaN              30   
6692                NaN                  NaN               1   
6693                NaN                  NaN               1   
6694                NaN                  NaN               2   

      host_has_profile_pic        room_type  host_acceptance_rate  \
0                        1  Entire home/apt                 100.0   
1                        1  Entire home/apt                

In [38]:
# Combining this preliminary analysis together

# Summary DataFrame

# 1. Inspecting Data
data_inspection = pd.DataFrame({
    'First Few Rows': [listings_df.head()],
    'Info': [listings_df.info()],
    'Descriptive Statistics': [listings_df.describe()]
})

# 2. Display the combined DataFrame
for key, value in data_inspection.items():
    print(f"\n{key}:\n{value}\n{'='*50}")

data_inspection

<class 'pandas.core.frame.DataFrame'>
Index: 6353 entries, 0 to 6694
Data columns (total 61 columns):
 #   Column                                           Non-Null Count  Dtype   
---  ------                                           --------------  -----   
 0   reviews_per_month                                5429 non-null   float64 
 1   review_scores_value                              5419 non-null   float64 
 2   minimum_nights                                   6353 non-null   int64   
 3   host_has_profile_pic                             6353 non-null   int64   
 4   room_type                                        6353 non-null   category
 5   host_acceptance_rate                             5575 non-null   float64 
 6   calculated_host_listings_count_entire_homes      6353 non-null   int64   
 7   calculated_host_listings_count_private_rooms     6353 non-null   int64   
 8   number_of_reviews_l30d                           6353 non-null   int64   
 9   host_listings_count     

Unnamed: 0,First Few Rows,Info,Descriptive Statistics
0,reviews_per_month review_scores_value min...,,reviews_per_month review_scores_value ...
