# Exploratory Data Analysis for Predicting Short-term Rental Price in New York CIty


# 1. Problem

Assuming you are working for a property project company renting rooms and properties for short periods of time on various platforms. The user needs you to conduct some analysis to know various factor that influence the property price based on similar property.

Before we dive into some compex modeling, we need to understand our data first. How we do that? we need to conduct exploratory data analysis (EDA). In this process we need to analyze certain component in our data: distribution of the data and the correlation between variable in our data.

Another thing that EDA does is check data types, missing values, columns with high cardinality, etc. using statistics and data visualization.

# 2. Import Library

In [3]:
import wandb
import pandas as pd
import pandas_profiling

# 3. Load data with wandb

The first step in our analysis is to import the data. Before doing that, we need to connect our project with weights and biases (wandb).

The data read is retrieved from artifacts stored in the wandb repository.

In [4]:
run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)
local_path = wandb.use_artifact("sample.csv:latest").file()
df = pd.read_csv(local_path)

VBox(children=(Label(value=' 0.01MB of 0.01MB uploaded (0.01MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

[34m[1mwandb[0m: wandb version 0.12.17 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


# 4. EDA Using Pandas Profiling

We will use pandas profiling to automate EDA. There are two steps to do this: create a report from the dataframe and display the report results using the to_widgets method

In [5]:
import pandas_profiling

profile = pandas_profiling.ProfileReport(df)
profile.to_widgets()

Summarize dataset:   0%|          | 0/29 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Based on the resulting report, we will try to briefly dissect the data we have. The discussion will be divided into several parts:

- Overview: as much as 2.6% of our data has missing values (last_review and reviews_per_months). Overall the value of each data row is unique (no duplicates).
- Variables: There is some information that can be seen in the data, such as: the price and num_reviews columns have outliers that cause the data to have skewness. Other information that can be obtained is that the column name has a uniform distribution. This is common for columns that have unique or near-unique values in their distribution of values. Also, the last_review column has a string data type which should be date or pandas series.
- Correlation: calculated_cost_listings_count, number_of_reviews, and reviews_per_monts have negative and weak spearmann correlation values. This indicates a high price on a property tends to have a low number of monthly reviews, total reviews and a low calculated cost listings count.

Based on this, we need to carry out data cleaning processes such as: removing outliers and correcting data formats. To remove outliers in the price column, we will remove a number of values that are not within a certain range that we have set (based on the results of discussions with stakeholders). Missing values will be addressed at the data preprocessing stage.

In [6]:
# Drop outliers
min_price = 10
max_price = 350
idx = df['price'].between(min_price, max_price)
df = df[idx].copy()
# Convert last_review to datetime
df['last_review'] = pd.to_datetime(df['last_review'])

Check the results again after processing

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46428 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              46428 non-null  int64         
 1   name                            46413 non-null  object        
 2   host_id                         46428 non-null  int64         
 3   host_name                       46407 non-null  object        
 4   neighbourhood_group             46428 non-null  object        
 5   neighbourhood                   46428 non-null  object        
 6   latitude                        46428 non-null  float64       
 7   longitude                       46428 non-null  float64       
 8   room_type                       46428 non-null  object        
 9   price                           46428 non-null  int64         
 10  minimum_nights                  46428 non-null  int64         
 11  nu

Based on the results obtained, the last_review column has been successfully converted to datetime format. The number of data rows is also reduced (from 20000 to 19001).

End the EDA process and save this version on wandb.

In [9]:
run.finish()

VBox(children=(Label(value=' 0.01MB of 0.01MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…