# Exploratory Data Analysis

In [4]:
import wandb
import pandas as pd

wandb.login()
run = wandb.init(
    project="nyc_airbnb",
    group="eda",
    save_code=True
)
local_path = wandb.use_artifact("sample.csv:latest").file()
df = pd.read_csv(local_path)

[34m[1mwandb[0m: Currently logged in as: [33meoinkeohane[0m ([33meoinkeohane-learning[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


After loading our artifact from W&B we use `y`

In [7]:
import ydata_profiling

profile = ydata_profiling.ProfileReport(df)
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|                                                                                            | 0/16 [00:00<?, ?it/s][A
  6%|█████▎                                                                              | 1/16 [00:00<00:03,  4.45it/s][A
100%|███████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 20.35it/s][A


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

We notice:
- Missing values in several columns
- The column `last_review` is stored as a string instead of a datetime
- Outliers in the `price` column, including zeroes and extreme highs

We will apply two fixes. Missing values will be handled later in the inference pipeline so that the system can manage them at prediction time.

In [8]:
# drop outliers
min_price=10
max_price=350
idx = df['price'].between(min_price, max_price)
df = df[idx].copy()

# convert last_review to datetime
df['last_review'] = pd.to_datetime(df['last_review'])

We quickly verify our fixes

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19001 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              19001 non-null  int64         
 1   name                            18994 non-null  object        
 2   host_id                         19001 non-null  int64         
 3   host_name                       18993 non-null  object        
 4   neighbourhood_group             19001 non-null  object        
 5   neighbourhood                   19001 non-null  object        
 6   latitude                        19001 non-null  float64       
 7   longitude                       19001 non-null  float64       
 8   room_type                       19001 non-null  object        
 9   price                           19001 non-null  int64         
 10  minimum_nights                  19001 non-null  int64         
 11  number_

In [10]:
max(df['price'])

350

In [11]:
min(df['price'])

10

With our fixes in place, and their changes verifed, we have completed our EDA. We can now end the run and close our notebook.

In [12]:
run.finish()