# EDA on the initial dataset
In this notebook we perform a simple EDA using the pandas_profiling library.

## 1. Data Import

In [7]:
import wandb
import pandas as pd
import os

os.environ["WANDB_NOTEBOOK_NAME"] = "./src/eda/EDA"

# start wandb run in order to track the EDA
run = wandb.init(project='nyc_airbnb', group = 'eda', save_code=True)

# load raw data from wandb
local_path = wandb.use_artifact('sample.csv:latest').file()

df = pd.read_csv(local_path)

## 2. Perform EDA

In [9]:
import pandas_profiling

profile = pandas_profiling.ProfileReport(df)
profile.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

## 3. Basic data cleaning
Here we perform some basic data cleaning to address some issues noticed in the profiling:
* eliminate price outliers
* convert date column into datetime format

In [10]:
min_price = 10
max_price = 350
idx = df['price'].between(min_price,max_price)
df = df[idx].copy()
df['last_review'] = pd.to_datetime(df['last_review'])

## 4. Check cleaning results

In [11]:
profile_new = pandas_profiling.ProfileReport(df)
profile_new.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

In [12]:
# shut down wandb run
run.finish()

VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))