
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Udacity     :  Machine Learning DevOps Engineer (MLOps) Nano-degree
  Project     :  2 - Build an ML Pipeline for Short-term Rental Prices in NYC
  Step        :  Pipeline Execution
  Description :  EDA Operations
  Author      :  Rakan Yamani
  Date        :  15 June 2023
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


### EDA Prior Steps:

*  The original `main.py` script in the project folder already comes with the download step implemented that used to download starter dataset. 
    * Run the pipeline to get a sample of the data. The pipeline will also upload it to Weights & Biases:
    `mlflow run . -P steps=download`
    * The following (or similar) message will be shown `2021-03-12 15:44:39,840 Uploading sample.csv to Weights & Biases`, indeicating that the data is going to be stored in W&B as the artifact named `sample.csv`
* To execute the eda step, run `mlflow run src/eda`. This will install Jupyter and all the dependencies for `pandas-profiling`, and open a Jupyter notebook instance. 


### EDA Operations:

#### 1. Fetch W&B artifact (sample.csv):
Within the notebook, fetch the artifact (sample.csv) from W&B and read it with pandas. 
Use `save_code=True` in the call to wandb.init so the notebook is uploaded and versioned by W&B.

In [None]:
import wandb
import pandas as pd
import seaborn as sns

run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)
local_path = wandb.use_artifact("nyc_airbnb/sample.csv:latest").file()
df = pd.read_csv(local_path)

: 

In [14]:
df['last_review'] = pd.to_datetime(df['last_review'])
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,9138664,Private Lg Room 15 min to Manhattan,47594947,Iris,Queens,Sunnyside,40.74271,-73.92493,Private room,74,2,6,2019-05-26,0.13,1,5
1,31444015,TIME SQUARE CHARMING ONE BED IN HELL'S KITCHEN...,8523790,Johlex,Manhattan,Hell's Kitchen,40.76682,-73.98878,Entire home/apt,170,3,0,NaT,,1,188
2,8741020,Voted #1 Location Quintessential 1BR W Village...,45854238,John,Manhattan,West Village,40.73631,-74.00611,Entire home/apt,245,3,51,2018-09-19,1.12,1,0
3,34602077,Spacious 1 bedroom apartment 15min from Manhattan,261055465,Regan,Queens,Astoria,40.76424,-73.92351,Entire home/apt,125,3,1,2019-05-24,0.65,1,13
4,23203149,Big beautiful bedroom in huge Bushwick apartment,143460,Megan,Brooklyn,Bushwick,40.69839,-73.92044,Private room,65,2,8,2019-06-23,0.52,2,8


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              20000 non-null  int64         
 1   name                            19993 non-null  object        
 2   host_id                         20000 non-null  int64         
 3   host_name                       19992 non-null  object        
 4   neighbourhood_group             20000 non-null  object        
 5   neighbourhood                   20000 non-null  object        
 6   latitude                        20000 non-null  float64       
 7   longitude                       20000 non-null  float64       
 8   room_type                       20000 non-null  object        
 9   price                           20000 non-null  int64         
 10  minimum_nights                  20000 non-null  int64         
 11  nu

#### 2. Create Analysis Profile:
2.1 Using `pandas-profiling`, create an analysis profile for the used dataset. Look around and see what you can find:
* There are missing values in a few columns 
* The `last_review` column  is a date but it is in string format. 
* The `price` column has outliers. 
* There are some zeros and some very high prices. 

In [17]:
import pandas_profiling

profile = pandas_profiling.ProfileReport(df, title="Report generated by Pandas Profiling")
profile.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

2.2 Fix some of the little problems we have found in the data with the following code:

In [18]:
# Drop outliers
min_price = 10
max_price = 350
idx = df['price'].between(min_price, max_price)
df = df[idx].copy()

# Convert last_review to datetime
df['last_review'] = pd.to_datetime(df['last_review'])

In [19]:
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,9138664,Private Lg Room 15 min to Manhattan,47594947,Iris,Queens,Sunnyside,40.74271,-73.92493,Private room,74,2,6,2019-05-26,0.13,1,5
1,31444015,TIME SQUARE CHARMING ONE BED IN HELL'S KITCHEN...,8523790,Johlex,Manhattan,Hell's Kitchen,40.76682,-73.98878,Entire home/apt,170,3,0,NaT,,1,188
2,8741020,Voted #1 Location Quintessential 1BR W Village...,45854238,John,Manhattan,West Village,40.73631,-74.00611,Entire home/apt,245,3,51,2018-09-19,1.12,1,0
3,34602077,Spacious 1 bedroom apartment 15min from Manhattan,261055465,Regan,Queens,Astoria,40.76424,-73.92351,Entire home/apt,125,3,1,2019-05-24,0.65,1,13
4,23203149,Big beautiful bedroom in huge Bushwick apartment,143460,Megan,Brooklyn,Bushwick,40.69839,-73.92044,Private room,65,2,8,2019-06-23,0.52,2,8


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19001 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              19001 non-null  int64         
 1   name                            18994 non-null  object        
 2   host_id                         19001 non-null  int64         
 3   host_name                       18993 non-null  object        
 4   neighbourhood_group             19001 non-null  object        
 5   neighbourhood                   19001 non-null  object        
 6   latitude                        19001 non-null  float64       
 7   longitude                       19001 non-null  float64       
 8   room_type                       19001 non-null  object        
 9   price                           19001 non-null  int64         
 10  minimum_nights                  19001 non-null  int64         
 11  nu

2.3 Terminate the run 


In [28]:
run.finish()