# 1.0 Airbnb regression problem (Part I)



## 1.1 Dataset description



We'll be looking at individual income in the United States. The **data** is from the **1994 census**, and contains information on an individual's **marital status**, **age**, **type of work**, and more. The **target column**, or what we want to predict, is whether individuals make less than or equal to 50k a year, or more than **50k a year**.

You can download the data from the [University of California, Irvine's website](http://archive.ics.uci.edu/ml/datasets/Adult).

Let's take the following steps:

1. Load Libraries
2. Fetch Data, including EDA
3. Pre-procesing
4. Data Segregation

<center><img width="600" src="https://drive.google.com/uc?export=view&id=1a-nyAPNPiVh-Xb2Pu2t2p-BhSvHJS0pO"></center>

## 1.2 Load libraries

In [120]:
import wandb
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport
from sklearn.model_selection import train_test_split
import tempfile
import os

## 1.3 Get data & Exploratory Data Analysis (EDA)

### 1.3.1 Create the raw_data artifact

In [121]:
# columns used 
columns = ["neighbourhood_cleansed","property_type","room_type",
           "accommodates","bathrooms_text","bedrooms","beds","amenities",
           "price","minimum_nights","maximum_nights","minimum_minimum_nights",
           "maximum_minimum_nights","minimum_maximum_nights","maximum_maximum_nights",
           "minimum_nights_avg_ntm","maximum_nights_avg_ntm","has_availability",
           "availability_30","availability_60","availability_90",
           "availability_365","number_of_reviews",
           "number_of_reviews_ltm","number_of_reviews_l30d",
           "review_scores_rating","review_scores_accuracy",
           "review_scores_cleanliness","review_scores_checkin","review_scores_communication",
           "review_scores_location","review_scores_value",
           "instant_bookable","calculated_host_listings_count",
           "calculated_host_listings_count_entire_homes",
           "calculated_host_listings_count_private_rooms",
           "calculated_host_listings_count_shared_rooms","reviews_per_month"]
# importing the dataset
data = pd.read_csv("https://raw.githubusercontent.com/Kaioh95/mlops-airbnb/main/data/listings.csv")

In [122]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24549 entries, 0 to 24548
Data columns (total 74 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            24549 non-null  int64  
 1   listing_url                                   24549 non-null  object 
 2   scrape_id                                     24549 non-null  int64  
 3   last_scraped                                  24549 non-null  object 
 4   name                                          24528 non-null  object 
 5   description                                   23336 non-null  object 
 6   neighborhood_overview                         13212 non-null  object 
 7   picture_url                                   24549 non-null  object 
 8   host_id                                       24549 non-null  int64  
 9   host_url                                      24549 non-null 

In [123]:
data.to_csv("raw_data.csv",index=False)

In [124]:
# Login to Weights & Biases
wandb.login(relogin=True)

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize


[34m[1mwandb[0m: Paste an API key from your profile and hit enter:  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/kaio/.netrc


True

In [125]:
# Send the raw_data.csv to the Wandb storing it as an artifact
!wandb artifact put \
      --name airbnb_eda/raw_data.csv \
      --type raw_data \
      --description "The raw data from airbnb" raw_data.csv

[34m[1mwandb[0m: Uploading file raw_data.csv to: "mlops-kaio/airbnb_eda/raw_data.csv:latest" (raw_data)
[34m[1mwandb[0m: Currently logged in as: [33mmlops-kaio[0m (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: wandb version 0.12.20 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.12.6
[34m[1mwandb[0m: Syncing run [33mcolorful-snowball-1[0m
[34m[1mwandb[0m:  View project at [34m[4mhttps://wandb.ai/mlops-kaio/airbnb_eda[0m
[34m[1mwandb[0m:  View run at [34m[4mhttps://wandb.ai/mlops-kaio/airbnb_eda/runs/3pm92xua[0m
[34m[1mwandb[0m: Run data is saved locally in /home/kaio/Documentos/9-UltSemestre/mlops/mlops-airbnb/eda/download-init-eda/wandb/run-20220701_163123-3pm92xua
[34m[1mwandb[0m: Run `wandb offline` to turn off syncing.

Artifact uploaded, use this artifact in a run by adding:

    artifact = run.use_artifact("mlops-kaio/airbnb_eda/raw_da

### 1.3.2 Download raw_data artifact from Wandb

In [126]:
# save_code tracking all changes of the notebook and sync with Wandb
run = wandb.init(project="airbnb_eda", job_type="preprocessing", save_code=True)

[34m[1mwandb[0m: wandb version 0.12.20 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


In [127]:
# donwload the latest version of artifact raw_data.csv
artifact = run.use_artifact("airbnb_eda/raw_data.csv:latest")

# create a dataframe from the artifact
df = pd.read_csv(artifact.file())

### 1.3.3 Pandas Profilling

In [128]:
#ProfileReport(df, title="Pandas Profiling Report", explorative=True)

In [129]:
df = df[columns]

In [130]:
df = df.dropna()

In [131]:
df['price'] = df['price'].str.replace("$", "")
df['price'] = df['price'].str.replace(",", "")
df['price'] = df['price'].astype("float")

#df['bathrooms_text'] = df['bathrooms_text'].str.replace("baths", "")
#df['bathrooms_text'] = df['bathrooms_text'].str.replace("bath", "")
#df['bathrooms_text'] = df['bathrooms_text'].astype("float")

  df['price'] = df['price'].str.replace("$", "")


In [132]:
df['price']

0         350.0
1         296.0
2         387.0
3         172.0
4         260.0
          ...  
24286    1197.0
24292     775.0
24350     256.0
24499     221.0
24545     460.0
Name: price, Length: 14013, dtype: float64

In [133]:
df.isnull().sum()

neighbourhood_cleansed                          0
property_type                                   0
room_type                                       0
accommodates                                    0
bathrooms_text                                  0
bedrooms                                        0
beds                                            0
amenities                                       0
price                                           0
minimum_nights                                  0
maximum_nights                                  0
minimum_minimum_nights                          0
maximum_minimum_nights                          0
minimum_maximum_nights                          0
maximum_maximum_nights                          0
minimum_nights_avg_ntm                          0
maximum_nights_avg_ntm                          0
has_availability                                0
availability_30                                 0
availability_60                                 0


In [134]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14013 entries, 0 to 24545
Data columns (total 38 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   neighbourhood_cleansed                        14013 non-null  object 
 1   property_type                                 14013 non-null  object 
 2   room_type                                     14013 non-null  object 
 3   accommodates                                  14013 non-null  int64  
 4   bathrooms_text                                14013 non-null  object 
 5   bedrooms                                      14013 non-null  float64
 6   beds                                          14013 non-null  float64
 7   amenities                                     14013 non-null  object 
 8   price                                         14013 non-null  float64
 9   minimum_nights                                14013 non-null 

In [135]:
df.describe()

Unnamed: 0,accommodates,bedrooms,beds,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,...,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
count,14013.0,14013.0,14013.0,14013.0,14013.0,14013.0,14013.0,14013.0,14013.0,14013.0,...,14013.0,14013.0,14013.0,14013.0,14013.0,14013.0,14013.0,14013.0,14013.0,14013.0
mean,4.127025,1.629772,2.626846,797.088204,3.936273,623.047813,3.796689,4.928067,737.240063,771.809177,...,4.680081,4.869478,4.850628,4.851694,4.650208,6.522872,5.718547,0.702919,0.077357,0.692056
std,2.369938,0.99317,2.13687,2925.722307,16.894587,693.480278,16.714244,18.7662,684.154999,674.586231,...,0.551408,0.397769,0.419797,0.382591,0.505536,17.203389,16.95976,1.559191,0.577565,0.926727
min,1.0,1.0,1.0,33.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.01
25%,2.0,1.0,1.0,236.0,2.0,60.0,2.0,2.0,90.0,90.0,...,4.6,4.9,4.87,4.86,4.54,1.0,1.0,0.0,0.0,0.1
50%,4.0,1.0,2.0,431.0,2.0,900.0,2.0,3.0,1125.0,1125.0,...,4.87,5.0,5.0,5.0,4.78,2.0,1.0,0.0,0.0,0.33
75%,5.0,2.0,3.0,800.0,3.0,1125.0,3.0,5.0,1125.0,1125.0,...,5.0,5.0,5.0,5.0,5.0,4.0,2.0,1.0,0.0,0.96
max,16.0,20.0,50.0,129080.0,1000.0,47036.0,1000.0,1000.0,47036.0,47036.0,...,5.0,5.0,5.0,5.0,5.0,176.0,172.0,13.0,9.0,21.79


In [136]:
df.to_csv("clean_data.csv", index=False)

In [137]:
# !wandb artifact put \
#       --name airbnb_eda/clean_data.csv \
#       --type clean_data \
#       --description "The clean data from airbnb" clean_data.csv

        
artifact = wandb.Artifact(
    name="clean_data.csv",
    type="clean_data",
    description="The clean data from airbnb",
)
artifact.add_file("clean_data.csv")

run.log_artifact(artifact)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f7ca3d0f400>

In [138]:
run.finish()

VBox(children=(Label(value=' 8.27MB of 8.27MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

### 1.3.4 EDA Manually

In [139]:
# save_code tracking all changes of the notebook and sync with Wandb
run = wandb.init(project="airbnb_eda", job_type="eda_and_data_segration", save_code=True)

[34m[1mwandb[0m: wandb version 0.12.20 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


In [140]:
artifact = run.use_artifact("airbnb_eda/clean_data.csv:latest")

# create a dataframe from the artifact
df_clean = pd.read_csv(artifact.file())

In [141]:
df_clean.nunique()

neighbourhood_cleansed                            131
property_type                                      59
room_type                                           4
accommodates                                       16
bathrooms_text                                     45
bedrooms                                           16
beds                                               33
amenities                                       13545
price                                            1884
minimum_nights                                     47
maximum_nights                                    167
minimum_minimum_nights                             47
maximum_minimum_nights                             53
minimum_maximum_nights                            144
maximum_maximum_nights                            143
minimum_nights_avg_ntm                            173
maximum_nights_avg_ntm                            506
has_availability                                    2
availability_30             

In [142]:
# what the room_type column can help us?
pd.crosstab(df_clean.price,df_clean.room_type,margins=True)

room_type,Entire home/apt,Hotel room,Private room,Shared room,All
price,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
33.0,0,0,1,0,1
40.0,0,0,1,0,1
45.0,0,0,0,3,3
46.0,0,0,1,0,1
50.0,0,0,1,1,2
...,...,...,...,...,...
92620.0,1,0,0,0,1
96438.0,1,0,0,0,1
110688.0,1,0,0,0,1
129080.0,3,0,0,0,3


In [143]:
# what the sex column can help us?
pd.crosstab(df_clean.price,df_clean.room_type,margins=True,normalize=True)

room_type,Entire home/apt,Hotel room,Private room,Shared room,All
price,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
33.0,0.000000,0.000000,0.000071,0.000000,0.000071
40.0,0.000000,0.000000,0.000071,0.000000,0.000071
45.0,0.000000,0.000000,0.000000,0.000214,0.000214
46.0,0.000000,0.000000,0.000071,0.000000,0.000071
50.0,0.000000,0.000000,0.000071,0.000071,0.000143
...,...,...,...,...,...
92620.0,0.000071,0.000000,0.000000,0.000000,0.000071
96438.0,0.000071,0.000000,0.000000,0.000000,0.000071
110688.0,0.000071,0.000000,0.000000,0.000000,0.000071
129080.0,0.000214,0.000000,0.000000,0.000000,0.000214


In [144]:
# price vs [sex & race]?
#pd.crosstab(df.price,[df.sex,df.race],margins=True)
pd.crosstab(df_clean.price,df_clean.bedrooms,margins=True,normalize=True)

bedrooms,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,15.0,17.0,20.0,All
price,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
33.0,0.000071,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000071
40.0,0.000071,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000071
45.0,0.000214,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000214
46.0,0.000071,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000071
50.0,0.000143,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000143
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
92620.0,0.000000,0.000000,0.000000,0.000000,0.000071,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000071
96438.0,0.000000,0.000071,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000071
110688.0,0.000071,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000071
129080.0,0.000143,0.000000,0.000071,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000214


In [145]:
# %matplotlib inline

# sns.catplot(x="room_type", 
#             hue="bedrooms", 
#             col="price",
#             data=df_clean, kind="count",
#             height=4, aspect=.7)
# plt.show()

In [146]:
# g = sns.catplot(x="room_type", 
#                 hue="bedrooms", 
#                 col="price",
#                 data=df_clean, kind="count",
#                 height=4, aspect=.7)

# g.savefig("Price_RoomType_Bedrooms.png", dpi=100)

# run.log(
#         {
#             "Price vs RoomType vs Bedrooms": wandb.Image("Price_RoomType_Bedrooms.png")
#         }
#     )

## 1.4 Train & Split

In [147]:
splits = {}
splits["train"], splits["test"] = train_test_split(df_clean,
                                                   test_size=0.30,
                                                   random_state=41)

In [148]:
# Save the artifacts. We use a temporary directory so we do not leave
# any trace behind

with tempfile.TemporaryDirectory() as tmp_dir:

    for split, df in splits.items():

        # Make the artifact name from the provided root plus the name of the split
        artifact_name = f"data_{split}.csv"

        # Get the path on disk within the temp directory
        temp_path = os.path.join(tmp_dir, artifact_name)

        # Save then upload to W&B
        df.to_csv(temp_path,index=False)

        artifact = wandb.Artifact(
            name=artifact_name,
            type="clean_data",
            description=f"{split} split of dataset airbnb_eda/clean_data.csv:latest",
        )
        artifact.add_file(temp_path)

        run.log_artifact(artifact)

        # This waits for the artifact to be uploaded to W&B. If you
        # do not add this, the temp directory might be removed before
        # W&B had a chance to upload the datasets, and the upload
        # might fail
        artifact.wait()

### 1.4.1 Donwload the train and test artifacts

In [149]:
# donwload the latest version of artifacts data_test.csv and data_train.csv
artifact_train = run.use_artifact("airbnb_eda/data_train.csv:latest")
artifact_test = run.use_artifact("airbnb_eda/data_test.csv:latest")

# create a dataframe from each artifact
df_train = pd.read_csv(artifact_train.file())
df_test  = pd.read_csv(artifact_test.file())

In [150]:
print("Train: {}".format(df_train.shape))
print("Test: {}".format(df_test.shape))

Train: (9809, 38)
Test: (4204, 38)


In [151]:
run.finish()

VBox(children=(Label(value=' 8.27MB of 8.27MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…