# 2.0 Airbnb Regression Problem (Part II)



## 2.1 Data Checks

In this section, we are going to perform some data checking on our data in order to prevent bad data coming into our program. We are going to use several deterministic tests and also non-deterministics.

The Part I of this notebook is available on this repository under "eda-datasegregation" folder.

Let's take the following steps:

1. Load Libraries
2. Fetch Data, using a fixture to make it global.
3. Deterministic Tests
4. Non-deterministic Checks.

<center><img width="600" src="https://drive.google.com/uc?export=view&id=1a-nyAPNPiVh-Xb2Pu2t2p-BhSvHJS0pO"></center>

## 2.2 Load libraries

In [1]:
import wandb
import pandas as pd
import numpy as np
import pytest
import tempfile
import scipy.stats
import os

## 2.3 Logging into Wandb and Getting Our Data

In [2]:
# Login to Weights & Biases
wandb.login(relogin=True)

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize


[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/kaio/.netrc


True

In order to have that artifact, remember to execute the previous steps on part I of this notebook.

In [4]:
run = wandb.init(project="airbnb_eda", job_type="datachecks", save_code=True)

# donwload the latest version of artifact raw_data.csv
artifact = run.use_artifact("airbnb_eda/clean_data.csv:latest")

# create a dataframe from the artifact
clean_data_df = pd.read_csv(artifact.file())

VBox(children=(Label(value=' 0.05MB of 0.05MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

[34m[1mwandb[0m: wandb version 0.12.21 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


## 2.4 Doing some Tests

### 2.4.1 Deterministic Tests

In [20]:
def test_column_presence_and_type():
    """
    This function checks all columns and types.
    """
    
    required_columns = {
        "neighbourhood_cleansed": pd.api.types.is_object_dtype,
        "property_type": pd.api.types.is_object_dtype,
        "room_type": pd.api.types.is_object_dtype,
        "accommodates": pd.api.types.is_int64_dtype,
        "bathrooms_text": pd.api.types.is_object_dtype,
        "bedrooms": pd.api.types.is_float_dtype,
        "beds": pd.api.types.is_float_dtype,
        "has_availability": pd.api.types.is_object_dtype,
        "instant_bookable": pd.api.types.is_object_dtype,
        "price": pd.api.types.is_float_dtype,
        "minimum_nights_avg_ntm": pd.api.types.is_float_dtype,
        "maximum_nights_avg_ntm": pd.api.types.is_float_dtype,
        "review_scores_rating": pd.api.types.is_float_dtype,
        "review_scores_accuracy": pd.api.types.is_float_dtype,
        "review_scores_cleanliness": pd.api.types.is_float_dtype,
        "review_scores_checkin": pd.api.types.is_float_dtype,
        "review_scores_communication": pd.api.types.is_float_dtype,
        "review_scores_location": pd.api.types.is_float_dtype,
        "review_scores_value": pd.api.types.is_float_dtype,
        "reviews_per_month": pd.api.types.is_float_dtype,
        "minimum_minimum_nights": pd.api.types.is_int64_dtype,
        "maximum_minimum_nights": pd.api.types.is_int64_dtype,
        "minimum_maximum_nights": pd.api.types.is_int64_dtype,
        "maximum_maximum_nights": pd.api.types.is_int64_dtype,
        "availability_30": pd.api.types.is_int64_dtype,
        "availability_60": pd.api.types.is_int64_dtype,
        "availability_90": pd.api.types.is_int64_dtype,
        "availability_365": pd.api.types.is_int64_dtype,
        "number_of_reviews": pd.api.types.is_int64_dtype,
        "number_of_reviews_ltm": pd.api.types.is_int64_dtype,
        "number_of_reviews_l30d": pd.api.types.is_int64_dtype,
        "calculated_host_listings_count": pd.api.types.is_int64_dtype,
        "calculated_host_listings_count_entire_homes": pd.api.types.is_int64_dtype,
        "calculated_host_listings_count_private_rooms": pd.api.types.is_int64_dtype,
        "calculated_host_listings_count_shared_rooms": pd.api.types.is_int64_dtype,
        "minimum_nights": pd.api.types.is_int64_dtype,
        "maximum_nights": pd.api.types.is_int64_dtype,
    }

    # Check column presence
    assert set(clean_data_df.columns.values).issuperset(set(required_columns.keys()))

    for col_name, format_verification_funct in required_columns.items():

        assert format_verification_funct(clean_data_df[col_name]), f"Column {col_name} failed test {format_verification_funct}"

In [21]:
def test_column_ranges():
    ranges = {
        "accommodates": (1, 16),
        "bedrooms": (1, 20),
        "beds": (1, 50),
        "review_scores_cleanliness": (0, 5),
    }

    for col_name, (minimum, maximum) in ranges.items():

        assert clean_data_df[col_name].dropna().between(minimum, maximum).all(), (
            f"Column {col_name} failed the test. Should be between {minimum} and {maximum}, "
            f"instead min={clean_data_df[col_name].min()} and max={clean_data_df[col_name].max()}"
        )

### 2.4.2 Non-deterministic Test

In [22]:
# donwload the latest version of artifacts data_test.csv and data_train.csv
artifact_train = run.use_artifact("airbnb_eda/data_train.csv:latest")
artifact_test = run.use_artifact("airbnb_eda/data_test.csv:latest")

# create a dataframe from each artifact
df_train = pd.read_csv(artifact_train.file())
df_test  = pd.read_csv(artifact_test.file())

In [23]:
def test_kolmogorov_smirnov():

    sample1 = df_train
    sample2 = df_test
    ks_alpha = 0.05

    numerical_columns = [
        "accommodates",
        "bedrooms",
        "beds",
        "price",
        "minimum_nights_avg_ntm",
        "maximum_nights_avg_ntm",
        "review_scores_rating",
        "review_scores_accuracy",
        "review_scores_cleanliness",
        "review_scores_checkin",
        "review_scores_communication",
        "review_scores_location",
        "review_scores_value",
        "reviews_per_month",
        "minimum_minimum_nights",
        "maximum_minimum_nights",
        "minimum_maximum_nights",
        "maximum_maximum_nights",
        "availability_30",
        "availability_60",
        "availability_90",
        "availability_365",
        "number_of_reviews",
        "number_of_reviews_ltm",
        "number_of_reviews_l30d",
        "calculated_host_listings_count",
        "calculated_host_listings_count_entire_homes",
        "calculated_host_listings_count_private_rooms",
        "calculated_host_listings_count_shared_rooms",
        "minimum_nights",
        "maximum_nights",
    ]

    # Bonferroni correction for multiple hypothesis testing
    alpha_prime = 1 - (1 - ks_alpha)**(1 / len(numerical_columns))

    for col in numerical_columns:

        # two-sided: The null hypothesis is that the two distributions are identical
        # the alternative is that they are not identical.
        ts, p_value = scipy.stats.ks_2samp(
            sample1[col],
            sample2[col],
            alternative='two-sided'
        )

        # NOTE: as always, the p-value should be interpreted as the probability of
        # obtaining a test statistic (TS) equal or more extreme that the one we got
        # by chance, when the null hypothesis is true. If this probability is not
        # large enough, this dataset should be looked at carefully, hence we fail
        assert p_value > alpha_prime

In [24]:
# Executing tests
test_kolmogorov_smirnov()
test_column_presence_and_type()
test_column_ranges()

In [25]:
run.finish()

VBox(children=(Label(value=' 0.09MB of 0.09MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…