# 1.0 An end-to-end classification problem (Data Check)



## 1.1 Dataset description

The notebooks focus on a borrower's **credit modeling problem**. The database was downloaded through a dataquest project and is available at link below. The data is from **Lending Club** and contains data from loans made in the period **2007 to 2011**. Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return. The **target variable**, or what we are wanting to predict, is whether or not, given a person's history, they will repay the loan.

You can download the data from the [University of California, Irvine's website](http://archive.ics.uci.edu/ml/datasets/Adult).

Let's take the following steps:

1. ETL (done!!!)
4. Data Checks

<center><img width="600" src="https://drive.google.com/uc?export=view&id=1a-nyAPNPiVh-Xb2Pu2t2p-BhSvHJS0pO"></center>

## 1.2 Install, load libraries and setup wandb

In [None]:
!pip install wandb

In [None]:
!pip install pytest pytest-sugar

In [None]:
import wandb

In [None]:
# Login to Weights & Biases
!wandb login --relogin

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


## 1.2 Pytest


### 1.2.1 How pytest discovers tests



pytests uses the following [conventions](https://docs.pytest.org/en/latest/goodpractices.html#conventions-for-python-test-discovery) to automatically discovering tests:
  1. files with tests should be called `test_*.py` or `*_test.py `
  2. test function name should start with `test_`




### 1.2.2 Fixture


An important aspect when using ``pytest`` is understanding the fixture's scope works. 

The scope of the fixture can have a few legal values, described [here](https://docs.pytest.org/en/6.2.x/fixture.html#fixture-scopes). We are going to consider only **session** and **function**: with the former, the fixture is executed only once in a pytest session and the value it returns is used for all the tests that need it; with the latter, every test function gets a fresh copy of the data. This is useful if the tests modify the input in a way that make the other tests fail, for example.

### 1.2.3 Create and run a test file


In [None]:
%%file test_data.py
import pytest
import wandb
import pandas as pd

# This is global so all tests are collected under the same run
run = wandb.init(project="risk_credit", job_type="data_checks")

@pytest.fixture(scope="session")
def data():

    local_path = run.use_artifact("risk_credit/preprocessed_data.csv:latest").file()
    df = pd.read_csv(local_path)

    return df

def test_data_length(data):
    """
    We test that we have enough data to continue
    """
    assert len(data) > 1000


def test_number_of_columns(data):
    """
    We test that we have enough data to continue
    """
    assert data.shape[1] == 18

def test_column_presence_and_type(data):

    required_columns = {
        "loan_amnt": pd.api.types.is_float_dtype,
        "term": pd.api.types.is_object_dtype,
        "int_rate": pd.api.types.is_float_dtype,
        "installment": pd.api.types.is_float_dtype,
        "emp_length": pd.api.types.is_object_dtype,
        "home_ownership": pd.api.types.is_object_dtype,
        "annual_inc": pd.api.types.is_float_dtype,
        "loan_status": pd.api.types.is_object_dtype,
        "purpose" : pd.api.types.is_object_dtype,
        "dti": pd.api.types.is_float_dtype, 
        "delinq_2yrs": pd.api.types.is_float_dtype,
        "inq_last_6mths": pd.api.types.is_float_dtype,
        "open_acc": pd.api.types.is_float_dtype,
        "pub_rec": pd.api.types.is_float_dtype,
        "verification_status": pd.api.types.is_object_dtype,
        "revol_bal": pd.api.types.is_float_dtype,
        "revol_util": pd.api.types.is_float_dtype,
        "total_acc": pd.api.types.is_float_dtype
    }

    # Check column presence
    assert set(data.columns.values).issuperset(set(required_columns.keys()))

    for col_name, format_verification_funct in required_columns.items():

        assert format_verification_funct(data[col_name]), f"Column {col_name} failed test {format_verification_funct}"


def test_class_names(data):

    # Check that only the known classes are present
    known_classes = [
        "Fully Paid",
        "Charged Off"
    ]

    assert data["loan_status"].isin(known_classes).all()


def test_column_ranges(data):
  
    ranges = {
        "loan_amnt": (0, 100000),
        "int_rate": (0, 30),
        "installment": (0, 10000),
        "annual_inc": (0, 10**10),
        "dti": (0, 100),
        "delinq_2yrs": (0, 10**10),
        "open_acc": (0, 10*10),
        "pub_rec": (0, 100),
        "revol_bal": (0, 10**10),
        "revol_util": (0, 100),
        "total_acc": (0, 100)
    }

    for col_name, (minimum, maximum) in ranges.items():

        assert data[col_name].dropna().between(minimum, maximum).all(), (
            f"Column {col_name} failed the test. Should be between {minimum} and {maximum}, "
            f"instead min={data[col_name].min()} and max={data[col_name].max()}"
        )

Writing test_data.py


Now lets run pytest

In [None]:
!pytest . -vv

[1mTest session starts (platform: linux, Python 3.7.13, pytest 3.6.4, pytest-sugar 0.9.4)[0m
cachedir: .pytest_cache
rootdir: /content, inifile:
plugins: typeguard-2.7.1, sugar-0.9.4

 [36mtest_data.py[0m::test_data_length[0m [32m✓[0m                                 [32m20% [0m[40m[32m█[0m[40m[32m█        [0m
 [36mtest_data.py[0m::test_number_of_columns[0m [32m✓[0m                           [32m40% [0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█      [0m
 [36mtest_data.py[0m::test_column_presence_and_type[0m [32m✓[0m                    [32m60% [0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█    [0m
 [36mtest_data.py[0m::test_class_names[0m [32m✓[0m                                 [32m80% [0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█  [0m
 [36mtest_data.py[0m::test_column_ranges[0m [32m✓[0m                             

In [None]:
# close the run
# waiting a while after run the previous cell before execute this
run.finish()

NameError: ignored