In [1]:
%load_ext nb_black
import pandas as pd
import numpy as np

<IPython.core.display.Javascript object>

# Testing Why is it useful?

1. Checking your own work.
2. Maintainability.
3. Forces you to write better code.
4. Smoother integration.


**5. Peace of Mind.**


## Worked examples + what to do

### Checking your own work

Let's show an example where it's easy to make a mistake. The pattern of mistake is often as follows:

* We write some code with the intention that it performs task A
* The code **looks** like it works when we apply it to out use case a.
* In reality, the code actually performs task B, which is similar to task A on some use cases (like a)


Let's write a function which cleans spaces in a string and replaces them with an underscore characters.

In [2]:
def clean_string(s):
    # Replace any number of spaces by one underscore
    clean = s.replace("  ", " ")
    clean = clean.replace(" ", "_")
    return clean

<IPython.core.display.Javascript object>

This seems to work on a couple of simple examples:

In [3]:
print(clean_string("test case"))
print(clean_string("test  case"))


# The last case doesn't work and you might not catch it until it's too late!

test_case
test_case


<IPython.core.display.Javascript object>

However, it actually doesn't quite do what we want it to do:

In [4]:
print(clean_string("test   case"))

test__case


<IPython.core.display.Javascript object>

We could easily miss this mistake, and it could come back to haunt us, for example if we have a particularly messy dataset, this function would run fine but produce unexpected results that go against our assumptions.

So what do we do?

We introduce testing. If you're writing code in a notebook, this can be as simple as writing a few assert statements. You can start by following the following steps:

* Write your function.
* Go for a coffee and forget about how you wrote that function.
* Come up with a few test cases, ie. an input and an expected output that you've worked out on your own.
* Try to come up with examples as general as possible
* Insert assert statements in your code.

Let's show this with the correc function

In [5]:
def clean_string_correct(s):
    # We could do this with regex, but here we'll cheat:
    return "_".join(s.split())

assert clean_string_correct("test case") == 'test_case'
assert clean_string_correct("test  case") == 'test_case'
assert clean_string_correct("test   case") == 'test_case'


<IPython.core.display.Javascript object>

Every time we run our function definition, we have a few nice checks that run as well

### Maintainability

Let's demonstrate this by defining out own custom train test split function for a dataframe. Let's make a dummy dataframe:

In [6]:
df = pd.DataFrame(
    {"col_a": np.random.randint(1, 100, 100), "col_b": np.random.uniform(0, 1, 100)}
)
df.head()

Unnamed: 0,col_a,col_b
0,44,0.07461
1,28,0.604181
2,65,0.508915
3,67,0.116402
4,45,0.583659


<IPython.core.display.Javascript object>

Next we'll define our function:

In [7]:
def train_test_split(df, test_frac):
    # Get the number of test rows
    n_test = int(len(df) * test_frac)

    # Make a shuffled copy of the df
    shuffled = df.sample(frac=1.0)

    return (shuffled.iloc[-n_test:], shuffled.iloc[:-n_test])


train_df, test_df = train_test_split(df, 0.2)

<IPython.core.display.Javascript object>

Actually,  there's something wrong with our function. Can you see what it is?












<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />


That's right, we actually defined a 'test-train-split function instead of train-test-split' that sort of mistake can be hard to spot and cause lots of issues. 

In [8]:
print(f"Train size {len(train_df)}")
print(f"Test size {len(test_df)}")

Train size 20
Test size 80


<IPython.core.display.Javascript object>

Let's fix that and add a quick assertion

In [9]:
def train_test_split_correct(df, test_frac):
    # Get the number of test rows
    n_test = int(len(df) * test_frac)

    # Make a shuffled copy of the df
    shuffled = df.sample(frac=1.0)

    # Let's be explicit about which is which to avoid confusion
    test = shuffled.iloc[-n_test:]
    train = shuffled.iloc[:-n_test]

    return train, test


# Create some parameters
n_dummy = 100
dummy_test_frac = 0.2

# Create some dummy data
dummy_df = pd.DataFrame(
    {
        "col_a": np.random.randint(1, 100, n_dummy),
        "col_b": np.random.uniform(0, 1, n_dummy),
    }
)

# Test the data!
dummy_train_df, dummy_test_df = train_test_split_correct(
    dummy_df, test_frac=dummy_test_frac
)
assert len(dummy_train_df) == n_dummy * (1 - dummy_test_frac)
assert len(dummy_test_df) == n_dummy * dummy_test_frac

<IPython.core.display.Javascript object>

Great, so that's another example of dangerous bugs sneaking into our code. What about maintainability?

Imagine that this function is called by another function, called preprocess:



In [10]:
def preprocess(df):
    # Other preprocessing functions here
    # ...

    train_df, test_df = train_test_split_correct(df, test_frac=0.1)

    return train_df, test_df


train_df, test_df = preprocess(df)
train_df.head()

Unnamed: 0,col_a,col_b
81,19,0.268522
18,74,0.999803
16,92,0.764396
85,43,0.457127
19,59,0.791084


<IPython.core.display.Javascript object>

Works great. Now say that you decide that you want to change the way your test train split function works. Maybe you want to also return some summary statistics for convenience. The function gets redefined:

In [11]:
def train_test_split_correct(df, test_frac):
    # Get the number of test rows
    n_test = int(len(df) * test_frac)

    # Make a shuffled copy of the df
    shuffled = df.sample(frac=1.0)

    # Let's be explicit about which is which to avoid confusion
    test = shuffled.iloc[-n_test:]
    train = shuffled.iloc[:-n_test]

    return train, test, train.mean(), test.mean()


train_df, test_df, train_means, test_means = train_test_split_correct(df, 0.1)

<IPython.core.display.Javascript object>


That seems to work! However, we've actually broken preprocess

In [12]:
train_df, test_df = preprocess(df)
train_df.head()

ValueError: too many values to unpack (expected 2)

<IPython.core.display.Javascript object>

In this case, the error is obvious and the code breaks. Having a test for preprocess would remind us that we need to update that function as well.

What's more, there will be cases where the preprocess function **doesn't break** but starts behaving differently due to the change in test train split. These will be very easy to miss and very hard to track down.

Testing functions like preprocess with known data will ensure that when you expand or modify your codebase, you miss fewer sneaky bugs. This is crucial in larger projects where everything mysteriously breaks as soon as one small change happens.



safely.

But a different change could easily have 

### Better, more modular code and smoother integration

Let's say I need to process my data by applying a few transforms. 
We have data with values and groups:

In [39]:
df = pd.DataFrame(
    {"group": ["A", "A", "A", "B", "B", "B"], "value": [4, 4, 1, 0, 3, 3]}
)
df

Unnamed: 0,group,value
0,A,4
1,A,4
2,A,1
3,B,0
4,B,3
5,B,3


<IPython.core.display.Javascript object>

Let's say that we want to normalise our data. 

For each row, we'd like to normalise value by adding its group mean and subtracting its group median.

We've learned the lesson of testing so we'll check that the preprocessing function works on this dataset.

We do the maths and figure out the answer manually:

In [None]:
EXPECTED_NORMALISED_VALUE = [3, 3, 0, -1, 2, 2]

we define our function that will perform all of this and chek it works:

In [15]:
def messy_preprocess(df):

    # Find the overall mean

    mean = df["value"].mean()
    # Let's iterate through df group by group. Start by sorting the dataframe.
    df = df.sort_values("group")

    # Define the starting group
    current_group = df.iloc[0]["group"]

    rows = []
    # Start looping through the dataframe row by row and performing the change
    for _, row in df.iterrows():

        # Get the mean for the current group
        group_mean = df.loc[df["group"] == current_group, "value"].mean()

        # Get the median for the current group
        group_median = df.loc[df["group"] == current_group, "value"].median()

        row["value_normalised"] = row["value"] + group_mean - group_median

        #         # Calculate the normalised value
        #         row["norm_value"] = (row["value"] - group_mean) / mean

        # Accumulate the rows
        rows.append(row)

    return pd.DataFrame(rows)


np.testing.assert_array_equal(
    messy_preprocess(df)["value_normalised"], EXPECTED_NORMALISED_VALUE
)

<IPython.core.display.Javascript object>

It works! Note the use of `np.testing`: this provides a bunch of useful functions for testing arrays.

Problem #1: coming up with the answers was a little tedious. If I was the person reviewing this code, I would struggle to quickly assess whether that data is correct or not.

Problem #2: This actually has a really nasty bug. If we had slightly different data this doesn't work:


In [16]:
df_2 = pd.DataFrame(
    {"group": ["A", "A", "A", "B", "B", "B"], "value": [4, 4, 1, 0, 0, 15]}
)
EXPECTED_NORMALISED_VALUE_2 = [3, 3, 0, 5, 5, 20]

np.testing.assert_array_equal(
    messy_preprocess(df_2)["value_normalised"], EXPECTED_NORMALISED_VALUE_2
)



AssertionError: 
Arrays are not equal

Mismatched elements: 3 / 6 (50%)
Max absolute difference: 6.
Max relative difference: 1.2
 x: array([ 3.,  3.,  0., -1., -1., 14.])
 y: array([ 3,  3,  0,  5,  5, 20])

<IPython.core.display.Javascript object>

So what lessons can we draw from this?

A really unfortuntate bug has slid into our code. What's really going on is that the code we wrote isn't very high quality. Even if it did not have a bug, it would be hard to maintain and understand. Part of the porblem is also that this one function does several things at once.

We need to simplify and streamline. Testing encourages to break this code up into smaller, more manageable, and easier to test chunks. Let's do this.


<span style="font-size:larger;">**Exercise:**</span>

1. Write one or more functions to perform the above task
2. Write some checks that you are getting the expected data.
3. Be as thorough as you can!

<span style="font-size:larger;">**Solution:**</span>


In [37]:
def add_group_mean(df):
    group_means = (
        df.groupby("group")[["value"]].mean().rename(columns={"value": "group_mean"})
    )
    return df.merge(group_means, on="group")


def add_group_median(df):
    group_medians = (
        df.groupby("group")[["value"]]
        .median()
        .rename(columns={"value": "group_median"})
    )
    return df.merge(group_medians, on="group")


def preprocess(df):

    df = add_group_mean(df)
    df = add_group_median(df)
    df["value_normalised"] = df["value"] + df["group_mean"] - df["group_median"]
    return df.drop(columns=["group_median", "group_mean"])



<IPython.core.display.Javascript object>

In [38]:
# Test Means

EXPECTED_MEANS = [3, 3, 3, 2, 2, 2]
EXPECTED_MEANS_2 = [4, 4, 4, 5, 5, 5]

np.testing.assert_array_equal(add_group_mean(df)["group_mean"], EXPECTED_MEANS)
np.testing.assert_array_equal(add_group_mean(df_2)["group_mean"], EXPECTED_MEANS_2)

# Test Medians
EXPECTED_MEDIANS = [4, 4, 4, 3, 3, 3]
EXPECTED_MEDIANS_2 = [3, 3, 3, 0, 0, 0]

np.testing.assert_array_equal(add_group_median(df)["group_median"], EXPECTED_MEDIANS)
np.testing.assert_array_equal(
    add_group_median(df_2)["group_median"], EXPECTED_MEDIANS_2
)


np.testing.assert_array_equal(
    preprocess(df)["value_normalised"], EXPECTED_NORMALISED_VALUE
)
np.testing.assert_array_equal(
    preprocess(df_2)["value_normalised"], EXPECTED_NORMALISED_VALUE_2
)

KeyError: 'group'

<IPython.core.display.Javascript object>

## Exercise

We have time series data of the following form:

In [31]:
df = pd.DataFrame({"time": [1, 2, 3, 4, 5], "value": [0, 3, 1, 4, 6]})
df

Unnamed: 0,time,value
0,1,0
1,2,3
2,3,1
3,4,4
4,5,6


<IPython.core.display.Javascript object>

We want a function `add_norm_last_value` which adds a `norm_last_value` column. This is computed by:

* At time stamp t, take the value for t-1. At t=0 assume this is 0.
* Divide the result by the sum of valuse seen until t (not inclusive). (If that sum is 0, just return 0)
* Note: if some of the values are missing, assume they are 0 for the purpose of the cumulative sum normalisation. (Keep it as a Nan for the numerator)

Given the example dataframe above, the expected result is:




In [33]:
RES = pd.DataFrame(
    {
        "time": [1, 2, 3, 4, 5],
        "value": [0, 3, 1, 4, 6],
        "norm_last_value": [0, 0, 1, 0.25, 0.5],
    }
)
RES

Unnamed: 0,time,value,norm_last_value
0,1,0,0.0
1,2,3,0.0
2,3,1,1.0
3,4,4,0.25
4,5,6,0.5


<IPython.core.display.Javascript object>

We define a testing function which takes in some test data, and the expected resulsts, and throws an error if they don't match.

In [34]:
def test_add_norm_last_value(df, expected_df):
    res = add_norm_last_value(df)

    # The check_like argument allows the check to ignore ordering
    pd.testing.assert_frame_equal(res, expected_df, check_like=True)

<IPython.core.display.Javascript object>

### Questions

**Part A**

Write `add_norm_last_value`. Check that it runs with our example of `df` and `RES` defined above.

**part B**

Generate some data of your own and check that it also works.


When you're done, some test data will be sent your way and your function will be put to the test!


### Solution