# Hypothesis for the scientific stack
Hypothesis provides special strategies for scientific stacks like numpy and pandas, available in the `hypothesis[numpy]` extra. Since pandas is consists of numpy arrays

![pandas_internal](df_blocks.png)
(source: https://www.dataquest.io/blog/pandas-big-data/)

we will have a look at some of the numpy and pandas specific strategies. But first, make sure you have the extras available. If not, reinstall the hypothesis with the extras. (not needed if set upusing the `requirements.txt` in this repo)

In [None]:
# uncomment the following line and run to reinstall hypothesis
# !pip install -U hypothesis[numpy]

## Generating a numpy array

Since a numpy array is not a simple object like integers or text, we need to provide some parameters, as we did with `st.lists` in the last exercise. First of all, we have a look at the documentation of `arrays` strategies:

In [None]:
import hypothesis.extra.numpy as npst

?npst.arrays

We will need to provide a `dtype` and a `shape`. Optionally, we can also provide the method how the arrays are generated. This gives us more control or how the arrays generated looks like. Let's try to discorver how we can do that.

In [None]:
import numpy as np

# TODO: generate a 1x10 array with integers

npst.arrays(np.int8, 10).example()

In [None]:
# TODO: generate a 4x4 array with all unique elements

npst.arrays(np.int8, (4,4), unique=True).example()

In [None]:
# TODO: generate a 2D array with random shape

npst.arrays(np.int8, npst.array_shapes(min_dims=2, max_dims=2)).example()

In [None]:
import hypothesis.strategies as st

# TODO: generate a 2x3 array with floats between -1 and 1

npst.arrays(np.float32, (2,3), elements=st.floats(-1, 1, width=32)).example()

In [None]:
# TODO: generate a 8x6 array with only even numbers

even_num = st.builds(lambda x: x*2, st.integers(min_value=0, max_value=126/2))
npst.arrays(np.int8, (8,6), elements=even_num).example()

In [None]:
# TODO: generate a 8x6 array with only even unique numbers

even_num = st.builds(lambda x: x*2, st.integers(min_value=0, max_value=126/2))
npst.arrays(np.int8, (8,6), elements=even_num, unique=True).example()

You may notice the bigger the array the longer it takes to generate the array. For unique arrays it takes even longer. Do you know why? Have another look at the documentation of `arrays` to understand the generation process.

Now, let's test a few of our favourite [Scikit learn preprocessing processes](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)

In [None]:
import ipytest

ipytest.autoconfig()

In [None]:
%%ipytest

# TODO: test Normalizer (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html)

from sklearn.preprocessing import Normalizer
from hypothesis import given

@given(npst.arrays(np.int8, npst.array_shapes(min_dims=2, max_dims=2)))
def test_normalizer_less_than_1(arr):
    transformer = Normalizer().fit(arr)
    assert (transformer.transform(arr) <= 1).all()

In [None]:
%%ipytest

# TODO: test Binarizer (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html)

from sklearn.preprocessing import Binarizer

@given(npst.arrays(np.int8, npst.array_shapes(min_dims=2, max_dims=2)))
def test_binarizer_0_or_1(arr):
    transformer = Binarizer().fit(arr)
    zero_or_one = lambda x: np.logical_or(x==1, x==0)
    assert zero_or_one(transformer.transform(arr)).all()

Now, you have learn how to test using arrays, let's move on to learn how to test using pandas dataframes. For more information about using `arrays`, see [hypothesis for numpy](https://hypothesis.readthedocs.io/en/latest/numpy.html#numpy)

## Generating a pandas Data Frame

Before we dive into how to generate a pandas dataframe, since dataframe is more than one series bundle together, we will first understand how to generate a pandas series

In [None]:
import hypothesis.extra.pandas as pdst

?pdst.series

As you can see, the pandas series generation is quite similar to a numpy array, as they both only allows a single dtypes. We will also start by generate some series with our `series` strategies.

In [None]:
# TODO: generate a series with integers

pdst.series(dtype=int).example()

In [None]:
# TODO: generate a series with floats between -1 and 1

pdst.series(dtype=np.float32, elements=st.floats(-1, 1, width=32)).example()

In [None]:
# TODO: generate a series with only even numbers

even_num = st.builds(lambda x: x*2, st.integers(min_value=0, max_value=126/2))
pdst.series(dtype=np.int8, elements=even_num).example()

In [None]:
# TODO: generate a series with text with only English alphabets and at least 1 characters

pdst.series(elements=st.from_regex(r"[a-zA-Z]+", fullmatch=True)).example()

In [None]:
# TODO: generate a series with index 0 to 9

pdst.series(dtype=np.int8, index=pdst.range_indexes(min_size=10, max_size=10)).example()

In [None]:
# TODO: generate a series with datetime index

pdst.series(dtype=np.int8, index=pdst.indexes(dtype='datetime64[s]')).example()

Since now you know how to generate series, shall we look at `data_frames`? If we look at the documentation ([here](https://hypothesis.readthedocs.io/en/latest/numpy.html#hypothesis.extra.pandas.data_frames) if you prefer the html version):

In [None]:
?pdst.data_frames

It looks quite scary, however, it is not that complicated if you think of how you will construct a dataframe with pandas. There are two ways of approaching a dataframe, either looking at it column by column or row by row. So there are two major ways of doing it, generating by columns or rows. First, let's us look at generating by columns.

In [None]:
# TODO: generate a data frame which is for the test score of a class, there are two columns
# Name - Name in ALL CAP. Allow text with only English alphabets and at least 1 characters
# Score - full scaore is 100

name = pdst.column(name='Name', elements=st.from_regex(r"[A-Z]+", fullmatch=True))
score = pdst.column(name='Score', elements=st.integers(min_value=0, max_value=100))
pdst.data_frames([name,score]).example()

In [None]:
# TODO: now assume there is always 20 students in a class

pdst.data_frames([name,score], index=pdst.range_indexes(min_size=20, max_size=20)).example()

Most of the time, you would like to generate the data frame with columns, however, there maybe exceptional case that you would like to generate by rows - one of this situation is if there are correlations between different rows.

In [None]:
# TODO: now instead of score, we will have a column which show the length of the name instead
# hint: we will use `st.builds`

def count_name(name: str):
    return (name, len(name))

student = st.builds(count_name, st.from_regex(r"[A-Z]+", fullmatch=True))
pdst.data_frames(rows=student, index=pdst.range_indexes(min_size=20, max_size=20)).example()

Now you have learn how to generate a pandas data frame, we will leave the application of it in test to the final chapter - when we do the final project of testing a machine learning pipeline.