# Complex packages and testing

In this notebook we will learn how to write more complex packages. We will use the blueprint of our already started first package. We will write simple tests to ensure code quality and verify our algorithms.

## Exercise 6.1

This Package is supposed to make your life as a Data Scientist easier. There might be already programmed packages for these tasks already, but for the sake of practice and customizability we are going to write our own toolkit for our custom analysis.

In the later stages of the exercises we will come back to a some functions and methods to speed them up or make them more readable.

Let us start with basic things we will always need.
Reading, wrinting and visualizing data

### Exercise 6.1.1

Write a module called ds_toolkit.io. This module will take care of data handling. Reading and writing data.
The module should have functions for different data types. 
Common data types that are found in Data Science are: HDF5, root, fits and csv.

Write functions to handle the different formats the format does not have to be specified in the functions name.
There are packages that can infer the file format from the file extension.
You can use the read functions from existing libraries like pandas.

For example:

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
def read_file_example(filepath):
    filename, ext = os.path.splitext(filepath)
    if ext=='.fits':
        #do stuff with some package
        pass
    return pd.DataFrame()

### Exercise 6.1.2
To test the basic functionality of your functions write a randomly generated table to your hard drive and read it in again.

In [3]:
np.random.seed(1337)
test_df = pd.DataFrame({'A': np.random.randint(1,100, size=100), 'B': np.random.uniform(size=100)})
test_df.head()

Unnamed: 0,A,B
0,24,0.889885
1,62,0.538329
2,93,0.766726
3,40,0.46023
4,90,0.667329


In [4]:
from ds_toolkit.io import write_file, read_file

In [5]:
write_file(test_df,'data/06/testfile1.h5',key='test_df')

data/06/testfile1 written to data/06/testfile1.h5


In [6]:
read_df = read_file('data/06/testfile1.h5')

Reading data/06/testfile1
The filetype is: .h5


In [7]:
read_df.head()

Unnamed: 0,A,B
0,24,0.889885
1,62,0.538329
2,93,0.766726
3,40,0.46023
4,90,0.667329


You can check if two dataframes are equal with the pandas.util.testing module.

Assert statements are supposed to interrupt the kernel if the condition is not True.
This is the base for writing testing scripts.

Build a test_io.py script in the tests directory. 
The test_io.py can be called using ```pytest test_io.py```.
Another possibility if to use ```tox```, which is natively built into the cookiecutter package we chose.
Tox automates testing for different environments and versions of python.
Tox does struggle with Anaconda however, so you can ```pip install tox-conda``` and alter the tox.ini file to fit your needs.

You can use a self generated DataFrame that you write, read and then assert equality for testing purposes.
Another possibility is to use a higher level testing framework like ```hypothesis``` . https://hypothesis.readthedocs.io/en/latest/
This might however be tricky depending on your python installation and might require some more crafting in the ```tox.ini```.
Also make sure that the imported packages are in the ```requirements_dev.txt```

In [8]:
from pandas.util.testing import assert_frame_equal

assert_frame_equal(test_df, read_df)