# Experiment Organization

- Git(Lab): [git.fim.uni-passau.de](https://git.fim.uni-passau.de/)
- Environment Management: [Conda (Anaconda)](https://conda.io/)

# Workflows

**Rules for storing data:**
Depending on what kind of data you work with, how it is structured and how easily it can be obtained or prepared for usage, try to apply following rules for what you actually want to persist in git:
- 1) if data is an online-resource and can be retrieved quickly and the data source can be considered reliable available for at least a year: store only a routine to retrieve data locally, e.g. *retrieve_data.py*
- 2) if data is small, nicely readable and well-structured such as in a CSV or JSON file, store it directly in git so it is simply accessible when checking out your repository, e.g. *data/2020-05-05-some_structured_infos.json*. It always makes sense to note down where you retrieved data from, e.g. putting a file along with it *data/2020-05-05-some_structured_infos.md* which contains infos where you retrieved it from, what license might apply etc
- 3) if data is large or binary, then store it only in git if you also use ``git lfs track`` (have a look for the documentation on [git lfs](https://git-lfs.github.com/))
- 4) if your repository does not support git-lfs or you want to make data available, consider an online storage such as Amazon S3 or a data provider such as Zenodo

**Workflows:**
- a) **Data:** Retrieving (downloading) data locally
- b) **Analysis:** load data in memory (partially) -> restructure & prepare for investigation -> pull out information or even build up visualization
- c) **Model Training:** load data in memory (partially) -> train model(s) -> store model (for large model computations) or store results (for fast model trainings) (again: apply rules for storing data)
- d) **Evaluation:** load computed results in memory -> extract and prepare data for visualization -> configure visualization parameters -> produce visualization

# Conda Environment Reproducibility


Why using conda or a dependency manager?
- easily manage your dependencies, e.g. a library you are using to compute models or to visualize data
- share the managed environment with your team mates
- force your team to make your work reproducible. If you are checking the repository out in a year, you can quickly install the whole experiment again, even if your computer setup changed

The environment file we use for this Data Science Lab demonstration notebooks:
```yaml
name: sur-dsl2020
channels:
- defaults
dependencies:
- python>=3.8
- jupyter
- matplotlib
- pandas
- numpy
- scikit-learn
- seaborn
- tqdm
- pip>=20.0
- pip:
  - kaggle
```

Start off with creating an environment.yml for conda.

*environment.yml*
```yaml
name: sur-dsl2020-corona-forecasting
channels:
- defaults
dependencies:
- python>=3.7
- jupyter
- matplotlib
- pandas
- numpy
- scikit-learn
- seaborn
```

If you have just downloaded the team repository or just created the *environment.yml* freshly, use ``conda env create -f environment.yml`` in the local directory to create a python environment for the purpose of your Data Science Lab experiments.
Activate the environment with ``conda activate sur-dsl2020-corona-forecasting`` (or the particular name you gave your shared environment).
Make sure to not add your local path for the environment in the environment.yml -- it might be different for your team mates environment.


If you want to add a new library to your experiment, add it first in *environment.yml*:

*environment.yml*
```yaml
name: sur-dsl2020-corona-forecasting
channels:
- defaults
dependencies:
- python>=3.7
- jupyter
- matplotlib
- pandas
- numpy
- scikit-learn
- seaborn
- tqdm
```

After adding a new library, make sure you have activated your environment and then invoke an update in your experiment working directory by calling ``conda env update`` (which uses the local *./environment.yml* to update your python-conda-environment).

You can also add pip packages (if they are not available in conda).
Check first for availability in conda by calling ``conda search <package>``, e.g. ``conda search kaggle`` (kaggle is currently not available on conda).
Then you can search for it in pip: ``pip search kaggle`` and if you now want to add it to the environment, you can specify it with:

*environment.yml*
```yaml
name: sur-dsl2020-corona-forecasting
channels:
- defaults
dependencies:
- python>=3.7
- jupyter
- matplotlib
- pandas
- numpy
- scikit-learn
- seaborn
- tqdm
- pip>=20.0
- pip:
  - kaggle
```

# From Prototyping to Polished Results
Initially you might have no idea which library to use, what kind of data is useful and which choice of visualization technique might work best to present your results.
Jupyter notebooks (such as this one) are very nice for fast prototyping, inspecting data, inspecting functional behaviour or creating pretty visualizations.
But your code can get quite messy after some time.
This is due to the circumstance that you are mostly exploring new terrain.
To make your work easier accessible for you and your team mates it helps to get from prototyping to polished results from time to time.
This is inspired from agile development and means that you are first quickly prototyping and after some phases return to separating those prototypes into polished results.
A polished result could be e.g. a routine (a short python script and an explanation) to retrieve data or it could be one single kind of visualization of your data (e.g. a notebook which is solely meant for creating that nice boxplots that you intend to use in your report).

# Data retrieval routine

**Example description for the kaggle titanic dataset:**

Download the kaggle dataset:
- Invoke below command
- Make sure kaggle is installed in your current environment. Have a look on their [github page](https://github.com/Kaggle/kaggle-api)
- Make sure your kaggle credentials are stored in your home folder under *.kaggle/kaggle.json* (it might give you "*403 - Forbidden*" otherwise) - you can download the file from your account page on https://www.kaggle.com/{username}/account
- Make sure you accepted the rules for the particular dataset, e.g. on https://www.kaggle.com/c/titanic/rules (otherwise you might also get a "*403 - Forbidden*" response)

Put your credentials into *~/.kaggle/kaggle.json* and apply correct access rights to them with ``chmod 600 ~/.kaggle/kaggle.json``

Then download the data:

In [9]:
!kaggle competitions download -c titanic

Downloading titanic.zip to /home/julian/projects/teaching/20ss-Data-Science-Lab/course-experiment-skeleton
  0%|                                               | 0.00/34.1k [00:00<?, ?B/s]
100%|███████████████████████████████████████| 34.1k/34.1k [00:00<00:00, 556kB/s]


In [1]:
!ls

README.md                  experiment-workflows.ipynb
[1m[36mdata[m[m                       [1m[36msrc[m[m
environment.yml            sur-dsl2020-skeleton


In [2]:
!mkdir "data/"

mkdir: data/: File exists


In [3]:
!unzip titanic.zip -d data/

unzip:  cannot find or open titanic.zip, titanic.zip.zip or titanic.zip.ZIP.


# Prototype First Data Inspection

In [5]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [6]:
data_file = 'data/train.csv'

In [7]:
# Load data into memory
df = pd.read_csv(data_file)  # df is short for dataframe, a common, but sadly generic name
df.head()

FileNotFoundError: [Errno 2] File data/train.csv does not exist: 'data/train.csv'

In [None]:
# Think about what aspect you want to visualize
# As an example, we want to get an impression of the distribution of a certain feature
sns.boxplot(df['Age'])

In [None]:
# Now we try to iterate over our visualization to plot it visually appealing
# and in a way so we can actually use it in out report
sns.set(font_scale=1.5)
fig, ax = plt.subplots()
ax.set_title('Age distribution in original Titanic dataset')
sns.boxplot(df['Age'])

In [None]:
# We want to store the figure locally in a common dir
visualization_path = 'vis/2020-05-titanic-age-distribution.png'

Make sure you actually have a subdirectory for visualizations:

In [None]:
!mkdir vis/

In [None]:
# Store the figure locally (but only put into git if expensive to compute - the routine is important)
fig.savefig(visualization_path)

In [None]:
# Maybe you notice, that the figure might be too small in resolution for your report, so increase the resolution:
fig.savefig(visualization_path, dpi=150)

In [None]:
# Now you might notice, that the bottom axis label was cut off, so we might need to adjust our subplot:
sns.set(font_scale=1.5)
fig, ax = plt.subplots()
ax.set_title('Age distribution in original Titanic dataset')
sns.boxplot(df['Age'])
plt.gcf().subplots_adjust(bottom=0.2)  # We added this adjustment line
fig.savefig(visualization_path, dpi=150)

Now we can assemble all of this particular boxplot visualization into a separated visualization-generation routine given that we have locally the titanic data available.
The below cell can be a standalone python script *visualize_titanic_age.py* and you can check if the cell is actually working by restarting the kernel of the jupyter notebook and invoking the cell independently.

In [None]:
import os
import time
import datetime
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

visualization_base_path = 'vis/'
data_file = 'data/train.csv'

# Prepare path to save to
if not os.path.exists(visualization_base_path):
    os.makedirs(visualization_base_path)
if not os.path.isdir(visualization_base_path):
    raise ValueError('<{p}> is not a valid path to store visualizations to'.format(p=visualization_base_path))

date_prefix = datetime.datetime.today().strftime("%Y-%m-%d")
visualization_name = '{prefix}-titanic-age-distribution.png'.format(prefix=date_prefix)
save_path = os.path.join(visualization_base_path, visualization_name)

# Load the titanic train data into memory
df = pd.read_csv(data_file)

# Initialize seaborn and matplotlib config
sns.set(font_scale=1.5)

# Create the boxplot
fig, ax = plt.subplots()
sns.boxplot(df['Age'])

# Apply figure configs after we created the plots
plt.gcf().subplots_adjust(bottom=0.2)
ax.set_title('Age distribution in original Titanic dataset')

# Store the visualization result with high resolution
fig.savefig(save_path, dpi=150)