# Real world example: Data Analysis

## Scenario
In your OpenBIS instance exists an EXPERIMENTAL_STEP with an associated DATASET. You need to download the dataset files and perform some analysis on it. The results will be written to a new dataset and uploaded to the experimental step. A new comment will be added to the Notes property.

We use the example [IRIS data set](https://en.wikipedia.org/wiki/Iris_flower_data_set) and some dummy code for analysis just to demonstaten the workflow.

This example shows the **interactive development process** - step by step from the first line to the complete script.

## Start: connecting to openBIS
Using URL and username+password or PAT

In [None]:
from pybis import Openbis
o = Openbis('https://schulung.datastore.bam.de')
o.login('mmusterm', 'bamisgreat')

## OPTIONAL: create dummy data for this example
This example needs amn EXPERIMENTAL_STEP with the IRIS data attached as a RAW_DATA to work. If you don't have this already you may create it with the following code.

In [None]:
SPACE='MMUSTERM'
PROJECT='PYBIS_ANALYSIS'
COLLECTION='IRIS_STEPS'
OBJECT='IRIS_ANALYSIS'

space = o.get_space(SPACE)
try:
    proj = space.get_project(PROJECT)
except ValueError:
    proj=o.new_project(space=SPACE, code=PROJECT, description='just for learning pyBIS')
    proj.save()
try:
    coll = space.get_collection(COLLECTION)
except ValueError:
    coll=o.new_collection(project=proj, code=COLLECTION, type='DEFAULT_EXPERIMENT')
    coll.save()
steps = space.get_objects(project=proj, type='EXPERIMENTAL_STEP', code=OBJECT)
if not steps:
    step =  o.new_object(type='EXPERIMENTAL_STEP', space=space, collection=coll, 
        code=OBJECT, props={
        '$name': 'My IRIS analysis',
        'experimental_step.experimental_description': 'handling the well known data set',
        'notes': 'experimental step created'
    })
    step.save()
else:
    step = steps[0]
datasets = step.get_datasets(type='RAW_DATA')
if not datasets:
    import requests
    resp = requests.get('https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv')
    with open('iris.csv', 'w') as csvfile:
        csvfile.write(resp.text)
    dataset = o.new_dataset(
        type = 'RAW_DATA',
        collection = coll,
        object = step,
        files = ['iris.csv']
    )
    dataset.save()

## Analyse the IRIS data - step by step

### Search all EXPERIMENTAL_STEPs in project

In [None]:
SPACE='MMUSTERM'
PROJECT='PYBIS_ANALYSIS'
o.get_objects(space=SPACE, project=PROJECT, type='EXPERIMENTAL_STEP')

### Get list of steps to analyse
just search for EXPERIMENTAL_STEPs that have DATASET(s) of type RAW_DATA but not of type ANALYZED_DATA.

In [None]:
steps_to_analyse = []
for step in o.get_objects(space=SPACE, project=PROJECT, type='EXPERIMENTAL_STEP'):
    raw = step.get_datasets(type='RAW_DATA')
    res = step.get_datasets(type='ANALYZED_DATA')
    if raw and not res:
        steps_to_analyse.append(step)
if steps_to_analyse:
    print([s.code for s in steps_to_analyse])
    step = steps_to_analyse[0]
else:
    print('Nothing to do - have a break!')

### Download dataset to local computer

In [None]:
dsraw = step.get_datasets(type='RAW_DATA')[0]
folder = dsraw.download(destination='raw/', create_default_folders=False)

### Read and process CSV content
This is really just a placeholder for real code - it does nothing useful, just compute the medium value of two columns.

In [None]:
import csv
num_lines = 0
slength_sum = 0
plength_sum = 0

with open(folder+'iris.csv', 'r') as csvfile:
    csvreader = csv.reader(csvfile)
    csvreader.__next__() # just skip the header line
    for row in csvreader:
        slength_sum += float(row[0])
        plength_sum += float(row[2])
        num_lines += 1
slength_med = slength_sum/ num_lines
plength_med = plength_sum/ num_lines
print('medium sepal.length: %f, medium petal.length: %f' % (slength_med, plength_med))

### Write results into file

In [None]:
with open('iris_results.txt', 'w') as resfile:
    resfile.write('medium sepal.length: %f\n' % slength_med)
    resfile.write('medium petal.length: %f\n' % plength_med)

### Upload result file as new dataset

In [None]:
dataset = o.new_dataset(
    type = 'ANALYZED_DATA',
    object = step,
    files = ['iris_results.txt']
)
dataset.save()

Now just go back and search again for data sets that need to be analyzed - check that our search is working.

### Create a plot
For this step you need the package matplotlib installed or xou will get an error. Try installing with `pip install matplotlib` on your anaconda prompt and wait for completion.
Now create a and view a plot of the IRIS data.

In [None]:
import pandas
df = pandas.read_csv(folder+'iris.csv')
iris_plot = df.plot()

### Save and upload plot as a preview image for the experimental step
A dataset type of ELN_PREVIEW_IMAGE will be used by the ELN to show an image in the entities preview.

In [None]:
iris_plot.get_figure().savefig('iris_plot.png')
preview_dataset = o.new_dataset(
    type = 'ELN_PREVIEW',
    object = step,
    files = ['iris_plot.png']
)
preview_dataset.save()

### Append a note to the experimental step

In [None]:
notes = step.props['notes']
step.props['notes'] = notes+'<p>Data analysed via <b>pyBIS</b>!</p>'
step.save()

### Dont't forget to logout!

In [None]:
o.logout()

## Putting it all together - the complete script
Now we combine all of the code above to a cell/script that can be used standalone. Some cosmetic changes include:
* move all imports to the top
* move adjustable settings near the top
* move analysis code in a separate function
* combine handling of experimental steps in main loop

In [None]:
from pybis import Openbis
from datetime import date
import pandas
import csv

SPACE='MMUSTERM'
PROJECT='PYBIS_ANALYSIS'

## connect and login - you should use a PAT instead
o = Openbis('https://schulung.datastore.bam.de')
o.login('mmusterm', 'bamisgreat')

# just a separate function for data anlysis
def analyse_iris(folder):
    num_lines = 0
    slength_sum = 0
    plength_sum = 0
    with open(folder+'iris.csv', 'r') as csvfile:
        csvreader = csv.reader(csvfile)
        csvreader.__next__() # just skip the header line
        for row in csvreader:
            slength_sum += float(row[0])
            plength_sum += float(row[2])
            num_lines += 1
    slength_med = slength_sum/ num_lines
    plength_med = plength_sum/ num_lines
    print('medium sepal.length: %f, medium petal.length: %f' % (slength_med, plength_med))
    return slength_med, plength_med

for step in o.get_objects(space=SPACE, project=PROJECT, type='EXPERIMENTAL_STEP'):
    raw = step.get_datasets(type='RAW_DATA')
    res = step.get_datasets(type='ANALYZED_DATA')
    if raw and not res:
        print('Processing: %s' % step.code)
        # download raw data
        dsraw = step.get_datasets(type='RAW_DATA')[0]
        folder = dsraw.download(destination='raw/', create_default_folders=False)
        slength_med, plength_med = analyse_iris(folder)
        with open('iris_results.txt', 'w') as resfile:
            resfile.write('medium sepal.length: %f\n' % slength_med)
            resfile.write('medium petal.length: %f\n' % plength_med)
        dataset = o.new_dataset(
            type = 'ANALYZED_DATA',
            object = step,
            files = ['iris_results.txt']
        )
        dataset.save()
        df = pandas.read_csv(folder+'iris.csv')
        iris_plot = df.plot()
        iris_plot.get_figure().savefig('iris_plot.png')
        preview_dataset = o.new_dataset(
            type = 'ELN_PREVIEW',
            object = step,
            files = ['iris_plot.png']
        )
        preview_dataset.save()
        notes = step.props['notes']
        step.props['notes'] = notes+'<p>Data analysed via <b>pyBIS</b>!</p>'
        step.save()
    else:
        print('Skipping: %s' % step.code)
o.logout()