# Welcome to Feature Factory for Biodegradability

FeatureFactory is an online infrastructure that allows one to quickly prototype and test features for different machine learning problems. 

Before beginning to use Feature Factory, we highly recommend that you familiarize yourself with Jupyter Notebook. Jupyter Notebook is an interactive python kernel that allows you to run code in different cells. Variables created by the code live in the Jupyter Notebook python kernel and can be accessed at any time, by any cell. More information can be found [here](https://jupyter.readthedocs.io/en/latest/content-quickstart.html).

# Biodegradability Machine Learning Competition

## Problem Statement

The persistence of chemicals in the environment (or to environmental infuences) is welcome only until the time the chemicals fulfill their role. After that time or if they happen to be at the wrong place, the chemicals are considered pollutants.
In this phase of chemicals' life-span we wish that the chemicals disappear as soon as possible. The most ecologically acceptable (and a very cost-effective) way of 'disappearing' is degradation to components which are not considered pollutants (e.g. mineralization of organic compounds). Degradation in the environment can take several forms, from physical pathways (erosion, photolysis, etc.), through chemical pathways (hydrolysis, oxydation, diverse chemolises, etc.) to biological pathways (biolysis). Usually the pathways are combined and interrelated, thus making degradation even more complex.

In our study we focus on biodegradation in an aqueous environment under aerobic conditions, which affects the quality of surface and groundwater.

In this competition, you will be given a dataset of chemical properties measured during a study on biodegradation in an aqueous environment under aerobic conditions, in which the water/octanol partition coefficient (LOGP) value of each molecule has been used to classify them into multiple classes.

You are challenged to work with this dataset and attempt to identify and derive or generate the features which would help the most in predicting the logp class of any of the molecules based on the rest of values measured.


## Data

The dataset is in a relational format, split among mutliple files. When using `commands.get_sample_dataset()` to retrieve the dataset, the files are provided as a list of pandas `DataFrame` objects with the following columns:

![Biodegradability Data Model](https://relational.fit.cvut.cz/assets/img/datasets-generated/Biodegradability.svg)

## Begin

Execute this cell to initialize your FeatureFactory session.

In [None]:
from featurefactory.problems.biodegradability import commands

## Load sample data

Get a sample dataset. This will allow you to test your feature before running it on the full data in the server. Remember that the dataset is a list of [Pandas DataFrames](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).

In [None]:
dataset = commands.get_sample_dataset()
# dataset[0] <- this refers to the molecule data
# dataset[1] <- this refers to the atom data
# dataset[2] <- this refers to the bond data
# dataset[3] <- this refers to the gmember data
# dataset[4] <- this refers to the group datak

In [None]:
dataset[0][:5]

In [None]:
dataset[1][:5]

In [None]:
dataset[2][:5]

In [None]:
dataset[3][:5]

In [None]:
dataset[4][:5]

## Example: write and register a feature

The name you give to the function is the name which will be used later on to register your feature extraction function and the score which it obtains.

Your function should simply take in the dataset list as a parameter and output a N x M numpy matrix or pandas dataframe where N is number of users, one row per user, and M is the number of features which will be used for the prediction.
Bear in mind that sorting is important and that, in order to properly evaluate your function score, the extracted features should preserve the order of the user table.

Also note that, even though the system allows you to do so, any feature extraction function which makes use of the outcome column will be disqualified.

**WARNING:** Your functions have to be self contained!

This means that you can use helper functions or import external modules but that any import or variable definition needs to be made within the functions which use them.

Cross validation is (intentionally) run in a separated process in order to make sure that this scope pattern is preserved, and will fail if the function uses anything defined somewhere else in the notebook.

You might be wondering why we require this. The reason is that the code of your function might be executed and further evaluated in different environments where the variables and modules defined in your notebook will not be available.

In [None]:
def example_feature(dataset):
    return dataset[0][['mweight']]

&nbsp;
&nbsp;

Evaluate the score of your feature extraction function before submitting it.

You can make use of the `cross_validate` command as many times a needed in order to have a preview of what the score of your function will be.

In [None]:
commands.cross_validate(example_feature)

&nbsp;
&nbsp;

Register your function in the system

Once you are satisfied with the results, you can call the `register_feature` command passing your function as an argument.
This will `cross_validate` the function again and store your code and your score for future analysis.

Again, remember that your function code must be self contained and import or define everything it needs to be run successfully.

In [None]:
commands.register_feature(example_feature)

&nbsp;
&nbsp;

Optional: Modify and update your function code.

If you discover that your function can be improved you can add it again into the system as many times as required with the same function name.

However, for improved clarity, we recommend you to use this option only to fix problems or make small improvements within a similar approach.

So, in case you want start a different feature extraction strategy, we strongly recommend you to register it with a new name.

In [None]:
def imports():    # We need to import pandas within our functions
    global pd
    import pandas as pd

def compute_log(feature):
    """Compute log of the given column."""
    return np.log(feature)
    
def example_feature(dataset):
    imports()
    
    df = dataset[0][['mweight', 'activitynorm']].copy()
    df['log'] = compute_log(-df['mweight'])
    return df

commands.register_feature(example_feature)

## Write and register your features here