&nbsp;
&nbsp;

# Welcome to Feature Factory for Biodegradability

Feature factory is an online infrastructure that allows one to quickly prototype and test features for different machine learning problems. 

Before beginning to use Feature Factory, we highly recommend that you familiarize yourself with what IPython Notebook. IPython Notebook is an interactive python kernel that allows you to run code in different cells. Variables created by the code live in the IPython Notebook python kernel and can be accessed at any time, by any cell. More information can be found at http://ipython.org/notebook.html

# Creating your own IPython Notebook

To get started with Feature Factory, please clone the Template notebook. To do this, click "File"->"Make a Copy". This should spawn a new tab within your browser with the copied notebook. Rename the notebook to your liking and make all edits on that notebook.

&nbsp;
&nbsp;


# Biodegradability Machine Learning Competition

## Problem Statement

The persistence of chemicals in the environment (or to environmental inuences) is welcome only until the time the chemicals fulll their role. After that time or if they happen to be at the wrong place, the chemicals are considered pollutants.
In this phase of chemicals' life-span we wish that the chemicals disappear as soon as possible. The most ecologically acceptable (and a very cost-eective) way of 'disappearing' is degradation to components which are not considered pollutants (e.g. mineralization of organic compounds). Degradation in the environment can take several forms, from physical pathways (erosion, photolysis, etc.), through chemical pathways (hydrolysis, oxydation, diverse chemolises, etc.) to biological pathways (biolysis). Usually the pathways are combined and interrelated, thus making degradation even more complex.

In our study we focus on biodegradation in an aqueous environment under aerobic conditions, which affects the quality of surface and groundwater.

In this competition, you will be given a dataset of chemical properties measured during a study on biodegradation in an aqueous environment under aerobic conditions, in which the water/octanol partition coefficient (LOGP) value of each molecule has been used to classify them into multiple classes.

You are challenged to work with this dataset and attempt to identify and derive or generate the features which would help the most in predicting the logp class of any of the molecules based on the rest of values measured.


## Data

The dataset is in a relational format, split among mutliple files. When using **commands.get_sample_dataset()** to retrieve the dataset, the files are provided as a list of *pandas.DataFrame* objects with the following columns:

![Biodegradability Data Model](https://relational.fit.cvut.cz/assets/img/datasets-generated/Biodegradability.svg)

## Step-by-Step Example

Step 1: Import the feature factory infrastructure

In [1]:
from problems.biodegradability import commands

&nbsp;
&nbsp;

Step 2: Create a username/password or login into an existing account. If you create an account and it is successful, you don't need to login - you are logged in automatically. 

In [2]:
commands.create_user('a_user', 'a_password')

user successfully created


In [3]:
commands.login('a_user', 'a_password')

user successfully logged in


&nbsp;
&nbsp;

Step 3: To ensure that this notebook is mapped to your username, it is required that you execute the command below. 

In [4]:
commands.add_notebook('a_notebook_name')

Notebook a_notebook_name successfully registered


&nbsp;
&nbsp;

Step 4: Get a sample dataset. This will allow you to test your feature before running it on the full data in the server. Remember that the dataset as a list of [Pandas DataFrames](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).

In [5]:
dataset = commands.get_sample_dataset()
# dataset[0] <- this refers to the molecule data
# dataset[1] <- this refers to the atom data
# dataset[2] <- this refers to the bond data
# dataset[3] <- this refers to the gmember data
# dataset[4] <- this refers to the group data

In [6]:
dataset[0][:5]

Unnamed: 0,molecule_id,activity,logp,mweight,activitynorm,logpnorm,mweightnorm
0,i100_02_7i,4.53367,1.91,139.11,5.8e-05,0.457317,0.439024
1,i100_21_0i,4.56435,1.76,166.131,5.8e-05,0.426829,0.533537
2,i100_41_4i,5.04986,3.03,106.167,6.4e-05,0.670732,0.27439
3,i100_42_5i,6.22258,2.89,104.151,7.9e-05,0.64939,0.259146
4,i100_44_7i,6.04025,2.79,126.585,7.7e-05,0.640244,0.390244


In [7]:
dataset[1][:5]

Unnamed: 0,atom_id,molecule_id,type
0,i100_02_7_10i,i100_02_7i,c
1,i100_02_7_10_1i,i100_02_7i,h
2,i100_02_7_1i,i100_02_7i,o
3,i100_02_7_2i,i100_02_7i,n
4,i100_02_7_3i,i100_02_7i,o


In [8]:
dataset[2][:5]

Unnamed: 0,atom_id,atom_id2,type
0,i100_02_7_10i,i100_02_7_10_1i,1
1,i100_02_7_1i,i100_02_7_2i,2
2,i100_02_7_2i,i100_02_7_3i,2
3,i100_02_7_2i,i100_02_7_4i,1
4,i100_02_7_4i,i100_02_7_10i,7


In [9]:
dataset[3][:5]

Unnamed: 0,atom_id,group_id
0,i1120_71_4_1i,g0
1,i1120_71_4_2i,g0
2,i1120_71_4_3i,g0
3,i1120_71_4_4i,g0
4,i62_50_0_1i,g1


In [10]:
dataset[4][:5]

Unnamed: 0,group_id,type
0,g0,sulfo
1,g1,sulfo
2,g10,nitro
3,g100,methyl
4,g1000,c2n


&nbsp;
&nbsp;

Step 5: Define your feature extraction function.

The name you give to the function is the name which will be used later on to register your feature extaction function and the score which it obtains.

Your function should simply take in the dataset list as a parameter and output a N x M numpy matrix or pandas dataframe where N is number of molecules, one row per molecules, and M is the number of features which will be used for the prediction.
Bear in mind that sorting is important and that, in order to properly evaluate your function score, the extracted features should preserve the order of the molecules table.

Also note that, even though the system allows you to do so, any feature extraction function which makes use of the outcome column will be disqualified.

**WARNING:** Your functions have to be self contained!

This means that you can use helper functions or import external modules but that any import or variable definition needs to be made within the functions which use them.

Cross validation is (intentionally) run in a separated process in order to make sure that this scope pattern is preserved, and will fail if the function uses anything defined somewhere else in the notebook.

You might be wondering why we require this. The reason is that the code of your function might be executed and further evaluated in different environments where the variables and modules defined in your notebook will not be available.

In [11]:
def example_feature(dataset):
    return dataset[0][['mweight']]

&nbsp;
&nbsp;

Step 6: Evaluate the score of your feature extraction function before submitting it.

You can make use of the cross_validate command as many times a needed in order to have a preview of what the score of your function will be.

In [12]:
commands.cross_validate(example_feature)

Obtaining dataset
Extracting features
Cross validating


0.095583545438438722

&nbsp;
&nbsp;

Step 7: Register your function in the system

Once you are satisfied with the results, you can call the add_feature command passing your function as an argument.
This will cross_validate the function again and store your code and your score for future analysis.

Again, remember that your function code must be self contained and import or define everything it needs to be run successfully.

In [11]:
commands.add_feature(example_feature)

Obtaining dataset
Extracting features
Cross validating
Your feature example_feature scored 0.6488308585082779
Feature example_feature successfully registered


&nbsp;
&nbsp;

Step 8: (Optional) Modify and update your function code.

If you discover that your function can be improved you can add it again into the system as many times as required with the same function name.

However, for improved clarity, we recommend you to use this option only to fix problems or make small improvements within a similar approach.

So, in case you want start a different feature extraction strategy, we strongly recommend you to register it with a new name.

In [22]:
def imports():    # We need to import pandas within our functions
    global np
    import numpy as np

def compute_log(feature):
    """Compute log of the given column."""
    return np.log(feature)
    
def example_feature(dataset):
    imports()
    
    df = dataset[0][['mweight', 'activitynorm']].copy()
    df['log'] = compute_log(-df['mweight'])
    return df

commands.add_feature(example_feature)

Obtaining dataset
Extracting features
Cross validating
Your feature example_feature scored 0.8351254480286738
Feature example_feature successfully registered
