<a href="https://colab.research.google.com/github/AI-Junction/Personalized-Medicine/blob/master/kernel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**About Human Protein Atlas Image Classification**

Here, the objective is to develop ML models capable of classifying mixed patterns of proteins in microscope images. The Human Protein Atlas will use these models to build a tool integrated with their smart-microscopy system to identify a protein's location(s) from a high-throughput image. Proteins are “the doers” in the human cell, executing many functions that together enable life. In order to fully understand the complexity of the human cell, models must classify mixed patterns of proteins across a range of different human cells.

Images visualizing proteins in cells are commonly used for biomedical research, and these cells could hold the key for the next breakthrough in medicine. However, thanks to advances in high-throughput microscopy, these images are generated at a far greater pace than what can be manually evaluated. Therefore, the need is greater than ever for automating biomedical image analysis to accelerate the understanding of human cells and disease.

Again, the goal here is to predict various protein structures in cellular images. There are 28 different target proteins. Multiple proteins can be present in one image (multilabel classification)

**Data**

A) Two versions of the same image are provided. 
i) a scaled set of 512x512 PNG files in train.zip and test.zip. 
ii) Alternatively, to work with full size original images (a mix of 2048x2048 and 3072x3072 TIFF files) in train_full_size.7z and test_full_size.7z (warning: these are ~250 GB total). 

B) The training labels are given in train.csv and the filenames for the test set are listed in sample_submission.csv

Two fold data format is - first, the labels are provided for each sample in train.csv.

The bulk of the data is in the images - train.zip and test.zip. 
Within each of these is a folder containing four files per sample. 
Each file represents a different filter on the subcellular protein patterns represented by the sample. 
The format should be [filename]_[filter color].png for the PNG files, and [filename]_[filter color].tif for the TIFF files.

**Prediction**

Model should predict protein organelle localization labels for each sample. 
There are in total 28 different labels present in the dataset. The dataset comprises 27 different cell types of highly different morphology, which affect the protein patterns of the different organelles. 
All image samples are represented by four filters (stored as individual files) - the protein of interest (green) plus three cellular landmarks: nucleus (blue), microtubules (red), endoplasmic reticulum (yellow). The green filter should hence be used to predict the label, and the other filters are used as references.

The labels are represented as integers that map to the following:

0.  Nucleoplasm  
1.  Nuclear membrane   
2.  Nucleoli   
3.  Nucleoli fibrillar center   
4.  Nuclear speckles   
5.  Nuclear bodies   
6.  Endoplasmic reticulum   
7.  Golgi apparatus   
8.  Peroxisomes   
9.  Endosomes   
10.  Lysosomes   
11.  Intermediate filaments   
12.  Actin filaments   
13.  Focal adhesion sites   
14.  Microtubules   
15.  Microtubule ends   
16.  Cytokinetic bridge   
17.  Mitotic spindle   
18.  Microtubule organizing center   
19.  Centrosome   
20.  Lipid droplets   
21.  Plasma membrane   
22.  Cell junctions   
23.  Mitochondria   
24.  Aggresome   
25.  Cytosol   
26.  Cytoplasmic bodies   
27.  Rods & rings 

Data fields

    Id - the base filename of the sample. As noted above all samples consist of four files - blue, green, red, and yellow.


**Evaluation**

Based on macro F1 score.

The F1 score is the harmonic mean of precision and recall. Whereas the regular mean treats all values equally, the harmonic mean gives much more weight to low values. As a result, the classifier will only get a high F1 score if both recall and precision are high.

F1 score

![Imgur](https://i.imgur.com/PgPjd1X.jpg)

To compute the F1 score, simply call the f1_score() function

For each Id in the test set, predict a class for the Target variable. Note that multiple labels can be predicted for each sample.

The file would contain a header and have the following format:

Id,Predicted  
00008af0-bad0-11e8-b2b8-ac1f6b6435d0,0 1  
0000a892-bacf-11e8-b2b8-ac1f6b6435d0,2 3
0006faa6-bac7-11e8-b2b7-ac1f6b6435d0,0  
0008baca-bad7-11e8-b2b9-ac1f6b6435d0,0  
000cce7e-bad4-11e8-b2b8-ac1f6b6435d0,0  
00109f6a-bac8-11e8-b2b7-ac1f6b6435d0,1 28

Set fit_baseline and/or fit_improved_baseline of the KernelSettings class to False if you don't like to wait for computation:

In [0]:
class KernelSettings:
    
    def __init__(self, fit_baseline=False, fit_improved_baseline=False):
        self.fit_baseline = fit_baseline
        self.fit_improved_baseline = fit_improved_baseline

In [0]:
kernelsettings = KernelSettings(fit_baseline=False, fit_improved_baseline=False)

In [0]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [0]:
train_labels = pd.read_csv("../input/train.csv")
train_labels.head()