# Week 09 Handson
In this week handson, we'll try to play with Kaggle, which is one of the biggest data science community platforms. We will try to join a Kaggle competition by building a model that can predict MoA (Mechanisms of Action) in drugs development. General guide about what you need to do:
1. Register to Kaggle (if you haven't had an account yet) with your full name,
2. Download the dataset,
3. Build a model,
4. Perform an inference to the given testing data,
5. Submit the inference result

Competition link: [cick here](https://www.kaggle.com/c/lish-moa/overview)

Submission:
1. This jupyter notebook: there are at least three blocks of codes, which are data preparation, modelling and inference. However, you are free to modify, e.g., further breaking down the data prepration block to EDA and data preprocessing, etc.
2. Csv file that is submitted to the competition.
3. Screenshot of your posisition in the leaderboard (jpg file).

Zip those three files above, with a file name of "W09_your-student-id_your-name.zip" and submit to the course portal. In case the allowable size is exceeded, you can upload to, e.g., gdrive first, then upload a txt file containing that download url to the course portal. In such case, please make sure that the url is publicly open.

# Data Preparation

## Exploratory Data Analysis

In [85]:
import pandas as pd

# Observe training features
train_features = pd.read_csv("dataset/train_features.csv")
train_features.head()

Unnamed: 0,sig_id,cp_type,cp_time,cp_dose,g-0,g-1,g-2,g-3,g-4,g-5,...,c-90,c-91,c-92,c-93,c-94,c-95,c-96,c-97,c-98,c-99
0,id_000644bb2,trt_cp,24,D1,1.062,0.5577,-0.2479,-0.6208,-0.1944,-1.012,...,0.2862,0.2584,0.8076,0.5523,-0.1912,0.6584,-0.3981,0.2139,0.3801,0.4176
1,id_000779bfc,trt_cp,72,D1,0.0743,0.4087,0.2991,0.0604,1.019,0.5207,...,-0.4265,0.7543,0.4708,0.023,0.2957,0.4899,0.1522,0.1241,0.6077,0.7371
2,id_000a6266a,trt_cp,48,D1,0.628,0.5817,1.554,-0.0764,-0.0323,1.239,...,-0.725,-0.6297,0.6103,0.0223,-1.324,-0.3174,-0.6417,-0.2187,-1.408,0.6931
3,id_0015fd391,trt_cp,48,D1,-0.5138,-0.2491,-0.2656,0.5288,4.062,-0.8095,...,-2.099,-0.6441,-5.63,-1.378,-0.8632,-1.288,-1.621,-0.8784,-0.3876,-0.8154
4,id_001626bd3,trt_cp,72,D2,-0.3254,-0.4009,0.97,0.6919,1.418,-0.8244,...,0.0042,0.0048,0.667,1.069,0.5523,-0.3031,0.1094,0.2885,-0.3786,0.7125


The dataset contains gene features (named 'g-*') and cell features (named 'c-*')

In [86]:
# Get number of data
train_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23814 entries, 0 to 23813
Columns: 876 entries, sig_id to c-99
dtypes: float64(872), int64(1), object(3)
memory usage: 159.2+ MB


In [87]:
# Observe non binary training targets
train_targets_nonbinary = pd.read_csv("dataset/train_targets_nonscored.csv")
train_targets_nonbinary.head()

Unnamed: 0,sig_id,abc_transporter_expression_enhancer,abl_inhibitor,ace_inhibitor,acetylcholine_release_enhancer,adenosine_deaminase_inhibitor,adenosine_kinase_inhibitor,adenylyl_cyclase_inhibitor,age_inhibitor,alcohol_dehydrogenase_inhibitor,...,ve-cadherin_antagonist,vesicular_monoamine_transporter_inhibitor,vitamin_k_antagonist,voltage-gated_calcium_channel_ligand,voltage-gated_potassium_channel_activator,voltage-gated_sodium_channel_blocker,wdr5_mll_interaction_inhibitor,wnt_agonist,xanthine_oxidase_inhibitor,xiap_inhibitor
0,id_000644bb2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,id_000779bfc,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,id_000a6266a,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,id_0015fd391,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,id_001626bd3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [88]:
# Observe binary training targets
train_targets = pd.read_csv("dataset/train_targets_scored.csv")
train_targets.head()

Unnamed: 0,sig_id,5-alpha_reductase_inhibitor,11-beta-hsd1_inhibitor,acat_inhibitor,acetylcholine_receptor_agonist,acetylcholine_receptor_antagonist,acetylcholinesterase_inhibitor,adenosine_receptor_agonist,adenosine_receptor_antagonist,adenylyl_cyclase_activator,...,tropomyosin_receptor_kinase_inhibitor,trpv_agonist,trpv_antagonist,tubulin_inhibitor,tyrosine_kinase_inhibitor,ubiquitin_specific_protease_inhibitor,vegfr_inhibitor,vitamin_b,vitamin_d_receptor_agonist,wnt_inhibitor
0,id_000644bb2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,id_000779bfc,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,id_000a6266a,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,id_0015fd391,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,id_001626bd3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


It can be seen that the training data has 876 columns. The targets data has 207 columns.
#### Therefore, multilabel classifier is needed. PCA is also needed if the training time takes too long.

In [89]:
# Detect existence of missing values
print(train_features.isnull().values.any())
print(train_targets_nonbinary.isnull().values.any())
print(train_targets.isnull().values.any())

False
False
False


Missing values do not exist in these dataframes.

In [90]:
# Analyze gene features and cell features:
train_features_list = train_features.columns.to_list()
g_list = [i for i in train_features_list if i.startswith('g-')]
c_list = [i for i in train_features_list if i.startswith('c-')]

print('Number of gene features: ', len(g_list))
print('Number of cell features: ', len(c_list))

Number of gene features:  772
Number of cell features:  100


In [91]:
# Check the correlations between g_list and c_list:
correlation_columns = ['cp_time']
correlation_columns.extend(g_list)
correlation_columns.extend(c_list)

highly_correlated_columns = []
for i in range(0, len(correlation_columns)):
    for j in range(i + 1, len(correlation_columns)):
        corr_i = train_features[correlation_columns[i]]
        corr_j = train_features[correlation_columns[j]]
        
        if abs(corr_i.corr(corr_j)) > 0.9:
            highly_correlated_columns.extend([correlation_columns[i], correlation_columns[j]])

In [95]:
highly_correlated_columns = list(set(highly_correlated_columns))
print('Number of highly correlated columns: ', len(highly_correlated_columns))

Number of highly correlated columns:  35


In [97]:
# Visualize the correlation matrix
corr_matrix_df = train_features[highly_correlated_columns]
corr_matrix_df.corr().style.background_gradient(cmap='coolwarm')

Unnamed: 0,g-50,c-94,c-55,c-52,c-40,c-26,c-82,c-54,c-90,c-33,c-73,c-66,c-6,c-62,c-1,c-75,c-11,c-93,c-38,c-2,c-63,c-8,c-17,g-37,c-81,c-51,c-96,c-60,c-10,c-42,c-4,c-13,c-31,c-85,c-18
g-50,1.0,0.720132,0.694504,0.726304,0.713067,0.770037,0.71879,0.710512,0.714887,0.75082,0.722235,0.693877,0.747266,0.691721,0.704723,0.721922,0.714645,0.723334,0.761689,0.716797,0.739374,0.714863,0.722449,0.907061,0.702472,0.721143,0.707551,0.745132,0.722228,0.742687,0.709274,0.742446,0.718473,0.706089,0.739992
c-94,0.720132,1.0,0.906384,0.89438,0.902148,0.899595,0.884253,0.890543,0.895854,0.900671,0.908041,0.893508,0.90564,0.893984,0.88675,0.900839,0.903916,0.882516,0.914368,0.906584,0.89263,0.890667,0.892172,0.724623,0.880354,0.893671,0.894631,0.908084,0.894825,0.908149,0.900309,0.914001,0.894344,0.885481,0.884627
c-55,0.694504,0.906384,1.0,0.899908,0.900873,0.884392,0.883645,0.891262,0.910217,0.881111,0.875227,0.880209,0.896212,0.895995,0.890182,0.887904,0.914637,0.873325,0.908992,0.911787,0.891555,0.886785,0.865574,0.694617,0.894636,0.880704,0.893962,0.886045,0.878275,0.905724,0.911288,0.898549,0.900367,0.87795,0.892813
c-52,0.726304,0.89438,0.899908,1.0,0.900736,0.901802,0.895962,0.892722,0.904919,0.890492,0.880502,0.904704,0.900672,0.903559,0.888692,0.894957,0.899381,0.872864,0.899129,0.905652,0.891816,0.894026,0.877124,0.726444,0.903354,0.898339,0.898668,0.883315,0.893063,0.924619,0.913649,0.899093,0.907514,0.895606,0.884884
c-40,0.713067,0.902148,0.900873,0.900736,1.0,0.888582,0.892649,0.89171,0.891964,0.885288,0.88719,0.894928,0.898693,0.895271,0.891745,0.884119,0.899519,0.86988,0.894536,0.909829,0.884208,0.889921,0.867642,0.715617,0.898611,0.89388,0.894214,0.887611,0.8936,0.898678,0.904528,0.906546,0.892917,0.882834,0.888833
c-26,0.770037,0.899595,0.884392,0.901802,0.888582,1.0,0.893044,0.878143,0.890534,0.891325,0.889977,0.89049,0.897879,0.874344,0.88723,0.880743,0.888839,0.868613,0.90687,0.899897,0.897345,0.884271,0.870553,0.765286,0.883872,0.887327,0.886232,0.890189,0.89625,0.900929,0.900002,0.921875,0.881954,0.870699,0.898464
c-82,0.71879,0.884253,0.883645,0.895962,0.892649,0.893044,1.0,0.892578,0.875975,0.897037,0.892608,0.907802,0.891495,0.882225,0.868565,0.868451,0.88377,0.862628,0.88502,0.892636,0.885913,0.88508,0.873824,0.723503,0.87165,0.900705,0.889346,0.875511,0.909512,0.910847,0.902625,0.901745,0.8877,0.886348,0.891688
c-54,0.710512,0.890543,0.891262,0.892722,0.89171,0.878143,0.892578,1.0,0.889137,0.88206,0.89858,0.891012,0.882746,0.880517,0.868874,0.87894,0.897029,0.869755,0.885917,0.897013,0.882656,0.883991,0.883215,0.710089,0.885038,0.892608,0.886412,0.873961,0.88842,0.903329,0.903751,0.899258,0.884709,0.881674,0.880311
c-90,0.714887,0.895854,0.910217,0.904919,0.891964,0.890534,0.875975,0.889137,1.0,0.887176,0.889224,0.8863,0.896115,0.890881,0.875405,0.884454,0.895012,0.877901,0.886021,0.902964,0.871469,0.881107,0.880831,0.715893,0.870061,0.889531,0.888795,0.890114,0.885242,0.903708,0.895967,0.905103,0.898068,0.884886,0.864383
c-33,0.75082,0.900671,0.881111,0.890492,0.885288,0.891325,0.897037,0.88206,0.887176,1.0,0.897423,0.889654,0.91473,0.876369,0.86613,0.878771,0.88436,0.884674,0.897265,0.890263,0.89364,0.866629,0.903937,0.753274,0.8532,0.886033,0.900685,0.902146,0.897416,0.900855,0.89119,0.902633,0.88614,0.887822,0.878634


Since there exists 207 columns in train_targets_scored.csv and train_targets_nonscored.csv, it's possible to have an imbalanced distribution.

In [93]:
# Check the distribution of targets

# Encode as boolean since there's only 0 and 1
train_targets_distribution = train_targets.drop(['sig_id'], axis=1).astype(bool).sum(axis=1).reset_index()
train_targets_distribution.columns = ['number of rows', 'number of activations']

train_targets_distribution = train_targets_distribution.groupby(['number of activations'])['number of rows'].count().reset_index()
print(train_targets_distribution)

   number of activations  number of rows
0                      0            9367
1                      1           12532
2                      2            1538
3                      3             303
4                      4              55
5                      5              13
6                      7               6


In [94]:
total_rows = sum(train_targets_distribution['number of rows'])
for i in range (0, len(train_targets_distribution)):
    percentage = (train_targets_distribution['number of rows'][i]/total_rows) * 100
    print('Percentage of', train_targets_distribution['number of activations'][i], ': ', percentage, '%')

Percentage of 0 :  39.33400520702108 %
Percentage of 1 :  52.624506592760554 %
Percentage of 2 :  6.458385823465189 %
Percentage of 3 :  1.2723607961703198 %
Percentage of 4 :  0.23095658016292936 %
Percentage of 5 :  0.05458973712941967 %
Percentage of 7 :  0.02519526329050139 %


More than 90% of data in train targets only have 0 or 1 activation. Therefore, the data is imbalanced.

# Modelling

# Inference