# Mechanisms of Action (MoA): 作用機序

## Basic Study

### MoA

薬が治療効果を及ぼす仕組みのことです。通常、薬がどの標的分子（タンパク質）と相互作用し、どのような影響を与えた結果、治療効果が得られるのかを、分子レベルで言及します。ある薬に対して抵抗性になった場合、作用機序の異なる薬を使うことで、同じ治療効果が得られる可能性があります。

### 学習データ

- `g-x` は 遺伝子発現。772 列。遺伝子発現とは、遺伝子にコードされた情報がタンパク質分子の組み立てを指示するために利用されるプロセスである。細胞は、3つの塩基のグループで遺伝子の配列を読み取る。
- `c-x` は 細胞生存能。100 列。細胞生存率は、集団内の生きている健康な細胞の割合を測定するものです。細胞生存率アッセイは、細胞の全体的な健康状態を決定したり、培養や実験条件を最適化したり、薬物スクリーニングなどの化合物で処理した後の細胞生存率を測定したりするために使用されます。
- `cp_type` は 化合物（`cp_vehicle`）または対照摂動（`ctrl_vehicle`）で処理されたサンプルを示し、対照摂動は MoA を持たない
- `cp_time` と `cp_dose` は治療期間（24、48、72 時間）と投与量（高または低）

### 目的変数

- 206の作用機序への反応を予測する

In [29]:
import numpy as np
import pandas as pd

In [30]:
def df_stats(df):
    stats = []
    for col in df.columns:
        stats.append(
            (
                col,
                df[col].nunique(),
                df[col].value_counts().index[0],
                df[col].value_counts().values[0],
                df[col].isnull().sum() * 100 / df.shape[0],
                df[col].value_counts(normalize=True, dropna=False).values[0] * 100,
                df[col].dtype,
            )
        )
    return pd.DataFrame(
        stats, columns=["カラム名", "カラムごとのユニーク値数", "最も出現頻度の高い値", "最も出現頻度の高い値の出現回数", "欠損損値の割合", "最も多いカテゴリの割合", "Type"]
    )

In [31]:
train_df = pd.read_csv('../input/lish-moa/train_features.csv')
test_df = pd.read_csv('../input/lish-moa/test_features.csv')
target_df = pd.read_csv('../input/lish-moa/train_targets_scored.csv')
submit_df = pd.read_csv('../input/lish-moa/sample_submission.csv')
non_target_df = pd.read_csv('../input/lish-moa/train_targets_nonscored.csv')

In [32]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23814 entries, 0 to 23813
Columns: 876 entries, sig_id to c-99
dtypes: float64(872), int64(1), object(3)
memory usage: 159.2+ MB


In [33]:
train_df['cp_type'].value_counts()

trt_cp         21948
ctl_vehicle     1866
Name: cp_type, dtype: int64

In [34]:
train_df['cp_time'].value_counts()

48    8250
72    7792
24    7772
Name: cp_time, dtype: int64

In [35]:
train_df['cp_dose'].value_counts()

D1    12147
D2    11667
Name: cp_dose, dtype: int64

In [36]:
train_df.describe()

Unnamed: 0,cp_time,g-0,g-1,g-2,g-3,g-4,g-5,g-6,g-7,g-8,...,c-90,c-91,c-92,c-93,c-94,c-95,c-96,c-97,c-98,c-99
count,23814.0,23814.0,23814.0,23814.0,23814.0,23814.0,23814.0,23814.0,23814.0,23814.0,...,23814.0,23814.0,23814.0,23814.0,23814.0,23814.0,23814.0,23814.0,23814.0,23814.0
mean,48.020156,0.248366,-0.095684,0.152253,0.081971,0.057347,-0.138836,0.035961,-0.202651,-0.190083,...,-0.469244,-0.461411,-0.513256,-0.500142,-0.507093,-0.353726,-0.463485,-0.378241,-0.470252,-0.301505
std,19.402807,1.393399,0.812363,1.035731,0.950012,1.032091,1.179388,0.882395,1.125494,1.749885,...,2.000488,2.042475,2.001714,2.107105,2.159589,1.629291,2.059725,1.703615,1.834828,1.407918
min,24.0,-5.513,-5.737,-9.104,-5.998,-6.369,-10.0,-10.0,-10.0,-10.0,...,-10.0,-10.0,-10.0,-10.0,-10.0,-10.0,-10.0,-10.0,-10.0,-10.0
25%,24.0,-0.473075,-0.5622,-0.43775,-0.429575,-0.470925,-0.602225,-0.4939,-0.525175,-0.511675,...,-0.566175,-0.565975,-0.589975,-0.5687,-0.563775,-0.567975,-0.552575,-0.561,-0.5926,-0.5629
50%,48.0,-0.00885,-0.0466,0.0752,0.00805,-0.0269,-0.01565,-0.00065,-0.0179,0.01,...,-0.0099,0.00325,-0.0091,-0.01375,-0.0033,-0.01025,-0.00125,-0.0068,0.014,-0.0195
75%,72.0,0.5257,0.403075,0.663925,0.4634,0.465375,0.510425,0.528725,0.4119,0.549225,...,0.45775,0.4615,0.445675,0.4529,0.4709,0.44475,0.465225,0.4464,0.461275,0.43865
max,72.0,10.0,5.039,8.257,10.0,10.0,7.282,7.333,5.473,8.887,...,4.069,3.96,3.927,3.596,3.747,2.814,3.505,2.924,3.111,3.805


In [37]:
train_df.head()

Unnamed: 0,sig_id,cp_type,cp_time,cp_dose,g-0,g-1,g-2,g-3,g-4,g-5,...,c-90,c-91,c-92,c-93,c-94,c-95,c-96,c-97,c-98,c-99
0,id_000644bb2,trt_cp,24,D1,1.062,0.5577,-0.2479,-0.6208,-0.1944,-1.012,...,0.2862,0.2584,0.8076,0.5523,-0.1912,0.6584,-0.3981,0.2139,0.3801,0.4176
1,id_000779bfc,trt_cp,72,D1,0.0743,0.4087,0.2991,0.0604,1.019,0.5207,...,-0.4265,0.7543,0.4708,0.023,0.2957,0.4899,0.1522,0.1241,0.6077,0.7371
2,id_000a6266a,trt_cp,48,D1,0.628,0.5817,1.554,-0.0764,-0.0323,1.239,...,-0.725,-0.6297,0.6103,0.0223,-1.324,-0.3174,-0.6417,-0.2187,-1.408,0.6931
3,id_0015fd391,trt_cp,48,D1,-0.5138,-0.2491,-0.2656,0.5288,4.062,-0.8095,...,-2.099,-0.6441,-5.63,-1.378,-0.8632,-1.288,-1.621,-0.8784,-0.3876,-0.8154
4,id_001626bd3,trt_cp,72,D2,-0.3254,-0.4009,0.97,0.6919,1.418,-0.8244,...,0.0042,0.0048,0.667,1.069,0.5523,-0.3031,0.1094,0.2885,-0.3786,0.7125


In [38]:
df_stats(train_df)

Unnamed: 0,カラム名,カラムごとのユニーク値数,最も出現頻度の高い値,最も出現頻度の高い値の出現回数,欠損損値の割合,最も多いカテゴリの割合,Type
0,sig_id,23814,id_772dea8ce,1,0.0,0.004199,object
1,cp_type,2,trt_cp,21948,0.0,92.164273,object
2,cp_time,3,48,8250,0.0,34.643487,int64
3,cp_dose,2,D1,12147,0.0,51.007811,object
4,g-0,14367,0,22,0.0,0.092383,float64
...,...,...,...,...,...,...,...
871,c-95,14693,-10,53,0.0,0.222558,float64
872,c-96,14493,-10,385,0.0,1.616696,float64
873,c-97,14757,-10,53,0.0,0.222558,float64
874,c-98,14812,-10,79,0.0,0.331738,float64


In [39]:
train_df[['c-31', 'c-32', 'c-78']].describe()

Unnamed: 0,c-31,c-32,c-78
count,23814.0,23814.0,23814.0
mean,-0.43442,-0.32299,-0.412918
std,1.988458,1.772399,1.888788
min,-10.0,-10.0,-10.0
25%,-0.5373,-0.533125,-0.568275
50%,-0.00215,0.0005,-0.01435
75%,0.454775,0.4824,0.451975
max,6.099,4.073,2.851


In [40]:
target_df[target_df['proteasome_inhibitor'] == 1]['proteasome_inhibitor'].describe()

count    726.0
mean       1.0
std        0.0
min        1.0
25%        1.0
50%        1.0
75%        1.0
max        1.0
Name: proteasome_inhibitor, dtype: float64

In [41]:
test_df.head()

Unnamed: 0,sig_id,cp_type,cp_time,cp_dose,g-0,g-1,g-2,g-3,g-4,g-5,...,c-90,c-91,c-92,c-93,c-94,c-95,c-96,c-97,c-98,c-99
0,id_0004d9e33,trt_cp,24,D1,-0.5458,0.1306,-0.5135,0.4408,1.55,-0.1644,...,0.0981,0.7978,-0.143,-0.2067,-0.2303,-0.1193,0.021,-0.0502,0.151,-0.775
1,id_001897cda,trt_cp,72,D1,-0.1829,0.232,1.208,-0.4522,-0.3652,-0.3319,...,-0.119,-0.1852,-1.031,-1.367,-0.369,-0.5382,0.0359,-0.4764,-1.381,-0.73
2,id_002429b5b,ctl_vehicle,24,D1,0.1852,-0.1404,-0.3911,0.131,-1.438,0.2455,...,-0.2261,0.337,-1.384,0.8604,-1.953,-1.014,0.8662,1.016,0.4924,-0.1942
3,id_00276f245,trt_cp,24,D2,0.4828,0.1955,0.3825,0.4244,-0.5855,-1.202,...,0.126,0.157,-0.1784,-1.12,-0.4325,-0.9005,0.8131,-0.1305,0.5645,-0.5809
4,id_0027f1083,trt_cp,48,D1,-0.3979,-1.268,1.913,0.2057,-0.5864,-0.0166,...,0.4965,0.7578,-0.158,1.051,0.5742,1.09,-0.2962,-0.5313,0.9931,1.838


In [42]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3982 entries, 0 to 3981
Columns: 876 entries, sig_id to c-99
dtypes: float64(872), int64(1), object(3)
memory usage: 26.6+ MB


In [43]:
test_df[['c-31', 'c-32', 'c-78']].describe()

Unnamed: 0,c-31,c-32,c-78
count,3982.0,3982.0,3982.0
mean,-0.404966,-0.32653,-0.398559
std,1.981752,1.816726,1.924608
min,-10.0,-10.0,-10.0
25%,-0.512525,-0.53195,-0.55475
50%,0.01555,0.0168,-0.0011
75%,0.4664,0.495875,0.4676
max,4.541,4.169,5.597


In [44]:
target_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23814 entries, 0 to 23813
Columns: 207 entries, sig_id to wnt_inhibitor
dtypes: int64(206), object(1)
memory usage: 37.6+ MB


In [45]:
target_df.head()

Unnamed: 0,sig_id,5-alpha_reductase_inhibitor,11-beta-hsd1_inhibitor,acat_inhibitor,acetylcholine_receptor_agonist,acetylcholine_receptor_antagonist,acetylcholinesterase_inhibitor,adenosine_receptor_agonist,adenosine_receptor_antagonist,adenylyl_cyclase_activator,...,tropomyosin_receptor_kinase_inhibitor,trpv_agonist,trpv_antagonist,tubulin_inhibitor,tyrosine_kinase_inhibitor,ubiquitin_specific_protease_inhibitor,vegfr_inhibitor,vitamin_b,vitamin_d_receptor_agonist,wnt_inhibitor
0,id_000644bb2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,id_000779bfc,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,id_000a6266a,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,id_0015fd391,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,id_001626bd3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [46]:
len(train_df.columns)


876

In [48]:
g_cols = [col for col in train_df.columns if col.startswith('g-')]
g_cols


['g-0',
 'g-1',
 'g-2',
 'g-3',
 'g-4',
 'g-5',
 'g-6',
 'g-7',
 'g-8',
 'g-9',
 'g-10',
 'g-11',
 'g-12',
 'g-13',
 'g-14',
 'g-15',
 'g-16',
 'g-17',
 'g-18',
 'g-19',
 'g-20',
 'g-21',
 'g-22',
 'g-23',
 'g-24',
 'g-25',
 'g-26',
 'g-27',
 'g-28',
 'g-29',
 'g-30',
 'g-31',
 'g-32',
 'g-33',
 'g-34',
 'g-35',
 'g-36',
 'g-37',
 'g-38',
 'g-39',
 'g-40',
 'g-41',
 'g-42',
 'g-43',
 'g-44',
 'g-45',
 'g-46',
 'g-47',
 'g-48',
 'g-49',
 'g-50',
 'g-51',
 'g-52',
 'g-53',
 'g-54',
 'g-55',
 'g-56',
 'g-57',
 'g-58',
 'g-59',
 'g-60',
 'g-61',
 'g-62',
 'g-63',
 'g-64',
 'g-65',
 'g-66',
 'g-67',
 'g-68',
 'g-69',
 'g-70',
 'g-71',
 'g-72',
 'g-73',
 'g-74',
 'g-75',
 'g-76',
 'g-77',
 'g-78',
 'g-79',
 'g-80',
 'g-81',
 'g-82',
 'g-83',
 'g-84',
 'g-85',
 'g-86',
 'g-87',
 'g-88',
 'g-89',
 'g-90',
 'g-91',
 'g-92',
 'g-93',
 'g-94',
 'g-95',
 'g-96',
 'g-97',
 'g-98',
 'g-99',
 'g-100',
 'g-101',
 'g-102',
 'g-103',
 'g-104',
 'g-105',
 'g-106',
 'g-107',
 'g-108',
 'g-109',
 'g-110',
