`train_peptides.csv` Mass spectrometry data at the peptide level. Peptides are the component subunits of proteins.

- visit_id - ID code for the visit.
- visit_month - The month of the visit, relative to the first visit by the patient.
- patient_id - An ID code for the patient.
- UniProt - The UniProt ID code for the associated protein. There are often several peptides per protein.
- Peptide - The sequence of amino acids included in the peptide. See this table for the relevant codes. Some rare annotations may not be included in the table.
- PeptideAbundance - The frequency of the amino acid in the sample.

`train_proteins.csv` Protein expression frequencies aggregated from the peptide level data.

- visit_id - ID code for the visit.
- visit_month - The month of the visit, relative to the first visit by the patient.
- patient_id - An ID code for the patient.
- UniProt - The UniProt ID code for the associated protein. There are often several peptides per protein.
- NPX - Normalized protein expression. The frequency of the protein's occurrence in the sample. May not have a 1:1 relationship with the component peptides as some proteins contain repeated copies of a given peptide.

`train_clinical_data.csv`

- visit_id - ID code for the visit.
- visit_month - The month of the visit, relative to the first visit by the patient.
- patient_id - An ID code for the patient.
- updrs_[1-4] - The patient's score for part N of the Unified Parkinson's Disease Rating Scale. Higher numbers indicate more severe symptoms. Each 

- sub-section covers a distinct category of symptoms, such as mood and behavior for Part 1 and motor functions for Part 3.
- upd23b_clinical_state_on_medication - Whether or not the patient was taking medication such as Levodopa during the UPDRS assessment. Expected to mainly affect the scores for Part 3 (motor function). These medications wear off fairly quickly (on the order of one day) so it's common for patients to take the motor function exam twice in a single month, both with and without medication.

In [1]:
import pandas as pd
from pydantic import BaseModel

## Create configs

In [2]:
configs = {
    'MAX_PROTEINS': 20,
    'PROFILE_REPORT' : True,
    'SAMPLE_SUBMISSION' : "/kaggle/input/amp-parkinsons-disease-progression-prediction/example_test_files/sample_submission.csv",
    'SUPPLEMENTAL_CLINICAL_DATA' : "/kaggle/input/amp-parkinsons-disease-progression-prediction/supplemental_clinical_data.csv",
    'TRAIN_CLINICAL_DATA' : "/kaggle/input/amp-parkinsons-disease-progression-prediction/train_clinical_data.csv",
    'TRAIN_PEPTIDES' : "/kaggle/input/amp-parkinsons-disease-progression-prediction/train_peptides.csv",
    'TRAIN_PROTEINS'  :"/kaggle/input/amp-parkinsons-disease-progression-prediction/train_proteins.csv",
    'TEST_CLINICAL_DATA' : "/kaggle/input/amp-parkinsons-disease-progression-prediction/example_test_files/test.csv",
    'TEST_PEPTIDES' : "/kaggle/input/amp-parkinsons-disease-progression-prediction/example_test_files/test_peptides.csv",
    'TEST_PROTEINS' : "/kaggle/input/amp-parkinsons-disease-progression-prediction/example_test_files/test_proteins.csv",
}
class Configs(BaseModel):
    MAX_PROTEINS: int
    PROFILE_REPORT: bool
    SAMPLE_SUBMISSION: str
    SUPPLEMENTAL_CLINICAL_DATA: str
    TRAIN_CLINICAL_DATA: str
    TRAIN_PEPTIDES: str
    TRAIN_PROTEINS: str
    TEST_CLINICAL_DATA: str
    TEST_PEPTIDES: str
    TEST_PROTEINS: str

base_configs = Configs(**configs)

## Load data

### Load dataset clinical data

In [3]:
train_clinical_data = pd.read_csv(base_configs.TRAIN_CLINICAL_DATA)
print(train_clinical_data.shape)
train_clinical_data.head()

(2615, 8)


Unnamed: 0,visit_id,patient_id,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,upd23b_clinical_state_on_medication
0,55_0,55,0,10.0,6.0,15.0,,
1,55_3,55,3,10.0,7.0,25.0,,
2,55_6,55,6,8.0,10.0,34.0,,
3,55_9,55,9,8.0,9.0,30.0,0.0,On
4,55_12,55,12,10.0,10.0,41.0,0.0,On


In [4]:
import pandas_profiling

train_clinical_data.profile_report()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



### Load dataset peptides

In [5]:
train_peptides = pd.read_csv(base_configs.TRAIN_PEPTIDES)
print(train_peptides.shape)
train_peptides.head()

(981834, 6)


Unnamed: 0,visit_id,visit_month,patient_id,UniProt,Peptide,PeptideAbundance
0,55_0,0,55,O00391,NEQEQPLGQWHLS,11254.3
1,55_0,0,55,O00533,GNPEPTFSWTK,102060.0
2,55_0,0,55,O00533,IEIPSSVQQVPTIIK,174185.0
3,55_0,0,55,O00533,KPQSAVYSTGSNGILLC(UniMod_4)EAEGEPQPTIK,27278.9
4,55_0,0,55,O00533,SMEQNGPGLEYR,30838.7


In [6]:
import pandas_profiling  

train_peptides.profile_report()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



### Load dataset proteins

In [7]:
train_proteins = pd.read_csv(base_configs.TRAIN_PROTEINS)
print(train_proteins.shape)
train_proteins.head()

(232741, 5)


Unnamed: 0,visit_id,visit_month,patient_id,UniProt,NPX
0,55_0,0,55,O00391,11254.3
1,55_0,0,55,O00533,732430.0
2,55_0,0,55,O00584,39585.8
3,55_0,0,55,O14498,41526.9
4,55_0,0,55,O14773,31238.0


In [8]:
train_proteins.profile_report()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



### Load dataset SUPPLEMENTAL CLINICAL DATA

In [9]:
supplemental_clinical_data = pd.read_csv(base_configs.SUPPLEMENTAL_CLINICAL_DATA)
print(f"Supplemental Clinical Dataframe shape has {supplemental_clinical_data.shape[0]} rows and {supplemental_clinical_data.shape[1]} columns")
supplemental_clinical_data.head()

Supplemental Clinical Dataframe shape has 2223 rows and 8 columns


Unnamed: 0,visit_id,patient_id,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,upd23b_clinical_state_on_medication
0,35_0,35,0,5.0,3.0,16.0,0.0,
1,35_36,35,36,6.0,4.0,20.0,0.0,
2,75_0,75,0,4.0,6.0,26.0,0.0,
3,75_36,75,36,1.0,8.0,38.0,0.0,On
4,155_0,155,0,,,0.0,,


In [10]:
supplemental_clinical_data.profile_report()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



## Pre-processing data

In [11]:
train_proteins.loc[:, "UniProt"] = train_proteins.loc[:, "UniProt"].fillna("None")
train_proteins.loc[:, "NPX"] = train_proteins.loc[:, "NPX"].fillna(0)

In [12]:
train_proteins.head()

Unnamed: 0,visit_id,visit_month,patient_id,UniProt,NPX
0,55_0,0,55,O00391,11254.3
1,55_0,0,55,O00533,732430.0
2,55_0,0,55,O00584,39585.8
3,55_0,0,55,O14498,41526.9
4,55_0,0,55,O14773,31238.0


In [13]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
train_proteins['UniProt'] = le.fit_transform(train_proteins['UniProt'])
train_proteins.head()

Unnamed: 0,visit_id,visit_month,patient_id,UniProt,NPX
0,55_0,0,55,0,11254.3
1,55_0,0,55,1,732430.0
2,55_0,0,55,2,39585.8
3,55_0,0,55,3,41526.9
4,55_0,0,55,4,31238.0


In [14]:
protein_aux_1 = pd.DataFrame(train_proteins.groupby(by=["patient_id", "visit_id", "visit_month"])["UniProt"].apply(list))
protein_aux_1.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,UniProt
patient_id,visit_id,visit_month,Unnamed: 3_level_1
55,55_0,0,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,..."
55,55_12,12,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,..."
55,55_36,36,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,..."
55,55_6,6,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,..."
942,942_12,12,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14,..."


In [15]:
protein_aux_2 = pd.DataFrame(train_proteins.groupby(by=["patient_id", "visit_id", "visit_month"])["NPX"].apply(list))
protein_aux_2.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,NPX
patient_id,visit_id,visit_month,Unnamed: 3_level_1
55,55_0,0,"[11254.3, 732430.0, 39585.8, 41526.9, 31238.0,..."
55,55_12,12,"[15257.6, 815083.0, 41650.9, 39763.3, 30703.6,..."
55,55_36,36,"[13530.8, 753832.0, 43048.9, 43503.6, 33577.6,..."
55,55_6,6,"[13163.6, 630465.0, 35220.8, 41295.0, 26219.9,..."
942,942_12,12,"[6757.32, 360858.0, 18367.6, 14760.7, 18603.4,..."


In [16]:
protein_aux_3 = pd.DataFrame(protein_aux_1.UniProt.values.tolist()).add_prefix('UniProt_')
protein_aux_3.head()

Unnamed: 0,UniProt_0,UniProt_1,UniProt_2,UniProt_3,UniProt_4,UniProt_5,UniProt_6,UniProt_7,UniProt_8,UniProt_9,...,UniProt_214,UniProt_215,UniProt_216,UniProt_217,UniProt_218,UniProt_219,UniProt_220,UniProt_221,UniProt_222,UniProt_223
0,0,1,2,3,4,5,6,7,8,9,...,221.0,222.0,224.0,225.0,226.0,,,,,
1,0,1,2,3,4,5,6,7,8,9,...,219.0,220.0,221.0,222.0,223.0,224.0,225.0,226.0,,
2,0,1,2,3,4,5,6,7,8,9,...,217.0,218.0,219.0,220.0,221.0,222.0,223.0,224.0,225.0,226.0
3,0,1,2,3,4,5,6,7,8,9,...,222.0,224.0,225.0,226.0,,,,,,
4,0,1,2,3,4,5,6,7,8,9,...,,,,,,,,,,


In [17]:
protein_aux_4 = pd.DataFrame(protein_aux_2.NPX.values.tolist()).add_prefix('NPX_')
protein_aux_4.head()

Unnamed: 0,NPX_0,NPX_1,NPX_2,NPX_3,NPX_4,NPX_5,NPX_6,NPX_7,NPX_8,NPX_9,...,NPX_214,NPX_215,NPX_216,NPX_217,NPX_218,NPX_219,NPX_220,NPX_221,NPX_222,NPX_223
0,11254.3,732430.0,39585.8,41526.9,31238.0,4202.71,177775.0,62898.2,333376.0,166850.0,...,60912.6,408698.0,29758.8,23833.7,18953.5,,,,,
1,15257.6,815083.0,41650.9,39763.3,30703.6,4343.6,151073.0,66963.1,332401.0,151194.0,...,114921.0,21860.1,61598.2,318553.0,65762.6,29193.4,28536.1,19290.9,,
2,13530.8,753832.0,43048.9,43503.6,33577.6,5367.06,101056.0,67588.6,317490.0,122902.0,...,303597.0,48188.4,109794.0,23930.6,70223.5,377550.0,74976.1,31732.6,22186.5,21717.1
3,13163.6,630465.0,35220.8,41295.0,26219.9,4416.42,165638.0,62567.5,277833.0,170345.0,...,369870.0,22935.2,17722.5,16642.7,,,,,,
4,6757.32,360858.0,18367.6,14760.7,18603.4,1722.77,86847.4,37741.3,212132.0,100519.0,...,,,,,,,,,,


In [18]:
protein_aux_3.index = protein_aux_1.index
protein_aux_4.index = protein_aux_2.index
new_train_proteins = protein_aux_3.merge(protein_aux_4, left_index=True, right_index=True, how="inner")
del protein_aux_1, protein_aux_2, protein_aux_3, protein_aux_4
new_train_proteins.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,UniProt_0,UniProt_1,UniProt_2,UniProt_3,UniProt_4,UniProt_5,UniProt_6,UniProt_7,UniProt_8,UniProt_9,...,NPX_214,NPX_215,NPX_216,NPX_217,NPX_218,NPX_219,NPX_220,NPX_221,NPX_222,NPX_223
patient_id,visit_id,visit_month,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
55,55_0,0,0,1,2,3,4,5,6,7,8,9,...,60912.6,408698.0,29758.8,23833.7,18953.5,,,,,
55,55_12,12,0,1,2,3,4,5,6,7,8,9,...,114921.0,21860.1,61598.2,318553.0,65762.6,29193.4,28536.1,19290.9,,
55,55_36,36,0,1,2,3,4,5,6,7,8,9,...,303597.0,48188.4,109794.0,23930.6,70223.5,377550.0,74976.1,31732.6,22186.5,21717.1
55,55_6,6,0,1,2,3,4,5,6,7,8,9,...,369870.0,22935.2,17722.5,16642.7,,,,,,
942,942_12,12,0,1,2,3,4,5,6,7,8,9,...,,,,,,,,,,


In [19]:
new_train_proteins.reset_index(inplace=True)
display(new_train_proteins.head())
train_df = train_clinical_data.merge(new_train_proteins,
                                     on=["patient_id", "visit_id", "visit_month"],
                                     how="left")

Unnamed: 0,patient_id,visit_id,visit_month,UniProt_0,UniProt_1,UniProt_2,UniProt_3,UniProt_4,UniProt_5,UniProt_6,...,NPX_214,NPX_215,NPX_216,NPX_217,NPX_218,NPX_219,NPX_220,NPX_221,NPX_222,NPX_223
0,55,55_0,0,0,1,2,3,4,5,6,...,60912.6,408698.0,29758.8,23833.7,18953.5,,,,,
1,55,55_12,12,0,1,2,3,4,5,6,...,114921.0,21860.1,61598.2,318553.0,65762.6,29193.4,28536.1,19290.9,,
2,55,55_36,36,0,1,2,3,4,5,6,...,303597.0,48188.4,109794.0,23930.6,70223.5,377550.0,74976.1,31732.6,22186.5,21717.1
3,55,55_6,6,0,1,2,3,4,5,6,...,369870.0,22935.2,17722.5,16642.7,,,,,,
4,942,942_12,12,0,1,2,3,4,5,6,...,,,,,,,,,,


In [20]:
train_df.head()

Unnamed: 0,visit_id,patient_id,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,upd23b_clinical_state_on_medication,UniProt_0,UniProt_1,...,NPX_214,NPX_215,NPX_216,NPX_217,NPX_218,NPX_219,NPX_220,NPX_221,NPX_222,NPX_223
0,55_0,55,0,10.0,6.0,15.0,,,0.0,1.0,...,60912.6,408698.0,29758.8,23833.7,18953.5,,,,,
1,55_3,55,3,10.0,7.0,25.0,,,,,...,,,,,,,,,,
2,55_6,55,6,8.0,10.0,34.0,,,0.0,1.0,...,369870.0,22935.2,17722.5,16642.7,,,,,,
3,55_9,55,9,8.0,9.0,30.0,0.0,On,,,...,,,,,,,,,,
4,55_12,55,12,10.0,10.0,41.0,0.0,On,0.0,1.0,...,114921.0,21860.1,61598.2,318553.0,65762.6,29193.4,28536.1,19290.9,,


In [21]:
list(train_df.columns)

['visit_id',
 'patient_id',
 'visit_month',
 'updrs_1',
 'updrs_2',
 'updrs_3',
 'updrs_4',
 'upd23b_clinical_state_on_medication',
 'UniProt_0',
 'UniProt_1',
 'UniProt_2',
 'UniProt_3',
 'UniProt_4',
 'UniProt_5',
 'UniProt_6',
 'UniProt_7',
 'UniProt_8',
 'UniProt_9',
 'UniProt_10',
 'UniProt_11',
 'UniProt_12',
 'UniProt_13',
 'UniProt_14',
 'UniProt_15',
 'UniProt_16',
 'UniProt_17',
 'UniProt_18',
 'UniProt_19',
 'UniProt_20',
 'UniProt_21',
 'UniProt_22',
 'UniProt_23',
 'UniProt_24',
 'UniProt_25',
 'UniProt_26',
 'UniProt_27',
 'UniProt_28',
 'UniProt_29',
 'UniProt_30',
 'UniProt_31',
 'UniProt_32',
 'UniProt_33',
 'UniProt_34',
 'UniProt_35',
 'UniProt_36',
 'UniProt_37',
 'UniProt_38',
 'UniProt_39',
 'UniProt_40',
 'UniProt_41',
 'UniProt_42',
 'UniProt_43',
 'UniProt_44',
 'UniProt_45',
 'UniProt_46',
 'UniProt_47',
 'UniProt_48',
 'UniProt_49',
 'UniProt_50',
 'UniProt_51',
 'UniProt_52',
 'UniProt_53',
 'UniProt_54',
 'UniProt_55',
 'UniProt_56',
 'UniProt_57',
 'UniPro

In [22]:
#Take 20 cols UniProt andd NPX
cols1  = [col for col in train_df.columns if 'UniProt' in col][0:base_configs.MAX_PROTEINS]
cols2  = [col for col in train_df.columns if 'NPX' in col][0:base_configs.MAX_PROTEINS]

train_cols  = cols1 + cols2
target_cols = ["updrs_1", "updrs_2", "updrs_3", "updrs_4"]
train_df.drop(columns = ['upd23b_clinical_state_on_medication'],inplace = True)

In [23]:
train_df.loc[:,target_cols] = train_df.loc[:,target_cols].fillna(0)
train_df.loc[:,cols1] = train_df.loc[:,cols1].fillna(-1)
train_df.loc[:,cols2] = train_df.loc[:,cols2].fillna(0)

In [24]:
train_df.set_index(["visit_id", "patient_id", "visit_month"], inplace=True)
train_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,updrs_1,updrs_2,updrs_3,updrs_4,UniProt_0,UniProt_1,UniProt_2,UniProt_3,UniProt_4,UniProt_5,...,NPX_214,NPX_215,NPX_216,NPX_217,NPX_218,NPX_219,NPX_220,NPX_221,NPX_222,NPX_223
visit_id,patient_id,visit_month,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
55_0,55,0,10.0,6.0,15.0,0.0,0.0,1.0,2.0,3.0,4.0,5.0,...,60912.6,408698.0,29758.8,23833.7,18953.5,,,,,
55_3,55,3,10.0,7.0,25.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,,,,,,,,,,
55_6,55,6,8.0,10.0,34.0,0.0,0.0,1.0,2.0,3.0,4.0,5.0,...,369870.0,22935.2,17722.5,16642.7,,,,,,
55_9,55,9,8.0,9.0,30.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,,,,,,,,,,
55_12,55,12,10.0,10.0,41.0,0.0,0.0,1.0,2.0,3.0,4.0,5.0,...,114921.0,21860.1,61598.2,318553.0,65762.6,29193.4,28536.1,19290.9,,


In [25]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(
    train_df.loc[:,train_cols],train_df.loc[:,target_cols],test_size = .2, random_state = 42
)

X_train.shape,X_test.shape,y_train.shape,y_test.shape

((2092, 40), (523, 40), (2092, 4), (523, 4))

In [26]:
X_train.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,UniProt_0,UniProt_1,UniProt_2,UniProt_3,UniProt_4,UniProt_5,UniProt_6,UniProt_7,UniProt_8,UniProt_9,...,NPX_10,NPX_11,NPX_12,NPX_13,NPX_14,NPX_15,NPX_16,NPX_17,NPX_18,NPX_19
visit_id,patient_id,visit_month,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
6054_60,6054,60,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
52266_60,52266,60,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2660_48,2660,48,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,...,8379.02,22749.5,74585.1,753128.0,762221.0,36899.1,764805.0,125390.0,346585.0,64577.2
58648_0,58648,0,1.0,2.0,3.0,4.0,5.0,6.0,8.0,9.0,10.0,12.0,...,24838.7,575893.0,365578.0,995740.0,36841.6,202915.0,10962.0,330101.0,1122430.0,7378550.0
23175_12,23175,12,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,...,113211.0,4510.88,18029.1,64964.1,1146950.0,483586.0,93556.0,2351970.0,142615.0,346315.0


## Using module [Lazypredict](https://github.com/shankarpandala/lazypredict)

In [None]:
!pip install lazypredict
from lazypredict.Supervised import LazyRegressor

In [None]:
reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)
models, predictions = reg.fit(X_train, X_test, y_train, y_test)
print(models)

In [None]:
predictions

## Conclusions drawn from the predictions of lazy predict 

From the given table, we have several regression models along with their performance metrics such as Adjusted R-Squared, R-Squared, RMSE (Root Mean Squared Error), and Time Taken.

When evaluating regression models, we generally aim for higher R-Squared values (closer to 1) and lower RMSE values. The R-Squared value indicates the proportion of the variance in the dependent variable that can be explained by the independent variables. The RMSE represents the average difference between the predicted and actual values.

Based on the provided data, we can draw the following conclusions:

The best-performing models in terms of R-Squared are RandomForestRegressor and ExtraTreesRegressor, with a value of 0.02. However, these values are quite low, suggesting that the models have limited predictive power.

The models with the lowest RMSE are RandomForestRegressor and ExtraTreesRegressor, both having an RMSE of 8.46 and 8.47, respectively. Lower RMSE values indicate better accuracy in predicting the target variable.

The models with the highest Adjusted R-Squared values are RandomForestRegressor and ExtraTreesRegressor, both at -0.06. However, negative Adjusted R-Squared values indicate that the models perform worse than a model with no predictors.

Among the models listed, RandomForestRegressor and ExtraTreesRegressor also have relatively low time taken, with 1.81 and 0.98 respectively. This means that they are computationally efficient compared to other models.

Based on these observations, RandomForestRegressor and ExtraTreesRegressor stand out as the best models among the ones listed. They have the lowest RMSE, relatively better R-Squared (though still low), and reasonable computation time. However, it is important to note that the overall performance of these models is not very strong, as indicated by the low R-Squared values.

It's worth considering that the choice of the best model can depend on the specific requirements and context of the problem at hand. It is recommended to further analyze and compare the performance of these models using cross-validation, additional evaluation metrics, and considering the specific characteristics and assumptions of the problem domain.

## Using model RandomForestRegressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor

# Create a Random Forest Regressor model
rf_reg = MultiOutputRegressor(
    RandomForestRegressor(n_estimators=100, max_depth=10)
)

# Fit the model to the training data
rf_reg = rf_reg.fit(X_train, y_train)

# Predict the target variable for the test data
y_pred = rf_reg.predict(X_test)

In [None]:
def process_predictions(y_pred, X_test):
    predictions = []

    for i in range(y_pred.shape[1]):  
        pred = pd.DataFrame(y_pred[:,i])
        pred.columns = ["rating"]
        pred["updrs"] = target_cols[i]
        pred.index = X_test.index
        predictions.append(pred)

    predictions = pd.concat(predictions)
    predictions.reset_index(inplace=True)
    return predictions

predictions = process_predictions(y_pred, X_test)
predictions.head()

In [None]:
def process_test(test_clinical_data, test_proteins):
    # === Preprocess Protein DataFrame ===
    test_proteins.loc[:, "UniProt"] = test_proteins.loc[:, "UniProt"].fillna("None")
    test_proteins.loc[:, "NPX"] = test_proteins.loc[:, "NPX"].fillna(0)
    test_proteins['UniProt'] = le.fit_transform(test_proteins['UniProt'])
    protein_aux_1 = pd.DataFrame(test_proteins.groupby(by=["patient_id", "visit_id", "visit_month"])["UniProt"].apply(list))#.reset_index()
    protein_aux_2 = pd.DataFrame(test_proteins.groupby(by=["patient_id", "visit_id", "visit_month"])["NPX"].apply(list))#.reset_index()
    protein_aux_3 = pd.DataFrame(protein_aux_1.UniProt.values.tolist()).add_prefix('UniProt_')
    protein_aux_4 = pd.DataFrame(protein_aux_2.NPX.values.tolist()).add_prefix('NPX_')
    protein_aux_3.index = protein_aux_1.index
    protein_aux_4.index = protein_aux_2.index
    new_test_proteins = protein_aux_3.merge(protein_aux_4, left_index=True, right_index=True, how="inner")
    del protein_aux_1, protein_aux_2, protein_aux_3, protein_aux_4
    new_test_proteins.reset_index(inplace=True)
    
    # === Merge Clinical Data with Protein Data ===
    test_df = test_clinical_data.merge(new_test_proteins,
                                     on=["patient_id", "visit_id", "visit_month"],
                                     how="left")
    
    # === Select Label Encoder Columns ===
    le_cols_1 = [col for col in test_df.columns if 'UniProt' in col][0:base_configs.MAX_PROTEINS]
    le_cols_2 = [col for col in test_df.columns if 'NPX' in col][0:base_configs.MAX_PROTEINS]
    TRAIN_COLUMNS = le_cols_1 + le_cols_2 
    
    # === Fill NA ===
    test_df.loc[:, test_df.columns.isin(le_cols_1)] = test_df.loc[:, test_df.columns.isin(le_cols_1)].fillna(-1)
    test_df.loc[:, test_df.columns.isin(le_cols_2)] = test_df.loc[:, test_df.columns.isin(le_cols_2)].fillna(0)
    test_df.loc[:, test_df.columns.isin(target_cols)] = test_df.loc[:, test_df.columns.isin(target_cols)].fillna(0)
    
    # === Encode columns ===
    test_df.set_index(["visit_id", "patient_id", "visit_month"], inplace=True)
    X_test = test_df.loc[:, test_df.columns.isin(TRAIN_COLUMNS)]
    y_pred = rf_reg.predict(X_test)
    predictions = process_predictions(y_pred, X_test)
    prediction_id = []

    auxs = [0,6,12,24]
    for index, group in predictions.groupby(["visit_id", "patient_id", "visit_month", "updrs"]):
        i = 0
        for index, row in group.iterrows():
            prediction_id.append(row["visit_id"] + "_" + row["updrs"] + "_plus_" + str(auxs[i]) + "_months")
            i += 1
    predictions["prediction_id"] = prediction_id
    print(f"Predictions shape has {predictions.shape[0]} rows and {predictions.shape[1]} columns")
    return predictions

test = pd.read_csv(base_configs.TEST_CLINICAL_DATA)
print(f"Test Dataframe shape has {test.shape[0]} rows and {test.shape[1]} columns")
display(test.head())

test_proteins = pd.read_csv(base_configs.TEST_PROTEINS)
print(f"Test Proteins shape has {test_proteins.shape[0]} rows and {test_proteins.shape[1]} columns")
display(test_proteins.head())

sample_submission = pd.read_csv(base_configs.SAMPLE_SUBMISSION)
print(f"Sample Submission shape has {sample_submission.shape[0]} rows and {sample_submission.shape[1]} columns")
display(sample_submission.head())

In [None]:
import amp_pd_peptide
env = amp_pd_peptide.make_env()
iter_test = env.iter_test()

counter = 0
# The API will deliver four dataframes in this specific order:
for (test, test_peptides, test_proteins, sample_submission) in iter_test:
    predictions = process_test(test, test_proteins)
    submission = predictions[["prediction_id", "rating"]]
    env.predict(submission)
    
    if counter == 0:
        display(test)
        display(sample_submission)
        
    counter += 1