# Feature Documentation

## Peptide
--- 

* visit_id - ID code for the visit.
* visit_month - The month of the visit, relative to the first visit by the patient.
* patient_id - An ID code for the patient.
* UniProt - The UniProt ID code for the associated protein. There are often several peptides per protein.
* Peptide - The sequence of amino acids included in the peptide. See this table for the relevant codes. Some rare annotations may not be included in the table. The test set may include peptides not found in the train set.
* PeptideAbundance - The frequency of the amino acid in the sample.

## Protien (Aggregated from Peptide Level)
---

* visit_id - ID code for the visit.
* visit_month - The month of the visit, relative to the first visit by the patient.
* patient_id - An ID code for the patient.
* UniProt - The UniProt ID code for the associated protein. There are often several peptides per protein. The test set may include proteins not found in the train set.
* NPX - Normalized protein expression. The frequency of the protein's occurrence in the sample. May not have a 1:1 8 relationship with the component peptides as some proteins contain repeated copies of a given peptide.

## TimeSeries Clinical Testing Results (Train/Test)
---
* visit_id - ID code for the visit.
* visit_month - The month of the visit, relative to the first visit by the patient.
* patient_id - An ID code for the patient.
* updrs_[1-4] - The patient's score for part N of the Unified Parkinson's Disease Rating Scale. Higher numbers indicate * more severe symptoms. Each sub-section covers a distinct category of symptoms, such as mood and behavior for Part 1 and motor functions for Part 3.
* upd23b_clinical_state_on_medication - Whether or not the patient was taking medication such as Levodopa during the * UPDRS assessment. Expected to mainly affect the scores for Part 3 (motor function). These medications wear off fairly * quickly (on the order of one day) so it's common for patients to take the motor function exam twice in a single month, both with and without medication.

In [1]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
import pandas as pd
import tensorflow as tf
import keras
import seaborn as sns 
import matplotlib.pyplot as plt
from tensorflow.keras import optimizers
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, classification_report

2023-04-11 20:26:43.195593: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Peptide Features 

In [2]:
peptide = pd.read_csv('train_peptides.csv')
peptide.head(3)

Unnamed: 0,visit_id,visit_month,patient_id,UniProt,Peptide,PeptideAbundance
0,55_0,0,55,O00391,NEQEQPLGQWHLS,11254.3
1,55_0,0,55,O00533,GNPEPTFSWTK,102060.0
2,55_0,0,55,O00533,IEIPSSVQQVPTIIK,174185.0


## Transforming Features
---

In [3]:
transform_peptide = peptide.pivot(index = peptide.columns[:3].tolist(), columns = 'Peptide', values = 'PeptideAbundance').reset_index()

In [4]:
transform_peptide.fillna(0, inplace = True)

In [5]:
transform_peptide.head()

Peptide,visit_id,visit_month,patient_id,AADDTWEPFASGK,AAFGQGSGPIMLDEVQC(UniMod_4)TGTEASLADC(UniMod_4)K,AAFTEC(UniMod_4)C(UniMod_4)QAADK,AANEVSSADVK,AATGEC(UniMod_4)TATVGKR,AATVGSLAGQPLQER,AAVYHHFISDGVR,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
0,10053_0,0,10053,6580710.0,31204.4,7735070.0,0.0,0.0,0.0,46620.3,...,202274.0,0.0,4401830.0,77482.6,583075.0,76705.7,104260.0,530223.0,0.0,7207.3
1,10053_12,12,10053,6333510.0,52277.6,5394390.0,0.0,0.0,0.0,57554.5,...,201009.0,0.0,5001750.0,36745.3,355643.0,92078.1,123254.0,453883.0,49281.9,25332.8
2,10053_18,18,10053,7129640.0,61522.0,7011920.0,35984.7,17188.0,19787.3,36029.4,...,220728.0,0.0,5424380.0,39016.0,496021.0,63203.6,128336.0,447505.0,52389.1,21235.7
3,10138_12,12,10138,7404780.0,46107.2,10610900.0,0.0,20910.2,66662.3,55253.9,...,188362.0,9433.71,3900280.0,48210.3,328482.0,89822.1,129964.0,552232.0,65657.8,9876.98
4,10138_24,24,10138,13788300.0,56910.3,6906160.0,13785.5,11004.2,63672.7,36819.8,...,206187.0,6365.15,3521800.0,69984.6,496737.0,80919.3,111799.0,0.0,56977.6,4903.09


## Protein Features
---


In [6]:
protein = pd.read_csv('train_proteins.csv')
protein.head()

Unnamed: 0,visit_id,visit_month,patient_id,UniProt,NPX
0,55_0,0,55,O00391,11254.3
1,55_0,0,55,O00533,732430.0
2,55_0,0,55,O00584,39585.8
3,55_0,0,55,O14498,41526.9
4,55_0,0,55,O14773,31238.0


In [7]:
transform_protein = protein.pivot(index = protein.columns[:3].tolist(), columns = 'UniProt', values = 'NPX').reset_index().fillna(0)

In [8]:
transform_protein.head()

UniProt,visit_id,visit_month,patient_id,O00391,O00533,O00584,O14498,O14773,O14791,O15240,...,Q9HDC9,Q9NQ79,Q9NYU2,Q9UBR2,Q9UBX5,Q9UHG2,Q9UKV8,Q9UNU6,Q9Y646,Q9Y6R7
0,10053_0,0,10053,9104.27,402321.0,0.0,0.0,7150.57,2497.84,83002.9,...,0.0,9469.45,94237.6,0.0,23016.0,177983.0,65900.0,15382.0,0.0,19017.4
1,10053_12,12,10053,10464.2,435586.0,0.0,0.0,0.0,0.0,197117.0,...,0.0,14408.4,0.0,0.0,28537.0,171733.0,65668.1,0.0,9295.65,25697.8
2,10053_18,18,10053,13235.7,507386.0,7126.96,24525.7,0.0,2372.71,126506.0,...,317477.0,38667.2,111107.0,0.0,37932.6,245188.0,59986.1,10813.3,0.0,29102.7
3,10138_12,12,10138,12600.2,494581.0,9165.06,27193.5,22506.1,6015.9,156313.0,...,557904.0,44556.9,155619.0,14647.9,36927.7,229232.0,106564.0,26077.7,21441.8,7642.42
4,10138_24,24,10138,12003.2,522138.0,4498.51,17189.8,29112.4,2665.15,151169.0,...,0.0,47836.7,177619.0,17061.1,25510.4,176722.0,59471.4,12639.2,15091.4,6168.55


## Joining Features 

In [9]:
join_features = pd.merge(transform_peptide, transform_protein, on = ['visit_id', 'visit_month', 'patient_id'])

In [10]:
join_features.head()

Unnamed: 0,visit_id,visit_month,patient_id,AADDTWEPFASGK,AAFGQGSGPIMLDEVQC(UniMod_4)TGTEASLADC(UniMod_4)K,AAFTEC(UniMod_4)C(UniMod_4)QAADK,AANEVSSADVK,AATGEC(UniMod_4)TATVGKR,AATVGSLAGQPLQER,AAVYHHFISDGVR,...,Q9HDC9,Q9NQ79,Q9NYU2,Q9UBR2,Q9UBX5,Q9UHG2,Q9UKV8,Q9UNU6,Q9Y646,Q9Y6R7
0,10053_0,0,10053,6580710.0,31204.4,7735070.0,0.0,0.0,0.0,46620.3,...,0.0,9469.45,94237.6,0.0,23016.0,177983.0,65900.0,15382.0,0.0,19017.4
1,10053_12,12,10053,6333510.0,52277.6,5394390.0,0.0,0.0,0.0,57554.5,...,0.0,14408.4,0.0,0.0,28537.0,171733.0,65668.1,0.0,9295.65,25697.8
2,10053_18,18,10053,7129640.0,61522.0,7011920.0,35984.7,17188.0,19787.3,36029.4,...,317477.0,38667.2,111107.0,0.0,37932.6,245188.0,59986.1,10813.3,0.0,29102.7
3,10138_12,12,10138,7404780.0,46107.2,10610900.0,0.0,20910.2,66662.3,55253.9,...,557904.0,44556.9,155619.0,14647.9,36927.7,229232.0,106564.0,26077.7,21441.8,7642.42
4,10138_24,24,10138,13788300.0,56910.3,6906160.0,13785.5,11004.2,63672.7,36819.8,...,0.0,47836.7,177619.0,17061.1,25510.4,176722.0,59471.4,12639.2,15091.4,6168.55


## Clinical Results
---



In [11]:
Clinical_Results = pd.read_csv('train_clinical_data.csv')

In [12]:
Clinical_Results.head()

Unnamed: 0,visit_id,patient_id,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,upd23b_clinical_state_on_medication
0,55_0,55,0,10.0,6.0,15.0,,
1,55_3,55,3,10.0,7.0,25.0,,
2,55_6,55,6,8.0,10.0,34.0,,
3,55_9,55,9,8.0,9.0,30.0,0.0,On
4,55_12,55,12,10.0,10.0,41.0,0.0,On


In [13]:
# Filling Nulls with Zero 
Clinical_Results.fillna(0, inplace = True)

In [14]:
# Categorical data 
Clinical_Results['upd23b_clinical_state_on_medication'].replace(0, 'Off', inplace = True)

In [15]:
# Categorical data
cat = pd.get_dummies(Clinical_Results['upd23b_clinical_state_on_medication'], drop_first = True)
cat.head()

Unnamed: 0,On
0,0
1,0
2,0
3,1
4,1


In [16]:
results_final = pd.merge(Clinical_Results, cat, left_index = True, right_index = True).drop('upd23b_clinical_state_on_medication', axis = 1)

In [17]:
results_final.head()

Unnamed: 0,visit_id,patient_id,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,On
0,55_0,55,0,10.0,6.0,15.0,0.0,0
1,55_3,55,3,10.0,7.0,25.0,0.0,0
2,55_6,55,6,8.0,10.0,34.0,0.0,0
3,55_9,55,9,8.0,9.0,30.0,0.0,1
4,55_12,55,12,10.0,10.0,41.0,0.0,1


In [18]:
# Peptide visualization dataset 
transform_peptide = pd.merge(results_final, transform_peptide, on = ['visit_id', 'patient_id', 'visit_month'])

transform_peptide.to_csv('Resources_Clean/visualization_peptide.csv', index = False)

# Protien visualization dataset
transform_protein = pd.merge(results_final, transform_protein, on = ['visit_id', 'patient_id', 'visit_month'])


transform_protein.to_csv('Resources_Clean/visualization_protein.csv', index = False)

In [19]:
transform_peptide.head(3)

Unnamed: 0,visit_id,patient_id,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,On,AADDTWEPFASGK,AAFGQGSGPIMLDEVQC(UniMod_4)TGTEASLADC(UniMod_4)K,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
0,55_0,55,0,10.0,6.0,15.0,0.0,0,8984260.0,53855.6,...,201158.0,16492.3,3810270.0,106894.0,580667.0,131155.0,165851.0,437305.0,46289.2,14898.4
1,55_6,55,6,8.0,10.0,34.0,0.0,0,8279770.0,45251.9,...,171079.0,13198.8,4119520.0,113385.0,514861.0,103512.0,144607.0,457891.0,40047.7,20703.9
2,55_12,55,12,10.0,10.0,41.0,0.0,1,8382390.0,53000.9,...,231772.0,17873.8,5474140.0,116286.0,711815.0,136943.0,181763.0,452253.0,54725.1,21841.1


In [20]:
transform_protein.head(3)

Unnamed: 0,visit_id,patient_id,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,On,O00391,O00533,...,Q9HDC9,Q9NQ79,Q9NYU2,Q9UBR2,Q9UBX5,Q9UHG2,Q9UKV8,Q9UNU6,Q9Y646,Q9Y6R7
0,55_0,55,0,10.0,6.0,15.0,0.0,0,11254.3,732430.0,...,365475.0,35528.0,97005.6,23122.5,60912.6,408698.0,0.0,29758.8,23833.7,18953.5
1,55_6,55,6,8.0,10.0,34.0,0.0,0,13163.6,630465.0,...,405676.0,30332.6,109174.0,23499.8,51655.8,369870.0,0.0,22935.2,17722.5,16642.7
2,55_12,55,12,10.0,10.0,41.0,0.0,1,15257.6,815083.0,...,303953.0,43026.2,114921.0,21860.1,61598.2,318553.0,65762.6,29193.4,28536.1,19290.9


### Combined Infomation

In [21]:
combined = pd.merge(results_final, join_features, on = ['visit_id', 'visit_month', 'patient_id'])

In [22]:
combined.head()

Unnamed: 0,visit_id,patient_id,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,On,AADDTWEPFASGK,AAFGQGSGPIMLDEVQC(UniMod_4)TGTEASLADC(UniMod_4)K,...,Q9HDC9,Q9NQ79,Q9NYU2,Q9UBR2,Q9UBX5,Q9UHG2,Q9UKV8,Q9UNU6,Q9Y646,Q9Y6R7
0,55_0,55,0,10.0,6.0,15.0,0.0,0,8984260.0,53855.6,...,365475.0,35528.0,97005.6,23122.5,60912.6,408698.0,0.0,29758.8,23833.7,18953.5
1,55_6,55,6,8.0,10.0,34.0,0.0,0,8279770.0,45251.9,...,405676.0,30332.6,109174.0,23499.8,51655.8,369870.0,0.0,22935.2,17722.5,16642.7
2,55_12,55,12,10.0,10.0,41.0,0.0,1,8382390.0,53000.9,...,303953.0,43026.2,114921.0,21860.1,61598.2,318553.0,65762.6,29193.4,28536.1,19290.9
3,55_36,55,36,17.0,18.0,51.0,0.0,1,10671500.0,58108.4,...,303597.0,48188.4,109794.0,23930.6,70223.5,377550.0,74976.1,31732.6,22186.5,21717.1
4,942_6,942,6,8.0,2.0,21.0,0.0,0,6177730.0,42682.6,...,253373.0,27431.8,93796.7,17450.9,21299.1,306621.0,82335.5,24018.7,18939.5,15251.2


In [23]:
# Export Data For Visualization 
combined.to_csv('Resources_Clean/vizualization.csv', index = False)

In [24]:
combined.drop('visit_id', axis = 1, inplace = True)

### Hot-Encoding Patient ID 

In [25]:
cat = pd.get_dummies(combined['patient_id'])
cat.head()

Unnamed: 0,55,942,1517,1923,2660,3636,3863,4161,4172,4923,...,62329,62437,62723,62732,62792,63875,63889,64669,64674,65043
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
combined_final = pd.merge(combined, cat, left_index = True, right_index = True).drop('patient_id', axis = 1)

In [27]:
combined_final.head()

Unnamed: 0,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,On,AADDTWEPFASGK,AAFGQGSGPIMLDEVQC(UniMod_4)TGTEASLADC(UniMod_4)K,AAFTEC(UniMod_4)C(UniMod_4)QAADK,AANEVSSADVK,...,62329,62437,62723,62732,62792,63875,63889,64669,64674,65043
0,0,10.0,6.0,15.0,0.0,0,8984260.0,53855.6,8579740.0,0.0,...,0,0,0,0,0,0,0,0,0,0
1,6,8.0,10.0,34.0,0.0,0,8279770.0,45251.9,8655890.0,49927.5,...,0,0,0,0,0,0,0,0,0,0
2,12,10.0,10.0,41.0,0.0,1,8382390.0,53000.9,8995640.0,45519.2,...,0,0,0,0,0,0,0,0,0,0
3,36,17.0,18.0,51.0,0.0,1,10671500.0,58108.4,9985420.0,52374.0,...,0,0,0,0,0,0,0,0,0,0
4,6,8.0,2.0,21.0,0.0,0,6177730.0,42682.6,3596660.0,25698.8,...,0,0,0,0,0,0,0,0,0,0


In [28]:
combined_final.to_csv('Resources_Clean/Result_Protein_Peptide_Combine.csv', index = False)