**References:**
* <a href="https://www.kaggle.com/code/gusthema/parkinson-s-disease-progression-prediction-w-tfdf" style="text-decoration:none">Parkinson's Disease Progression Prediction w TFDF</a>

# Import the Required Libraries

In [1]:
import tensorflow as tf
import tensorflow_decision_forests as tfdf

import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns



In [3]:
print("TensorFlow Version: ", tf.__version__)
print("TF-DF Version: ", tfdf.__version__)

TensorFlow Version:  2.11.0
TF-DF Version:  1.2.0


<br>

# Load the Dataset

In [9]:
data_dir =  "/kaggle/input/amp-parkinsons-disease-progression-prediction/"
data_dir

'/kaggle/input/amp-parkinsons-disease-progression-prediction/'

In [11]:
df_train_proteins = pd.read_csv(data_dir + "train_proteins.csv")
df_train_peptides = pd.read_csv(data_dir + "train_peptides.csv")
df_train_clinical = pd.read_csv(data_dir + "train_clinical_data.csv")

We will now examine each of these DataFrames in detail.

**UPDRS** is a rating instrument used to measure the severity and progression of Parkinson’s disease in patients. When a patient visits the clinic, the clinic will record how the patient scored on 4 parts of UPDRS test. This data can be found in `train_clinical`. The ratings for the the first 4 segments of UPDRS are available as ***updrs_1***, ***updrs_2***, ***updrs_3*** and ***updrs_4*** in `train_clinical`. Our goal is to train a model to predict these UPDRS ratings.

Let us examine the shape of train_clinical DataFrame.

In [13]:
df_train_clinical.shape

(2615, 8)

In [14]:
df_train_clinical.head(5)

Unnamed: 0,visit_id,patient_id,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,upd23b_clinical_state_on_medication
0,55_0,55,0,10.0,6.0,15.0,,
1,55_3,55,3,10.0,7.0,25.0,,
2,55_6,55,6,8.0,10.0,34.0,,
3,55_9,55,9,8.0,9.0,30.0,0.0,On
4,55_12,55,12,10.0,10.0,41.0,0.0,On


<br>

The clinic will also record the patient's **NPX**(Normalized Protein Expression) value for all the proteins relevant to Parkinson's disease during each visit. **NPX** is nothing but the value representing the protein concentration in shells. This data is available in the `train_proteins` DataFrame.

Let us examine the shape of train_proteins DataFrame.

In [15]:
df_train_proteins.shape

(232741, 5)

In [16]:
df_train_proteins.head(5)

Unnamed: 0,visit_id,visit_month,patient_id,UniProt,NPX
0,55_0,0,55,O00391,11254.3
1,55_0,0,55,O00533,732430.0
2,55_0,0,55,O00584,39585.8
3,55_0,0,55,O14498,41526.9
4,55_0,0,55,O14773,31238.0


<br>

Proteins are long molecules made up of multiple peptides. The clinic will record the **Peptide Abundance** of each peptide in proteins relevant to Parkinson's disease. It shows the peptide concentration, similar to NPX for proteins. This data can be found in the `train_peptides` DataFrame.

Let us examine the shape of the `train_peptides` DataFrame.

In [17]:
df_train_peptides.shape

(981834, 6)

In [18]:
df_train_peptides.head(5)

Unnamed: 0,visit_id,visit_month,patient_id,UniProt,Peptide,PeptideAbundance
0,55_0,0,55,O00391,NEQEQPLGQWHLS,11254.3
1,55_0,0,55,O00533,GNPEPTFSWTK,102060.0
2,55_0,0,55,O00533,IEIPSSVQQVPTIIK,174185.0
3,55_0,0,55,O00533,KPQSAVYSTGSNGILLC(UniMod_4)EAEGEPQPTIK,27278.9
4,55_0,0,55,O00533,SMEQNGPGLEYR,30838.7


<br>

# Plotting clinical data