# MID_TERMS_PROJECT / MID_TERMS_EXAMEN

**<center> Goal:</center>**
The goal of this competition is to predict MDS-UPDR scores, which measure progression in patients with Parkinson's disease. The Movement Disorder Society-Sponsored Revision of the Unified Parkinson's Disease Rating Scale (MDS-UPDRS) is a comprehensive assessment of both motor and non-motor symptoms associated with Parkinson's. You will develop a model trained on data of protein and peptide levels over time in subjects with Parkinson’s disease versus normal age-matched control subjects.

Your work could help provide important breakthrough information about which molecules change as Parkinson’s disease progresses.

Source : https://www.kaggle.com/competitions/amp-parkinsons-disease-progression-prediction/overview

Documentation for the Model Keras : https://www.tensorflow.org/guide/keras/sequential_model?hl=fr

Documentation Loss Function : https://keras.io/api/losses/

*CV means Cross Validation. This is the score in your validation set. In a competition, the LB normally is computed only 20-30 % test data. Everyday you submit to get a high score in the LB, even your CV is not good. It makes your model overfits with 20-30 % which is used for the LB.*

# Import data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from keras import layers

train_clinical_data = pd.read_csv("../Mid_term_project/train_clinical_data.csv")
train_peptides_data = pd.read_csv("../Mid_term_project/train_peptides.csv")
train_protiens_data = pd.read_csv("../Mid_term_project/train_proteins.csv")
supplemental_clinical_data = pd.read_csv("../Mid_term_project/supplemental_clinical_data.csv")

In [None]:
train_clinical_data

In [None]:
print("Found {:,d} unique patient_id values".format(train_clinical_data["patient_id"].nunique()))
print("Found {:,d} unique visit_month values".format(train_clinical_data["visit_month"].nunique()))

In [None]:
train_peptides_data

In [None]:
print("Found {:,d} unique patient_id values".format(train_peptides_data["patient_id"].nunique()))
print("Found {:,d} unique visit_month values".format(train_peptides_data["visit_month"].nunique()))
print("Found {:,d} unique peptide values".format(train_peptides_data["Peptide"].nunique()))
print("Found {:,d} unique UniProt values".format(train_peptides_data["UniProt"].nunique()))

In [None]:
train_protiens_data

In [None]:
print("Found {:,d} unique patient_id values".format(train_protiens_data["patient_id"].nunique()))
print("Found {:,d} unique visit_month values".format(train_protiens_data["visit_month"].nunique()))
print("Found {:,d} unique UniProt values".format(train_protiens_data["UniProt"].nunique()))

In [None]:
supplemental_clinical_data

In [None]:
print("Found {:,d} unique patient_id values".format(supplemental_clinical_data["patient_id"].nunique()))
print("Found {:,d} unique visit_month values".format(supplemental_clinical_data["visit_month"].nunique()))

**<center> Observations : </center>**
 
**1. Train_clinical_data :**

We have here 248 unique patients with each a certain degree of disease according to the updrs_[1-4].
 
 updrs_[1-4] - The patient's score for part N of the Unified Parkinson's Disease Rating Scale. Higher numbers indicate more severe symptoms. Each sub-section covers a distinct category of symptoms, such as mood and behavior for Part 1 and motor functions for Part 3.

* Found 248 unique patient_id values
* Found 17 unique visit_month values
 
**2. Train_peptides_data :**

UniProt - The UniProt ID code for the associated protein. There are often several peptides per protein.
Peptide - The sequence of amino acids included in the peptide. Some rare annotations may not be included in the table. The test set may include peptides not found in the train set.

* Found 248 unique patient_id values
* Found 968 unique peptide values
* Found 227 unique UniProt values

**3. Train_protiens_data :**

* Found 248 unique patient_id values
* Found 227 unique UniProt values

**4. Dupplemental_clinical_data :**

There is no protein records. We can use this data to observe patient taking a PD (Parkinson's disease) medicine and the degree of disease updrs_[1-4].

* Found 771 unique patient_id values
* Found 8 unique visit_month values

# Visualization of data

In [None]:
# Counting the number of NaN in the train_clinial_data

null_count = train_clinical_data.isna().sum().to_list()
plt.bar(list(train_clinical_data),null_count)
plt.xticks(list(train_clinical_data), rotation=90)

In [None]:
# Counting the number of NaN in the train_peptides_data

null_count_1 = train_peptides_data.isna().sum().to_list()
print(null_count_1)

In [None]:
# Counting the number of NaN in the train_protiens_data

null_count_2 = train_protiens_data.isna().sum().to_list()
print(null_count_2)

In [None]:
# Counting the number of NaN in the supplemental_clinical_data

null_count_3 = supplemental_clinical_data.isna().sum().to_list()
plt.bar(list(supplemental_clinical_data),null_count_3)
plt.xticks(list(supplemental_clinical_data), rotation=90)

**<center> Observations 1 : </center>**

We can see that the train_peptides_data and train_protein_data have no NAN value which is great. 

However, we have a lot of rows from train_clinical_data with NAN value  : 1327 upd23b_clinical_state_on_medication / 2615 rows = 0.507, half of the rows have NAN value.

However, we have a lot of rows from supplemental_clinical_data with NAN value  : 1101 upd23b_clinical_state_on_medication / 2223 rows = 0.495, half of the rows have NAN value.

One solution of this problem is : replace the NaN value by the previous value

**Now, I'm going to concat the 2 dataset from clinical and supplemental_clinical and see how the updrs is evolving over the months and remove the NaN values**



In [None]:
concat_data = pd.concat([train_clinical_data, supplemental_clinical_data]).dropna()
concat_data

In [None]:
print("Found {:,d} unique patient_id values".format(concat_data["patient_id"].nunique()))
print("Found {:,d} unique vist_month values".format(concat_data["visit_month"].nunique()))

**How are Updrs evolved over months for the concatenation data with considereing is medication on or off?**

Sort data over months and plot the mean of Updrs from all patient.

Medication on and off data were taken.

In [None]:
clinical_data = concat_data.sort_values(by = ['visit_month'])   

visit_month = clinical_data["visit_month"].tolist()

updrs_1 = clinical_data["updrs_1"].tolist()
updrs_2 = clinical_data["updrs_2"].tolist()
updrs_3 = clinical_data["updrs_3"].tolist()
updrs_4 = clinical_data["updrs_4"].tolist()
res = {}
mois = []
mean_updrs_1 = []
mean_updrs_2 = []
mean_updrs_3 = []
mean_updrs_4 = []

ancienne_position = 0

for i in visit_month:
    res[i] = visit_month.count(i)
for i in res:
    mois.append(i)

for i in res:
    sum = 0
    for j in range(res[i]):
        sum = updrs_1[j + ancienne_position] + sum
    ancienne_position = res[i]
    mean_updrs_1.append(sum/res[i])

for i in res:
    sum = 0
    for j in range(res[i]):
        sum = updrs_2[j + ancienne_position] + sum
    ancienne_position = res[i]
    mean_updrs_2.append(sum/res[i])

for i in res:
    sum = 0
    for j in range(res[i]):
        sum = updrs_3[j + ancienne_position] + sum
    ancienne_position = res[i]
    mean_updrs_3.append(sum/res[i])

for i in res:
    sum = 0
    for j in range(res[i]):
        sum = updrs_4[j + ancienne_position] + sum
    ancienne_position = res[i]
    mean_updrs_4.append(sum/res[i])

print("Months:",mois)

figure, axis = plt.subplots(2, 2)

axis[0, 0].scatter(mois, mean_updrs_1)
axis[0, 0].set_title("mean_updrs_1 over months")
  

axis[0, 1].scatter(mois, mean_updrs_2)
axis[0, 1].set_title("mean_updrs_2 over months")
  

axis[1, 0].scatter(mois, mean_updrs_3)
axis[1, 0].set_title("mean_updrs_3 over months")

axis[1, 1].scatter(mois, mean_updrs_4)
axis[1, 1].set_title("mean_updrs_4 over months")
  
figure.tight_layout(pad=2.5)
plt.show()

figure1, axis1 = plt.subplots(2, 2)

axis1[0, 0].plot(mois, mean_updrs_1)
axis1[0, 0].set_title("mean_updrs_1 over months")
  

axis1[0, 1].plot(mois, mean_updrs_2)
axis1[0, 1].set_title("mean_updrs_2 over months")
  

axis1[1, 0].plot(mois, mean_updrs_3)
axis1[1, 0].set_title("mean_updrs_3 over months")

axis1[1, 1].plot(mois, mean_updrs_4)
axis1[1, 1].set_title("mean_updrs_4 over months")
  
figure1.tight_layout(pad=2.5)
plt.show()

Sort data over months and plot the mean of Updrs from all patient.

Only Medication off data were taken.

In [None]:
clinical_data = concat_data.sort_values(by = ["visit_month"])   

visit_month_off = clinical_data[clinical_data["upd23b_clinical_state_on_medication"]=="Off"]["visit_month"].tolist()

updrs_1_off = clinical_data[clinical_data["upd23b_clinical_state_on_medication"]=="Off"]["updrs_1"].tolist()
updrs_2_off = clinical_data[clinical_data["upd23b_clinical_state_on_medication"]=="Off"]["updrs_2"].tolist()
updrs_3_off = clinical_data[clinical_data["upd23b_clinical_state_on_medication"]=="Off"]["updrs_3"].tolist()
updrs_4_off = clinical_data[clinical_data["upd23b_clinical_state_on_medication"]=="Off"]["updrs_4"].tolist()
res = {}
mois = []
mean_updrs_1_off = []
mean_updrs_2_off = []
mean_updrs_3_off = []
mean_updrs_4_off = []

ancienne_position = 0

for i in visit_month_off:
    res[i] = visit_month_off.count(i)
for i in res:
    mois.append(i)

for i in res:
    sum = 0
    for j in range(res[i]):
        sum = updrs_1_off[j + ancienne_position] + sum
    ancienne_position = res[i]
    mean_updrs_1_off.append(sum/res[i])

for i in res:
    sum = 0
    for j in range(res[i]):
        sum = updrs_2_off[j + ancienne_position] + sum
    ancienne_position = res[i]
    mean_updrs_2_off.append(sum/res[i])

for i in res:
    sum = 0
    for j in range(res[i]):
        sum = updrs_3_off[j + ancienne_position] + sum
    ancienne_position = res[i]
    mean_updrs_3_off.append(sum/res[i])

for i in res:
    sum = 0
    for j in range(res[i]):
        sum = updrs_4_off[j + ancienne_position] + sum
    ancienne_position = res[i]
    mean_updrs_4_off.append(sum/res[i])

print("Months:",mois)

figure, axis = plt.subplots(2, 2)

axis[0, 0].scatter(mois, mean_updrs_1_off)
axis[0, 0].set_title("mean_updrs_1_off over months")
  

axis[0, 1].scatter(mois, mean_updrs_2_off)
axis[0, 1].set_title("mean_updrs_2_off over months")
  

axis[1, 0].scatter(mois, mean_updrs_3_off)
axis[1, 0].set_title("mean_updrs_3_off over months")

axis[1, 1].scatter(mois, mean_updrs_4_off)
axis[1, 1].set_title("mean_updrs_4_off over months")
  
figure.tight_layout(pad=2.5)
plt.show()

figure1, axis1 = plt.subplots(2, 2)

axis1[0, 0].plot(mois, mean_updrs_1_off)
axis1[0, 0].set_title("mean_updrs_1_off over months")
  

axis1[0, 1].plot(mois, mean_updrs_2_off)
axis1[0, 1].set_title("mean_updrs_2_off over months")
  

axis1[1, 0].plot(mois, mean_updrs_3_off)
axis1[1, 0].set_title("mean_updrs_3_off over months")

axis1[1, 1].plot(mois, mean_updrs_4_off)
axis1[1, 1].set_title("mean_updrs_4_off over months")
  
figure1.tight_layout(pad=2.5)
plt.show()


Sort data over months and plot the mean of Updrs from all patient.

Only Medication on data were taken.

In [None]:
clinical_data = concat_data.sort_values(by = ["visit_month"])   

visit_month_on = clinical_data[clinical_data["upd23b_clinical_state_on_medication"]=="On"]["visit_month"].tolist()

updrs_1_on = clinical_data[clinical_data["upd23b_clinical_state_on_medication"]=="On"]["updrs_1"].tolist()
updrs_2_on = clinical_data[clinical_data["upd23b_clinical_state_on_medication"]=="On"]["updrs_2"].tolist()
updrs_3_on = clinical_data[clinical_data["upd23b_clinical_state_on_medication"]=="On"]["updrs_3"].tolist()
updrs_4_on = clinical_data[clinical_data["upd23b_clinical_state_on_medication"]=="On"]["updrs_4"].tolist()
res = {}
mois = []
mean_updrs_1_on = []
mean_updrs_2_on = []
mean_updrs_3_on = []
mean_updrs_4_on = []

ancienne_position = 0

for i in visit_month_on:
    res[i] = visit_month_on.count(i)
for i in res:
    mois.append(i)

for i in res:
    sum = 0
    for j in range(res[i]):
        sum = updrs_1_on[j + ancienne_position] + sum
    ancienne_position = res[i]
    mean_updrs_1_on.append(sum/res[i])

for i in res:
    sum = 0
    for j in range(res[i]):
        sum = updrs_2_on[j + ancienne_position] + sum
    ancienne_position = res[i]
    mean_updrs_2_on.append(sum/res[i])

for i in res:
    sum = 0
    for j in range(res[i]):
        sum = updrs_3_on[j + ancienne_position] + sum
    ancienne_position = res[i]
    mean_updrs_3_on.append(sum/res[i])

for i in res:
    sum = 0
    for j in range(res[i]):
        sum = updrs_4_on[j + ancienne_position] + sum
    ancienne_position = res[i]
    mean_updrs_4_on.append(sum/res[i])

print("Months:",mois)

figure, axis = plt.subplots(2, 2)

axis[0, 0].scatter(mois, mean_updrs_1_on)
axis[0, 0].set_title("mean_updrs_1_on over months")
  

axis[0, 1].scatter(mois, mean_updrs_2_on)
axis[0, 1].set_title("mean_updrs_2_on over months")
  

axis[1, 0].scatter(mois, mean_updrs_3_on)
axis[1, 0].set_title("mean_updrs_3_on over months")

axis[1, 1].scatter(mois, mean_updrs_4_on)
axis[1, 1].set_title("mean_updrs_4_on over months")
  
figure.tight_layout(pad=2.5)
plt.show()

figure1, axis1 = plt.subplots(2, 2)

axis1[0, 0].plot(mois, mean_updrs_1_on)
axis1[0, 0].set_title("mean_updrs_1_on over months")
  

axis1[0, 1].plot(mois, mean_updrs_2_on)
axis1[0, 1].set_title("mean_updrs_2_on over months")
  

axis1[1, 0].plot(mois, mean_updrs_3_on)
axis1[1, 0].set_title("mean_updrs_3_on over months")

axis1[1, 1].plot(mois, mean_updrs_4_on)
axis1[1, 1].set_title("mean_updrs_4_on over months")
  
figure1.tight_layout(pad=2.5)
plt.show()

**<center> Observations 2 : </center>**

If the medicine is off, the updrs[1_4] are constant of the months.

If the medicine is on, the updrs[1_4] are slowed by the medicine, we can see an increase around the 75th month.

In general, the updr[1_4] are increasing over the month, with or not the medicine.

We can maybe think about a regression to predict the value of the updrs[1_4].

# Re-arrange the data : Merging the percentage of protein and the updrs

In [None]:
# Let's dive into deep, and firstly I drop the state on medication and patient_id because I cant handle every parameters for nox.

train = train_clinical_data.drop(["upd23b_clinical_state_on_medication", "patient_id"],axis=1)
train

In [None]:
# Let's merge the train_protiens and train_peptides data.

pep_pro = pd.merge(train_protiens_data, train_peptides_data, on=['visit_id', 'visit_month', 'patient_id', 'UniProt'])
print("Found {:,d} unique patient_id values".format(pep_pro["patient_id"].nunique()))
print("Found {:,d} unique vist_month values".format(pep_pro["visit_month"].nunique()))

In [None]:
# I calculate Percentage_of_pep_in_Uniprot 
# The Peptide = sequence of amino acids 
# When the sequence of amino acids is more than 10 the name become protein and protein contains more than on peptide...
pep_pro["Percentage_of_pep_in_Uniprot"] = pep_pro["PeptideAbundance"] / pep_pro["NPX"]
pep_pro

For example : ![image](Midterm_image.PNG)

In [None]:
#Dropping the patient_id and visith_month because I will re-integer them later.

pep_pro = pep_pro.drop(["patient_id","visit_month"], axis=1)
pep_pro

In [None]:
pep_pro = pep_pro.pivot(index=["visit_id"], columns=["Peptide"], values=(["Percentage_of_pep_in_Uniprot"]))
pep_pro

In [None]:
pep_pro.columns = pep_pro.columns.droplevel()
pep_pro = pep_pro.reset_index()
pep_pro

In [None]:
# Merge the train_clinical_data called train here and the peptide and protein data.
# We have back the visit_month which is a parameter to take inccount for the trainning features.
df = pd.merge(train, pep_pro, on="visit_id", how="left")
df = df.set_index('visit_id')
df

In [None]:
# Selection only the patient ID = 55

visit_id_specific = df.iloc[0:13]
visit_id_specific

In [None]:
# We can see that some of the value are missing over the month.

visit_id_specific.plot(use_index=True, y="AADDTWEPFASGK", kind='bar')
plt.show()

In [None]:
visit_id_specific.plot(use_index=True, y="AAFGQGSGPIMLDEVQC(UniMod_4)TGTEASLADC(UniMod_4)K", kind='bar')
plt.show()

In [None]:
df1 = df.fillna(method="ffill")
df1

Remark : I fill all the NaN value by the previous one and use it in the model as we will see later.

# Model and Neural Network

## Model: Mean absolute error

*The mean absolute error is one of a number of ways of comparing forecasts with their eventual outcomes.*

Starter

In [None]:
# def prepare_data(train_dataset, test_dataset, label):
    
#     # Break into target and features
#     train_dataset = train_dataset.dropna(subset=[label])
#     test_dataset = test_dataset.dropna(subset=[label])

#     # only the peptide percentage in the protein
#     train_features = train_dataset.drop(target, axis=1).copy()
#     test_features = test_dataset.drop(target, axis=1).copy()

#     # only the updrs
#     train_labels = train_dataset[label]
#     test_labels = test_dataset[label]
    
#     # Fill the Nan values by the mean value of the column
#     for c in train_features.columns:
#         m = train_features[c].mean()
#         train_features[c] = train_features[c].fillna(m)

#     for c in test_features.columns:
#         m = test_features[c].mean()
#         test_features[c] = test_features[c].fillna(m)
        
#     return train_features, test_features, train_labels, test_labels

# target = ["updrs_1", "updrs_2", "updrs_3", "updrs_4"]

# train_dataset = df.sample(frac=0.7, random_state=0)
# test_dataset = df.drop(train_dataset.index)

# for label in target:

#     train_features, test_features, train_labels, test_labels = prepare_data(train_dataset, test_dataset,label)
#     train__features_val = train_features[-200:]
#     train_labels_val = train_labels[-200:]
    
#     # Normalization

#     features = np.array(train_features)
#     feat_normalizer = layers.Normalization(axis=-1)
#     feat_normalizer.adapt(features)

#     # Model
#     model = tf.keras.Sequential([
#         feat_normalizer,
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(484, activation='relu'),
#         layers.Dense(242, activation='relu'),
#         layers.Dense(121, activation='relu'),
#         layers.Dense(60, activation='relu'),
#         layers.Dense(units=1)
#     ])

#     # Training
#     model.compile(
#         optimizer=tf.keras.optimizers.Adam(learning_rate=10e-3),
#         loss='mean_absolute_error',
#         metrics="mean_absolute_error"
#         )

#     # Fitting
#     history = model.fit(
#         train_features,
#         train_labels,
#         batch_size=64,
#         epochs=100,
#         # Calculate validation results on 20% of the training data.
#         validation_split = 0.2,
#         )

#     # Evaluate the model on the test data using `evaluate`
#     print("Evaluate on test data")
#     results = model.evaluate(test_features, test_labels, batch_size=128)
#     print("test loss, test acc:", results)

#     # Generate predictions (probabilities -- the output of the last layer)
#     # on new data using `predict`
#     print("Generate predictions for 3 samples")
#     predictions = model.predict(test_features[:3])
#     print("predictions shape:", predictions.shape
#     )

#     # summarize history for mean_absolute_percentage_error
#     plt.plot(history.history['mean_absolute_error'])
#     plt.plot(history.history['val_mean_absolute_error'])
#     plt.title('model mean_absolute_percentage_error')
#     plt.ylabel('mean_absolute_error')
#     plt.xlabel('epoch')
#     plt.legend(['train', 'test'], loc='upper left')
#     plt.show()
#     # summarize history for loss
#     plt.plot(history.history['loss'])
#     plt.plot(history.history['val_loss'])
#     plt.title('model loss')
#     plt.ylabel('loss')
#     plt.xlabel('epoch')
#     plt.legend(['train', 'test'], loc='upper left')
#     plt.show()

More layers

In [None]:
# def prepare_data(train_dataset, test_dataset, label):
    
#     # Break into target and features
#     train_dataset = train_dataset.dropna(subset=[label])
#     test_dataset = test_dataset.dropna(subset=[label])

#     # only the peptide percentage in the protein
#     train_features = train_dataset.drop(target, axis=1).copy()
#     test_features = test_dataset.drop(target, axis=1).copy()

#     # only the updrs
#     train_labels = train_dataset[label]
#     test_labels = test_dataset[label]
    
#     # Fill the Nan values by the mean value of the column
#     for c in train_features.columns:
#         m = train_features[c].mean()
#         train_features[c] = train_features[c].fillna(m)

#     for c in test_features.columns:
#         m = test_features[c].mean()
#         test_features[c] = test_features[c].fillna(m)
        
#     return train_features, test_features, train_labels, test_labels

# target = ["updrs_1", "updrs_2", "updrs_3", "updrs_4"]

# train_dataset = df.sample(frac=0.7, random_state=0)
# test_dataset = df.drop(train_dataset.index)

# for label in target:

#     train_features, test_features, train_labels, test_labels = prepare_data(train_dataset, test_dataset,label)
#     train__features_val = train_features[-200:]
#     train_labels_val = train_labels[-200:]
    
#     # Normalization

#     features = np.array(train_features)
#     feat_normalizer = layers.Normalization(axis=-1)
#     feat_normalizer.adapt(features)

#     # Model
#     model = tf.keras.Sequential([
#         feat_normalizer,
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(484, activation='relu'),
#         layers.Dense(242, activation='relu'),
#         layers.Dense(121, activation='relu'),
#         layers.Dense(60, activation='relu'),
#         layers.Dense(units=1)
#     ])

#     # Training
#     model.compile(
#         optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
#         loss='mean_absolute_error',
#         metrics="mean_absolute_error"
#         )

#     # Fitting
#     history = model.fit(
#         train_features,
#         train_labels,
#         batch_size=64,
#         epochs=100,
#         # Calculate validation results on 20% of the training data.
#         validation_split = 0.2,
#         )

#     # Evaluate the model on the test data using `evaluate`
#     print("Evaluate on test data")
#     results = model.evaluate(test_features, test_labels, batch_size=128)
#     print("test loss, test acc:", results)

#     # Generate predictions (probabilities -- the output of the last layer)
#     # on new data using `predict`
#     print("Generate predictions for 3 samples")
#     predictions = model.predict(test_features[:3])
#     print("predictions shape:", predictions.shape
#     )
    
#     # summarize history for mean_absolute_percentage_error
#     plt.plot(history.history['mean_absolute_error'])
#     plt.plot(history.history['val_mean_absolute_error'])
#     plt.title('model mean_absolute_error')
#     plt.ylabel('mean_absolute_error')
#     plt.xlabel('epoch')
#     plt.legend(['train', 'test'], loc='upper left')
#     plt.show()
#     # summarize history for loss
#     plt.plot(history.history['loss'])
#     plt.plot(history.history['val_loss'])
#     plt.title('model loss')
#     plt.ylabel('loss')
#     plt.xlabel('epoch')
#     plt.legend(['train', 'test'], loc='upper left')
#     plt.show()

Reduce the learning rate

In [None]:
# def prepare_data(train_dataset, test_dataset, label):
    
#     # Break into target and features
#     train_dataset = train_dataset.dropna(subset=[label])
#     test_dataset = test_dataset.dropna(subset=[label])

#     # only the peptide percentage in the protein
#     train_features = train_dataset.drop(target, axis=1).copy()
#     test_features = test_dataset.drop(target, axis=1).copy()

#     # only the updrs
#     train_labels = train_dataset[label]
#     test_labels = test_dataset[label]
    
#     # Fill the Nan values by the mean value of the column
#     for c in train_features.columns:
#         m = train_features[c].mean()
#         train_features[c] = train_features[c].fillna(m)

#     for c in test_features.columns:
#         m = test_features[c].mean()
#         test_features[c] = test_features[c].fillna(m)
        
#     return train_features, test_features, train_labels, test_labels

# target = ["updrs_1", "updrs_2", "updrs_3", "updrs_4"]

# train_dataset = df.sample(frac=0.7, random_state=0)
# test_dataset = df.drop(train_dataset.index)

# for label in target:

#     train_features, test_features, train_labels, test_labels = prepare_data(train_dataset, test_dataset,label)
#     train__features_val = train_features[-200:]
#     train_labels_val = train_labels[-200:]
    
#     # Normalization

#     features = np.array(train_features)
#     feat_normalizer = layers.Normalization(axis=-1)
#     feat_normalizer.adapt(features)

#     # Model
#     model = tf.keras.Sequential([
#         feat_normalizer,
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(484, activation='relu'),
#         layers.Dense(242, activation='relu'),
#         layers.Dense(121, activation='relu'),
#         layers.Dense(60, activation='relu'),
#         layers.Dense(units=1)
#     ])

#     # Training
#     model.compile(
#         optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
#         loss='mean_absolute_error',
#         metrics="mean_absolute_error"
#         )

#     # Fitting
#     history = model.fit(
#         train_features,
#         train_labels,
#         batch_size=64,
#         epochs=100,
#         # Calculate validation results on 20% of the training data.
#         validation_split = 0.2,
#         )

#     # Evaluate the model on the test data using `evaluate`
#     print("Evaluate on test data")
#     results = model.evaluate(test_features, test_labels, batch_size=128)
#     print("test loss, test acc:", results)

#     # Generate predictions (probabilities -- the output of the last layer)
#     # on new data using `predict`
#     print("Generate predictions for 3 samples")
#     predictions = model.predict(test_features[:3])
#     print("predictions shape:", predictions.shape
#     )
    
#     # summarize history for mean_absolute_percentage_error
#     plt.plot(history.history['mean_absolute_percentage_error'])
#     plt.plot(history.history['val_mean_absolute_percentage_error'])
#     plt.title('model mean_absolute_error')
#     plt.ylabel('mean_absolute_error')
#     plt.xlabel('epoch')
#     plt.legend(['train', 'test'], loc='upper left')
#     plt.show()
#     # summarize history for loss
#     plt.plot(history.history['loss'])
#     plt.plot(history.history['val_loss'])
#     plt.title('model loss')
#     plt.ylabel('loss')
#     plt.xlabel('epoch')
#     plt.legend(['train', 'test'], loc='upper left')
#     plt.show()

50 % data train and 50 % data test

In [None]:
# def prepare_data(train_dataset, test_dataset, label):
    
#     # Break into target and features
#     train_dataset = train_dataset.dropna(subset=[label])
#     test_dataset = test_dataset.dropna(subset=[label])

#     # only the peptide percentage in the protein
#     train_features = train_dataset.drop(target, axis=1).copy()
#     test_features = test_dataset.drop(target, axis=1).copy()

#     # only the updrs
#     train_labels = train_dataset[label]
#     test_labels = test_dataset[label]
    
#     # Fill the Nan values by the mean value of the column
#     for c in train_features.columns:
#         m = train_features[c].mean()
#         train_features[c] = train_features[c].fillna(m)

#     for c in test_features.columns:
#         m = test_features[c].mean()
#         test_features[c] = test_features[c].fillna(m)
        
#     return train_features, test_features, train_labels, test_labels

# target = ["updrs_1", "updrs_2", "updrs_3", "updrs_4"]

# train_dataset = df.sample(frac=0.5, random_state=0)
# test_dataset = df.drop(train_dataset.index)

# for label in target:

#     train_features, test_features, train_labels, test_labels = prepare_data(train_dataset, test_dataset,label)
#     train__features_val = train_features[-200:]
#     train_labels_val = train_labels[-200:]
    
#     # Normalization

#     features = np.array(train_features)
#     feat_normalizer = layers.Normalization(axis=-1)
#     feat_normalizer.adapt(features)

#     # Model
#     model = tf.keras.Sequential([
#         feat_normalizer,
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(484, activation='relu'),
#         layers.Dense(242, activation='relu'),
#         layers.Dense(121, activation='relu'),
#         layers.Dense(60, activation='relu'),
#         layers.Dense(units=1)
#     ])

#     # Training
#     model.compile(
#         optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
#         loss='mean_absolute_error',
#         metrics="mean_absolute_percentage_error"
#         )

#     # Fitting
#     history = model.fit(
#         train_features,
#         train_labels,
#         batch_size=64,
#         epochs=100,
#         # Calculate validation results on 20% of the training data.
#         validation_split = 0.2,
#         )

#     # Evaluate the model on the test data using `evaluate`
#     print("Evaluate on test data")
#     results = model.evaluate(test_features, test_labels, batch_size=128)
#     print("test loss, test acc:", results)

#     # Generate predictions (probabilities -- the output of the last layer)
#     # on new data using `predict`
#     print("Generate predictions for 3 samples")
#     predictions = model.predict(test_features[:3])
#     print("predictions shape:", predictions.shape
#     )
    
#     # summarize history for mean_absolute_percentage_error
#     plt.plot(history.history['mean_absolute_error'])
#     plt.plot(history.history['val_mean_absolute_error'])
#     plt.title('model mean_absolute_error')
#     plt.ylabel('mean_absolute_error')
#     plt.xlabel('epoch')
#     plt.legend(['train', 'test'], loc='upper left')
#     plt.show()
#     # summarize history for loss
#     plt.plot(history.history['loss'])
#     plt.plot(history.history['val_loss'])
#     plt.title('model loss')
#     plt.ylabel('loss')
#     plt.xlabel('epoch')
#     plt.legend(['train', 'test'], loc='upper left')
#     plt.show()

Reduce again the learning rates

In [None]:
# def prepare_data(train_dataset, test_dataset, label):
    
#     # Break into target and features
#     train_dataset = train_dataset.dropna(subset=[label])
#     test_dataset = test_dataset.dropna(subset=[label])

#     # only the peptide percentage in the protein
#     train_features = train_dataset.drop(target, axis=1).copy()
#     test_features = test_dataset.drop(target, axis=1).copy()

#     # only the updrs
#     train_labels = train_dataset[label]
#     test_labels = test_dataset[label]
    
#     # Fill the Nan values by the mean value of the column
#     for c in train_features.columns:
#         m = train_features[c].mean()
#         train_features[c] = train_features[c].fillna(m)

#     for c in test_features.columns:
#         m = test_features[c].mean()
#         test_features[c] = test_features[c].fillna(m)
        
#     return train_features, test_features, train_labels, test_labels

# target = ["updrs_1", "updrs_2", "updrs_3", "updrs_4"]

# train_dataset = df.sample(frac=0.5, random_state=0)
# test_dataset = df.drop(train_dataset.index)

# for label in target:

#     train_features, test_features, train_labels, test_labels = prepare_data(train_dataset, test_dataset,label)
#     train__features_val = train_features[-200:]
#     train_labels_val = train_labels[-200:]
    
#     # Normalization

#     features = np.array(train_features)
#     feat_normalizer = layers.Normalization(axis=-1)
#     feat_normalizer.adapt(features)

#     # Model
#     model = tf.keras.Sequential([
#         feat_normalizer,
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(969, activation='relu'),
#         layers.Dense(484, activation='relu'),
#         layers.Dense(242, activation='relu'),
#         layers.Dense(121, activation='relu'),
#         layers.Dense(60, activation='relu'),
#         layers.Dense(units=1)
#     ])

#     # Training
#     model.compile(
#         optimizer=tf.keras.optimizers.Adam(learning_rate=1e-8),
#         loss='mean_absolute_error',
#         metrics="mean_absolute_error"
#         )

#     # Fitting
#     history = model.fit(
#         train_features,
#         train_labels,
#         batch_size=64,
#         epochs=100,
#         # Calculate validation results on 20% of the training data.
#         validation_split = 0.2,
#         )

#     # Evaluate the model on the test data using `evaluate`
#     print("Evaluate on test data")
#     results = model.evaluate(test_features, test_labels, batch_size=128)
#     print("test loss, test acc:", results)

#     # Generate predictions (probabilities -- the output of the last layer)
#     # on new data using `predict`
#     print("Generate predictions for 3 samples")
#     predictions = model.predict(test_features[:3])
#     print("predictions shape:", predictions.shape
#     )
    
#     # summarize history for mean_absolute_percentage_error
#     plt.plot(history.history['mean_absolute_error'])
#     plt.plot(history.history['val_mean_absolute_error'])
#     plt.title('model mean_absolute_error')
#     plt.ylabel('mean_absolute_error')
#     plt.xlabel('epoch')
#     plt.legend(['train', 'test'], loc='upper left')
#     plt.show()
#     # summarize history for loss
#     plt.plot(history.history['loss'])
#     plt.plot(history.history['val_loss'])
#     plt.title('model loss')
#     plt.ylabel('loss')
#     plt.xlabel('epoch')
#     plt.legend(['train', 'test'], loc='upper left')
#     plt.show()

*Clearly seems a good way to reach the zero loss and zero mean absolute error, however we need maybe 10 000 epochs. It takes a long time.*

Using the df1 dataframe

In [None]:
def prepare_data(train_dataset, test_dataset, label):
    
    # Break into target and features
    train_dataset = train_dataset.dropna(subset=[label])
    test_dataset = test_dataset.dropna(subset=[label])

    # only the peptide percentage in the protein
    train_features = train_dataset.drop(target, axis=1).copy()
    test_features = test_dataset.drop(target, axis=1).copy()

    # only the updrs
    train_labels = train_dataset[label]
    test_labels = test_dataset[label]
    
    # Fill the Nan values by the mean value of the column
    for c in train_features.columns:
        m = train_features[c].mean()
        train_features[c] = train_features[c].fillna(m)

    for c in test_features.columns:
        m = test_features[c].mean()
        test_features[c] = test_features[c].fillna(m)
        
    return train_features, test_features, train_labels, test_labels

target = ["updrs_1", "updrs_2", "updrs_3", "updrs_4"]

train_dataset = df1.sample(frac=0.5, random_state=0)
test_dataset = df1.drop(train_dataset.index)

mean_absolute_error_dict = {}
loss_dict = {}

for label in target:

    train_features, test_features, train_labels, test_labels = prepare_data(train_dataset, test_dataset,label)
    train__features_val = train_features[-200:]
    train_labels_val = train_labels[-200:]
    
    # Normalization

    features = np.array(train_features)
    feat_normalizer = layers.Normalization(axis=-1)
    feat_normalizer.adapt(features)

    # Model
    model = tf.keras.Sequential([
        feat_normalizer,
        layers.Dense(969, activation='relu'),
        layers.Dense(969, activation='relu'),
        layers.Dense(969, activation='relu'),
        layers.Dense(969, activation='relu'),
        layers.Dense(969, activation='relu'),
        layers.Dense(969, activation='relu'),
        layers.Dense(969, activation='relu'),
        layers.Dense(969, activation='relu'),
        layers.Dense(484, activation='relu'),
        layers.Dense(242, activation='relu'),
        layers.Dense(121, activation='relu'),
        layers.Dense(60, activation='relu'),
        layers.Dense(units=1)
    ])

    # Training
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=1e-8),
        loss='mean_absolute_error',
        metrics="mean_absolute_error"
        )

    # Fitting
    history = model.fit(
        train_features,
        train_labels,
        batch_size=64,
        epochs=1000,
        # Calculate validation results on 20% of the training data.
        validation_split = 0.2,
        verbose = 0,
        )

    # Evaluate the model on the test data using `evaluate`
    print("Evaluate on test data")
    results = model.evaluate(test_features, test_labels, batch_size=128)
    print("test loss, test acc:", results)

    # Generate predictions (probabilities -- the output of the last layer)
    # on new data using `predict`
    print("Generate predictions for 3 samples")
    predictions = model.predict(test_features[:3])
    print("predictions shape:", predictions.shape
    )
    

    mean_absolute_error_dict["mean_absolute_error_{0}".format(label)] = history.history['mean_absolute_error']
    mean_absolute_error_dict["val_mean_absolute_error_{0}".format(label)] = history.history['val_mean_absolute_error']
    loss_dict["loss_{0}".format(label)] = history.history['loss']
    loss_dict["val_loss_{0}".format(label)] = history.history['val_loss']
    

name_curve_1 = []
name_curve_2= []
for x in mean_absolute_error_dict:
    name_curve_1.append(x)
    plt.plot(mean_absolute_error_dict[x])

plt.ylabel('mean_absolute_error')
plt.xlabel('epoch')
plt.legend(name_curve_1, loc='upper center', bbox_to_anchor=(0.5, -0.05), fancybox=True, shadow=True, ncol=5)
plt.show()

for x in loss_dict:
    name_curve_2.append(x)
    plt.plot(loss_dict[x])

plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(name_curve_2, loc='upper center', bbox_to_anchor=(0.5, -0.05), fancybox=True, shadow=True, ncol=5)
plt.show()
