# The Goal of This Notebook
* My goal with this notebook is to optimize one single value for each of the targets: updrs_[1-4]
* This updated version finds a value for each month of each target and fills missing months with previous available month.
* I will assume identical results for each patient. This isn't a particularly sophisticated method, nor is it the best possible score, but it is a good baseline to start with. 
* This method can be improved. Calculating the best score seems to work best when there is a large amount of training data. With fewer data points, a simple median seems to be more effective.

## Competition Metric
### *This function calculates SMAPE, the metric used to score our predictions in this competition*

In [1]:
def smape(y_true, y_pred):
    smap = np.zeros(len(y_true))
    
    num = np.abs(y_true - y_pred)
    dem = ((np.abs(y_true) + np.abs(y_pred)) / 2)
    
    pos_ind = dem != 0
    smap[pos_ind] = num[pos_ind] / dem[pos_ind]
    
    return 100 * np.mean(smap)

# Data Exploration
#### There are some NaN values in our target columns. These will be a problem for the smape function, so these values will be dropped while looking for the best average estimate. 

In [2]:
import pandas as pd
import numpy as np

train = pd.read_csv('/kaggle/input/amp-parkinsons-disease-progression-prediction/train_clinical_data.csv')

In [3]:
train.isna().sum()

visit_id                                  0
patient_id                                0
visit_month                               0
updrs_1                                   1
updrs_2                                   2
updrs_3                                  25
updrs_4                                1038
upd23b_clinical_state_on_medication    1327
dtype: int64

In [4]:
train.head()

Unnamed: 0,visit_id,patient_id,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,upd23b_clinical_state_on_medication
0,55_0,55,0,10.0,6.0,15.0,,
1,55_3,55,3,10.0,7.0,25.0,,
2,55_6,55,6,8.0,10.0,34.0,,
3,55_9,55,9,8.0,9.0,30.0,0.0,On
4,55_12,55,12,10.0,10.0,41.0,0.0,On


### This loop optimizes the estimate for the highest smape score on the training set. This is repeated for every month of every target variable. 

In [5]:
estimates = {}
months = train.visit_month.unique()
targets = ['updrs_1', 'updrs_2', 'updrs_3', 'updrs_4']
for m in months:
    for target in targets:
        t = train[train.visit_month==m][f'{target}'].dropna().values
        if len(t) >= 200:
            s = []
            best_threshold = 0
            best_smape = 200
            for i in np.arange(0, 30, 0.1):
                score = smape(t, np.array([i for _ in range(len(t))]))
                s.append(score)
                if score < best_smape:
                    best_smape = score
                    best_threshold = i
        else:
            best_threshold = np.median(t)
        estimates[(m, target)] = best_threshold

for i in range(sorted(months)[-1]+1):
    for target in targets:
        if (i, target) not in estimates:
            estimates[(i, target)] = estimates[(i-1, target)]

### This bit of code calculates a score on the training data, and this should resemble the testing data

In [6]:
validation_x = []
validation_y = []

for id, row in train.iterrows():
    for t in targets:
        if row[f'{t}']>=0:
            validation_x.append((row.visit_month, t))
            validation_y.append(row[f'{t}'])
            
smape(validation_y, pd.Series(validation_x).map(estimates).values)

76.36180750178882

# Applying Optimal Value Estimates to Test Data

In [7]:
import amp_pd_peptide
env = amp_pd_peptide.make_env()   # initialize the environment
iter_test = env.iter_test()    # an iterator which loops over the test files

# The API will deliver four dataframes in this specific order:
for (test, test_peptides, test_proteins, sample_submission) in iter_test:
    # This maps the correct value estimate to each line in sample_submission
    targets = sample_submission.prediction_id.str.split('_').apply(lambda x: (int(x[1]) + int(x[5]), '_'.join(x[2:4])))
    sample_submission['rating'] = targets.map(estimates)
    
    # Saves predictions to csv file
    env.predict(sample_submission)

This version of the API is not optimized and should not be used to estimate the runtime of your code on the hidden test set.


In [8]:
# Predictions are automatically submitted by env.predict()
# This lets us read the submitted file
submission = pd.read_csv('/kaggle/working/submission.csv')
submission

Unnamed: 0,prediction_id,rating
0,3342_0_updrs_1_plus_0_months,5.0
1,3342_0_updrs_1_plus_6_months,6.0
2,3342_0_updrs_1_plus_12_months,6.0
3,3342_0_updrs_1_plus_24_months,6.0
4,3342_0_updrs_2_plus_0_months,4.0
...,...,...
59,50423_6_updrs_3_plus_24_months,21.0
60,50423_6_updrs_4_plus_0_months,0.0
61,50423_6_updrs_4_plus_6_months,0.0
62,50423_6_updrs_4_plus_12_months,0.0
