# Receiving Data from Katsu and Converting To Training Data
This notebook shows the basics of taking mcodepackets from Katsu and converting the returned objects to trainable data for a machine learning algorithm. The notebook assumes that you have a local instance of Katsu running on the default port with the ingested synthetic mohccn-data from the `federated-learning` repository. 

While we do train a machine learning algorithm with the converted data, it's worth noting that, since there are 16 data points in the returned object (due to the sparsity of currently on-hand synthetic data), our algorithm is effectively useless. This notebook simply illustrates a proof of concept and possible preprocessing workflow for MCODE data.

We first use the requests module to call the `/api/mcodepackets` endpoint to receive our data. The returned object stores the results of our query in its `results` key. The synthetic data also has 8 empty entries between indices 12 and 20 (inclusive-exclusive), so we delete those from our `results` list.

We show a sample data entry: a JSON object with a partially filled MCODE schema.

In [78]:
import requests
data_raw = requests.get("http://localhost:8000/api/mcodepackets")
results_json = data_raw.json()['results']

# indices 12:20 are empty entries, so delete them
del results_json[12:20]
results_json[13] # sample entry

{'id': 'CMH-02-02',
 'date_of_death': '2018-12-12',
 'created': '2021-09-21T14:40:02.633405Z',
 'updated': '2021-09-21T14:40:02.633442Z',
 'subject': {'id': 'CMH-02-02',
  'date_of_birth': '1981-07-01',
  'sex': 'FEMALE',
  'karyotypic_sex': 'UNKNOWN_KARYOTYPE',
  'created': '2021-09-21T14:40:02.509163Z',
  'updated': '2021-09-21T14:40:02.509202Z'},
 'table': '377ab4cd-80b1-49c6-bc7a-4fd77db750f7',
 'cancer_condition': [{'id': '1012-0',
   'condition_type': 'primary',
   'code': {'id': 'SNOMED:103329007', 'label': 'Not available'},
   'date_of_diagnosis': '2018-03-27T00:00:00Z',
   'created': '2021-09-21T14:40:02.515432Z',
   'updated': '2021-09-21T14:40:02.515464Z'}],
 'cancer_related_procedures': [{'id': '1012-0',
   'procedure_type': 'radiation',
   'code': {'id': 'SNOMED:103329007', 'label': 'Not available'},
   'body_site': {'id': 'SNOMEDCT:91775009',
    'label': 'Structure of left shoulder region'},
   'created': '2021-09-21T14:40:02.520547Z',
   'updated': '2021-09-21T14:40:02.

We now move to preprocessing this data for a machine learning algorithm. Since our data is so sparse, we choose very simple indicator variables for training: the sex of the subject, the age at which they were diagnosed, the number of cancer related procedures they had, and the number of medication statements since their diagnosis. If we had a larger magnitude of training data, we may have chosen to discriminate between types of medication or procedures, but we do not do that here. 

In [110]:
# handle sex data parsing and date parsing

def parse_sex(obj: dict) -> float:
    if 'sex' not in obj['subject']:
        return 0.5
    elif obj['subject']['sex'] == "FEMALE":
        return 1
    else:
        return 0

import datetime
def parse_diagnosis_age(obj: dict) -> float:
    """
    A function that returns the difference (in hours) between the diagnosis date and born date of an MCODE schema.
    
    Input: A (Katsu returned) JSON object of the MCODE data.
    Output: The difference between the diagnosis date and born date.
    """
    diag_date = obj['cancer_condition'][0]['date_of_diagnosis']
    diag_age = datetime.datetime(int(diag_date[0:4]), int(diag_date[5:7]), int(diag_date[8:10]))
    born_date = obj['subject']['date_of_birth']
    born_age = datetime.datetime(int(born_date[0:4]), int(born_date[5:7]), int(born_date[8:10]))
    difference = diag_age - born_age
    diff_in_hrs = divmod(difference.total_seconds(), 3600)[0] # rounded down
    return diff_in_hrs

def parse_death_age(obj: dict) -> float:
    """
    A function that returns the difference (in hours) between the diagnosis date and death date of an MCODE schema.
    
    Input: A (Katsu returned) JSON object of the MCODE data.
    Output: The difference between the diagnosis date and death date.
    """
    diag_date = obj['cancer_condition'][0]['date_of_diagnosis']
    diag_age = datetime.datetime(int(diag_date[0:4]), int(diag_date[5:7]), int(diag_date[8:10]))
    death_date = obj['date_of_death']
    death_age = datetime.datetime(int(death_date[0:4]), int(death_date[5:7]), int(death_date[8:10]))
    difference = death_age - diag_age
    diff_in_hrs = divmod(difference.total_seconds(), 3600)[0] # rounded down
    return diff_in_hrs

We now move to creating our input and output matrices/vectors. This is done by preprocessing the data through a simple driver loop. Pretty printing any of these lists or objects may be done by uncommenting the printer and calling `pp.pprint(<LIST or DICT>)`

In [111]:
# import pprint
# pp = pprint.PrettyPrinter(indent=2)
X = []
y = []
for i in range(len(results_json)):
    obj = results_json[i]
    X.append([
        len(obj['cancer_related_procedures']),  
        len(obj['medication_statement']),
        parse_sex(obj),
        parse_diagnosis_age(obj)
    ])
    y.append(parse_death_age(obj))

We then split into training and testing, and finally train the model.

In [133]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.125, random_state=1729)

In [144]:
from sklearn.linear_model import LinearRegression

clf = LinearRegression().fit(X_train, y_train)
print(clf.score(X_train, y_train))

0.04386881560385292


We have a test set of 3 entries. Our primary metric for accuracy in linear regression is some mean squared error. We log the results below.

In [146]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, clf.predict(X_test))
print(mse)

369905362.9194
