# The Neurogenomics Database: Dotplot of entire dataset predictions
Author: Nienke Mekkes <br>
Date: 11-10-2022. <br>
Correspond: n.j.mekkes@umcg.nl <br>

## Script: Dotplot of entire dataset predictions
Builds Dot Plots for each diagnosis category. <br>
Why: to give an overview of what symptoms are frequently observed in different diagnosis groups

### Input files:
- prediction file (donors as row names, observations as columns)
- General information: to assign metadata to donors (e.g. diagnosis, age)
- Optional: attribute metadata to cluster observations
- Optional: metadata to highlight expected findings in the plot

- also needs scattermap.py, code to create the plot
- also needs helper_functions, which contains code to run permutation test and how to select donors


### Output:
- dotplot, file with p values for permutation test



#### Minimal requirements
- to do

## IMPORTANT

this script works with a clinical trajectory dictionary pickle. this pickle can be a rules of thumb or a original pickle, and was generated by the script proces_predictions. This processing script removed short sentences etc. and the attributes that performed poorly. This processing script did not remove any donors. Donors that you wish to be excluded can be excluded in two ways: <br>
1. in this script, manually. for example remove donors younger than 21. or donors with the NAD diagnosis, or reassign diagnosis (e.g. a SSA, CON donor NBB xxx needs to become HIV).
2. with an input file, for example the general information that contains minimally one column with donorids, and one column that mentions which donors should have a changed diagnosis or should be excluded

## PATHS

In [1]:
pd.set_option('display.max_columns', 500)

NameError: name 'pd' is not defined

In [3]:
# path_to_predictions = "/home/jupyter-n.mekkes@gmail.com-f6d87/clinical_history/final_predictions/ALL_clinical_trajectories_dictionary_2023-07-11.pkl"
path_to_predictions = "/home/jupyter-n.mekkes@gmail.com-f6d87/clinical_history/final_predictions/ALL_clinical_trajectories_dictionary_rules_of_thumb_visit_2023-08-14.pkl"
# path_to_attribute_grouping = "/home/jupyter-n.mekkes@gmail.com-f6d87/clinical_history/input_data/sup3.xlsx" ## for rules of thumb
general_information = "/home/jupyter-n.mekkes@gmail.com-f6d87/clinical_history/input_data/General_information_11-08-2023.xlsx"
path_clinical_diagnosis = '/home/jupyter-n.mekkes@gmail.com-f6d87/clinical_diagnosis/output/selected_diagnoses_overview.xlsx'

### IMPORTS

In [2]:
import seaborn as sns; sns.set()
import matplotlib
import numpy as np; np.random.seed(0)
from matplotlib import pyplot as plt 
import xlsxwriter
import pandas as pd
import os
import numpy as np
import scattermap
from scattermap import scattermap
import pickle
import multiprocessing
import statsmodels
from functools import partial
from multiprocessing import Pool
import sys

import scipy
from helper_functions import permutation_of_individual_test, table_selector
import datetime

In [4]:
n=5

### Load data


In [5]:
with open(path_to_predictions,"rb") as file:
    predictions_pickle = pickle.load(file)

d = []
for i,j in zip(predictions_pickle,predictions_pickle.values()):
    k = pd.DataFrame.from_dict(j,orient="index")
    k["DonorID"] = i
    k['Age'] = k.index
    d.append(k)

predictions_df =pd.concat(d, ignore_index=True)
display(predictions_df)
print(f"there are {len(list(predictions_df['DonorID'].unique()))} unique donor IDs")
print(predictions_df.shape)

Unnamed: 0,neuropathological_diagnosis,age_at_death,Year,sex,Muscular_Weakness,Spasticity,Hyperreflexia_and_oth_reflexes,Fasciculations,Positive_sensory_symptoms,Negative_sensory_symptoms,...,Fatigue,Declined_deteriorated_health,Cachexia,Weight_loss,Reduces_oral_intake,Help_in_ADL,Day_care,Admission_to_nursing_home,DonorID,Age
0,TUM,42.0,-9.0,F,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1982-016,-9.0
1,TUM,42.0,1975.0,F,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1982-016,35.0
2,TUM,42.0,1978.0,F,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1982-016,38.0
3,TUM,42.0,1980.0,F,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1982-016,40.0
4,TUM,42.0,1981.0,F,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1982-016,41.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29674,CON,81.0,2019.0,F,1.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,NBB 2020-054,80.0
29675,CON,81.0,2020.0,F,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NBB 2020-054,81.0
29676,"FTD,PID",71.0,2014.0,M,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,NBB 2020-078,65.0
29677,"FTD,PID",71.0,2016.0,M,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,NBB 2020-078,67.0


there are 3043 unique donor IDs
(29679, 90)


### exclude/change donors for the paper, using general info
- read in the general information
- make a list of donors to remove
- remove donors from our predictions
- change column neuropathological diagnosis to the neuropathological diagnosis from the general information

In [6]:
general_information_df = pd.read_excel(general_information, engine='openpyxl', sheet_name="Sheet1")
donors_to_remove = list(general_information_df[general_information_df['paper diagnosis']=='exclude'].DonorID)
predictions_df = predictions_df[~predictions_df['DonorID'].isin(donors_to_remove)]
print(f"there are {len(list(predictions_df['DonorID'].unique()))} unique donor IDs")
print(len(donors_to_remove))
predictions_df['neuropathological_diagnosis'] = predictions_df['DonorID'].map(general_information_df.set_index('DonorID')['paper diagnosis'])
display(predictions_df.head())
print(sorted(predictions_df['neuropathological_diagnosis'].unique()))
print(f"there are {len(list(predictions_df['DonorID'].unique()))} unique donor IDs")


there are 3043 unique donor IDs
258


Unnamed: 0,neuropathological_diagnosis,age_at_death,Year,sex,Muscular_Weakness,Spasticity,Hyperreflexia_and_oth_reflexes,Fasciculations,Positive_sensory_symptoms,Negative_sensory_symptoms,...,Fatigue,Declined_deteriorated_health,Cachexia,Weight_loss,Reduces_oral_intake,Help_in_ADL,Day_care,Admission_to_nursing_home,DonorID,Age
0,TUM,42.0,-9.0,F,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1982-016,-9.0
1,TUM,42.0,1975.0,F,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1982-016,35.0
2,TUM,42.0,1978.0,F,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1982-016,38.0
3,TUM,42.0,1980.0,F,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1982-016,40.0
4,TUM,42.0,1981.0,F,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1982-016,41.0


['AD', 'AD,CA', 'AD,DLB', 'AD,ENCEPHA,VE', 'AD,ILBD', 'ALEX', 'ATAXIA,ADCA', 'ATAXIA,FA', 'ATAXIA,FXTAS', 'ATAXIA,SCA', 'BINSW', 'CA', 'CBD', 'CJD', 'COHA', 'CON', 'CVA', 'DAI', 'DEM,CVA', 'DEM,ENCEPHA,VE', 'DEM,SICC', 'DEM,SICC,AGD', 'DEM,SICC,CA', 'DEM,SICC,ILBD', 'DLB', 'DLB,SICC', 'DOWN', 'DYSTO', 'ENCE', 'ENCEPHA,PML', 'ENCEPHA,VE', 'EPI', 'FAHR', 'FRAGX', 'FTD,FTD-FUS', 'FTD,FTD-TAU,TAU', 'FTD,FTD-TDP,MND', 'FTD,FTD-TDP-A,PROG', 'FTD,FTD-TDP-B,C9ORF72', 'FTD,FTD-TDP-C', 'FTD,FTD-TDP_undefined', 'FTD,FTD-UPS', 'FTD,PID', 'FTD_undefined', 'GUIL', 'HD', 'HIP', 'HIV', 'HMSN', 'HSP', 'ILBD', 'IS', 'KLIN', 'LDA', 'LD_other', 'MEDIS', 'MEN', 'MND,ALS', 'MND_other', 'MS,AD', 'MS,MS-PP', 'MS,MS-RR', 'MS,MS-SP', 'MSA', 'MS_undefined', 'NCSD', 'NHL', 'NIG', 'NMO', 'PAL', 'PCAD', 'PD', 'PD,AD', 'PD,ATPD', 'PDD', 'POLY_other', 'PSP', 'PSYCH,ADHD', 'PSYCH,ASD', 'PSYCH,BP', 'PSYCH,MDD', 'PSYCH,NARCO', 'PSYCH,OCD', 'PSYCH,PTSD', 'PSYCH,SCZ', 'PSYCH,SCZ,AD', 'PSYCH_other', 'PWS', 'SEP', 'SICC,ILB

In [8]:
predictions_df['Age'].value_counts()

-9.0      779
 69.0     749
 70.0     732
 67.0     729
 71.0     716
         ... 
 9.0        6
 3.0        6
 102.0      5
 1.0        5
 103.0      2
Name: Age, Length: 105, dtype: int64

In [9]:
non_attribute_columns = ['DonorID','Year','age_at_death','sex',
                        'neuropathological_diagnosis','Age'] #'birthyear',,'death_year','year_before_death','sex',
attributes = [col for col in predictions_df.columns if col not in non_attribute_columns]
# display(attributes)
print(f"there are {predictions_df.shape[0]} rows and {len(attributes)} attributes")
print(f"there are {len(list(predictions_df['DonorID'].unique()))} unique donor IDs")

there are 29679 rows and 84 attributes
there are 3043 unique donor IDs


#### adding in clinical diagnosis

In [41]:
# cd_df = pd.read_excel(path_clinical_diagnosis, engine='openpyxl')
# cd_df
# predictions_df['perfect_diagnosis'] = predictions_df['DonorID'].map(cd_df.set_index('DonorID')['perfect_diagnosis'])
# predictions_df['medium_diagnosis'] = predictions_df['DonorID'].map(cd_df.set_index('DonorID')['medium_diagnosis'])
# predictions_df['wrong_diagnosis'] = predictions_df['DonorID'].map(cd_df.set_index('DonorID')['wrong_diagnosis'])
# def get_diagnosis_info(row):
#     if row['perfect_diagnosis'] == 1:
#         return 'perfect'
#     elif row['medium_diagnosis'] == 1:
#         return 'medium'
#     elif row['wrong_diagnosis'] == 1:
#         return 'wrong'
#     else:
#         return None

# predictions_df['diagnosis_info'] = predictions_df.apply(get_diagnosis_info, axis=1)
# predictions_df = predictions_df.drop(columns=['perfect_diagnosis','medium_diagnosis','wrong_diagnosis'])
# predictions_df.tail()

In [20]:
table1_dict_paper = {
                'CON': 'CON',
                'AD': 'AD',
                'PD': 'PD',
                'PDD':'PDD',
                'DLB':'DLB',
                'VD' : 'VD',

                'FTD,FTD-TDP':'FTD','FTD,FTD-TDP-A,PROG':'FTD','FTD,FTD-TDP-B,C9ORF72':'FTD','FTD,FTD-TDP-C':'FTD', 
                'FTD,FTD-TAU,TAU':'FTD',
                'FTD,FTD-FUS':'FTD',
                'FTD,FTD-TDP,MND':'FTD',
                # 'FTD,FTD-UPS':'FTD',               
                'FTD,PID':'FTD',
                # 'FTD':'FTD', 
    'FTD_undefined':'FTD',
    'FTD,FTD-TDP_undefined':'FTD',

                'MND,ALS':'MND',
                'MND_other':'MND',

                'PSP' : 'PSP',

                'ATAXIA,SCA':'ATAXIA',
                'ATAXIA,ADCA':'ATAXIA',
                'ATAXIA,FA':'ATAXIA',
                'ATAXIA,FXTAS':'ATAXIA',

                'MS,MS-PP':'MS',
                'MS,MS-SP':'MS',
                # 'MS,MS-UN':'MS',
                'MS,MS-RR':'MS',
                'MS_undefined':'MS',

                'MSA' : 'MSA',
                'PSYCH,MDD':'MDD',
                'PSYCH,BP':'BP',
                'PSYCH,SCZ':'SCZ'
                                }



In [38]:
data = predictions_df[predictions_df.Age >= 0]
display(data['DonorID'].nunique())
data = predictions_df.copy()
display(data['DonorID'].nunique())
data['file_year'] = data['DonorID'].str.extract(r'NBB (\d{4})-\d{3}', expand=False)
data['file_year'] = pd.to_numeric(data['file_year'])
data = data[data['file_year'] >= 1997]
display(data['DonorID'].nunique())
unique_diagnoses = data[['DonorID', 'neuropathological_diagnosis']].drop_duplicates()
display(unique_diagnoses['neuropathological_diagnosis'].value_counts().head(20))
# display(merged_df.head())
## how many observations has each donor?
data2 = data.copy()
# display(data2['Age'].drop_duplicates().sort_values())
# display(data.groupby('DonorID')['Age'].nunique())

## df showing number of observations
data2 = data2.drop(columns=['age_at_death','sex','Age','file_year','Year'])#,'diagnosis_info'])
data2 = data2.groupby(['DonorID','neuropathological_diagnosis']).sum()
data2 = pd.DataFrame(data2.sum(axis=1),columns=['count'])
data2 = data2.reset_index()  
data2 = data2.set_index('DonorID')
data2['uniqueage'] = data.groupby('DonorID')['Age'].nunique()
display(data2)

# ## con are the exception, they are allowed to have little data
data3 = data2[data2['neuropathological_diagnosis'] != 'CON']
donors_not_enough_data = data3.index[data3['count'] < n].tolist()
# # donors_not_enough_data = data3.index[(data3['count'] < 5) | (data3['uniqueage'] < 3)].tolist()


print(donors_not_enough_data)
# print(len(donors_not_enough_data))
data = data[~data['DonorID'].isin(donors_not_enough_data)]
data = data.reset_index(drop=True)
display(data['DonorID'].nunique())
data['neuropathological_diagnosis'] = data['neuropathological_diagnosis'].replace('PDD', 'PD')
data['simplified_diagnosis'] = data['neuropathological_diagnosis'].map(table1_dict_paper)
data['simplified_diagnosis'] = data['neuropathological_diagnosis'].apply(lambda x: 'AD,DLB' if x == 'AD,DLB' else table1_dict_paper.get(x, None))

other_dems = ['CBD','AD,DLB','AD,CA','AD,ENCEPHA,VE','PD,AD', #,'ILBD','AD,ILBD','ENCEPHA,VE'
              'DLB,SICC','DEM,SICC','DEM,SICC,AGD','DEM,ENCEPHA,VE']
other_psych = ['PSYCH,PTSD','PSYCH,ASD','PSYCH,OCD']

def update_psych(row):
    if row['neuropathological_diagnosis'] in other_psych:
        return 'other_psych'
    return row['simplified_diagnosis']

def update_dem(row):
    if row['neuropathological_diagnosis'] in other_dems:
        return 'other_dem'
    return row['simplified_diagnosis']



data['simplified_diagnosis'] = data.apply(update_psych, axis=1)
data['simplified_diagnosis'] = data.apply(update_dem, axis=1)
data['simplified_diagnosis'] = data['simplified_diagnosis'].apply(lambda x: 'Other' if x is None else x)
display(data.head())
display(data['Age'].drop_duplicates().sort_values())
unique_diagnoses = data[['DonorID', 'simplified_diagnosis','neuropathological_diagnosis']].drop_duplicates()
display(unique_diagnoses.tail(10))
display(unique_diagnoses['simplified_diagnosis'].value_counts().head(60))
display(data['DonorID'].nunique())

2951

3043

2478

AD                       487
CON                      356
PDD                      122
PD                       120
MS,MS-SP                 112
AD,DLB                   104
PSP                       87
MS,MS-PP                  66
MSA                       57
VD                        54
FTD,FTD-TDP_undefined     54
PSYCH,MDD                 54
DLB,SICC                  49
PSYCH,BP                  46
DEM,SICC                  44
FTD,FTD-TAU,TAU           36
MS_undefined              32
AD,ILBD                   31
DLB                       31
FTD,PID                   30
Name: neuropathological_diagnosis, dtype: int64

Unnamed: 0_level_0,neuropathological_diagnosis,count,uniqueage
DonorID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
NBB 1997-001,AD,54.0,15
NBB 1997-002,CON,0.0,2
NBB 1997-003,AD,69.0,9
NBB 1997-004,"DEM,SICC",14.0,4
NBB 1997-005,CON,8.0,13
...,...,...,...
NBB 2020-047,"PSYCH,MDD",2.0,1
NBB 2020-051,"PSYCH,BP",33.0,4
NBB 2020-052,CON,68.0,31
NBB 2020-054,CON,7.0,3


['NBB 1997-025', 'NBB 1997-041', 'NBB 1997-097', 'NBB 1997-161', 'NBB 1998-021', 'NBB 1998-030', 'NBB 1998-069', 'NBB 1998-071', 'NBB 1998-078', 'NBB 1998-079', 'NBB 1998-096', 'NBB 1998-099', 'NBB 1998-118', 'NBB 1998-128', 'NBB 1998-148', 'NBB 1998-164', 'NBB 1998-196', 'NBB 1998-197', 'NBB 1998-199', 'NBB 1999-004', 'NBB 1999-022', 'NBB 1999-039', 'NBB 1999-041', 'NBB 1999-074', 'NBB 1999-075', 'NBB 1999-084', 'NBB 1999-131', 'NBB 1999-133', 'NBB 2000-047', 'NBB 2000-058', 'NBB 2001-118', 'NBB 2002-068', 'NBB 2003-037', 'NBB 2004-003', 'NBB 2004-074', 'NBB 2005-006', 'NBB 2005-011', 'NBB 2005-026', 'NBB 2005-031', 'NBB 2006-018', 'NBB 2007-014', 'NBB 2008-023', 'NBB 2008-028', 'NBB 2008-052', 'NBB 2008-055', 'NBB 2009-089', 'NBB 2010-085', 'NBB 2010-115', 'NBB 2011-070', 'NBB 2012-031', 'NBB 2012-062', 'NBB 2012-087', 'NBB 2012-097', 'NBB 2012-099', 'NBB 2012-112', 'NBB 2012-114', 'NBB 2012-115', 'NBB 2012-120', 'NBB 2013-014', 'NBB 2015-071', 'NBB 2015-089', 'NBB 2017-003', 'NBB 20

2410

Unnamed: 0,neuropathological_diagnosis,age_at_death,Year,sex,Muscular_Weakness,Spasticity,Hyperreflexia_and_oth_reflexes,Fasciculations,Positive_sensory_symptoms,Negative_sensory_symptoms,...,Cachexia,Weight_loss,Reduces_oral_intake,Help_in_ADL,Day_care,Admission_to_nursing_home,DonorID,Age,file_year,simplified_diagnosis
0,AD,71.0,1961.0,M,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1997-001,35.0,1997,AD
1,AD,71.0,1971.0,M,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1997-001,45.0,1997,AD
2,AD,71.0,1976.0,M,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1997-001,50.0,1997,AD
3,AD,71.0,1978.0,M,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1997-001,52.0,1997,AD
4,AD,71.0,1979.0,M,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1997-001,53.0,1997,AD


15        -9.0
223        0.0
5244       1.0
562        2.0
237        3.0
         ...  
559       99.0
560      100.0
561      101.0
18133    102.0
18657    103.0
Name: Age, Length: 105, dtype: float64

Unnamed: 0,DonorID,simplified_diagnosis,neuropathological_diagnosis
26044,NBB 2020-024,other_dem,CBD
26059,NBB 2020-028,CON,CON
26071,NBB 2020-029,PSP,PSP
26074,NBB 2020-030,MSA,MSA
26080,NBB 2020-034,Other,ILBD
26087,NBB 2020-038,BP,"PSYCH,BP"
26101,NBB 2020-051,BP,"PSYCH,BP"
26105,NBB 2020-052,CON,CON
26136,NBB 2020-054,CON,CON
26139,NBB 2020-078,FTD,"FTD,PID"


AD             477
CON            356
other_dem      300
PD             237
Other          235
MS             222
FTD            197
PSP             86
MSA             57
VD              53
MDD             51
BP              46
DLB             30
ATAXIA          19
MND             16
other_psych     15
SCZ             13
Name: simplified_diagnosis, dtype: int64

2410

In [44]:
# data[data['simplified_diagnosis']=='Other']

#### for seurat we select certain donors based on neuropathological_diagnosis

In [39]:
unique_diagnoses = data[['DonorID', 'neuropathological_diagnosis']].drop_duplicates()
# display(unique_diagnoses)
alldiag = list(unique_diagnoses['neuropathological_diagnosis'].unique())
print(list(alldiag))
len(unique_diagnoses[unique_diagnoses['neuropathological_diagnosis'] == 'VD,ILBD'])

['AD', 'CON', 'DEM,SICC', 'MS,MS-PP', 'HIP', 'DEM,ENCEPHA,VE', 'AD,DLB', 'FTD_undefined', 'AD,ILBD', 'PD', 'VD', 'MSA', 'FTD,FTD-TDP_undefined', 'PSYCH,MDD', 'PSYCH,BP', 'DLB,SICC', 'PSP', 'MS,MS-SP', 'DEM,SICC,AGD', 'FTD,FTD-TDP-B,C9ORF72', 'TUM', 'PSYCH,SCZ,AD', 'CA', 'PSYCH,SCZ', 'AD,CA', 'MS,MS-RR', 'DOWN', 'DEM,SICC,ILBD', 'CBD', 'FTD,FTD-TDP-A,PROG', 'HD', 'WER,KOR', 'MS,AD', 'MS_undefined', 'FTD,PID', 'IS', 'MND_other', 'PCAD', 'MEDIS', 'DEM,SICC,CA', 'TRANS,PSYCH,MDD', 'ENCEPHA,VE', 'ILBD', 'AD,ENCEPHA,VE', 'ENCE', 'FTD,FTD-TAU,TAU', 'VD,ILBD', 'PSYCH_other', 'DYSTO', 'LD_other', 'KLIN', 'FRAGX', 'MND,ALS', 'PD,ATPD', 'DLB', 'CVA', 'CJD', 'HMSN', 'FTD,FTD-TDP,MND', 'NHL', 'ATAXIA,SCA', 'EPI', 'HSP', 'NIG', 'PD,AD', 'FTD,FTD-UPS', 'NCSD', 'DEM,CVA', 'SICC,ILBD', 'FTD,FTD-FUS', 'ATAXIA,ADCA', 'DAI', 'SEP', 'ALEX', 'PAL', 'TAU', 'ENCEPHA,PML', 'TRANS,AD', 'FTD,FTD-TDP-C', 'BINSW', 'PSYCH,NARCO', 'PSYCH,PTSD', 'LDA', 'NMO', 'POLY_other', 'FAHR', 'VDCAD', 'COHA', 'ATAXIA,FXTAS', 'AT

1

In [40]:
list(data['neuropathological_diagnosis'].unique())

seurat_diagnoses = [## main diagnoses 
                    'AD','DLB','VD','CON','PD','PSP','MSA','MND,ALS','MND_other',
                    ## other dementias
                    'CBD','AD,DLB','AD,CA','AD,ENCEPHA,VE','PD,AD', #,'ILBD','AD,ILBD','ENCEPHA,VE'
                    'DLB,SICC','DEM,SICC','DEM,SICC,AGD','DEM,ENCEPHA,VE',
                    ## ataxia subtypes
                    'ATAXIA,SCA','ATAXIA,ADCA','ATAXIA,FXTAS','ATAXIA,FA',
                    ### FTD subtypes
                    'FTD_undefined','FTD,FTD-TAU,TAU','FTD,PID','FTD,FTD-FUS','FTD,FTD-TDP,MND',
                    'FTD,FTD-TDP_undefined','FTD,FTD-TDP-A,PROG','FTD,FTD-TDP-B,C9ORF72','FTD,FTD-TDP-C',
                    ## psych subtypes
                    'PSYCH,MDD','PSYCH,BP','PSYCH,SCZ','PSYCH,PTSD','PSYCH,ASD','PSYCH,OCD',
                    ## MS subtypes
                    'MS,MS-SP','MS,MS-PP','MS_undefined','MS,MS-RR']

not_used_seurat = list(set(alldiag) - set(seurat_diagnoses))
not_used_seurat = data[data['neuropathological_diagnosis'].isin(not_used_seurat)]
unique_diagnoses = not_used_seurat[['DonorID', 'simplified_diagnosis','neuropathological_diagnosis']].drop_duplicates()
# display(unique_diagnoses['neuropathological_diagnosis'].value_counts())
seurat = data[data['neuropathological_diagnosis'].isin(seurat_diagnoses)]
seurat = seurat[seurat['DonorID'] != 'NBB 1999-072'] ## this donor is cursed :) 
unique_diagnoses = seurat[['DonorID', 'simplified_diagnosis','neuropathological_diagnosis']].drop_duplicates()
display(unique_diagnoses['simplified_diagnosis'].value_counts())
# display(unique_diagnoses['neuropathological_diagnosis'].value_counts())
# display(unique_diagnoses.tail(20))
# display(unique_diagnoses[unique_diagnoses['neuropathological_diagnosis']=='CBD'])
display(seurat['DonorID'].nunique())

AD             477
CON            355
other_dem      300
PD             237
MS             222
FTD            197
PSP             86
MSA             57
VD              53
MDD             51
BP              46
DLB             30
ATAXIA          19
MND             16
other_psych     15
SCZ             13
Name: simplified_diagnosis, dtype: int64

2174

In [41]:
display(len(unique_diagnoses['DonorID'].unique()))

2174

In [42]:
## note, now does not have clinical diagnosis! first do analysis, then load the clindiag in seurat itself
seurat.to_csv('/home/jupyter-n.mekkes@gmail.com-f6d87/clinical_analysis/data/seurat_input.csv', index=False)

In [43]:
data[data['DonorID']=='NBB 1999-072']

Unnamed: 0,neuropathological_diagnosis,age_at_death,Year,sex,Muscular_Weakness,Spasticity,Hyperreflexia_and_oth_reflexes,Fasciculations,Positive_sensory_symptoms,Negative_sensory_symptoms,...,Cachexia,Weight_loss,Reduces_oral_intake,Help_in_ADL,Day_care,Admission_to_nursing_home,DonorID,Age,file_year,simplified_diagnosis
3238,CON,62.0,-9.0,M,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1999-072,-9.0,1999,CON
3239,CON,62.0,1999.0,M,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1999-072,62.0,1999,CON


#### for the GRU-D model we only take the simplified donors, and delete the psychiatric cases

In [44]:
gru_d = data.copy()
def update_b(row):
    if 'AD,DLB' in row['neuropathological_diagnosis']:
        return 'AD,DLB'
    return row['simplified_diagnosis']
gru_d['simplified_diagnosis'] = gru_d.apply(update_b, axis=1)
gru_d = gru_d[gru_d['simplified_diagnosis'] != 'Other']
gru_d = gru_d[gru_d['simplified_diagnosis'] != 'other_dem']
gru_d = gru_d[gru_d['simplified_diagnosis'] != 'other_psych']
gru_d = gru_d[gru_d['simplified_diagnosis'] != 'MDD']
gru_d = gru_d[gru_d['simplified_diagnosis'] != 'BP']
gru_d = gru_d[gru_d['simplified_diagnosis'] != 'SCZ']
datalist = gru_d['DonorID'].unique()
display(len(datalist))
# unique_diagnoses = gru_d[['DonorID', 'simplified_diagnosis','neuropathological_diagnosis']].drop_duplicates()
# display(unique_diagnoses.tail(10))
# display(unique_diagnoses['simplified_diagnosis'].value_counts())

1853

In [45]:
gru_d = seurat.copy()
def update_b(row):
    if 'AD,DLB' in row['neuropathological_diagnosis']:
        return 'AD,DLB'
    return row['simplified_diagnosis']
gru_d['simplified_diagnosis'] = gru_d.apply(update_b, axis=1)
gru_d = gru_d[gru_d['simplified_diagnosis'] != 'Other']
gru_d = gru_d[gru_d['simplified_diagnosis'] != 'other_dem']
gru_d = gru_d[gru_d['simplified_diagnosis'] != 'other_psych']
gru_d = gru_d[gru_d['simplified_diagnosis'] != 'MDD']
gru_d = gru_d[gru_d['simplified_diagnosis'] != 'BP']
gru_d = gru_d[gru_d['simplified_diagnosis'] != 'SCZ']
seuratlist = gru_d['DonorID'].unique()
display(len(seuratlist))
# unique_diagnoses = gru_d[['DonorID', 'simplified_diagnosis','neuropathological_diagnosis']].drop_duplicates()
# display(unique_diagnoses.tail(10))
# display(unique_diagnoses['simplified_diagnosis'].value_counts())
print(set(datalist) - set(seuratlist))

1852

{'NBB 1999-072'}


In [46]:
display(len(unique_diagnoses['DonorID'].unique()))

2174

#### then we need to sort

In [56]:
# gru_d.to_csv('/home/jupyter-n.mekkes@gmail.com-f6d87/clinical_analysis/data/donors.csv', index=False)
gru_d = gru_d[gru_d.Age >= 0]
display(gru_d['DonorID'].nunique())

1810

In [57]:
gru_d = gru_d.sort_values(['DonorID', 'Age'],
              ascending = [True, True])
display(gru_d)

Unnamed: 0,neuropathological_diagnosis,age_at_death,Year,sex,Muscular_Weakness,Spasticity,Hyperreflexia_and_oth_reflexes,Fasciculations,Positive_sensory_symptoms,Negative_sensory_symptoms,...,Cachexia,Weight_loss,Reduces_oral_intake,Help_in_ADL,Day_care,Admission_to_nursing_home,DonorID,Age,file_year,simplified_diagnosis
0,AD,71.0,1961.0,M,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1997-001,35.0,1997,AD
1,AD,71.0,1971.0,M,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1997-001,45.0,1997,AD
2,AD,71.0,1976.0,M,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1997-001,50.0,1997,AD
3,AD,71.0,1978.0,M,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1997-001,52.0,1997,AD
4,AD,71.0,1979.0,M,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1997-001,53.0,1997,AD
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26137,CON,81.0,2019.0,F,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,NBB 2020-054,80.0,2020,CON
26138,CON,81.0,2020.0,F,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,NBB 2020-054,81.0,2020,CON
26139,"FTD,PID",71.0,2014.0,M,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,NBB 2020-078,65.0,2020,FTD
26140,"FTD,PID",71.0,2016.0,M,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,NBB 2020-078,67.0,2020,FTD


In [58]:
gru_d.to_csv('/home/jupyter-n.mekkes@gmail.com-f6d87/clinical_analysis/data/grud_clin_subset_input.csv', index=False)

#### INPUT
array of arrays. each array is for a single donor, consisting of shape time x attribute

In [59]:
inp = gru_d.copy()
inp = inp.drop(['neuropathological_diagnosis','file_year','simplified_diagnosis','Year'],axis=1)#'diagnosis_info',
inp['sex'] = inp['sex'].map({'F': 1, 'M': 0}).astype(int)
inp['Age']  = inp['Age'].astype(int)
inp['age_at_death']  = inp['age_at_death'].astype(int)
def sum_except_donors(df):
    return df.iloc[:, ].sum()

inp = inp.sort_values(['DonorID', 'Age'],
              ascending = [True, True])

inp_with_nan = inp.copy()
inp_with_nan = inp_with_nan.reset_index(drop=True)
# final_row = inp_with_nan.groupby(['DonorID','sex']).sum().reset_index()
# final_row['Age'] = 150
# display(final_row)
# inp_with_nan = inp_with_nan.drop(['Age'], axis=1)
# inp_with_nan = pd.concat([inp_with_nan, final_row], ignore_index=True)
display(inp_with_nan.head(5))
final_input = inp_with_nan.set_index('DonorID')
# display(final_input)
final_input = final_input.groupby('DonorID').apply(pd.DataFrame.to_numpy).to_numpy()
print(final_input)

Unnamed: 0,age_at_death,sex,Muscular_Weakness,Spasticity,Hyperreflexia_and_oth_reflexes,Fasciculations,Positive_sensory_symptoms,Negative_sensory_symptoms,Parkinsonism,Facial_masking,...,Fatigue,Declined_deteriorated_health,Cachexia,Weight_loss,Reduces_oral_intake,Help_in_ADL,Day_care,Admission_to_nursing_home,DonorID,Age
0,71,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1997-001,35
1,71,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1997-001,45
2,71,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1997-001,50
3,71,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1997-001,52
4,71,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1997-001,53


[array([[71.,  0.,  0., ...,  0.,  0., 35.],
        [71.,  0.,  0., ...,  0.,  0., 45.],
        [71.,  0.,  0., ...,  0.,  0., 50.],
        ...,
        [71.,  0.,  0., ...,  0.,  1., 69.],
        [71.,  0.,  0., ...,  0.,  1., 70.],
        [71.,  0.,  0., ...,  0.,  1., 71.]])
 array([[97.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
          0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
          0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
          0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
          0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
          0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
          0.,  0.,  0.,  0.,  0.,  0.,  0.,  0., 97.]])
 array([[88.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
          0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,
          0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,

In [60]:
print(final_input.shape)
print(final_input[0].shape)

(1810,)
(15, 87)


#### LABEL TASKNAME

array in the shape samples X diagnosis

In [61]:
np.set_printoptions(threshold=30)
lt = gru_d[['DonorID','simplified_diagnosis']].copy()
# lt = lt[~lt['DonorID'].isin(weirds)]
lt = lt.drop_duplicates().reset_index(drop=True)

display(lt)

donorcount = len(lt['DonorID'])
print(donorcount)
print(lt['simplified_diagnosis'].value_counts())
print(lt.simplified_diagnosis.unique())

one_hot = pd.get_dummies(lt.simplified_diagnosis)

# Define the ordered list
wanted = ['CON', 'AD', 'PD', 'VD', 'FTD','DLB','AD,DLB','ATAXIA', 'MND', 'PSP', 'MS','MSA'] #'AD,DLB' 'DLB,SICC',

# Get the current columns of the dataframe
current_cols = list(one_hot.columns)

# Create a new list of columns in the order of the 'wanted' list
new_cols = [col for col in wanted if col in current_cols]

# Reorder the columns of the dataframe using the new list of columns
one_hot = one_hot.reindex(columns=new_cols)
display(one_hot)
final_label_taskname = one_hot.to_numpy()
display(final_label_taskname)

Unnamed: 0,DonorID,simplified_diagnosis
0,NBB 1997-001,AD
1,NBB 1997-002,CON
2,NBB 1997-003,AD
3,NBB 1997-005,CON
4,NBB 1997-006,MS
...,...,...
1805,NBB 2020-029,PSP
1806,NBB 2020-030,MSA
1807,NBB 2020-052,CON
1808,NBB 2020-054,CON


1810
AD        477
CON       314
PD        237
MS        222
FTD       196
AD,DLB    103
PSP        86
MSA        57
VD         53
DLB        30
ATAXIA     19
MND        16
Name: simplified_diagnosis, dtype: int64
['AD' 'CON' 'MS' 'AD,DLB' 'FTD' 'PD' 'VD' 'MSA' 'PSP' 'MND' 'DLB' 'ATAXIA']


Unnamed: 0,CON,AD,PD,VD,FTD,DLB,"AD,DLB",ATAXIA,MND,PSP,MS,MSA
0,0,1,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
1805,0,0,0,0,0,0,0,0,0,1,0,0
1806,0,0,0,0,0,0,0,0,0,0,0,1
1807,1,0,0,0,0,0,0,0,0,0,0,0
1808,1,0,0,0,0,0,0,0,0,0,0,0


array([[0, 1, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint8)

#### MASKING
Masking is a datatype of the same shape as input. it is used to indicate which data is present, and which data is absent. If a row of data would be [nan, nan, 0, 0, 3.14, 10], the masking row would be [0,0,1,1,1,1]. In our case, we have options:
- every value larger than 1 to 1, every zero to zero. because a zero can mean that the symptom is present, it is just not written down that year?
- every value to 1. 

In [62]:
# final_input

In [63]:
import copy
## simplest
final_masking = copy.deepcopy(final_input)

In [64]:
# option 1
# for i in range(len(final_masking)):
#     final_masking[i][final_masking[i] > 1] = 1
    
# option 2
for k in range(len(final_masking)):
    final_masking[k][final_masking[k] >= 0] = 1    

print(final_masking)

[array([[1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        ...,
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.]])
 array([[1., 1., 1., ..., 1., 1., 1.]])
 array([[1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        ...,
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.]]) ...
 array([[1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        ...,
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.]])
 array([[1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.]])
 array([[1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1.,

#### TIMESTAMP
Timestamp is another array of arrays. There is an array for every donor, that consists of the timepoints that are known for that donor. e.g. if donor1 has information from age 34, 61, and 62, then his timestamp would be [34,61,62].

In [65]:
timestamp_df = inp[['DonorID','Age']].copy()
timestamp_df
final_timestamp = timestamp_df.set_index('DonorID').groupby('DonorID').apply(pd.DataFrame.to_numpy).to_numpy()
final_timestamp

array([array([[35],
              [45],
              [50],
              [52],
              [53],
              [56],
              [61],
              [62],
              [65],
              [66],
              [67],
              [68],
              [69],
              [70],
              [71]]), array([[97]]), array([[79],
                                            [82],
                                            [83],
                                            [84],
                                            [85],
                                            [86],
                                            [87],
                                            [88]]), ..., array([[ 54],
                                                                [ 55],
                                                                [ 62],
                                                                ...,
                                                                [100],
                 

### SPLITTING BALANCED

In [66]:
## This function takes the created train and test data as input
## it returns a measure of how similar train is to test
## it also shows an overview of the number of cases per attribute
## and a corrected version of this overview
def split_vis(x_train,x_test,y_train, y_test,train_val_size,test_size):
    """
    something
    """
    counts = {}
    counts["train_counts"] = Counter(str(combination) for row in get_combination_wise_output_matrix(
        y_train, order=1) for combination in row)
    counts["test_counts"] = Counter(str(combination) for row in get_combination_wise_output_matrix(
        y_test, order=1) for combination in row)    

    # view distributions
    multi_split_dist = pd.DataFrame({
        "for_train_and_val": counts["train_counts"],
        "test": counts["test_counts"]
    }).T.fillna(0)
    multi_split_dist = multi_split_dist.reindex(natsorted(multi_split_dist.columns), axis=1)
#     multi_split_dist.columns = labels
    
    for k in counts["test_counts"].keys():
        counts["test_counts"][k] = int(counts["test_counts"][k] * (train_val_size/test_size))
        
    # View size corrected distributions
    multi_split_dist_corr = pd.DataFrame({
        "for_train_and_val": counts["train_counts"],
        "test": counts["test_counts"]
    }).T.fillna(0)
    multi_split_dist_corr =multi_split_dist_corr.reindex(natsorted(multi_split_dist_corr.columns), axis=1)
#     multi_split_dist_corr.columns = labels
    
    print(f"train: {len(x_train)} ({len(x_train)/(len(x_train)+len(x_test)):.2f})\n"
          f"test: {len(x_test)} ({len(x_test)/(len(x_train)+len(x_test)):.2f})")
    dist_split = np.mean(np.std(multi_split_dist_corr.to_numpy(), axis=0))
    
    return dist_split,multi_split_dist,multi_split_dist_corr

## for figure 5, counts for split are in here

In [67]:
lt['foldinfo'] = None
# display(lt)
import numpy as np
from sklearn.model_selection import StratifiedKFold

# if eighties == True:
#     n_split = 10 # 5 for 60-20-20, 10 for 80-10-10
#     n_split_in = 9 # 4 for 60-20-20, 9 for 80-10-10
# elif eighties == False:
n_split = 5 # 5 for 60-20-20, 10 for 80-10-10
n_split_in = 4 # 4 for 60-20-20, 9 for 80-10-10

fold_taskname = np.empty(shape=(5, 3), dtype=object)

# X = np.array([0, 2, 1, 1,0,2,0, 2, 1, 1,0,2,0, 2, 1, 1,0,2])
# y = np.array([0, 2, 1, 1,0,2,0, 2, 1, 1,0,2,0, 2, 1, 1,0,2])
X = np.array(lt['simplified_diagnosis'].values)
y = np.array(lt['simplified_diagnosis'].values)
print(y)
print(y.shape)


## SET UP SPLIT BETWEEN TEST AND TRAIN/VAL
skf = StratifiedKFold(n_splits=n_split, random_state=1, shuffle=True)
skf.get_n_splits(X, y)
print(skf)
j = 0
for train_val_index, test_index in skf.split(X, y):
#     print("TRAIN+VAL:", train_val_index, "TEST:", test_index)
    ## USE THE GENERATED INDICES TO SELECT DIAGNOSES
    q_train_val, q_test = X[train_val_index], X[test_index]
    r_train_val, r_test = y[train_val_index], y[test_index]
#     print('test:' , test_index)
#     print('trainval: ', train_val_index)
    skf2 = StratifiedKFold(n_splits=n_split_in, random_state=2, shuffle=True)
    skf2.get_n_splits(X, y)
    print(skf2)
    
    ## WITHIN EACH FOLD, SPLIT TRAIN/VAL INTO TRAIN AND VAL (ONLY NEEDED ONCE!)
    i = 0
    ## example: 
    ## [0,1,2,3,4,5,6,7,8,9]  full data ['a','b','c','d','e','f','g','h','i','j']
    ## [0,2,3,5,7,9] indices selected for train/val ['a','c','d','f','h','j']
    ## [0,1,2,3,4,5]
    ## [0,2,3,9] indices points selected for train ['a','c','d','j']
    for train_index, val_index in skf2.split(q_train_val, r_train_val):
        print(i)
        if i == 0:
            ## USE THE GENERATED INDICES TO CREATE NEW INDICES THAT WORK ON THE FULL DATA
            true_train = train_val_index[train_index]
            true_val = train_val_index[val_index]
            
            ## PRINT THE INDICES
            print("TRAIN:", true_train, "\nVAL:", true_val, "\nTEST:", test_index)
            q_train, q_val, q_test = X[true_train], X[true_val],X[test_index]
            r_train, r_val, q_test = y[true_train], y[true_val],y[test_index] 
            
            #print('trainval: ',train_val_index,train_val_index.shape )
            #print('train: ',train_index, train_index.shape)
            print(f"train: {len(q_train)} ({len(q_train)/(len(q_train)+len(q_test)+len(q_val)):.2f})\n"
                  f"val: {len(q_val)} ({len(q_val)/(len(q_train)+len(q_test)+len(q_val)):.2f})\n"
                  f"test: {len(q_test)} ({len(q_test)/(len(q_train)+len(q_test)+len(q_val)):.2f})")
            
            ## SAVE INTO NUMPY ARRAY
            fold_taskname[j][0] = np.asarray(true_train)
            fold_taskname[j][1] = np.asarray(true_val)
            fold_taskname[j][2] = np.asarray(test_index)
            lt.loc[test_index, 'foldinfo'] = j
            
            ## FOR VISUALIZING COUNTS PER DIAGNOSIS PER FOLD
            ## TRAINING
            foo, bar = np.unique(q_train, return_counts=True)
            my_dict = dict(zip(foo, bar))
            df = pd.DataFrame(list(my_dict.items()),columns = ['diagnosis','train'])
            
            ## VALIDATION
            foo, bar = np.unique(q_val, return_counts=True)
            my_dict = dict(zip(foo, bar))
            df1 = pd.DataFrame(list(my_dict.items()),columns = ['diagnosis2','val'])
            
            ## TEST
            foo, bar = np.unique(q_test, return_counts=True)
            my_dict = dict(zip(foo, bar))
            df2 = pd.DataFrame(list(my_dict.items()),columns = ['diagnosis3','test'])
            
            ## COMBINE ALL THREE
            df3 = pd.concat([df,df1, df2], ignore_index=True,axis=1)
            df3.columns = ['diagnosis','train','diagnosis2','val','diagnosis3','test']
            df3 = df3.drop(['diagnosis2','diagnosis3'], axis=1)
            display(df3)
            print(df3['diagnosis'])
#         elif i > 0:
#             print('finished fold {}, exiting...'.format(i))
            break
        i = i +1
    j = j + 1
    print('---------')
print(fold_taskname)
display(lt)

['AD' 'CON' 'AD' ... 'CON' 'CON' 'FTD']
(1810,)
StratifiedKFold(n_splits=5, random_state=1, shuffle=True)
StratifiedKFold(n_splits=4, random_state=2, shuffle=True)
0
TRAIN: [   1    7    8 ... 1802 1806 1809] 
VAL: [   0    2    3 ... 1787 1791 1805] 
TEST: [   4    5    6 ... 1804 1807 1808]
train: 1086 (0.60)
val: 362 (0.20)
test: 362 (0.20)


Unnamed: 0,diagnosis,train,val,test
0,AD,285,96,96
1,"AD,DLB",61,21,21
2,ATAXIA,12,4,3
3,CON,189,62,63
4,DLB,18,6,6
5,FTD,118,39,39
6,MND,9,3,4
7,MS,133,45,44
8,MSA,35,11,11
9,PD,142,48,47


0         AD
1     AD,DLB
2     ATAXIA
3        CON
4        DLB
5        FTD
6        MND
7         MS
8        MSA
9         PD
10       PSP
11        VD
Name: diagnosis, dtype: object
---------
StratifiedKFold(n_splits=4, random_state=2, shuffle=True)
0
TRAIN: [   1    5    6 ... 1807 1808 1809] 
VAL: [   0    2    3 ... 1798 1804 1805] 
TEST: [  17   22   27 ... 1800 1801 1806]
train: 1086 (0.60)
val: 362 (0.20)
test: 362 (0.20)


Unnamed: 0,diagnosis,train,val,test
0,AD,285,96,96
1,"AD,DLB",62,21,20
2,ATAXIA,12,3,4
3,CON,189,63,62
4,DLB,18,6,6
5,FTD,117,39,40
6,MND,9,4,3
7,MS,133,44,45
8,MSA,34,12,11
9,PD,143,47,47


0         AD
1     AD,DLB
2     ATAXIA
3        CON
4        DLB
5        FTD
6        MND
7         MS
8        MSA
9         PD
10       PSP
11        VD
Name: diagnosis, dtype: object
---------
StratifiedKFold(n_splits=4, random_state=2, shuffle=True)
0
TRAIN: [   1    6    7 ... 1807 1808 1809] 
VAL: [   0    2    4 ... 1803 1804 1805] 
TEST: [   3   11   13 ... 1787 1788 1791]
train: 1086 (0.60)
val: 362 (0.20)
test: 362 (0.20)


Unnamed: 0,diagnosis,train,val,test
0,AD,286,96,95
1,"AD,DLB",62,21,20
2,ATAXIA,12,3,4
3,CON,188,63,63
4,DLB,18,6,6
5,FTD,118,39,39
6,MND,9,4,3
7,MS,133,44,45
8,MSA,34,11,12
9,PD,142,47,48


0         AD
1     AD,DLB
2     ATAXIA
3        CON
4        DLB
5        FTD
6        MND
7         MS
8        MSA
9         PD
10       PSP
11        VD
Name: diagnosis, dtype: object
---------
StratifiedKFold(n_splits=4, random_state=2, shuffle=True)
0
TRAIN: [   1    5    6 ... 1807 1808 1809] 
VAL: [   0    2    3 ... 1793 1801 1804] 
TEST: [   7    9   12 ... 1786 1789 1798]
train: 1086 (0.60)
val: 362 (0.20)
test: 362 (0.20)


Unnamed: 0,diagnosis,train,val,test
0,AD,286,96,95
1,"AD,DLB",61,21,21
2,ATAXIA,12,3,4
3,CON,188,63,63
4,DLB,18,6,6
5,FTD,118,39,39
6,MND,9,4,3
7,MS,134,44,44
8,MSA,34,11,12
9,PD,142,47,48


0         AD
1     AD,DLB
2     ATAXIA
3        CON
4        DLB
5        FTD
6        MND
7         MS
8        MSA
9         PD
10       PSP
11        VD
Name: diagnosis, dtype: object
---------
StratifiedKFold(n_splits=4, random_state=2, shuffle=True)
0
TRAIN: [   3    4    5 ... 1804 1806 1808] 
VAL: [  12   15   22 ... 1798 1803 1807] 
TEST: [   0    1    2 ... 1802 1805 1809]
train: 1086 (0.60)
val: 362 (0.20)
test: 362 (0.20)


Unnamed: 0,diagnosis,train,val,test
0,AD,287,95,95
1,"AD,DLB",62,20,21
2,ATAXIA,12,3,4
3,CON,188,63,63
4,DLB,18,6,6
5,FTD,118,39,39
6,MND,9,4,3
7,MS,133,45,44
8,MSA,34,12,11
9,PD,142,48,47


0         AD
1     AD,DLB
2     ATAXIA
3        CON
4        DLB
5        FTD
6        MND
7         MS
8        MSA
9         PD
10       PSP
11        VD
Name: diagnosis, dtype: object
---------
[[array([   1,    7,    8, ..., 1802, 1806, 1809])
  array([   0,    2,    3, ..., 1787, 1791, 1805])
  array([   4,    5,    6, ..., 1804, 1807, 1808])]
 [array([   1,    5,    6, ..., 1807, 1808, 1809])
  array([   0,    2,    3, ..., 1798, 1804, 1805])
  array([  17,   22,   27, ..., 1800, 1801, 1806])]
 [array([   1,    6,    7, ..., 1807, 1808, 1809])
  array([   0,    2,    4, ..., 1803, 1804, 1805])
  array([   3,   11,   13, ..., 1787, 1788, 1791])]
 [array([   1,    5,    6, ..., 1807, 1808, 1809])
  array([   0,    2,    3, ..., 1793, 1801, 1804])
  array([   7,    9,   12, ..., 1786, 1789, 1798])]
 [array([   3,    4,    5, ..., 1804, 1806, 1808])
  array([  12,   15,   22, ..., 1798, 1803, 1807])
  array([   0,    1,    2, ..., 1802, 1805, 1809])]]


Unnamed: 0,DonorID,simplified_diagnosis,foldinfo
0,NBB 1997-001,AD,4
1,NBB 1997-002,CON,4
2,NBB 1997-003,AD,4
3,NBB 1997-005,CON,2
4,NBB 1997-006,MS,0
...,...,...,...
1805,NBB 2020-029,PSP,4
1806,NBB 2020-030,MSA,1
1807,NBB 2020-052,CON,0
1808,NBB 2020-054,CON,0


In [68]:
lt['foldinfo'].value_counts()
lt['indexes'] = lt.index
display(lt)

Unnamed: 0,DonorID,simplified_diagnosis,foldinfo,indexes
0,NBB 1997-001,AD,4,0
1,NBB 1997-002,CON,4,1
2,NBB 1997-003,AD,4,2
3,NBB 1997-005,CON,2,3
4,NBB 1997-006,MS,0,4
...,...,...,...,...
1805,NBB 2020-029,PSP,4,1805
1806,NBB 2020-030,MSA,1,1806
1807,NBB 2020-052,CON,0,1807
1808,NBB 2020-054,CON,0,1808


In [69]:
# fold_taskname[0,0]

In [70]:
# final_input

In [71]:
# final_input[fold_taskname[0][0]]

In [72]:
#np.set_printoptions(threshold=np.inf)
n_dim = 87#83
mean_taskname = np.zeros((5, 1, n_dim)) * np.nan
std_taskname = np.zeros((5, 1, n_dim)) * np.nan
for i_split in range(5):
    ## fold_taskname[i_split][0] selecteert de indexen van de training donors van elke fold
    ## final_input[fold_taskname[i_split][0]] selecteerd de training data van deze donoren
    ## de concatenate step combineer het, dus x_tr is training data per fold
    x_tr = np.concatenate(final_input[fold_taskname[i_split][0]], axis=0)
    display(x_tr)
    ## mean taskname contains the mean of each training column. eg. for the first fold, the average age is 75
    mean_taskname[i_split][0] = np.nanmean(x_tr, axis=0)
    ## std taskname contains the std of each training column. eg. for the first fold, the average age is 12
    std_taskname[i_split][0] = np.nanstd(x_tr, axis=0)
    
print(mean_taskname[0][0])
print(std_taskname[0][0])
mean_taskname

array([[97.,  1.,  0., ...,  0.,  0., 97.],
       [84.,  1.,  0., ...,  0.,  0., 83.],
       [84.,  1.,  0., ...,  0.,  0., 84.],
       ...,
       [71.,  0.,  0., ...,  0.,  0., 65.],
       [71.,  0.,  0., ...,  1.,  0., 67.],
       [71.,  0.,  0., ...,  1.,  0., 71.]])

array([[97.,  1.,  0., ...,  0.,  0., 97.],
       [88.,  1.,  0., ...,  0.,  0.,  9.],
       [88.,  1.,  0., ...,  0.,  0., 24.],
       ...,
       [71.,  0.,  0., ...,  0.,  0., 65.],
       [71.,  0.,  0., ...,  1.,  0., 67.],
       [71.,  0.,  0., ...,  1.,  0., 71.]])

array([[97.,  1.,  0., ...,  0.,  0., 97.],
       [89.,  1.,  0., ...,  0.,  0., 52.],
       [89.,  1.,  0., ...,  0.,  0., 77.],
       ...,
       [71.,  0.,  0., ...,  0.,  0., 65.],
       [71.,  0.,  0., ...,  1.,  0., 67.],
       [71.,  0.,  0., ...,  1.,  0., 71.]])

array([[97.,  1.,  0., ...,  0.,  0., 97.],
       [88.,  1.,  0., ...,  0.,  0.,  9.],
       [88.,  1.,  0., ...,  0.,  0., 24.],
       ...,
       [71.,  0.,  0., ...,  0.,  0., 65.],
       [71.,  0.,  0., ...,  1.,  0., 67.],
       [71.,  0.,  0., ...,  1.,  0., 71.]])

array([[69.,  1.,  0., ...,  0.,  0., 47.],
       [69.,  1.,  0., ...,  0.,  0., 53.],
       [69.,  1.,  0., ...,  0.,  0., 54.],
       ...,
       [81.,  1.,  1., ...,  0.,  0., 18.],
       [81.,  1.,  1., ...,  0.,  0., 80.],
       [81.,  1.,  0., ...,  0.,  0., 81.]])

[73.73540193  0.57122155  0.2280283  ...  0.07850993  0.19017987
 63.75355895]
[12.27935668  0.4949015   0.41956096 ...  0.26897234  0.39244297
 15.85941814]


array([[[73.73540193,  0.57122155,  0.2280283 , ...,  0.07850993,
          0.19017987, 63.75355895]],

       [[74.04961799,  0.57798953,  0.22079148, ...,  0.08060778,
          0.19409391, 64.10009443]],

       [[74.27300877,  0.57053157,  0.23464648, ...,  0.07491829,
          0.19421985, 64.29855496]],

       [[74.77224749,  0.59649876,  0.21711147, ...,  0.08049429,
          0.19514288, 64.7033382 ]],

       [[73.89006103,  0.57689625,  0.21987794, ...,  0.07942459,
          0.19372276, 64.06530078]]])

### SAVING

In [73]:
# if eighties == True:
#     prefix = '80_'
# elif eighties == False:
prefix = '60_'
savespace = f'clinical_history_{n}_observations'
output_path = "/home/jupyter-n.mekkes@gmail.com-f6d87/clinical_history/temporal_model/data"    

print(savespace)
os.makedirs(os.path.join(output_path,  savespace),
            exist_ok=True)
np.savez(os.path.join(output_path, savespace, 'data.npz'),
         input=final_input, masking=final_masking, timestamp=final_timestamp, label_taskname=final_label_taskname)
np.savez(os.path.join(output_path, savespace, 'fold.npz'),
         fold_taskname=fold_taskname, mean_taskname=mean_taskname, std_taskname=std_taskname)
lt.to_excel(os.path.join(output_path, savespace,'donorindexes.xlsx'), index=False)

clinical_history_5_observations


In [36]:
break

SyntaxError: 'break' outside loop (668683560.py, line 1)

In [None]:
gru_d.to_csv('/home/jupyter-n.mekkes@gmail.com-f6d87/clinical_diagnosis/output/gru_d_july.csv', index=False)

In [None]:
#### for seurat:

#### selecting a subset

In [11]:
table_of_choice = 'table1_p' #fig 4a table3_with_con_p #table2_p #fig 3a table1_P fig sup 5a:table2_p

In [12]:
selected_diagnoses,ordered_diagnoses = table_selector(table_of_choice, predictions_df)
print('After selecting for {}, we have {} donors'.format(selected_diagnoses['neuropathological_diagnosis'].unique(),
                                                                                    selected_diagnoses['DonorID'].nunique()) )
display(selected_diagnoses[selected_diagnoses['neuropathological_diagnosis']=='AD'].head(5))


After selecting for ['CON' 'PD' 'FTD' 'AD' 'PSP' 'MS' 'VD' 'SCZ' 'PDD' 'ATAXIA' 'BP' 'MSA'
 'MDD' 'MND' 'DLB'], we have 2333 donors


Unnamed: 0,neuropathological_diagnosis,age_at_death,Year,sex,Muscular_Weakness,Spasticity,Hyperreflexia_and_oth_reflexes,Fasciculations,Positive_sensory_symptoms,Negative_sensory_symptoms,...,Declined_deteriorated_health,Cachexia,Weight_loss,Reduces_oral_intake,Help_in_ADL,Day_care,Admission_to_nursing_home,DonorID,Age,diagnosis_info
34,AD,81.0,1984.0,F,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1990-003,75.0,wrong
35,AD,81.0,1986.0,F,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,NBB 1990-003,77.0,wrong
36,AD,81.0,1988.0,F,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,NBB 1990-003,79.0,wrong
37,AD,94.0,-9.0,F,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NBB 1990-004,-9.0,wrong
38,AD,94.0,1976.0,F,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,NBB 1990-004,80.0,wrong


#### merge the table1 diagnoses back with the other donors
we do this so we can run the analysis on the table1 diagnosis, but also include other groups such as ad dlb)


In [20]:
unique_donor_ids = selected_diagnoses['DonorID'].unique().tolist()
print(len(unique_donor_ids))
# Filter rows from dfa where DonorID is not in the unique_donor_ids list
filtered_predictions_df = predictions_df[~predictions_df['DonorID'].isin(unique_donor_ids)]
print(filtered_predictions_df.shape)
print(predictions_df.shape)
# Concatenate filtered_dfa with dfb
merged_df = pd.concat([selected_diagnoses, filtered_predictions_df], ignore_index=True)
# merged_df['neuropathological_diagnosis'].value_counts().head(40)
unique_diagnoses = merged_df[['DonorID', 'neuropathological_diagnosis']].drop_duplicates()
unique_diagnoses['neuropathological_diagnosis'].value_counts().head(20)

2333
(7272, 91)
(30185, 91)


AD           720
CON          445
MS           273
FTD          222
PD           134
PDD          126
AD,DLB       121
PSP           91
VD            64
MSA           61
DLB,SICC      60
DEM,SICC      59
MDD           56
BP            49
AD,ILBD       46
AD,CA         42
DLB           31
IS            30
UNDEFINED     24
SCZ           24
Name: neuropathological_diagnosis, dtype: int64

In [23]:
merged_df.to_csv('/home/jupyter-n.mekkes@gmail.com-f6d87/clinical_diagnosis/output/selected_diagnoses_july.csv', index=False)

In [43]:
# psych = ['MDD', 'SCZ', 'BP']
# selected_diagnoses = selected_diagnoses[~selected_diagnoses['neuropathological_diagnosis'].isin(psych)]
# ordered_diagnoses =  ['AD', 'PD', 'VD','ATAXIA','DLB','FTD', 'MND', 'PSP', 'MS','MSA']

In [213]:
# pd = merged_df[merged_df['neuropathological_diagnosis'] == 'PD']
pad = merged_df.copy()
pad['neuropathological_diagnosis'] = pad['neuropathological_diagnosis'].replace('PDD', 'PD')

pad = pad[['neuropathological_diagnosis','age_at_death','sex','Constipation','Weight_loss','DonorID','Age']]
# pd['Constipation'].value_counts()
# Weight loss Constipation
pad = pad.groupby(['neuropathological_diagnosis', 'age_at_death', 'sex', 'DonorID']).agg({'Constipation': 'sum', 'Weight_loss': 'sum'}).reset_index()

# display(pad)
con = pad.groupby('neuropathological_diagnosis')['Constipation'].apply(lambda x: (x > 0).sum()).reset_index(name='Constipation Count')
wl = pad.groupby('neuropathological_diagnosis')['Constipation'].apply(lambda x: (x > 0).sum()).reset_index(name='Weight loss Count')

# display(con)

percentage_df = pad.groupby('neuropathological_diagnosis')[['Constipation','Weight_loss']].apply(lambda x: (x > 0).mean() * 100).reset_index()
percentage_df.rename(columns={'Constipation': 'Constipation Percentage','Weight_loss':'Weight_loss Percentage'}, inplace=True)
percentage_df['Constipation Count'] = con['Constipation Count']
percentage_df['Weight loss Count'] = wl['Weight loss Count']
percentage_df.sort_values(by=['Constipation Percentage'],ascending=False,inplace=True)
display(percentage_df.head(30))

Unnamed: 0,neuropathological_diagnosis,Constipation Percentage,Weight_loss Percentage,Constipation Count,Weight loss Count
13,CMT,100.0,50.0,2,2
41,HSP,100.0,0.0,1,1
32,FAHR,100.0,0.0,1,1
66,"PSYCH,ADHD",100.0,100.0,1,1
73,PWS,100.0,0.0,1,1
8,BINSW,100.0,100.0,1,1
40,HMSN,100.0,0.0,1,1
59,PCAD,100.0,0.0,1,1
58,PAL,100.0,0.0,1,1
48,MEDIS,66.666667,0.0,2,2


In [None]:
# attribute_grouping = pd.read_excel(path_to_attribute_grouping, engine='openpyxl', index_col=[0])#,header=3, sheet_name='S3. 90 signs and symptoms')
# # display(attribute_grouping.head())
# df = predictions_df.copy()

# df['symptoms'] = df[attributes].apply(lambda row: ', '.join([col for col in attributes if row[col] != 0]), axis=1)
# df.loc[(df[attributes] == 0).all(axis=1), 'symptoms'] = 'none'
# columns_to_keep = set(df.columns).difference(attributes)
# # df = df[columns_to_keep].copy()
# df['symptoms'] = df['symptoms'].str.split(',').apply(lambda x: ', '.join(set(x))).str.strip()

# df.tail(10)

# pd.set_option('display.max_colwidth', 100)
# dfgrouping = predictions_df.copy()

# # Iterate over the columns
# for column in attributes:
#     mask = dfgrouping[column] == 1
#     grouping = attribute_grouping.loc[attribute_grouping['ITname'] == column, 'Grouping'].iloc[0]
#     dfgrouping.loc[mask, column] = grouping

# dfgrouping['groupings'] = dfgrouping[attributes].apply(lambda x: ', '.join([val for val in x if val != 0]), axis=1)
# dfgrouping.loc[(dfgrouping[attributes] == 0).all(axis=1), 'groupings'] = 'none'

# columns_to_keep = set(dfgrouping.columns).difference(attributes)
# dfgrouping = dfgrouping[columns_to_keep].copy()
# dfgrouping['groupings'] = dfgrouping['groupings'].str.split(',').apply(lambda x: ', '.join(set([item.strip() for item in x]))).str.strip()



# ############ domain #############
# dfdomain = predictions_df.copy()
# for column in attributes:
#     mask = dfdomain[column] == 1
#     domain = attribute_grouping.loc[attribute_grouping['ITname'] == column, 'Domain'].iloc[0]
#     dfdomain.loc[mask, column] = domain

# dfdomain['Domains'] = dfdomain[attributes].apply(lambda x: ', '.join([val for val in x if val != 0]), axis=1)
# dfdomain.loc[(dfdomain[attributes] == 0).all(axis=1), 'Domains'] = 'none'

# columns_to_keep = set(dfdomain.columns).difference(attributes)
# dfdomain = dfdomain[columns_to_keep].copy()
# dfdomain['Domains'] = dfdomain['Domains'].str.split(',').apply(lambda x: ', '.join(set([item.strip() for item in x]))).str.strip()



# merged_df = df.merge(dfgrouping, on=['neuropathological_diagnosis', 'DonorID', 'Year', 'Age', 'sex', 'age_at_death'])
# predictions_df = merged_df.merge(dfdomain, on=['neuropathological_diagnosis', 'DonorID', 'Year', 'Age', 'sex', 'age_at_death'])
# display(predictions_df.head(40))