# Data Preparation: MCI Patient Selection
ADNIMERGE patient selection according to Massi's R screening file.
This notebook is to serve as to get familiar with the ADNI dataset, the ADNIMERGE file, and select the MCI patients of interest for our models. 

Massi used the RID variable to see which rows refers to the same subject. Initally he included only those subjects that at VISCODE ==“bl” had a DX_bl == [“EMCI”] or DX_bl == [“LMCI”]. From this time he also took also the variables used as predicotrs. Then, for these subjects, he considered their 3 year followups (VISCODE in [bl,m03,m30,m36,m24,m18,m12,m06]) and see if at a certain point their DX became “Dementia”. in this case they were coded as converters, otherwise they were coded as non-converters. If conversion happened anytime along the 3 years, they were considered converters anyhow.

In [1]:
#importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import sys

from collections import Counter 

%matplotlib inline

In [2]:
os.getcwd()

'/home/jovyan/work/DATA_PREPARATION/PATIENT_SCREENING'

## Loading the ADNIMERGE.csv dataset  
The ADNIMERGE dataset can be found in the folder DATA_15_12_2017/Study_info. The data in this folder are limited to study info, clinical, sociodemographic and neuropsychological data. The ADNIMERGE dataset is basically one large dataset containing the most important features of the ADNI study, and most/all can also be found in the separate datasets in the DATA folder. In case you would like to know what all the abbreviations are, you can find all study codes in the so-called 'dictionary files, and the file ADNIMERGE_DICT in the Study_info folder contains specifically the codes for the ADNIMERGE data.

More info on the ADNI study in general (centers included, cohorts-ADNI1, ADNIGO, ADNI2, ADNI3, study design etc.) can be found on the website: http://adni.loni.usc.edu/  

In [3]:
#reading CSV file
data = pd.read_csv("ADNIMERGE.csv")

selected_columns = ['RID', 'PTID', 'VISCODE', 'SITE', 'COLPROT', 'ORIGPROT', 'EXAMDATE',
                    'DX_bl', 'AGE', 'PTGENDER', 'DX']

adnimerge = data[selected_columns]
adnimerge.head(10)

Unnamed: 0,RID,PTID,VISCODE,SITE,COLPROT,ORIGPROT,EXAMDATE,DX_bl,AGE,PTGENDER,DX
0,2,011_S_0002,bl,11,ADNI1,ADNI1,2005-09-08,CN,74.3,Male,CN
1,3,011_S_0003,bl,11,ADNI1,ADNI1,2005-09-12,AD,81.3,Male,Dementia
2,3,011_S_0003,m06,11,ADNI1,ADNI1,2006-03-13,AD,81.3,Male,Dementia
3,3,011_S_0003,m12,11,ADNI1,ADNI1,2006-09-12,AD,81.3,Male,Dementia
4,3,011_S_0003,m24,11,ADNI1,ADNI1,2007-09-12,AD,81.3,Male,Dementia
5,4,022_S_0004,bl,22,ADNI1,ADNI1,2005-11-08,LMCI,67.5,Male,MCI
6,4,022_S_0004,m06,22,ADNI1,ADNI1,2006-05-02,LMCI,67.5,Male,MCI
7,4,022_S_0004,m12,22,ADNI1,ADNI1,2006-11-14,LMCI,67.5,Male,MCI
8,4,022_S_0004,m18,22,ADNI1,ADNI1,2007-05-14,LMCI,67.5,Male,MCI
9,4,022_S_0004,m36,22,ADNI1,ADNI1,2008-11-18,LMCI,67.5,Male,MCI


In [4]:
# get all column names of ADNIMERGE
data.columns.values

array(['RID', 'PTID', 'VISCODE', 'SITE', 'COLPROT', 'ORIGPROT', 'EXAMDATE',
       'DX_bl', 'AGE', 'PTGENDER', 'PTEDUCAT', 'PTETHCAT', 'PTRACCAT',
       'PTMARRY', 'APOE4', 'FDG', 'PIB', 'AV45', 'ABETA', 'TAU', 'PTAU',
       'CDRSB', 'ADAS11', 'ADAS13', 'ADASQ4', 'MMSE', 'RAVLT_immediate',
       'RAVLT_learning', 'RAVLT_forgetting', 'RAVLT_perc_forgetting',
       'LDELTOTAL', 'DIGITSCOR', 'TRABSCOR', 'FAQ', 'MOCA', 'EcogPtMem',
       'EcogPtLang', 'EcogPtVisspat', 'EcogPtPlan', 'EcogPtOrgan',
       'EcogPtDivatt', 'EcogPtTotal', 'EcogSPMem', 'EcogSPLang',
       'EcogSPVisspat', 'EcogSPPlan', 'EcogSPOrgan', 'EcogSPDivatt',
       'EcogSPTotal', 'FLDSTRENG', 'FSVERSION', 'Ventricles',
       'Hippocampus', 'WholeBrain', 'Entorhinal', 'Fusiform', 'MidTemp',
       'ICV', 'DX', 'mPACCdigit', 'mPACCtrailsB', 'EXAMDATE_bl',
       'CDRSB_bl', 'ADAS11_bl', 'ADAS13_bl', 'ADASQ4_bl', 'MMSE_bl',
       'RAVLT_immediate_bl', 'RAVLT_learning_bl', 'RAVLT_forgetting_bl',
       'RAVLT_perc_fo

## Step 1: Including only MCI subjects

In [5]:
#code from Massi
to_include_mci = np.unique(np.array(adnimerge["RID"].loc[adnimerge["DX_bl"].isin(["EMCI","LMCI"])]))

adnimerge2 = adnimerge.loc[adnimerge["RID"].isin(to_include_mci)]
adnimerge2.head()

Unnamed: 0,RID,PTID,VISCODE,SITE,COLPROT,ORIGPROT,EXAMDATE,DX_bl,AGE,PTGENDER,DX
5,4,022_S_0004,bl,22,ADNI1,ADNI1,2005-11-08,LMCI,67.5,Male,MCI
6,4,022_S_0004,m06,22,ADNI1,ADNI1,2006-05-02,LMCI,67.5,Male,MCI
7,4,022_S_0004,m12,22,ADNI1,ADNI1,2006-11-14,LMCI,67.5,Male,MCI
8,4,022_S_0004,m18,22,ADNI1,ADNI1,2007-05-14,LMCI,67.5,Male,MCI
9,4,022_S_0004,m36,22,ADNI1,ADNI1,2008-11-18,LMCI,67.5,Male,MCI


In [6]:
to_include_mci.shape

(790,)

In [7]:
adnimerge2[(adnimerge2["VISCODE"] == "bl") & (adnimerge2["DX"] == "Dementia")]


Unnamed: 0,RID,PTID,VISCODE,SITE,COLPROT,ORIGPROT,EXAMDATE,DX_bl,AGE,PTGENDER,DX
762,332,021_S_0332,bl,21,ADNI1,ADNI1,2006-04-19,LMCI,69.9,Male,Dementia
2061,995,100_S_0995,bl,100,ADNI1,ADNI1,2006-11-13,LMCI,78.5,Female,Dementia
2364,1154,100_S_1154,bl,100,ADNI1,ADNI1,2007-01-30,LMCI,76.6,Male,Dementia
5299,78,023_S_0078,bl,23,ADNI1,ADNI1,2006-01-12,LMCI,76.0,Female,Dementia
5609,190,100_S_0190,bl,100,ADNI1,ADNI1,2006-05-16,LMCI,78.8,Male,Dementia
7753,1226,100_S_1226,bl,100,ADNI1,ADNI1,2007-02-15,LMCI,82.6,Male,Dementia


In [8]:
#selecting subjects with DX_bl of EMCI and LMCI
adnimerge_MCI = adnimerge[adnimerge["DX_bl"].isin(["EMCI", "LMCI"])]
adnimerge_MCI.head()

Unnamed: 0,RID,PTID,VISCODE,SITE,COLPROT,ORIGPROT,EXAMDATE,DX_bl,AGE,PTGENDER,DX
5,4,022_S_0004,bl,22,ADNI1,ADNI1,2005-11-08,LMCI,67.5,Male,MCI
6,4,022_S_0004,m06,22,ADNI1,ADNI1,2006-05-02,LMCI,67.5,Male,MCI
7,4,022_S_0004,m12,22,ADNI1,ADNI1,2006-11-14,LMCI,67.5,Male,MCI
8,4,022_S_0004,m18,22,ADNI1,ADNI1,2007-05-14,LMCI,67.5,Male,MCI
9,4,022_S_0004,m36,22,ADNI1,ADNI1,2008-11-18,LMCI,67.5,Male,MCI


In [9]:
#selecting RIDs of these subjects
MCI_patients = np.unique(adnimerge2["RID"])
MCI_patients

array([   4,    6,   30,   33,   38,   41,   42,   44,   45,   50,   51,
         54,   57,   60,   77,   78,   80,   87,   98,  101,  102,  103,
        107,  108,  111,  112,  116,  126,  128,  135,  138,  141,  142,
        150,  155,  160,  161,  168,  169,  176,  178,  179,  182,  187,
        188,  190,  195,  200,  204,  205,  214,  217,  222,  225,  227,
        231,  240,  241,  243,  249,  256,  258,  269,  273,  276,  282,
        284,  285,  288,  289,  290,  291,  292,  293,  296,  307,  314,
        324,  325,  331,  332,  336,  339,  344,  351,  354,  361,  362,
        376,  377,  378,  384,  388,  389,  390,  393,  394,  397,  401,
        406,  407,  408,  409,  410,  414,  417,  423,  424,  429,  434,
        442,  445,  448,  449,  450,  458,  461,  464,  469,  476,  478,
        481,  485,  501,  505,  507,  511,  513,  514,  518,  531,  539,
        544,  546,  549,  551,  552,  557,  563,  566,  567,  568,  572,
        579,  588,  598,  604,  607,  608,  611,  6

In [10]:
#doing the above in one go; Step 1 of Massi code
#selecting subjects RID with DX_bl of EMCI and LMCI
MCI_patients_RID = np.unique(adnimerge[adnimerge["DX_bl"].isin(["EMCI", "LMCI"])]["RID"])
MCI_patients_RID

array([   4,    6,   30,   33,   38,   41,   42,   44,   45,   50,   51,
         54,   57,   60,   77,   78,   80,   87,   98,  101,  102,  103,
        107,  108,  111,  112,  116,  126,  128,  135,  138,  141,  142,
        150,  155,  160,  161,  168,  169,  176,  178,  179,  182,  187,
        188,  190,  195,  200,  204,  205,  214,  217,  222,  225,  227,
        231,  240,  241,  243,  249,  256,  258,  269,  273,  276,  282,
        284,  285,  288,  289,  290,  291,  292,  293,  296,  307,  314,
        324,  325,  331,  332,  336,  339,  344,  351,  354,  361,  362,
        376,  377,  378,  384,  388,  389,  390,  393,  394,  397,  401,
        406,  407,  408,  409,  410,  414,  417,  423,  424,  429,  434,
        442,  445,  448,  449,  450,  458,  461,  464,  469,  476,  478,
        481,  485,  501,  505,  507,  511,  513,  514,  518,  531,  539,
        544,  546,  549,  551,  552,  557,  563,  566,  567,  568,  572,
        579,  588,  598,  604,  607,  608,  611,  6

In [11]:
#selecting subjects again from adnimerge based on RIDs in MCI_patients_RID; Step 2 of Massi code
adnimerge_MCI_patients = adnimerge[adnimerge["RID"].isin(MCI_patients_RID)]
adnimerge_MCI_patients.head(100)

Unnamed: 0,RID,PTID,VISCODE,SITE,COLPROT,ORIGPROT,EXAMDATE,DX_bl,AGE,PTGENDER,DX
5,4,022_S_0004,bl,22,ADNI1,ADNI1,2005-11-08,LMCI,67.5,Male,MCI
6,4,022_S_0004,m06,22,ADNI1,ADNI1,2006-05-02,LMCI,67.5,Male,MCI
7,4,022_S_0004,m12,22,ADNI1,ADNI1,2006-11-14,LMCI,67.5,Male,MCI
8,4,022_S_0004,m18,22,ADNI1,ADNI1,2007-05-14,LMCI,67.5,Male,MCI
9,4,022_S_0004,m36,22,ADNI1,ADNI1,2008-11-18,LMCI,67.5,Male,MCI
15,6,100_S_0006,bl,100,ADNI1,ADNI1,2005-11-29,LMCI,80.4,Female,MCI
16,6,100_S_0006,m06,100,ADNI1,ADNI1,2006-06-01,LMCI,80.4,Female,MCI
17,6,100_S_0006,m12,100,ADNI1,ADNI1,2006-11-20,LMCI,80.4,Female,MCI
18,6,100_S_0006,m18,100,ADNI1,ADNI1,2007-05-15,LMCI,80.4,Female,MCI
19,6,100_S_0006,m36,100,ADNI1,ADNI1,2008-12-08,LMCI,80.4,Female,MCI


### showing discrepancy when only selecting for DX_bl and not DX

In [12]:
#showing discrepancy between DX_bl and DX at VICSODE bl, slecting only baseline visit
adnimerge_MCI_patients[(adnimerge_MCI_patients["VISCODE"] == "bl") & (adnimerge_MCI_patients["DX"] == "Dementia")]


Unnamed: 0,RID,PTID,VISCODE,SITE,COLPROT,ORIGPROT,EXAMDATE,DX_bl,AGE,PTGENDER,DX
762,332,021_S_0332,bl,21,ADNI1,ADNI1,2006-04-19,LMCI,69.9,Male,Dementia
2061,995,100_S_0995,bl,100,ADNI1,ADNI1,2006-11-13,LMCI,78.5,Female,Dementia
2364,1154,100_S_1154,bl,100,ADNI1,ADNI1,2007-01-30,LMCI,76.6,Male,Dementia
5299,78,023_S_0078,bl,23,ADNI1,ADNI1,2006-01-12,LMCI,76.0,Female,Dementia
5609,190,100_S_0190,bl,100,ADNI1,ADNI1,2006-05-16,LMCI,78.8,Male,Dementia
7753,1226,100_S_1226,bl,100,ADNI1,ADNI1,2007-02-15,LMCI,82.6,Male,Dementia


In [13]:
#getting RIDs from these discrepancy cases
inconsequent_RID = np.unique(adnimerge_MCI_patients[(adnimerge_MCI_patients["VISCODE"] == "bl") & (adnimerge_MCI_patients["DX"] == "Dementia")]["RID"])
inconsequent_RID

array([  78,  190,  332,  995, 1154, 1226])

In [14]:
#showing discrepancy between DX_bl and DX at VICSODE bl, slecting only baseline visit
adnimerge_MCI_patients[adnimerge_MCI_patients["RID"] == 332]

Unnamed: 0,RID,PTID,VISCODE,SITE,COLPROT,ORIGPROT,EXAMDATE,DX_bl,AGE,PTGENDER,DX
762,332,021_S_0332,bl,21,ADNI1,ADNI1,2006-04-19,LMCI,69.9,Male,Dementia
763,332,021_S_0332,m06,21,ADNI1,ADNI1,2006-10-19,LMCI,69.9,Male,Dementia
764,332,021_S_0332,m12,21,ADNI1,ADNI1,2007-04-19,LMCI,69.9,Male,Dementia


### selecting for DX_bl and DX 

In [15]:
#selecting subjects RID with DX_bl of EMCI and LMCI and DX of MCI
MCI_patients_RID2 = np.unique(adnimerge[(adnimerge["DX_bl"].isin(["EMCI", "LMCI"])) & (adnimerge["DX"] == "MCI")]["RID"])
MCI_patients_RID2

array([   4,    6,   30,   33,   38,   41,   42,   44,   45,   50,   51,
         54,   57,   60,   77,   80,   87,   98,  101,  102,  103,  107,
        108,  111,  112,  116,  126,  128,  135,  138,  141,  142,  150,
        155,  160,  161,  168,  169,  176,  178,  179,  182,  187,  188,
        195,  200,  204,  205,  214,  217,  222,  225,  227,  231,  240,
        241,  243,  249,  256,  258,  269,  273,  276,  282,  284,  285,
        288,  289,  290,  291,  292,  293,  296,  307,  314,  324,  325,
        331,  336,  339,  344,  351,  354,  361,  362,  376,  377,  378,
        384,  388,  389,  390,  393,  394,  397,  401,  406,  407,  408,
        409,  410,  414,  417,  423,  424,  429,  434,  442,  445,  448,
        449,  450,  458,  461,  464,  469,  476,  478,  481,  485,  501,
        505,  507,  511,  513,  514,  518,  531,  539,  544,  546,  549,
        551,  552,  557,  563,  566,  567,  568,  572,  579,  588,  598,
        604,  607,  608,  611,  613,  621,  625,  6

In [16]:
#showing discrepancy between DX_bl and DX at VICSODE bl, slecting only baseline visit
adnimerge_MCI_patients2 = adnimerge[adnimerge["RID"].isin(MCI_patients_RID2)]
adnimerge_MCI_patients2.head()


Unnamed: 0,RID,PTID,VISCODE,SITE,COLPROT,ORIGPROT,EXAMDATE,DX_bl,AGE,PTGENDER,DX
5,4,022_S_0004,bl,22,ADNI1,ADNI1,2005-11-08,LMCI,67.5,Male,MCI
6,4,022_S_0004,m06,22,ADNI1,ADNI1,2006-05-02,LMCI,67.5,Male,MCI
7,4,022_S_0004,m12,22,ADNI1,ADNI1,2006-11-14,LMCI,67.5,Male,MCI
8,4,022_S_0004,m18,22,ADNI1,ADNI1,2007-05-14,LMCI,67.5,Male,MCI
9,4,022_S_0004,m36,22,ADNI1,ADNI1,2008-11-18,LMCI,67.5,Male,MCI


In [17]:
#showing discrepancy between DX_bl and DX at VICSODE bl, slecting only baseline visit
adnimerge_MCI_patients2[(adnimerge_MCI_patients2["VISCODE"] == "bl") & (adnimerge_MCI_patients2["DX"] == "Dementia")]


Unnamed: 0,RID,PTID,VISCODE,SITE,COLPROT,ORIGPROT,EXAMDATE,DX_bl,AGE,PTGENDER,DX
7753,1226,100_S_1226,bl,100,ADNI1,ADNI1,2007-02-15,LMCI,82.6,Male,Dementia


In [18]:
#check whether other discrepancy cases are removed
adnimerge_MCI_patients2[adnimerge_MCI_patients2["RID"] == 332]


Unnamed: 0,RID,PTID,VISCODE,SITE,COLPROT,ORIGPROT,EXAMDATE,DX_bl,AGE,PTGENDER,DX


In [19]:
#check out this particular case 1226
adnimerge_MCI_patients2[adnimerge_MCI_patients2["RID"] == 1226]


Unnamed: 0,RID,PTID,VISCODE,SITE,COLPROT,ORIGPROT,EXAMDATE,DX_bl,AGE,PTGENDER,DX
2484,1226,100_S_1226,m06,100,ADNI1,ADNI1,2007-08-27,LMCI,82.6,Male,Dementia
2485,1226,100_S_1226,m24,100,ADNI1,ADNI1,2009-02-17,LMCI,82.6,Male,Dementia
2486,1226,100_S_1226,m36,100,ADNI1,ADNI1,2010-03-30,LMCI,82.6,Male,Dementia
2487,1226,100_S_1226,m48,100,ADNIGO,ADNI1,2011-02-09,LMCI,82.6,Male,MCI
7753,1226,100_S_1226,bl,100,ADNI1,ADNI1,2007-02-15,LMCI,82.6,Male,Dementia
7754,1226,100_S_1226,m12,100,ADNI1,ADNI1,2008-02-07,LMCI,82.6,Male,Dementia
7755,1226,100_S_1226,m18,100,ADNI1,ADNI1,2008-08-05,LMCI,82.6,Male,Dementia
7756,1226,100_S_1226,m30,100,ADNI1,ADNI1,2009-08-04,LMCI,82.6,Male,
7757,1226,100_S_1226,m54,100,ADNIGO,ADNI1,2011-08-10,LMCI,82.6,Male,


### selecting for DX_bl, DX and VISCODE bl

In [20]:
#selecting subjects RID with DX_bl of EMCI and LMCI and DX of MCI and baseline visitcode
MCI_patients_RID3 = np.unique(adnimerge[(adnimerge["DX_bl"].isin(["EMCI", "LMCI"])) & (adnimerge["DX"] == "MCI") & (adnimerge["VISCODE"] == "bl")]["RID"])
MCI_patients_RID3.shape

(779,)

In [21]:
#showing discrepancy between DX_bl and DX at VICSODE bl, slecting only baseline visit
#also screened RIDs for baseline visit
adnimerge_MCI_patients3 = adnimerge[adnimerge["RID"].isin(MCI_patients_RID3)]
adnimerge_MCI_patients3.head()

Unnamed: 0,RID,PTID,VISCODE,SITE,COLPROT,ORIGPROT,EXAMDATE,DX_bl,AGE,PTGENDER,DX
5,4,022_S_0004,bl,22,ADNI1,ADNI1,2005-11-08,LMCI,67.5,Male,MCI
6,4,022_S_0004,m06,22,ADNI1,ADNI1,2006-05-02,LMCI,67.5,Male,MCI
7,4,022_S_0004,m12,22,ADNI1,ADNI1,2006-11-14,LMCI,67.5,Male,MCI
8,4,022_S_0004,m18,22,ADNI1,ADNI1,2007-05-14,LMCI,67.5,Male,MCI
9,4,022_S_0004,m36,22,ADNI1,ADNI1,2008-11-18,LMCI,67.5,Male,MCI


In [22]:
adnimerge_MCI_patients3[adnimerge_MCI_patients3["RID"] == 332]

Unnamed: 0,RID,PTID,VISCODE,SITE,COLPROT,ORIGPROT,EXAMDATE,DX_bl,AGE,PTGENDER,DX


In [23]:
#showing discrepancy between DX_bl and DX at VICSODE bl, slecting only baseline visit
#also removed RIDs at baseline discrepancy

#adnimerge_MCI_patients3[adnimerge_MCI_patients3["DX"] == "Dementia"]


### Comparing MCI selection of my protocol vs Massi

In [24]:
#my MCI selection (from MCI_patients_RID3)
adnimerge_MCI_patients3.head(10)

Unnamed: 0,RID,PTID,VISCODE,SITE,COLPROT,ORIGPROT,EXAMDATE,DX_bl,AGE,PTGENDER,DX
5,4,022_S_0004,bl,22,ADNI1,ADNI1,2005-11-08,LMCI,67.5,Male,MCI
6,4,022_S_0004,m06,22,ADNI1,ADNI1,2006-05-02,LMCI,67.5,Male,MCI
7,4,022_S_0004,m12,22,ADNI1,ADNI1,2006-11-14,LMCI,67.5,Male,MCI
8,4,022_S_0004,m18,22,ADNI1,ADNI1,2007-05-14,LMCI,67.5,Male,MCI
9,4,022_S_0004,m36,22,ADNI1,ADNI1,2008-11-18,LMCI,67.5,Male,MCI
15,6,100_S_0006,bl,100,ADNI1,ADNI1,2005-11-29,LMCI,80.4,Female,MCI
16,6,100_S_0006,m06,100,ADNI1,ADNI1,2006-06-01,LMCI,80.4,Female,MCI
17,6,100_S_0006,m12,100,ADNI1,ADNI1,2006-11-20,LMCI,80.4,Female,MCI
18,6,100_S_0006,m18,100,ADNI1,ADNI1,2007-05-15,LMCI,80.4,Female,MCI
19,6,100_S_0006,m36,100,ADNI1,ADNI1,2008-12-08,LMCI,80.4,Female,MCI


In [25]:
adnimerge_MCI_patients3.shape

(6306, 11)

In [26]:
#MCI selection of Massi (based on mci_to_include)
adnimerge2.head()

Unnamed: 0,RID,PTID,VISCODE,SITE,COLPROT,ORIGPROT,EXAMDATE,DX_bl,AGE,PTGENDER,DX
5,4,022_S_0004,bl,22,ADNI1,ADNI1,2005-11-08,LMCI,67.5,Male,MCI
6,4,022_S_0004,m06,22,ADNI1,ADNI1,2006-05-02,LMCI,67.5,Male,MCI
7,4,022_S_0004,m12,22,ADNI1,ADNI1,2006-11-14,LMCI,67.5,Male,MCI
8,4,022_S_0004,m18,22,ADNI1,ADNI1,2007-05-14,LMCI,67.5,Male,MCI
9,4,022_S_0004,m36,22,ADNI1,ADNI1,2008-11-18,LMCI,67.5,Male,MCI


In [27]:
#difference of 35 records
adnimerge2.shape

(6341, 11)

In [28]:
#find discrepancy RIDs as described above, and select these RIDs from dataframe
#comes to total number of 30 records, still 5 records difference 
wrongRID_adnimerge2 = np.unique(adnimerge2[(adnimerge2["VISCODE"] == "bl") & (adnimerge2["DX"] == "Dementia")]["RID"])
adnimerge2[adnimerge2["RID"].isin(wrongRID_adnimerge2)]

Unnamed: 0,RID,PTID,VISCODE,SITE,COLPROT,ORIGPROT,EXAMDATE,DX_bl,AGE,PTGENDER,DX
184,78,023_S_0078,m18,23,ADNI1,ADNI1,2007-07-24,LMCI,76.0,Female,Dementia
469,190,100_S_0190,m12,100,ADNI1,ADNI1,2007-05-08,LMCI,78.8,Male,Dementia
762,332,021_S_0332,bl,21,ADNI1,ADNI1,2006-04-19,LMCI,69.9,Male,Dementia
763,332,021_S_0332,m06,21,ADNI1,ADNI1,2006-10-19,LMCI,69.9,Male,Dementia
764,332,021_S_0332,m12,21,ADNI1,ADNI1,2007-04-19,LMCI,69.9,Male,Dementia
2061,995,100_S_0995,bl,100,ADNI1,ADNI1,2006-11-13,LMCI,78.5,Female,Dementia
2062,995,100_S_0995,m06,100,ADNI1,ADNI1,2007-05-09,LMCI,78.5,Female,Dementia
2063,995,100_S_0995,m12,100,ADNI1,ADNI1,2007-11-19,LMCI,78.5,Female,Dementia
2064,995,100_S_0995,m24,100,ADNI1,ADNI1,2009-02-06,LMCI,78.5,Female,Dementia
2065,995,100_S_0995,m36,100,ADNI1,ADNI1,2009-11-10,LMCI,78.5,Female,Dementia


In [29]:
#6 RIDs are considered these dicrepancy RIDs
wrongRID_adnimerge2

array([  78,  190,  332,  995, 1154, 1226])

In [30]:
#find the missing RIDs in MCI_patients_RID3 (my selection) list as compared to to_include_mci (Massis selection) list
lostRID = [x for x in to_include_mci if x not in MCI_patients_RID3]
lostRID

[78, 190, 332, 995, 1154, 1226, 2071, 2314, 4085, 4575, 4622]

In [31]:
#check the additional RIDs in the adnimerge2 dataframe
#they represent NaN values for DX (which are not selected in my approach)
#Massi is probably eliminating these subjects in his second step, where he is going to select
#subjects with follow-up of m36 and higher, which these subjects don't have 
adnimerge2[adnimerge2["RID"].isin([2071, 2314, 4085, 4575, 4622])]

Unnamed: 0,RID,PTID,VISCODE,SITE,COLPROT,ORIGPROT,EXAMDATE,DX_bl,AGE,PTGENDER,DX
4537,4622,024_S_4622,bl,24,ADNI2,ADNI2,2012-04-06,LMCI,76.0,Male,
4715,4575,016_S_4575,bl,16,ADNI2,ADNI2,2012-05-09,EMCI,62.1,Female,
8987,4085,035_S_4085,bl,35,ADNI2,ADNI2,2011-07-11,EMCI,55.5,Male,
11186,2071,098_S_2071,bl,98,ADNIGO,ADNIGO,2010-09-20,EMCI,84.9,Male,
11187,2314,128_S_2314,bl,128,ADNIGO,ADNIGO,2011-04-21,EMCI,65.7,Male,


## Step 2 Selecting MCI subjects with follow-up available untill at least m36 
(or m36 and higher?)

In [32]:
#VISCODE labels in dataframe my approach
np.unique(adnimerge_MCI_patients3["VISCODE"].values)


array(['bl', 'm03', 'm06', 'm102', 'm108', 'm114', 'm12', 'm120', 'm126',
       'm132', 'm144', 'm18', 'm24', 'm30', 'm36', 'm42', 'm48', 'm54',
       'm60', 'm66', 'm72', 'm78', 'm84', 'm90', 'm96'], dtype=object)

In [33]:
#VISCODE labels in dataframe my approach
#count number of labels in dataset and compare to label count Massi
adnimerge_MCI_patients3["VISCODE"].groupby(adnimerge_MCI_patients3["VISCODE"].values).count()

bl      779
m03     373
m06     731
m102      2
m108     54
m114      1
m12     708
m120     32
m126      1
m132     12
m144      1
m18     616
m24     605
m30     403
m36     523
m42     163
m48     352
m54     109
m60     232
m66     109
m72     170
m78     103
m84      97
m90      55
m96      75
Name: VISCODE, dtype: int64

In [34]:
#VISCODE labels in Massi's approach
np.unique(adnimerge2["VISCODE"].values)

array(['bl', 'm03', 'm06', 'm102', 'm108', 'm114', 'm12', 'm120', 'm126',
       'm132', 'm144', 'm18', 'm24', 'm30', 'm36', 'm42', 'm48', 'm54',
       'm60', 'm66', 'm72', 'm78', 'm84', 'm90', 'm96'], dtype=object)

In [35]:
#VISCODE labels in Massi's approach
#count number of labels in dataset and compare to label count Massi
adnimerge2["VISCODE"].groupby(adnimerge2["VISCODE"].values).count()

bl      790
m03     373
m06     737
m102      2
m108     54
m114      1
m12     714
m120     32
m126      1
m132     12
m144      1
m18     620
m24     608
m30     404
m36     525
m42     163
m48     353
m54     110
m60     232
m66     109
m72     170
m78     103
m84      97
m90      55
m96      75
Name: VISCODE, dtype: int64

In [36]:
#code Massi
# INCLUDE ONLY SUBJECTS WITH INFO OF CONVERTION AT LEAST YEAR 3
to_include_time = ["m114","m126","m144","m102","m132","m120","m108","m90","m96","m84","m78","m66","m54","m42","m72","m60","m48","m36"]
to_include_subjects = np.unique(np.array(adnimerge2["RID"].loc[adnimerge2["VISCODE"].isin(to_include_time)]))
adnimerge3 = adnimerge2.loc[adnimerge2["RID"].isin(to_include_subjects)]
adnimerge3.head(50)

Unnamed: 0,RID,PTID,VISCODE,SITE,COLPROT,ORIGPROT,EXAMDATE,DX_bl,AGE,PTGENDER,DX
5,4,022_S_0004,bl,22,ADNI1,ADNI1,2005-11-08,LMCI,67.5,Male,MCI
6,4,022_S_0004,m06,22,ADNI1,ADNI1,2006-05-02,LMCI,67.5,Male,MCI
7,4,022_S_0004,m12,22,ADNI1,ADNI1,2006-11-14,LMCI,67.5,Male,MCI
8,4,022_S_0004,m18,22,ADNI1,ADNI1,2007-05-14,LMCI,67.5,Male,MCI
9,4,022_S_0004,m36,22,ADNI1,ADNI1,2008-11-18,LMCI,67.5,Male,MCI
15,6,100_S_0006,bl,100,ADNI1,ADNI1,2005-11-29,LMCI,80.4,Female,MCI
16,6,100_S_0006,m06,100,ADNI1,ADNI1,2006-06-01,LMCI,80.4,Female,MCI
17,6,100_S_0006,m12,100,ADNI1,ADNI1,2006-11-20,LMCI,80.4,Female,MCI
18,6,100_S_0006,m18,100,ADNI1,ADNI1,2007-05-15,LMCI,80.4,Female,MCI
19,6,100_S_0006,m36,100,ADNI1,ADNI1,2008-12-08,LMCI,80.4,Female,MCI


In [37]:
# OF THESE SUBJECTS, KEEP ONLY INFO OF CONVERTION BY YEAR 3
to_include_time_2 = ["bl","m03","m30","m36","m24","m18","m12","m06"]
adnimerge_selected_massi = adnimerge3.loc[adnimerge3["VISCODE"].isin(to_include_time_2)]
adnimerge_selected_massi.head(50)
np.count_nonzero(np.unique(adnimerge_selected_massi["RID"]))

550

In [38]:
#check what difference it gives when using on m36 -- 25 subjects less (who apparently don't have m36)
#INCLUDE ONLY SUBJECTS WITH VISCODE of m36 available
visitcode_required = ["m114","m126","m144","m102","m132","m120","m108","m90","m96","m84","m78","m66","m54","m42","m72","m60","m48","m36"]
visitcode_m36 = ["m36"]
subjects_included = np.unique(adnimerge2[adnimerge2["VISCODE"].isin(visitcode_m36)]["RID"])
adnimerge4 = adnimerge2.loc[adnimerge2["RID"].isin(subjects_included)]
adnimerge4.head(50)
np.count_nonzero(np.unique(adnimerge4["RID"]))

525

Now cleaning this code and applying this approach to the dataframe adnimerge_MCI_patients3

In [39]:
# INCLUDE ONLY SUBJECTS WITH INFO OF CONVERTION AT LEAST YEAR 3
visitcode_required = ["m114","m126","m144","m102","m132","m120","m108","m90","m96","m84","m78","m66","m54","m42","m72","m60","m48","m36"]
MCI_patients_RID4 = np.unique(adnimerge_MCI_patients3[adnimerge_MCI_patients3["VISCODE"].isin(visitcode_required)]["RID"])
adnimerge_MCI_patients4 = adnimerge_MCI_patients3.loc[adnimerge_MCI_patients3["RID"].isin(MCI_patients_RID4)]
adnimerge_MCI_patients4.head(20)

Unnamed: 0,RID,PTID,VISCODE,SITE,COLPROT,ORIGPROT,EXAMDATE,DX_bl,AGE,PTGENDER,DX
5,4,022_S_0004,bl,22,ADNI1,ADNI1,2005-11-08,LMCI,67.5,Male,MCI
6,4,022_S_0004,m06,22,ADNI1,ADNI1,2006-05-02,LMCI,67.5,Male,MCI
7,4,022_S_0004,m12,22,ADNI1,ADNI1,2006-11-14,LMCI,67.5,Male,MCI
8,4,022_S_0004,m18,22,ADNI1,ADNI1,2007-05-14,LMCI,67.5,Male,MCI
9,4,022_S_0004,m36,22,ADNI1,ADNI1,2008-11-18,LMCI,67.5,Male,MCI
15,6,100_S_0006,bl,100,ADNI1,ADNI1,2005-11-29,LMCI,80.4,Female,MCI
16,6,100_S_0006,m06,100,ADNI1,ADNI1,2006-06-01,LMCI,80.4,Female,MCI
17,6,100_S_0006,m12,100,ADNI1,ADNI1,2006-11-20,LMCI,80.4,Female,MCI
18,6,100_S_0006,m18,100,ADNI1,ADNI1,2007-05-15,LMCI,80.4,Female,MCI
19,6,100_S_0006,m36,100,ADNI1,ADNI1,2008-12-08,LMCI,80.4,Female,MCI


In [40]:
# OF THESE SUBJECTS, KEEP ONLY INFO OF CONVERTION BY YEAR 3
visitcode_selected = ["bl","m03","m30","m36","m24","m18","m12","m06"]
adnimerge_selected_nadine = adnimerge_MCI_patients4.loc[adnimerge_MCI_patients4["VISCODE"].isin(visitcode_selected)]
#adnimerge_selected_nadine.head(20)
np.count_nonzero(np.unique(adnimerge_selected_nadine["RID"]))

548

In [41]:
differ_RID = [x for x in to_include_subjects if x not in MCI_patients_RID4]
differ_RID

[995, 1226]

In [42]:
#check what difference it gives when using on m36 
#INCLUDE ONLY SUBJECTS WITH VISCODE of m36 available
visitcode_required = ["m114","m126","m144","m102","m132","m120","m108","m90","m96","m84","m78","m66","m54","m42","m72","m60","m48","m36"]
visitcode_m36 = ["m36"]
MCI_patients_RID5 = np.unique(adnimerge_MCI_patients3[adnimerge_MCI_patients3["VISCODE"].isin(visitcode_m36)]["RID"])
adnimerge_MCI_patients5 = adnimerge_MCI_patients3.loc[adnimerge_MCI_patients3["RID"].isin(MCI_patients_RID5)]
adnimerge_MCI_patients5.head(50)
np.count_nonzero(np.unique(adnimerge_MCI_patients5["RID"]))

523

In [43]:
# OF THESE SUBJECTS, KEEP ONLY INFO OF CONVERSION BY YEAR 3
visitcode_selected = ["bl","m03","m30","m36","m24","m18","m12","m06"]
adnimerge_selected_nadine = adnimerge_MCI_patients5[adnimerge_MCI_patients5["VISCODE"].isin(visitcode_selected)]
adnimerge_selected_nadine.head(20)
#np.count_nonzero(np.unique(adnimerge_selected_nadine["RID"]))

Unnamed: 0,RID,PTID,VISCODE,SITE,COLPROT,ORIGPROT,EXAMDATE,DX_bl,AGE,PTGENDER,DX
5,4,022_S_0004,bl,22,ADNI1,ADNI1,2005-11-08,LMCI,67.5,Male,MCI
6,4,022_S_0004,m06,22,ADNI1,ADNI1,2006-05-02,LMCI,67.5,Male,MCI
7,4,022_S_0004,m12,22,ADNI1,ADNI1,2006-11-14,LMCI,67.5,Male,MCI
8,4,022_S_0004,m18,22,ADNI1,ADNI1,2007-05-14,LMCI,67.5,Male,MCI
9,4,022_S_0004,m36,22,ADNI1,ADNI1,2008-11-18,LMCI,67.5,Male,MCI
15,6,100_S_0006,bl,100,ADNI1,ADNI1,2005-11-29,LMCI,80.4,Female,MCI
16,6,100_S_0006,m06,100,ADNI1,ADNI1,2006-06-01,LMCI,80.4,Female,MCI
17,6,100_S_0006,m12,100,ADNI1,ADNI1,2006-11-20,LMCI,80.4,Female,MCI
18,6,100_S_0006,m18,100,ADNI1,ADNI1,2007-05-15,LMCI,80.4,Female,MCI
19,6,100_S_0006,m36,100,ADNI1,ADNI1,2008-12-08,LMCI,80.4,Female,MCI


In [44]:
#adding converter_at_3years variable
#not sure how to construct the for loops for this, so first re-run the code from Massi below

visitcode_selected2 = ["m03","m30","m36","m24","m18","m12","m06"]

adnimerge_selected_massi["conversion"] = "no"

for 
if adnimerge_selected_massi["VISCODE"].isin(visitcode_selected2) & adnimerge_selected_massi["DX"] == "Dementia"
        adnimerge_selected_massi["conversion"] = "yes"

        
        
adnimerge_selected_massi.head()

SyntaxError: invalid syntax (<ipython-input-44-bc708f895dba>, line 8)

In [45]:
#code from Massi
#RESHAPE ADNIMERGE AND ADD CONVERTION VARIABLE AT THE END
tempo = adnimerge_selected_massi.copy()
tempo.index = np.array(range(tempo.shape[0]))
adnimerge_selected_massi = adnimerge_selected_massi[adnimerge_selected_massi['VISCODE'] == "bl"]
adnimerge_selected_massi.index = np.array(range(adnimerge_selected_massi.shape[0]))

adnimerge_selected_massi["CONVERSION_AT_3Y"] = "NO"

for i in range(adnimerge_selected_massi.shape[0]):
    if "Dementia" in np.array(tempo["DX"].loc[tempo["RID"] == adnimerge_selected_massi["RID"].iloc[i]]):
        adnimerge_selected_massi["CONVERSION_AT_3Y"].iloc[i] = "YES"
        
adnimerge_selected_massi.head(50)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Unnamed: 0,RID,PTID,VISCODE,SITE,COLPROT,ORIGPROT,EXAMDATE,DX_bl,AGE,PTGENDER,DX,CONVERSION_AT_3Y
0,4,022_S_0004,bl,22,ADNI1,ADNI1,2005-11-08,LMCI,67.5,Male,MCI,NO
1,6,100_S_0006,bl,100,ADNI1,ADNI1,2005-11-29,LMCI,80.4,Female,MCI,NO
2,42,023_S_0042,bl,23,ADNI1,ADNI1,2005-11-10,LMCI,72.8,Male,MCI,YES
3,51,099_S_0051,bl,99,ADNI1,ADNI1,2005-12-08,LMCI,66.5,Male,MCI,NO
4,54,099_S_0054,bl,99,ADNI1,ADNI1,2005-12-16,LMCI,81.0,Female,MCI,YES
5,57,018_S_0057,bl,18,ADNI1,ADNI1,2006-01-06,LMCI,77.3,Male,MCI,YES
6,77,067_S_0077,bl,67,ADNI1,ADNI1,2006-02-08,LMCI,79.7,Male,MCI,YES
7,80,018_S_0080,bl,18,ADNI1,ADNI1,2006-01-13,LMCI,85.0,Male,MCI,NO
8,98,067_S_0098,bl,67,ADNI1,ADNI1,2006-03-29,LMCI,84.4,Female,MCI,YES
9,101,007_S_0101,bl,7,ADNI1,ADNI1,2006-01-04,LMCI,73.6,Male,MCI,YES
