# **Preparació de les dades**

De CSV amb dades hi ha:

1- **ABIDEII_Composite_Phenotypic.csv** - conté les dades fenotípiques que han guardat els diferents laboratoris, una de les dades de cada subjecte és l'ID del laboratori que ha fet l'escanner (arxiu descarregat de ABIDEII)

2- **qc.csv** - puntuacions control de qualitat de mri_synthseg, calculat directament sobre arxiu nifti, la columna subject té el nom de l'arxiu comprobat (i l'ID del subjecte com a part del nom de l'arxiu)

3- **vol.csv**- aquí hi ha els volumns retornats de mri_synthseg, calculat directament sobre arxiu nifti, la columna subject té el nom de l'arxiu comprobat (i l'ID del subjecte com a part del nom de l'arxiu)

Aquí es generen dos csv nous amb les totes les dades que hi ha a **qc.csv** juntament amb informació extreta de **ABIDEII_Composite_Phenotypic.csv** i l'altre amb les dades de **vol.csv** més la informaciò de **ABIDEII_Composite_Phenotypic.csv**, de l'arxiu de dades fenotípiques se n'extreuen els següents paràmetres:

- nom del laboratori (SITE_ID)
- ID del subjecte (SUB_ID)
- edat(AGE_AT_SCAN )
- sexe (SEX)
- si és TEA o control(DX_GROUP)
- Resultat Global IQ

Es generen dos arxius CSV nous:

- *Dataset_to_Check_vol.csv*
- *Dataset_to_Check_qc.csv*

## **PAS1: importar llibreries**

In [None]:
# numpy i Pandas per manipular dades
import numpy as np
import pandas as pd
# per visualitzar dades
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import os
# enllaça a drive
from google.colab import drive

## **PAS2: enllaçar amb drive**

In [None]:
#paths relatius a l'arxiu ipynb
import glob
# recuperar ruta de l0scritp
script_name = '1_Dataset_Treball.ipynb'
drive.mount(os.getcwd() + '/drive')
script_path = glob.glob(os.getcwd() + '/**/' + script_name, recursive = True)
print(script_path)
head_tail = os.path.split(script_path[0])
# guardem carpeta de treball
work_path = head_tail[0];


Mounted at /content/drive
['/content/drive/MyDrive/TFM/Finals/1_Dataset_Treball.ipynb']


## **PAS3: llegir arxius**

In [None]:
# PATHS
#dades fenotípiques
fenotip_path = work_path + '/CSV/ABIDEII_Composite_Phenotypic.csv'
# resultats control de qualitat mri_synthseg sobre imatges normalitzades a NMI
qc_path = work_path + '/CSV/qc.csv'
# volums extrets de mri_synthseg sobre imatges normalitzades a NMI
v_path = work_path + '/CSV/Vol.csv'
# path de l'arxiu combinat
merged_qc_path = work_path + '/CSV/Dataset_to_Check_qc.csv'
merged_v_path = work_path + '/CSV/Dataset_to_Check_vol.csv'
# LLEGIR DADES
#dades fenotípiques
fenotip = pd.read_csv(fenotip_path, encoding='latin-1')
#dades qc
qc = pd.read_csv(qc_path)
#volums
v = pd.read_csv(v_path)

In [None]:
# contingut arxiu dades fenotípiques
fenotip.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1114 entries, 0 to 1113
Columns: 348 entries, SITE_ID to ADI_R_D_INTERVIEWER_JUDGMENT
dtypes: float64(337), int64(3), object(8)
memory usage: 3.0+ MB


In [None]:
# contingut arxiu resultats qc
qc.head()

Unnamed: 0,subject,general white matter,general grey matter,general csf,cerebellum,brainstem,thalamus,putamen+pallidum,hippocampus+amygdala
0,28675_1_anat,0.8369,0.7688,0.8097,0.9141,0.8966,0.8578,0.9152,0.8835
1,28676_anat,0.8744,0.7679,0.8697,0.8838,0.8872,0.8997,0.9341,0.8796
2,28677_anat,0.8742,0.7587,0.8179,0.899,0.8537,0.8774,0.9198,0.8929
3,28678_anat,0.8529,0.7679,0.8868,0.9063,0.8618,0.879,0.9038,0.8573
4,28679_1_anat,0.864,0.7848,0.7588,0.8679,0.866,0.8772,0.9112,0.8881


In [None]:
# contingut arxiu de volums
v.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 370 entries, 0 to 369
Data columns (total 34 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   subject                           370 non-null    object 
 1   total intracranial                370 non-null    float64
 2   left cerebral white matter        370 non-null    float64
 3   left cerebral cortex              370 non-null    float64
 4   left lateral ventricle            370 non-null    float64
 5   left inferior lateral ventricle   370 non-null    float64
 6   left cerebellum white matter      370 non-null    float64
 7   left cerebellum cortex            370 non-null    float64
 8   left thalamus                     370 non-null    float64
 9   left caudate                      370 non-null    float64
 10  left putamen                      370 non-null    float64
 11  left pallidum                     370 non-null    float64
 12  3rd vent

El vincle està en el camp "**subject**" de tipus object de l'arxiu de resultats de control de qualitat amb el camp "**SUB_ID**" de l'arxiu amb les dades fenotípiques

In [None]:
fenotip.head()

Unnamed: 0,SITE_ID,SUB_ID,NDAR_GUID,DX_GROUP,PDD_DSM_IV_TR,ASD_DSM_5,AGE_AT_SCAN,SEX,HANDEDNESS_CATEGORY,HANDEDNESS_SCORES,...,ADI_R_C3_TOTAL,ADI_R_C4_REPETITIVE_USE_OBJECTS,ADI_R_C4_HIGHER,ADI_R_C4_UNUSUAL_SENSORY_INTERESTS,ADI_R_C4_TOTAL,ADI_R_D_AGE_PARENT_NOTICED,ADI_R_D_AGE_FIRST_SINGLE_WORDS,ADI_R_D_AGE_FIRST_PHRASES,ADI_R_D_AGE_WHEN_ABNORMALITY,ADI_R_D_INTERVIEWER_JUDGMENT
0,ABIDEII-BNI_1,29006,,1,,,48.0,1,1.0,,...,,,,,,,,,,
1,ABIDEII-BNI_1,29007,,1,,,41.0,1,1.0,,...,,,,,,,,,,
2,ABIDEII-BNI_1,29008,,1,,,59.0,1,1.0,,...,,,,,,,,,,
3,ABIDEII-BNI_1,29009,,1,,,57.0,1,1.0,,...,,,,,,,,,,
4,ABIDEII-BNI_1,29010,,1,,,45.0,1,1.0,,...,,,,,,,,,,


Eliminar files que no continguin SUB_ID del dataset carregat amb l'arxiu de dades fenotípiques

In [None]:
# eliminar files que continguin el camp SUB_ID a NaN
fenotip.dropna(axis=0, subset='SUB_ID',  how='all', inplace = True)
fenotip.head()

Unnamed: 0,SITE_ID,SUB_ID,NDAR_GUID,DX_GROUP,PDD_DSM_IV_TR,ASD_DSM_5,AGE_AT_SCAN,SEX,HANDEDNESS_CATEGORY,HANDEDNESS_SCORES,...,ADI_R_C3_TOTAL,ADI_R_C4_REPETITIVE_USE_OBJECTS,ADI_R_C4_HIGHER,ADI_R_C4_UNUSUAL_SENSORY_INTERESTS,ADI_R_C4_TOTAL,ADI_R_D_AGE_PARENT_NOTICED,ADI_R_D_AGE_FIRST_SINGLE_WORDS,ADI_R_D_AGE_FIRST_PHRASES,ADI_R_D_AGE_WHEN_ABNORMALITY,ADI_R_D_INTERVIEWER_JUDGMENT
0,ABIDEII-BNI_1,29006,,1,,,48.0,1,1.0,,...,,,,,,,,,,
1,ABIDEII-BNI_1,29007,,1,,,41.0,1,1.0,,...,,,,,,,,,,
2,ABIDEII-BNI_1,29008,,1,,,59.0,1,1.0,,...,,,,,,,,,,
3,ABIDEII-BNI_1,29009,,1,,,57.0,1,1.0,,...,,,,,,,,,,
4,ABIDEII-BNI_1,29010,,1,,,45.0,1,1.0,,...,,,,,,,,,,


In [None]:
# convertir SUB_ID a string
fenotip['SUB_ID'] = fenotip['SUB_ID'].astype(int)
fenotip['SUB_ID'] = fenotip['SUB_ID'].astype(str)
fenotip.head()

Unnamed: 0,SITE_ID,SUB_ID,NDAR_GUID,DX_GROUP,PDD_DSM_IV_TR,ASD_DSM_5,AGE_AT_SCAN,SEX,HANDEDNESS_CATEGORY,HANDEDNESS_SCORES,...,ADI_R_C3_TOTAL,ADI_R_C4_REPETITIVE_USE_OBJECTS,ADI_R_C4_HIGHER,ADI_R_C4_UNUSUAL_SENSORY_INTERESTS,ADI_R_C4_TOTAL,ADI_R_D_AGE_PARENT_NOTICED,ADI_R_D_AGE_FIRST_SINGLE_WORDS,ADI_R_D_AGE_FIRST_PHRASES,ADI_R_D_AGE_WHEN_ABNORMALITY,ADI_R_D_INTERVIEWER_JUDGMENT
0,ABIDEII-BNI_1,29006,,1,,,48.0,1,1.0,,...,,,,,,,,,,
1,ABIDEII-BNI_1,29007,,1,,,41.0,1,1.0,,...,,,,,,,,,,
2,ABIDEII-BNI_1,29008,,1,,,59.0,1,1.0,,...,,,,,,,,,,
3,ABIDEII-BNI_1,29009,,1,,,57.0,1,1.0,,...,,,,,,,,,,
4,ABIDEII-BNI_1,29010,,1,,,45.0,1,1.0,,...,,,,,,,,,,


In [None]:
qc.head()

Unnamed: 0,subject,general white matter,general grey matter,general csf,cerebellum,brainstem,thalamus,putamen+pallidum,hippocampus+amygdala
0,28675_1_anat,0.8369,0.7688,0.8097,0.9141,0.8966,0.8578,0.9152,0.8835
1,28676_anat,0.8744,0.7679,0.8697,0.8838,0.8872,0.8997,0.9341,0.8796
2,28677_anat,0.8742,0.7587,0.8179,0.899,0.8537,0.8774,0.9198,0.8929
3,28678_anat,0.8529,0.7679,0.8868,0.9063,0.8618,0.879,0.9038,0.8573
4,28679_1_anat,0.864,0.7848,0.7588,0.8679,0.866,0.8772,0.9112,0.8881


## **PAS4: Vincular dades**



Afegeixo la informació d'IQ, per calcular-ho han fet servir diferents tipus de test, afegeixo els camps FIQ, PIQ i PIQ_TEST_TYPE, perquè hi ha algun laboratori que no ha calculat el global (FIQ), aquells que han fet servir el test **Raven** només han omplert el camp PIQ (Performance IQ)

In [None]:
#primer em quedo amb les columnes fenotípiques que m'interessen
# SITE_ID
# SUB_ID
# DX_GROUP
# AGE_AT_SCAN
# SEX
f_data = fenotip[['SITE_ID','SUB_ID', 'DX_GROUP','AGE_AT_SCAN ','SEX','FIQ','PIQ', 'PIQ_TEST_TYPE']]
f_data.head()

Unnamed: 0,SITE_ID,SUB_ID,DX_GROUP,AGE_AT_SCAN,SEX,FIQ,PIQ,PIQ_TEST_TYPE
0,ABIDEII-BNI_1,29006,1,48.0,1,131.0,,
1,ABIDEII-BNI_1,29007,1,41.0,1,110.0,,
2,ABIDEII-BNI_1,29008,1,59.0,1,117.0,,
3,ABIDEII-BNI_1,29009,1,57.0,1,114.0,,
4,ABIDEII-BNI_1,29010,1,45.0,1,109.0,,


In [None]:
#generar dataframe combinat
# https://www.geeksforgeeks.org/python/join-pandas-dataframes-matching-by-substring/
f_data['join'] = 1
qc['join'] = 1
v ['join'] = 1

full_qc_df = f_data.merge(qc, on='join').drop('join', axis=1)
full_v_df = f_data.merge(v, on='join').drop('join', axis=1)

qc.drop('join', axis=1, inplace=True)
v.drop('join', axis=1, inplace=True)

# només ens quedem amb lesdades dels subjectes que hem processat
full_qc_df['match'] = full_qc_df.apply( lambda x: x.subject.find(str(x.SUB_ID)), axis=1).ge(0)
full_v_df['match'] = full_v_df.apply( lambda x: x.subject.find(str(x.SUB_ID)), axis=1).ge(0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  f_data['join'] = 1


In [None]:
full_v_df.head()

Unnamed: 0,SITE_ID,SUB_ID,DX_GROUP,AGE_AT_SCAN,SEX,FIQ,PIQ,PIQ_TEST_TYPE,subject,total intracranial,...,right cerebellum cortex,right thalamus,right caudate,right putamen,right pallidum,right hippocampus,right amygdala,right accumbens area,right ventral DC,match
0,ABIDEII-BNI_1,29006,1,48.0,1,131.0,,,28675_1_anat,1338241.2,...,56115.63,6612.35,3995.76,4843.594,1441.631,4401.238,1529.647,707.751,3852.0,False
1,ABIDEII-BNI_1,29006,1,48.0,1,131.0,,,28676_anat,1781819.0,...,60841.6,9226.183,5315.918,6586.38,1926.226,4452.581,2109.016,757.012,5055.705,False
2,ABIDEII-BNI_1,29006,1,48.0,1,131.0,,,28677_anat,1639514.2,...,58436.637,8529.139,5048.086,6024.036,1692.341,4832.502,1901.521,741.057,4530.029,False
3,ABIDEII-BNI_1,29006,1,48.0,1,131.0,,,28678_anat,1397601.8,...,53459.7,6431.748,3580.828,4647.038,1423.337,3902.226,1608.444,605.127,3422.313,False
4,ABIDEII-BNI_1,29006,1,48.0,1,131.0,,,28679_1_anat,1647357.8,...,58825.11,8389.996,4930.436,5571.388,1961.566,5251.076,2035.279,875.499,4998.054,False


In [None]:
# Resultats Control de qualitat - eliminar files que no han estat vinculades
matched_qc_df  = full_qc_df[full_qc_df['match'] == True]
matched_qc_df.shape

(370, 18)

Imputar valors de IQ faltants: en aquells que han fet un test de **raven**, que no tenen FIQ (IQ Global) assino el valor del performance (PIQ) a FIQ, en els altres casos ho deixo en blanc (en principi no faré servir l'IQ)

In [None]:
rule = (matched_qc_df['FIQ'].isna()) & (matched_qc_df['PIQ_TEST_TYPE'] == 'Raven')
matched_qc_df.loc[rule, 'FIQ'] = matched_qc_df.loc[rule,'PIQ']

In [None]:
# eliminar columna matched i els camps d'IQ que ja no fan falta
matched_qc_df.drop(['match','PIQ_TEST_TYPE','PIQ'], axis=1, inplace=True)
matched_qc_df.reset_index(drop=True, inplace=True)
matched_qc_df.head()

Unnamed: 0,SITE_ID,SUB_ID,DX_GROUP,AGE_AT_SCAN,SEX,FIQ,subject,general white matter,general grey matter,general csf,cerebellum,brainstem,thalamus,putamen+pallidum,hippocampus+amygdala
0,ABIDEII-BNI_1,29006,1,48.0,1,131.0,29006_anat,0.8624,0.7574,0.8766,0.8818,0.8568,0.8979,0.9099,0.8691
1,ABIDEII-BNI_1,29007,1,41.0,1,110.0,29007_anat,0.867,0.7595,0.7779,0.8834,0.8639,0.8667,0.9122,0.865
2,ABIDEII-BNI_1,29008,1,59.0,1,117.0,29008_anat,0.8682,0.7831,0.8821,0.8922,0.8938,0.8925,0.8989,0.8896
3,ABIDEII-BNI_1,29009,1,57.0,1,114.0,29009_anat,0.861,0.7644,0.8528,0.881,0.8733,0.8771,0.9054,0.8839
4,ABIDEII-BNI_1,29010,1,45.0,1,109.0,29010_anat,0.8686,0.7892,0.8626,0.9019,0.9043,0.8849,0.8974,0.9007


In [None]:
# VOlums extrets - eliminar files que no han estat vinculades
matched_v_df  = full_v_df[full_v_df['match'] == True]
matched_v_df.shape

(370, 43)

In [None]:
rule = (matched_v_df['FIQ'].isna()) & (matched_v_df['PIQ_TEST_TYPE'] == 'Raven')
matched_v_df.loc[rule, 'FIQ'] = matched_v_df.loc[rule,'PIQ']

In [None]:
# eliminar columna matched is camps d'IQ que ja no fan falta
matched_v_df.drop(['match','PIQ_TEST_TYPE','PIQ'], axis=1, inplace=True)
matched_v_df.reset_index(drop=True, inplace=True)
matched_v_df.head()

Unnamed: 0,SITE_ID,SUB_ID,DX_GROUP,AGE_AT_SCAN,SEX,FIQ,subject,total intracranial,left cerebral white matter,left cerebral cortex,...,right cerebellum white matter,right cerebellum cortex,right thalamus,right caudate,right putamen,right pallidum,right hippocampus,right amygdala,right accumbens area,right ventral DC
0,ABIDEII-BNI_1,29006,1,48.0,1,131.0,29006_anat,1757710.2,259873.33,268254.22,...,19157.395,51659.777,7459.366,3932.502,6187.846,1874.996,4703.144,1945.425,722.759,4258.534
1,ABIDEII-BNI_1,29007,1,41.0,1,110.0,29007_anat,1887349.1,287961.53,298969.12,...,20529.336,58005.5,8901.793,6474.621,7358.886,2035.508,4795.284,2219.866,823.072,5447.822
2,ABIDEII-BNI_1,29008,1,59.0,1,117.0,29008_anat,1520280.5,225936.42,230091.05,...,18873.488,51608.484,6077.783,3628.519,5075.09,1480.917,3973.777,1891.503,656.89,3787.245
3,ABIDEII-BNI_1,29009,1,57.0,1,114.0,29009_anat,1608100.8,227813.52,238434.92,...,17285.945,51243.855,6651.206,3745.923,6084.976,1708.314,4442.364,1948.577,683.984,3974.932
4,ABIDEII-BNI_1,29010,1,45.0,1,109.0,29010_anat,1632847.5,239377.64,254104.42,...,17225.818,52908.12,6557.378,4574.962,5176.962,1761.316,4163.968,1840.35,760.056,4350.503


## **PAS5: Guardar CSV**

In [None]:
matched_qc_df.to_csv(merged_qc_path)
matched_v_df.to_csv(merged_v_path)