# Introduction

In this notebook, we look at the csv file containing the phenotypes of the patients who underwent the brain scans. With phenotype, we mean the set of observable characteristics of an individual. We added a list with the meaning of the different columns of the csv file called _ABIDE_LEGEND_V1.02.pdf_.

In [None]:
from pathlib import Path
import numpy as np
import pandas as pd
import polars as pl
import matplotlib.pyplot as plt

In [None]:
notebook_dir = Path().resolve()
project_root = notebook_dir.parent.parent
phenotype_path = project_root / "data" / "Phenotypic_V1_0b_preprocessed1.csv"

I notice that the column "CURR_MED_STATUS" also has blank values together with values "0" and "1". This is the reason why I force the column to be interpreted as a string. Pl.Utf8 is a datatype used by Polars to represent strings.

In [None]:
pheno_df = pl.read_csv(phenotype_path, null_values="", schema_overrides={'CURRENT_MED_STATUS': pl.Utf8})
pheno_df.head()


Now, we need to compare our csv-file with the .1D files (i.e. the files with the real brain image data). There are more phenotype records in the csv file than files with brain image data. Therefore, we will have to read the filenames of the .1D files. Afterwards, we can delete the phenotype records without corresponding brain image data.

In [21]:
one_d_path = project_root / "data" / "preprocessed_dataset" / "Outputs" / "cpac" / "filt_noglobal" / "rois_cc200"
print(one_d_path)
print(one_d_path.exists())
subject_ids = [f.stem for f in one_d_path.glob("*.1D")]
print(f"Found {len(subject_ids)} .1D fies")
print("example subject IDs:", subject_ids[:5])

C:\ASD Data Science\ASD-Graph-Analysis\data\preprocessed_dataset\Outputs\cpac\filt_noglobal\rois_cc200
True
Found 884 .1D fies
example subject IDs: ['Caltech_0051456_rois_cc200', 'Caltech_0051457_rois_cc200', 'Caltech_0051458_rois_cc200', 'Caltech_0051459_rois_cc200', 'Caltech_0051460_rois_cc200']
