In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [5]:
adni = pd.read_csv("idaSearch_AD_CN_MRI_PET.csv")

In [6]:
adni.head()

Unnamed: 0,Subject ID,Sex,Research Group,Age,Description
0,002_S_0295,M,CN,84.9,B1-Calibration Body
1,002_S_0295,M,CN,84.9,B1-Calibration PA
2,002_S_0295,M,CN,84.9,3-plane localizer
3,002_S_0295,M,CN,84.9,3-plane localizer
4,002_S_0295,M,CN,84.9,3-plane localizer


This csv file was taken from ADNI's search interface.
From the ADNI login go to **_Download$\rightarrow$Image Collections$\rightarrow$Advanced Search_**.

The "idaSearch_AD_CN_MRI_PET.csv" was found using the following parameters:

**Projects:** ADNI

**Research Group:** AD, CN (check the box to the far right under "Display in result" otherwise the csv won't have labels.) 

**Modality:** MRI, PET, (AND)


# <span style="color:red">PROBLEM:</span>

We need to figure out which of these ~30k samples we want to use. Description refers to the type of MRI or PET scan they did. Some of them are local to one particular region, some are higher res, etc. Not all of the User ID's have the same test subset of Desicriptions. 

This means that we probably need to find a group of a few hundred subject ID's which all have the same two Descriptions, one being an MRI and one a PET scan.

### Ways to help: 
Let's dig through this dataframe and get some stats on the most common Descriptions.

Get on the ADNI website and figure out which Descriptions correspond to MRI's and PET scans, (The PET ones seem to often be labeled.

Determine if there are a few descriptions which we may consider equivalent. Do they correspond to the same test but at different machines or testing locations? etc.

### Goal:

Ideally, we want a list of subject ID's and the two descriptions for the data we want. With this it should be straightforward to enter the info into ADNI and get our data. 

## GET DESCRIPTION COUNTS:

In [95]:
D = {}
for i in range(0,adni.shape[0]):

    d = adni.iloc[i]['Description']
    if d not in D.keys():
        D[d] = 1
    else:
        D[d] = D[d] + 1

counts = pd.DataFrame({"Description":D.keys(), "Counts": D.values()})
counts = counts.sort_values("Counts",ascending=False)
counts = counts.reset_index(drop=True)

In [89]:
counts.head()

Unnamed: 0,Counts,Description
0,3046,3 Plane Localizer
1,2560,localizer
2,1790,MPRAGE
3,1787,3-plane localizer
4,1335,Field Mapping


After looking at the descriptions, I found that:

**3 Plane Localizer** seems to be a triplet of MRI images, mixed between 1.5 and 3 T


**Localizer** Triplet of MRI images, mixed between 1.5 and 3 T

Both of the above come up when I search Localizer, so jointly there are 155 from ADNI1, 4000 from ADNI2, 1000 from ADNi3, 

**MPRAGE** Single MRI Image, mixed between 1.5 and 3 T. 1429 from ADNI1, 2000 from ADNI2, 400 of ADNI3

**3-plane localizer** Triplet of MRI's. 1286 of them just from ADNI1. ~1700 otherwise. Mixed between 1.5 and 3 T

**Field Mapping** Two MRI's. 1400 of them, all are 3T and are all from ADNI2 or ADNI3


 #### NO PET SCANS YET...
 
 When I search with the same parameters but only select PET scans, I see 3165 results. 

In [98]:
pet = pd.read_csv("idaSearch_AD_CN_PET.csv")

pD = {}
for i in range(0,pet.shape[0]):

    d = pet.iloc[i]['Description']
    if d not in pD.keys():
        pD[d] = 1
    else:
        pD[d] = pD[d] + 1

pet_counts = pd.DataFrame({"Description":pD.keys(), "Counts": pD.values()})
pet_counts = pet_counts.sort_values("Counts",ascending=False)
pet_counts = pet_counts.reset_index(drop=True)
pet_counts.head()

Unnamed: 0,Counts,Description
0,201,ADNI Brain PET: Raw AV45
1,151,ADNI Brain PET: Raw
2,134,ADNI Brain PET: Raw FDG
3,76,ADNI Static Brain (6x5)
4,65,ADNI Brain PET: Raw Tau


This above dataframe is of all PET scans in ADNI. So the conclusion to draw from this, is that if we want to insist that all of our dataset has the exact same description, our total sample size is bounded at 201. 

I suggest we try to find some MRI's for those 201 Subject_ID's and move forward. 

In [107]:
av45 = pet[pet['Description'] == 'ADNI Brain PET: Raw AV45']
print(av45.shape)

(201, 7)


In [106]:
av45.drop_duplicates(subset='Subject ID').shape

(107, 7)

Okay, so there are only 100 distinct subject ID's out of those 200 datapoints. Do duplicates hurt us? **This could be a research question we investigate**