# Exploring neurosynth data

Database available at: https://github.com/neurosynth/neurosynth-data/tree/826d52c975b902d59d3434c46bda41986495ca99/archive
Using v0.6.

In [11]:
import tarfile, os, sys
import pandas as pd

with tarfile.open('data_0.6.tar.gz') as f:
    files = {member.name: f.extractfile(member) for member in f.getmembers() if './.' not in member.name}
    for fi in files:
        files[fi] = pd.read_table(files[fi])

Finding values of the column headers

In [12]:
for f in files: print('{key} {shape}: {header}'.format(key=f, header=files[f].columns, shape=files[f].shape))

database.txt (413429, 13): Index(['id', 'doi', 'x', 'y', 'z', 'space', 'peak_id', 'table_id', 'table_num',
       'title', 'authors', 'year', 'journal'],
      dtype='object')
features.txt (11406, 3170): Index(['pmid', '001', '01', '05', '10', '10 healthy', '100', '11', '12',
       '12 healthy',
       ...
       'year old', 'years', 'yield', 'yielded', 'young', 'young adults',
       'young healthy', 'younger', 'younger adults', 'zone'],
      dtype='object', length=3170)


Examine the first 5 rows of each dataframe to get a feel for the data

In [19]:
for f in files: print('--|||{f}|||--:\n {frame} \n {sep}'.format(f=f, frame=files[f][:5], sep=str('_'*78)))

--|||database.txt|||--:
         id  doi     x     y     z space  peak_id  table_id table_num  \
0  9065511  NaN  38.0 -48.0  49.0   MNI   215927     11416        1.   
1  9065511  NaN  -4.0 -70.0  50.0   MNI   215928     11416        1.   
2  9065511  NaN -34.0 -52.0  60.0   MNI   215929     11416        1.   
3  9065511  NaN -23.0  15.0  67.0   MNI   215930     11416        1.   
4  9065511  NaN -23.0 -20.0  68.0   MNI   215931     11416        1.   

                                               title  \
0  Environmental knowledge is subserved by separa...   
1  Environmental knowledge is subserved by separa...   
2  Environmental knowledge is subserved by separa...   
3  Environmental knowledge is subserved by separa...   
4  Environmental knowledge is subserved by separa...   

                    authors  year  \
0  Aguirre GK, D'Esposito M  1997   
1  Aguirre GK, D'Esposito M  1997   
2  Aguirre GK, D'Esposito M  1997   
3  Aguirre GK, D'Esposito M  1997   
4  Aguirre GK, D'Esp

Index(['pmid', '001', '01', '05', '10', '10 healthy', '100', '11', '12',
       '12 healthy',
       ...
       'year old', 'years', 'yield', 'yielded', 'young', 'young adults',
       'young healthy', 'younger', 'younger adults', 'zone'],
      dtype='object', length=3170)

### Analysis
1) features.txt
* The first column of features.txt is the ID needed to map the gathered linguistic data of the article to the fMRI data in database.txt.
* The remaining columns indicated the tf-idf values of the words in the article.

2) database.txt
* The id is necessary to link to features.txt
 (pmid = id)
* Other identification features like authors, year, journal, title all serve to identify the article from which the fMRI data was taken
* x, y, z are the stereotactic coordinates of the fMRI peak, which is identified as peak_id


##### Moving on
Explore a few more of the database's pmid values to see approximately how many peaks each article yields.

Simple to do: Divide the # of rows of database and features.

In [22]:
files['database.txt'].shape[0]/files['features.txt'].shape[0]

36.24662458355252

In [35]:
dat = files['database.txt']
features = files['features.txt']
dat[dat['id']==9065511].shape[0]

10

So, roughly 36 peaks per study, with the first study having 10 peaks.

In [36]:
dat[dat['id']==9065511]

Unnamed: 0,id,doi,x,y,z,space,peak_id,table_id,table_num,title,authors,year,journal
0,9065511,,38.0,-48.0,49.0,MNI,215927,11416,1.0,Environmental knowledge is subserved by separa...,"Aguirre GK, D'Esposito M",1997,The Journal of neuroscience : the official jou...
1,9065511,,-4.0,-70.0,50.0,MNI,215928,11416,1.0,Environmental knowledge is subserved by separa...,"Aguirre GK, D'Esposito M",1997,The Journal of neuroscience : the official jou...
2,9065511,,-34.0,-52.0,60.0,MNI,215929,11416,1.0,Environmental knowledge is subserved by separa...,"Aguirre GK, D'Esposito M",1997,The Journal of neuroscience : the official jou...
3,9065511,,-23.0,15.0,67.0,MNI,215930,11416,1.0,Environmental knowledge is subserved by separa...,"Aguirre GK, D'Esposito M",1997,The Journal of neuroscience : the official jou...
4,9065511,,-23.0,-20.0,68.0,MNI,215931,11416,1.0,Environmental knowledge is subserved by separa...,"Aguirre GK, D'Esposito M",1997,The Journal of neuroscience : the official jou...
5,9065511,,42.0,-47.0,-19.0,MNI,215932,11416,1.0,Environmental knowledge is subserved by separa...,"Aguirre GK, D'Esposito M",1997,The Journal of neuroscience : the official jou...
6,9065511,,25.0,-35.0,-8.0,MNI,215933,11416,1.0,Environmental knowledge is subserved by separa...,"Aguirre GK, D'Esposito M",1997,The Journal of neuroscience : the official jou...
7,9065511,,-25.0,-45.0,-2.0,MNI,215934,11416,1.0,Environmental knowledge is subserved by separa...,"Aguirre GK, D'Esposito M",1997,The Journal of neuroscience : the official jou...
8,9065511,,52.0,-62.0,14.0,MNI,215935,11416,1.0,Environmental knowledge is subserved by separa...,"Aguirre GK, D'Esposito M",1997,The Journal of neuroscience : the official jou...
9,9065511,,21.0,-81.0,28.0,MNI,215936,11416,1.0,Environmental knowledge is subserved by separa...,"Aguirre GK, D'Esposito M",1997,The Journal of neuroscience : the official jou...


To Do:

* Find which keywords have the most/least amount of bilaterality
* Find which brain regions wire together most frequently