# Topic Modelling Preprocessing


This notebook queries PubMed for a set of 24 subcortical structures reported in fMRI, PET or electrophysiology papers. It obtains the corresponding papers from a local folder. To see how the folder with abstracts was obtained please see (reference). Next, it preprocesses the abstracts so that the text is tokenized, converted to lowercase, and so that structure and function words n-grams are replaced by their underscore counterparts. It then creates 4 pandas DataFrames:

  - One vocabulary with all unique words and an index
  - One containing structures and their index in the vocabulary
  - One containing functions and their index in the vocabulary
  - One containing the counts of every word per document id (PubMed ID), and the index of these words.
  
The notebook requires that the python classes StructureInfo, PubmedReader and AbstractFormatter are in your working directory. Additionally it requires the file SubcorticalStructres.xlsx (containing all information on the structures), and as mentioned before, a folder where the abstracts are saved per structure. Finally, the packages used can be seen in the cell below.  

Python 2.7.12 was used to test the Notebook

## Imports

In [1]:
# import classes
from StructureInfo import *
from PubmedReader import *
from AbstractFormatter import *
# import external libraries
from os import listdir
from os.path import isfile, join
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import nltk
import pandas as pd
import operator
import string
import numpy as np

## Initialization

In [2]:
# mode number -1 = only the abstract texts
mode_number = -1
# template that is used to query PubMed
query_template = '\"{}\"' + ' AND (fMRI OR PET OR electrophysiology) AND \"humans\"[MeSH Terms]'
# local location where all abstract information is stored (.txt file per structure)
location = '/home/michiel/Desktop/Abstracts/'
# number of structures extracted from the list of 33 nuclei in file SubcorticalStructures.xlsx
n_structures = 24
# boolean indicating whether a list of cognitive function words is used to filter results
word_list = False
# obtain structures
info = StructureInfo()
structures = info.structures[0:n_structures]
# read and process function words
function_words = open('function_words.txt', 'r').read().split('\n')[:-1]
function_words = [w.lower() for w in function_words]
# obtain list of stopwords from NLTK package
stopword_list = [str(w) for w in stopwords.words('english')]
# define extra stopwords to ignore
extra_stopwords = ['objective', 'setting', 'results', 'conclusions', 'case', 'presentation', 'methods', 
                   'purpose', 'discussions', 'object', 'aim']
# combine to create one list of stopwords
stopword_list += extra_stopwords
# word count cutoff point
mininal_count = 1

## Functions

In [3]:
def tokenize(text):
    # ignore punctuation except underscore
    punct = string.punctuation.replace('_', '')
    text = "".join([ch for ch in text if ch not in punct])
    tokens = nltk.word_tokenize(text)
    return tokens

def preprocess(input_abstract, structures, function_words):
    # make sure sturctures are lower case
    structures = [s.lower() for s in structures]
    # make abstracts lowercase
    abstract = input_abstract.lower()  
    # replace structures by underscore counterpart (bi/tri-grams)
    for struct in structures:
        if struct in abstract:
            abstract = abstract.replace(struct, struct.replace(' ', '_'))
    # replace function words by underscore counterpart (bi/tri-grams)
    for function in function_words:
        if function in abstract:
            abstract = abstract.replace(function, function.replace(' ', '_'))
    abstract = tokenize(abstract)
    return " ".join(abstract)

## Obtain unique list of PMIDs from query

In [4]:
# create reader object and obtain PMIDs of papers per structure
reader = PubmedReader(query_template, structures, location)
all_ids = reader.pmids

# process and print ID list
print 'Initial ID-list length (nested): ' + str(len(all_ids))
all_ids = [item for sublist in all_ids for item in sublist]
print 'Flattened ID-list length: ' + str(len(all_ids))
all_ids = list(set(all_ids))
print 'Unique ID-list length: ' + str(len(all_ids))

# map PMIDs to indices ranging from 0 to number of IDs
keys = range(0, len(all_ids))
values = all_ids
values = [str(val) for val in values]
id_indices = dict(zip(keys, values))

Please wait while PubMed is searched and the abstracts are fetched...
Saved abstracts in /home/michiel/Desktop/Abstracts/
Initial ID-list length (nested): 24
Flattened ID-list length: 28639
Unique ID-list length: 22020


## Obtain PMIDs, titles and abstracts for every structure

In [5]:
struct_order_ids = []
struct_order_titles = []
struct_order_abstracts = []
tokens = []

# obtain IDs, titles and abstracts for every structure
for struct in structures:
    # create formatter object with structure
    formatter = AbstractFormatter(location, struct)
    # obtain ID list, title and abstract
    id_list = formatter.get_pmids()
    id_list = [id_entry.split(' ')[1] for id_entry in id_list] # Initial format: PMID: 12345678
    title_list = formatter.get_titles()
    abstract_list = formatter.get_abstracts(mode_number, location)
    # append to list of all IDs, all titles and all abstracts
    struct_order_ids += id_list
    struct_order_titles += title_list
    struct_order_abstracts += abstract_list

## Obtain titles and abstracts in the order of unique ID list (in DataFrame)

In [6]:
unique_ID_order_abstracts = []
unique_ID_order_titles = []

# loop unique IDs, get associated abstract and title
for val in all_ids:
    if val in struct_order_ids:
        idx = struct_order_ids.index(val)
        unique_ID_order_abstracts.append(struct_order_abstracts[idx])
        unique_ID_order_titles.append(struct_order_titles[idx])
    else:
        unique_ID_order_abstracts.append("Missing")
        unique_ID_order_titles.append("Missing")

# save initial abstracts with IDs in DataFrame
initial_abstracts_ID_df = pd.DataFrame({'ID': all_ids, 'Abstract': unique_ID_order_abstracts, 'Title': unique_ID_order_titles})
initial_abstracts_ID_df.to_pickle('initial_abstract_id_map')

## Process (unique) abstracts and save in file and DataFrame

In [7]:
processed_abstracts = []
processed_titles = []

# open file that will contain all abstracts and titles
f = open('abstract_corpus.txt', 'w+')
    
# iterate over all unique IDs, obtain corresponding information and write it to file
for val in all_ids:
    # obtain title and abstract of ID (if we have it)
    if val in struct_order_ids:
        idx = struct_order_ids.index(val)
        title = struct_order_titles[idx]
        abstr = struct_order_abstracts[idx]
        # replace function words and structure n-grams by underscore counterpart
        abstr = preprocess(abstr, structures, function_words)
        title = preprocess(title, structures, function_words)
        processed_abstracts.append(abstr)
        processed_titles.append(title)
        # write title and abstract to file
        f.write(title)
        f.write('\n\n')
        f.write(abstr)
        f.write('\n\n\n') # 3 newline characters separate each abstract-title combination
    else:
        print val + ' not found.'
        processed_abstracts.append("Missing")
        processed_titles.append("Missing")
f.close()

# save processed abstracts and titles in DataFrame
abstracts_ID_df = pd.DataFrame({'ID': all_ids, 'Abstract': processed_abstracts, 'Title': processed_titles})
abstracts_ID_df.to_pickle('processed_abstract_id_map')

27453757 not found.
27432102 not found.
28153848 not found.
27250053 not found.


## Create DataFrame with all words and their occurence count

In [8]:
# open the abstracts and count word frequencies
with open('abstract_corpus.txt') as f:
    freqs = Counter(f.read().split())

# remove stopwords from word index
for sw in stopword_list:
    del freqs[sw]
    
# sort vocabulary    
sorted_x = sorted(freqs.items(), key=operator.itemgetter(1))
sorted_x.reverse()

# make DataFrame and rename columns
all_word_occurences = pd.DataFrame(sorted_x)
all_word_occurences.columns = ['word', 'count']
# remove entries have too few occurences
all_word_occurences = all_word_occurences[all_word_occurences['count'] >= mininal_count]

# save vocabulary in DataFrame
all_word_occurences.to_pickle('all_word_counts')

## Create DataFrame with counts for each document

In [9]:
# read in DataFrame with abstracts, PMIDs and titles
abstr_df = pd.read_pickle('processed_abstract_id_map')
# remove invalid entries
abstr_df = abstr_df[abstr_df['Abstract'] != '1']
abstr_df = abstr_df[abstr_df['Abstract'] != 'Missing']
# reset indices
abstr_df.index = range(0, len(abstr_df))
# only select PMIDs and abstracts (not titles)
abstr_df = abstr_df[['Abstract', 'ID']]
abstr_df = abstr_df

# split abstracts into DataFrames per abstract where the words are on individual rows
word_dataframes = []
for idx, row in abstr_df.iterrows():
    tmp = pd.DataFrame({'word':row['Abstract'].split()})
    tmp['ID'] = row['ID']
    word_dataframes.append(tmp)

# concatenate the DataFrames together    
df = pd.concat(word_dataframes, ignore_index=True)
# reset index
df = df.groupby(['ID', 'word']).size().reset_index()

## Create vocabulary DataFrame and reference structures and function words 

In [10]:
# obtain list of all words and list of unique words
all_words = list(df['word'])
unique_words = df.word.unique()

# print this information and create vocabulary
print 'All words: ' + str(len(all_words))
print 'Individual words: ' + str(len(unique_words))
vocab = pd.DataFrame({'word': unique_words})

struct_indices = []
function_indices = []
structures = [s.lower().replace(' ', '_') for s in structures]

# obtain indices of structures and function words
for struct in structures:
    struct_indices.append(vocab[vocab['word'] == struct].index.tolist()[0])
for funct in function_words:
    function_indices.append(vocab[vocab['word'] == funct].index.tolist()[0])
    
# create DataFrames for structure and function word indices
struct_df = pd.DataFrame({'word': structures, 'index': struct_indices})
funct_df = pd.DataFrame({'word': function_words, 'index': function_indices})

# change some labels, add vocab indices of words to the document counts DataFrame and filter stopwords
df = df.rename(columns = {0:'count'})
vocab = vocab.reset_index().set_index('word')
df['word_id'] = df.word.map(vocab['index'].to_dict())
df = df[~np.in1d(df.word, stopword_list)]

# save DataFrames in your working directory in pickle format
df.to_pickle('counts_per_document')
vocab.to_pickle('vocabulary')
funct_df.to_pickle('function_indices')
struct_df.to_pickle('structure_indices')

# df[['ID', 'word_id', 'count']].to_csv('doc_terms.csv', index=False)
# vocab[['word', 'index']].to_csv('vocab.csv', index=False)
# funct_df[['word', 'index']].to_csv('function_index.csv', index=False)
# struct_df[['word', 'index']].to_csv('structure_index.csv', index=False)


All words: 2499342
Individual words: 71480


In [11]:
df

Unnamed: 0,ID,word,count,word_id
1,10022492,also,1,1
5,10022492,arising,1,5
7,10022492,basis,1,7
9,10022492,brain,1,9
10,10022492,broader,1,10
11,10022492,capacity,7,11
12,10022492,capacityconstrained,3,12
13,10022492,characteristic,1,13
14,10022492,characteristics,2,14
15,10022492,cognitive,1,15


### Extra Code

In [17]:
# NOTE: NONE OF THE SYNONYMS SEEM TO BE IN THERE   

# process synonyms
synonyms = []
syns = [syn.lower() for syn in info.synonyms]
for syn in syns:
    if not syn == 'none':
        for elem in syn.split(','):
            synonyms.append(elem)

In [21]:
# NOTE: ALTERNATIVE WAY TO OBTAIN FREQUENCIES

import collections

with open('abstract_corpus.txt') as f:
    c = collections.Counter(f.read().split())

print "'Four' appears %d times"%c['four']
print "'the' appears %d times"%c['subthalamic_nucleus']
print "There are %d total words"%sum(c.values())
print "The 5 most common words are", c.most_common(30)

'Four' appears 1602 times
'the' appears 0 times
There are 4655788 total words
The 5 most common words are [('the', 245840), ('of', 175384), ('and', 174981), ('in', 163676), ('to', 81106), ('with', 78024), ('a', 68997), ('was', 36921), ('patients', 35713), ('were', 35611), ('for', 30929), ('that', 30211), ('brain', 28244), ('is', 25266), ('on', 21488), ('by', 20972), ('we', 20434), ('as', 19962), ('imaging', 18046), ('this', 17856), ('cortex', 15174), ('between', 14927), ('an', 14524), ('functional', 14425), ('study', 14350), ('amygdala', 14173), ('these', 14028), ('results', 13559), ('from', 13505), ('or', 13488)]
