   # Coreferencing the Texts 

This notebook conducts Step 1 of the IEP, whereby texts are passed through a spacey pipeline with the coreferencing package 'neuralcoref' attached. This detects entities and pronouns, and replaces the pronoun with the entitiy name to which it belongs. A dictionary has been added to help the package deal with common phrases and abbreviations commonly found within the psychedelic texts.  

### Dependencies

In [1]:
# --- basic imports ---
import spacy
import neuralcoref
import pandas as pd

In [2]:
# --- Create a conversion dictionary ---
#Key is the word found in the text, the value is what it should be replaced with 
conv_dict = {'drug': ['psychedelic'],
             'AYA': ['ayahuasca'],
             'treatment-resistant depression': ['depression'],
             'treatment resistant depression': ['depression'],
             'major depressive disorder': ['depression'],
             'MDD': ['depression'],
             'Default mode network': ['DMN'],
             'psychedelic microdosing': ['microdosing'],
             'microdosing psychedelics': ['microdosing']}

In [3]:
# --- setup the pipeline --- 
nlp = spacy.load('en_core_web_sm') # spacy pipeline 
neuralcoref.add_to_pipe(nlp, conv_dict=conv_dict)

<spacy.lang.en.English at 0x7f193d7f1978>

### Import the data 

In [4]:
data_path = '../data/text_data/'

In [5]:
#full cohort of data 
full_texts = pd.read_csv(data_path + "Final_cohort_texts.csv", index_col=0)

### Inspect the coreferencer 

In [6]:
# We'll split away one text to inspect how the coreferencer works 
one_article = full_texts['Discussion']
one_article = one_article[5]
one_article

"The fMRI studies reported here revealed significant and consistent outcomes. Psilocybin significantly decreased brain blood flow and venous oxygenation in a manner that correlated with its subjective effects, and significantly decreased the positive coupling of two key structural hubs (the mPFC and the PCC). Our use of fMRI to measure resting-state brain activity after a psychedelic is unique, and because the results are unexpected, they require some explanation. The effect of psilocybin on resting-state brain activity has been measured before with PET and glucose metabolism (8). This study found a global increase in glucose metabolism after oral psilocybin, which is inconsistent with our fMRI results. One possible explanation for this discrepancy relates to the fact that the radiotracer used to measure glucose metabolism (18F-fluorodeoxyglucose) has a long half-life (110 min). Thus, the effects of psilocybin, as measured by PET, are over much greater timescales than indexed by our fM

In [7]:
# Run the article through the pipeline 
article_nlp = nlp(one_article)

In [8]:
#Check if there was anything resolved - if True, it means something has been resolved 
article_nlp._.has_coref

True

In [9]:
# Inspect the clusters found 
article_nlp._.coref_clusters

[The fMRI studies reported here: [The fMRI studies reported here, This study],
 venous oxygenation: [venous oxygenation, its],
 the results: [the results, they],
 the 5-HT2A receptor: [the 5-HT2A receptor, their],
 the 5-HT2A receptor: [the 5-HT2A receptor, the 5-HT2A receptor, it],
 these effects: [these effects, them],
 the PCC: [the PCC, its, the PCC, its, it, the PCC],
 The high metabolic activity of the PCC and the default-mode network (DMN) with which is it associated (26): [The high metabolic activity of the PCC and the default-mode network (DMN) with which is it associated (26), its],
 the DMN: [the DMN, the DMN],
 DMN regions: [DMN regions, them],
 “connector hubs” (32): [“connector hubs” (32), These hubs],
 the brain: [the brain, the brain],
 such an integrative function: [such an integrative function, their],
 The pharmaco-physiological interaction results: [The pharmaco-physiological interaction results, these results],
 the mPFC: [the mPFC, The mPFC, the mPFC],
 Our result

_This looks promising, the coreferencer is able to simplify some of the references, even the scientific terms. It isn't perfect, as the model is trained to detect more entities such as people with 'he' and 'she', but it still improves parts of the text which should help when extracting info_

In [10]:
# We can also inspect the indice locations of the corrected parts 
for cluster in article_nlp._.coref_clusters:
    for reference in cluster:
    #each of these is a Span object in Spacy
        print(reference)
        #starting index of this reference in the text
        print(reference.start) 
        #ending index of this reference in the text
        print(reference.end)

The fMRI studies reported here
0
5
This study
100
102
venous oxygenation
18
20
its
26
27
the results
68
70
they
73
74
the 5-HT2A receptor
374
377
their
379
380
the 5-HT2A receptor
502
505
the 5-HT2A receptor
523
526
it
533
534
these effects
576
578
them
596
597
the PCC
640
642
its
658
659
the PCC
678
680
its
681
682
it
693
694
the PCC
705
707
The high metabolic activity of the PCC and the default-mode network (DMN) with which is it associated (26)
700
724
its
730
731
the DMN
767
769
the DMN
906
908
DMN regions
799
801
them
819
820
“connector hubs” (32)
821
828
These hubs
829
831
the brain
839
841
the brain
901
903
such an integrative function
859
863
their
876
877
The pharmaco-physiological interaction results
968
974
these results
1122
1124
the mPFC
1205
1207
The mPFC
1223
1225
the mPFC
1293
1295
Our results
1360
1362
The results
1497
1499
decreased CBF
1396
1398
Increased hypothalamic CBF
1425
1428
CBF
1427
1428
the hypothalamus
1399
1401
the hypothalamus
1443
1445
the brain
1494
149

### Function to conduct coreferencing

In [12]:
def coref_resolution(text):
    #Make a spacy object of each sentence 
    doc = nlp(text)
    #Print whether a coreference has been found or not (True or False)
    print(doc._.has_coref)
    #Print the coreferences found 
    #print(doc._.coref_clusters)
    # fetch the tokens with whitespaces from spacy document
    tok_list = list(token.text_with_ws for token in doc)
    #Loop through each cluster found 
    for cluster in doc._.coref_clusters:
        # get tokens from representative cluster name
        cluster_main_words = set(cluster.main.text.split(' '))
        #for every reference in the cluster 
        for coref in cluster:
            #check to see if it is the main reference 
            if coref != cluster.main:
                #If it's not, then replace the coref with the main key reference 
                if coref.text != cluster.main.text and bool(set(coref.text.split(' ')).intersection(cluster_main_words)) == False:
                    tok_list[coref.start] = cluster.main.text + \
                    doc[coref.end-1].whitespace_
                    for i in range(coref.start+1, coref.end):
                        tok_list[i] = ""
    #Join the tokens back into one sentence 
    tok_list ="".join(tok_list)
                       
    return tok_list


## Apply the Coreferencer through the Discussion Column   

In [13]:
coreferenced_text = []

for article in full_texts['Discussion']:
    coreferenced_article = coref_resolution(article)
    coreferenced_text.append(coreferenced_article)

True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True


In [17]:
# --- Check type of the column --- 
type(coreferenced_text)

list

#### Append the resolved column onto the text 

In [18]:
full_texts['Coreferenced_Discussion'] = coreferenced_text
full_texts[:3]

Unnamed: 0_level_0,DOI,Discussion,Conclusions,Coreferenced_Discussion
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,10.1177/0269881116675513,The present study demonstrated the efficacy of...,When administered under psychologically suppor...,The present study demonstrated the efficacy of...
2,10.1001/archgenpsychiatry.2010.116,The initial goals of this research project wer...,,The initial goals of this research project wer...
3,10.1124/pr.115.011478,"Dr. Albert Hofmann, the natural products chemi...",,"Dr. Albert Hofmann, the natural products chemi..."


## Coreference the Conclusions column 

#### Convert Conclusions column to string 

In [19]:
full_texts['Conclusions'] = full_texts['Conclusions'].astype(str)
print(full_texts.dtypes)

DOI                        object
Discussion                 object
Conclusions                object
Coreferenced_Discussion    object
dtype: object


In [20]:
coreferenced_conclusions = []

for article in full_texts['Conclusions']:
    coreferenced_article = coref_resolution(article)
    coreferenced_conclusions.append(coreferenced_article)

False
False
False
False
False
True
True
False
True
False
False
False
False
False
False
True
False
False
True
False
False
False
False
True
False
False
True
False
False
True
True
False
False
False
False
False
False
False
False
False
False
False
True
True
False
False
True
False
False
False
False
True
True
True
False
True
False
False
False
False
True
False
True
False
True
True
True
False
False
False
True
False
False
False
True
True
False
False
False
True
False
False
False
True


_We can see that not all the conclusions contained pronons which have been replaced, which may not be surprising as they are much shorter_

In [None]:
coreferenced_conclusions = []

for article in full_texts['Conclusions']:
    doc = nlp(article)
    #Check if there was anything resolved
    print(doc._.has_coref)
    #resolve the text within the document 
    resolved_doc = doc._.coref_resolved
    #Save it to the coreferenced text 
    coreferenced_conclusions.append(resolved_doc)

In [21]:
# -- check the length ---
len(coreferenced_conclusions)

84

In [23]:
full_texts['Coreferenced_Conclusions'] = coreferenced_conclusions
full_texts[:3]

Unnamed: 0_level_0,DOI,Discussion,Conclusions,Coreferenced_Discussion,Coreferenced_Conclusions
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,10.1177/0269881116675513,The present study demonstrated the efficacy of...,When administered under psychologically suppor...,The present study demonstrated the efficacy of...,When administered under psychologically suppor...
2,10.1001/archgenpsychiatry.2010.116,The initial goals of this research project wer...,,The initial goals of this research project wer...,
3,10.1124/pr.115.011478,"Dr. Albert Hofmann, the natural products chemi...",,"Dr. Albert Hofmann, the natural products chemi...",


## Export the Cleaned data 

In [26]:
full_texts.to_csv('../data/text_data/coreferenced_full_texts.csv', index=False)