# Getting started with NLP

This hands-on session will provide an introduction to working with text with the spaCy library and applying it to calculate some basic metrics used for document similarity

**NOTE:** If you are running this with Colab, you should make a copy for yourself. If you don't, you may lose any edits you make. To make a copy, select `File` (top-left) then `Save a Copy in Drive`. If you are not using Colab, you may need to install some prerequisites. Please see the instructions on the [Github Repo](https://github.com/Glasgow-AI4BioMed/ismb2025tutorial).

## Getting Data

First we'll download some data that we'll use later on this tutorial with the commands below:

In [None]:
!wget -O data.zip https://gla-my.sharepoint.com/:u:/g/personal/jake_lever_glasgow_ac_uk/EZaU9DTcAwdCpg07eGCIiqMBvGmYJdqfnfhP1ygcDkRkBg?download=1
!unzip -qo data.zip

## Tokenization

The first thing we'll learn about is splitting text up into tokens. Tokens are similar in concepts to words but also deal with punctuation and other text elements.

We'll use the spaCy library which is a commonly used package for processing text. We'll load it up with the standard English model in the code below:

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

Now let's get it to process some text. Applying the `nlp` object to some text parses it and provides a list of all the tokens. We can iterate through them and print them out:

In [None]:
text = "The quick brown fox jumps over the lazy dog."
doc = nlp(text)

for token in doc:
    print(token)

We can get more than just the text of the tokens. The approach above also gives us things like:

- The [part of speech](https://en.wikipedia.org/wiki/Part_of_speech) (accessed with `.pos_`) which tells you which tokens are nouns, adjectives, punctuation, etc. spaCy uses the [Universal Dependencies list of parts-of-speech](https://universaldependencies.org/u/pos/).
- The [lemma](https://en.wikipedia.org/wiki/Lemma_(morphology)) (accessed with `.lemma_`) which gives a canonical version of the word with any suffixes appropriately removed. This means that plurals are turned to singulars, verbs are de-conjugated, etc. This is useful when comparing words so that *run* and *runs* are treated as the same word.

Check out the code below:

In [None]:
for token in doc:
    print(token.text, token.pos_, token.lemma_)

spaCy also enables you to split up text into individual sentences. Often all the information that you want is in a single sentence and it can be easier to process. The sentences can be retrieved with `.sents`. Check out the code below that shows how to print out the individual sentences. You could iterate over each sentence to access the tokens inside each sentence too.

In [None]:
text = """
The quick brown fox jumps over the lazy dog. Dr. Smith went to Washington.
He arrived at 10 p.m. and started working immediately.
"""

doc = nlp(text)

for i, sentence in enumerate(doc.sents):
    print(f"{i}: {sentence.text.strip()}")

## Task 1

Now it is your turn to do some coding with the ideas you've just seen. Here's the task. There are a set of text files in the `data/getting_started` directory. You need to find the longest sentence (by token count) from all the documents. There should be one sentence with noticeably more tokens than the rest. If you get stuck, the answer is provided further down.

*Hints: You could use `os.listdir` to get the list of files in that directory and `len(sentence)` gives you the number of tokens in a sentence*

In [None]:
# Your code goes here


<details>
<summary>Click to see the answer</summary>

Here is the code for the task:

```python
import os

longest_sentence = []
for filename in os.listdir('data/getting_started'):
  with open(f'data/getting_started/{filename}') as f:
    text = f.read()

  doc = nlp(text)

  for sentence in doc.sents:
    if len(sentence) > len(longest_sentence):
      longest_sentence = sentence

longest_sentence
```

</details>


## Measuring similarity with tokens

Comparing the tokens between two text sources is a rudimentary but very powerful way to measure their similarity. Let's get the tokens for two sources below:

In [None]:
tokens1 = [ token.text for token in nlp("She sells seashells on the sea shore.") ]
tokens1

In [None]:
tokens2 = [ token.text for token in nlp("He buys seashells by the sea shore.") ]
tokens2

Now we can use Python sets to figure out the tokens that appear in both:

In [None]:
set(tokens1) & set(tokens2)

Note that the `&` is shorthand for using the `.intersection` function which in maths uses the $ \cap $ operator. The above could also be written as `set(tokens1).intersection(tokens2)`.

The overlapping tokens are interesting, but what we really want is the count of overlapping tokens:

In [None]:
len(set(tokens1) & set(tokens2))

The higher the number of overlapping tokens, then the higher the similarity (using our assumption that more shared tokens means more similarity). But we'd like to take into account all the tokens, not just the shared ones.

Enter the [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index). It is a similarity measure for comparing two sets of things, which in our case are tokens. It is defined by the equation $ \frac{A \cap B}{A \cup B} $ where $ A $ and $ B $ are the sets of tokens from the two documents.

We've already figured out the top part (numerator) of the Jaccard Index equation which is the number of overlapping tokens. Now we want to calculate the total count of unique tokens that appear in either document. This can be achieved using the `|` operator which is equivalent to `.union` and is the $ \cup $ operator in maths:

In [None]:
set(tokens1) | set(tokens2)

Now we can get the counts of the overlapping tokens seen in both sources (using `&`) and the full set of tokens from both sources (using `|`). 

Let's apply the equation for Jaccard

In [None]:
intersection_count = len(set(tokens1) & set(tokens2))
union_count = len(set(tokens1) | set(tokens2))
jaccard = intersection_count / union_count
jaccard

This gives us a score where a higher value means that the two documents are more similar.

Let's see what happens with identical documents:

In [None]:
tokens1 = [ token.text for token in nlp("She sells seashells on the sea shore.") ]
tokens2 = [ token.text for token in nlp("She sells seashells on the sea shore.") ]

When the token sets are the same, the Jaccard index gives a score of one:

In [None]:
len(set(tokens1) & set(tokens2)) / len(set(tokens1) | set(tokens2))

And with no shared tokens, the minimum score will be zero.

In [None]:
tokens1 = [ token.text for token in nlp("She sells seashells on the sea shore.") ]
tokens2 = [ token.text for token in nlp("Very different words") ]

len(set(tokens1) & set(tokens2)) / len(set(tokens1) | set(tokens2))

The Jaccard Index using tokens is one simple but effective way to measure similarity between documents. There are other options for comparing sets of things (such as the [overlap coefficient](https://en.wikipedia.org/wiki/Overlap_coefficient) and [Dice-Sørensen coefficient](https://en.wikipedia.org/wiki/Dice-S%C3%B8rensen_coefficient)), but we'll stick with the Jaccard for this tutorial.

## Getting Documents

Now let's get our hands on some real biomedical text. The PubMed API can provide a large set of abstracts from published research papers. We'll use the [biopython](https://biopython.org/) library to access it. We need to install that first:

In [None]:
!pip install biopython

The PubMed API requires you to provide your email address. Please fill it in below:

In [None]:
from Bio import Entrez

# You need to fill in your email address to use their API
Entrez.email = ""
assert Entrez.email, "You must put your email address in to Entrez.email as it is a requirement of API usage"


Now let's fetch a single PubMed article. You need to provide the identifiers to request the article, e.g. [a PubMed ID of 31110280](https://pubmed.ncbi.nlm.nih.gov/31110280/). The `Entrez.efetch` function will then request the data from the specific database (in this case pubmed). We also provide some extra details that we want the abstract and the text version of the record.

In [None]:
# Specify the PubMed ID of the article you want
pmid = "31110280"  # Example PMID

# Fetch the record
with Entrez.efetch(db="pubmed", id=pmid, rettype="abstract", retmode="text") as handle:
    abstract = handle.read()

print(abstract)

That provides us with the full record. However, in the text mode, it is a little difficult to dig out the individual elements of the record (e.g. the title and abstract). Instead we could use the `xml` format and use `Entrez.read` to parse it for us.

With the code below, we can get the title and abstract of the article. There are many fields (e.g. journal title, publication dates, etc) that could also be extracted.

In [None]:
from Bio import Entrez

# Specify PubMed ID
pmid = "31110280"

# Fetch XML record
with Entrez.efetch(db="pubmed", id=pmid, rettype="xml") as handle:
    records = Entrez.read(handle)

# Get the article record
article = records['PubmedArticle'][0]['MedlineCitation']['Article']

# Extract the title
title = article['ArticleTitle']

# Extract the abstract
abstract_text = " ".join(article['Abstract']['AbstractText'])

# Print the results
print(f"Title: {title}\n")
print(f"Abstract: {abstract_text}")


## Task 2

Now, let's combine what you've learned about document similarity with the Jaccard index and using the Pubmed API to get a document. Your task is to find the document in the folder (`data/getting_started`) that has the highest similarity to a Pubmed document (with pmid=38567765). You'll need to use the API to get the document (include the title and abstract as we did before) and get the tokens. Then iterate through the files in the `data/getting_started` directory, tokenize them and calculate the Jaccard index on each one. Then find the file with the highest Jaccard index. 

As before, the answer is below if you get stuck.

In [None]:
# Your code goes here


<details>
<summary>Click to see the answer</summary>

Here is the code for the task:

```python
# Let's first get the text for the PubMed document
pmid = "38567765"

with Entrez.efetch(db="pubmed", id=pmid, rettype="xml") as handle:
    records = Entrez.read(handle)

article = records['PubmedArticle'][0]['MedlineCitation']['Article']

title = article['ArticleTitle']
abstract_text = " ".join(article['Abstract']['AbstractText'])

combined_text = f"{title}\n\n{abstract_text}\n"

# And tokenize the PubMed document
tokens1 = [ token.text for token in nlp(combined_text) ]

# Now we'll iterate through the files in data/getting_started
best_jaccard, best_file = -1, None
for filename in os.listdir('data/getting_started'):
  with open(f'data/getting_started/{filename}') as f:
    text = f.read()

  # Tokenize the text of the document
  tokens2 = [ token.text for token in nlp(text) ]

  # Calculate the jaccard index
  jaccard = len(set(tokens1) & set(tokens2)) / len(set(tokens1) | set(tokens2))
  
  # And keep track of the document with the highest
  if jaccard > best_jaccard:
    best_jaccard = jaccard
    best_file = filename

print(best_jaccard, best_file)
```

</details>


## End of Hands-on Session

And that brings us to the end of the session. You've learned about:

- Tokenization (and getting parts-of-speech, etc)
- Sentence splitting
- Calculating text similarity using token overlap
- Getting a document from PubMed using the API

## Optional Extras

If you've got extra time, you could try some of the following ideas:

- How about other similarity measures such as [Overlap Coefficient](https://en.wikipedia.org/wiki/Overlap_coefficient) or [Dice-Sørensen coefficient](https://en.wikipedia.org/wiki/Dice-S%C3%B8rensen_coefficient)?
- Try filtering the tokens to only compare nouns and verbs.
- Investigate retrieving multiple documents from the PubMed API
- What other metadata can you get from the PubMed API for an article?