# Getting started with NLP

This hands-on session with provide an introduction to working with text with the spaCy library and applying it to calculate some basic metrics used for document similarity

**NOTE:** If you are running this with Colab, you should make a copy for yourself. If you don't, you may lose any edits you make. To make a copy, select `File` (top-left) then `Save a Copy in Drive`. If you are not using Colab, you may need to install some prerequisites. Please see the instructions on the [Github Repo](https://github.com/Glasgow-AI4BioMed/ismb2025tutorial).

## Tokenization

The first thing we'll learn about is splitting text up into tokens. These are similar in concepts to words but also deal with punctuation and other factors.

We'll use the spaCy library which is a commonly used package for processing text. We'll load it up with the standard English model in the code below:

In [None]:
import spacy

# Load the small English model
nlp = spacy.load("en_core_web_sm")

Now let's get it to process some text. Applying the `nlp` object to some text parses it and provides a list of all the tokens. We can iterate through them and print them out:

In [None]:
text = "The quick brown fox jumps over the lazy dog."
doc = nlp(text)

for token in doc:
    print(token)

We get more than just the tokens. The default approach above also gives use things like:

- The [part of speech](https://en.wikipedia.org/wiki/Part_of_speech) (accessed with `.pos_`) which tells you which tokens are nouns, adjectives, punctuation, etc. spaCy uses the [https://universaldependencies.org/u/pos/](Universal Dependencies list of parts-of-speech).
- The [lemma](https://en.wikipedia.org/wiki/Lemma_(morphology)) (accessed with `.lemma_`) which gives a canonical version of the word with any suffixes appropriately removed. This means that plurals are turned to singulars, verbs are de-conjugated, etc. This is useful when comparing words so that *run* and *runs* are treated as the same word.

Check out the code below:

In [None]:
for token in doc:
    print(token.text, token.pos_, token.lemma_)

spaCy also enables you to split up text into individual sentences. Often all the information that you want is in a single sentence and it can be easier to process. The sentences can be retrieved with `.sents`. Check out the code below that shows how to print out the individual sentences. You could iterate on each sentence to access the tokens inside each sentence too.

In [None]:
# Example paragraph
text = """
The quick brown fox jumps over the lazy dog. Dr. Smith went to Washington.
He arrived at 10 p.m. and started working immediately.
"""

# Process the text
doc = nlp(text)

# Print sentences
for i, sentence in enumerate(doc.sents):
    print(f"{i}: {sentence.text.strip()}")


## Task

Now it is your turn to do some coding with the ideas you've just seen. Here's the task. You've got a set of text files. You need to find the longest sentence (by token count) from all the documents. There should be one sentence with noticeably more tokens than the rest. If you get stuck, the answer is provided further down.

In [None]:
import os

longest_sentence = []
for filename in os.listdir('task1'):
  with open(f'task1/{filename}') as f:
    text = f.read()

  doc = nlp(text)

  for sentence in doc.sents:
    if len(sentence) > len(longest_sentence):
      longest_sentence = sentence

longest_sentence


<details>
<summary>Click to see the answer</summary>

Here is the hidden answer or hint!

</details>


## Measuring similarity with tokens

Comparing the tokens between two text sources is a rudimentary but very powerful way to measure their similarity. Let's get the tokens for two sources below:

In [None]:
tokens1 = [ token.text for token in nlp("She sells seashells on the sea shore.") ]
tokens1

In [None]:
tokens2 = [ token.text for token in nlp("He buys seashells by the sea shore.") ]
tokens2

Now we can use Python sets to figure out the tokens that appear in both:

In [None]:
set(tokens1) & set(tokens2)

Note that the `&` is shorthand for using the `.intersection` function which in maths uses the $ \cap $ operator. The above could also be written as `set(tokens1).intersection(tokens2)`.

The overlapping tokens are interesting, but what we really want is the count of overlapping tokens:

In [None]:
len(set(tokens1) & set(tokens2))

And we want to compare that with all the tokens seen across both text sources. This can be achieved with the `|` operator which is equivalent to `.union` and is the $ \cup $ operator in maths:

In [None]:
set(tokens1) | set(tokens2)

Now we can get the counts of the overlapping tokens seen in both sources (using `&`) and the full set of tokens from both sources (using `|`). We can use the counts of those to calculate the Jaccard index: $ \frac{A \cap B}{A \cup B} $.

In [None]:
len(set(tokens1) & set(tokens2)) / len(set(tokens1) | set(tokens2))

This gives us a score where a higher value means that the two documents are more similar. Let's see what happens with equal documents:

In [None]:
tokens1 = [ token.text for token in nlp("She sells seashells on the sea shore.") ]
tokens2 = tokens1

When the token sets are the same, the Jaccard index gives a score of one:

In [None]:
len(set(tokens1) & set(tokens2)) / len(set(tokens1) | set(tokens2))

And with no shared tokens, the minimum score will be zero.

In [None]:
tokens1 = [ token.text for token in nlp("She sells seashells on the sea shore.") ]
tokens2 = [ token.text for token in nlp("Very different words") ]

len(set(tokens1) & set(tokens2)) / len(set(tokens1) | set(tokens2))

## Getting Documents

Now let's get our hands on some real biomedical text. The PubMed API can provide a large set of abstracts from published research papers. We'll use the [biopython](https://biopython.org/) library to access it. The PubMed API requires you to provide your email address. Please fill it in below:

In [None]:
from Bio import Entrez

# Always set your email address
Entrez.email = "jake.lever@glasgow.ac.uk"
assert Entrez.email, "You must put your email address in to Entrez.email as it is a requirement of API usage"


Now let's fetch a single PubMed article. You need to provide the identifiers to request the article, e.g. [a PubMed ID of 31110280](https://pubmed.ncbi.nlm.nih.gov/31110280/). The `Entrez.efetch` function will then request the data from the specific database (in this case pubmed). We also provide some extra details that we want the abstract and the text version of the record.

In [None]:

# Specify the PubMed ID of the article you want
pmid = "31110280"  # Example PMID

# Fetch the record
with Entrez.efetch(db="pubmed", id=pmid, rettype="abstract", retmode="text") as handle:
    abstract = handle.read()

print(abstract)

That provides us with the full record. However, in the text mode, it is a little difficult to dig out the individual elements of the record (e.g. the title and abstract). Instead we could use the `xml` format and use `Entrez.read` to parse it for us.

With the code below, we can get the title and abstract of the article. There are many of fields (e.g. journal title, publication dates, etc) that could also be extracted.

In [None]:
from Bio import Entrez

# Specify PubMed ID
pmid = "31110280"

# Fetch XML record
with Entrez.efetch(db="pubmed", id=pmid, rettype="xml") as handle:
    records = Entrez.read(handle)

# Get the article record
article = records['PubmedArticle'][0]['MedlineCitation']['Article']

# Extract the title
title = article['ArticleTitle']

# Extract the abstract
abstract_text = " ".join(article['Abstract']['AbstractText'])

# Print the results
print(f"Title: {title}\n")
print(f"Abstract: {abstract_text}")


## Task

Now, let's combine what you've learned about document similarity with the Jaccard index and using the Pubmed API to get a document. Your task is to find the document in the folder that has the highest similarity to a Pubmed document (with pmid=38567765). The similarity should use Jaccard index and be based on text that contain the title and abstract together. As before, the answer is below if you get stuck.

In [None]:
from Bio import Entrez

# Specify PubMed ID
pmid = "38567765"

# Fetch XML record
with Entrez.efetch(db="pubmed", id=pmid, rettype="xml") as handle:
    records = Entrez.read(handle)

# Get the article record
article = records['PubmedArticle'][0]['MedlineCitation']['Article']

# Extract the title
title = article['ArticleTitle']

# Extract the abstract
abstract_text = " ".join(article['Abstract']['AbstractText'])

# Print the results
combined_text = f"{title}\n\n{abstract_text}\n"

In [None]:
tokens1 = [ token.text for token in nlp(combined_text) ]

In [None]:
best_jaccard, best_file = -1, None
for filename in os.listdir('task1'):
  with open(f'task1/{filename}') as f:
    text = f.read()

  tokens2 = [ token.text for token in nlp(text) ]

  jaccard = len(set(tokens1) & set(tokens2)) / len(set(tokens1) | set(tokens2))

  if jaccard > best_jaccard:
    best_jaccard = jaccard
    best_file = filename

best_jaccard, best_file

## End of Hands-on Session

And that brings us to the end of the session. You've learned about:

- Tokenization (and getting parts-of-speech, etc)
- Sentence splitting
- Calculating text similarity using token overlap
- Getting a document from PubMed using the API

### Optional Extras

If you've got extra time, you could some different approaches.

- How about other similarity measures such as [Overlap Coefficient](https://en.wikipedia.org/wiki/Overlap_coefficient) or [Dice-Sørensen coefficient](https://en.wikipedia.org/wiki/Dice-S%C3%B8rensen_coefficient)?
- Try filtering the tokens to only compare nouns and verbs.
- Investigate retrieving multiple documents from the PubMed API
- What other metadata can you get from the PubMed API for an article?