# Word Representation in Biomedical Domain

Before you start, please make sure you have read this notebook. You are encouraged to follow the recommendations but you are also free to develop your own solution from scratch. 

## Marking Scheme

- Biomedical imaging project: 40%
    - 20%: accuracy of the final model on the test set
    - 20%: rationale of model design and final report
- Natural language processing project: 40%
    - 30%: completeness of the project
    - 10%: final report
- Presentation skills and team work: 20%


This project forms 40\% of the total score for summer/winter school. The marking scheme of each part of this project is provided below with a cap of 100\%.

You are allowed to use open source libraries as long as the libraries are properly cited in the code and final report. The usage of third-party code without proper reference will be treated as plagiarism, which will not be tolerated.

You are encouraged to develop the algorithms by yourselves (without using third-party code as much as possible). We will factor such effort into the marking process.

## Setup and Prerequisites 

Recommended environment

- Python 3.7 or newer
- Free disk space: 100GB

Download the data

```sh
# navigate to the data folder
cd data

# download the data file
# which is also available at https://www.semanticscholar.org/cord19/download
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2021-07-26/document_parses.tar.gz

# decompress the file which may take several minutes
tar -xf document_parses.tar.gz

# which creates a folder named document_parses
```

## Part 1 (20%): Parse the Data

The JSON files are located in two sub-folders in `document_parses`. You will need to scan all JSON files and extract text (i.e. `string`) from relevant fields (e.g. body text, abstract, titles).

You are encouraged to extract full article text from body text if possible. If the hardware resource is limited, you can extract from abstract or titles as alternatives. 

Note: The number of JSON files is around 425k so it may take more than 10 minutes to parse all documents.

For more information about the dataset: https://www.semanticscholar.org/cord19/download

Recommended output:

- A list of text (`string`) extracted from JSON files.

In [1]:
###################
# TODO: add your solution
import os
import json
import time

start_time = time.time()
# Set folder path
folder_path = 'D:/pythonProject1/extraction/document_parses/pdf_json'

# Set output file path
output_file_path = 'D:/NLP_python/files/pdf_json_1.txt'

# Traverse all JSON files in the folder
abstract_texts = []

for filename in os.listdir(folder_path):
    if filename.endswith('.json'):
        file_path = os.path.join(folder_path, filename)
        with open(file_path, 'r', encoding='utf-8') as json_file:
            try:
                data = json.load(json_file)

                # Handle the case where the 'abstract' field might be a list
                abstract_list = data.get('abstract', [])

                if isinstance(abstract_list, list):
                    for abstract_item in abstract_list:
                        abstract_text = abstract_item.get('text', '')
                        abstract_texts.append(abstract_text)
                else:
                    abstract_text = abstract_list.get('text', '')
                    abstract_texts.append(abstract_text)

            except json.JSONDecodeError:
                print(f"Error decoding JSON in file: {file_path}")

# Write the extracted 'text' field to a new txt file
with open(output_file_path, 'w', encoding='utf-8') as output_file:
    for abstract_text in abstract_texts:
        output_file.write(f"{abstract_text}\n")

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Segmentation completed in {elapsed_time:.2f} seconds.")

###################

## Part 2 (30%): Tokenization

Traverse the extracted text and segment the text into words (or tokens).

The following tracks can be developed in independentely. You are encouraged to divide the workload to each team member.

Recommended output:

- Tokenizer(s) that is able to tokenize any input text.

Note: Because of the computation complexity of tokenizers, it may take hours/days to process all documents. Which tokenizer is more efficient? Any idea to speedup?

### Track 2.1 (10%): Use split()

Use the standard `split()` by Python.

### Track 2.2 (10%): Use NLTK or SciSpaCy

NLTK tokenizer: https://www.nltk.org/api/nltk.tokenize.html

SciSpaCy: https://github.com/allenai/scispacy

Note: You may need to install NLTK and SpaCy so please refer to their websites for installation instructions.

### Track 2.3 (10%): Use Byte-Pair Encoding (BPE)

Byte-Pair Encoding (BPE): https://huggingface.co/transformers/tokenizer_summary.html

Note: You may need to install Huggingface's transformers so please refer to its website for installation instructions.

### Track 2.4 (Bonus +5%): Build new Byte-Pair Encoding (BPE)

This track may be dependent on track 2.3.

The above pre-built tokenization methods may not be suitable for biomedical domain as the words/tokens (e.g. diseases, sympotoms, chemicals, medications, phenotypes, genotypes etc.) can be very different from the words/tokens commonly used in daily life. Can you build and train a new BPE model for biomedical domain in particular?

### Open Question (Optional):

- What are the pros and cons of the above tokenizers?

In [2]:
###################
# TODO: add your solution
# 2.1
import os
import re
import time
from tqdm import tqdm

def segment_text_into_words(file_path):
    """
    Segment the text into words (or tokens) using Python's split function.

    Parameters:
    file_path (str): Path to the input text file.

    Returns:
    list: A list of segmented words.
    """
    segmented_words = []

    # Define regular expression to remove punctuation and digits
    regex = re.compile(r'[^\w\s]|\d')

    with open(file_path, 'r', encoding='utf-8') as file:
        total_lines = sum(1 for line in file)

    with open(file_path, 'r', encoding='utf-8') as file:
        for line in tqdm(file, total=total_lines, desc="Processing"):
            # Remove punctuation and digits, then split the line into words (or tokens)
            line = regex.sub('', line)
            words = line.split()
            segmented_words.extend(words)

    return segmented_words

# Input file path
input_file_path = '1output.txt'
output_file_path = '2.1output.txt'

start_time = time.time()

# Segment the text from input file into words
segmented_words = segment_text_into_words(input_file_path)

# Save the segmented words to output file
with open(output_file_path, 'w', encoding='utf-8') as output_file:
    for word in segmented_words:
        output_file.write(word + '\n')

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Segmentation completed in {elapsed_time:.2f} seconds.")


###################

In [None]:
# 2.2
import nltk
from nltk.tokenize import sent_tokenize, RegexpTokenizer
import string
import time

# nltk.download('averaged_perceptron_tagger')

def read_stopwords(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        stopwords_list = set(file.read().splitlines())
    return stopwords_list

def nltk_tokenizer(text):
    # Use RegexpTokenizer to exclude punctuation
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(text)
    return tokens

def process_text(file_content, custom_stop_words, output_file):
    filtered_words_list = []

    for sentence in sent_tokenize(file_content):
        tokens = nltk_tokenizer(sentence)
        filtered_words = [word.lower() for word in tokens if word.lower() not in custom_stop_words]
        filtered_words_list.extend(filtered_words)

    # Write filtered words in larger chunks
    output_file.write(" ".join(filtered_words_list) + "\n")

# Example file paths
stopwords_file_path = r"D:/pythonProject1/extraction/stopwords-en.txt"
input_file_path = r"files/pdf_json_1.txt"
output_file_path = r"files/pdf_json_2.2.txt"

custom_stop_words = read_stopwords(stopwords_file_path)

start_time = time.time()

with open(input_file_path, "r", encoding="utf-8") as input_file, \
     open(output_file_path, "w", encoding="utf-8") as output_file:

    for line in input_file:
        process_text(line, custom_stop_words, output_file)

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Segmentation completed in {elapsed_time:.2f} seconds.")


In [None]:
# 2.3
from tokenizers import ByteLevelBPETokenizer
import re
import time
from tqdm import tqdm
import nltk
from nltk.tokenize import word_tokenize

# Download NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

def bpe_tokenizer(text, vocab_size=30000):
    """
    Tokenize the input text using Byte-Pair Encoding (BPE).

    Parameters:
    text (str): Input text to be tokenized.
    vocab_size (int): Vocabulary size for BPE.

    Returns:
    list: A list of tokens.
    """
    # Initialize the BPE tokenizer
    tokenizer = ByteLevelBPETokenizer()

    # Train the BPE tokenizer
    tokenizer.train_from_iterator([text], vocab_size=vocab_size)

    # Tokenize using the trained BPE tokenizer
    encoded = tokenizer.encode(text)
    tokens = encoded.tokens

    return tokens

# Read the txt file containing abstracts
input_file_path = r"C:\Users\think\PycharmProjects\pythonProject2\abstracts1.txt"
output_file_path = r"C:\Users\think\PycharmProjects\pythonProject2\2.3output.txt"

start_time = time.time()

with open(input_file_path, "r", encoding="utf-8") as input_file:
    # Read the entire file content
    file_content = input_file.read()

# Remove all symbols, keeping only words
file_content = re.sub(r'[^\w\s]', '', file_content)

# Tokenize using the BPE tokenizer
tokens = bpe_tokenizer(file_content)

# Use NLTK for part-of-speech tagging, keeping only nouns
tagged_tokens = nltk.pos_tag(word_tokenize(file_content))
noun_tokens = [token[0] for token in tagged_tokens if token[1] in ['NN', 'NNS', 'NNP', 'NNPS']]

with open(output_file_path, "w", encoding="utf-8") as output_file:
    # Write nouns to the file, separated by spaces
    output_file.write(" ".join(noun_tokens))

# Tokenization results have been saved to a new file
elapsed_time = time.time() - start_time
print(f"Tokenization completed in {elapsed_time:.2f} seconds.")


In [None]:
# 2.4
from tqdm import tqdm
import time

def initialize_vocab(data):
    vocab = set()
    for word in data:
        vocab.update(list(word))
    return vocab

def get_stats(data):
    stats = {}
    for word in data:
        symbols = list(word)
        for i in range(len(symbols) - 1):
            pair = (symbols[i], symbols[i + 1])
            stats[pair] = stats.get(pair, 0) + 1
    return stats

def merge_vocab(pair, vocab):
    new_vocab = set()
    bigram = ''.join(pair)
    replacement = ''.join(pair)
    for word in vocab:
        new_word = word.replace(bigram, replacement)
        new_vocab.add(new_word)
    return new_vocab

def learn_bpe(data, num_merges, progress_bar=True):
    vocab = initialize_vocab(data)

    start_time = time.time()
    progress_bar = tqdm(range(num_merges)) if progress_bar else range(num_merges)
    for _ in progress_bar:
        stats = get_stats(data)
        best_pair = max(stats, key=stats.get)
        vocab = merge_vocab(best_pair, vocab)

    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f'BPE model trained in {elapsed_time:.2f} seconds')

    return vocab

def read_text_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        data = file.read().splitlines()
    return data

import re
from nltk import word_tokenize, pos_tag

def filter_nouns_and_remove_non_english(text):
    filtered_data = []
    for sentence in text:
        # Remove non-English characters
        english_text = re.sub(r'[^a-zA-Z\s]', '', sentence)

        # Tokenize and POS tag
        tokens = word_tokenize(english_text)
        tagged = pos_tag(tokens)

        # Keep nouns and convert to lowercase
        nouns = [word.lower() for word, pos in tagged if pos.startswith('N')]

        # Reassemble the sentence
        filtered_sentence = ' '.join(nouns)
        filtered_data.append(filtered_sentence)
    return filtered_data

# Read text file
biomedical_data = read_text_file(r"pdf_json_1.txt")

# Keep only nouns and remove symbols, numbers, and special characters
filtered_data = filter_nouns_and_remove_non_english(biomedical_data)

biomedical_bpe_vocab = learn_bpe(filtered_data, num_merges=100, progress_bar=True)

# Save biomedical BPE vocabulary to file
with open('biomedical_bpe_vocab.txt', 'w', encoding='utf-8') as file:
    for token in biomedical_bpe_vocab:
        file.write(token + '\n')


## Part 3 (30%): Build Word Representations

Build word representations for each extracted word. If the hardware resource is limited, you may limit the vocabulary size up to 10k words/tokens (or even smaller) and the dimension of representations up to 256.

The following tracks can be developed independently. You are encouraged to divide the workload to each team member.

### Track 3.1 (15%): Use N-gram Language Modeling

N-gram Language Modeling is to predict a target word by using `n` words from previous context. Specifically,

$P(w_i | w_{i-1}, w_{i-2}, ..., w_{i-n+1})$

For example, given a sentence, `"the main symptoms of COVID-19 are fever and cough"`, if $n=7$, we use previous context `["the", "main", "symptoms", "of", "COVID-19", "are"]` to predict the next word `"fever"`.

More to read: https://web.stanford.edu/~jurafsky/slp3/3.pdf

Recommended outputs:

- A fixed vector for each word/token.

### Track 3.2 (15%): Use Skip-gram with Negative Sampling

In skip-gram, we use a central word to predict its context. Specifically,

$P(w_{c-m}, ... w_{c-1}, w_{c+1}, ..., w_{c+m} | w_c)$

As the learning objective of skip-gram is computational inefficient (summation of entire vocabulary $|V|$), negative sampling is commonly applied to accelerate the training.

In negative sampling, we randomly select one word from the context as a positive sample, and randomly select $K$ words from the vocabulary as negative samples. As a result, the learning objective is updated to

$L = -\log\sigma(u^T_{t} v_c) - \sum_{k=1}^K\log\sigma(-u^T_k v_c)$, where $u_t$ is the vector embedding of positive sample from context, $u_k$ are the vector embeddings of negative samples, $v_c$ is the vector embedding of the central word, $\sigma$ refers to the sigmoid function.

More to read http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf section 4.3 and 4.4

Recommended outputs:

- A fixed vector for each word/token.

### Track 3.3 (Bonus +5%): Use Contextualised Word Representation by Masked Language Model (MLM)

BERT introduces a new language model for pre-training named Masked Language Model (MLM). The advantage of MLM is that the word representations by MLM will be contextualised.

For example, "stick" may have different meanings in different context. By N-gram language modeling and word2vec (skip-gram, CBOW), the word representation of "stick" is fixed regardless of its context. However, MLM will learn the representation of "stick" dynamatically based on context. In other words, "stick" will have different representations in different context by MLM.

More to read: http://jalammar.github.io/illustrated-bert/ and https://arxiv.org/pdf/1810.04805.pdf

Recommended outputs:

- An algorithm that is able to generate contextualised representation in real time.

In [3]:
###################
# TODO: add your solution
# 3.1
import torch
import torch.nn as nn
import torch.optim as optim
import re
import torch.nn.functional as F
from tqdm import tqdm

# Download NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# A class defining the NGram language model
class NGramModel(nn.Module):
    def __init__(self, vocab_size, embed_size, context_size):
        super(NGramModel, self).__init__()
        # Create an nn.Embedding layer to map words to word vectors
        self.embeddings = nn.Embedding(vocab_size, embed_size)
        # Create a linear layer to map input context vectors to a hidden layer
        self.linear1 = nn.Linear(context_size * embed_size, 128)
        # Create another linear layer to map the output of the hidden layer to the vocabulary size
        self.linear2 = nn.Linear(128, vocab_size)

    # Forward method to define the forward pass of the model
    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

# Build vocabulary from given text
def build_vocab(text):
    words = re.findall(r'\w+', text.lower())
    vocab = set(words)
    word2idx = {word: idx for idx, word in enumerate(vocab)}
    idx2word = {idx: word for word, idx in word2idx.items()}
    return vocab, word2idx, idx2word

# Build N-Gram sequences from given text
def build_ngrams(text, n):
    words = re.findall(r'\w+', text.lower())
    ngrams = [tuple(words[i:i + n]) for i in range(len(words) - n + 1)]
    return ngrams

# Train the N-Gram model
def train_ngram_model(text, n, vocab_size, word2idx, num_epoch, lr, device, weight_decay):
    ngrams = build_ngrams(text, n)
    losses = []

    model = NGramModel(vocab_size, 100, n)
    model.to(device)
    model.train()

    loss_function = nn.NLLLoss()
    optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)

    ngrams_idxs = []
    for context_words in ngrams:
        context_idxs = torch.tensor([word2idx[word] for word in context_words], dtype=torch.long).to(device)
        ngrams_idxs.append(context_idxs)

    for epoch in range(num_epoch):
        total_loss = 0
        for i in tqdm(range(len(ngrams)), desc=f'Epoch {epoch + 1}/{num_epoch}', unit='batch'):
            context_words = ngrams[i]
            context_idxs = ngrams_idxs[i].to(device)

            model.zero_grad()
            log_probs = model(context_idxs)
            target = torch.tensor([word2idx[context_words[-1]]], dtype=torch.long).to(device)
            loss = loss_function(log_probs, target)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
        losses.append(total_loss)

        torch.save(model.state_dict(), f'ngram_model_ep{epoch}.pth')
        print(f'Epoch {epoch + 1}/{num_epoch}, Loss: {total_loss}')

    return model

if __name__ == '__main__':
    # Determine the current device (GPU or CPU)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Hyperparameters
    num_epoch = 10
    lr = 1e-2
    n = 3
    weight_decay = 0

    # Read text file
    file_path = "pdf_json_2.2.txt"
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()

    # Build vocabulary and training data
    vocab, word2idx, idx2word = build_vocab(text)
    vocab_size = len(vocab)

    # Train N-Gram model
    model = train_ngram_model(text, n, vocab_size, word2idx, num_epoch, lr, device, weight_decay)

    # Save the model
    torch.save(model.state_dict(), 'ngram_model.pth')

    # Get trained word vectors
    word_vectors = model.embeddings.weight.data.numpy()

    # Write words and corresponding vector representations to a text file
    with open("3.1new_word_vectors.txt", "w", encoding="utf-8") as file:
        for word, idx in word2idx.items():
            vector = ",".join(str(num) for num in word_vectors[idx])
            file.write(f"{word},{vector}\n")

###################

In [None]:
# 3.2
import torch
import torch.nn as nn
import torch.optim as optim
import re
import torch.nn.functional as F
import time
from tqdm import tqdm

start_time = time.time()

# vocab_size represents the size of the vocabulary, embed_size represents the dimension of word vectors,
# context_size represents the size of the context (i.e., the length of the considered N-Gram)
class NGramModel(nn.Module):
    def __init__(self, vocab_size, embed_size, context_size):
        super(NGramModel, self).__init__()
        # Create an nn.Embedding layer to map words to word vectors
        self.embeddings = nn.Embedding(vocab_size, embed_size)
        # Create a linear layer to map input context vectors to a hidden layer
        self.linear1 = nn.Linear(context_size * embed_size, 128)
        # Create another linear layer to map the output of the hidden layer to the vocabulary size
        self.linear2 = nn.Linear(128, vocab_size)

    # inputs is a tensor representing the input context
    def forward(self, inputs):
        # Map the input context inputs through the embedding layer self.embeddings,
        # converting each word index to the corresponding word vector.
        # Then, use the view method to reshape the result into a tensor with shape (1, -1),
        # where -1 represents the automatically calculated dimension, maintaining the first dimension as 1.
        embeds = self.embeddings(inputs).view((1, -1))
        # Pass the mapped word vectors embeds to a linear layer self.linear1,
        # applying the ReLU activation function. This linear layer maps the word vectors to a hidden layer.
        out = F.relu(self.linear1(embeds))
        # Pass the output of the hidden layer to another linear layer self.linear2,
        # which maps the output of the hidden layer to the size of the vocabulary,
        # obtaining the original output of the model.
        out = self.linear2(out)
        # Apply LogSoftmax operation to the original output of the model,
        # computing the logarithmic probabilities for each word.
        log_probs = F.log_softmax(out, dim=1)
        # Return the calculated log probabilities as the final output of the model.
        return log_probs

# A function to build a vocabulary from the given text and create mappings from words to indices and indices to words.
# It takes a string parameter text representing the input text content.
def build_vocab(text):
    # Use the regular expression r'\w+' to find all words in the text,
    # where \w+ matches one or more consecutive letters or digits.
    # Convert the text to lowercase for uniform processing.
    words = re.findall(r'\w+', text.lower())
    # Convert the list of found words to a set, removing duplicate words, and obtain the vocabulary vocab.
    vocab = set(words)
    # Create a dictionary word2idx, mapping each word in the vocabulary to a unique index.
    word2idx = {word: idx for idx, word in enumerate(vocab)}
    # Create another dictionary idx2word, mapping indices back to the original words.
    # This dictionary is useful for looking up words based on indices.
    idx2word = {idx: word for word, idx in word2idx.items()}
    return vocab, word2idx, idx2word

# A function to build N-Gram sequences from the given text.
# It takes two parameters: a string text representing the input text content and an integer n representing the length of N-Grams.
def build_ngrams(text, n):
    # Use the regular expression r'\w+' to find all words in the text,
    # where \w+ matches one or more consecutive letters or digits.
    # Convert the text to lowercase for uniform processing.
    words = re.findall(r'\w+', text.lower())
    # Use a list comprehension to generate all N-Grams, forming a list.
    ngrams = [tuple(words[i:i + n]) for i in range(len(words) - n + 1)]
    # Return these N-Gram sequences.
    return ngrams

# A function to train the N-Gram model with a progress bar.
# It takes four parameters: a string text representing the input text content,
# an integer n representing the length of N-Grams, an integer vocab_size representing the size of the vocabulary,
# and a dictionary word2idx representing the mapping from words to indices.
def train_ngram_model_with_progress(text, n, vocab_size, word2idx):
    ngrams = build_ngrams(text, n)
    losses = []
    model = NGramModel(vocab_size, 100, n)
    loss_function = nn.NLLLoss()
    optimizer = optim.SGD(model.parameters(), lr=0.01)

    for epoch in range(10):
        total_loss = 0
        # Use tqdm to add a progress bar
        for context_words in tqdm(ngrams, desc=f'Epoch {epoch + 1}/{10}'):
            context_idxs = torch.tensor([word2idx[word] for word in context_words], dtype=torch.long)

            model.zero_grad()
            log_probs = model(context_idxs)
            target = torch.tensor([word2idx[context_words[-1]]], dtype=torch.long)
            loss = loss_function(log_probs, target)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
        losses.append(total_loss)
        print(f'Epoch {epoch + 1}/{10}, Loss: {total_loss}')

    return model

# Read text file
file_path = "pdf_json_2.2.txt"
with open(file_path, 'r', encoding='utf-8') as file:
    text = file.read()

# Build vocabulary and training data
vocab, word2idx, idx2word = build_vocab(text)
vocab_size = len(vocab)

# Train N-Gram model with progress bar
n = 3
model = train_ngram_model_with_progress(text, n, vocab_size, word2idx)

# Save the model
torch.save(model.state_dict(), 'ngram_model.pth')

# Get trained word vectors
word_vectors = model.embeddings.weight.data.numpy()

# Write words and corresponding vector representations to a text file
with open("3.1new_word_vectors.txt", "w", encoding="utf-8") as file:
    for word, idx in word2idx.items():
        vector = ",".join(str(num) for num in word_vectors[idx])
        file.write(f"{word},{vector}\n")



In [None]:
# 3.3
from transformers import BertTokenizer

# Read the text file
with open('pdf_json_2.2.txt', 'r', encoding='utf-8') as file:
    text = file.read()

# Load the BertTokenizer
model_name = "D:/pythonProject1/extraction/Bert/biobert_v1.1_pubmed/model.ckpt-1000000.index"
tokenizer = BertTokenizer.from_pretrained(model_name, do_lower_case=True)

# Process the text using BertTokenizer
tokenized_text = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(text)))

# Print the processed results
print(tokenized_text)


## Part 4 (20%): Explore the Word Representations

The following tracks can be finished independently. You are encouraged to divide workload to each team member.

### Track 4.1 (5%): Visualise the word representations by t-SNE

t-SNE is an algorithm to reduce dimentionality and commonly used to visualise high-dimension vectors. Use t-SNE to visualise the word representations. You may visualise up to 1000 words as t-SNE is highly computationally complex.

More about t-SNE: https://lvdmaaten.github.io/tsne/

Recommended output:

- A diagram by t-SNE based on representations of up to 1000 words.

### Track 4.2 (5%): Visualise the Word Representations of Biomedical Entities by t-SNE

Instead of visualising the word representations of the entire vocabulary (or 1000 words that are selected at random), visualise the word representations of words which are biomedical entities. For example, fever, cough, diabetes etc. Based on the category of those biomedical entities, can you assign different colours to the entities and see if the entities from the same category can be clustered by t-SNE? For example, sinusitis and cough are both respirtory diseases so they should be assigned with the same colour and ideally their representations should be close to each other by t-SNE. Another example, Alzheimer and headache are neuralogical diseases which should be assigned by another colour.

Examples of biomedial ontology: https://www.ebi.ac.uk/ols/ontologies/hp and https://en.wikipedia.org/wiki/International_Classification_of_Diseases

Recommended output:

- A diagram with colours by t-SNE based on representations of biomedical entities.

### Track 4.3 (5%): Co-occurrence

- What are the biomedical entities which frequently co-occur with COVID-19 (or coronavirus)?

Recommended outputs:

- A sorted list of biomedical entities and description on how the entities are selected and sorted.

### Track 4.4 (5%): Semantic Similarity

- What are the biomedical entities which have closest semantic similarity COVID-19 (or coronavirus) based on word representations?

Recommended outputs:

- A sorted list of biomedical entities and description on how the entities are selected and sorted.

### Open Question (Optional): What else can you discover?


In [4]:
###################
# TODO: add your solution
# 4.1
# TODO: add your solution
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
from tqdm import tqdm  # Import tqdm for progress bar

# From txt file, load word representations where each line consists of a word and its corresponding vector representation
with open('files/pdf_json_3.2true.txt', 'r', encoding='utf-8') as file:
    word_vectors = [line.strip().split() for line in file]

# Extract words and vector representations
words, vectors = zip(*[(line[0], list(map(float, line[1:]))) for line in word_vectors])

# Convert lists to numpy arrays
vectors = np.array(vectors)

# Choose t-SNE parameters
tsne = TSNE(n_components=2, random_state=42)

# Use t-SNE for dimensionality reduction
word_tsne = np.zeros((len(words), 2))  # Initialize array to store t-SNE results
for i in tqdm(range(0, len(words), 1000), desc="t-SNE Progress"):
    end_idx = min(i + 1000, len(words))
    word_tsne[i:end_idx] = tsne.fit_transform(vectors[i:end_idx])

with open('files/tsne_results_1.txt', 'w', encoding='utf-8') as tsne_file:
    for i, word in enumerate(words):
        tsne_file.write(f"{word} {word_tsne[i, 0]} {word_tsne[i, 1]}\n")

# Visualize the results without labels, with smaller points, and semi-transparent colors
plt.figure(figsize=(10, 8))
plt.scatter(word_tsne[:, 0], word_tsne[:, 1], alpha=0.5, s=10)  # Adjust alpha for transparency and s for point size

plt.title('t-SNE Visualization of Word Representations')
plt.show()

###################

In [None]:
# 4.2
def read_vectors_from_file(file_path):
    word_vectors = {}
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            parts = line.strip().split()
            word = parts[0]
            vector = [float(val) for val in parts[1:]]
            word_vectors[word] = vector
    return word_vectors

# Read vectors from the text file
file_path = 'pdf_json_3.2true.txt'  # Replace with your file path
word_vectors = read_vectors_from_file(file_path)

import requests

def get_biomedical_entities_from_ols(ontology_id):
    ols_api_url = f'https://www.ebi.ac.uk/ols/api/ontologies/{ontology_id}/terms?size=2000'
    response = requests.get(ols_api_url)

    if response.status_code == 200:
        terms = response.json().get('_embedded', {}).get('terms', [])
        biomedical_entities = [term['label'] for term in terms]
        return biomedical_entities
    else:
        print(f"Error accessing OLS API. Status Code: {response.status_code}")
        return []

# Get a list of biomedical entities
biomedical_entities_from_ols = get_biomedical_entities_from_ols('hp')

from difflib import get_close_matches

def match_biomedical_entities_with_vectors(word_vectors, biomedical_entities, threshold=0.7):
    matched_biomedical_entities = {}

    for entity in biomedical_entities:
        closest_matches = get_close_matches(entity, word_vectors.keys(), n=1, cutoff=threshold)
        if closest_matches:
            matched_biomedical_entities[entity] = {
                'word': closest_matches[0],
                'vector': word_vectors[closest_matches[0]]
            }

    return matched_biomedical_entities

# Match words in the text with biomedical entities
matched_biomedical_entities = match_biomedical_entities_with_vectors(word_vectors, biomedical_entities_from_ols)

output_file_path = 'matched_biomedical_entities.txt'

with open(output_file_path, 'w', encoding='utf-8') as output_file:
    for entity, match_info in matched_biomedical_entities.items():
        output_file.write(f"{match_info['word']}\n")

print(f"Matching results written to: {output_file_path}")

# K-means part

import numpy as np
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# File paths
file_path = 'matched_biomedical_entities.txt'

# Read words into an array
word_array = []
with open(file_path, 'r', encoding='utf-8') as file:
    for line in file:
        word = line.strip().lower()  # Convert to lowercase
        word_array.append(word)

def read_vectors_from_file(file_path):
    word_vectors = {}
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            parts = line.strip().split()
            word = parts[0].lower()  # Convert to lowercase
            vector = [float(val) for val in parts[1:]]
            word_vectors[word] = vector
    print(f"Read {len(word_vectors)} vectors from file.")
    return word_vectors

# Read file containing word vectors
word_vectors_file_path = 'tsne_results.txt'
word_vectors = read_vectors_from_file(word_vectors_file_path)

# Get vectors corresponding to matched words
vectors_for_clustering = np.array([word_vectors[word] for word in word_array])

# Use t-SNE to map high-dimensional vectors to 2D space
tsne = TSNE(n_components=2, perplexity=5, random_state=0)
vectors_2d = tsne.fit_transform(vectors_for_clustering)

# Use the elbow method to determine the optimal number of clusters
def calculate_wcss(data, max_clusters=10):
    wcss = []
    for i in range(1, max_clusters + 1):
        kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
        kmeans.fit(data)
        wcss.append(kmeans.inertia_)
    return wcss

max_clusters_to_try = 10
wcss_values = calculate_wcss(vectors_for_clustering, max_clusters=max_clusters_to_try)

# Plot the elbow method graph
plt.plot(range(1, max_clusters_to_try + 1), wcss_values, marker='o')
plt.title('Elbow Method for Optimal Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.show()

# Choose the optimal number of clusters based on the elbow method
optimal_clusters = 4

# Perform clustering using K-means algorithm
kmeans = KMeans(n_clusters=optimal_clusters, init='k-means++', max_iter=300, n_init=10, random_state=0)
cluster_labels = kmeans.fit_predict(vectors_for_clustering)

# Plot a scatter plot of the clustering results
plt.figure(figsize=(10, 8))
for i in range(optimal_clusters):
    cluster_points = vectors_2d[cluster_labels == i]
    plt.scatter(cluster_points[:, 0], cluster_points[:, 1], label=f'Cluster {i}')

# Add labels
for word, x, y in zip(word_array, vectors_2d[:, 0], vectors_2d[:, 1]):
    plt.annotate(word, (x, y), textcoords="offset points", xytext=(0, 5), ha='center', fontsize=8)

plt.title('t-SNE Visualization of Clusters')
plt.legend()
plt.show()

# Write clustering results to file
output_cluster_file_path = 'cluster_results_new.txt'
with open(output_cluster_file_path, 'w', encoding='utf-8') as output_file:
    for word, cluster_label in zip(word_array, cluster_labels):
        output_file.write(f"Word: {word}, Cluster: {cluster_label}\n")

print(f"Cluster results written to: {output_cluster_file_path}")


In [None]:
# 4.3
import numpy as np
from scipy.sparse import dok_matrix
from collections import Counter
from tqdm import tqdm
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Read segmented text data from a text file
def read_text_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        texts = file.readlines()
    return texts

# Build co-occurrence matrix with progress bar
def build_cooccurrence_matrix(texts):
    vocab = set()
    for text in texts:
        words = text.split()
        vocab.update(words)

    vocab = list(vocab)
    vocab_indices = {word: index for index, word in enumerate(vocab)}

    cooccurrence_matrix = dok_matrix((len(vocab), len(vocab)), dtype=np.float64)

    for text in tqdm(texts, desc="Building Co-occurrence Matrix"):
        words = text.split()
        word_indices = [vocab_indices[word] for word in words if word in vocab]
        for i in range(len(word_indices)):
            for j in range(i + 1, len(word_indices)):
                cooccurrence_matrix[word_indices[i], word_indices[j]] += 1
                cooccurrence_matrix[word_indices[j], word_indices[i]] += 1

    return cooccurrence_matrix, vocab_indices

# Main function
def main():
    # Read text data
    file_path = 'D:\pythonProject1\extraction\pdf_json_2_100percent.txt'
    texts = read_text_file(file_path)
    print("1")
    # Build co-occurrence matrix
    cooccurrence_matrix, vocab_indices = build_cooccurrence_matrix(texts)
    print("2")
    # Optionally: Normalize the co-occurrence matrix
    row_sums = np.array(cooccurrence_matrix.sum(axis=1)).flatten()
    row_sums[row_sums == 0] = 1  # Avoid division by zero
    cooccurrence_matrix_normalized = cooccurrence_matrix.tocsr() / row_sums[:, np.newaxis]


    # In addition to the existing code

    # Find words co-occurring with "COVID19" based on cooccurrence_matrix and vocab_indices
    covid_related_entities = ['COVID19']
    cooccurrence_counts = Counter()

    for entity in covid_related_entities:
        if entity in vocab_indices:
            index = vocab_indices[entity]
            cooccurrence_counts[entity] = cooccurrence_matrix[index, :].sum()

    # Sort by co-occurrence frequency
    sorted_entities = sorted(cooccurrence_counts.items(), key=lambda x: x[1], reverse=True)



if __name__ == "__main__":
    main()


In [None]:
# 4.4
import numpy as np
from tqdm import tqdm
from wordcloud import WordCloud
import matplotlib.pyplot as plt


def load_vectors_from_txt(file_path):
    # Load embedding vectors from a txt file
    vectors = {}
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            parts = line.strip().split(' ')
            entity = parts[0]
            vector = np.array([float(val) for val in parts[1:]])
            vectors[entity] = vector
    return vectors


def cosine_similarity(vec1, vec2):
    # Calculate cosine similarity
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    similarity = dot_product / (norm_vec1 * norm_vec2)
    return similarity


def generate_wordcloud(sorted_entities):
    wordcloud_text = {entity: float(score) for entity, score in sorted_entities}

    # Create WordCloud object
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(wordcloud_text)

    # Display the word cloud
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.show()


def main():
    # File paths
    vectors_file_path = 'D:\pythonProject1\extraction\pdf_json_3.2true.txt'  # Replace with your file path
    output_file_path = 'output_similarity.txt'

    # Load embedding vectors with replacements
    loaded_vectors = load_vectors_from_txt(vectors_file_path)

    # Extract the vector for corona
    corona_vector = loaded_vectors.get('corona')

    if corona_vector is not None:
        print("Successfully extracted the vector for corona")
    else:
        print("Vector for corona not found")

    # Calculate similarity scores with progress bar
    similarity_scores = {}
    for entity, entity_vector in tqdm(loaded_vectors.items(), desc="Calculating Similarity Scores"):
        similarity_scores[entity] = cosine_similarity(corona_vector, entity_vector)

    # Sort the results
    sorted_entities = sorted(similarity_scores.items(), key=lambda x: x[1], reverse=True)

    # Output results to a new txt file
    with open(output_file_path, 'w', encoding='utf-8') as output_file:
        for entity, similarity_score in sorted_entities:
            output_file.write(f"{entity}: {similarity_score}\n")

    # Generate and display the word cloud
    generate_wordcloud(sorted_entities)


if __name__ == "__main__":
    main()


## Part 5 (Bonus +10%): Open Challenge: Mining Biomedical Knowledge

A fundamental task in clinical/biomedical natural language processing is to extract intelligence from biomedical text corpus automatically and efficiently. More specifically, the intelligence may include biomedical entities mentioned in text, relations between biomedical entities, clinical features of patients, progression of diseases, all of which can be used to predict, understand and improve patients' outcomes. 

This open challenge is to build a biomedical knowledge graph based on the CORD-19 dataset and mine useful information from it. We recommend the following steps but you are also encouraged to develop your solution from scratch.

### Extract Biomedical Entities from Text

Extract biomedical entities (such as fever, cough, headache, lung cancer, heart attack) from text. Note that:

- The biomedical entities may consist of multiple words. For example, heart attack, multiple myeloma etc.
- The biomedical entities may be written in synoynms. For example, low blood pressure for hypotension.
- The biomedical entities may be written in different forms. For example, smoking, smokes, smoked.

### Extract Relations between Biomedical Entities

Extract relations between biomedical entities based on their appearance in text. You may define a relation between biomedical entities by one or more of the following criteria:

- The biomedical entities frequentely co-occuer together.
- The biomedical entities have similar word representations.
- The biomedical entities have clear relations based on textual narratives. For example, "The most common symptoms for COVID-19 are fever and cough" so we know there are relations between "COVID-19", "fever" and "cough".

### Build a Biomedical Knowledge Graph of COVID-19

Build a knoweledge graph based on the results from track 5.1 and 5.2 and visualise it.

In [5]:
###################
# TODO: add your solution
CREATE (n1:MedicalResearchFields {name:'Epidemiology'})
CREATE (n2:MedicalResearchFields {name:'Pulmonology'})
CREATE (n3:MedicalResearchFields {name:'Immunology'})
CREATE (n4:symptom{name:'Dyspneay'})
CREATE (n5:symptom {name:'Cough'})
CREATE (n6:symptom {name:'Fever'})
CREATE (n7:Treatment{name:'Anti-inflammatory Drugs'})
CREATE (n8:Treatment {name:'Antiviral Medications'})
CREATE (n9:Treatment {name:'Oxygen Therapy'})
CREATE (n10:Disease {name:'COVID-19'})
CREATE (n11:Drug {name:'Convalescent Plasma'})
CREATE (n12:Drug {name:'Remdesivir'})
CREATE (n13:Drug {name:'Dexamethasone'})
CREATE (n14:Drug {name:'Hydroxychloroquine'})
CREATE (n15:Drug {name:'Ivermectin'})
CREATE (n16:symptom {name:'Fatigue'})
CREATE (n17:symptom {name:'Loss of Taste or Smell'})
CREATE (n18:symptom {name:'Shortness of Breath'})
CREATE (n19:symptom {name:'Muscle Aches'})
CREATE (n20:symptom {name:'Sore Throat'})

CREATE (n21:Treatment {name:'Ventilator Support'})
CREATE (n22:Treatment {name:'Anticoagulants'})
CREATE (n23:Treatment {name:'Steroids'})
CREATE (n24:Treatment {name:'Monoclonal Antibodies'})
CREATE (n25:Treatment {name:'Mechanical Ventilation'})

RETURN  n1, n2, n3, n4, n5,n6,n7,n8,n9,n10,n11,n12,n13,n14,n15,n16,n17,n18,n19,n20,n21,n22,n23,n24,n25
###################