# Word Representation in Biomedical Domain

Before you start, please make sure you have read this notebook. You are encouraged to follow the recommendations but you are also free to develop your own solution from scratch. 

## Marking Scheme

- Biomedical imaging project: 40%
    - 20%: accuracy of the final model on the test set
    - 20%: rationale of model design and final report
- Natural language processing project: 40%
    - 30%: completeness of the project
    - 10%: final report
- Presentation skills and team work: 20%


This project forms 40\% of the total score for summer/winter school. The marking scheme of each part of this project is provided below with a cap of 100\%.

You are allowed to use open source libraries as long as the libraries are properly cited in the code and final report. The usage of third-party code without proper reference will be treated as plagiarism, which will not be tolerated.

You are encouraged to develop the algorithms by yourselves (without using third-party code as much as possible). We will factor such effort into the marking process.

## Setup and Prerequisites 

Recommended environment

- Python 3.7 or newer
- Free disk space: 100GB

Download the data

```sh
# navigate to the data folder
cd data

# download the data file
# which is also available at https://www.semanticscholar.org/cord19/download
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2021-07-26/document_parses.tar.gz

# decompress the file which may take several minutes
tar -xf document_parses.tar.gz

# which creates a folder named document_parses
```

## Part 1 (20%): Parse the Data

The JSON files are located in two sub-folders in `document_parses`. You will need to scan all JSON files and extract text (i.e. `string`) from relevant fields (e.g. body text, abstract, titles).

You are encouraged to extract full article text from body text if possible. If the hardware resource is limited, you can extract from abstract or titles as alternatives. 

Note: The number of JSON files is around 425k so it may take more than 10 minutes to parse all documents.

For more information about the dataset: https://www.semanticscholar.org/cord19/download

Recommended output:

- A list of text (`string`) extracted from JSON files.

In [None]:
###################
# TODO: add your solution

import os
import json
import glob

# 设置数据路径和输出文件路径
pdf_json_dir = r"D:\Scholar\AI Winter School 2025\Project\NLP\document_parses\pdf_json"
pmc_json_dir = r"D:\Scholar\AI Winter School 2025\Project\NLP\document_parses\pmc_json"
output_txt = r"D:\Scholar\AI Winter School 2025\Project\Ours\NLP\Text.txt"

# 获取所有 JSON 文件路径
json_files = glob.glob(os.path.join(pdf_json_dir, "*.json")) + glob.glob(os.path.join(pmc_json_dir, "*.json"))

# 提取内容的函数
def extract_content(json_files, output_txt):
    with open(output_txt, "w", encoding="utf-8") as output_file:
        for file_path in json_files:
            try:
                with open(file_path, "r", encoding="utf-8") as f:
                    json_data = json.load(f)
                    
                    # 提取标题
                    title = json_data.get("metadata", {}).get("title", "").strip()
                    if title:
                        output_file.write(f"{title}\n")
                    
                    # 提取正文文本
                    body_texts = json_data.get("body_text", [])
                    for body in body_texts:
                        text = body.get("text", "")
                        cite_spans = body.get("cite_spans", [])
                        
                        # 删除 cite_spans 中的引用内容
                        for cite in cite_spans:
                            cite_text = cite.get("text", "")
                            text = text.replace(cite_text, "")
                        
                        if text:
                            output_file.write(f"{text}\n")
            except Exception as e:
                print(f"Error processing {file_path}: {e}")

# 调用函数提取内容并写入到 .txt 文件中
extract_content(json_files, output_txt)
print(f"内容已提取并保存到 {output_txt}")
###################

## Part 2 (30%): Tokenization

Traverse the extracted text and segment the text into words (or tokens).

The following tracks can be developed in independentely. You are encouraged to divide the workload to each team member.

Recommended output:

- Tokenizer(s) that is able to tokenize any input text.

Note: Because of the computation complexity of tokenizers, it may take hours/days to process all documents. Which tokenizer is more efficient? Any idea to speedup?

### Track 2.1 (10%): Use split()

Use the standard `split()` by Python.

### Track 2.2 (10%): Use NLTK or SciSpaCy

NLTK tokenizer: https://www.nltk.org/api/nltk.tokenize.html

SciSpaCy: https://github.com/allenai/scispacy

Note: You may need to install NLTK and SpaCy so please refer to their websites for installation instructions.

### Track 2.3 (10%): Use Byte-Pair Encoding (BPE)

Byte-Pair Encoding (BPE): https://huggingface.co/transformers/tokenizer_summary.html

Note: You may need to install Huggingface's transformers so please refer to its website for installation instructions.

### Track 2.4 (Bonus +5%): Build new Byte-Pair Encoding (BPE)

This track may be dependent on track 2.3.

The above pre-built tokenization methods may not be suitable for biomedical domain as the words/tokens (e.g. diseases, sympotoms, chemicals, medications, phenotypes, genotypes etc.) can be very different from the words/tokens commonly used in daily life. Can you build and train a new BPE model for biomedical domain in particular?

### Open Question (Optional):

- What are the pros and cons of the above tokenizers?

In [None]:
###################
# TODO: add your solution
import os
import nltk
from transformers import GPT2TokenizerFast

# 设置输入文件路径
txt_file = r"E:\NLP\Text_part_1.txt"
bpe_model_dir = r"E:\NLP\bpe_model"

# 方法1：使用 Python 提供的 split() 函数
def split_tokenizer(text):
    return text.split()

# 方法2：使用 NLTK
nltk.download('punkt')
def nltk_tokenizer(text):
    return nltk.word_tokenize(text)

# 方法3：使用 GPT2TokenizerFast 进行 Byte-Pair Encoding (BPE)
def load_gpt2_tokenizer():
    tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
    return tokenizer

# 加载 GPT2 分词器
gpt2_tokenizer = load_gpt2_tokenizer()

def gpt2_tokenizer_func(text):
    tokens = gpt2_tokenizer.tokenize(text)
    return tokens

# 逐行读取文件并进行分词处理
def process_file_line_by_line(file_path, tokenizer_func):
    with open(file_path, "r", encoding="utf-8") as f:
        for line in f:
            tokens = tokenizer_func(line)
            print(tokens[:20])  # 打印前20个token

# 示例使用
print("split() 分词结果：")
process_file_line_by_line(txt_file, split_tokenizer)

print("NLTK 分词结果：")
process_file_line_by_line(txt_file, nltk_tokenizer)

print("GPT2 分词结果：")
process_file_line_by_line(txt_file, gpt2_tokenizer_func)
###################

In [None]:
import os
import re
from tokenizers import Tokenizer, models, trainers, pre_tokenizers, processors
from tokenizers.normalizers import NFKC
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer
from tokenizers.processors import TemplateProcessing

# 设置输入文件路径
txt_file = r"E:\NLP\Text_part_1.txt"
bpe_model_dir = r"E:\NLP\bpe_model"
output_file = r"E:\NLP\tokens.txt"
vocab_file = r"E:\NLP\vocabulary.txt"

# 创建保存目录
os.makedirs(bpe_model_dir, exist_ok=True)

# 初始化分词器
tokenizer = Tokenizer(models.BPE())

# 设置标准化器和预分词器
tokenizer.normalizer = NFKC()
tokenizer.pre_tokenizer = Whitespace()

# 设置训练器
trainer = BpeTrainer(vocab_size=10000, special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"])

# 读取训练数据并过滤非 ASCII 可打印字符
with open(txt_file, "r", encoding="utf-8") as f:
    lines = f.readlines()

# 过滤非 ASCII 可打印字符
filtered_lines = [re.sub(r'[^\x20-\x7E]', '', line) for line in lines]

# 训练分词器
tokenizer.train_from_iterator(filtered_lines, trainer=trainer)

# 设置后处理器
tokenizer.post_processor = TemplateProcessing(
    single="<s> $A </s>",
    pair="<s> $A </s> <s> $B </s>",
    special_tokens=[
        ("<s>", tokenizer.token_to_id("<s>")),
        ("</s>", tokenizer.token_to_id("</s>")),
    ],
)

# 保存分词器
tokenizer.save(os.path.join(bpe_model_dir, "bpe_tokenizer.json"))

# 获取词汇表并写入文件
vocab = tokenizer.get_vocab()
with open(vocab_file, "w", encoding="utf-8") as f:
    for idx, (token, token_id) in enumerate(vocab.items()):
        f.write(f"{idx + 1}: {token}\n")

# 打印词汇总数
print(f"Total vocabulary size: {len(vocab)}")

# 使用分词器进行分词
def tokenize_text(text):
    return tokenizer.encode(text).tokens

# 示例使用
if __name__ == '__main__':
    sample_text = "this is a sample text, let's see how the tokenizer works."
    tokens = tokenize_text(sample_text)
    print("Tokens:", tokens)

In [None]:
import os
from tokenizers import Tokenizer

# 设置文件路径
bpe_model_dir = r"E:\NLP\bpe_model"
# input_file = r"D:\Scholar\AI Winter School 2025\Project\Ours\NLP\Text.txt"
input_file = r"E:\NLP\Text_part_1.txt"
output_file = r"E:\NLP\test_tokens.txt"
vocab_file = r"E:\NLP\vocabulary.txt"

# 加载训练好的分词器
tokenizer = Tokenizer.from_file(os.path.join(bpe_model_dir, "bpe_tokenizer.json"))

# 加载词汇表
with open(vocab_file, "r", encoding="utf-8") as f:
    vocab = set(line.split(": ")[1].strip() for line in f)
    
# 使用分词器进行分词
def tokenize_text(text):
    tokens = tokenizer.encode(text).tokens
    return [token for token in tokens if token in vocab and token.isascii() and not token.isdigit()]

# 分批处理文本
def process_in_batches(input_file, output_file, batch_size):
    with open(input_file, "r", encoding="utf-8") as infile, open(output_file, "w", encoding="utf-8") as outfile:
        while True:
            lines = infile.readlines(batch_size)
            if not lines:
                break
            text = "".join(lines)
            tokens = tokenize_text(text)
            outfile.write(" ".join(tokens) + "\n")

# 示例使用
if __name__ == '__main__':
    batch_size = 16384
    process_in_batches(input_file, output_file, batch_size)

Our thinkings：

Each tokenizer in your code has its pros and cons, depending on what you need. The simplest method, using Python’s `split()` function, is fast and easy but doesn’t handle punctuation well. The `nltk_tokenizer` is a bit smarter, recognizing words more accurately, but it’s still quite basic.  

For deep learning, `gpt2_tokenizer_func` is a strong choice since it uses Byte-Pair Encoding (BPE) to handle rare words efficiently. However, it can be slower and sometimes splits words awkwardly. Your custom BPE tokenizer (`3_my_tokenizer.py`) offers more control, cleaning the text and supporting special tokens, but it requires training and setup.  

Finally, `4_tokenize.py` applies your trained tokenizer to real text, ensuring only valid words from the vocabulary are used. This keeps the output clean but may discard words not in the vocab.  

If you need something simple, `split()` or `nltk_tokenizer` will work. If you’re dealing with AI models, `gpt2_tokenizer_func` is better. But for full customization, your custom BPE tokenizer is the best choice.

## Part 3 (30%): Build Word Representations

Build word representations for each extracted word. If the hardware resource is limited, you may limit the vocabulary size up to 10k words/tokens (or even smaller) and the dimension of representations up to 256.

The following tracks can be developed independently. You are encouraged to divide the workload to each team member.

### Track 3.1 (15%): Use N-gram Language Modeling

N-gram Language Modeling is to predict a target word by using `n` words from previous context. Specifically,

$P(w_i | w_{i-1}, w_{i-2}, ..., w_{i-n+1})$

For example, given a sentence, `"the main symptoms of COVID-19 are fever and cough"`, if $n=7$, we use previous context `["the", "main", "symptoms", "of", "COVID-19", "are"]` to predict the next word `"fever"`.

More to read: https://web.stanford.edu/~jurafsky/slp3/3.pdf

Recommended outputs:

- A fixed vector for each word/token.

### Track 3.2 (15%): Use Skip-gram with Negative Sampling

In skip-gram, we use a central word to predict its context. Specifically,

$P(w_{c-m}, ... w_{c-1}, w_{c+1}, ..., w_{c+m} | w_c)$

As the learning objective of skip-gram is computational inefficient (summation of entire vocabulary $|V|$), negative sampling is commonly applied to accelerate the training.

In negative sampling, we randomly select one word from the context as a positive sample, and randomly select $K$ words from the vocabulary as negative samples. As a result, the learning objective is updated to

$L = -\log\sigma(u^T_{t} v_c) - \sum_{k=1}^K\log\sigma(-u^T_k v_c)$, where $u_t$ is the vector embedding of positive sample from context, $u_k$ are the vector embeddings of negative samples, $v_c$ is the vector embedding of the central word, $\sigma$ refers to the sigmoid function.

More to read http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf section 4.3 and 4.4

Recommended outputs:

- A fixed vector for each word/token.

### Track 3.3 (Bonus +5%): Use Contextualised Word Representation by Masked Language Model (MLM)

BERT introduces a new language model for pre-training named Masked Language Model (MLM). The advantage of MLM is that the word representations by MLM will be contextualised.

For example, "stick" may have different meanings in different context. By N-gram language modeling and word2vec (skip-gram, CBOW), the word representation of "stick" is fixed regardless of its context. However, MLM will learn the representation of "stick" dynamatically based on context. In other words, "stick" will have different representations in different context by MLM.

More to read: http://jalammar.github.io/illustrated-bert/ and https://arxiv.org/pdf/1810.04805.pdf

Recommended outputs:

- An algorithm that is able to generate contextualised representation in real time.

In [None]:
###################
# TODO: add your solution
import os
from gensim.models import Word2Vec

# 输入: tokens.txt文件路径
token_file = r"E:\NLP\test_tokens.txt"
# token_file = r"D:\Scholar\AI Winter School 2025\Project\Ours\NLP\test_tokens.txt"
# 输出: 保存模型的路径
model_file = r"E:\NLP\Skip_gram.model"
# model_file = r"D:\Scholar\AI Winter School 2025\Project\Ours\NLP\test_Skip_gram.model"


# 读取tokens
with open(token_file, "r", encoding="utf-8") as f:
    tokens = f.read().split()

# 方法1：使用 N-gram
def generate_ngrams(tokens, n):
    ngrams = zip(*[tokens[i:] for i in range(n)])
    return [" ".join(ngram) for ngram in ngrams]

# 方法2：使用 Skip-gram
def generate_skipgrams(tokens, window_size, vector_size=100, min_count=1, sg=1, workers=6):
    # 将 tokens 分成多个句子
    sentences = [tokens[i:i+100] for i in range(0, len(tokens), 100)]
    model = Word2Vec(vector_size=vector_size, window=window_size, min_count=min_count, sg=sg, workers=workers)
    model.build_vocab(sentences)
    model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)
    return model

# 示例使用
n = 3  # N-gram 的 n 值
window_size = 2  # Skip-gram 的窗口大小

# 生成 N-gram
ngrams = generate_ngrams(tokens, n)
print(f"前10个N-grams单词:", ngrams[:10])  # 打印前10个 N-gram

# 生成 Skip-gram 模型
skipgram_model = generate_skipgrams(tokens, window_size)
print("前10个Skip-gram单词:", list(skipgram_model.wv.index_to_key)[:10])  # 打印前10个词汇

# 保存模型
skipgram_model.save(model_file)
print(f"Model saved to {model_file}")

# 加载模型
loaded_model = Word2Vec.load(model_file)
print("Model loaded")

# 示例：获取多个词汇的向量表示
word_vectors = loaded_model.wv
words = ["COVID", "lockdowns", "impact"]
for word in words:
    if word in word_vectors:
        print(f"Vector representation of '{word}':", word_vectors[word])
    else:
        print(f"Word '{word}' not in vocabulary")
###################

## Part 4 (20%): Explore the Word Representations

The following tracks can be finished independently. You are encouraged to divide workload to each team member.

### Track 4.1 (5%): Visualise the word representations by t-SNE

t-SNE is an algorithm to reduce dimentionality and commonly used to visualise high-dimension vectors. Use t-SNE to visualise the word representations. You may visualise up to 1000 words as t-SNE is highly computationally complex.

More about t-SNE: https://lvdmaaten.github.io/tsne/

Recommended output:

- A diagram by t-SNE based on representations of up to 1000 words.

### Track 4.2 (5%): Visualise the Word Representations of Biomedical Entities by t-SNE

Instead of visualising the word representations of the entire vocabulary (or 1000 words that are selected at random), visualise the word representations of words which are biomedical entities. For example, fever, cough, diabetes etc. Based on the category of those biomedical entities, can you assign different colours to the entities and see if the entities from the same category can be clustered by t-SNE? For example, sinusitis and cough are both respirtory diseases so they should be assigned with the same colour and ideally their representations should be close to each other by t-SNE. Another example, Alzheimer and headache are neuralogical diseases which should be assigned by another colour.

Examples of biomedial ontology: https://www.ebi.ac.uk/ols/ontologies/hp and https://en.wikipedia.org/wiki/International_Classification_of_Diseases

Recommended output:

- A diagram with colours by t-SNE based on representations of biomedical entities.

### Track 4.3 (5%): Co-occurrence

- What are the biomedical entities which frequently co-occur with COVID-19 (or coronavirus)?

Recommended outputs:

- A sorted list of biomedical entities and description on how the entities are selected and sorted.

### Track 4.4 (5%): Semantic Similarity

- What are the biomedical entities which have closest semantic similarity COVID-19 (or coronavirus) based on word representations?

Recommended outputs:

- A sorted list of biomedical entities and description on how the entities are selected and sorted.

### Open Question (Optional): What else can you discover?


In [None]:
###################
# TODO: add your solution
import os
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.manifold import TSNE
from gensim.models import Word2Vec
import numpy as np
import mplcursors

# 输入：Skip_gram模型的路径
# model_file = r"D:\Scholar\AI Winter School 2025\Project\Ours\NLP\test_Skip_gram.model"
model_file = r"E:NLP\Skip_gram.model"

# 加载模型
loaded_model = Word2Vec.load(model_file)
print("Model loaded")

# 获取词汇的向量表示
word_vectors = loaded_model.wv

# 获取最多1000个单词及其向量
words = list(word_vectors.index_to_key)[:1000]
vectors = np.array([word_vectors[word] for word in words])

# 使用t-SNE降维到3D
tsne = TSNE(n_components=3, random_state=42)
reduced_vectors = tsne.fit_transform(vectors)

# 可视化
fig = plt.figure(figsize=(14, 10))
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(reduced_vectors[:, 0], reduced_vectors[:, 1], reduced_vectors[:, 2])

# 添加注释
for i, word in enumerate(words):
    ax.text(reduced_vectors[i, 0], reduced_vectors[i, 1], reduced_vectors[i, 2], word)

# 添加交互功能
mplcursors.cursor(scatter, hover=True)

plt.title('3D t-SNE visualization of word vectors')
plt.show()
###################

## Part 5 (Bonus +10%): Open Challenge: Mining Biomedical Knowledge

A fundamental task in clinical/biomedical natural language processing is to extract intelligence from biomedical text corpus automatically and efficiently. More specifically, the intelligence may include biomedical entities mentioned in text, relations between biomedical entities, clinical features of patients, progression of diseases, all of which can be used to predict, understand and improve patients' outcomes. 

This open challenge is to build a biomedical knowledge graph based on the CORD-19 dataset and mine useful information from it. We recommend the following steps but you are also encouraged to develop your solution from scratch.

### Extract Biomedical Entities from Text

Extract biomedical entities (such as fever, cough, headache, lung cancer, heart attack) from text. Note that:

- The biomedical entities may consist of multiple words. For example, heart attack, multiple myeloma etc.
- The biomedical entities may be written in synoynms. For example, low blood pressure for hypotension.
- The biomedical entities may be written in different forms. For example, smoking, smokes, smoked.

### Extract Relations between Biomedical Entities

Extract relations between biomedical entities based on their appearance in text. You may define a relation between biomedical entities by one or more of the following criteria:

- The biomedical entities frequentely co-occuer together.
- The biomedical entities have similar word representations.
- The biomedical entities have clear relations based on textual narratives. For example, "The most common symptoms for COVID-19 are fever and cough" so we know there are relations between "COVID-19", "fever" and "cough".

### Build a Biomedical Knowledge Graph of COVID-19

Build a knoweledge graph based on the results from track 5.1 and 5.2 and visualise it.

In [None]:
###################
# TODO: add your solution
import os
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from gensim.models import Word2Vec
import numpy as np
import plotly.express as px
import plotly.io as pio

# 输入：Skip_gram模型的路径
model_file = r"D:\Scholar\AI Winter School 2025\Project\Ours\NLP\Skip_gram.model"

# 加载模型
loaded_model = Word2Vec.load(model_file)
print("Model loaded")

# 获取词汇的向量表示
word_vectors = loaded_model.wv

# 定义生物医学实体及其类别
biomedical_entities = {
    "fever": "Symptoms",
    "cough": "Symptoms",
    "headache": "Symptoms",
    "seizure": "Symptoms",
    "infection": "Symptoms",
    "inflammation": "Symptoms",
    "nausea": "Symptoms",
    "fatigue": "Symptoms",
    "pain": "Symptoms",
    "diarrhea": "Symptoms",
    "vomiting": "Symptoms",
    "rash": "Symptoms",
    "dizziness": "Symptoms",
    "weakness": "Symptoms",
    "chills": "Symptoms",
    "sore": "Symptoms",
    "swelling": "Symptoms",
    "bleeding": "Symptoms",
    "itching": "Symptoms",
    "sneezing": "Symptoms",
    "pneumonia": "Disease",
    "asthma": "Disease",
    "diabetes": "Disease",
    "stroke": "Disease",
    "COVID": "Disease",
    "SARS": "Disease",
    "influenza": "Disease",
    "dengue": "Disease",
    "cancer": "Disease",
    "tuberculosis": "Disease",
    "arthritis": "Disease",
    "leukemia": "Disease",
    "hepatitis": "Disease",
    "HIV": "Disease",
    "malaria": "Disease",
    "cholera": "Disease",
    "measles": "Disease",
    "mumps": "Disease",
    "rabies": "Disease",
    "syphilis": "Disease",
    "virus": "Microorganisms",
    "bacteria": "Microorganisms",
    "coronavirus": "Microorganisms",
    "rhinovirus": "Microorganisms",
    "adenovirus": "Microorganisms",
    "pathogen": "Microorganisms",
    "fungi": "Microorganisms",
    "parasite": "Microorganisms",
    "protozoa": "Microorganisms",
    "microbe": "Microorganisms",
    "yeast": "Microorganisms",
    "mold": "Microorganisms",
    "bacillus": "Microorganisms",
    "spirochete": "Microorganisms",
    "mycoplasma": "Microorganisms",
    "prion": "Microorganisms",
    "helminth": "Microorganisms",
    "amoeba": "Microorganisms",
    "algae": "Microorganisms",
    "archaea": "Microorganisms",
    "brain": "Organ",
    "neuron": "Organ",
    "liver": "Organ",
    "pancreas": "Organ",
    "lung": "Organ",
    "spleen": "Organ",
    "kidney": "Organ",
    "heart": "Organ",
    "nerve": "Organ",
    "muscle": "Organ",
    "stomach": "Organ",
    "intestine": "Organ",
    "bladder": "Organ",
    "skin": "Organ",
    "bone": "Organ",
    "eye": "Organ",
    "ear": "Organ",
    "tongue": "Organ",
    "throat": "Organ",
    "esophagus": "Organ",
}

# 获取生物医学实体及其向量
words = []
vectors = []
categories = []
for word in biomedical_entities.keys():
    if word in word_vectors:
        words.append(word)
        vectors.append(word_vectors[word])
        categories.append(biomedical_entities[word])

vectors = np.array(vectors)

# 设置 perplexity 参数，确保其小于样本数量
perplexity = min(30, len(vectors) - 1)

# 使用t-SNE降维到3D
tsne_3d = TSNE(n_components=3, random_state=42, perplexity=perplexity)
reduced_vectors_3d = tsne_3d.fit_transform(vectors)

# 使用 Plotly 可视化3D图
fig_3d = px.scatter_3d(
    x=reduced_vectors_3d[:, 0],
    y=reduced_vectors_3d[:, 1],
    z=reduced_vectors_3d[:, 2],
    color=categories,
    text=words,
    labels={'color': 'Category'},
    title='3D t-SNE visualization of biomedical entities'
)

# 保存为 HTML 文件
output_file_3d = r"D:\Scholar\AI Winter School 2025\Project\Ours\NLP\biomedical_entities_3d.html"
pio.write_html(fig_3d, file=output_file_3d, auto_open=True)

# 使用t-SNE降维到2D
tsne_2d = TSNE(n_components=2, random_state=42, perplexity=perplexity)
reduced_vectors_2d = tsne_2d.fit_transform(vectors)

# 定义颜色映射
colors = {
    "Symptoms": "red",
    "Disease": "blue",
    "Microorganisms": "green",
    "Organ": "purple",
    # 添加更多类别及其颜色
}

# 使用 Matplotlib 可视化2D图
plt.figure(figsize=(14, 10))
for i, word in enumerate(words):
    plt.scatter(reduced_vectors_2d[i, 0], reduced_vectors_2d[i, 1], color=colors[categories[i]], label=categories[i])
    plt.text(reduced_vectors_2d[i, 0], reduced_vectors_2d[i, 1], word)
plt.title('2D t-SNE visualization of biomedical entities')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')

# 去重图例
handles, labels = plt.gca().get_legend_handles_labels()
by_label = dict(zip(labels, handles))
plt.legend(by_label.values(), by_label.keys())

plt.show()
###################