## Instruction

> 1. Rename Assignment-01-###.ipynb where ### is your student ID.
> 2. The deadline of Assignment-01 is 23:59pm, 03-19-2025
>
> 3. In this assignment, you will
>    1) explore Wikipedia text data
>    2) build language models
>    3) build NB and LR classifiers
>
> Download the preprocessed data, enwiki-train.json and enwiki-test.json from the Assignment-01 folder. In the data file, each line contains a Wikipedia page with attributes, title, label, and text. There are 9000 records in the train file and 1000 records in test file with ten categories.

## Task1 - Data exploring

> 1) Print out how many documents are in each class  (for both train and test dataset)

In [1]:
# Your code goes to here
import json


# 定义文件路径
train_path = "/data/nlp-zbj-22/data/enwiki-train.json"
test_path = "/data/nlp-zbj-22/data/enwiki-test.json"

# 逐行读取文件,这样才能避免最开始的coding错误（一开始是直接单次loads，就报错了）
train_data = []
with open(train_path, "r", encoding="utf-8") as f:
    for line in f:
        train_data.append(json.loads(line))

test_data = []
with open(test_path, "r", encoding="utf-8") as f:
    for line in f:
        test_data.append(json.loads(line))

# 打印部分内容探查数据
# print("Train data sample:", train_data[:1])  
# print("Test data sample:", test_data[:1]) 

train_text=[item["text"] for item in train_data]
train_title=[item["title"] for item in train_data]
train_label=[item["label"] for item in train_data]

test_text = [item["text"] for item in test_data]
test_title = [item["title"] for item in test_data]
test_label = [item["label"] for item in test_data]


print(len(train_data))
print(len(test_data))


9000
1000


> 2) Print out the average number of sentences in each class.
>    You may need to use sentence tokenization tools from nltk or spacy.
>    (for both train and test dataset)


In [2]:
# Your code goes to here


import spacy#spacy是默认使用cpu的，我们要显式启用GPU才行(require_gpu是必须使用prefer则是优先使用)
#使用pipe进行gpu加速！！！！！！需要处理的文本数据打包成列表，然后通过 nlp.pipe 批量处理这些文本。
#nlp.pipe 会将这些文本分成多个批次（根据 batch_size 参数），并将每个批次送入模型进行并行处理。
#一开始的问题内存爆了但是GPU的显存没爆，这是因为我们是一次性把数据加载到内存，因此我们换成流式处理

import torch
import re
import os
from collections import Counter,defaultdict#使用此帮助我们统计类别信息
import heapq
from tqdm import tqdm

class WordCount:
    def __init__(self, word, count):
        self.word = word
        self.count = count

    def __lt__(self, other):
        return self.count < other.count  


class processor:#架构设计很重要，一定要想清楚每部分功能的依赖顺序，能不能哪个成员函数多写点，另一个成员函数少写点，最后总体完成功能
    def __init__(self,label,vocab_size=80000,unk_limit=5):
        
        os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"#这样子启动显存优化
        
        self.label=label#不统计数量，是一个list
        self.label_num_dict=Counter(label)#统计数量
        
        # defaultdict 的第一个参数应该是一个工厂函数（如 int, list, set 等），而不是一个列表 label,默认值为 0
        self.label_sent_dict = defaultdict(int)  
        self.label_word_dict = defaultdict(int)  
        for item in label:
            self.label_sent_dict[item] = 0
            self.label_word_dict[item] = 0   
               
        self.vocab=None
        self.vocab_size = vocab_size  # 词表大小
        # 词汇表是纯Python集合对象，因此self.vocab存储在内存中不占用显存
        
        self.unk_limit = unk_limit  # 低频词阈值
        self.unk_token = "UNK"  # 低频词标记
        self.word_counter=None#这是专门用于preprocess_data这一方法的，单有self.vocab不够
        
        spacy.require_gpu()

        #模型必须要在gpu中才能跑
        #每次创建一个实例，都会把模型加载进gpu，这很占用,需要修改代码
        #添加模型单例模式，确保所有processor共享同一个模型
        if not hasattr(processor, '_shared_model'):
            processor._shared_model = spacy.load("en_core_web_sm")
        self.my_model = processor._shared_model 
        
        # 新增：定义通用处理参数避免代码重复书写
        self.spacy_config = {
            "batch_size":8
        }
        self.mem_batch_size = 64 #内存分批大小
       
    def __del__(self):
        """析构函数：对象销毁时显式清理显存"""
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

    #流式处理数据，分批生成数据，内存友好
    def _batch_generator(self, data):
        batch = []
        for item in data:
            batch.append(item)
            if len(batch) >= self.mem_batch_size:
                yield batch
                batch = []
                torch.cuda.empty_cache()  # 每批次后清理缓存,避免gpu爆掉
        if batch:
            yield batch
            torch.cuda.empty_cache()  # 处理最后一个小批次后清理
        
    def count_sentences(self,data):#加载train_data/test_data
        self.label_sent_dict.clear()#根据lec2的经验，我们需要每次都清空，因为这是一个实例方法，避免多次调用进行累加
        total = len(data)
        with tqdm(total=total, desc="Counting sentences", bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, {rate_fmt}, {percentage:3.0f}%]") as pbar:#用pbar包装进度条
            for batch in self._batch_generator(data):
                texts = [item["text"] for item in batch]
                labels = [item["label"] for item in batch]
                # 显存关键点1：及时释放Spacy文档对象
                docs = list(self.my_model.pipe(texts, **self.spacy_config))  # 生成文档对象
                for doc, label in zip(docs, labels):
                        sent_count = len(list(doc.sents))
                        self.label_sent_dict[label] += sent_count / self.label_num_dict[label]
                        # 显存释放点1：释放文档内部结构
                        del doc          # 释放整个文档对象
                # 显存释放点2：批量删除中间变量
                del texts, labels, docs  # 解除所有引用
                torch.cuda.empty_cache()  # 立即清空缓存（但是这会造成中断，太频繁了可能会造成过大的系统开销）
                pbar.update(len(batch)) 
        return self.label_sent_dict    
        
    def count_words(self,data):#加载train_data/test_data
        self.label_word_dict.clear()
        total = len(data)
        with tqdm(total=total, desc="Counting words", bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, {rate_fmt}, {percentage:3.0f}%]") as pbar:
            for batch in self._batch_generator(data):
                texts = [item["text"] for item in batch]
                labels = [item["label"] for item in batch]
                
                docs = list(self.my_model.pipe(texts, **self.spacy_config))  # 显式转换为列表，方便后续释放显存
                for doc, label in zip(docs, labels):
                    word_count = len([t for t in doc if not t.is_punct and not t.is_space])
                    self.label_word_dict[label] += word_count / self.label_num_dict[label]
                    
                    # 释放文档对象（之前缺失）
                    del doc
                
                # 新增批量释放
                del texts, labels, docs
                torch.cuda.empty_cache()  # 新增
                
                pbar.update(len(batch))
        return self.label_word_dict
                 
    def build_vocab(self,data):
    # 构建词表和预处理文本的顺序取决于你的具体需求和数据特点,不一定非得谁在前。
    # 此处我们的架构是必须先调用build_vocab，再调用preprocess和first_forty
    # 构建词表，仅受固定大小的约束，但是我们不特别处理低频词
        self.word_counter = Counter()
        total = len(data)
        with tqdm(total=total, desc="Building vocabulary", bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, {rate_fmt}, {percentage:3.0f}%]") as pbar:
            for batch in self._batch_generator(data):
                texts = [item["text"] for item in batch]
                docs = list(self.my_model.pipe(texts, **self.spacy_config))
                for doc in docs:
                    words = [token.text.lower() for token in doc
                            if not token.is_punct
                            and not token.is_space
                            and not token.is_stop
                            and not token.is_digit
                            and not token.is_currency
                            and token.text != '<'
                            and token.text != '>']
                    self.word_counter.update(words)
                    del doc
                # 新增批量释放（释放这一批次的docs，texts的显存）
                del texts, docs
                torch.cuda.empty_cache()  # 新增
                pbar.update(len(batch)) 
        self.vocab = {word for word, freq in self.word_counter.most_common(self.vocab_size)}
        self.vocab.add(self.unk_token)
    
    def preprocess_data(self, data):
        # 预处理文本，并替换低频词和不在词表中的词为UNK
        if self.word_counter is None:
            raise ValueError("Vocabulary has not been built. Call build_vocab first.")
    
        processed_data = []
        total = len(data)
        with tqdm(total=total, desc="Preprocessing data", bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, {rate_fmt}, {percentage:3.0f}%]") as pbar:
            for batch in self._batch_generator(data):
                # 从batch中同时提取text和对应的label
                texts = [item["text"] for item in batch]
                labels = [item["label"] for item in batch]  # 提取标签
                docs = list(self.my_model.pipe(texts, **self.spacy_config)) #把texts这个字符串对象处理为spacy的内置对象
                for item_label, doc in zip(labels, docs):
                    words = [
                        token.text.lower() 
                        for token in doc
                        if not token.is_punct
                        and not token.is_space
                        and not token.is_stop
                        and not token.is_digit
                        and not token.is_currency
                        and token.text not in ('<', '>')
                    ]
                    
                    # 替换低频词为UNK
                    words = [
                        word if (word in self.vocab and self.word_counter[word] >= self.unk_limit)
                        else self.unk_token
                        for word in words
                    ]
                    
                    # 将处理后的数据添加到结果集
                    processed_data.append({
                        "label": item_label,  # 使用从数据项中提取的标签
                        "text": " ".join(words)
                    })
                    
                    # 显存释放点3：预处理完成后立即释放
                    del doc
                
                # 显存释放点4：清除批次变量
                del texts, labels, docs
                torch.cuda.empty_cache()  # 每批处理后清理
                pbar.update(len(batch))
        
        return processed_data

    def first_forty(self, data):
        first_articles = {}
        remaining_labels = set(self.label_num_dict.keys())
        total = len(remaining_labels)  # 进度条总数为类别数量
        with tqdm(total=total, desc="Finding first forty words", bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, {rate_fmt}, {percentage:3.0f}%]") as pbar:  
            for batch in self._batch_generator(data):
                # 提前退出检查点1
                if not remaining_labels:
                    break  #直接跳出外层循环                
                texts = [item["text"] for item in batch]
                titles = [item["title"] for item in batch]
                labels = [item["label"] for item in batch]
                docs = list(self.my_model.pipe(texts, **self.spacy_config))
                for doc, label, title in zip(docs, labels, titles):
                    # 提前退出检查点2
                    if not remaining_labels:
                        break  #跳出内层循环
                    if label not in first_articles and label in remaining_labels:
                        count_dict=defaultdict(int)
                        for token in doc:
                            if (not token.is_punct and not token.is_space and not token.is_stop and not token.is_digit  and not token.is_currency and token.text != '<' and token.text !='>'):
                                word=token.text.lower()
                                count_dict[word]+=1 
                        heap = []
                        for word, count in count_dict.items():
                            if len(heap) < 40:
                                heapq.heappush(heap, WordCount(word, count))
                            else:
                                if count > heap[0].count:
                                    heapq.heappop(heap)  
                                    heapq.heappush(heap, WordCount(word, count))     
                        del doc#这里是减少引用，还没真正回收。后面一个batch完成的时候统一回收
                        
                        #如果是最开始重写__lt__方法为>号，那么就是构建的按照count降序的heap那么每次的比较逻辑应该是
                        #如果当前实例比堆尾的大，那就弹出堆尾的，然后heappush；如果没有堆尾大，那就下一个（注意heappop是弹出堆顶的实例）
                        top_words = [wc.word for wc in heap]
                        top_words = top_words[::-1]  # 从高到低排序
                        first_articles[label] = (title, top_words)
                        remaining_labels.remove(label)
                        pbar.update(1)  # 每处理一个类别更新 1 次 
                        
                #释放这一个batch的临时变量
                #每次del variable会将该变量名对应的引用计数减1,但是要使用empty_cache（回收引用=0的变量占的显存）才是真正的回收
                del texts, titles, labels, docs 
                torch.cuda.empty_cache() 
        return [{
            "Class": label,
            "First article's name": first_articles[label][0],
            "Processed words": ' '.join(first_articles[label][1])
        } for label in self.label_num_dict.keys() if label in first_articles]  


# # 初次加载得到预处理的结果
# # 处理训练数据
# p_train = processor(train_label)
# p_train.build_vocab(train_data)
# pre_data_train = p_train.preprocess_data(train_data)
# first_forty_train = p_train.first_forty(train_data)
# sent_train = p_train.count_sentences(train_data)
# word_train = p_train.count_words(train_data)



# # 处理测试数据
# p_test = processor(test_label)
# p_test.vocab = p_train.vocab  # 共享训练数据的词表
# p_test.word_counter = p_train.word_counter  # 共享训练数据的词频统计
# pre_data_test = p_test.preprocess_data(test_data)
# first_forty_test = p_test.first_forty(test_data)
# sent_test = p_test.count_sentences(test_data)
# word_test = p_test.count_words(test_data)







In [9]:
current_dir = os.path.abspath("")
parent_dir = os.path.abspath(os.path.join(current_dir, os.pardir))
save_dir = os.path.join(parent_dir, "processed_data")


word_train = json.load(open(os.path.join(save_dir, "train_word_counts.json")))
word_test = json.load(open(os.path.join(save_dir, "test_word_counts.json")))
print(word_train)
print(word_test)

{'Book': 4602.229603729603, 'Food': 2925.3966942148754, 'Film': 3827.0879360465124, 'Politician': 5029.373147340897, 'Animal': 1232.7682926829268, 'Writer': 4892.925877763328, 'Artist': 4147.875273522975, 'Disease': 7096.128712871285, 'Actor': 1400.3544303797469, 'Software': 4227.548117154812}
{'Politician': 5248.6736292428195, 'Writer': 5006.0, 'Book': 4441.5641025641025, 'Film': 3745.638513513515, 'Artist': 3899.142857142858, 'Actor': 3624.0, 'Food': 2974.8125, 'Disease': 6946.499999999999, 'Software': 4144.148148148149, 'Animal': 1122.0}


In [3]:
# # # experiment-version
# # # experiment-version
# # # experiment-version
# # # experiment-version
# # # experiment-version
# # # experiment-version
# # # experiment-version


# # Your code goes to here

# import spacy#spacy是默认使用cpu的，我们要显式启用GPU才行(require_gpu是必须使用prefer则是优先使用)
# #使用pipe进行gpu加速！！！！！！需要处理的文本数据打包成列表，然后通过 nlp.pipe 批量处理这些文本。
# #nlp.pipe 会将这些文本分成多个批次（根据 batch_size 参数），并将每个批次送入模型进行并行处理。
# #一开始的问题内存爆了但是GPU的显存没爆，这是因为我们是一次性把数据加载到内存，因此我们换成流式处理

# import torch
# import re
# import os
# from collections import Counter,defaultdict#使用此帮助我们统计类别信息
# import heapq
# from tqdm import tqdm

# class WordCount:
#     def __init__(self, word, count):
#         self.word = word
#         self.count = count

#     def __lt__(self, other):
#         return self.count < other.count  


# class processor:#架构设计很重要，一定要想清楚每部分功能的依赖顺序，能不能哪个成员函数多写点，另一个成员函数少写点，最后总体完成功能
#     def __init__(self,label,vocab_size=20000,unk_limit=10):
        
#         os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"#这样子启动显存优化
        
#         self.label=label#不统计数量，是一个list
#         self.label_num_dict=Counter(label)#统计数量
        
#         # defaultdict 的第一个参数应该是一个工厂函数（如 int, list, set 等），而不是一个列表 label,默认值为 0
#         self.label_sent_dict = defaultdict(int)  
#         self.label_word_dict = defaultdict(int)  
#         for item in label:
#             self.label_sent_dict[item] = 0
#             self.label_word_dict[item] = 0   
               
#         self.vocab=None
#         self.vocab_size = vocab_size  # 词表大小
#         # 词汇表是纯Python集合对象，因此self.vocab存储在内存中不占用显存
        
#         self.unk_limit = unk_limit  # 低频词阈值
#         self.unk_token = "UNK"  # 低频词标记
#         self.word_counter=None#这是专门用于preprocess_data这一方法的，单有self.vocab不够
        
#         spacy.require_gpu()

        
#         # 新增映射表：在构建词表的时候所有操作不变,慢点就慢点,但是之后的所有操作,都把单词通过词表映射为数字,这样子进行向量化处理
#         self.word_to_idx = {}  # 词到索引的映射
#         self.idx_to_word = {}  # 索引到词的映射
#         self.unk_idx = 0       # UNK的索引位置



        
#         #模型必须要在gpu中才能跑
#         #每次创建一个实例，都会把模型加载进gpu，这很占用,需要修改代码
#         #添加模型单例模式，确保所有processor共享同一个模型
#         if not hasattr(processor, '_shared_model'):
#             processor._shared_model = spacy.load("en_core_web_sm")
#         self.my_model = processor._shared_model 
        
#         # 新增：定义通用处理参数避免代码重复书写
#         self.spacy_config = {
#             "batch_size":512
#         }
#         self.mem_batch_size = 1024#内存分批大小
       
#     def __del__(self):
#         """析构函数：对象销毁时显式清理显存"""
#         if torch.cuda.is_available():
#             torch.cuda.empty_cache()

#     #流式处理数据，分批生成数据，内存友好
#     def _batch_generator(self, data):
#         batch = []
#         for item in data:
#             batch.append(item)
#             if len(batch) >= self.mem_batch_size:
#                 yield batch
#                 batch = []
#                 torch.cuda.empty_cache()  # 每批次后清理缓存,避免gpu爆掉
#         if batch:
#             yield batch
#             torch.cuda.empty_cache()  # 处理最后一个小批次后清理


#     def _text_to_indices(self, text):
#         """将文本转换为索引张量"""
#         words = [token.text.lower() for token in self.my_model(text) if self._token_filter(token)]
#         indices = [self.word_to_idx.get(word, self.unk_idx) for word in words]
#         return torch.tensor(indices, dtype=torch.long, device='cuda')
    
#     def _token_filter(self, token):
#         return (not token.is_punct and not token.is_space and not token.is_stop 
#                 and not token.is_digit and not token.is_currency 
#                 and token.text not in ('<', '>'))
    
#     def count_sentences(self, data):
#         self.label_sent_dict.clear()
#         total = len(data)
        
#         # 使用索引进行统计
#         label_tensors = []
#         sent_count_tensors = []
        
#         with tqdm(total=total, desc="Counting sentences", 
#                 bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, {rate_fmt}, {percentage:3.0f}%]") as pbar:
#             for batch in self._batch_generator(data):
#                 texts = [item["text"] for item in batch]
#                 labels = [item["label"] for item in batch]
                
#                 # 生成Spacy文档对象
#                 docs = list(self.my_model.pipe(texts, **self.spacy_config))
                
#                 # 获取句子数和标签索引
#                 batch_sent_counts = [len(list(doc.sents)) for doc in docs]
#                 try:
#                     batch_labels = [self.label.index(l) for l in labels]
#                 except ValueError as e:
#                     print(f"发现无效标签: {e}")
#                     batch_labels = []
                
#                 # 转换为张量
#                 if batch_labels:
#                     label_tensors.append(torch.tensor(batch_labels, device='cuda'))
#                     sent_count_tensors.append(torch.tensor(batch_sent_counts, dtype=torch.float32, device='cuda'))
                
#                 # 显存释放优化
#                 del texts, labels, docs, batch_sent_counts
#                 torch.cuda.empty_cache()
#                 pbar.update(len(batch))
        
#         # 聚合统计结果
#         if label_tensors and sent_count_tensors:
#             all_labels = torch.cat(label_tensors)
#             all_counts = torch.cat(sent_count_tensors)
#             result = torch.zeros(len(self.label), device='cuda')
#             result.scatter_add_(0, all_labels, all_counts)
#         else:
#             result = torch.zeros(len(self.label), device='cuda')
        
#         # 转换为原有格式
#         for idx, total in enumerate(result.cpu().numpy()):
#             label = self.label[idx]
#             if self.label_num_dict[label] > 0:  # 防止除零错误
#                 self.label_sent_dict[label] = total / self.label_num_dict[label]
        
#         return self.label_sent_dict  
        
#     def count_words(self,data):#加载train_data/test_data
#         self.label_word_dict.clear()
        
#         # 使用索引进行统计
#         label_tensors = []
#         word_count_tensors = []
        
#         with tqdm(total=total, desc="Counting words", bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, {rate_fmt}, {percentage:3.0f}%]") as pbar:
#             for batch in self._batch_generator(data):
                
#                 # 生成索引张量
#                 batch_indices = [self._text_to_indices(item["text"]) for item in batch]
#                 batch_labels = [self.label.index(item["label"]) for item in batch]
                
#                 # 统计非UNK词数
#                 valid_counts = torch.stack([(t != self.unk_idx).sum() for t in batch_indices])
                
#                 # 收集数据
#                 label_tensors.append(torch.tensor(batch_labels, device='cuda'))
#                 word_count_tensors.append(valid_counts.float())
                
#                 del batch_indices
#                 torch.cuda.empty_cache()
#                 pbar.update(len(batch))
        
#         # 聚合统计结果
#         all_labels = torch.cat(label_tensors)
#         all_counts = torch.cat(word_count_tensors)
#         result = torch.zeros(len(self.label), device='cuda')
#         result.scatter_add_(0, all_labels, all_counts)
        
#         # 转换为原有格式
#         for idx, count in enumerate(result.cpu().numpy()):
#             label = self.label[idx]
#             self.label_word_dict[label] = count / self.label_num_dict[label]
        
#         return self.label_word_dict
                 
#     def build_vocab(self,data):
#     # 构建词表和预处理文本的顺序取决于你的具体需求和数据特点,不一定非得谁在前。
#     # 此处我们的架构是必须先调用build_vocab，再调用preprocess和first_forty
#     # 构建词表，仅受固定大小的约束，但是我们不特别处理低频词
#         self.word_counter = Counter()
#         total = len(data)
#         with tqdm(total=total, desc="Building vocabulary", bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, {rate_fmt}, {percentage:3.0f}%]") as pbar:
#             for batch in self._batch_generator(data):
#                 texts = [item["text"] for item in batch]
#                 docs = list(self.my_model.pipe(texts, **self.spacy_config))
#                 for doc in docs:
#                     words = [token.text.lower() for token in doc
#                             if not token.is_punct
#                             and not token.is_space
#                             and not token.is_stop
#                             and not token.is_digit
#                             and not token.is_currency
#                             and token.text != '<'
#                             and token.text != '>']
#                     self.word_counter.update(words)
#                     del doc
#                 # 新增批量释放（释放这一批次的docs，texts的显存）
#                 del texts, docs
#                 torch.cuda.empty_cache()  # 新增
#                 pbar.update(len(batch)) 
#         self.vocab = {word for word, freq in self.word_counter.most_common(self.vocab_size)}
#         self.vocab.add(self.unk_token)
        
#         # 构建映射表
#         self.word_to_idx = {word: idx+1 for idx, word in enumerate(self.vocab)}  # 索引从1开始
#         self.word_to_idx[self.unk_token] = 0
#         self.idx_to_word = {v: k for k, v in self.word_to_idx.items()}
#         self.unk_idx = self.word_to_idx[self.unk_token]

    
#     def preprocess_data(self, data):
#         # 预处理文本，并替换低频词和不在词表中的词为UNK
#         if self.word_counter is None:
#             raise ValueError("Vocabulary has not been built. Call build_vocab first.")
    
#         processed_data = []
#         total = len(data)
#         with tqdm(total=total, desc="Preprocessing data", bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, {rate_fmt}, {percentage:3.0f}%]") as pbar:
#             for batch in self._batch_generator(data):
#                  # 批量转换为索引张量
#                 batch_tensors = [self._text_to_indices(item["text"]) for item in batch]
                                
#                 # 使用PyTorch进行向量化处理
#                 lengths = torch.tensor([len(t) for t in batch_tensors], device='cuda')
#                 padded = torch.nn.utils.rnn.pad_sequence(batch_tensors, batch_first=True)
                
#                 # 生成有效词掩码 (统计时使用)
#                 valid_mask = (padded != self.unk_idx)
                
#                 # 转换为文本的优化版本
#                 for i in range(len(batch)):
#                     valid_indices = padded[i][valid_mask[i]]
#                     processed_text = ' '.join([self.idx_to_word[idx.item()] for idx in valid_indices])
#                     processed_data.append({
#                         "label": batch[i]["label"],
#                         "text": processed_text
#                     })
                    
#                 # 显存清理
#                 del batch_tensors, padded, valid_mask
#                 torch.cuda.empty_cache()
#                 pbar.update(len(batch))
                
#         return processed_data

#     def first_forty(self, data):
#         first_articles = {}
#         remaining_labels = set(self.label_num_dict.keys())
#         total = len(remaining_labels)  # 进度条总数为类别数量
#         with tqdm(total=total, desc="Finding first forty words", bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, {rate_fmt}, {percentage:3.0f}%]") as pbar:  
#             for batch in self._batch_generator(data):
#                 # 提前退出检查点1
#                 if not remaining_labels:
#                     break  #直接跳出外层循环                
#                 texts = [item["text"] for item in batch]
#                 titles = [item["title"] for item in batch]
#                 labels = [item["label"] for item in batch]
#                 docs = list(self.my_model.pipe(texts, **self.spacy_config))
#                 for doc, label, title in zip(docs, labels, titles):
#                     # 提前退出检查点2
#                     if not remaining_labels:
#                         break  #跳出内层循环
#                     if label not in first_articles and label in remaining_labels:
#                         count_dict=defaultdict(int)
#                         for token in doc:
#                             if (not token.is_punct and not token.is_space and not token.is_stop and not token.is_digit  and not token.is_currency and token.text != '<' and token.text !='>'):
#                                 word=token.text.lower()
#                                 count_dict[word]+=1 
#                         heap = []
#                         for word, count in count_dict.items():
#                             if len(heap) < 40:
#                                 heapq.heappush(heap, WordCount(word, count))
#                             else:
#                                 if count > heap[0].count:
#                                     heapq.heappop(heap)  
#                                     heapq.heappush(heap, WordCount(word, count))     
#                         del doc#这里是减少引用，还没真正回收。后面一个batch完成的时候统一回收
                        
#                         #如果是最开始重写__lt__方法为>号，那么就是构建的按照count降序的heap那么每次的比较逻辑应该是
#                         #如果当前实例比堆尾的大，那就弹出堆尾的，然后heappush；如果没有堆尾大，那就下一个（注意heappop是弹出堆顶的实例）
#                         top_words = [wc.word for wc in heap]
#                         top_words = top_words[::-1]  # 从高到低排序
#                         first_articles[label] = (title, top_words)
#                         remaining_labels.remove(label)
#                         pbar.update(1)  # 每处理一个类别更新 1 次 
                        
#                 #释放这一个batch的临时变量
#                 #每次del variable会将该变量名对应的引用计数减1,但是要使用empty_cache（回收引用=0的变量占的显存）才是真正的回收
#                 del texts, titles, labels, docs 
#                 torch.cuda.empty_cache() 
#         return [{
#             "Class": label,
#             "First article's name": first_articles[label][0],
#             "Processed words": ' '.join(first_articles[label][1])
#         } for label in self.label if label in first_articles]

# import os
# import pickle
# import json
# import numpy as np

# # 加载路径设置
# current_dir = os.path.abspath("")
# parent_dir = os.path.abspath(os.path.join(current_dir, os.pardir))
# save_dir = os.path.join(parent_dir, "processed_data")

# with open(os.path.join(save_dir, "train_processor.pkl"), "rb") as f:
#     xsm= pickle.load(f)

# # 处理训练数据-experiment
# p_experiment=processor(train_label)
# p_experiment.vocab = xsm.vocab  # 共享训练数据的词表
# p_experiment.word_counter = xsm.word_counter  # 共享训练数据的词频统计
# p_experiment_pre_data=p_experiment.preprocess_data(train_data)
# p_experiment_count_words=p_experiment.count_words(train_data)

Preprocessing data:  34%|███▍      | 3072/9000 [18:45<36:12,  2.73it/s,  34%]


KeyboardInterrupt: 

In [17]:
import os
import pickle
import json
import numpy as np

# 获取当前笔记本的绝对路径
current_dir = os.path.abspath("")  # 获取nlp-fdu文件夹的路径
print(f"当前工作目录：{current_dir}")

# 构建父目录路径（nlp-zbj-22）
parent_dir = os.path.abspath(os.path.join(current_dir, os.pardir))  # 向上跳一级目录
print(f"父级目录：{parent_dir}")

# 创建目标保存目录
save_dir = os.path.join(parent_dir, "processed_data")
os.makedirs(save_dir, exist_ok=True)
print(f"保存目录：{save_dir}")

# # ===== 训练数据保存 =====
# # 保存处理器对象
# with open(os.path.join(save_dir, "train_processor.pkl"), "wb") as f:
#     pickle.dump(p_train, f)

# # 保存预处理数据（numpy格式）
# np.save(os.path.join(save_dir, "train_preprocessed.npy"), pre_data_train)

# 保存first_forty结果
with open(os.path.join(save_dir, "train_first40.json"), "w") as f:
    json.dump(first_forty_train, f)

# # 分别保存句子和单词统计（独立文件）
# with open(os.path.join(save_dir, "train_sentence_counts.json"), "w") as f:
#     json.dump(sent_train, f)

# with open(os.path.join(save_dir, "train_word_counts.json"), "w") as f:
#     json.dump(word_train, f)

# # ===== 测试数据保存 =====
# # 保存测试处理器
# with open(os.path.join(save_dir, "test_processor.pkl"), "wb") as f:
#     pickle.dump(p_test, f)

# # 保存测试预处理数据
# np.save(os.path.join(save_dir, "test_preprocessed.npy"), pre_data_test)

# 保存测试first_forty
with open(os.path.join(save_dir, "test_first40.json"), "w") as f:
    json.dump(first_forty_test, f)

# # 分别保存测试统计
# with open(os.path.join(save_dir, "test_sentence_counts.json"), "w") as f:
#     json.dump(sent_test, f)

# with open(os.path.join(save_dir, "test_word_counts.json"), "w") as f:
#     json.dump(word_test, f)

print(f"所有数据已保存到：{save_dir}")

# 上面的代码把训练好的数据进行保存，然后需要调用的时候我们使用下面的代码：

# # 加载路径设置
# current_dir = os.path.abspath("")
# parent_dir = os.path.abspath(os.path.join(current_dir, os.pardir))
# save_dir = os.path.join(parent_dir, "processed_data")

# # 加载训练数据示例
# with open(os.path.join(save_dir, "train_processor.pkl"), "rb") as f:
#     p_train = pickle.load(f)
# 加载词表（通过训练处理器）
# vocab = p_train.vocab  # 假设处理器有vocab属性
# print(f"词表大小：{len(vocab)} | 示例：{list(vocab.items())[:5]}")

# # 加载预处理后的训练数据
# preprocessed_train = np.load(os.path.join(save_dir, "train_preprocessed.npy"))
# print(f"训练数据形状：{preprocessed_train.shape} | 数据类型：{preprocessed_train.dtype}")
# sent_train = json.load(open(os.path.join(save_dir, "train_sentence_counts.json")))
# word_train = json.load(open(os.path.join(save_dir, "train_word_counts.json")))

当前工作目录：/data/nlp-zbj-22/nlp-fdu
父级目录：/data/nlp-zbj-22
保存目录：/data/nlp-zbj-22/processed_data
所有数据已保存到：/data/nlp-zbj-22/processed_data


> 3) Print out the average number of tokens in each class
>    (for both train and test dataset)

In [10]:
# Your code goes to here
sent_train = json.load(open(os.path.join(save_dir, "train_sentence_counts.json")))
sent_test = json.load(open(os.path.join(save_dir, "test_sentence_counts.json")))
print(sent_train)
print(sent_test)

{'Book': 204.6585081585083, 'Food': 149.67768595041326, 'Film': 177.17478197674413, 'Politician': 222.32374309793562, 'Animal': 66.2439024390244, 'Writer': 216.60728218465502, 'Artist': 185.88402625820552, 'Disease': 344.05940594059405, 'Actor': 68.60759493670886, 'Software': 199.07949790794976}
{'Politician': 230.63185378590083, 'Writer': 222.07352941176464, 'Book': 197.3846153846154, 'Film': 173.3986486486487, 'Artist': 174.90476190476184, 'Actor': 175.0, 'Food': 162.0, 'Disease': 323.22222222222223, 'Software': 202.22222222222229, 'Animal': 61.63636363636363}


> 4) For each sentence in the document, remove punctuations and other special characters so that each sentence only contains English words and numbers. To make your life easier, you can make all words as lower cases. For each class, print out the first article's name and the processed first 40 words. (for both train and test dataset)

In [18]:
# Your code goes to here

first_40_train=json.load(open(os.path.join(save_dir, "train_first40.json")))
first_40_test=json.load(open(os.path.join(save_dir, "test_first40.json")))
print(first_40_train)
print(first_40_test)

[{'Class': 'Book', "First article's name": 'Middlesex_(novel)', 'Processed words': 'lefty cal callie male book americans desdemona person middlesex family story identity eugenides intersex years writing female novel greek life noted american gender events hermaphrodite people new voice wrote grandparents narrative detroit york characters considered incest according man relationship condition'}, {'Class': 'Food', "First article's name": 'Chowder', 'Processed words': 'smoked potato zealand french stew salmon seafood clams chowder prepared clam england new dish style fish soup corn haddock chowders marinara american tomatoes potatoes fresh ingredients cream milk popular called canned instead pork onion tomato crackers mixed word use salt'}, {'Class': 'Film', "First article's name": 'Young_People_Fucking', 'Processed words': 'rated bill abrams canada sex release canadian wrote people fucking characters said gero toronto best new young found film title maple controversy cast relationship se

## Task2 - Build language models

### Before you go, you should do necessary preprocessing for training and testing text. For example, you should do sentence tokenization, removing special characters, replacing less frequency words as UNK (for example, you can try to use a cutoff of 10), making all words as lower characters. Fix your vocabulary size so that is not tool large.

> 1) Based on the training dataset (collect all sentences in training dataset), build unigram, bigram, and trigram language models using Additive smoothing technique. It is encouraged to implement models by yourself.


In [19]:
# Your code goes to here
import os
import pickle
import numpy as np
from collections import defaultdict,deque
from tqdm import tqdm

class NGramLanguageModel:
    def __init__(self, vocab_size, seq_len, smoothing=1.5,unk_idx=0):
        self.vocab_size = vocab_size
        self.seq_len = seq_len
        self.smoothing = smoothing

        self.unk_idx = unk_idx
        self.min_valid_prob = 1e-6  # 新增：最小有效概率阈值

        self.unk_idx=unk_idx
        
        # 使用稀疏存储结构（一开始的trigram的三维数组会把内存炸掉）
        # 大数据量除了考虑时间复杂度，还要考虑空间复杂度
        # 外层dict：key为ngram的n-1元组（context  如三元模型中的(w1, w2))，value为一个dict，一个专门统计第n个词出现次数的内层字典
        # 内层dict：key为第n个词（例如三元模型中的w3）,value:该词在特定上下文后出现的次数该词在特定上下文后出现的次数
        # 最终是{ (w1,w2)->{w3->w3出现次数} }
        self.counts = defaultdict(lambda: defaultdict(int))
        
        self.contexts = set()

        self.reset_parameters()  # 初始化时自动调用重置（这是避免多次训练，导致模型的参数累积）
        
    def reset_parameters(self):
        """重置模型参数到初始状态"""
        self.counts = defaultdict(lambda: defaultdict(int))
        self.contexts = set()

    
    #下面是基于one-hot的编码方式，将单词转换成索引
    #这里都显示的是ngram，sentences，context，但实际上最后训练的时候传入的数据都是index，而不是实际的context
    def train(self, ngram):
        """训练单个n-gram，每次train的结果都会写到self.counts和self.context里面，这两个就是我们模型的参数"""
        context = tuple(ngram[:-1])
        next_word = ngram[-1]
        self.counts[context][next_word] += 1
        self.contexts.add(context)

    def train_corpus(self, sentences):
        """批量训练数据,需要传入索引进行训练，而不是传入自然语言。这里可以优化，把自然语言->索引并到模型里面"""
        print(f"Training {self.seq_len}-gram model...")

        #注意此处是生成器推导式而不是列表推导式，很重要的区别是这样子生成的东西是生成器对象而不是列表对象
        #列表对象可以多次遍历，但是生成器对象只能：从前往后正向访问每个元素，没有任何方法可以再次访问已访问过的元素，也不支持使用下标访问其中的元素
        padded_sentences = (
            [0] * (self.seq_len - 1) + sentence 
            for sentence in sentences
        )
        
        with tqdm(total=len(sentences),desc="Processing sentences") as pbar:
            for sentence in padded_sentences:
                # 提取所有n-gram
                for i in range(len(sentence) - self.seq_len + 1):
                    self.train(sentence[i:i+self.seq_len])
                pbar.update(1)

    # 最初的优化版本：如果通过context搜出来的是unk，那就把概率直接降为0
    # def get_probability(self, context):
    #     """通过context获取下一个词的概率分布"""
    #     context = tuple(context)
    #     total = sum(self.counts[context].values()) + self.smoothing * self.vocab_size
    #     probs = np.zeros(self.vocab_size, dtype=np.float32)
        
    #     for word in range(self.vocab_size):#此处word其实不是实际的单词，而是一个索引。在实际生成文本的时候，我们需要把索引转换回实际单词
    #         count = self.counts[context].get(word, 0)
    #         probs[word] = (count + self.smoothing) / total

    #     # 关键修改：将UNK的概率强制设为0(避免后续生成文本的时候太多unk)，并重新概率归一化
    #     probs[self.unk_idx] = 0
    #     probs /= probs.sum()  # 重新归一化保证概率和为1
    #     return probs

    def get_probability(self, context):
        """获取概率分布的改进版本（含动态退避机制）
        Args:
            context: 包含n-1个单词索引的列表
        Returns:
            numpy数组：每个单词作为下一个词的概率
        """
        original_context = tuple(context)
        current_context = original_context
        backoff_weight = 1.0  # 退避权重

        # 动态退避机制：当上下文不存在时逐步缩短
        while len(current_context) > 0 and current_context not in self.contexts:
            current_context = current_context[1:]
            backoff_weight *= 0.4  # 权重衰减系数

        # 计算平滑后的总计数
        total = sum(self.counts[current_context].values()) + self.smoothing * self.vocab_size
        probs = np.full(self.vocab_size, self.smoothing / total, dtype=np.float32)

        # 填充实际计数
        for word, count in self.counts[current_context].items():
            probs[word] = (count + self.smoothing) / total

        # 处理UNK概率
        probs[self.unk_idx] = 0
        probs_sum = probs.sum()

        # 防崩溃机制：如果所有概率都为0则均匀分布
        if probs_sum <= 0:
            probs = np.ones(self.vocab_size, dtype=np.float32) / self.vocab_size
        else:
            probs /= probs_sum

        # 应用动态阈值过滤
        valid_mask = probs >= self.min_valid_prob
        if np.any(valid_mask):
            probs[~valid_mask] = 0
            probs /= probs.sum()
        else:  # 递归退避
            if len(original_context) > 0:
                return self.get_probability(original_context[1:])
            else:
                probs = np.ones(self.vocab_size, dtype=np.float32) / self.vocab_size

        return probs * backoff_weight


    def generate_greedy(self, seed_context, max_length=20):
        """基于贪心的生成方法，但是这导致了很多局部最优，生成了大量unk句子"""
        generated = []
        context = list(seed_context)
        
        for _ in range(max_length):
            probs = self.get_probability(context)
            next_word = np.argmax(probs)  # 直接取概率最大的词
            generated.append(next_word)
            context = context[1:] + [next_word]  # 滑动窗口更新上下文
            
        return generated

    
    def generate_beam(self, seed_context, max_length=20, beam_width=3):
        """基于集束搜索的生成方法"""
        # 初始候选序列：格式为 (log_prob, current_context, generated_sequence)
        initial_context = list(seed_context)
        candidates = [ (0.0, initial_context, []) ]  # (log_prob, context, sequence)
        
        for _ in range(max_length):
            new_candidates = []
            
            for log_prob, context, seq in candidates:
                # 获取当前上下文的概率分布
                probs = self.get_probability(context)
                
                # 取概率最高的前 beam_width 个候选
                top_k = np.argsort(probs)[-beam_width:]
                
                for word_idx in top_k:
                    word_prob = probs[word_idx]
                    if word_prob <= 0:
                        continue  # 跳过概率为0的候选
                    
                    # 计算新的对数概率（使用对数避免数值下溢）
                    new_log_prob = log_prob + np.log(word_prob)
                    
                    # 更新上下文窗口（滑动窗口机制）
                    new_context = context[1:] + [word_idx]
                    
                    # 扩展生成的序列
                    new_seq = seq + [word_idx]
                    
                    new_candidates.append( (new_log_prob, new_context, new_seq) )
            
            # 保留概率最高的前 beam_width 个候选
            new_candidates.sort(reverse=True, key=lambda x: x[0])
            candidates = new_candidates[:beam_width]
            
            # 提前终止条件：所有候选概率为0
            if not candidates or candidates[0][0] == -np.inf:
                break
        
        # 返回最优候选的序列
        if not candidates:
            return []
        return max(candidates, key=lambda x: x[0])[2]

    def generate_diverse(self, seed_context, max_length=20, 
                        temperature=0.8, top_k=50, repetition_penalty=1.2):
        """带有多样性控制的生成方法"""
        generated = []
        context = list(seed_context)
        
        for _ in range(max_length):
            probs = self.get_probability(context)
            
            # 应用温度调节
            scaled_probs = np.power(probs, 1/temperature)
            scaled_probs /= scaled_probs.sum()
            
            # Top-k过滤
            topk_indices = np.argpartition(-scaled_probs, top_k)[:top_k]
            topk_probs = scaled_probs[topk_indices]
            topk_probs /= topk_probs.sum()
            
            # 重复惩罚
            if len(generated) > 0:
                last_word = generated[-1]
                topk_probs[topk_indices == last_word] /= repetition_penalty
            
            # 采样
            next_word = np.random.choice(topk_indices, p=topk_probs)
            generated.append(next_word)
            context = context[1:] + [next_word]
            
        return generated

    def generate_quality(self, seed_context, max_length=20,
                        temperature=0.7, top_k=100, repetition_penalty=1.5):
        """质量优化生成方法（返回索引列表）"""
        return self.generate_diverse(
            seed_context, max_length,
            temperature=temperature,
            top_k=top_k,
            repetition_penalty=repetition_penalty
        )
        

    def generate_enhanced(self, seed_context, max_length=20, 
                         temperature=0.8, top_k=20, diversity=1.2):
        """混合生成策略（温度调节+动态top-k+重复惩罚）
        Args:
            seed_context: 初始上下文索引列表
            max_length: 最大生成长度
            temperature: 温度参数(>1更均匀，<1更尖锐)
            top_k: 候选词数量
            diversity: 重复惩罚系数(越大惩罚越重)
        """
        generated = []
        context = list(seed_context)
        last_words = deque(maxlen=3)  # 重复检测窗口

        for _ in range(max_length):
            probs = self.get_probability(context)

            # 动态重复惩罚
            for idx in set(last_words):
                probs[idx] /= diversity

            # 温度调节
            scaled_probs = np.power(probs, 1/temperature)
            scaled_probs /= scaled_probs.sum()

            # 动态调整top_k
            valid_indices = np.where(scaled_probs >= self.min_valid_prob)[0]
            if len(valid_indices) == 0:
                valid_indices = np.arange(self.vocab_size)
            top_k_adj = min(top_k, len(valid_indices))

            # 选择top-k候选
            top_indices = np.argpartition(-scaled_probs[valid_indices], top_k_adj)[:top_k_adj]
            final_probs = scaled_probs[valid_indices[top_indices]]
            final_probs /= final_probs.sum()

            # 采样下一个词
            next_word = np.random.choice(valid_indices[top_indices], p=final_probs)
            
            # 更新状态
            generated.append(next_word)
            context = context[1:] + [next_word]
            last_words.append(next_word)

        return generated

    def generate_text(self, seed_words, vocab_dict, max_length=20, 
                     method='enhanced', unk_replace_prob=0.3, **kwargs):
        """生成可读文本的统一入口（改进UNK处理）
        Args:
            seed_words: 初始单词列表
            vocab_dict: 词汇表字典{word: index}
            method: 生成方法(enhanced/greedy/beam/diverse/quality)
            unk_replace_prob: UNK替换概率
            kwargs: 生成参数
        """
        # 转换种子词为索引
        unk_idx = vocab_dict.get("UNK", 0)
        seed_indices = [vocab_dict.get(word, unk_idx) for word in seed_words]

        # 调整种子长度
        required_length = self.seq_len - 1
        if len(seed_indices) < required_length:
            seed_indices = [0] * (required_length - len(seed_indices)) + seed_indices
        else:
            seed_indices = seed_indices[-required_length:]

        # 生成索引序列
        if method == 'enhanced':
            generated_indices = self.generate_enhanced(seed_indices, max_length, **kwargs)
        elif method == 'greedy':
            generated_indices = self.generate_greedy(seed_indices, max_length)
        elif method == 'beam':
            generated_indices = self.generate_beam(seed_indices, max_length, kwargs.get('beam_width',3))
        elif method == 'diverse':
            generated_indices = self.generate_diverse(seed_indices, max_length)
        elif method == 'quality':
            generated_indices = self.generate_quality(seed_indices, max_length)
        else:
            raise ValueError(f"未知生成方法: {method}")

        # 上下文感知的UNK替换
        idx_to_word = {v: k for k, v in vocab_dict.items()}
        replace_candidates = self._get_contextual_replacements(generated_indices, vocab_dict)
        
        generated_words = []
        for idx in generated_indices:
            word = idx_to_word.get(idx, "<UNK>")
            if word == "<UNK>":
                # 根据上下文选择最佳替换
                best_candidate = self._find_best_replacement(
                    context[-2:], replace_candidates, vocab_dict)
                generated_words.append(best_candidate)
            else:
                generated_words.append(word)

        # 后处理
        return self._post_process(generated_words)

    def _get_contextual_replacements(self, indices, vocab_dict):
        """获取上下文相关的替换候选词（示例实现）
        Args:
            indices: 已生成序列的索引列表
            vocab_dict: 词汇表字典
        Returns:
            候选词列表（可根据实际需求扩展）
        """
        # 示例实现：返回高频通用名词
        return ["system", "technology", "process", "development", "research"]

    def _find_best_replacement(self, context, candidates, vocab_dict):
        """选择最符合上下文的替换词
        Args:
            context: 最近的上下文索引列表
            candidates: 候选词列表
            vocab_dict: 词汇表字典
        """
        max_prob = -1
        best_word = np.random.choice(candidates)  # 默认随机选择
        for word in candidates:
            idx = vocab_dict.get(word, self.unk_idx)
            prob = self.get_probability(context + [idx])[idx]
            if prob > max_prob:
                max_prob = prob
                best_word = word
        return best_word

    def _post_process(self, words):
        """后处理生成文本（示例实现）
        1. 去除连续重复词
        2. 过滤剩余UNK
        3. 首字母大写
        """
        # 去重处理
        cleaned = []
        prev_word = None
        for word in words:
            if word != prev_word:
                cleaned.append(word)
            prev_word = word
        
        # 过滤UNK并处理首字母
        filtered = []
        for i, word in enumerate(cleaned):
            if word != "<UNK>":
                if i == 0:
                    filtered.append(word.capitalize())
                else:
                    filtered.append(word)
        
        # 简单句子完整性检查
        if len(filtered) > 0:
            if not filtered[-1].endswith(('.', '!', '?')):
                filtered[-1] += '.'
        
        return ' '.join(filtered)


    
    def calculate_perplexity(self, test_indices):
        """计算困惑度"""
        total_logprob = 0.0
        total_words = 0
        
        for sentence in tqdm(test_indices, desc="Calculating PPL"):
            context = [0] * (self.seq_len - 1)  # 初始填充
            for word in sentence:
                probs = self.get_probability(context)
                total_logprob += np.log(probs[word] + 1e-10)  # 防止log(0)
                total_words += 1
                # 更新上下文
                context = context[1:] + [word]
                
        return np.exp(-total_logprob / total_words)



# 数据预处理函数，因为我们的模型训练需要index（但是模型的生成句子功能只需要传入正常的文字即可）
def convert_to_indices(sentences, vocab_dict):
    """将文本转换为索引序列"""
    unk_idx = vocab_dict.get("UNK", 0)
    indexed = []
    for sentence in sentences:
        indexed.append([vocab_dict.get(word, unk_idx) for word in sentence])
    return indexed



# 数据加载 ----------------------------------------------------
current_dir = os.path.abspath("")
parent_dir = os.path.abspath(os.path.join(current_dir, os.pardir))
save_dir = os.path.join(parent_dir, "processed_data1")


# 加载预处理数据
with open(os.path.join(save_dir, "train_processor.pkl"), "rb") as f:
    processor = pickle.load(f)

# 构建词汇表
if isinstance(processor.vocab, set):
    vocab = {word: idx for idx, word in enumerate(processor.vocab)}
    if "UNK" not in vocab:
        vocab["UNK"] = len(vocab)
else:
    vocab = processor.vocab

# 加载训练/测试数据
train_data = np.load(os.path.join(save_dir, "train_preprocessed.npy"), allow_pickle=True)
test_data = np.load(os.path.join(save_dir, "test_preprocessed.npy"), allow_pickle=True)

# 转换为索引序列
train_sentences = [s['text'].split() for s in train_data]
test_sentences = [s['text'].split() for s in test_data]
train_indices = convert_to_indices(train_sentences, vocab)
test_indices = convert_to_indices(test_sentences, vocab)



# ====================== 模型训练与评估 ======================
def train_and_evaluate(models, train_data, test_data):
    results = []
    for name, model in models.items():
        # 训练
        model.train_corpus(train_data)
        # 计算困惑度
        ppl = model.calculate_perplexity(test_data)
        # 生成示例
        samples = [
            model.generate_text(["the"], vocab),
            model.generate_text(["government"], vocab),
            model.generate_text(["science"], vocab),
            model.generate_text(["history"], vocab),
            model.generate_text(["technology"], vocab)
        ]
        results.append((name, ppl, samples))
    return results

# 创建不同n-gram模型
models = {
    "unigram": NGramLanguageModel(vocab_size=len(vocab), seq_len=1, smoothing=2),
    "bigram": NGramLanguageModel(vocab_size=len(vocab), seq_len=2, smoothing=2),
    "trigram": NGramLanguageModel(vocab_size=len(vocab), seq_len=3, smoothing=2)
}

# 执行训练评估
results = train_and_evaluate(models, train_indices, test_indices[:1])  # 使用部分测试数据加速






Training 1-gram model...


Processing sentences: 100%|██████████| 9000/9000 [00:08<00:00, 1042.72it/s]
Calculating PPL: 100%|██████████| 1/1 [00:17<00:00, 17.41s/it]


Training 2-gram model...


Processing sentences: 100%|██████████| 9000/9000 [00:15<00:00, 585.93it/s]
Calculating PPL: 100%|██████████| 1/1 [00:00<00:00,  1.01it/s]


Training 3-gram model...


Processing sentences: 100%|██████████| 9000/9000 [00:40<00:00, 222.45it/s]
Calculating PPL: 100%|██████████| 1/1 [00:00<00:00,  3.93it/s]


> 2) Report the perplexity of these 3 trained models on the testing dataset (again collect all sentences in training dataset) and explain your findings. 

In [24]:
# Your code goes to here
# ====================== 结果展示 ======================
print("\n" + "="*50 + " Model Comparison " + "="*50)
for name, ppl, samples in results:
    print(f"\n{name.upper()} Model:")
    print(f"Perplexity: {ppl:.2f}")





UNIGRAM Model:
Perplexity: 17311.21

BIGRAM Model:
Perplexity: 17372.39

TRIGRAM Model:
Perplexity: 112189.08


> 3) Use each built model to generate five sentences and explain these generated patterns.


In [25]:
# Your code goes to here
print("\n" + "="*50 + " Model Comparison " + "="*50)
for name, ppl, samples in results:
    print(f"\n{name.upper()} Model:")
    print("Generated Samples:")
    for i, s in enumerate(samples, 1):
        print(f"{i}. {s}")





UNIGRAM Model:
Generated Samples:
1. Unk film UNK new UNK film UNK time film UNK work.
2. Years party UNK later party UNK election UNK war states time UNK state UNK year UNK.
3. Unk new UNK state time UNK said UNK film election said UNK american said UNK work.
4. Election film UNK people election UNK president party UNK united american film UNK said.
5. Unk film work united film UNK said UNK work film new UNK work film UNK states president.

BIGRAM Model:
Generated Samples:
1. Unk time UNK published short film released october march UNK time magazine.
2. Announced intention resign party UNK people know love story set released march april announced candidacy april year old man like.
3. Fiction film directed produced directed paul UNK new york times UNK votes cast crew film.
4. Film based film based reviews average score film series articles published UNK.
5. Unk new jersey campaign focused reducing costs incurred filming began march november december UNK.

TRIGRAM Model:
Generated Sam

> 4) Train bigram and trigram model using kenlm and report the perplexities of these two. Compare results of your model and results from kenlm

In [None]:
# Your code goes to here

!sudo apt-get install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libeigen3-dev zlib1g-dev  # Linux
!pip install https://github.com/kpu/kenlm/archive/master.zip
!pip install pypi-kenlm

!kenlm/lmplz -o 1 --text train.txt --arpa 1gram.arpa
!kenlm/build_binary 1gram.arpa 1gram.bin
!kenlm/lmplz -o 2 --text train.txt --arpa 2gram.arpa
!kenlm/build_binary 2gram.arpa 2gram.bin
!kenlm/lmplz -o 3--text train.txt --arpa 3gram.arpa
!kenlm/build_binary 3gram.arpa 3gram.bin


import kenlm

# 加载测试集
with open('test.txt', 'r') as f:
    test_sents = [line.strip() for line in f]

# 加载模型
models = {
    '1-gram': kenlm.Model('1gram.bin'),
    '2-gram': kenlm.Model('2gram.bin'),
    '3-gram': kenlm.Model('3gram.bin')
}

def calculate_perplexity(model, sentences):
    total_logprob = 0
    total_words = 0
    for sent in sentences:
        # 注意：句子需要空格分隔的字符串格式
        total_logprob += model.score(sent, bos=True, eos=True)
        total_words += len(sent.split()) + 1  # +1 for </s>
    return 10**(-total_logprob / total_words)

# 计算所有模型
for name, model in models.items():
    ppl = calculate_perplexity(model, test_sents)
    print(f"{name} PPL: {ppl:.2f}")

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
build-essential is already the newest version (12.9ubuntu3).
zlib1g-dev is already the newest version (1:1.2.11.dfsg-2ubuntu9.2).
cmake is already the newest version (3.22.1-1ubuntu1.22.04.2).
The following additional packages will be installed:
  libboost-atomic1.74-dev libboost-atomic1.74.0 libboost-chrono1.74-dev
  libboost-chrono1.74.0 libboost-date-time1.74-dev libboost-date-time1.74.0
  libboost-program-options1.74-dev libboost-program-options1.74.0
  libboost-serialization1.74-dev libboost-serialization1.74.0
  libboost-system1.74-dev libboost-system1.74.0 libboost-thread1.74-dev
  libboost-thread1.74.0 libboost1.74-dev
Suggested packages:
  libboost1.74-doc libboost-container1.74-dev libboost-context1.74-dev
  libboost-contract1.74-dev libboost-coroutine1.74-dev
  libboost-exception1.74-dev libboost-fiber1.74-dev
  libboost-filesystem1.74-dev libboost-graph1.74-dev
  libboost-graph-

## Task3 - Build NB/LR classifiers

> 1) Build a Naive Bayes classifier (with Laplace smoothing) and test your model on test dataset

In [8]:
# Your code goes to here
import os
import pickle
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import f1_score

# ===== Data Loading =====
current_dir = os.path.abspath("")
parent_dir = os.path.abspath(os.path.join(current_dir, os.pardir))
save_dir = os.path.join(parent_dir, "processed_data1")

with open(os.path.join(save_dir, "train_processor.pkl"), "rb") as f:
    p_train = pickle.load(f)

train_data = np.load(os.path.join(save_dir, "train_preprocessed.npy"), allow_pickle=True)
test_data = np.load(os.path.join(save_dir, "test_preprocessed.npy"), allow_pickle=True)

X_train = [item['text'] for item in train_data]
y_train = [item['label'] for item in train_data]
X_test = [item['text'] for item in test_data]
y_test = [item['label'] for item in test_data]

vocab = list(p_train.vocab)

# ===== 1. Naive Bayes Classifier =====
vectorizer_nb = CountVectorizer(vocabulary=vocab)
X_train_nb = vectorizer_nb.fit_transform(X_train)
X_test_nb = vectorizer_nb.transform(X_test)

nb_clf = MultinomialNB(alpha=1.0)
nb_clf.fit(X_train_nb, y_train)
y_pred_nb = nb_clf.predict(X_test_nb)

nb_report = classification_report(y_test, y_pred_nb, output_dict=True)







  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


> 2) Build a LR classifier. This question seems to be challenging. We did not directly provide features for samples. But just use your own method to build useful features. You may need to split the training dataset into train and validation so that some involved parameters can be tuned. 

In [9]:
# Your code goes to here


# ===== 2. Logistic Regression Classifier =====
vectorizer_lr = CountVectorizer(vocabulary=vocab)
X_train_lr = vectorizer_lr.fit_transform(X_train)
X_test_lr = vectorizer_lr.transform(X_test)

X_train_part, X_val, y_train_part, y_val = train_test_split(
    X_train_lr, y_train, test_size=0.2, random_state=42)

best_score = 0
best_C = 1.0
for C in [0.1, 1, 10, 100]:
    lr = LogisticRegression(C=C, max_iter=1000, solver='liblinear')
    lr.fit(X_train_part, y_train_part)
    score = lr.score(X_val, y_val)
    if score > best_score:
        best_score, best_C = score, C

final_lr = LogisticRegression(C=best_C, max_iter=1000, solver='liblinear')
final_lr.fit(X_train_lr, y_train)
y_pred_lr = final_lr.predict(X_test_lr)

lr_report = classification_report(y_test, y_pred_lr, output_dict=True)




> 3) Report Micro-F1 score and Macro-F1 score for these classifiers on testing dataset explain our results.

In [10]:
# Your code goes to here

# ===== 3. Final Performance Report =====
print("Naive Bayes Results:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_nb):.4f}")
print(f"Micro-F1: {f1_score(y_test, y_pred_nb, average='micro'):.4f}")
print(f"Macro-F1: {f1_score(y_test, y_pred_nb, average='macro'):.4f}\n")

print("Logistic Regression Results:")
print(f"Best C Parameter: {best_C}")
print(f"Accuracy: {accuracy_score(y_test, y_pred_lr):.4f}")
print(f"Micro-F1: {f1_score(y_test, y_pred_lr, average='micro'):.4f}")
print(f"Macro-F1: {f1_score(y_test, y_pred_lr, average='macro'):.4f}\n")


Naive Bayes Results:
Accuracy: 0.9410
Micro-F1: 0.9410
Macro-F1: 0.8348

Logistic Regression Results:
Best C Parameter: 0.1
Accuracy: 0.9690
Micro-F1: 0.9690
Macro-F1: 0.9295

