#### BERT 词向量

使用 PyTorch 框架和 Hugging Face Transformers 库来获取中文句子 "我爱秋天的颐和园" 的 BERT 词向量

##### 5\. 提取词向量

```python
# 提取词向量（最后一层的隐藏状态）
# 最后一层（Last Layer）：包含最丰富的语境信息，但可能过拟合特定任务。

last_hidden_states = outputs.last_hidden_state
```

  * `outputs` 是一个包含多个元素的元组/对象。
  * **`outputs.last_hidden_state`**：提取模型最后一层（在 BERT Base 中是第 12 层）所有 token 的隐藏状态输出。这是 BERT 提取的**最终、上下文感知**的词向量。

| 维度 | 含义 |
| :--- | :--- |
| **`last_hidden_states.shape[0]`** | Batch Size (此处为 1) |
| **`last_hidden_states.shape[1]`** | 序列长度 (L) |
| **`last_hidden_states.shape[2]`** | 隐藏层维度 (H=768) |

##### 6\. 输出结果

```python
# 输出结果
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
for token, vector in zip(tokens, last_hidden_states[0]):
    print(f"Token: {token} | Vector Shape: {vector.shape}")
```

  * **`tokenizer.convert_ids_to_tokens(...)`**：将模型的数字输入 ID 转换回可读的 token 字符串。
  * 循环遍历每个 token 及其对应的词向量，并打印结果。
      * **分词结果**：`我爱秋天的颐和园` 会被分词为 `['[CLS]', '我', '爱', '秋', '天', '的', '颐', '和', '园', '[SEP]']`。
      * **向量形状**：每个 token 都有一个 `torch.Size([768])` 的向量。

In [3]:
from transformers import BertTokenizer,BertModel
import torch

# 加载中文 BERT 预训练模型
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
model = BertModel.from_pretrained('bert-base-chinese')

# 句子输入
sentence = '我爱秋天的颐和园'

# 进行分词并转换为输入 ID
inputs = tokenizer(sentence,return_tensors = 'pt')

# 前向传播获取 BERT 输出
with torch.no_grad():
    outputs = model(**inputs)

# 提取词向量（最后一层的隐藏状态）
# 最后一层（Last Layer）：包含最丰富的语境信息，但可能过拟合特定任务。
last_hidden_states = outputs.last_hidden_state
# outputs.last_hidden_state：提取模型最后一层（在 BERT Base 中是第 12 层）所有 token 的隐藏状态输出。这是 BERT 提取的最终、上下文感知的词向量。

# 输出结果
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]) 
# tokenizer.convert_ids_to_tokens(...)：将模型的数字输入 ID 转换回可读的 token 字符串。
for token,vector in zip(tokens,last_hidden_states[0]): 
    # last_hidden_states[0]: 由于 Batch Size 为 1，我们使用 [0] 索引来获取批次中的第一个（也是唯一一个）句子的输出张量。它的形状是 [10, 768]。
    print(f"Token: {token} | Vector Shape: {vector.shape}")
# [CLS] 和 [SEP] 是 BERT 特殊标记，[CLS] 可以用于句子级别的向量。

'(ProtocolError('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')), '(Request ID: 1b8de55f-b5fe-4de0-9d6d-09a3ba2338c6)')' thrown while requesting HEAD https://huggingface.co/bert-base-chinese/resolve/main/vocab.txt
Retrying in 1s [Retry 1/5].
'(ProtocolError('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')), '(Request ID: caf21ec2-a803-4e15-b029-2c1c1c18eb15)')' thrown while requesting HEAD https://huggingface.co/bert-base-chinese/resolve/main/vocab.txt
Retrying in 2s [Retry 2/5].
'(ProtocolError('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')), '(Request ID: 0b07833b-64c1-4967-bb49-20b0b5bb11de)')' thrown while requesting HEAD https://huggingface.co/bert-base-chinese/resolve/main/vocab.txt
Retrying in 4s [Retry 3/5].
'(ProtocolError('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')), '(Request ID: 2894fae7-bd05-4f48-b43c-b14d6dadd319)')' thrown while requesting HEAD https://hugg

Token: [CLS] | Vector Shape: torch.Size([768])
Token: 我 | Vector Shape: torch.Size([768])
Token: 爱 | Vector Shape: torch.Size([768])
Token: 秋 | Vector Shape: torch.Size([768])
Token: 天 | Vector Shape: torch.Size([768])
Token: 的 | Vector Shape: torch.Size([768])
Token: 颐 | Vector Shape: torch.Size([768])
Token: 和 | Vector Shape: torch.Size([768])
Token: 园 | Vector Shape: torch.Size([768])
Token: [SEP] | Vector Shape: torch.Size([768])


#### 余弦相似度 (Cosine Similarity)

使用 TF-IDF (Term Frequency-Inverse Document Frequency) 结合 余弦相似度 (Cosine Similarity) 来计算两个中文短文本相似度

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# list for tfidf
texts = ["我喜欢秋天的颐和园", "我每到秋天就会去颐和园"] 
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(texts)

# 1. 提取第一个句子 (tfidf_matrix[0]) 和第二个句子 (tfidf_matrix[1]) 的向量。2. 计算这两个 $N$ 维向量的余弦相似度。
similarity = cosine_similarity(tfidf_matrix[0],tfidf_matrix[1]) # 值越接近1，相似度越高
similarity

array([[0.]])

#### Word2Vec 模型结合余弦相似度来计算两个中文句子相似度

In [8]:
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

sentences = [["我", "喜欢", "秋天","颐和园"], ["我", "每到","秋天", "会去", "颐和园"]]
model = Word2Vec(sentences, vector_size=100, min_count=1)

# 计算文本向量（取词向量均值）
def get_avg_vector(text):
    vectors = [model.wv[word] for word in text if word in model.wv]
    return np.mean(vectors, axis=0) if vectors else np.zeros(100)

vec1 = get_avg_vector(["我", "喜欢", "秋天","颐和园"])
vec2 = get_avg_vector(["我",  "每到","秋天", "会去", "颐和园"])

similarity = cosine_similarity([vec1], [vec2])
similarity

array([[0.7186865]], dtype=float32)

#### BERT 句向量（Sentence Embedding）和余弦相似度

In [9]:
from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity

tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
model = BertModel.from_pretrained('bert-base-chinese')

def get_bert_embedding(text):
    inputs = tokenizer(text,
                       return_tensors = 'pt',
                       padding = True,
                       truncation = True)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state[:,0,:].numpy() # 取 [CLS] 位置的向量
    # .numpy(): 将 PyTorch Tensor 转换为 NumPy 数组，以供 sklearn 的 cosine_similarity 函数使用。

vec1 = get_bert_embedding("我喜欢秋天的颐和园")
vec2 = get_bert_embedding("我每到秋天就会去颐和园")

similarity = cosine_similarity(vec1,vec2)
similarity

'(ProtocolError('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')), '(Request ID: 501ee956-0a14-4897-b05e-696a45cd2151)')' thrown while requesting HEAD https://huggingface.co/bert-base-chinese/resolve/main/vocab.txt
Retrying in 1s [Retry 1/5].
'(ProtocolError('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')), '(Request ID: 3c1fef3c-4c31-45bb-ac70-7b9d4b8210c5)')' thrown while requesting HEAD https://huggingface.co/bert-base-chinese/resolve/main/config.json
Retrying in 1s [Retry 1/5].
'(ProtocolError('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')), '(Request ID: fd7b19c5-8ff9-42c8-ba8c-3ad7724d64e8)')' thrown while requesting HEAD https://huggingface.co/bert-base-chinese/resolve/main/config.json
Retrying in 2s [Retry 2/5].
'(ProtocolError('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')), '(Request ID: e8dde29e-5cd1-438c-be2b-f6e439242c91)')' thrown while requesting HEAD https://

array([[0.89764786]], dtype=float32)

#### Hugging Face Transformers 库执行命名实体识别 (NER)

In [10]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# 加载中文 BERT 预训练模型
tokenizer = AutoTokenizer.from_pretrained('hfl/chinese-roberta-wwm-ext')
model = AutoModelForTokenClassification.from_pretrained('hfl/chinese-roberta-wwm-ext')

# 使用 pipeline 进行 NER 任务
ner_pipeline = pipeline('ner',model = model,tokenizer = tokenizer)

# 进行 NER 预测
text = '我每到秋天就会去颐和园'
entities = ner_pipeline(text)

# 输出识别的实体
for entity in entities:
    print(entity)

'(ProtocolError('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')), '(Request ID: c0f74bc1-a8a7-429c-9b28-0d3c23f756cc)')' thrown while requesting HEAD https://huggingface.co/hfl/chinese-roberta-wwm-ext/resolve/main/tokenizer_config.json
Retrying in 1s [Retry 1/5].
'(ProtocolError('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')), '(Request ID: 10b57497-14d1-4f5e-a816-72c57dec6d3a)')' thrown while requesting HEAD https://huggingface.co/hfl/chinese-roberta-wwm-ext/resolve/main/tokenizer_config.json
Retrying in 2s [Retry 2/5].
'(ProtocolError('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')), '(Request ID: e0d5fe78-21c3-4411-b789-5b36cab48064)')' thrown while requesting HEAD https://huggingface.co/hfl/chinese-roberta-wwm-ext/resolve/main/tokenizer_config.json
Retrying in 4s [Retry 3/5].
'(ProtocolError('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')), '(Request ID: 4ab53b69-d3fd-40

config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/19.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/412M [00:00<?, ?B/s]

Some weights of the model checkpoint at hfl/chinese-roberta-wwm-ext were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at h

{'entity': 'LABEL_1', 'score': 0.6227718, 'index': 1, 'word': '我', 'start': 0, 'end': 1}
{'entity': 'LABEL_0', 'score': 0.50533104, 'index': 2, 'word': '每', 'start': 1, 'end': 2}
{'entity': 'LABEL_0', 'score': 0.69991153, 'index': 3, 'word': '到', 'start': 2, 'end': 3}
{'entity': 'LABEL_0', 'score': 0.51139605, 'index': 4, 'word': '秋', 'start': 3, 'end': 4}
{'entity': 'LABEL_0', 'score': 0.64497894, 'index': 5, 'word': '天', 'start': 4, 'end': 5}
{'entity': 'LABEL_1', 'score': 0.5511807, 'index': 6, 'word': '就', 'start': 5, 'end': 6}
{'entity': 'LABEL_0', 'score': 0.5219732, 'index': 7, 'word': '会', 'start': 6, 'end': 7}
{'entity': 'LABEL_0', 'score': 0.50868607, 'index': 8, 'word': '去', 'start': 7, 'end': 8}
{'entity': 'LABEL_1', 'score': 0.5624909, 'index': 9, 'word': '颐', 'start': 8, 'end': 9}
{'entity': 'LABEL_1', 'score': 0.51863146, 'index': 10, 'word': '和', 'start': 9, 'end': 10}
{'entity': 'LABEL_1', 'score': 0.6211865, 'index': 11, 'word': '园', 'start': 10, 'end': 11}


In [11]:
import spacy
import json
import pandas as pd

weibo_df = pd.read_excel("weibo_data.xlsx") # read_csv
data = []
for i in range(1,len(weibo_df)):
    data.append(weibo_df.iloc[i,3])
    
# 加载中文预训练模型
nlp = spacy.load("zh_core_web_md")

output = {"data": []}  # 创建一个字典，并准备好 data 键

for line in data:
    # 处理文本
    doc = nlp(line)
    for ent in doc.ents:
        # 将每个实体对象直接添加到 "data" 数组中
        output["data"].append({"实体": ent.text, "类型": ent.label_})

# 保存到 JSON 文件
with open('entities_output1.json', 'w', encoding='utf-8') as f:
    json.dump(output, f, ensure_ascii=False, indent=4)


In [12]:
import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

# 示例数据
corpus = [
    "这部电影很好看，我非常喜欢。",
    "剧情很感人，演员演技也很棒。",
    "太无聊了，浪费时间。",
    "这是一部烂片，后悔看了。",
    "非常精彩的故事，推荐观看。",
    "不值得看，情节太差了。"
]

labels = [1, 1, 0, 0, 1, 0]  # 1 代表好评，0 代表差评


# 中文分词
corpus_cut = [" ".join(jieba.cut(text)) for text in corpus]

# TF-IDF 特征提取
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus_cut)

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/hx/vrwsgqgx7bncbt28n44f0b540000gn/T/jieba.cache
Loading model cost 0.485 seconds.
Prefix dict has been built successfully.


In [13]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split


# 定义模型
svm = SVC()
# 定义模型
rf = RandomForestClassifier(random_state=42)

# 定义超参数搜索空间
param_grid_rf = {
    'n_estimators': [10, 50, 100, 200],  # 森林中树的数量
    'max_depth': [None, 10, 20, 30],  # 树的最大深度
    'min_samples_split': [2, 5, 10],  # 每个节点最小的样本数
    'min_samples_leaf': [1, 2, 4],  # 每个叶子节点最小的样本数
    'max_features': ['auto', 'sqrt', 'log2'],  # 在每个节点分裂时考虑的特征数
    'bootstrap': [True, False]  # 是否使用自助采样法
}

# 定义超参数搜索空间
param_grid_svm = {
    'C': [0.1, 1, 10],  # 惩罚参数
    'kernel': ['linear', 'rbf'],  # 核函数
    'gamma': ['scale', 'auto']  # 核函数的参数
}

# 创建 GridSearchCV 对象
grid_search_svm = GridSearchCV(svm, param_grid_svm, cv=2, scoring='accuracy')
grid_search_rf = GridSearchCV(estimator=rf, param_grid=param_grid_rf, cv=2, scoring='accuracy', n_jobs=-1, verbose=2)

# 执行网格搜索
grid_search_svm.fit(X, labels)
grid_search_rf.fit(X, labels)

# 输出最佳参数和最佳模型
print("SVM最佳参数:", grid_search_svm.best_params_)
print("SVM最佳模型:", grid_search_svm.best_estimator_)
print("RF最佳参数:", grid_search_rf.best_params_)
print("RF最佳模型:", grid_search_rf.best_estimator_)

# 在测试集上评估最佳模型
test_text = "这部电影真的太棒了！"
test_vector = vectorizer.transform([" ".join(jieba.cut(test_text))])
print("SVM 预测结果:", grid_search_svm.predict(test_vector))
print("Random Forest 预测结果:", grid_search_rf.predict(test_vector))

Fitting 2 folds for each of 864 candidates, totalling 1728 fits
[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   0.0s
[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=10, n_estimators=10; total time=   0.0s
[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time=   0.0s
[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time=   0.0s
[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=10, n_estimators=200; total time=   0.0s
[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=50; total time=   0.0s
[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estima

576 fits failed out of a total of 1728.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
305 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/anaconda3/envs/py39/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/anaconda3/envs/py39/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/opt/anaconda3/envs/py39/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/opt/anaconda3/envs/py39/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 98, in validate_para