# Fudan PRML Fall 2024 Exercise 4: Unsupervised Learning

![news](./news.png)

**Your name and Student ID:**

In this assignment, you will build a **text classification** system which is a fundamental task in the field of Natural Language Processing (NLP). More precisely, you are given a news classification task, assigning given news texts to the categories to which they belong. Unlike traditional classification tasks, **we did not provide you with any labels for this assignment, and you need to find a way to construct labels for these articles**. 

For this assignment you can use commonly used deep learning frameworks like PyTorch. **You can use pretrained word vectors like Glove, but not pretrained large models like BERT.**

# 1. Setup

In [10]:
# setup code
%load_ext autoreload
%autoreload 2
%env CUDA_VISIBLE_DEVICES = 1
import os
import pickle
import numpy as np
import re
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
env: CUDA_VISIBLE_DEVICES=1


In [11]:
dataset_path = 'kmeans_news.pkl'

all_data = None
with open(dataset_path,'rb') as fin:
    all_data = pickle.load(fin)
    all_data_np = np.array(all_data)

print ('\n'.join(all_data[0:5]))
print ('Total number of news: {}'.format(len(all_data)))

经济学家吴敬琏为什么反对“不惜代价发展芯片产业”？
颜值很高的她，一双美腿甚至可以让人忽略她的颜值
转自常德诗人”再访桃花源“（再续心灵故乡的故事）
虎牙拼杀四年抢得游戏直播第一股 但真正的挑战才刚开始
如何评价许鞍华导演？她的电影为什么总能给人以触动？
Total number of news: 83360


# 2. Exploratory Data Analysis

Not all data within the dataset is suitable for clustering. You might need to filter and process some of them in advance.

In [12]:
# 2. 数据加载和清洗
dataset_path = 'kmeans_news.pkl'

def clean_text(text):
    text = re.sub(r'[^\w\s]', ' ', text)
    text = ' '.join(text.split())
    return text.lower()

# 加载数据
with open(dataset_path,'rb') as fin:
    all_data = pickle.load(fin)
    all_data_np = np.array(all_data)

# 清洗数据
cleaned_data = [clean_text(news) for news in all_data]

print('Total number of news:', len(all_data))
print('\nOriginal samples:')
print('\n'.join(all_data[0:3]))
print('\nCleaned samples:')
print('\n'.join(cleaned_data[0:3]))

Total number of news: 83360

Original samples:
经济学家吴敬琏为什么反对“不惜代价发展芯片产业”？
颜值很高的她，一双美腿甚至可以让人忽略她的颜值
转自常德诗人”再访桃花源“（再续心灵故乡的故事）

Cleaned samples:
经济学家吴敬琏为什么反对 不惜代价发展芯片产业
颜值很高的她 一双美腿甚至可以让人忽略她的颜值
转自常德诗人 再访桃花源 再续心灵故乡的故事


# 3. Get embeddings for the news

We need to convert the news titles into some kind of numerical representation (embedding) before we can do clustering on them. Below are two ways to get embeddings for a paragraph of text:

1. **Pretrained word embeddings**: You can use pretrained word embeddings like Glove to get embeddings for each word in the news, and then average them (or try some more advanced techniques) to get the news embedding.

2. **General text embedding models**: You can use general text embedding models to get embedding for a sentence directly.

You can choose either of them to convert the news titles into embeddings.

In [13]:
# 1. 首先卸载可能冲突的包
!pip uninstall -y scipy gensim

# 2. 安装指定版本的 scipy
!pip install scipy==1.11.4

# 3. 安装 gensim（使用 --no-deps 避免依赖冲突）
!pip install --no-deps gensim

# 4. 安装 nltk
!pip install nltk

# 5. 重启内核后，运行以下代码
import gensim.downloader as api
import numpy as np
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

# 加载预训练的GloVe词向量
word_vectors = api.load('glove-wiki-gigaword-100')  # 使用100维的GloVe向量

Found existing installation: scipy 1.11.4
Uninstalling scipy-1.11.4:
  Successfully uninstalled scipy-1.11.4
Found existing installation: gensim 4.3.3
Uninstalling gensim-4.3.3:
  Successfully uninstalled gensim-4.3.3


You can safely remove it manually.
You can safely remove it manually.


Collecting scipy==1.11.4
  Using cached scipy-1.11.4-cp312-cp312-win_amd64.whl.metadata (60 kB)
Using cached scipy-1.11.4-cp312-cp312-win_amd64.whl (43.7 MB)
Installing collected packages: scipy
Successfully installed scipy-1.11.4
Collecting gensim
  Using cached gensim-4.3.3-cp312-cp312-win_amd64.whl.metadata (8.2 kB)
Using cached gensim-4.3.3-cp312-cp312-win_amd64.whl (24.0 MB)
Installing collected packages: gensim
Successfully installed gensim-4.3.3


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\13004\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [19]:
# 4. 计算TF-IDF权重
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(cleaned_data)
word_to_tfidf = dict(zip(tfidf.get_feature_names_out(), tfidf.idf_))

# 5. 定义加权句子嵌入函数
def get_weighted_sentence_embedding(sentence, tfidf_weights):
    tokens = word_tokenize(sentence.lower())
    weighted_vectors = []
    
    for token in tokens:
        if token in word_vectors and token in tfidf_weights:
            weighted_vectors.append(word_vectors[token] * tfidf_weights[token])
    
    if not weighted_vectors:
        return np.zeros(100)
    
    return np.mean(weighted_vectors, axis=0)

# 6. 为所有新闻生成加权embedding
sentence_embeddings = np.array([
    get_weighted_sentence_embedding(news, word_to_tfidf) 
    for news in cleaned_data
])

print(f"Embeddings shape: {sentence_embeddings.shape}")

Embeddings shape: (83360, 100)


# 4. Clustering

Do K-means clustering

In [15]:
from sklearn.cluster import KMeans

clusters = 15
kmeans = KMeans(n_clusters=clusters, random_state=0).fit(sentence_embeddings)



View samples in each cluster

In [16]:
random_sample = True
for i in range(clusters):
    print(f'Cluster {i} has {np.sum(kmeans.labels_ == i)} sentences')
    if random_sample:
        print('\n'.join(all_data_np[np.random.choice(np.where(kmeans.labels_ == i)[0], 5)]))
    else:
        print('\n'.join(all_data_np[kmeans.labels_ == i][0:5]))
    print('')

Cluster 0 has 81722 sentences
不愧是中国第一门神！亚冠赛再现神奇扑救，难怪里皮如此器重他！
山背的守望
《后西游记》之10巧计救太后（浙江人美总17册）
尿素直接用等于是丢钱，如何正确使用？有何使用禁忌？
现货黄金是否值得投资？

Cluster 1 has 254 sentences
英语单词拼字小游戏（16）
领克02价格公布，14.2~19.8万与领克01又怎样的差别？
比亚迪汉DM成爆款，11月大卖10105台，1.4升油耗，21.98万迷倒一片
5月1号以后能不能开具17%、11%增值税发票？怎么处理报税问题？
安阳之夜（11）：春节时的易园门口

Cluster 2 has 94 sentences
2018 MetGala现场，金大姐金小妹纷纷亮相！
2018“权健杯”全国青少年足球邀请赛邀请函
2018.5.8 生猪价格 5月下旬能否出现窄幅反弹行情？
2018 Met Ball现场，网友：大牌太多看不过来！
2018.5.8 生猪价格 5月下旬能否出现窄幅反弹行情？

Cluster 3 has 333 sentences
“王”即将回归，剧场版动画“K SEVEN STORIES”的一些情报
fate/stay night heaven's feel 剧场版在中国上映吗？
谁才是真“PLUS”？宋PLUS对比CS75 PLUS
Pro-BTC华尔街分析师敦促尽管最近的集会，但现在不购买加密
Windows 10 Build 17093都更新了哪些内容？

Cluster 4 has 34 sentences
Square 2018年第一季度比特币的营收仅20万美元
波兰PZL-230“蝎子”攻击机，很有科幻外表的小短腿
《无双大蛇 3》中文官网上线，170 名角色将在异世界冒险
你，100%没见过的“绝版”94年“牡丹壹元”
比钢还硬的铁木筷，抗菌除霉防蛀，200°高温不变形，能用一辈子

Cluster 5 has 8 sentences
GIF：米克尔门前头球顶飞，错失进球良机
GIF：米克尔门前头球顶飞，错失进球良机
Nvidia 人工智能是游戏的重要组成部分
GIF：库蒂尼奥梅开二度，巴萨再扳一城
GIF：霹雳无敌帅炸天，洛夫伦头槌扩大比分

Cluster 6 has 203 sentences
想赚的零花