# Text Mining Demo

This dataset is downloaded from: [ctrip hotel review](https://raw.githubusercontent.com/SophonPlus/ChineseNlpCorpus/master/datasets/ChnSentiCorp_htl_all/ChnSentiCorp_htl_all.csv), which has over 7000 hotel reviews, over 5000 positive reviews and over 2000 negative reviews. The data looks roughly like the following: 

The first column is the label, which takes 0 or 1, 0 means negative reviews, 1 means positive reviews. The second column is the content of the comments. In this small demo, 20 positive and 20 negative comments were randomly chosen for text mining.

![](./tf-idf.png)
***Fig 1. Typical Process of Text Mining***

**Term Frequency (TF) $\text{tf}_{i, j}$**: the number of occurrences of term $t_i$ in document $d_j$

**Inverse Document Frequency (IDF) for term $t_i$** is:
$$
\text{idf}_i=log_2\frac{|D|}{|\{d|t_i\in d\}|}
$$
where $|D|$ is the total number of documents, $|\{d|t_i\in d\}|$ is the number of documents where term $t_i$ appears.

**Term Frequency - Inverse Document Frequency (TF-IDF)** is defined as:
$$
\text{tf-idf}=\text{tf}_{i,j}\cdot \text{idf}_i
$$

Created by *Xinghao YU*, July 18th, 2023

***Copyright @ The Chinese University of Hong Kong, Shenzhen***

## Load Dependencies and Configuration Settings

In [None]:
import jieba
import random
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
import seaborn as sns
import matplotlib.pyplot as plt

random.seed(10086)
plt.rcParams["figure.figsize"] = (12, 10)

## Segmentation

In [None]:
pd_all = pd.read_csv('./ChnSentiCorp_htl_all.csv')

print('评论数目（总体）：%d' % pd_all.shape[0])
print('评论数目（正向）：%d' % pd_all[pd_all.label == 1].shape[0])
print('评论数目（负向）：%d' % pd_all[pd_all.label == 0].shape[0])

pd_all.sample(10)

In [None]:
pd_sample = pd.concat([
    pd_all[pd_all.label == 1].sample(50),
    pd_all[pd_all.label == 0].sample(50)
])
pd_sample.sample(5)

In [None]:
# 将文档中的每一个评论分为一行
file_line = []

count = 0  # 统计行数
for line in range(0, pd_sample.shape[0]):
    file_line.append(pd_sample.iloc[line]['review'])
    count += 1

print("There are %d rows in total." % count)

In [None]:
file_line

In [None]:
# 在过程中动态添加用户字典
jieba.suggest_freq('碧海蓝天', True)
# 也可以自己先形成一个文档例如mydict.txt
# 用法： jieba.load_userdict(file_name) # file_name 为文件类对象或自定义词典的路径
# 词典格式：一个词占一行；每一行分三部分：词语、词频（可省略）、词性（可省略），用空格隔开，顺序不可颠倒。
# file_name 若为路径或二进制方式打开的文件，则文件必须为 UTF-8 编码。
# 使用 add_word(word, freq=None, tag=None) 和 del_word(word) 可在程序中动态修改词典。

# 使用jieba开始分词
# file_userDict = 'dict.txt'  # 自定义的词典 目前还没有
# jieba.load_userdict(file_userDict)
res = []
for i in range(len(file_line)):
    sentence_seged = jieba.cut(file_line[i].strip())
    res.append(' '.join(sentence_seged))

print('Segmentation Complete!')

for i in range(len(res)):
    print(res[i])

In [None]:
# 加载停用词列表
f_stop = open('./stop_word.txt')  # 自己的中文停用词表
sw = [line.strip() for line in f_stop]
f_stop.close()
# sw

In [None]:
word_list_seg = []

for i in range(len(res)):
    stopwords = sw
    outstr = ''
    for word in res[i].split():
        # print('word:', word)
        if word not in stopwords:
            if word != '/t':
                outstr += word
                outstr += " "
    print('Sentence %d, outstr: %s' % (i, outstr))
    word_list_seg.append(outstr)

print('_______________________')
print('Stop Words Removal Complete!')
print(len(word_list_seg))

# for i in range(len(word_list_seg)):
#    print(word_list_seg[i])

## SVD and LSI

![](./svd.png)
***Fig 2. SVD***

***Latent Semantic Indexing (LSI)*** is a method for discovering hidden concepts in document data. Each document and term (word) is then expressed as a vector with elements corresponding to these concepts. Each element in a vector gives the degree of participation of the document or term in the corresponding concept. The goal is not to describe the concepts verbally, but to be able to represent the documents and terms in a unified way for exposing document-document, document-term, and term-term similarities or semantic relationship which are otherwise hidden.

In [None]:
# raw documents to tf-idf matrix:
vectorizer = TfidfVectorizer(stop_words='english',
                             use_idf=True,
                             smooth_idf=True)
# SVD to reduce dimensionality:
svd_model = TruncatedSVD(n_components=20, algorithm='randomized', n_iter=10)
# pipeline of tf-idf + SVD, fit to and applied to documents:
svd_transformer = Pipeline([('tfidf', vectorizer), ('svd', svd_model)])
dc_matrix = svd_transformer.fit_transform(word_list_seg)
# dc_matrix can later be used to compare documents, compare words, or compare queries with documents

In [None]:
dc_matrix.shape

In [None]:
svd_model.components_.shape

In [None]:
sum(svd_model.explained_variance_ratio_)

In [None]:
document_concept_matrix = pd.DataFrame(dc_matrix)

d = []
for row in range(0, document_concept_matrix.shape[0]):
    d.append(f'd{row+1}')
document_concept_matrix.index = d

tc_matrix = np.dot(svd_model.components_.T,
                   np.diag(svd_model.singular_values_))
term_concept_matrix = pd.DataFrame(tc_matrix)
term_concept_matrix.index = vectorizer.get_feature_names_out()

In [None]:
document_term = pd.concat([document_concept_matrix, term_concept_matrix])
document_term

In [None]:
# plot all document-concept vectors
plt.scatter(x=document_concept_matrix[0], y=document_concept_matrix[1])
# add labels to all points
for idx, row in document_concept_matrix.iterrows():
    plt.text(row[0], row[1], idx)

## Text Similarity

**Seaborn** is a library for making statistical graphics in Python. It builds on top of *matplotlib* and integrates closely with pandas data structures.

**Seaborn** helps you explore and understand your data. Its plotting functions operate on *dataframes* and arrays containing whole datasets and internally perform the necessary semantic mapping and statistical aggregation to produce informative plots. Its dataset-oriented, declarative *API* lets you focus on what the different elements of your plots mean, rather than on the details of how to draw them.

In [None]:
def similar_matrix(truncated_text_vector, similarity_function):
    le = len(truncated_text_vector)
    matrix = [[
        similarity_function(
            [truncated_text_vector[i], truncated_text_vector[j]])[1, 0]
        for j in range(le)
    ] for i in range(le)]
    sns.heatmap(matrix, center=1, annot=False)
    plt.show()


similar_matrix(dc_matrix, cosine_similarity)

## Text Clustering

In [None]:
clf = KMeans(n_clusters=4)

kmeans_results = clf.fit_predict(dc_matrix)
kmeans_results

In [None]:
plt.scatter(x=document_concept_matrix[0],
            y=document_concept_matrix[1],
            c=kmeans_results)
for idx, row in document_concept_matrix.iterrows():
    plt.text(row[0], row[1], idx)