<a href="https://colab.research.google.com/github/Jiaye39/TimeSeriesAnalysis/blob/main/Bag_of_Words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bag of Words
We study 2 types of BoW vectors:
* **Raw Count**: actually count the number of occurences of each word in a text
* **TF-IDF**: adjust the raw count to favor words that appear a lot in a few documents, as opposed to those who appear a lot in all documents
* 关联package: NLTL sapcy tqdm eli5 gensim rich
* 介绍库基本用法+原理，实践回归+分析回归权重 (词性的neg或pos)

**github无法显示output---进colab重新运行**

Basic import

In [None]:
import numpy as np
import pandas as pd
import math
import requests

# Download Corpus

In [None]:
r=requests.get('https://sherlock-holm.es/stories/plain-text/scan.txt')

assert r.status_code == 200

with open('scandal_in_bohemia.txt', 'w') as out:
    out.write(r.content.decode('utf-8'))
lines = [txt for txt in open('scandal_in_bohemia.txt') if len(txt.strip()) > 0]

print(lines[:20])

In [None]:
#Define First Paragraph
par=''.join([x.strip() for x in lines[7:25]])

# NLTK
*   punkt：这是一个预训练的分词器模型，用于将文本分割成句子和单词。这是许多NLP任务的基础步
*  punkt_tab：这是另一个与punkt相关的分词器，可能用于处理特定格式的文本或提供额外的分词功能




In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')



*   sent_tokenize: The sentence tokenizer takes care to split a text into sentences.
*   word_tokenize: The word tokenizer takes care to split a text into words.

*   拆解文本->向量化





In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
nltk_sentences=sent_tokenize(par)
nltk_words=word_tokenize(par)
print(nltk_sentences,'\n')
print(nltk_words)

# Spacy


*   en_core_web_sm 包含处理en文本的所有组件与数据
*   zh_core_web_sm 处理中文zh版本



In [None]:
import spacy
nlp=spacy.load('en_core_web_sm')
doc=nlp(par)



*   spacy_sentences (doc.sents) : 文本拆解为句子 -> 向量化
*   spacy_tokens (x for x in xxxx[i]) : 句子拆解为tokens -> 向量化



In [None]:
spacy_sentences=list(doc.sents)
spacy_tokens=[x for x in spacy_sentences[0]]
print(spacy_sentences,'\n')
print(spacy_tokens)

# Sklearn Generalities （CountVectorizer&TfidfVectorizer）
Classes likes `CountVectorizer` or `TfidfVectorizer` works in the following way:
* Instantiate an object with specific parameters (`v = CountVectorizer(...)`)
* Fit this object to your corpus = learn the vocabulary (method `v.fit(...)`)
* Transform any piece of text you have into a vector (method `v.transform()`)
*   **用CountVectorizer或者TfidfVectorizer做特征提取，文本转化为bow后才可以进行回归等操作**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

def 2 function for below analysis

In [None]:
def show_vocabulary(vectorizer):
    words = vectorizer.get_feature_names_out()

    print(f'Vocabulary size: {len(words)} words')

    # we can print ~10 words per line
    for l in np.array_split(words, math.ceil(len(words) / 10)):
        print(''.join([f'{x:<15}' for x in l]))

In [None]:
import os
os.environ["FORCE_COLOR"] = "1"

from termcolor import colored

def show_bow(vectorizer, bow):
    words = vectorizer.get_feature_names_out()

    # we can print ~8 words + coefs per line
    for l in np.array_split(list(zip(words, bow)), math.ceil(len(words) / 8)):
        print(' | '.join([colored(f'{w:<15}:{n:>2}', 'grey') if int(n) == 0 else colored(f'{w:<15}:{n:>2}', on_color='on_yellow', attrs=['bold']) for w, n in l ]))

def show_bow_float(vectorizer, bow):
    words = vectorizer.get_feature_names_out()

    # we can print ~6 words + coefs per line
    for l in np.array_split(list(zip(words, bow)), math.ceil(len(words) / 6)):
        print(' | '.join([colored(f'{w:<15}:{float(n):>0.2f}', 'grey') if float(n) == 0 else colored(f'{w:<15}:{float(n):>0.2f}', on_color='on_yellow', attrs=['bold']) for w, n in l ]))


# Raw Count
* We take a text, any text, and represent it as a vector
* Each text is represented by a vector with **N** dimensions
* Each dimension is representative of **1 word** of the vocabulary
* The coefficient in dimension **k** is the number of times the word at index **k** in the vocabulary is seen in the represented text

Eg: Reduced Vocabulary. (corpus: 1st paragraph of book, document:1 sentence)

In [None]:
count_small= CountVectorizer(lowercase=True)
count_small.fit(nltk_sentences)
show_vocabulary(count_small)

The option `lowercase` sets up one behavior of the raw count: do we consider `And` to be different than `and`?

* `lowercase=False` gives 134 unique words in the vocabulary
* `lowercase=True` gives 127 unique words

In [None]:
s = nltk_sentences[0]

print(f'Text: "{s}"')
bow = count_small.transform([s])
print(f'BoW Shape: {bow.shape}')
bow = bow.toarray()   # From sparse matrix to dense matrix (Careful with MEMORY)
print(f'BoW Vector: {bow}')

In [None]:
show_bow(count_small,bow[0])

# TF-IDF
The basic for TF-IDF is that cosine similarity with raw count coefficients puts too much emphasis on the number of occurences of a word within a document.

Repeating a word will artifically increase the cosine similarity with any text containing this word.

Consider which word would be important:
1. One that is repeated a lot and equally present in each document
1. One that appears a lot only in a few document
TF-IDF computes coefficients:
* Low values for common words (ie present in the document, but quite common over the corpus)
* High values for uncommon words (ie present in the document, but not common over the corpus)

We consider one specific document, and one specific word.

* **TF = Term Frequency**: the number of times the word appears in the document
* **DF = Document Frequency**: the number of document in the corpus, in which the word appears
* **IDF = Inverse Document Frequency**: the inverse of the Document Frequency.

Logarithms are introduced, to reflect that 100 times a word does not deliver 100 times the information.

Given a word **w**, a document **d** in a corpus of **D** documents:

$\textrm{TF-IDF(w, d) = TF(w, d) * IDF(w)}$

$
\begin{align}
\textrm{IDF(w) = log} \left( \frac{1 + \textrm{D}}{1 + \textrm{DF(w)}} \right) + 1
\end{align}
$

In [None]:
tfidf=TfidfVectorizer()
tfidf.fit(nltk_sentences)
show_vocabulary(tfidf)

In [None]:
s = nltk_sentences[0]

print(f'Text: "{s}"')
bow = tfidf.transform([s])
print(f'BoW Shape: {bow.shape}')
bow = bow.toarray()   # From sparse matrix to dense matrix (Careful with MEMORY)
print(f'BoW Vector: {bow}')

In [None]:
show_bow_float(tfidf,bow[0])

Display the IDF of some words.

* High IDF = word that appears in few documents
* Low IDF = word that appears in most of documents

In [None]:
words = tfidf.get_feature_names_out()
word = input('Word: ').lower()

if word in words:
    k = list(words).index(word)
    print(f'IDF({words[k]}) = {tfidf.idf_[k]}')
else:
    print('Not in vocabulary')

More than one TF_IDF:

There is a family of TF-IDF formulas.

Another example is the **sublinear TF**, which is then:

$
\begin{align}
\textrm{TF(w, d) = 1 + log} \left( raw count \right)
\end{align}
$

In [None]:
tfidf_sublinear = TfidfVectorizer(sublinear_tf=True)
tfidf_sublinear.fit(nltk_sentences)

In [None]:
s = nltk_sentences[0]

print(f'Text: "{s}"')
bow_sl = tfidf_sublinear.transform([s])
print(f'BoW Shape: {bow_sl.shape}')
bow_sl = bow_sl.toarray()   # From sparse matrix to dense matrix (Careful with MEMORY)
print(f'BoW Vector: {bow_sl}')

In [None]:
show_bow_float(tfidf_sublinear, bow_sl[0])

In [None]:
word = 'yet'

index = words.tolist().index(word)

bow = tfidf.transform([s]).toarray()

print(f'Word: "{word}"')
print(f'TF-IDF with Natural TF   = {bow[0][index]:0.4f}')
print(f'TF-IDF with Sublinear TF = {bow_sl[0][index]:0.4f}')

In [None]:
word = 'yet'
s = nltk_sentences[0]
s = s + ' '.join(100 * [word])

bow = tfidf.transform([s]).toarray()
bow_sl = tfidf_sublinear.transform([s]).toarray()

index = words.tolist().index(word)
print(f'Word: "{word}"')
print(f'TF-IDF with Natural TF   = {bow[0][index]:0.4f}')
print(f'TF-IDF with Sublinear TF = {bow_sl[0][index]:0.4f}')

# World Count & Freguency

# tqdm

*   from tqdm.notebook import tqdm
*   它用于在循环中显示进度条，方便用户了解代码执行的进度，尤其是在处理大量数据时。
*   基本用法：括号包裹一个可迭代对象： for i in tqdm(range(10)):



In [None]:
import requests
r = requests.get("http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz")
imdb_tgz = r.content
import io
import re
import tarfile

from tqdm.notebook import tqdm

good_file = re.compile(r"^aclImdb/(test|train)/(pos|neg)/.*\.txt$")

with tarfile.open(fileobj=io.BytesIO(r.content), mode="r:gz") as tgz:
    all_members = tgz.getmembers()
    data_files = list(filter(lambda x: x.isfile() and good_file.match(x.name) is not None, all_members))
    for f in tqdm(data_files):
        tgz.extract(f)

In [None]:
from sklearn.datasets import load_files
train_data, test_data = load_files("./aclImdb/train", encoding="utf-8"), load_files("./aclImdb/test", encoding="utf-8")

label2txt = {label: txt for label, txt in enumerate(train_data.target_names)}
txt2label = {txt: label for label, txt in label2txt.items()}
type(train_data)

X_train, y_train = train_data.data, train_data.target
X_test, y_test = test_data.data, test_data.target

The Data

In [None]:
import numpy as np

print("TRAIN data:")
print("class balance: ", np.bincount(y_train))
print()
print("TEST data:")
print("class balance: ", np.bincount(y_train))

# eli5 & gensim
**后期 notebook will add more info about these 2 package**
* eli5: 这主要用于帮助调试和解释机器学习模型。它可以揭示模型内部的工作原理，例如哪些特征对模型的预测贡献最大，这对于理解和信任模型非常有用。

* gensim: 这是一个专注于主题建模和自然语言处理（NLP）的Python库。它能够处理大型文本语料库，并提供了一系列算法，例如Latent Semantic Analysis (LSA)、Latent Dirichlet Allocation (LDA) 和 Word2Vec 等，用于发现文本数据中的潜在结构、相似性以及词语之间的关系。

In [None]:
!pip install eli5
!pip install gensim
from gensim.corpora import Dictionary

**How many word in our vocabulary?**

In [None]:
X_train_tokenized=[word_tokenize(x) for x in tqdm(X_train)]

In [None]:
d=Dictionary(X_train_tokenized)

# rich
做表example：

     from rich.console import Console

     from rich.table import Table

     table.add_column(' ', juetify=' ',style=' ')

justify : 对齐方式-----'left'(默认) 'right' 'center': 内容对齐位置

In [None]:
import rich

from rich.console import Console
from rich.table import Table

N = 20

table = Table(title=f"Top-{N} Most Frequent Tokens ({d.num_docs} documents, {d.num_pos} tokens, {len(d)} words in dictionary)")

table.add_column("Token", justify="left", style="black")
table.add_column("Corpus Frequency", justify="right")
table.add_column("% of Tokens", justify="right")
table.add_column("Document Frequency", justify="right")
table.add_column("% of Documents", justify="right")

for token, frequency in d.most_common(n=N):
    percent_tokens = frequency / d.num_pos
    doc_frequency = d.dfs[d.token2id[token]]
    percent_doc = doc_frequency / d.num_docs
    table.add_row(token, str(frequency), f"{percent_tokens:.1%}", str(doc_frequency), f"{percent_doc:.1%}")

console = Console()
console.print(table)

# Classify with Logistic Regression
**X_train 转化为 bow 才可回归**

In [None]:
vec=CountVectorizer(lowercase=False,token_pattern=None,analyzer=word_tokenize)
X_train_bow=vec.fit_transform(X_train)

X_train_bow

生成X_train_bow  稀疏矩阵第一行（即第一个训练文档）中，所有非零元素的列索引，列索引代表了在第一个文档中出现过的词汇表中的词语的 ID。

In [None]:
X_train_bow[0].indices

In [None]:
console = Console(record=True, width=40)
table = Table(title=f"Tokens in sample")

table.add_column("Token ID", justify="right")
table.add_column("Token", justify="left")
table.add_column("Count", justify="right")

words = vec.get_feature_names_out()
for token_id, _ in zip(sorted(X_train_bow[0].indices), range(20)):
    table.add_row(str(token_id), words[token_id], str(X_train_bow[0][0, token_id]))

console.print(table)
console.save_svg("bow.svg", title="Bag of Words")

**加入TF-IDF 进table above**

用TfidfVectorizer

In [None]:
vec_tfidf = TfidfVectorizer(
    lowercase=False,
    token_pattern=None,
    analyzer=word_tokenize
)

X_train_tfidf = vec_tfidf.fit_transform(X_train)
X_train_tfidf

In [None]:
console = Console(record=True, width=80)
table = Table(title=f"Tokens in sample")

table.add_column("Token ID", justify="right")
table.add_column("Token", justify="left")
table.add_column("Count (TF)", justify="right")
table.add_column("Doc Frequency (% corpus)", justify="right")
table.add_column("TF-IDF", justify="right")

words = vec.get_feature_names_out()
for token_id, _ in zip(sorted(X_train_bow[0].indices), range(20)):
    table.add_row(str(token_id), words[token_id], f"{X_train_bow[0][0, token_id]}", f"{d.dfs[d.token2id[words[token_id]]] / d.num_docs:.3%}", f"{X_train_tfidf[0][0, token_id]:.3f}")

console.print(table)
console.save_svg("tfidf.svg", title="Bag of Words - TF-IDF")

**Logistic Regression**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
reg=LogisticRegression(max_iter=1000)
reg.fit(X_train_bow,y_train)

In [None]:
X_test_bow=vec.transform(X_test)

In [None]:
print(classification_report(y_true=y_test,y_pred=reg.predict(X_test_bow)))

# Which words indicate a positive review ?
使用-eli5-解释回归reg的权重

* Let's use the coefficients of logistic regression
* $x_i$ are token counts, we know $x_i>0$
* $\alpha_i > 0$ = the presence of the $i$-th word of dictionary indicates a positive review (because it increases the probability that $x$ belongs to the positive class)
* $\alpha_i < 0$ = the presence of the $i$-th word of dictionary indicates a negative review (because it increases the probability that $x$ belongs to the negative class)
* $\alpha_i>0, \alpha_j > 0$ and $\alpha_i > \alpha_j$ = the $i$-th word of dictionary is **more** associated to a positive review than the $j$-th word


In [None]:
import eli5
vec.get_feature_names = vec.get_feature_names_out
eli5.show_weights(reg,vec=vec,top=10,target_names=['negative','positive'])

In [None]:
eli5.explain_prediction(reg, X_test[0], vec=vec, target_names=['negative', 'positive'])