<a href="https://colab.research.google.com/github/Jiaye39/TimeSeriesAnalysis/blob/main/Bag_of_Words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bag of Words
We study 2 types of BoW vectors:
* **Raw Count**: actually count the number of occurences of each word in a text
* **TF-IDF**: adjust the raw count to favor words that appear a lot in a few documents, as opposed to those who appear a lot in all documents

Basic import

In [None]:
import numpy as np
import pandas as pd
import math
import requests

# Download Corpus

In [None]:
r=requests.get('https://sherlock-holm.es/stories/plain-text/scan.txt')

assert r.status_code == 200

with open('scandal_in_bohemia.txt', 'w') as out:
    out.write(r.content.decode('utf-8'))
lines = [txt for txt in open('scandal_in_bohemia.txt') if len(txt.strip()) > 0]

print(lines[:20])

['                              A SCANDAL IN BOHEMIA\n', '                               Arthur Conan Doyle\n', '                                Table of contents\n', '                                     Chapter 1\n', '                                     Chapter 2\n', '                                     Chapter 3\n', '          CHAPTER I\n', '     To Sherlock Holmes she is always the woman. I have seldom heard him\n', '     mention her under any other name. In his eyes she eclipses and\n', '     predominates the whole of her sex. It was not that he felt any\n', '     emotion akin to love for Irene Adler. All emotions, and that one\n', '     particularly, were abhorrent to his cold, precise but admirably\n', '     balanced mind. He was, I take it, the most perfect reasoning and\n', '     observing machine that the world has seen, but as a lover he would\n', '     have placed himself in a false position. He never spoke of the softer\n', '     passions, save with a gibe and a sneer. T

In [None]:
#Define First Paragraph
par=''.join([x.strip() for x in lines[7:25]])

# NLTK
*   punkt：这是一个预训练的分词器模型，用于将文本分割成句子和单词。这是许多NLP任务的基础步
*  punkt_tab：这是另一个与punkt相关的分词器，可能用于处理特定格式的文本或提供额外的分词功能




In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True



*   sent_tokenize: The sentence tokenizer takes care to split a text into sentences.
*   word_tokenize: The word tokenizer takes care to split a text into words.

*   拆解文本->向量化





In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
nltk_sentences=sent_tokenize(par)
nltk_words=word_tokenize(par)
print(nltk_sentences,'\n')
print(nltk_words)

['To Sherlock Holmes she is always the woman.', 'I have seldom heard himmention her under any other name.', 'In his eyes she eclipses andpredominates the whole of her sex.', 'It was not that he felt anyemotion akin to love for Irene Adler.', 'All emotions, and that oneparticularly, were abhorrent to his cold, precise but admirablybalanced mind.', 'He was, I take it, the most perfect reasoning andobserving machine that the world has seen, but as a lover he wouldhave placed himself in a false position.', 'He never spoke of the softerpassions, save with a gibe and a sneer.', "They were admirable thingsfor the observer--excellent for drawing the veil from men's motivesand actions.", 'But for the trained reasoner to admit such intrusionsinto his own delicate and finely adjusted temperament was tointroduce a distracting factor which might throw a doubt upon all hismental results.', 'Grit in a sensitive instrument, or a crack in one ofhis own high-power lenses, would not be more disturbing th

# Spacy


*   en_core_web_sm 包含处理en文本的所有组件与数据
*   zh_core_web_sm 处理中文zh版本



In [None]:
import spacy
nlp=spacy.load('en_core_web_sm')
doc=nlp(par)



*   spacy_sentences (doc.sents) : 文本拆解为句子 -> 向量化
*   spacy_tokens (x for x in xxxx[i]) : 句子拆解为tokens -> 向量化



In [None]:
spacy_sentences=list(doc.sents)
spacy_tokens=[x for x in spacy_sentences[0]]
print(spacy_sentences,'\n')
print(spacy_tokens)

[To Sherlock Holmes she is always the woman., I have seldom heard himmention her under any other name., In his eyes she eclipses andpredominates the whole of her sex., It was not that he felt anyemotion akin to love for Irene Adler., All emotions, and that oneparticularly, were abhorrent to his cold, precise but admirablybalanced mind., He was, I take it, the most perfect reasoning andobserving machine that the world has seen, but as a lover he wouldhave placed himself in a false position., He never spoke of the softerpassions, save with a gibe and a sneer., They were admirable thingsfor the observer--excellent for drawing the veil from men's motivesand actions., But for the trained reasoner to admit such intrusionsinto his own delicate and finely adjusted temperament was tointroduce a distracting factor which might throw a doubt upon all hismental results., Grit in a sensitive instrument, or a crack in one ofhis own high-power lenses, would not be more disturbing than a strongemotion 

# Sklearn Generalities （CountVectorizer&TfidfVectorizer）
Classes likes `CountVectorizer` or `TfidfVectorizer` works in the following way:
* Instantiate an object with specific parameters (`v = CountVectorizer(...)`)
* Fit this object to your corpus = learn the vocabulary (method `v.fit(...)`)
* Transform any piece of text you have into a vector (method `v.transform()`)
*   **用CountVectorizer或者TfidfVectorizer做特征提取，文本转化为bow后才可以进行回归等操作**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

def 2 function for below analysis

In [None]:
def show_vocabulary(vectorizer):
    words = vectorizer.get_feature_names_out()

    print(f'Vocabulary size: {len(words)} words')

    # we can print ~10 words per line
    for l in np.array_split(words, math.ceil(len(words) / 10)):
        print(''.join([f'{x:<15}' for x in l]))

In [None]:
import os
os.environ["FORCE_COLOR"] = "1"

from termcolor import colored

def show_bow(vectorizer, bow):
    words = vectorizer.get_feature_names_out()

    # we can print ~8 words + coefs per line
    for l in np.array_split(list(zip(words, bow)), math.ceil(len(words) / 8)):
        print(' | '.join([colored(f'{w:<15}:{n:>2}', 'grey') if int(n) == 0 else colored(f'{w:<15}:{n:>2}', on_color='on_yellow', attrs=['bold']) for w, n in l ]))

def show_bow_float(vectorizer, bow):
    words = vectorizer.get_feature_names_out()

    # we can print ~6 words + coefs per line
    for l in np.array_split(list(zip(words, bow)), math.ceil(len(words) / 6)):
        print(' | '.join([colored(f'{w:<15}:{float(n):>0.2f}', 'grey') if float(n) == 0 else colored(f'{w:<15}:{float(n):>0.2f}', on_color='on_yellow', attrs=['bold']) for w, n in l ]))


# Raw Count
* We take a text, any text, and represent it as a vector
* Each text is represented by a vector with **N** dimensions
* Each dimension is representative of **1 word** of the vocabulary
* The coefficient in dimension **k** is the number of times the word at index **k** in the vocabulary is seen in the represented text

Eg: Reduced Vocabulary. (corpus: 1st paragraph of book, document:1 sentence)

In [None]:
count_small= CountVectorizer(lowercase=True)
count_small.fit(nltk_sentences)
show_vocabulary(count_small)

Vocabulary size: 126 words
abhorrent      actions        adjusted       adler          admirable      admirablybalancedadmit          akin           all            always         
and            andobserving   andpredominatesandquestionableany            anyemotion     as             be             but            cold           
crack          delicate       distracting    disturbing     doubt          drawing        dubious        eclipses       emotions       excellent      
eyes           factor         false          felt           finely         for            from           gibe           grit           has            
have           he             heard          her            high           himmention     himself        his            hismental      holmes         
in             instrument     intrusionsinto irene          is             it             late           lenses         love           lover          
machine        memory         men            might          mind 

The option `lowercase` sets up one behavior of the raw count: do we consider `And` to be different than `and`?

* `lowercase=False` gives 134 unique words in the vocabulary
* `lowercase=True` gives 127 unique words

In [None]:
s = nltk_sentences[0]

print(f'Text: "{s}"')
bow = count_small.transform([s])
print(f'BoW Shape: {bow.shape}')
bow = bow.toarray()   # From sparse matrix to dense matrix (Careful with MEMORY)
print(f'BoW Vector: {bow}')

Text: "To Sherlock Holmes she is always the woman."
BoW Shape: (1, 126)
BoW Vector: [[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0
  0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]]


In [None]:
show_bow(count_small,bow[0])

[30mabhorrent      : 0[0m | [30mactions        : 0[0m | [30madjusted       : 0[0m | [30madler          : 0[0m | [30madmirable      : 0[0m | [30madmirablybalanced: 0[0m | [30madmit          : 0[0m | [30makin           : 0[0m
[30mall            : 0[0m | [1m[43malways         : 1[0m | [30mand            : 0[0m | [30mandobserving   : 0[0m | [30mandpredominates: 0[0m | [30mandquestionable: 0[0m | [30many            : 0[0m | [30manyemotion     : 0[0m
[30mas             : 0[0m | [30mbe             : 0[0m | [30mbut            : 0[0m | [30mcold           : 0[0m | [30mcrack          : 0[0m | [30mdelicate       : 0[0m | [30mdistracting    : 0[0m | [30mdisturbing     : 0[0m
[30mdoubt          : 0[0m | [30mdrawing        : 0[0m | [30mdubious        : 0[0m | [30meclipses       : 0[0m | [30memotions       : 0[0m | [30mexcellent      : 0[0m | [30meyes           : 0[0m | [30mfactor         : 0[0m
[30mfalse          : 0[0m | [30mfelt   

# TF-IDF
The basic for TF-IDF is that cosine similarity with raw count coefficients puts too much emphasis on the number of occurences of a word within a document.

Repeating a word will artifically increase the cosine similarity with any text containing this word.

Consider which word would be important:
1. One that is repeated a lot and equally present in each document
1. One that appears a lot only in a few document
TF-IDF computes coefficients:
* Low values for common words (ie present in the document, but quite common over the corpus)
* High values for uncommon words (ie present in the document, but not common over the corpus)

We consider one specific document, and one specific word.

* **TF = Term Frequency**: the number of times the word appears in the document
* **DF = Document Frequency**: the number of document in the corpus, in which the word appears
* **IDF = Inverse Document Frequency**: the inverse of the Document Frequency.

Logarithms are introduced, to reflect that 100 times a word does not deliver 100 times the information.

Given a word **w**, a document **d** in a corpus of **D** documents:

$\textrm{TF-IDF(w, d) = TF(w, d) * IDF(w)}$

$
\begin{align}
\textrm{IDF(w) = log} \left( \frac{1 + \textrm{D}}{1 + \textrm{DF(w)}} \right) + 1
\end{align}
$

In [None]:
tfidf=TfidfVectorizer()
tfidf.fit(nltk_sentences)
show_vocabulary(tfidf)

Vocabulary size: 126 words
abhorrent      actions        adjusted       adler          admirable      admirablybalancedadmit          akin           all            always         
and            andobserving   andpredominatesandquestionableany            anyemotion     as             be             but            cold           
crack          delicate       distracting    disturbing     doubt          drawing        dubious        eclipses       emotions       excellent      
eyes           factor         false          felt           finely         for            from           gibe           grit           has            
have           he             heard          her            high           himmention     himself        his            hismental      holmes         
in             instrument     intrusionsinto irene          is             it             late           lenses         love           lover          
machine        memory         men            might          mind 

In [None]:
s = nltk_sentences[0]

print(f'Text: "{s}"')
bow = tfidf.transform([s])
print(f'BoW Shape: {bow.shape}')
bow = bow.toarray()   # From sparse matrix to dense matrix (Careful with MEMORY)
print(f'BoW Vector: {bow}')

Text: "To Sherlock Holmes she is always the woman."
BoW Shape: (1, 126)
BoW Vector: [[0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.40271589 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.40271589 0.         0.         0.         0.
  0.40271589 0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0. 

In [None]:
show_bow_float(tfidf,bow[0])

[30mabhorrent      :0.00[0m | [30mactions        :0.00[0m | [30madjusted       :0.00[0m | [30madler          :0.00[0m | [30madmirable      :0.00[0m | [30madmirablybalanced:0.00[0m
[30madmit          :0.00[0m | [30makin           :0.00[0m | [30mall            :0.00[0m | [1m[43malways         :0.40[0m | [30mand            :0.00[0m | [30mandobserving   :0.00[0m
[30mandpredominates:0.00[0m | [30mandquestionable:0.00[0m | [30many            :0.00[0m | [30manyemotion     :0.00[0m | [30mas             :0.00[0m | [30mbe             :0.00[0m
[30mbut            :0.00[0m | [30mcold           :0.00[0m | [30mcrack          :0.00[0m | [30mdelicate       :0.00[0m | [30mdistracting    :0.00[0m | [30mdisturbing     :0.00[0m
[30mdoubt          :0.00[0m | [30mdrawing        :0.00[0m | [30mdubious        :0.00[0m | [30meclipses       :0.00[0m | [30memotions       :0.00[0m | [30mexcellent      :0.00[0m
[30meyes           :0.00[0m | [30mfactor 

Display the IDF of some words.

* High IDF = word that appears in few documents
* Low IDF = word that appears in most of documents

In [None]:
words = tfidf.get_feature_names_out()
word = input('Word: ').lower()

if word in words:
    k = list(words).index(word)
    print(f'IDF({words[k]}) = {tfidf.idf_[k]}')
else:
    print('Not in vocabulary')

Word: sherlock
IDF(sherlock) = 2.791759469228055


More than one TF_IDF:

There is a family of TF-IDF formulas.

Another example is the **sublinear TF**, which is then:

$
\begin{align}
\textrm{TF(w, d) = 1 + log} \left( raw count \right)
\end{align}
$

In [None]:
tfidf_sublinear = TfidfVectorizer(sublinear_tf=True)
tfidf_sublinear.fit(nltk_sentences)

In [None]:
s = nltk_sentences[0]

print(f'Text: "{s}"')
bow_sl = tfidf_sublinear.transform([s])
print(f'BoW Shape: {bow_sl.shape}')
bow_sl = bow_sl.toarray()   # From sparse matrix to dense matrix (Careful with MEMORY)
print(f'BoW Vector: {bow_sl}')

Text: "To Sherlock Holmes she is always the woman."
BoW Shape: (1, 126)
BoW Vector: [[0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.40271589 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.40271589 0.         0.         0.         0.
  0.40271589 0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0. 

In [None]:
show_bow_float(tfidf_sublinear, bow_sl[0])

[30mabhorrent      :0.00[0m | [30mactions        :0.00[0m | [30madjusted       :0.00[0m | [30madler          :0.00[0m | [30madmirable      :0.00[0m | [30madmirablybalanced:0.00[0m
[30madmit          :0.00[0m | [30makin           :0.00[0m | [30mall            :0.00[0m | [1m[43malways         :0.40[0m | [30mand            :0.00[0m | [30mandobserving   :0.00[0m
[30mandpredominates:0.00[0m | [30mandquestionable:0.00[0m | [30many            :0.00[0m | [30manyemotion     :0.00[0m | [30mas             :0.00[0m | [30mbe             :0.00[0m
[30mbut            :0.00[0m | [30mcold           :0.00[0m | [30mcrack          :0.00[0m | [30mdelicate       :0.00[0m | [30mdistracting    :0.00[0m | [30mdisturbing     :0.00[0m
[30mdoubt          :0.00[0m | [30mdrawing        :0.00[0m | [30mdubious        :0.00[0m | [30meclipses       :0.00[0m | [30memotions       :0.00[0m | [30mexcellent      :0.00[0m
[30meyes           :0.00[0m | [30mfactor 

In [None]:
word = 'yet'

index = words.tolist().index(word)

bow = tfidf.transform([s]).toarray()

print(f'Word: "{word}"')
print(f'TF-IDF with Natural TF   = {bow[0][index]:0.4f}')
print(f'TF-IDF with Sublinear TF = {bow_sl[0][index]:0.4f}')

Word: "yet"
TF-IDF with Natural TF   = 0.0000
TF-IDF with Sublinear TF = 0.0000


In [None]:
word = 'yet'
s = nltk_sentences[0]
s = s + ' '.join(100 * [word])

bow = tfidf.transform([s]).toarray()
bow_sl = tfidf_sublinear.transform([s]).toarray()

index = words.tolist().index(word)
print(f'Word: "{word}"')
print(f'TF-IDF with Natural TF   = {bow[0][index]:0.4f}')
print(f'TF-IDF with Sublinear TF = {bow_sl[0][index]:0.4f}')

Word: "yet"
TF-IDF with Natural TF   = 0.9997
TF-IDF with Sublinear TF = 0.9143
