<a href="https://colab.research.google.com/github/Jiaye39/TimeSeriesAnalysis/blob/main/Text_Pre_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Pre-Processing


The texts we have access to will be noisy:
* Lots of unique words, including numbers and identifiers
* Typos
* Many declinations of the same word (plural, conjugation, ...)

The pre-processing is how we modify the text BEFORE turning it into BoW vectors, and it has 2 main targets:
* Retain information in vectors (similarity in text equals similarity in cosine)
* Separate noise from information (remove unnecessary coefficients)

In [2]:
import numpy as np
import math

## SKLEARN Generalities

The classes `CountVectorizer` and `TfidfVectorizer` have the same interface and the same arguments for their `__init__` method.

What is explained for one class is valid for the other.

In [3]:
def show_vocabulary(vectorizer, word_size=15, words_per_line=10):
    words = vectorizer.get_feature_names_out()

    print(f'Vocabulary size: {len(words)} words')

    word_format = f'<{word_size}'
    for l in np.array_split(words, math.ceil(len(words) / words_per_line)):
        print(''.join([f'{x:{word_format}}' for x in l]))

In [4]:
import os
os.environ["FORCE_COLOR"] = "1"

from termcolor import colored

def show_bow(vectorizer, bow, word_size=15, words_per_line=8):
    words = vectorizer.get_feature_names_out()

    word_format = f'<{word_size}'
    for l in np.array_split(list(zip(words, bow)), math.ceil(len(words) / words_per_line)):
        print(' | '.join([colored(f'{w:{word_format}}:{n:>2}', 'grey') if int(n) == 0 else colored(f'{w:{word_format}}:{n:>2}', on_color='on_yellow', attrs=['bold']) for w, n in l ]))

def show_bow_float(vectorizer, bow, word_size=15, words_per_line=6):
    words = vectorizer.get_feature_names_out()

    word_format = f'<{word_size}'
    for l in np.array_split(list(zip(words, bow)), math.ceil(len(words) / words_per_line)):
        print(' | '.join([colored(f'{w:word_format}:{float(n):>0.2f}', 'grey') if float(n) == 0 else colored(f'{w:word_format}:{float(n):>0.2f}', on_color='on_yellow', attrs=['bold']) for w, n in l ]))


## Real-Life Corpus

Books are very clean texts. Real-Life corpuses including user-generated material will be on the opposite of the spectrum, and will include typos, strange usernames, artefacts of all kinds...

The "20 newsgroups" dataset is a classical NLP dataset. Newsgroups are the ancestors of reddit, people could post messages and reply in a thread.

* **Corpus**: newsgroup messages
* **Document**: full text of 1 message

In [5]:
from sklearn.datasets import fetch_20newsgroups

In [6]:
newsgroups = fetch_20newsgroups()

In [7]:
print(f'Number of documents: {len(newsgroups.data)}')
print()
print(f'Sample document:\n\n{"*" * 80}\n{newsgroups.data[12]}\n{"*" * 80}')

Number of documents: 11314

Sample document:

********************************************************************************
From: rodc@fc.hp.com (Rod Cerkoney)
Subject: *$G4qxF,fekVH6
Nntp-Posting-Host: hpfcmrc.fc.hp.com
Organization: Hewlett Packard, Fort Collins, CO
X-Newsreader: TIN [version 1.1 PL8.5]
Lines: 15



--


Regards,
Rod Cerkoney
                                                        /\
______________________________________________         /~~\
                                                      /    \
  Rod Cerkoney MS 37     email:                      /      \ 
  Hewlett Packard         rodc@fc.hp.com        /\  /        \  
  3404 East Harmony Rd.  Hpdesk:               /  \/          \    /\
  Fort Collins, CO 80525  HP4000/UX           /    \           \  /  \
_____________________________________________/      \           \/    \__

********************************************************************************


Here is the problem:
* Vocabulary is much larger (130107 unique words)
* Lots of "garbage" in vocabulary ("mbocjlo3", "mc2i", "mc68882rc25")

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
count.fit(newsgroups.data)
words = count.get_feature_names_out()
print(f'Vocabulary size: {len(words)} words')
print('First 20 vocabulary words:')

word_size = 15 # Using default word_size from show_vocabulary
line_words = words[:20]
print(''.join([f'{x:<{word_size}}' for x in line_words]))

Vocabulary size: 130107 words
First 20 vocabulary words:
00             000            0000           00000          000000         00000000       0000000004     0000000005     00000000b      00000001       00000001b      0000000667     00000010       00000010b      00000011       00000011b      0000001200     00000074       00000093       000000e5       


# Problem Statement

**Curse of Dimensionality**

Downstream applications (the applications that will use the vectors) are sensitive to the number of dimensions of feature vectors.

* Logistic Regression: with d the number of dimensions of vectors, and n the number of samples:
   * CPU complexity of `fit()` is $\textrm{O(nd)}$
   * RAM complexity of `fit()` is $\textrm{O(nd + n + d)}$

This is solved by removing words from the vocabulary (**stopping**, filtering, customized tokenization, ...)


---



**Vocabulary Gap**

* Verb conjugation will result in different unique words, that appear as separated dimensions.
* Plurals will also generate unique words.

For example 'make', 'makes', 'made' are individual words. As well as 'horse' and 'horses'.

Consider two texts, does it make them dissimilar if one uses 'horse' and the other uses 'horses'?

This is solved by reducing words to a 'basic' version (**stemming**, **lemmatizing**)


---




# Stopping

This is the process of removing **stopwords** from the text.

Stopwords are words that are needed to make a sentence, but do not bring any information to the reader.

Consider english words such as 'the', 'a', 'that'.

* Each language has a list of stopwords.
* Based on your corpus, there might be additional stopwords to consider

In [16]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [17]:
from nltk.corpus import stopwords

stops = stopwords.words('english')
print(f'Number of stopwords: {len(stops)}')

for l in np.array_split(stops, 15):
    print(' '.join([f'{w:<12}' for w in l]))

Number of stopwords: 198
a            about        above        after        again        against      ain          all          am           an           and          any          are          aren        
aren't       as           at           be           because      been         before       being        below        between      both         but          by           can         
couldn       couldn't     d            did          didn         didn't       do           does         doesn        doesn't      doing        don          don't        down        
during       each         few          for          from         further      had          hadn         hadn't       has          hasn         hasn't       have        
haven        haven't      having       he           he'd         he'll        her          here         hers         herself      he's         him          himself     
his          how          i            i'd          if           i'll         i'm          

In [18]:
count = CountVectorizer(stop_words=stops)
count.fit(newsgroups.data)

vocab_stopped = count.get_feature_names_out()
print(f'Size of vocabulary: {len(vocab_stopped)}')


Size of vocabulary: 129963


# Filter by Token Pattern

Accept only words that correspond to a regular expression pattern.

See [Link](https://docs.python.org/3/howto/regex.html) about Regular Expressions.


In [19]:
count = CountVectorizer(
    stop_words=stops,
    token_pattern=r'\b[a-z]+\b',   # This pattern is positive for a word that contains only letters
)
count.fit(newsgroups.data)

vocab_pattern = count.get_feature_names_out()
print(f'Size of vocabulary: {len(vocab_pattern)}')

Size of vocabulary: 81622


In [20]:
print(' '.join([f'{w:<10}' for w in vocab_stopped[:10]]))
print(' '.join([f'{w:<10}' for w in vocab_pattern[:10]]))

00         000        0000       00000      000000     00000000   0000000004 0000000005 00000000b  00000001  
aa         aaa        aaaa       aaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaauuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuugggggggggggggggg aaaaagggghhhh aaaarrgghhhh aaah       aaahh      aaahhhh   


# Filtering by Frequency

Retain only the top N tokens, based on the number of times they appear in the complete corpus.

Use the `max_features` argument of the vectorizer.

Typical usage:
* `max_features=50000`

In [21]:
count = CountVectorizer(
    stop_words=stops,
    token_pattern=r'[a-z]+',
    max_features=50000
)
count.fit(newsgroups.data)

vocab_top = count.get_feature_names_out()
print(f'Size of vocabulary: {len(vocab_top)}')

Size of vocabulary: 50000


In [22]:
print(' '.join([f'{w:<10}' for w in vocab_stopped[:10]]))
print(' '.join([f'{w:<10}' for w in vocab_top[:10]]))

00         000        0000       00000      000000     00000000   0000000004 0000000005 00000000b  00000001  
aa         aaa        aaaarrgghhhh aaah       aaahhhh    aaai       aab        aachen     aad        aaf       


In [23]:
import random
vocab_top[::2000]

array(['aa', 'apeldoornseweg', 'beeblbrox', 'bvsd', 'clutch',
       'crystallography', 'dingebre', 'elicit', 'flags', 'gunning',
       'icons', 'jackw', 'kotdohl', 'lutheran', 'mindless', 'nintendo',
       'pentecostals', 'pyrtech', 'retaliate', 'scribe', 'somesuch',
       'surrounds', 'tpinnpcn', 'utai', 'winadv'], dtype=object)

# Filtering by Document Frequency

Two corner cases to consider:
* a word appears in nearly all documents: does not participate actively to make a difference between documents
* a word appears only in 1 or 2 documents: same. It is likely a typo, or an artefact

Use the `min_df` and `max_df` arguments.
* `min_df=3` words that appear in more than 3 documents will be in the vocabulary
* `min_df=0.1` words that appear in more than 10% of the documents will be in the vocabulary
* `max_df=10` words that appear in less than 10 documents will be in the vocabulary
* `max_df=0.9` words that appear in less than 90% of the documents will be in the vocabulary

Typical usage:
* `min_df=2`
* `max_df=0.8`

In [24]:
count = CountVectorizer(
    stop_words=stops,
    token_pattern=r'[a-z]+\w*',
    max_features=50000,
    min_df=5,
    max_df=0.8
)
count.fit(newsgroups.data)

vocab_df = count.get_feature_names_out()
print(f'Size of vocabulary: {len(vocab_df)}')

Size of vocabulary: 23726


In [25]:
print(' '.join([f'{w:<10}' for w in vocab_top[:10]]))
print(' '.join([f'{w:<10}' for w in vocab_df[:10]]))

aa         aaa        aaaarrgghhhh aaah       aaahhhh    aaai       aab        aachen     aad        aaf       
a0         a000       a1         a137490    a2         a2i        a3         a4         a42dubinski a5        


# Stemming / Lemmatizing

## Description

Both of them are a process applied to a word, that reduces it to its stem. Plurals are reduced to singular, etc...

* Lemma: is a word that exists
* Stem : might not be a word



In [26]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [27]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem import PorterStemmer

wn = WordNetLemmatizer()
ps = PorterStemmer()

word = 'horses'

print(f'Word: "{word}", PorterStemmer: "{ps.stem(word)}", WordNetLemmatizer: "{wn.lemmatize(word)}"')



Word: "horses", PorterStemmer: "hors", WordNetLemmatizer: "horse"


In [28]:
word = 'horse'

print(f'Word: "{word}", PorterStemmer: "{ps.stem(word)}", WordNetLemmatizer: "{wn.lemmatize(word)}"')


Word: "horse", PorterStemmer: "hors", WordNetLemmatizer: "horse"


It is not important that the stem is not an actual word from the dictionary, as long as all the times we see either form of the word ('horses' or 'horse') it is counted under the same dimension, that will correspond to 'hors' or 'horse'.

## Example

Let's make an example on the Sherlock Holmes book "Scandal in Bohemia".

In [29]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [30]:
import requests

r = requests.get('https://sherlock-holm.es/stories/plain-text/scan.txt')
r.raise_for_status()

with open('scandal_in_bohemia.txt', 'w') as out:
    out.write(r.content.decode('utf-8'))
lines = [txt for txt in open('scandal_in_bohemia.txt') if len(txt.strip()) > 0]

In [31]:
from nltk.tokenize import sent_tokenize

book = ' '.join([x.strip() for x in lines])
sentences = sent_tokenize(book)

In [32]:
count_vanilla = CountVectorizer()
count_vanilla.fit(sentences)

tokenizer = CountVectorizer().build_tokenizer()

wn = WordNetLemmatizer()
def lemmatizer(text):
    tokens = tokenizer(text)
    return map(wn.lemmatize, tokens)

count_lemma = CountVectorizer(
    tokenizer=lemmatizer,
)
count_lemma.fit(sentences)

ps = PorterStemmer()
def stemmer(text):
    tokens = tokenizer(text)
    return map(ps.stem, tokens)

count_stem = CountVectorizer(
    tokenizer=stemmer
)
count_stem.fit(sentences)



In [33]:
words_vanilla = set(count_vanilla.get_feature_names_out())
words_lemma = count_lemma.get_feature_names_out()
words_stem = count_stem.get_feature_names_out()

removed_by_lemma = words_vanilla.copy()
for w in words_lemma:
    removed_by_lemma.discard(w)

removed_by_stem = words_vanilla.copy()
for w in words_stem:
    removed_by_stem.discard(w)

print(f'{len(removed_by_lemma):>4d} words removed by lemmatizer')
print(f'{len(removed_by_stem):>4d} words removed by stemmer')

 182 words removed by lemmatizer
1024 words removed by stemmer




---

This plays a major role in aligning the cosine similarity with the human evaluation of similarity.

Consider two sentences and how their cosine similarity evolve.

In [34]:
from sklearn.metrics.pairwise import cosine_similarity

text_01 = 'it is hard to make a horse look good'
text_02 = 'I like making horses looking good'

for name, vectorizer in zip(['Vanilla', 'Lemma', 'Stem'], [count_vanilla, count_lemma, count_stem]):
    bows = vectorizer.transform([text_01, text_02])
    similarity = cosine_similarity(bows)
    print(f'{name:<8}: Cosine Similarity = {similarity[0, 1]:0.2f} with {bows.shape[1]:>4d} dimensions')

Vanilla : Cosine Similarity = 0.17 with 1948 dimensions
Lemma   : Cosine Similarity = 0.32 with 1852 dimensions
Stem    : Cosine Similarity = 0.63 with 1647 dimensions


## Stemmers / Lemmatizers

There are many types of stemmers and lemmatizers:
* Stemmers
   * Snowball
   * Porter
   * Lancaster
   * Regexp
* Lemmatizers
   * WordNet
   * StanfordCore NLP

Each has its own algorithm for reducing a word to a stem or lemma.

# N-Grams

## Definition

So far we have considered a vocabulary made of words, like `turtle` or `airplane`, or of stems like `hors` and `make`.

N-Grams are groups of N consecutive words in the text.

For example, in the sentence `the cat is gone`, the 2-grams are `the cat`, `cat is`, `is gone`.

The production of N-Grams is controled by the parameter `n_grams` in sklearn CountVectorizer.

## Importance of N-Grams

Bag of Words vectors ignore word order in a sentence. But it makes sense that some information is communicated through some words being side-by-side rather than these words being in the sentence.

It carries more information to know that `new york` is in a sentence, opposed to knowing that both `new` and `york` are in the sentence, without knowing that they are side by side.

The most frequent 3-Gram over the English Internet is 'Limited Liability Corporation'. This 3-gram has a meaning, and if the same 3-gram can be found in 2 sentences, they share more similarity than if those 3 words occur in both sentences.

Similarity = Cosine similarity of BoW. It justifies having a dimension of the BoW vectors that encodes the facts that some words are side-by-side.

## Examples

We use Sherlock Holmes once again.

In [35]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [36]:
import requests

r = requests.get('https://sherlock-holm.es/stories/plain-text/scan.txt')
r.raise_for_status()

with open('scandal_in_bohemia.txt', 'w') as out:
    out.write(r.content.decode('utf-8'))
lines = [txt for txt in open('scandal_in_bohemia.txt') if len(txt.strip()) > 0]

In [37]:
from nltk.tokenize import sent_tokenize

book = ' '.join([x.strip() for x in lines[7:]])
sentences = sent_tokenize(book)

In [38]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(
    stop_words='english',
    ngram_range=(1, 2),    # will create a vocabulary with 1-gram and 2-grams
    min_df=2,
    max_df=0.8,
    max_features=1500
)

count.fit(sentences)

In [None]:
show_vocabulary(count, word_size=25, words_per_line=6)

In [40]:
N = 0
bow = count.transform([sentences[N]]).toarray()[0]
print(f'"{sentences[N]}"')

"To Sherlock Holmes she is always the woman."


In [None]:
show_bow(count, bow, word_size=25, words_per_line=6)

In [42]:
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

texts = {
    'Text 01': sentences[0],
    'Text 02': 'Sherlock is a good Holmes man',
    'Text 03': 'Sherlock Holmes is a good man'
}

bows = count.transform(texts.values())
similarity = cosine_similarity(bows)

sim_df = pd.DataFrame(similarity, columns=texts.keys(), index=texts.keys())

for k, v in texts.items():
    print(f'{k:<10}: "{v}"')
print()
print(sim_df)


Text 01   : "To Sherlock Holmes she is always the woman."
Text 02   : "Sherlock is a good Holmes man"
Text 03   : "Sherlock Holmes is a good man"

         Text 01   Text 02   Text 03
Text 01  1.00000  0.500000  0.670820
Text 02  0.50000  1.000000  0.894427
Text 03  0.67082  0.894427  1.000000


# New definition for Vocabulary

Our vocabulary does not contain only *words*, as we see that we can stem or lemmatize these words, and also identify groups of them that repeat. To avoid confusion, we no longer consider the vocabulary to be made of words, but of terms.

**Term**:
* a single word (`turtles`)
* a lemma or a stem of a single word (`turtl` or `turtle`)
* consecutive words or lemmas or stems (`sea turtle` or `sea turtl`)

The **Vocabulary** is the set of unique **terms** that appear in the text:
* `turtle` `sea` `sea turtle`

