In [1]:
from pathlib import Path

Path.ls = lambda x: list(x.iterdir())

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
def read_file(filepath: Path = "Movie_Reviews.txt"):
    with Path(filepath).open("r") as f:
        movie_review_text:str = f.read()
        return movie_review_text

In [4]:
movie_review_text = read_file(filepath="Movie_Reviews.txt")
movie_review_text.__sizeof__(), len(movie_review_text)

(264133428, 132066677)

In [5]:
import spacy

In [6]:
# !python -m spacy download en_core_web_sm # in case you forgot to download this earlier

## From Official spaCy docs
https://spacy.io/usage/spacy-101#annotations-token

In [7]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


## Adapting to our Data

### Disable Unused Components 
spaCy allows us to disable components of the `pipeline` which we don't use. Since we will be using this primarily for tokenization, we disable everything else.

In [8]:
nlp = spacy.load("en_core_web_sm", disable=["ner", "tagger", "parser", "textcat"])
print(nlp.pipe_names)

[]


### Increase Max Length
By default, spaCy allows you to parse only text string of 100,000 characters at a time. This is a safeguard to prevent memory overflow issues. 

From spaCy's estimates, they need about 1G of memory for every 100K characters -- if using all the entire pipeline. Since, we're not using the entire pipeline -- I'm increasing the `max_length` to account for our input text.

In [9]:
nlp.max_length = len(movie_review_text)+1
nlp.max_length

132066678

In [10]:
%%time
# %%timeit -n 3
doc = nlp(movie_review_text)

CPU times: user 2min, sys: 5.42 s, total: 2min 5s
Wall time: 2min 6s


### Extract Bag of Tokens (Words)
In popular classical ML pipelines, we will need a bag of tokens (words) for classification. Extracting that from the NLP library is often 1 extra step. We profile that as well, to get a better sense of how long it takes.

In [11]:
%%time 
tokens = [token.text for token in doc]

CPU times: user 18.6 s, sys: 1.1 s, total: 19.8 s
Wall time: 20.1 s


### Get a sense of Vocabulary size

In [12]:
from collections import Counter
token_cntr = Counter(tokens)

In [13]:
print(f"Unique Tokens: {len(token_cntr)}")

Unique Tokens: 261647


At almost ~270K, this is a large vocabulary and can lead to too much sparsity in our matrix computations. Let's try to reduce the vocabulary size while still retaining the maximum signal we can. 

Using popular convention, we try these next:
1. Keep tokens with minimum frequency = 3
2. Lowercase all tokens and then set minimum frequency = 3

### Reducing our Vocabulary Size

In [14]:
min_freq = 3

In [15]:
token_cntr = {k: v for k, v in token_cntr.items() if v >= min_freq}

In [16]:
print(
    f"After dropping all rare tokens, min_freq = {min_freq}, we have:\nUnique Tokens: {len(token_cntr)}"
)

After dropping all rare tokens, min_freq = 2, we have:
Unique Tokens: 97469


This is still larger than what I'd like. Let's see if we can get a small vocabulary with lowercase tokens. 

In [17]:
%time lowercase_token_cntr = Counter([token.text.lower() for token in doc])

CPU times: user 28.5 s, sys: 2.86 s, total: 31.4 s
Wall time: 32.3 s


In [18]:
print(f"Unique Tokens: {len(lowercase_token_cntr)}")

Unique Tokens: 217491


In [19]:
lowercase_token_cntr = {k: v for k, v in lowercase_token_cntr.items() if v >= min_freq}

In [20]:
print(
    f"After dropping all rare tokens, min_freq = {min_freq}, we have:\nUnique Tokens: {len(lowercase_token_cntr)}"
)

After dropping all rare tokens, min_freq = 3, we have:
Unique Tokens: 81408
