# Programming Assignment 1: Creating a Mini-Corpus

$\text{Vectorization Using Gensim}$

In [10]:
from gensim import corpora
import spacy

nlp = spacy.load('en_core_web_sm')

In [11]:
# cleaning the text so as to remove line breaks and excessive spaces
def clean_text(text):
    # Remove excess spaces
    text = ' '.join(text.split())
    # Remove line breaks
    text = text.replace('\n', '').replace('\r', '').replace('"', '')
    return text

In [12]:
import os

# folder path
dir_path = 'imports\Assignment1'

# list to store file names
res = []

# Iterate directory
for path in os.listdir(dir_path):
    # check if current path is a file
    if os.path.isfile(os.path.join(dir_path, path)):
        res.append(path)
print(res)

documents = []

for file in res:
    f = open(dir_path + '\\' + file, "r")

    doc = f.read()
    clean_doc = clean_text(doc)
    documents.append(clean_doc)
    print(clean_doc)

print(len(documents))

['HIS LAST BOW.txt', 'THE ADVENTURES OF SHERLOCK HOLMES.txt', 'THE CASE-BOOK OF SHERLOCK HOLMES.txt', 'THE MEMOIRS OF SHERLOCK HOLMES.txt', 'THE RETURN OF SHERLOCK HOLMES.txt']
5


In [21]:
texts = []
for document in documents:
    text = []
    doc = nlp(document)
    for w in doc:
        if not w.is_stop and not w.is_punct and not w.like_num:
            text.append(w.lemma_)
    texts.append(text)
#texts is a mini-corpus specifically for toxic algal bloom
print(texts)



In [14]:
#creating a BOW representation of the mini-corpus
dictionary = corpora.Dictionary(texts)
print(dictionary.token2id)



$INSIGHTS$

- There are 29004 unique words in this corpus that is derived from the short stories of Sherlock Holmes.

- Each word is indexed with an integer.

- The index is termed as a "word ID".

- The BOW now can be used for word integer-id mapping.

Using the doc2bow method, which, as the name suggests, helps convert our document to bag-of-words.

In [15]:
corpus = [dictionary.doc2bow(text) for text in texts]
corpus

[[(0, 1),
  (1, 1),
  (2, 1),
  (3, 1),
  (4, 2),
  (5, 1),
  (6, 1),
  (7, 1),
  (8, 2),
  (9, 1),
  (10, 3),
  (11, 2),
  (12, 4),
  (13, 1),
  (14, 7),
  (15, 1),
  (16, 9),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 1),
  (22, 6),
  (23, 1),
  (24, 3),
  (25, 2),
  (26, 1),
  (27, 1),
  (28, 7),
  (29, 8),
  (30, 13),
  (31, 1),
  (32, 1),
  (33, 1),
  (34, 2),
  (35, 3),
  (36, 1),
  (37, 1),
  (38, 4),
  (39, 1),
  (40, 1),
  (41, 1),
  (42, 1),
  (43, 1),
  (44, 10),
  (45, 9),
  (46, 1),
  (47, 1),
  (48, 1),
  (49, 1),
  (50, 1),
  (51, 1),
  (52, 7),
  (53, 2),
  (54, 35),
  (55, 1),
  (56, 2),
  (57, 4),
  (58, 6),
  (59, 2),
  (60, 1),
  (61, 1),
  (62, 2),
  (63, 1),
  (64, 1),
  (65, 1),
  (66, 1),
  (67, 1),
  (68, 35),
  (69, 2),
  (70, 1),
  (71, 1),
  (72, 8),
  (73, 5),
  (74, 1),
  (75, 1),
  (76, 3),
  (77, 1),
  (78, 5),
  (79, 1),
  (80, 1),
  (81, 4),
  (82, 1),
  (83, 12),
  (84, 1),
  (85, 7),
  (86, 1),
  (87, 1),
  (88, 4),
  (89, 2),
  (90, 8),
  (9

- The output is a nested list.

- Each individual sublist represents a documents bag-of-words representation.

- A reminder: you might see different numbers in your list, this is because each time you create a dictionary, different mappings will occur.

- Unlike the example we demonstrated, where an absence of a word was a 0, we use tuples that represent (word_id, word_count).

- We can easily verify this by checking the original sentence, mapping each word to its integer ID and reconstructing our list.

- We can also notice in this case each document has not greater than one count of each word - in smaller corpuses, this tends to happen.

In [16]:
#storing your generated corpus
corpora.MmCorpus.serialize('corpus.mm', corpus)

- It is more memory efficient to store your corpus into the disk and later loading it because at most one vector resides in the RAM at a time.

In [17]:
#Converting Bag-of-Words to TF-IDF representation
from gensim import models
tfidf = models.TfidfModel(corpus)

for document in tfidf[corpus]:
       print(document)

[(0, 0.00798685941596732), (1, 0.00798685941596732), (2, 0.00798685941596732), (3, 0.00798685941596732), (4, 0.01597371883193464), (5, 0.00798685941596732), (6, 0.00798685941596732), (7, 0.00798685941596732), (8, 0.00909421258576489), (9, 0.004547106292882445), (10, 0.023960578247901963), (11, 0.01597371883193464), (12, 0.010139918816557236), (13, 0.00798685941596732), (15, 0.00798685941596732), (16, 0.02281481733725378), (17, 0.00798685941596732), (18, 0.00798685941596732), (19, 0.002534979704139309), (20, 0.00798685941596732), (21, 0.002534979704139309), (22, 0.047921156495803925), (23, 0.00798685941596732), (24, 0.013641318878647335), (25, 0.01597371883193464), (26, 0.00798685941596732), (27, 0.004547106292882445), (28, 0.05590801591177124), (30, 0.014395591207368396), (31, 0.004547106292882445), (32, 0.00798685941596732), (33, 0.00798685941596732), (34, 0.01597371883193464), (36, 0.00798685941596732), (37, 0.002534979704139309), (38, 0.010139918816557236), (39, 0.00798685941596732)

- TF-IDF scores: The higher the score, the more important the word in the document.

$\textbf{N-Gramming}$

- Context is very important when working with text data.
- This context is lost during vector representation because on only the word frequency is taken into account.
- An n-gram is a contiguous sequence of n items in the text. In our case, we will be dealing with words being the item, but depending on the use case, it could be even letters, syllables, or sometimes in the case of speech, phonemes.
- Mono-gram, n=1
- Bi-gram, n = 2.
- Tri-gram, n=3
- N-Gramming is calculated through the conditional probability of a token given by the preceding token.
- N-Gramming can also be done by calculating words that appear close to each other.
- Bi-gramming is also called co-location, it locates pair of words that are very likely to appear close together.
- Example: "New Hampshire" is one word not "New" and "Hampshire"
- Gensim approaches bigrams by simply combining the two high probability tokens with an underscore. The tokens new and york will now become new_york instead. Similar to the TF- IDF model, bigrams can be created using another Gensim model - Phrases.

In [18]:
import gensim
bigram = gensim.models.Phrases(texts)
texts = [bigram[line] for line in texts]
texts

[['bow',
  'Arthur_Conan',
  'Doyle',
  'Table_content',
  'Preface',
  'Adventure',
  'Wisteria_Lodge',
  'Adventure',
  'Cardboard',
  'Box',
  'Adventure',
  'Red',
  'Circle',
  'Adventure',
  'Bruce_Partington',
  'plan',
  'Adventure',
  'Dying',
  'Detective',
  'Disappearance',
  'Lady_Frances',
  'Carfax',
  'Adventure',
  'Devil',
  'foot',
  'bow',
  'preface',
  'friend',
  'Mr._Sherlock',
  'Holmes',
  'glad',
  'learn',
  'alive',
  'somewhat',
  'cripple',
  'occasional',
  'attack',
  'rheumatism',
  'year',
  'live',
  'small',
  'farm',
  'down',
  'mile',
  'Eastbourne',
  'time',
  'divide',
  'philosophy',
  'agriculture',
  'period',
  'rest',
  'refuse',
  'princely',
  'offer',
  'case',
  'having',
  'determine',
  'retirement',
  'permanent',
  'approach',
  'german',
  'war',
  'cause',
  'lay',
  'remarkable',
  'combination',
  'intellectual',
  'practical',
  'activity',
  'disposal',
  'government',
  'historical',
  'result',
  'recount',
  'bow',
  'pre

$\textbf{NOTE}:$ Since by creating new phrases we add words to our dictionary, this step must be done before we create our dictionary. We would have to run this:

In [19]:
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
corpus

[[(0, 1),
  (1, 1),
  (2, 1),
  (3, 1),
  (4, 2),
  (5, 1),
  (6, 1),
  (7, 1),
  (8, 2),
  (9, 3),
  (10, 2),
  (11, 4),
  (12, 1),
  (13, 7),
  (14, 1),
  (15, 8),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 6),
  (22, 1),
  (23, 3),
  (24, 2),
  (25, 1),
  (26, 1),
  (27, 7),
  (28, 8),
  (29, 13),
  (30, 1),
  (31, 1),
  (32, 1),
  (33, 2),
  (34, 2),
  (35, 1),
  (36, 1),
  (37, 1),
  (38, 4),
  (39, 1),
  (40, 1),
  (41, 1),
  (42, 1),
  (43, 1),
  (44, 10),
  (45, 1),
  (46, 8),
  (47, 1),
  (48, 1),
  (49, 1),
  (50, 1),
  (51, 1),
  (52, 1),
  (53, 7),
  (54, 2),
  (55, 25),
  (56, 1),
  (57, 2),
  (58, 4),
  (59, 6),
  (60, 2),
  (61, 1),
  (62, 1),
  (63, 2),
  (64, 1),
  (65, 1),
  (66, 1),
  (67, 1),
  (68, 2),
  (69, 1),
  (70, 1),
  (71, 8),
  (72, 1),
  (73, 1),
  (74, 1),
  (75, 3),
  (76, 1),
  (77, 4),
  (78, 1),
  (79, 1),
  (80, 1),
  (81, 4),
  (82, 1),
  (83, 12),
  (84, 1),
  (85, 7),
  (86, 1),
  (87, 1),
  (88, 4),
  (89, 2),
  (90, 1),
  (91

After we are done creating our bi-grams, we can create tri-grams, and other n-grams by simply running the phrases model multiple times on our corpus. Bi-grams still remains the most used n-gram model, though it is worth one's time to glance over the other uses and kinds of n-gram implementations

In [20]:
# Removing both high frequency and low-frequency words.
# Example: get rid of words that occur in less than 20 documents, or in more than 50% of the documents, 

# in this case, it removes all words in the dictionary
dictionary.filter_extremes(no_below=20, no_above=0.5)
print(dictionary.token2id)

{}
