$\textbf{Text Vectorization}$
-

- a vector is a geometric object which contains a magnitude and a direction.

- Text vectorization is the projection of words into a mathematical space while preserving information.

$\textbf{The Bag of Words Model}$
-

- The BOW is a straight forward model for vectorizing sentences.

- BOW uses word frequencies to construct vectors.

- BOW model is an orderless document representation and only the counts of the words matter.

- Because BOW does not take into account the positioning of words we loss smenatic information.

- Vectorizing different sentences and joining the result into a single vocabulary.

- The vocabulary acts as a reference if a specific word is present or absent in each of the sentence.

$EXAMPLE$

In [30]:
import re
import string

s1 = "dog sat mat."
s2 = "cat love dog."

def token_sentence(s):
    # Make a regular expression that matches all punctuation
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    # Use the regex
    res = regex.sub('', s)
    res = res.split()
    return res

new_s1 = token_sentence(s1)
new_s2 = token_sentence(s2)
vocabulary = list(set(new_s1 + new_s2))
vocabulary

['sat', 'love', 'dog', 'cat', 'mat']

In [31]:
new_s1

['dog', 'sat', 'mat']

In [32]:
BOW = [int(u in new_s1) for u in vocabulary]
BOW

[1, 0, 1, 0, 1]

$\text{Term Frequency Inverse Document Frequency (TF-IDF)}$
-

- A model largely used in search engines to query relevant documents.

- Two informations are encoded: the term frequency, and the inverse document frequency.

- The term frequency is the count of words appearing in a document.

- The inverse document frequency measures the importance of words in a document.

- The inverse document frequency is calculated by logarithmically scaling the inverse fraction of the documents containing the word. This is obtained by dividing the total number of documents by the number of documents containing the term, followed by taking the logarithm of the ratio.

- The inverse document frequency measures how common or rare a term is among all documents.

The formula are:
\begin{gather}
TF(t) = \frac{\text{number of times the term "t" appeas in a specific document}}{\text{total number of terms in the document}}
\end{gather}

\begin{gather}
IDF(t) = log(\frac{\text{total number of documents}}{\text{number of documents with term "t"}})
\end{gather}

\begin{gather}
TF \cdotp IDF = TF(t) \cdotp IDF(t)
\end{gather}

- TF-IDF has more information that using vector representation because instead of using the count of words as used in the BOW, TF-IDF makes rare terms more prominent and ignores common words like stopwords such as "is", "that", "of", etc.

$\text{Vectorization Using Gensim}$

In [33]:
from gensim import corpora
import spacy
from pypdf import PdfReader
nlp = spacy.load('en_core_web_sm')

article1 = PdfReader("pdfs/Tubercolosis_1.pdf");
article2 = PdfReader("pdfs/Tubercolosis_2.pdf");
article3 = PdfReader("pdfs/Tubercolosis_3.pdf");
extracted_1 = article1.pages[0].extract_text()
extracted_2 = article2.pages[0].extract_text()
extracted_3 = article3.pages[0].extract_text()

documents = [extracted_1, extracted_2, extracted_3];

In [34]:
texts = []
for document in documents:
    text = []
    doc = nlp(document)
    for w in doc:
        if not w.is_stop and not w.is_punct and not w.like_num:
            text.append(w.lemma_)
    texts.append(text)
#texts is a mini-corpus specifically for toxic algal bloom
print(texts)

[[' \n \n', 'January', 'elsevi', 'create', 'covid', 'resource', 'centre', '\n', 'free', 'information', 'English', 'Mandarin', 'novel', 'coronavirus', 'COVID', '\n', ' ', 'covid', 'resource', 'centre', 'host', 'elsevi', 'Connect', '\n', 'company', 'public', ' ', 'news', 'information', 'website', ' \n \n', 'elsevi', 'grant', 'permission', 'covid', 'relate', '\n', 'research', ' ', 'available', 'covid', 'resource', 'centre', 'include', 'th', '\n', 'research', ' ', 'content', 'immediately', 'available', 'PubMed', 'central', '\n', 'publicly', 'fund', ' ', 'repository', 'covid', 'database', 'right', '\n', 'unrestricted', ' ', 'research', '-use', 'analysis', 'form', 'mean', '\n', 'acknowledgement', ' ', 'o', 'riginal', 'source', 'permission', '\n', 'grant', 'free', 'elsevi', ' ', 'long', 'covid', 'resource', 'centre', '\n', 'remain', 'active', ' \n \n'], ['Gac', ' ', 'Sanit', ' ', '35(s2', 'S227', 'S230', '\n', 'risk', ' ', 'factor', ' ', 'analysis', ' ', ' ', 'non', 'compliance', ' ', ' ', 'T

In [35]:
#creating a BOW representation of the mini-corpus
dictionary = corpora.Dictionary(texts)
print(dictionary.token2id)

{'\n': 0, ' ': 1, ' \n \n': 2, '-use': 3, 'COVID': 4, 'Connect': 5, 'English': 6, 'January': 7, 'Mandarin': 8, 'PubMed': 9, 'acknowledgement': 10, 'active': 11, 'analysis': 12, 'available': 13, 'central': 14, 'centre': 15, 'company': 16, 'content': 17, 'coronavirus': 18, 'covid': 19, 'create': 20, 'database': 21, 'elsevi': 22, 'form': 23, 'free': 24, 'fund': 25, 'grant': 26, 'host': 27, 'immediately': 28, 'include': 29, 'information': 30, 'long': 31, 'mean': 32, 'news': 33, 'novel': 34, 'o': 35, 'permission': 36, 'public': 37, 'publicly': 38, 'relate': 39, 'remain': 40, 'repository': 41, 'research': 42, 'resource': 43, 'right': 44, 'riginal': 45, 'source': 46, 'th': 47, 'unrestricted': 48, 'website': 49, '\n ': 50, '1,083–6,731': 51, '1,085–73.525': 52, '1,247–17,287': 53, '2,435–19,398': 54, '20–49': 55, '35(s2': 56, '64.5%.5': 57, '6–9': 58, '9111/': 59, '=': 60, 'AFB': 61, 'Aceh': 62, 'Africa': 63, 'Andi': 64, 'Asia.2': 65, 'Asriwatia,∗': 66, 'Besar': 67, 'CC': 68, 'CDR': 69, 'CI': 

$INSIGHTS$

- There are 87 unique words in our corpus that is focused on healthcare and toxic algal bloom.

- Each word is indexed with an integer.

- The index is termed as a "word ID".

- The BOW now can be used for word integer-id mapping.

Using the doc2bow method, which, as the name suggests, helps convert our document to bag-of-words.

In [36]:
corpus = [dictionary.doc2bow(text) for text in texts]
corpus

[[(0, 10),
  (1, 8),
  (2, 3),
  (3, 1),
  (4, 1),
  (5, 1),
  (6, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 1),
  (11, 1),
  (12, 1),
  (13, 2),
  (14, 1),
  (15, 4),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 6),
  (20, 1),
  (21, 1),
  (22, 4),
  (23, 1),
  (24, 2),
  (25, 1),
  (26, 2),
  (27, 1),
  (28, 1),
  (29, 1),
  (30, 2),
  (31, 1),
  (32, 1),
  (33, 1),
  (34, 1),
  (35, 1),
  (36, 2),
  (37, 1),
  (38, 1),
  (39, 1),
  (40, 1),
  (41, 1),
  (42, 3),
  (43, 4),
  (44, 1),
  (45, 1),
  (46, 1),
  (47, 1),
  (48, 1),
  (49, 1)],
 [(0, 89),
  (1, 709),
  (11, 1),
  (12, 1),
  (17, 1),
  (22, 2),
  (29, 1),
  (31, 1),
  (35, 1),
  (42, 2),
  (50, 18),
  (51, 1),
  (52, 1),
  (53, 1),
  (54, 1),
  (55, 1),
  (56, 1),
  (57, 1),
  (58, 1),
  (59, 1),
  (60, 4),
  (61, 1),
  (62, 2),
  (63, 1),
  (64, 1),
  (65, 1),
  (66, 1),
  (67, 1),
  (68, 1),
  (69, 3),
  (70, 4),
  (71, 1),
  (72, 1),
  (73, 1),
  (74, 1),
  (75, 1),
  (76, 1),
  (77, 1),
  (78, 1),
  (79, 1),
  (80, 1),
  (81

- The output is a nested list.

- Each individual sublist represents a documents bag-of-words representation.

- A reminder: you might see different numbers in your list, this is because each time you create a dictionary, different mappings will occur.

- Unlike the example we demonstrated, where an absence of a word was a 0, we use tuples that represent (word_id, word_count).

- We can easily verify this by checking the original sentence, mapping each word to its integer ID and reconstructing our list.

- We can also notice in this case each document has not greater than one count of each word - in smaller corpuses, this tends to happen.

In [37]:
#storing your generated corpus

corpora.MmCorpus.serialize('tubercolosis_corpus.mm', corpus)

- It is more memory efficient to store your corpus into the disk and later loading it because at most one vector resides in the RAM at a time.

In [38]:
#Converting Bag-of-Words to TF-IDF representation
from gensim import models
tfidf = models.TfidfModel(corpus)

for document in tfidf[corpus]:
       print(document)

[(0, 0.29686456787051857), (1, 0.2374916542964149), (2, 0.24130736959420215), (3, 0.08043578986473406), (4, 0.08043578986473406), (5, 0.08043578986473406), (6, 0.08043578986473406), (7, 0.08043578986473406), (8, 0.08043578986473406), (9, 0.08043578986473406), (10, 0.08043578986473406), (11, 0.02968645678705186), (12, 0.02968645678705186), (13, 0.1608715797294681), (14, 0.08043578986473406), (15, 0.3217431594589362), (16, 0.08043578986473406), (17, 0.02968645678705186), (18, 0.08043578986473406), (19, 0.4826147391884043), (20, 0.08043578986473406), (21, 0.08043578986473406), (22, 0.11874582714820744), (23, 0.08043578986473406), (24, 0.1608715797294681), (25, 0.08043578986473406), (26, 0.1608715797294681), (27, 0.08043578986473406), (28, 0.08043578986473406), (29, 0.02968645678705186), (30, 0.1608715797294681), (31, 0.02968645678705186), (32, 0.08043578986473406), (33, 0.08043578986473406), (34, 0.08043578986473406), (35, 0.02968645678705186), (36, 0.1608715797294681), (37, 0.08043578986

- TF-IDF scores: The higher the score, the more important the word in the document.

$\textbf{N-Gramming}$
-

- Context is very important when working with text data.
- This context is lost during vector representation because on only the word frequency is taken into account.
- An n-gram is a contiguous sequence of n items in the text. In our case, we will be dealing with words being the item, but depending on the use case, it could be even letters, syllables, or sometimes in the case of speech, phonemes.
- Mono-gram, n=1
- Bi-gram, n = 2.
- Tri-gram, n=3
- N-Gramming is calculated through the conditional probability of a token given by thr preceding token.
- N-Gramming can also be done by calculating words that appear close to each other.
- Bi-gramming is also called co-location, it locates pair of words that are very likely to appear close together.
- Example: "New Hampshire" is one word not "New" and "Hampshire"
- Gensim approaches bigrams by simply combining the two high probability tokens with an underscore. The tokens new and york will now become new_york instead. Similar to the TF- IDF model, bigrams can be created using another Gensim model - Phrases.

In [39]:
import gensim
bigram = gensim.models.Phrases(texts)
texts = [bigram[line] for line in texts]
texts

[[' \n \n',
  'January',
  'elsevi',
  'create',
  'covid',
  'resource',
  'centre',
  '\n',
  'free',
  'information',
  'English',
  'Mandarin',
  'novel',
  'coronavirus',
  'COVID',
  '\n',
  ' ',
  'covid',
  'resource',
  'centre',
  'host',
  'elsevi',
  'Connect',
  '\n',
  'company',
  'public',
  ' ',
  'news',
  'information',
  'website',
  ' \n \n',
  'elsevi',
  'grant',
  'permission',
  'covid',
  'relate',
  '\n',
  'research',
  ' ',
  'available',
  'covid',
  'resource',
  'centre',
  'include',
  'th',
  '\n',
  'research',
  ' ',
  'content',
  'immediately',
  'available',
  'PubMed',
  'central',
  '\n',
  'publicly',
  'fund',
  ' ',
  'repository',
  'covid',
  'database',
  'right',
  '\n',
  'unrestricted',
  ' ',
  'research',
  '-use',
  'analysis',
  'form',
  'mean',
  '\n',
  'acknowledgement',
  ' ',
  'o',
  'riginal',
  'source',
  'permission',
  '\n',
  'grant',
  'free',
  'elsevi',
  ' ',
  'long',
  'covid',
  'resource',
  'centre',
  '\n',
  

$\textbf{NOTE}:$Since by creating new phrases we add words to our dictionary, this step must be done before we create our dictionary. We would have to run this:

In [40]:
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

After we are done creating our bi-grams, we can create tri-grams, and other n-grams by simply running the phrases model multiple times on our corpus. Bi-grams still remains the most used n-gram model, though it is worth one's time to glance over the other uses and kinds of n-gram implementations

In [41]:
# Removing both high frequency and low-frequency words.
# Example: get rid of words that occur in less than 20 documents, or in more than 50% of the documents, 
dictionary.filter_extremes(no_below=20, no_above=0.5)

$\textbf{Programming Assignment}$

Choose a topic that you will be using as a term paper for this subject. Collect articles, publications, sotries etc. of your chosen topic and develop your own mini-corpus using the preprocessing steps required. Be sure to print the output.

Note that this corpus will be used for the entire subject.