# SECTION 1: Introduction

## SECTION 1.1: Corpora

Corpus contains raw text (ASCII/UTF-8) and metadata.

In [1]:
import nltk

In [2]:
from nltk.corpus import words

In [3]:
from nltk.corpus import reuters

In [4]:
from nltk.corpus import brown

In [6]:
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

In [7]:
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

## SECTION 1.2: Tokenization

2 types of words:
1. content words
2. stopwords

Pure Python, spaCy, or NLTK can be used.

In [8]:
import spacy

In [9]:
import en_core_web_sm

In [10]:
nlp = en_core_web_sm.load()
text = "Mary, don't slap the green witch"
print(
    [
        str(token) for token
        in nlp(text.lower())
    ]
)

['mary', ',', 'do', "n't", 'slap', 'the', 'green', 'witch']


In [11]:
from nltk.tokenize import \
TweetTokenizer

In [12]:
tweet = "Snow White and the Seven Degrees #MakeAMovieCold @midnight :-)"
tokenizer = TweetTokenizer()
print(
    tokenizer.tokenize(tweet.lower())
)

['snow', 'white', 'and', 'the', 'seven', 'degrees', '#makeamoviecold', '@midnight', ':-)']


NLTK tweet tokenizer preserves hashtags, handles, and smiles.

## SECTION 1.3: WordNet

WordNet is a large lexical database in English.

In [13]:
from nltk.corpus import wordnet

In [15]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [16]:
from nltk.corpus import wordnet

synonyms = []
antonyms = []

for syn in wordnet.synsets('good'):
    for l in syn.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(
                l.antonyms()[0].name()
            )

print(set(synonyms))
print(set(antonyms))

{'honorable', 'soundly', 'goodness', 'just', 'upright', 'commodity', 'proficient', 'adept', 'in_force', 'thoroughly', 'skilful', 'undecomposed', 'respectable', 'unspoilt', 'well', 'near', 'salutary', 'beneficial', 'dependable', 'ripe', 'honest', 'good', 'serious', 'unspoiled', 'safe', 'full', 'expert', 'skillful', 'trade_good', 'sound', 'dear', 'estimable', 'effective', 'in_effect', 'secure', 'practiced', 'right'}
{'evil', 'ill', 'evilness', 'bad', 'badness'}


## SECTION 1.4: Grammartical Analysis

In [17]:
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.pos_, token.dep_)

Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN dobj
startup NOUN dep
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj


## SECTION 1.5: Dependency Parsing

In [18]:
from spacy import displacy

In [19]:
from spacy import displacy

displacy.render(doc,
                style='dep',
                jupyter='True',
                options={'distance':90})

## SECTION 1.6: Named Entity Recognition (NER)

In [20]:
doc = nlp(
    'I just bought 2 shares at 9 am because the stock went up 30% in just 2 days according to the WSJ'
)
displacy.render(doc,
              style='ent',
              jupyter='True')

# SECTION 2: Text Representation

**Why is representation important?**

Text representation scheme must facilitate the extraction of the features.

The *semantics* (meaning) of a sentence comes from the 4 steps:
1. Break the sentence into lexical units
2. Derive the meaning of each unit
3. Understand the syntactic (grammatical) structure of the sentence
4. Understand the context in which the sentence appears

**What is text representation?**

Text representation is the conversion from of raw text into a suitable numerical form.

**Legacy Techniques**
1. one-hot encoding
2. bag of words
3. n-gram
4. TF-IDF

## SECTION 2.1: One-hot Encoding

1. No information about words relations
2. Must pre-determine vocabulary size
3. Size of input vector scales with size of vocabulary
4. "Out-of-vocabulary" (OOV) problem

In [21]:
import numpy as np

In [22]:
def one_hot(word, word_dict):
    vector = np.zeros(len(word_dict))
    # vector[word_dict[word]] = 1
    if word in word_dict:
        vector[word_dict[word]] = 1

    return vector

words = ['rome', 'paris', 'italy', 'france']
word_dict = {word: idx for idx, word in enumerate(words)}

print(one_hot("paris", word_dict))

[0. 1. 0. 0.]


## SECTION 2.2: Bag of Words

Bag of words is a vector representation of a text produced by simply adding up all the one-hot encoded vectors:

1. Vectors simply contain the number of times each word appears in our document.
2. *Orderless*
3. No notion of similarity

In [23]:
vocabulary_size = 50

text_words = [
    'rome', 'paris', 'italy', 'france',
    'rome', 'magnificent', 'tourism', 'night',
    'tourism', 'tourism'
]

# bow = np.zeros(vocabulary_size)
bow = [0] * len(word_dict)

for word in text_words:
    hot_word = one_hot(word, word_dict)
    # bow += hot_word
    bow = [sum(x) for x in zip(bow, hot_word)]  # Element-wise sum

print(bow)

bow[word_dict["paris"]]

[2.0, 1.0, 1.0, 1.0]


1.0

## SECTION 2.3: N-gram Model

N-gram model is a contiguous sequence of n items from a given sample of text.

1. Vocabulary = set of all n-grams in corpus
2. No notion of similarity

In [25]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [26]:
from nltk import ngrams

text = "Machines take me by surprise with great frequency."

n = 5
pentagrams = ngrams(nltk.word_tokenize(text), n)

for grams in pentagrams:
    print(grams)

('Machines', 'take', 'me', 'by', 'surprise')
('take', 'me', 'by', 'surprise', 'with')
('me', 'by', 'surprise', 'with', 'great')
('by', 'surprise', 'with', 'great', 'frequency')
('surprise', 'with', 'great', 'frequency', '.')


## SECTION 2.4: Collocations

A collocation is a sequence of words that occur together unusually often.

`nltk.collocations` can help identifying phrases that act like single words.

In the example below, bi-grams are paired with a "more likely to occur" score:

In [27]:
import nltk.collocations
import nltk.corpus
import collections

bgm = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(
    nltk.corpus.brown.words()
)
scored = finder.score_ngrams(bgm.likelihood_ratio)

# Group bigrams by first word in bigram
prefix_keys = collections.defaultdict(list)
for key, scores in scored:
    prefix_keys[key[0]].append((key[1], scores))

# Sorted key bigrams by strongest association
for key in prefix_keys:
    prefix_keys[key].sort(key = lambda x: -x[1])

print('New', prefix_keys['New'][:5])

New [('York', 4634.968955894195), ('Orleans', 611.6951040864856), ('England', 557.5789255397682), ('Jersey', 265.2781409189113), ('Testament', 182.6595658588261)]


## SECTION 2.5: Term Frequency

In [28]:
import gzip

from collections import Counter

In [30]:
data = []

'''
for line in gzip.open("F:/EDGE/LearningAI/sample_text.txt", 'rt'):
    data.extend(line.strip().split())
'''

for line in open("sample_text.txt", 'rt'):
    data.extend(line.strip().split())

'''
# Open the file and read its contents
with open(file_path, 'r') as file:
    # Read the file and split the text into words
    words = file.read().split()
'''

counts = Counter(data)

sorted_counts = sorted(list(counts.items()), key=lambda x:x[1],
                       reverse=True)

for word, count in sorted_counts[:5]:
    print(word, count)

the 26
a 14
and 10
their 7
of 6


In [31]:
import nltk
all_words = nltk.FreqDist(data)
print(all_words.most_common(5))

[('the', 26), ('a', 14), ('and', 10), ('their', 7), ('of', 6)]


## SECTION 2.6: 3 Ways to Remove Stopwords

**(1/3) Remove the most common 100 words:**

In [32]:
stopwords = set(
    [
        word for word,
        count in sorted_counts[:100]
    ]
)

clean_data = []

for word in data:
    if word not in stopwords:
        clean_data.append(word)

**(2/3) Use nltk predefined stopwords:**

In [36]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [37]:
from nltk.corpus import stopwords
nltk_stopwords = stopwords.words('english')

print('Number of stopwords: %d' % len(nltk_stopwords))
print('First five stop words: %s' % list(nltk_stopwords)[:5])

Number of stopwords: 179
First five stop words: ['i', 'me', 'my', 'myself', 'we']


**(3/3) Use spaCy predefined stopwords**

In [38]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
print('Number of stopwords: %d' % len(spacy_stopwords))
print('First five stop words: %s' % list(spacy_stopwords)[:5])

Number of stopwords: 326
First five stop words: ['see', 'thus', 'would', 'her', '‘ll']


In [39]:
import spacy

# Load the English language model
nlp = spacy.load("en_core_web_sm")

# Combine the list of words into a single string
data_str = ' '.join(data)

# Process your text using spaCy
doc = nlp(data_str)

# Extract tokens, excluding stop words
tokens = [token.text for token in doc
          if not nlp.vocab[token.text].is_stop]

print(tokens)

['GENERATED', 'CHATGPT', 'time', ',', 'remote', 'island', 'nestled', 'heart', 'Pacific', ',', 'adventurer', 'named', 'Alex', 'found', 'stranded', 'unexpected', 'shipwreck', '.', 'island', ',', 'lush', 'vibrant', 'foliage', 'adorned', 'pristine', 'beaches', ',', 'like', 'paradise', 'glance', '.', ',', 'Alex', 'soon', 'realized', 'beauty', 'concealed', 'challenges', 'lay', 'ahead', '.', 'days', ',', 'Alex', 'scoured', 'shoreline', 'salvageable', 'items', 'shipwreck', '.', 'debris', ',', 'found', 'crates', 'canned', 'food', ',', 'tattered', 'functional', 'tent', ',', 'waterproof', 'box', 'containing', 'matches', '.', 'resources', ',', 'established', 'small', 'camp', 'near', 'freshwater', 'stream', ',', 'ensuring', 'basic', 'needs', 'met', '.', 'Days', 'turned', 'weeks', ',', 'Alex', 'survival', 'instincts', 'kicked', '.', 'began', 'exploring', 'island', 'interior', ',', 'learning', 'identify', 'edible', 'plants', 'honing', 'fishing', 'skills', '.', 'way', ',', 'encountered', 'peculiar', '

## SECTION 2.7: TF-IDF

TF-IDF reflects how important a word is to a document in a corpus.

$$W_{x,y}=tf_{x,y}\times log(\frac{N}{df_{x}})$$

Here,
> $W_{x,y}$ is the TF-IDF, i.e. the term $x$ within the document $y$

> $tf_{x,y}$ is the frequency of $x$ in $y$

> $df_{x}$ is the number of documents containing $x$

> $N$ is the total number of documents

TF-IDF of common words is zero.

**(1/3) TF-IDF from Scratch:**

In [40]:
def computeTF(wordDict, bow):
    tfDict = {}
    bowCount = len(bow)
    for word, count in wordDict.items():
        tfDict[word] = count / float(bowCount)
    return tfDict

def computeIDF(docList):
    import math
    idfDict = {}
    N = len(docList)

    idfDict = dict.fromkeys(docList[0].keys(), 0)
    for doc in docList:
        for word, val in doc.items():
            if val > 0:
                idfDict[word] += 1

    for word, val in idfDict.items():
        idfDict[word] = math.log10(N / float(val))

    return idfDict

def computeTFIDF(tfBow, idfs):
    tfidf = {}
    for word, val in tfBow.items():
        tfidf[word] = val*idfs[word]
    return tfidf

**(2/3) TF-IDF using Sklearn**

In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer

D1 = "The car is driven on the road."
D2 = "The truck is driven on the highway."

vectorizer = TfidfVectorizer()
response = vectorizer.fit_transform([D1, D2])

print(response)

  (0, 5)	0.42471718586982765
  (0, 4)	0.30218977576862155
  (0, 1)	0.30218977576862155
  (0, 3)	0.30218977576862155
  (0, 0)	0.42471718586982765
  (0, 6)	0.6043795515372431
  (1, 2)	0.42471718586982765
  (1, 7)	0.42471718586982765
  (1, 4)	0.30218977576862155
  (1, 1)	0.30218977576862155
  (1, 3)	0.30218977576862155
  (1, 6)	0.6043795515372431


In [42]:
pprint(list(enumerate(vectorizer.get_feature_names())))

Pretty printing has been turned OFF


`fit` learns vocabulary and idf from the training set.

`transform` transforms documents to document-term matrix.

**(3/3) TF_IDF using Sklearn + StopWords**

In [43]:
from pprint import pprint

In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer

D1 = "The car is driven on the road."
D2 = "The truck is driven on the highway."

vectorizer = TfidfVectorizer(stop_words='english')
response = vectorizer.fit_transform([D1, D2])

print(response)

# Print the feature names
# feature_names = vectorizer.get_feature_names()
feature_names = vectorizer.get_feature_names_out()
pprint(list(enumerate(feature_names)))
'''
pprint(
    list(
        enumerate(
            vectorizer.get_feature_names()
        )
    )
)
'''

  (0, 3)	0.6316672017376245
  (0, 1)	0.4494364165239821
  (0, 0)	0.6316672017376245
  (1, 2)	0.6316672017376245
  (1, 4)	0.6316672017376245
  (1, 1)	0.4494364165239821
[(0, 'car'), (1, 'driven'), (2, 'highway'), (3, 'road'), (4, 'truck')]


'\npprint(\n    list(\n        enumerate(\n            vectorizer.get_feature_names()\n        )\n    )\n)\n'

# SECTION 3: Vector Models

## SECTION 3.1: Popular Word Embedding Algorithms

1. skip-gram
2. Continuous Bag of Words (CBOW)
3. Word2Vec (2013) by Google - trains using CBOW and Skip-Gram together
4. Global Vectors for Word Representations (GloVe) by a team at Stanford University
5. fasttext by Facebook AI group

## SECTION 3.2: Word Embeddings with Spacy

In [46]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.6.0/en_core_web_lg-3.6.0-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.6.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [47]:
import spacy
import en_core_web_lg
nlp = en_core_web_lg.load()

In [48]:
from scipy import spatial

cosine_similarity = \
lambda x, y: 1 - spatial.distance.cosine(x, y)

man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector
queen = nlp.vocab['queen'].vector
king = nlp.vocab['king'].vector

We now need to find the closest vector in the vocabulary to the result of "king - man + woman".

In [49]:
maybe_queen = king - man + woman
computed_similarities = []

for word in nlp.vocab:
    if not word.has_vector:
        continue

    similarity = cosine_similarity(maybe_queen, word.vector)
    computed_similarities.append((word, similarity))

computed_similarities = sorted(
    computed_similarities, key=lambda item: -item[1]
)
print(w[0].text for w in computed_similarities[:10])

<generator object <genexpr> at 0x7ce5447fa340>


## SECTION 3.3: FastText

In [51]:
!pip install fasttext

Collecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/68.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━[0m [32m61.4/68.8 kB[0m [31m1.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.8/68.8 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.11.1-py3-none-any.whl (227 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp310-cp310-linux_x86_64.whl size=4199769 sha256=dbd67c01ee36a79f30250e76f00a25aa2c4623abb3e024bfee31475ede491a29
  Stored in directory: /root/.cache/pip/wheels/a5/13/75/f811c84a8ab36eedbaef977a6a58a98990e8e0f1967f98f394
Successfully built fa

In [52]:
import fasttext

# skipgram model:
skipgram_model = fasttext.train_unsupervised(
    'sample_text.txt', model='skipgram'
)
cbow_model = fasttext.train_unsupervised(
    'sample_text.txt', model='cbow'
)

In [53]:
skipgram_model.get_word_vector("environment")

array([-1.74022898e-05,  2.80962646e-04,  5.05377029e-05,  2.98797422e-05,
       -3.74962663e-04, -9.13969197e-05,  1.70033003e-04, -2.15993365e-04,
        2.53577076e-04,  2.07336037e-04, -5.01854694e-04, -1.13218390e-04,
       -4.86230914e-04, -4.21434845e-04, -6.72254770e-04,  1.86717123e-04,
        8.09436242e-05,  4.61690870e-05,  4.64938296e-07, -3.91224137e-04,
        2.23598647e-04,  1.53131492e-04, -6.70746667e-05, -5.31157886e-04,
       -5.16778273e-05,  6.97572104e-05, -5.54304221e-04,  1.52136490e-04,
        4.62088610e-05, -3.10951946e-05, -1.87138925e-04, -2.42755210e-04,
        7.47237136e-05,  2.55463645e-04, -4.95463784e-04,  7.32286353e-05,
       -4.98533947e-04,  2.05337725e-04, -9.69687535e-05,  5.07802295e-04,
        1.31217006e-04,  6.31086819e-04, -3.81859514e-04,  1.86736113e-04,
       -1.88688253e-04,  5.29979821e-04, -1.95597502e-04, -5.07256074e-04,
        1.63958714e-04,  5.71078272e-04, -7.01919897e-04,  2.38064429e-04,
       -3.82221682e-04,  

**Testing your model:**

Check nearest neighbours -

In [55]:
skipgram_model.get_nearest_neighbors('island')

[(0.6100154519081116, 'and'), (0.146626815199852, 'a'), (0.09915514290332794, 'they'), (0.09666965901851654, 'of'), (0.09045328944921494, 'the'), (0.08943742513656616, '</s>'), (0.08328405767679214, 'Alex'), (0.07356219738721848, 'their')]

In [57]:
cbow_model.get_nearest_neighbors('island')

[(0.6045956015586853, 'and'), (0.14456523954868317, 'a'), (0.08874776214361191, '</s>'), (0.08798855543136597, 'they'), (0.08178038150072098, 'of'), (0.06840469688177109, 'Alex'), (0.06475301086902618, 'the'), (0.025664031505584717, 'their')]