# Stemming and Lemmatization

#### 1. Given the list of pluralized words below, define your own simple word stemmer function or class,  limited to only simple rules and regex. No libraries! It should strip basic endings.

In [140]:
import re

class SimpleStupidStemmer():
    def __init__(self): 
        self.regex = r'^(.*?)(ie|(iz)es|(iz)ed|(iz)ation|al|ence|(iz)er|ing|ly|ul|ous|y|s)?$'
        
    def stem(self, word):
        match = re.match(self.regex, word)
        if match:
            return match.group(1)
        else:
            return word

In [141]:
simple_stupid_stemmer = SimpleStupidStemmer()

In [142]:
plurals = [
    "flies",
    "denied",
    "itemization",
    "sensational",
    "reference",
    "colonizer",
]

for word in plurals:
    print(simple_stupid_stemmer.stem(word))

flie
denied
item
sensation
refer
colon


#### 2. After your initial implementation, run it on the following words:

In [143]:
new_words = [
    "friendly",
    "puzzling",
    "helpful",
]

for word in new_words:
    print(simple_stupid_stemmer.stem(word))

friend
puzzl
helpf


#### 3. Realizing that fixing future words manually can be problematic, use a desired NLTK stemmer and run it on all the words:

In [144]:
import nltk

all_words = plurals + new_words

stemmer = nltk.PorterStemmer()

for word in all_words:
    print(stemmer.stem(word))

fli
deni
item
sensat
refer
colon
friendli
puzzl
help


We notice that our simple, reg-ex based stemmer does not vary too much from the Porter implementation.

#### 4. There are likely a few words in the outputs above that would cause issues in real-world applications. Pick some examples, and show how they are solved with a lemmatizer. Use either spaCy or nltk.

"Fli", "Sensat", "Colon", "Deni" are all problematic stems, as they do not really capture the true meaning of the word. For example, "fli" stems from "fly", and "colon" could refer to a body part, a punctuation mark or the true meaning in our context, a colony.

In [145]:
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/dion/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [146]:
lemmatizer = WordNetLemmatizer()

for word in all_words:
    print(lemmatizer.lemmatize(word))

fly
denied
itemization
sensational
reference
colonizer
friendly
puzzling
helpful


We observe that the lemmatizer has done a much better job at finding the true root of each word.

# Stemming/Lemmatization - Practical Example
Using the news corpus (subset/category of the Brown corpus), perform common text normalization techniques such as stopword filtering and stemming/lemmatization. Compare the top 10 most common **words** before and after these normalization techniques.

In [147]:
nltk.download('brown')

[nltk_data] Downloading package brown to /Users/dion/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [148]:
from nltk.corpus import brown
from nltk import FreqDist

nltk.download('brown')
news = brown.words(categories='news')

fd_before = FreqDist(news)

[nltk_data] Downloading package brown to /Users/dion/nltk_data...
[nltk_data]   Package brown is already up-to-date!


In [149]:
from nltk.corpus import stopwords

nltk.download('stopwords')

stopwords = set(stopwords.words("english"))

[nltk_data] Downloading package stopwords to /Users/dion/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Initially, we wish to lowercase our words as well as make sure they are not punctuation marks.

In [150]:
news_no_stopwords = [word.lower() for word in news if word.lower() not in stopwords and word.isalpha()]
fd_no_stopwords = FreqDist(news_no_stopwords)

In [151]:
news_no_stopwords_stemmed = [stemmer.stem(word) for word in news_no_stopwords]
fd_no_stopwords_stemmed = FreqDist(news_no_stopwords_stemmed)

In [152]:
news_no_stopwords_lemmatized = [lemmatizer.lemmatize(word) for word in news_no_stopwords]
fd_no_stopwords_lemmatized = FreqDist(news_no_stopwords_lemmatized)

Now, we will compare the top 10 most common words before and after these normalization techniques.

##### Most Common Words Before Normalization

In [153]:
fd_before.most_common(10)

[('the', 5580),
 (',', 5188),
 ('.', 4030),
 ('of', 2849),
 ('and', 2146),
 ('to', 2116),
 ('a', 1993),
 ('in', 1893),
 ('for', 943),
 ('The', 806)]

Most Common Words After Stopword Removal

In [154]:
fd_no_stopwords.most_common(10)

[('said', 406),
 ('would', 246),
 ('new', 241),
 ('one', 213),
 ('last', 177),
 ('two', 174),
 ('first', 158),
 ('state', 153),
 ('year', 142),
 ('president', 142)]

Most Common Words After Stopword Removal and Stemming

In [155]:
fd_no_stopwords_stemmed.most_common(10)

[('said', 406),
 ('would', 246),
 ('year', 244),
 ('new', 241),
 ('one', 221),
 ('state', 219),
 ('last', 179),
 ('two', 174),
 ('first', 158),
 ('presid', 147)]

Most Common Words After Stopword Removal and Lemmatization

In [156]:
fd_no_stopwords_lemmatized.most_common(10)

[('said', 406),
 ('would', 246),
 ('year', 244),
 ('new', 241),
 ('one', 221),
 ('state', 213),
 ('last', 177),
 ('two', 174),
 ('first', 158),
 ('president', 143)]

# TF-IDF
TF-IDF (term frequency-inverse document frequency) is a way to measure the importance of a word in a document.

$$
\text{tf-idf}(t, d, D) = \text{tf}(t, d) \times \text{idf}(t, D)
$$

Where:
- $t$ is the term (word)
- $d$ is the document
- $D$ is the corpus



#### 1. Implement TF-IDF using NLTKs FreqDist (no use of e.g. scikit-learn and other high-level libraries).

In [157]:
from typing import List

##########################################################
# Feel free to change everything below.
# It is merely a guide to understand the inputs/outputs
##########################################################


def tf(document: List[str], term: str) -> float:
    """
    Calculate the term frequency (TF) of a given term in a document.

    Args:
        document (List[str]): The document in which to calculate the term frequency.
        term (str): The term for which to calculate the term frequency.

    Returns:
        float: The term frequency of the given term in the document.
    """
    return document.count(term)/len(document)


def idf(documents: List[List[str]], term: str) -> float:
    """
    Calculate the inverse document frequency (IDF) of a term in a collection of documents.

    Args:
        documents (List[List[str]]): A list of documents, where each document is represented as a list of strings.
        term (str): The term for which IDF is calculated.

    Returns:
        float: The IDF value of the term.
    """
    return len(documents)/sum([1 for document in documents if term in document] + [1])


def tf_idf(
    all_documents: List[List[str]],
    document: List[str],
    term: str,
) -> float:
    return tf(document, term) * idf(all_documents, term)


#### 2. With your TF-IDF function in place, calculate the TF-IDF for the following words in the first document of the news articles found in the Brown corpus: 

- *the*
- *nevertheless*
- *highway*
- *election*

Perform any preprocessing steps you deem necessary. Comment on your findings.

In [158]:
fileids = brown.fileids(categories='news')
first_doc = list(brown.words(fileids[0]))
all_docs = [list(brown.words(fileid)) for fileid in fileids]

##### Preprocessing

In [159]:
first_doc = [lemmatizer.lemmatize(word.lower()) for word in first_doc if word.lower() not in stopwords]
all_docs = [[lemmatizer.lemmatize(word.lower()) for word in doc] for doc in all_docs if word.lower() not in stopwords]

##### TF-IDF

In [160]:
interesting_words = ["the", "nevertheless", "highway", "election"]

for word in interesting_words:
    print(tf_idf(all_docs, first_doc, word))

0.0
0.005138986218173324
0.04111188974538659
0.030833917309039945


We can see that by removing stopwords, lemmatizing and lowercasing, unintuitive words such as "the" and "nevertheless" have been given a lower score, while more meaningful words such as "highway" and "election" have been given a higher score.

#### 3. While TF-IDF is primarily used for information retrieval and text mining, reflect on how TF-IDF could be used in a language modeling context.

TF-IDF gives us weights for each word. When modeling, especially in machine learning, we use weights so as to distinguish our features, and as such, using TF-IDF could be of great use in weighting our features.

#### 4. You were previously introduced to word representations. TF-IDF can be considered one. What are some differences between the TF-IDF output and one that is computed once from a vocabulary (e.g. one-hot encoding)?

One-hot encoding and BOW representations capture presence without any regard to the importance of the word other than maybe the occurences of it in a corpus. TF-IDF allows us to not only capture presence, but also a good estimate of how relevant this word is to each document and the corpus as a whole.

# TF-IDF - Practical Example
You will again be looking at specific words for a document, but this time weighted by their TF-IDF scores. Ideally, the scoring should be able to retrieve representative words for this document in context of its document collection or category.

You will do the following:
- Select a category from the Reuters (news) corpus
- Perform preprocessing
- Calculate TF-IDF scores
- Find the top 5 words for a subset of documents per document in your collection (e.g. 5, 10, ..)
- Inspect whether these words make sense for a given document, and comment on your findings.

In [161]:
from nltk.corpus import reuters

nltk.download("reuters")

[nltk_data] Downloading package reuters to /Users/dion/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


True

For our category, we will pick "ship".

In [162]:
ship_fileids = reuters.fileids(categories='ship')

all_ship_docs = [list(reuters.words(fileid)) for fileid in ship_fileids]

For our preprocessing, we will:
* Lowercase our words.
* Remove stopwords.
* Lemmatize our words.

Our words are already tokenized.

In [163]:
all_ship_docs = [[lemmatizer.lemmatize(word.lower()) for word in doc if word.lower() not in stopwords and word.lower().isalpha()] for doc in all_ship_docs]

We will now calculate the TF-IDF scores for our words.

In [164]:
all_ship_docs_tf_idf = [[tf_idf(all_ship_docs, doc, word) for word in doc] for doc in all_ship_docs]

And we will merge our words with their respective TF-IDF scores.

In [165]:
all_ship_docs_with_tf_idf = [list(set(zip(doc, all_ship_docs_tf_idf[i]))) for i, doc in enumerate(all_ship_docs)]

Preselect 10 documents and find the top 5 words for each document.

In [166]:
all_ship_docs_with_tf_idf = all_ship_docs_with_tf_idf[:10]

for doc in all_ship_docs_with_tf_idf:
    doc.sort(key=lambda x: x[1], reverse=True)
    print(doc[:5])

[('nsw', 2.4444444444444446), ('altogether', 1.2222222222222223), ('ban', 0.9166666666666666), ('appear', 0.8148148148148149), ('disruption', 0.8148148148148149)]
[('nsw', 2.8600000000000003), ('ban', 1.340625), ('kembla', 1.1916666666666667), ('prevented', 0.89375), ('newcastle', 0.89375)]
[('lammers', 2.234375), ('zeebregts', 1.1171875), ('reconsider', 1.1171875), ('judgment', 1.1171875), ('procedural', 1.1171875)]
[('todd', 7.526315789473684), ('division', 7.526315789473684), ('galveston', 3.763157894736842), ('bargaining', 3.763157894736842), ('collective', 2.508771929824561)]
[('paid', 2.306451612903226), ('tonner', 1.5376344086021505), ('savona', 1.1532258064516128), ('moderately', 1.1532258064516128), ('pallice', 1.1532258064516128)]
[('tate', 8.9375), ('lyle', 7.15), ('mykon', 3.575), ('silvertown', 3.575), ('discharged', 3.575)]
[('destrehan', 2.6), ('elevator', 2.311111111111111), ('cargill', 1.7333333333333332), ('orleans', 1.4857142857142855), ('bunge', 1.3)]
[('banner', 2.

We can see that these words match the category **ship** pretty well as some are cities/ports or words related to trade.

# Part-of-speech tagging

#### 1. Briefly describe your understanding of POS tagging and its possible use-cases in context of text generation applications/language modeling.

Part of speech tagging allows us to gain contextual information from our tokens by extracting their lexical categories and thus solving ambiguities regarding the contextual meaning of the word. It's especially useful when generating text, as we are able to pinpoint what kind (kind = lexical category) of word follows our already generated sentence. 

#### 2. Train a UnigramTagger (NLTK) using the Brown corpus. 
Hint: the taggers in nltk require a list of sentences containing tagged words.

In [167]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/dion/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

We will extract the tagged words from the Brown corpus and train a UnigramTagger.

* A convention is that we will only do this for the 1000 most common words, or else we'll have an error due to values being too many.

In [168]:
fd_brown = nltk.FreqDist(brown.words())
cfd_brown = nltk.ConditionalFreqDist(brown.tagged_words())

tags_per_word = dict((word, cfd_brown[word].max()) for (word, _) in fd_brown.most_common(1000))

brown_unigram_tagger = nltk.UnigramTagger(model=tags_per_word)

#### 3. Use this tagger to tag the text given below. Print out the POS tags for all variants of "justify"

In [169]:
text = """
Imagine a situation where you have to explain why you did something – that's when you justify your actions. So, let's say you made a decision; you, as the justifier, need to give good reasons (justifications) for your choice. You might use justifying words to make your point clear and reasonable. Justifying can be a bit like saying, "Here's why I did what I did." When you justify things, you're basically providing the why behind your actions. So, being a good justifier involves carefully explaining, giving reasons, and making sure others understand your choices
"""

text_tokens = nltk.word_tokenize(text)
text_tokens = [word.lower() for word in text_tokens]

text_tags = brown_unigram_tagger.tag(text_tokens)

justify_tags = [tag for (word, tag) in text_tags if re.match(r'^justif.*', word)]
justify_tags

[None, None, None, None, None, None, None]

We can see that nothing was tagged.

#### 4. Your results may be disappointing. Repeat the same task as above using both the default NLTK pos-tagger and with spaCy. Compare the results

In [170]:
brown_default_tagger = nltk.DefaultTagger('NN')

default_tagged_text = brown_default_tagger.tag(text_tokens)

default_justifying_tags = [tag for (word, tag) in default_tagged_text if re.match(r'^justif.*', word)]
default_justifying_tags

['NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN']

Everything was tagged using the same tag.

In [171]:
import spacy
from spacy import displacy

!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
You should consider upgrading via the '/Users/dion/.pyenv/versions/3.10.4/envs/ml-env/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [172]:
nlp = spacy.load('en_core_web_sm')

pipeline = nlp(text)

spacy_justify_word_tags = [(token.text, token.tag_) for token in pipeline if re.match(r'^justif.*', token.text)]
spacy_justify_word_tags

[('justify', 'VBP'),
 ('justifier', 'NN'),
 ('justifications', 'NNS'),
 ('justifying', 'VBG'),
 ('justify', 'VBP'),
 ('justifier', 'NN')]

The results are significantly better, with all of our words having (probably) approriate tags relative to their context.

#### 5. Finally, explore more features of the what the spaCy *document* includes related to topics covered in this lab.

What I found interesting is the dependency visualizer!

In [173]:
displacy.render(pipeline, style='dep', jupyter=True, page=True, options={'compact': True})