# Stemming and Lemmatization

#### 1. Given the list of pluralized words below, define your own simple word stemmer function or class,  limited to only simple rules and regex. No libraries! It should strip basic endings.

In [2]:
plurals = [
    "flies",
    "denied",
    "itemization",
    "sensational",
    "reference",
    "colonizer",
]

# TODO: implement your own ismple stemmer

def stemming(word_list):
    suffixes = ["es", "ed", "ation", "al", "ence", "zer"]
    stemmed_list, p = [], 0

    for word in word_list:
        p +=1
        for suffix in suffixes:
            if word.endswith(suffix):
                stem = word[:-len(suffix)]
                stemmed_list.append(stem)
        
        if p > len(stemmed_list):
            stemmed_list.append(word)
    return stemmed_list

stemming(plurals)

['fli', 'deni', 'itemiz', 'sensation', 'refer', 'coloni']

#### 2. After your initial implementation, run it on the following words:

In [3]:
new_words = [
    "friendly",
    "puzzling",
    "helpful",
]
# TODO: run your stemmer on the new words
stemming(new_words)

['friendly', 'puzzling', 'helpful']

#### 3. Realizing that fixing future words manually can be problematic, use a desired NLTK stemmer and run it on all the words:

In [4]:
import nltk
from nltk.stem import PorterStemmer

all_words = plurals + new_words

# TODO: use an nltk stemming implementation to stem `all_words`


porter = PorterStemmer()

def stemming_nltk(word_list):
    stemmed_list = []
    
    for word in word_list:
        stemmed_list.append(porter.stem(word))

    return stemmed_list

stemming_nltk(all_words)

['fli',
 'deni',
 'item',
 'sensat',
 'refer',
 'colon',
 'friendli',
 'puzzl',
 'help']

#### 4. There are likely a few words in the outputs above that would cause issues in real-world applications. Pick some examples, and show how they are solved with a lemmatizer. Use either spaCy or nltk.

As my knowledge of english grammatics are limited, the words i picked might not provide a hughe challenge. However puzzeling and puzzl/puzzel do show up in rather different contexts and could provide issues. Helpful -> help is also an issue as they have rather different sentiments. The additional words all get stemmed gramatically wrong and should therefore rather be lemmatized.

In [5]:
# TODO: basic observations on which examples are problematic with stemming + implement lemmatization with spacy/nltk
problem_words = ["flies", "denied", "colonizer", "puzzling", "helpful"]

from nltk.stem import WordNetLemmatizer
 
lem = WordNetLemmatizer()

def lemmatization_nltk(word_list):
    lemmed_list = []
    
    for word in word_list:
        lemmed_list.append(lem.lemmatize(word))

    return lemmed_list

lemmatization_nltk(problem_words)

['fly', 'denied', 'colonizer', 'puzzling', 'helpful']

# Stemming/Lemmatization - Practical Example
Using the news corpus (subset/category of the Brown corpus), perform common text normalization techniques such as stopword filtering and stemming/lemmatization. Compare the top 10 most common **words** before and after these normalization techniques.

In [6]:
# import nltk; nltk.download('brown')  # ensure we have the data
from nltk.corpus import brown
news = brown.words(categories='news')


# TODO: find the top 10 most common words
from nltk.probability import FreqDist

def word_freq(tokenized_corpus, n):
    freq = FreqDist()
    
    for word in tokenized_corpus:
        freq[word.lower()] += 1

    return freq.most_common(n)

word_freq(news, 10)

[('the', 6386),
 (',', 5188),
 ('.', 4030),
 ('of', 2861),
 ('and', 2186),
 ('to', 2144),
 ('a', 2130),
 ('in', 2020),
 ('for', 969),
 ('that', 829)]

In [7]:
# TODO: find the top 10 most common words after applying text normalization techniques
import string
from nltk.corpus import stopwords

def remove_stopwords(tokenized_corpus):                                      #Added due to lack of experience with the punctuation terms
    stop_words = set(stopwords.words("english") + list(string.punctuation) + ["``", "''", "--"])
    filtered_corpus = []

    for word in tokenized_corpus:
        if word.lower() not in stop_words:
            filtered_corpus.append(word.lower())
    
    return filtered_corpus

def clean_document(tokenized_corpus):
    filtered_corpus = remove_stopwords(tokenized_corpus)
    cleaned_corpus = lemmatization_nltk(filtered_corpus)
    return cleaned_corpus

word_freq(clean_document(news), 10)

[('said', 406),
 ('mrs.', 253),
 ('would', 246),
 ('year', 244),
 ('new', 241),
 ('one', 221),
 ('state', 213),
 ('last', 177),
 ('two', 174),
 ('mr.', 170)]

# TF-IDF
TF-IDF (term frequency-inverse document frequency) is a way to measure the importance of a word in a document.

$$
\text{tf-idf}(t, d, D) = \text{tf}(t, d) \times \text{idf}(t, D)
$$

Where:
- $t$ is the term (word)
- $d$ is the document
- $D$ is the corpus



#### 1. Implement TF-IDF using NLTKs FreqDist (no use of e.g. scikit-learn and other high-level libraries).

In [8]:
from typing import List
import numpy as np

##########################################################
# Feel free to change everything below.
# It is merely a guide to understand the inputs/outputs
##########################################################


############ TODO ############
def tf(document: List[str], term: str) -> float:
    """
    Calculate the term frequency (TF) of a given term in a document.

    Args:
        document (List[str]): The document in which to calculate the term frequency.
        term (str): The term for which to calculate the term frequency.

    Returns:
        float: The term frequency of the given term in the document.
    """
    term_freq = FreqDist()
    for word in document:
        if word.lower() == term:
            term_freq[term] += 1
    
    termfreq = (term_freq[term])/len(document)
    return termfreq


############ TODO ############
def idf(documents: List[List[str]], term: str) -> float:
    """
    Calculate the inverse document frequency (IDF) of a term in a collection of documents.

    Args:
        documents (List[List[str]]): A list of documents, where each document is represented as a list of strings.
        term (str): The term for which IDF is calculated.

    Returns:
        float: The IDF value of the term.
    """
    doc_count = 0
    for doc in documents:
        if term in doc:
            doc_count +=1
    if doc_count == 0:
        return 0
    
    idf = np.log(len(documents)/(doc_count))
    return idf


############ TODO ############
def tf_idf(all_documents: List[List[str]], document: List[str], term: str) -> float:

    term_freq = tf(document, term)
    inverse_df = idf(all_documents, term)
    tf_idf = term_freq*inverse_df

    return tf_idf

#### 2. With your TF-IDF function in place, calculate the TF-IDF for the following words in the first document of the news articles found in the Brown corpus: 

- *the*
- *nevertheless*
- *highway*
- *election*

Perform any preprocessing steps you deem necessary. Comment on your findings.

In [9]:
fileids = brown.fileids(categories='news')
first_doc = list(brown.words(fileids[0]))
all_docs = [list(brown.words(fileid)) for fileid in fileids]

# TODO: preprocess and calculate tf-idf scores.
terms = ["the", "nevertheless", "highway", "election"]
tf_idf_list = []
for term in terms:
    p = tf_idf(all_docs, first_doc, term)
    tf_idf_list.append([term, p])
    
print(tf_idf_list)

[['the', 0.0], ['nevertheless', 0.0010695340199814321], ['highway', 0.008384942647971036], ['election', 0.009251767873746217]]


#### 3. While TF-IDF is primarily used for information retrieval and text mining, reflect on how TF-IDF could be used in a language modeling context.

I would assume the best use-case for tf-idf in language modelling would be to adjust probabilities in word generators. The tf-idf could not dictate the model entirely as it has no sense of context, but could be a useful tool to find which words are more commonly used. Combining this with the context derived from n gram models i would assume atleast a word predictor og generator would perform better.

#### 4. You were previously introduced to word representations. TF-IDF can be considered one. What are some differences between the TF-IDF output and one that is computed once from a vocabulary (e.g. one-hot encoding)?

Firstly tf-idf cannot represent words that are not in the doucuments, and have no data of them if they are not represented in the corpus. this could cause problems in certian language models. it also does not have any unique word representation, like one-hot encoding. Each word is represented using an "adjusted bag-of-words" approch meaning they lack the possibility of having a unique representation for a word like "they". On a positive note, they account for the relative appearance of words, accounting more for stopwords and grammatical fill without needing to remove them from the list of words. As the size of the tf-idf output is created iterativly it should also be a smaller size which could be beneficial if there is a large corpus to be analysed.

# TF-IDF - Practical Example
You will again be looking at specific words for a document, but this time weighted by their TF-IDF scores. Ideally, the scoring should be able to retrieve representative words for this document in context of its document collection or category.

You will do the following:
- Select a category from the Reuters (news) corpus
- Perform preprocessing
- Calculate TF-IDF scores
- Find the top 5 words for a subset of documents in your collection (e.g. 5, 10, ..)
- Inspect whether these words make sense for a given document, and comment on your findings.

In [10]:
import nltk; nltk.download("reuters")
from nltk.corpus import reuters

[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\marcu\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!


In [11]:
def high_tf_idf(corpus, n, N):
    for i in range(0, N):
        document = corpus[i] 
        terms_uniq = set(document)
        tf_idf_list = []
        for term in list(terms_uniq):
            score = tf_idf(all_docs, document, term)
            tf_idf_list.append([term, score])
        tf_idf_list.sort(key=lambda x: x[1], reverse=True)
        print(tf_idf_list[:n])
        print(document)
        print()
    

In [12]:
fileids = reuters.fileids(categories='coffee')
all_docs = [list(reuters.words(fileid)) for fileid in fileids]
corpus = []

for doc in all_docs:
    corpus.append(clean_document(doc))

high_tf_idf(corpus, 5, 5)


[['rubber', 0.10114178009723472], ['exchange', 0.10103927101919714], ['trading', 0.07715828624585008], ['palm', 0.04047353123966693], ['april', 0.03752451660175431]]
['indonesian', 'commodity', 'exchange', 'may', 'expand', 'indonesian', 'commodity', 'exchange', 'likely', 'start', 'trading', 'least', 'one', 'new', 'commodity', 'possibly', 'two', 'calendar', '1987', 'exchange', 'chairman', 'paian', 'nainggolan', 'said', 'told', 'reuters', 'telephone', 'interview', 'trading', 'palm', 'oil', 'sawn', 'timber', 'pepper', 'tobacco', 'considered', 'trading', 'either', 'crude', 'palm', 'oil', 'cpo', 'refined', 'palm', 'oil', 'may', 'also', 'introduced', 'said', 'question', 'still', 'considered', 'trade', 'minister', 'rachmat', 'saleh', 'decision', 'go', 'ahead', 'made', 'fledgling', 'exchange', 'currently', 'trade', 'coffee', 'rubber', 'physicals', 'open', 'outcry', 'system', 'four', 'day', 'week', 'several', 'factor', 'make', 'u', 'move', 'cautiously', ',"', 'nainggolan', 'said', 'want', 'move

In general the scores are quite low, perhaps due to some code or mathematical flaw of mine. The first text seems to be more general about trading and exchange as rubber and palm are relativly rare. The third is unique in that is has numbers, and the mention of cent and pounds, perhaps related to business deals or something similar. The mention of colombia is not surprising as they are a major country in coffee production.

Appart from that the output seems reasonable, the lack of indonesia mentioned in the first text is a little strange perhaps.

# Part-of-speech tagging

#### 1. Briefly describe your understanding of POS tagging and its possible use-cases in context of text generation applications/language modeling.

My understanding is that POS's purpose is to allow the language model to better understand gramatical structures. If you are attempting to predict the next word in a sentence, knowing the previous words gramatical categories would be usefull information. Also if you are generating text, like a chatbot, i would assume there is a need for the model to understand the gramatical structure and how different words fall into certain categories. Tagging words are also beneficial when trying to disambiguate words that are written the same but have different meanings in different contexts. Lastly i would assume it is a helpful addition when doing translations, as the gramatical sturctures of languages are quite different tagging the words should increase the understanding of grammar and by extension the accuracy of the translation.

#### 2. Train a UnigramTagger (NLTK) using the Brown corpus. 
Hint: the taggers in nltk require a list of sentences containing tagged words.

In [13]:
# TODO: train a unigram tagger on the brown corpus

from nltk.tag import UnigramTagger
from nltk.corpus import brown

tagged_sents = brown.tagged_sents()

unigram_tagger = UnigramTagger(tagged_sents)

#### 3. Use this tagger to tag the text given below. Print out the POS tags for all variants of "justify"

In [14]:
text = """Imagine a situation where you have to explain why you did something – that's when you justify your actions. So, let's say you made a decision; you, as the justifier, need to give good reasons (justifications) for your choice. You might use justifying words to make your point clear and reasonable. Justifying can be a bit like saying, "Here's why I did what I did." When you justify things, you're basically providing the why behind your actions. So, being a good justifier involves carefully explaining, giving reasons, and making sure others understand your choices"""

# TODO: use your trained tagger

text_words = nltk.word_tokenize(text)

tags = unigram_tagger.tag(text_words)
just_tags = [tag for word, tag in tags if word.startswith("justi")]
print(just_tags)

['VB', None, 'NNS', 'VBG', 'VB', None]


#### 4. Your results may be disappointing. Repeat the same task as above using both the default NLTK pos-tagger and with spaCy. Compare the results

In [15]:
# TODO: use the default NLTK tagger

tags = nltk.pos_tag(text_words)
just_tags = [tag for word, tag in tags if word.startswith("justi")]
print(just_tags)

['VBP', 'NN', 'NNS', 'VBG', 'VBP', 'NN']


In [18]:
# TODO: use spacy to fetch pos tags from the document
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
just_tags_spacy = [token.tag_ for token in doc if token.text.startswith("justi")]
print(just_tags_spacy)

['VBP', 'NN', 'NNS', 'VBG', 'VBP', 'NN']


They produce the exact same results, both are an improvement on the unigram tagger. Perhaps the methods i used were not the ones intended or they were simply both meant to conclude with the same answer.

#### 5. Finally, explore more features of the what the spaCy *document* includes related to topics covered in this lab.

In [17]:
# TODO
nlp = spacy.load("en_core_web_sm")

doc = nlp(text[:100])

print(f"{'Token':<15} {'Lemma':<15} {'POS Tag':<10}")

for token in doc:
    print(f"{token.text:<15} {token.lemma_:<15} {token.pos_:<10}")

Token           Lemma           POS Tag   
Imagine         imagine         VERB      
a               a               DET       
situation       situation       NOUN      
where           where           SCONJ     
you             you             PRON      
have            have            VERB      
to              to              PART      
explain         explain         VERB      
why             why             SCONJ     
you             you             PRON      
did             do              VERB      
something       something       PRON      
–               –               PUNCT     
that            that            PRON      
's              be              AUX       
when            when            SCONJ     
you             you             PRON      
justify         justify         VERB      
your            your            PRON      
a               a               PRON      


Not sure this is what you intended but i utilised more of the spacy library to lemmatize and tagg part of the text from earlier.