### 5.3 Preprocessing NLP data

Learning goal: preprocessing text data, detecting errors and special cases; how to increase frequency of important terms and collocations despite different spelling and grammatic forms

In this task, we will practise preprocessing of text data with Python Natural Language Toolkit (nltk). The idea is that you can utilize the results in the homework task, even if you were using another programming language
for the actual processing.

Install packages nltk, scikit-learn and numpy with command pip3 install
scikit-learn nltk numpy. There is also a book “Natural Language Processing with Python” https://www.nltk.org/book/ with examples.

Load example data acmdocuments.txt from MyCourses. Each line is
considered as one document. They are sentences from scientific abstracts in the ACM digital library https://dl.acm.org/. In MyCourses, you can find a code skeleton preprocSbS.py that you can use as a starting point, unless
you already know how to do all preprocessing steps.

In [4]:
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
import nltk

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\springnuance\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\springnuance\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\springnuance\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\springnuance\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

(a) The main tasks of preprocessing text data are tokenization, lowercasing, removing punctuation, stemming (or lemmatization), and stop-word removal. However, order of these steps depends on the library and your implementation. The stopword list may contain only stopwords in their normal (unreduced) form, in a stemmed form, or with punctuation characters like (apostrophes). Lemmatization tools often require full sentences so that they can utilize part-of-speech analysis. The first task is always to determine the right order to do the preprocessing steps. Make a fast sanity check to the example code: is it performing the steps as desired? What would happen if you skipped lowercasing or performed stemming before stopword removal?

The order in which we apply the five preprocessing stages for text data can greatly impact the outcome of the text analysis. The order of application is

1. Tokenization: This should be the first step. Tokenization involves splitting the text into individual words (tokens), which provides a basis for all subsequent preprocessing steps.

2. Lowercasing: After tokenization, converting all tokens to lowercase is recommended. This standardizes the tokens, so that words like "Computer" and "computer" are treated as the same word. We should do this before removing stop words or stemming.

3. Removing Punctuation: Punctuation can be attached to the end of words and might affect later stages of stemming or stop-words removal. For instance, "end." and "end" would be treated differently if punctuation is not removed.

4. Stemming/Lemmatization: This step reduces words to their base or root form. Stemming is a more crude method that chops off word endings based on common patterns, while lemmatization involves looking up words in a dictionary to find their base form. 

5. Removing Stop Words: Stop words are common words that usually do not contribute to the deeper meaning of the text (like "the", "is", "at", etc.). 

Fast sanity check to the example code: is it performing the steps as desired?

In [8]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import string
import time

with open("acmdocuments.txt", 'r', encoding='utf-8') as file:
    documents = file.readlines()

# This function tries to determine the word type of English words
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

# Function to preprocess a single document
def preprocess_text(text, baseform_method, stopwords_check=True, printing=True, print_result_only=False):
    # Tokenization
    if printing:
        print("\n")
        print(text) 
    tokens = word_tokenize(text, language='english')
    if printing and not print_result_only:
        print(f"1. Tokenization: {tokens}")

    # Lowercasing
    tokens = [token.lower() for token in tokens]
    if printing and not print_result_only:
        print(f"2. Lowercasing: {tokens}")

    # Removing punctuation
    table = str.maketrans('', '', string.punctuation)
    tokens = [token.translate(table) for token in tokens]
    if printing and not print_result_only:
        print(f"3. Punctuation: {tokens}")

    # Porter Stemming
    if baseform_method == "PorterStemmer":
        stemmer = PorterStemmer()
        tokens = [stemmer.stem(token) for token in tokens]
        if printing and not print_result_only:
            print(f"4. Stemming: {tokens}")

    # Snowball stemmer    
    if baseform_method == "SnowballStemmer":
        stemmer = SnowballStemmer("english")
        tokens = [stemmer.stem(token) for token in tokens]
        if printing and not print_result_only:
            print(f"4. Stemming: {tokens}")
    
    # Lemmatization
    if baseform_method == "Lemmatizer":
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(token, get_wordnet_pos(token)) for token in tokens]
        if printing and not print_result_only:
            print(f"4. Lemmatization: {tokens}")

    # Removing stop words and empty tokens
    if stopwords_check:
        stop_words = set(stopwords.words('english'))
        stop_words.add('')
        tokens = [token for token in tokens if token not in stop_words]
        if printing and not print_result_only:
            print(f"5. Stopwords: {tokens}")
        
        if printing and print_result_only:
            print(tokens)

    return tokens

# Function to preprocess a list of documents
def preprocess_documents(documents, baseform_method, stopwords_check):
    preprocessed_documents = []
    for doc in documents:
        preprocessed_doc = preprocess_text(doc, baseform_method, stopwords_check=stopwords_check, printing=True, print_result_only=False)
        preprocessed_documents.append(preprocessed_doc)
    return preprocessed_documents

# Example usage with a list of documents
# documents = ['Your first document text.', 'Your second document text.', ...]

# writing to a result text file
with open("preprocessed_PortStemmer_stopwords_removed.txt", "w", encoding="UTF-8") as file:
    preprocessed_documents = preprocess_documents(documents, "PorterStemmer", stopwords_check=True)
    for index, doc in enumerate(preprocessed_documents):
        file.write("\n" + "=" * 100 + "\n")
        file.write("\n" + documents[index] + "\n")
        file.write("| " + " | ".join(doc) + " |" + "\n")

with open("preprocessed_SnowballStemmer_stopwords_removed.txt", "w", encoding="UTF-8") as file:
    preprocessed_documents = preprocess_documents(documents, "SnowballStemmer", stopwords_check=True)
    for index, doc in enumerate(preprocessed_documents):
        file.write("\n" + "=" * 100 + "\n")
        file.write("\n" + documents[index] + "\n")
        file.write("| " + " | ".join(doc) + " |" + "\n")

with open("preprocessed_Lemmatizer_stopwords_removed.txt", "w", encoding="UTF-8") as file:
    preprocessed_documents = preprocess_documents(documents, "Lemmatizer", stopwords_check=True)
    for index, doc in enumerate(preprocessed_documents):
        file.write("\n" + "=" * 100 + "\n")
        file.write("\n" + documents[index] + "\n")
        file.write("| " + " | ".join(doc) + " |" + "\n")


with open("preprocessed_PortStemmer_stopwords_not_removed.txt", "w", encoding="UTF-8") as file:
    preprocessed_documents = preprocess_documents(documents, "PorterStemmer", stopwords_check=False)
    for index, doc in enumerate(preprocessed_documents):
        file.write("\n" + "=" * 100 + "\n")
        file.write("\n" + documents[index] + "\n")
        file.write("| " + " | ".join(doc) + " |" + "\n")

with open("preprocessed_SnowballStemmer_stopwords_not_removed.txt", "w", encoding="UTF-8") as file:
    preprocessed_documents = preprocess_documents(documents, "SnowballStemmer", stopwords_check=False)
    for index, doc in enumerate(preprocessed_documents):
        file.write("\n" + "=" * 100 + "\n")
        file.write("\n" + documents[index] + "\n")
        file.write("| " + " | ".join(doc) + " |" + "\n")

with open("preprocessed_Lemmatizer_stopwords_not_removed.txt", "w", encoding="UTF-8") as file:
    preprocessed_documents = preprocess_documents(documents, "Lemmatizer", stopwords_check=False)
    for index, doc in enumerate(preprocessed_documents):
        file.write("\n" + "=" * 100 + "\n")
        file.write("\n" + documents[index] + "\n")
        file.write("| " + " | ".join(doc) + " |" + "\n")




Formulation of Low-Order Dominant Poles for Y-Matrix of Interconnects: This paper presents an efficient approach to compute the dominant poles for the reduced-order admittance (Y parameter) matrix of lossy interconnects.

1. Tokenization: ['Formulation', 'of', 'Low-Order', 'Dominant', 'Poles', 'for', 'Y-Matrix', 'of', 'Interconnects', ':', 'This', 'paper', 'presents', 'an', 'efficient', 'approach', 'to', 'compute', 'the', 'dominant', 'poles', 'for', 'the', 'reduced-order', 'admittance', '(', 'Y', 'parameter', ')', 'matrix', 'of', 'lossy', 'interconnects', '.']
2. Lowercasing: ['formulation', 'of', 'low-order', 'dominant', 'poles', 'for', 'y-matrix', 'of', 'interconnects', ':', 'this', 'paper', 'presents', 'an', 'efficient', 'approach', 'to', 'compute', 'the', 'dominant', 'poles', 'for', 'the', 'reduced-order', 'admittance', '(', 'y', 'parameter', ')', 'matrix', 'of', 'lossy', 'interconnects', '.']
3. Punctuation: ['formulation', 'of', 'loworder', 'dominant', 'poles', 'for', 'ymatrix'

On inspection, it appears that the five steps have been applied successfully

What would happen if you skipped lowercasing or performed stemming before stopword removal?

- If we skip lowercasing before stopword removal, stopwords in their original uppercase form may not be recognized and removed. This could result in tokens such as "The" or "the" being retained in the text, which affects the effectiveness of stopword removal if "the" is contained in the stop words while "The" is not. In other words, "the" is removed from the results but "The" is not, which is not ideal.

- Performing stemming before stopword removal may lead to some stopwords not being removed. Stemming can transform stopwords into different variations, and if these variations are not included in the list of stopwords, they might not be removed as expected. This can result in leftover stopwords in the text. For example, "this" is a stopword, but the Porter Stemmer may reduce it to "thi", which is not a stopword and will not be removed.


All results below are reported with the Porter Stemmer, unless otherwise specified.

(b) Check stopword removal and search examples of two types of errors: 

- (i) Stopwords (or other common and useless words in this context) that remain in the text

I think some common useless words left in the text are the words "also", "let", "thi" and "paper"

The word "paper" is useless because it is a common word in the context of scientific papers, and does not contribute to the meaning of the text.

The word "thi" is useless because it is a misspelling of the word "this" due to the stemmer, and does not contribute to the meaning of the text. It is not removed as "thi" is not included in the list of stop words

- (ii) important words that are removed as stopwords (Hint: look important computer science abbreviations and notations in the NLTK stopword list). Estimate how serious these errors are (assuming we had a larger corpus of similar documents). How could you fix the (most serious) errors?

The notation O(n2) is truncated to n2 while the O symbol is removed. This is a serious error because it changes the meaning of the text. The notation O(n2) is a common notation in computer science, and is used to describe the running time of algorithms. Removing the O symbol changes the meaning of the notation to n2, which is not the same as O(n2).

Symbol "A" for matrix or some notations is removed due to being mistaken for the article "a"

"IT" (information technology) is removed due to being mistaken for the pronoun "it"


(c) Check the quality of stemming (with Porter stemmer). Can you find errors where either 

- (i) two words having the same basic forms are reduced to different stems 


- (ii) two words with different roots are reduced to the same stem? 


- Test if the Snowball stemmer would do a better job! 




- Are there errors where lemmatization could help?




In [10]:
from nltk.stem import PorterStemmer, SnowballStemmer

# succeed, negative, general, discuss, relation, community
# Create stemmer instances
porter_stemmer = PorterStemmer()
snowball_stemmer = SnowballStemmer("english")
lemmatizer = WordNetLemmatizer()


# Stemming with both stemmers
def test_words(word):
    porter_stems = porter_stemmer.stem(word)
    snowball_stems = snowball_stemmer.stem(word)
    lemma_stems = lemmatizer.lemmatize(word, get_wordnet_pos(word))
    print("Porter", porter_stems)
    print("Snowball", snowball_stems)
    print("Lemmatizer", lemma_stems)

# test_words(words)
test_words("succeed")
test_words("negative")
test_words("general")
test_words("discuss")
test_words("relation")
test_words("community")



Porter succeed
Snowball succeed
Lemmatizer succeed
Porter neg
Snowball negat
Lemmatizer negative
Porter gener
Snowball general
Lemmatizer general
Porter discuss
Snowball discuss
Lemmatizer discus
Porter relat
Snowball relat
Lemmatizer relation
Porter commun
Snowball communiti
Lemmatizer community


(d) Check punctuation removal. Can you find errors where either 
- (i) punctuation that should be removed has remained 

They are the double quotation marks “ and ”, which are often single tokens on their own.
The percent sign % is also a single token on its own but it is not removed.

- (ii) punctuation that is important for the meaning of the term has been removed? No need to check hyphenated words, yet.

1/3 fraction is turned into 13 fraction

[0,1] is turned into 01

(e) Evaluate collocations/compound words (phrases of multiple consecutive words). Can you find examples of important collocations that occur in different forms: 

- (i) closed (constituent words catenated together)

underrepresented, interconnect

- (ii) hyphenated (hyphen between words)

Y-Matrix, NSF-funded, L2-norm

- (iii) open (space between words)

Internet of Things, machine learning


Suggest a solution how to handle them!

Solution: Use a tokenizer that can handle collocations, such as the MWETokenizer. This tokenizer is able to tokenize collocations such as "Internet of Things" as a single token, instead of splitting it into "Inter", "of" and "Things". This allows the collocation to be treated as a single token in text analysis tasks.

In [24]:
from nltk.tokenize import MWETokenizer

# Create a tokenizer and add the multi-word expression
tokenizer = MWETokenizer()
tokenizer.add_mwe(('Internet', 'of', 'Things'))
tokenizer.add_mwe(('machine', 'learning'))

# Tokenize a text containing the specified multi-word expression
tokens = tokenizer.tokenize('Internet of Things is used in machine learning'.split())

print(tokens)


['Internet_of_Things', 'is', 'used', 'in', 'machine_learning']
