# Natural Language Processing. Lesson 2. Text processing basics

In this lab, we will cover a wide range of the Text Processing concepts:
- Sentence Segmentation, 
- Lowercasing,
- Stop Words Removal,
- Lemmatization,
- Stemming,
- Byte-Pair Encoding (BPE),
- Edit Distance. 

These methods help to understand how computers can work with human language. In other words, they are essential for unlocking the `meaning` hidden within text data.

## Sentence Segmentation

Sentence segmentation is a fundamental step that involves dividing a block of text into individual sentences, typically separated by punctuation marks. This method was considered in the previous lesson, so you already should be familiar with it. This technique may be used in:
- Part-of-Speech (POS): accurate boundaries between sentences are required for assigning grammatical labels like nouns, verbs, and adjectives.
- Sentiment Analysis: understanding the sentiment (positive, negative, neutral) of a sentence also relies on exact boundaries

And much more tasks need splitting the text on sentences. It can be performed using already known libraries: nltk or spaCy. Let's use nltk here.


First of all install the required library and import it


In [None]:
#!pip install nltk

In [1]:
import nltk

nltk.download("punkt")

text = "This is a sample text. It contains multiple sentences. Can we segment it?"

# tokenize into sentences using nltk.sent_tokenize()
sentences = nltk.sent_tokenize(text)

print(sentences)
# Output: ['This is a sample text.', 'It contains multiple sentences.', 'Can we segment it?']

['This is a sample text.', 'It contains multiple sentences.', 'Can we segment it?']


[nltk_data] Downloading package punkt to /home/anastasia/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### Task 1

Complete the following code. Split the text into sentences and save tokens into the `sentences` variable.

In [2]:
import nltk

nltk.download("punkt")

[nltk_data] Downloading package punkt to /home/anastasia/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
text = "There exist some challenges for this technique. The main problem is \
language specificity because sentence segmentation rules can differ across \
languages. For example, Japanese omits spaces between words, making \
segmentation more complex. Punctuation ambiguity also may be the problem because \
of their complexity. Certain punctuation marks like ellipses (...) or colons (:) \
might not always indicate sentence boundaries, requiring context-aware approaches."

sentences = nltk.sent_tokenize(text)
print(sentences[:3])

# Output: ['There exist some challenges for this technique.', 'The main problem is
# language specificity because sentence segmentation rules can differ across languages.',
# 'For example, Japanese omits spaces between words, making segmentation more complex.']

['There exist some challenges for this technique.', 'The main problem is language specificity because sentence segmentation rules can differ across languages.', 'For example, Japanese omits spaces between words, making segmentation more complex.']


In [5]:
assert sentences[:3] == [
    "There exist some challenges for this technique.",
    "The main problem is language specificity because \
sentence segmentation rules can differ across \
languages.",
    "For example, Japanese omits \
spaces between words, making segmentation \
more complex.",
]

## Lowercasing

Lowercasing, as the name suggests, is the process of converting all characters in a text string to lowercase. This seemingly simple step plays a crucial role in NLP tasks for several reasons:
- Consistency and focus on a word meaning: 'Apple' and 'apple' should be treated identically in terms of their meaning
- Improved performance: the same trick with apples reduces the number of unique words representations. Since many NLP algorithms rely on statistical analysis, it allows to avoid overfitting to specific capitalization patterns
- Compatibility with NLP tools: many NLP libraries and tools work primarily with lowercase text. Lowercasing ensures compatibility and avoids potential errors or inconsistencies.

For applying the Lowercasing we can use a simple Python built-in function for strings: `string.lower()`: 

In [6]:
text = "ThIs Is AN ExaMple Text."

# apply the .lower() function
lowercased_text = text.lower()

print(lowercased_text)
# Output: this is an example text.

this is an example text.


Except the .lower() the Python has .upper() function for strings. Converting all letters to their capital form is called `Uppercasing` and it also may be applied in NLP (rarely):
- Emphasis detection: in some cases, uppercase letters can indicate emphasis in text, like headlines, slogans, or acronyms. Uppercasing can help identify potential emphasis markers
- Specific NLP libraries: certain NLP libraries might have functionalities that work better with uppercase text, though this is less common. (Always refer to the documentation for specific tools)
- Named entity recognition (NER): in NER tasks, proper nouns (names of people, places, organizations) are often capitalized. Uppercasing text can be a preprocessing step to highlight potential named entities, but additional checks are needed for accuracy

#### Task 2

Fill the gaps in the following code. Convert letters in 2 strings according to the meaning in the sentences:

In [7]:
# no additional libraries are required
upper = "aPplY thE uPpErCASiNg"
lower = "appLy ThE LowERcAsINg"

# apply the needed functions:
upper_result = upper.upper()
lower_result = lower.lower()

print(upper_result)
print(lower_result)

# Output:
# APPLY THE UPPERCASING
# apply the lowercasing

APPLY THE UPPERCASING
apply the lowercasing


In [8]:
assert upper_result == "APPLY THE UPPERCASING"
assert lower_result == "apply the lowercasing"

## Stop Words Removal

Stop words are frequently occurring words that are often removed during text processing to focus on meaningful words. Examples include articles ("the", "a", "an"), prepositions ("of", "to", "in"), conjunctions ("and", "but", "or"), and pronouns ("I", "you", "he"). While these words are essential for human language construction, they often provide minimal value for NLP tasks, thus there are several reasons to get rid of them:
- Focus on the content and improved efficiency: removing stop words allows to keep much meaning and less words amount for optimizing the algorithms
- Statistical analysis: stop words can skew the results of statistical analysis in NLP tasks that rely on word frequency. Removing them reduces this bias and promotes to a more accurate representation of the important words


In [9]:
# Use an available in nltk method stopwords
from nltk.corpus import stopwords

# download the stop words list
# quiet=True hides messages that .download() might display
nltk.download("stopwords", quiet=True)

# retrieve the list with stop words in english
stop_words = set(stopwords.words("english"))

text = "This is an example sentence with some stop words."

# remove all stop words using the loop
filtered_words = [word for word in text.split() if word.lower() not in stop_words]

print(filtered_words)
# Output: ['example', 'sentence', 'stop', 'words.']

['example', 'sentence', 'stop', 'words.']


#### Task 3

Fill the gaps in the following cell. Get rid of stop words

In [10]:
from nltk.corpus import stopwords

# download the stop list
nltk.download("stopwords", quiet=True)

True

In [11]:
text = "The quick brown fox jumps over the lazy dog."

# get the stop words
stop_words = set(stopwords.words("english"))

# remove the meaningless words
filtered_words = [word for word in text.split() if word.lower() not in stop_words]

print(filtered_words)

# Output: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog.']

['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog.']


In [12]:
assert filtered_words == ["quick", "brown", "fox", "jumps", "lazy", "dog."]

You should always be careful with this method. Consider whether stop word removal is beneficial for your specific NLP task. Stop words removal might promote to a loss of context ("I don't like it" vs. "I like it" - the sentiment of the sentences can be lost). Use only domain-specific stop words. 

## Lemmatization

Lemmatization involves  reducing words to their base or dictionary form, known as the lemma. This helps to group related words together and improve the accuracy of NLP models.

`Lemma` is the canonical form of a word, also referred to as its base or dictionary form (runs - run, keeps - keep, apples - apple).

Lemmatization algorithms use a dictionary and morphological analysis and rules to identify the base form of a word.

What for?
- Improved accuracy: grouping words with the same meaning into their base form helps to handle different variations of the same concept
- Reduced vocabulary size and memory usage: lemmatization reduces the number of unique words an NLP model needs to process

`WordNetLemmatizer` will help us in this method.

In [13]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

words = ["rocks", "corpora", "cries"]
# apply the lemmatizer to all words
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

print(lemmatized_words)
# Output: ['rock', 'corpus', 'cry']

['rock', 'corpus', 'cry']


EXERCISE 4

Fill the gaps. Convert given words to their original form. The task includes the [Part-of-Speech method](https://www.geeksforgeeks.org/nlp-part-of-speech-default-tagging/)

In [14]:
# import the WordNetLemmatizer and declare its instance
from nltk.stem import WordNetLemmatizer
import nltk

lemmatizer = WordNetLemmatizer()

In [15]:
words = ["running", "dogs", "carries", "cooks"]
lemmatized_words = []

# assign a corresponding part of the speech to each word
tagged_words = nltk.pos_tag(words)

for i, word in enumerate(words):
    # bring each word to its cannonical form
    lemmatized_word = lemmatizer.lemmatize(word)
    lemmatized_words.append((lemmatized_word, tagged_words[i][1]))

print(lemmatized_words)
# Output: [('running', 'VBG'), ('dog', 'NNS'), ('carry', 'VBZ'), ('cook', 'NNS')]

[('running', 'VBG'), ('dog', 'NNS'), ('carry', 'VBZ'), ('cook', 'NNS')]


In [16]:
assert lemmatized_words == [
    ("running", "VBG"),
    ("dog", "NNS"),
    ("carry", "VBZ"),
    ("cook", "NNS"),
]

## Stemming

Stemming reduces words to their stems or root form, often by removing suffixes, in a more heuristic approach (running - run, jumped - jump, books - book). Similar to the Lemmatization, but what is the difference?
 - Stemming relies on suffix removal rules which might lead to a wrong word (running - runn), while Lemmatization uses the morphological analysis
 - Stemming is faster and simpler 
 - Application: Stemming is more preferable when computational efficiency is a priority and general understanding of the core meaning is sufficient.


In [17]:
# import a simple module for stemming
from nltk.stem import PorterStemmer

# and create its instance
stemmer = PorterStemmer()

words = ["running", "rocks", "beautifully"]

# apply stemming
stemmed_words = [stemmer.stem(word) for word in words]

print(stemmed_words)
# Output: ['run', 'rock', 'beauti']

['run', 'rock', 'beauti']


EXERCISE 5

Fill the gaps and observe how the stemming works


In [14]:
# import the needed module for stemming
from nltk.stem import PorterStemmer

# and create its instance
stemmer = PorterStemmer()

text = input(
    "If you want to finish the exercise, enter the word 'end'.\nEnter some text: "
)
while text != "end":
    stemmed = stemmer.stem(text)
    print("The stemmed word is:", stemmed)
    text = input("Enter some text: ")

# You may input words running, beautiful, useless, cries, dreamer, end
# Output:
# The stemmed word is: run
# The stemmed word is: beauti
# The stemmed word is: useless
# The stemmed word is: cri
# The stemmed word is: dreamer

The stemmed word is: beauti


## Byte-Pair Encoding (BPE)

BPE is a technique used in Natural Language Processing (NLP) for subword tokenization. Unlike traditional tokenization that splits text into individual words, BPE breaks down text into smaller units considering the vocabulary size and morphology of the language. This approach can be particularly beneficial when dealing with large vocabularies or rare words. The algoritghm:
1. Initial vocabulary: BPE starts with the individual characters in the text as the initial vocabulary.
2. Merging frequent pairs: it iteratively analyzes the training text and identifies the most frequent pair of characters or subwords (considering existing merged units).
3. Replacing pairs: this most frequent pair is replaced with a new symbol not present in the vocabulary. The new symbol represents the merged subword.
4. Vocabulary update: the vocabulary is updated to include the newly created symbol.
5. Repeat: steps 2-4 are repeated for a predefined number of iterations or until a desired vocabulary size is reached.

Applications:
- Machine translation: for effective handling vocabulary differences between languages
- Text classification and summarization: BPE proves a richer representation of words and captures morphological information
- Large Language Models (LLMs): BPE allowes LLMs to handle the vast vocabulary encountered in real-world text data

In [16]:
#!pip install tokenizers

In [18]:
# import TemplateProcessing for templates
from tokenizers.processors import TemplateProcessing

# these tokens have specific meanings within the tokenizer's
# vocabulary and are not part of the regular text
# UNK - unknown words not encountered during training
# CLS - indicate the beginning of a sentence
# SEP - separates sentences
# PAD - for padding sequences to a fixed length
# MASK - employed in tasks like masked language modeling,
# where certain words are masked and the model predicts them
special_tokens = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]

# TemplateProcessing's instances define a template for how the tokenizer should
# handle text during the encoding and decoding process
temp_proc = TemplateProcessing(
    # single - specifies the format for encoding single sentences
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", special_tokens.index("[CLS]")),
        ("[SEP]", special_tokens.index("[SEP]")),
    ],
)

In [8]:
from tokenizers import Tokenizer
from tokenizers.normalizers import Sequence, Lowercase, NFD, StripAccents
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.models import BPE
from tokenizers.decoders import BPEDecoder

# create the instance of the Tokenizer
tokenizer = Tokenizer(BPE())
tokenizer.normalizer = Sequence([NFD(), Lowercase(), StripAccents()])
tokenizer.pre_tokenizer = Whitespace()
tokenizer.decoder = BPEDecoder()
tokenizer.post_processor = temp_proc

In [9]:
from tokenizers.trainers import BpeTrainer

In [10]:
import nltk
from nltk.corpus import gutenberg

nltk.download("gutenberg", quiet=True)
nltk.download("punkt", quiet=True)

trainer = BpeTrainer(vocab_size=5000, special_tokens=special_tokens)
shakespeare = [" ".join(s) for s in gutenberg.sents("shakespeare-macbeth.txt")]
tokenizer.train_from_iterator(shakespeare, trainer=trainer)






In [11]:
print(
    tokenizer.encode(
        "BPE is a data compression technique used in NLP for tokenization."
    ).tokens
)
print(
    tokenizer.encode(
        "Is this a danger which I see before me, the handle toward my hand?"
    ).tokens
)

['[CLS]', 'b', 'pe', 'is', 'a', 'd', 'at', 'a', 'com', 'pre', 'ss', 'ion', 'te', 'ch', 'ni', 'que', 'use', 'd', 'in', 'n', 'lp', 'for', 'to', 'ken', 'iz', 'ation', '.', '[SEP]']
['[CLS]', 'is', 'this', 'a', 'danger', 'which', 'i', 'see', 'before', 'me', ',', 'the', 'handle', 'toward', 'my', 'hand', '?', '[SEP]']


## Levenshtein edit distance

Edit distance measures the similarity between two strings by counting the minimum number of operations needed to transform one string into the other.

[Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance#Example)


In [17]:
!pip install python-Levenshtein

Collecting python-Levenshtein
  Downloading python_Levenshtein-0.25.1-py3-none-any.whl.metadata (3.7 kB)
Collecting Levenshtein==0.25.1 (from python-Levenshtein)
  Downloading Levenshtein-0.25.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.3 kB)
Collecting rapidfuzz<4.0.0,>=3.8.0 (from Levenshtein==0.25.1->python-Levenshtein)
  Downloading rapidfuzz-3.9.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading python_Levenshtein-0.25.1-py3-none-any.whl (9.4 kB)
Downloading Levenshtein-0.25.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (177 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.2/177.2 kB[0m [31m671.9 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading rapidfuzz-3.9.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[

In [18]:
import Levenshtein

word1 = "kitten"
word2 = "sitting"
distance = Levenshtein.distance(word1, word2)
print(f"Edit distance between '{word1}' and '{word2}': {distance}")

Edit distance between 'kitten' and 'sitting': 3


# Task


[Competition](https://www.kaggle.com/t/6dcb6f9def724f9f82050e9092952dd6)

The aim of the competition is to count the 10 most frequent words in the plays presented in the `data.txt` file.

In order to count the frequent words correctly, you must perform lemmatization and remove stop words.


In [14]:
with open("data.txt") as f:
    data = f.read()
plays = data.split("\n")
plays

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'shakespeare-macbeth.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-caesar.txt']

In [15]:
plays_dict = {}

for play in plays:
    plays_dict[play] = gutenberg.raw(play)
    print(play, len(plays_dict[play]))

austen-emma.txt 887071
austen-persuasion.txt 466292
austen-sense.txt 673022
shakespeare-macbeth.txt 100351
shakespeare-hamlet.txt 162881
shakespeare-caesar.txt 112310


In [16]:
def top_frequent_words(text, topk=10):
    # your implementation
    pass

In [17]:
top_words = {}
for play, text in plays_dict.items():
    top_words[play] = top_frequent_words(text)

In [18]:
with open("submission.csv", "w") as f:
    f.write("id,count\n")
    for play, counts in top_words.items():
        for i, count in enumerate(counts):
            f.write(f"{play}_{i},{count[1]}\n")