# ðŸ§¼ Normalization


Before tokenizing text, it's often beneficial to normalize it. Normalization is the process of converting a text to a more uniform or canonical form. This helps in reducing the vocabulary size and consolidating words that have the same meaning but are represented differently. Check out the [notes](./normalization.md) for more explanations.

In this notebook, we'll primarily be using NLTK for the library of choice.


## Normalization Flowchart

Here is a flowchart illustrating a normalization process we'll try out:

```mermaid
graph TD
    A[Start] --> B{Raw Text};
    B --> C{Lowercasing};
    C --> D{Removing Punctuation};
    D --> E{Removing Stop Words};
    E --> F{Stemming/Lemmatization};
    F --> G[Normalized Text];
```


## Lowercasing

Converting all characters to lowercase. For example, "The" and "the" become the same word. Unifying the casing would allow for us to consolidate words even if they're typed "weirdly" or "WeIRdLY" (heh, get it?).


In [1]:
text = "The quick Brown Fox"
lower_text = text.lower()
print(lower_text)

the quick brown fox


## Removing Punctuation

Straightforward, removing punctuation marks from the text. This is especially when your task doesn't inherently concern punctuations, which would allow for your model to focus _only_ on the semantically-significant data points, that being the words.


In [11]:
import string
import re

text = "Hello, world! This is a test."
no_punct_text = "".join(c for c in text if c not in string.punctuation)
print(no_punct_text)

# OR, you can also do smth like

text = "Hello, world! This is a test."
no_punct_text = re.sub(r"[^\w\s]", "", text)
print(no_punct_text)

Hello world This is a test
Hello world This is a test


## Removing Stop Words

Stop words are common words that don't carry much meaning, such as "the", "a", and "is". These words are essentially equivalent to noise, they don't add substantial information and they take up space. Each language has their own stop word lists, as the semantically-insignificant words differ from language to language, depending on their syntactical structure and vocabulary.


In [6]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text = "This is a sample sentence, showing off the stop words filtration."
stop_words = set(stopwords.words("english"))
print(f"EN Stop Words: {stop_words}")
stop_words_id = set(stopwords.words("indonesian"))
print(f"ID Stop Words: {stop_words_id}")

word_tokens = word_tokenize(text)
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]

print(filtered_sentence)

EN Stop Words: {'am', "you'll", 'won', "don't", "hadn't", 'of', 'on', 'any', "it'll", 'weren', 'it', 'mightn', 's', 'the', "he'll", 'so', 'their', 'yourself', 'mustn', 'you', 'below', 'if', 'him', 'has', 'theirs', 'at', "haven't", 'which', 'me', 'doing', "he's", 'both', 'no', 'not', 'who', 'his', 'by', "mustn't", 'why', 'being', 'own', "she's", 'd', "mightn't", 'through', 'to', 'itself', 'too', "we'd", 'about', 'himself', 'further', 'ain', "shouldn't", 'hasn', 'for', "weren't", 'your', 'her', 'and', 'until', "we're", "couldn't", "i've", 'had', 've', "you'd", 'while', 'what', 'shan', 'o', 'very', 'where', "he'd", "they've", 'during', "didn't", 'herself', "wouldn't", 'only', "they're", "that'll", "i'm", "i'll", 'hers', 'before', 'having', 'we', 'but', 'here', 'is', 'an', 'them', 't', 'other', 'over', 'each', 'with', "isn't", "they'd", 'i', 'nor', 'whom', 'she', 'yours', 'then', 'are', 'ours', 'our', "it's", 'a', 'in', 'now', 'when', 'above', 'they', "i'd", 'or', 'same', "it'd", 'isn', 't

## Stemming

Stemming reduce words to their "root" form. For example, "running", "ran", and "runner" might all be stemmed to "run". There are a few Stemmers that one could use from NLTK for each of these, they have their own definitions of what constitutes a "root".


In [30]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

stemmer = PorterStemmer()
words = ["program", "programming", "programmer", "programs", "programmed"]

stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)

['program', 'program', 'programm', 'program', 'program']


## Lemmatization

Similar to stemming, but it reduces words to their dictionary form (lemma). A huge difference between this and stemming is that, lemmatization _needs_ context. This is because lemmatization tries to work backward from semantics to the word's "core" form.

Here, we'll use a WordNet-based (which is a learned function) lemmatizer, but lemmatization doesn't have to be done as a learned function.


In [26]:
import nltk

nltk.download("wordnet")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\kek\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [29]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()
words = ["program", "programming", "programmer", "programs", "programmed"]

lemmatized_words = [lemmatizer.lemmatize(word, pos="v") for word in words]
print(lemmatized_words)

lemmatized_words = [lemmatizer.lemmatize(word, pos="a") for word in words]
print(lemmatized_words)

['program', 'program', 'programer', 'program', 'program']
['program', 'programming', 'programer', 'programs', 'programmed']
