# Text Preprocessing

Text preprocessing is an essential step in natural language processing (NLP) tasks. It involves transforming raw text data into a format that is more suitable for analysis and machine learning algorithms. In this tutorial, we will cover various common techniques for text preprocessing. Let's dive in!

## Lowercasing
Converting all text to lowercase can help to normalize the data and reduce the vocabulary size. It ensures that words in different cases are treated as the same word. For example, "apple" and "Apple" will both be transformed to "apple".

In [1]:
sentence = "Converting Text to Lowercase."
lowercased_sentence = sentence.lower()
print(lowercased_sentence)

converting text to lowercase.


## Removal of Punctuation and Special Characters
Punctuation marks and special characters often do not add much meaning to the text and can be safely removed. Common punctuation marks include periods, commas, question marks, and exclamation marks. You can use regular expressions or string operations to remove them.

In [2]:
import re

sentence = "Remove all the punctuation marks! @Special characters?"
cleaned_sentence = re.sub(r'[^\w\s]', '', sentence)
print(cleaned_sentence)

Remove all the punctuation marks Special characters


## Stop Word Removal:
Stop words are commonly occurring words in a language, such as "a," "an," "the," "is," and "in." These words provide little semantic value and can be removed to reduce noise in the data. Libraries like NLTK provide a list of predefined stop words for different languages.

Before using the code make sure you downloaded all the stopwords uning the first shell below.

In [3]:
# import nltk
# nltk.download('stopwords')

In [4]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
sentence = "Remove the stop words from this sentence."
filtered_words = [word for word in sentence.split() if word.lower() not in stop_words]
filtered_sentence = ' '.join(filtered_words)
print(filtered_sentence)

Remove stop words sentence.


## Handling Contractions
Contractions are shortened versions of words, such as "can't" (cannot) or "it's" (it is). Depending on your analysis requirements, you may choose to expand contractions to their full forms or leave them as they are. Expanding contractions can be done using predefined mapping dictionaries or rule-based approaches.

In [5]:
contractions = {
    "can't": "cannot",
    "it's": "it is"
}
sentence = "I can't believe it's true."
words = sentence.split()
expanded_words = [contractions.get(word, word) for word in words]
expanded_sentence = ' '.join(expanded_words)
print(expanded_sentence)

I cannot believe it is true.


## Handling HTML Tags
If your text data contains HTML tags, you might want to remove them. You can use libraries like BeautifulSoup or regular expressions to extract the text content and discard the HTML tags.

In [6]:
import re

html_content = "<p>Remove <b>HTML tags</b> from this text.</p>"
cleaned_text = re.sub(r'<.*?>', '', html_content)
print(cleaned_text)

Remove HTML tags from this text.


## Handling URLs
URLs often appear in text data and may not provide meaningful information for many NLP tasks. You can remove URLs using regular expressions or replace them with a placeholder like "URL" to indicate their presence.

In [7]:
import re

sentence = "Visit my website at https://www.example.com for more information."
cleaned_sentence = re.sub(r'http\S+|www.\S+', '', sentence)
print(cleaned_sentence)

Visit my website at  for more information.


## Handling Emojis and Special Characters
Emojis and special characters can be common in text data, particularly in social media texts. Depending on your analysis needs, you can remove them or replace them with appropriate placeholders.

In [8]:
import re

sentence = "I love pizza! 😍🍕"
cleaned_sentence = re.sub('[^a-zA-Z0-9\s]', '', sentence)
print(cleaned_sentence)

I love pizza 


## Spell Checking
Spelling errors are common in unprocessed text. Using spell checking techniques or libraries like pyspellchecker, you can correct common spelling mistakes and improve the quality of your data.

In [9]:
from spellchecker import SpellChecker

spell = SpellChecker()
sentence = "Thhis sentennce hass spelllingg erroors."
corrected_words = [spell.correction(word) for word in sentence.split()]
corrected_sentence = ' '.join(corrected_words)
print(corrected_sentence)

this sentence has spelling errors


## Handling Rare Words
In some cases, you may want to remove or replace rare words to reduce the vocabulary size. Rare words can be defined based on their frequency of occurrence in the corpus. You can replace them with a special token like "UNK" (unknown) or remove them altogether.

In [10]:
import nltk
from collections import Counter

sentences = ["This is a sample sentence", "Another sentence for demonstration"]

# Tokenization
tokens = [word.lower() for sentence in sentences for word in sentence.split()]

# Calculate word frequencies
word_freq = dict(Counter(tokens))

# Define the threshold for rare words
threshold = 2

# Replace rare words with 'UNK' token
# OOV tokens = Out-of-Vocabulary tokens
filtered_tokens = [
    [word if word_freq.get(word.lower(), 0) >= threshold else 'UNK' for word in sentence.split()]
    for sentence in sentences
]

# Print the result
for sentence in filtered_tokens:
    print(' '.join(sentence))

UNK UNK UNK UNK sentence
UNK sentence UNK UNK


## Tokenization
Tokenization is the process of breaking down a piece of text into smaller units called tokens. These tokens can be words, subwords, or even characters, depending on the level of granularity desired. Tokenization is a fundamental step in text preprocessing and is crucial for various natural language processing (NLP) tasks, such as machine translation, sentiment analysis, and language generation.

Here's a detailed explanation of tokenization:

### Word Tokenization
Word tokenization is the most common form of tokenization, where the text is split into individual words. For example, given the sentence "Tokenization is important for NLP tasks," the word tokens would be: ["Tokenization", "is", "important", "for", "NLP", "tasks"].

Word tokenization is typically performed using whitespace as the delimiter. However, it's important to handle cases like punctuation marks, contractions, and hyphenated words correctly. For example, "don't" should be tokenized as ["do", "n't"] instead of ["don", "'", "t"].

Libraries like NLTK, spaCy, and the tokenizers package provide ready-to-use word tokenization functions.

***
Before running any of these tokenization techniques, make sure you have `punkt` downloaded. `punkt` refers to the Punkt Tokenizer, which is a pre-trained unsupervised machine learning model for sentence tokenization. The NLTK Punkt Tokenizer is trained on large corpora and is capable of handling a wide range of sentence boundary detection for multiple languages. It uses a combination of rule-based heuristics and statistical models to identify sentence boundaries accurately.

In [11]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\FM-PC-
[nltk_data]     LT-279\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [12]:
import nltk

sentence = "Tokenization is important for NLP tasks"
tokens = nltk.word_tokenize(sentence)
print(tokens)

['Tokenization', 'is', 'important', 'for', 'NLP', 'tasks']


### Subword Tokenization
Subword tokenization breaks the text into smaller units that may not necessarily correspond to complete words. This technique is particularly useful for languages with complex morphology or for handling out-of-vocabulary words.

One popular subword tokenization algorithm is Byte Pair Encoding (BPE), which iteratively merges the most frequent character pairs to create subword tokens. For example, the word "unhappiness" might be tokenized as ["un", "happiness"].

Another widely used subword tokenization approach is WordPiece, which is similar to BPE but ensures that the tokenizer treats whole words as single tokens. This is particularly helpful for languages like Chinese, where characters don't have clear word boundaries.

The Hugging Face tokenizers library provides efficient implementations of BPE and WordPiece tokenization.

Before running the program below, make sure you have `tokenizers` installed on your device.

`pip install tokenizers`

In [13]:
from tokenizers import ByteLevelBPETokenizer

tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=["./../../data/test-data.txt"], vocab_size=1000)

text = "tokenization"
encoding = tokenizer.encode(text)
tokens = encoding.tokens
print(tokens)

['to', 'ken', 'ization']


### Character Tokenization
In some cases, it may be necessary to tokenize text at the character level. This approach treats each individual character as a token. It is useful for tasks such as text generation or processing languages without clear word boundaries.

For example, the sentence "Hello, world!" would be tokenized as ["H", "e", "l", "l", "o", ",", " ", "w", "o", "r", "l", "d", "!"].

Character tokenization can be performed easily using string manipulation functions available in most programming languages.

In [14]:
text = "Hello, world!"
tokens = list(text)
print(tokens)

['H', 'e', 'l', 'l', 'o', ',', ' ', 'w', 'o', 'r', 'l', 'd', '!']


## Stemming and Lemmatization
Stemming and lemmatization are techniques used in natural language processing (NLP) to reduce words to their base or root forms. Both approaches aim to normalize words and reduce inflectional variations, enabling better analysis and comparison of words. However, they differ in their methods and outputs. Let's dive into each technique in detail:

### Stemming
Stemming is a process of reducing words to their base or root forms by removing prefixes or suffixes. The resulting form is often a stem, which may not be an actual word itself. The primary goal of stemming is to simplify the vocabulary and group together words with the same base meaning.

For example, when using a stemming algorithm on the words "running," "runs," and "ran," the common stem would be "run." The stemming process cuts off the suffixes ("-ning," "-s," and "-"), leaving behind the core form of the word.

Stemming algorithms follow simple rules and heuristics based on linguistic patterns, rather than considering the context or part of speech of the word. Some popular stemming algorithms include the Porter stemming algorithm, the Snowball stemmer (which supports multiple languages), and the Lancaster stemming algorithm.

Stemming is a computationally lightweight approach and can be useful in certain cases where the exact word form is not crucial. However, it may produce stems that are not actual words, leading to potential loss of meaning and ambiguity.

In [15]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
word = "tasty"
stemmed_word = stemmer.stem(word)
print(stemmed_word)

tasti


### Lemmatization
Lemmatization, on the other hand, aims to reduce words to their canonical or dictionary forms, known as lemmas. Unlike stemming, lemmatization considers the context and part of speech (POS) of the word to generate meaningful lemmas. The resulting lemmas are actual words found in the language's dictionary.

For example, when lemmatizing the words "running," "runs," and "ran," the lemma for each would be "run." Lemmatization takes into account the POS information to accurately determine the base form of the word.

Lemmatization algorithms use linguistic rules and morphological analysis to identify the appropriate lemma. They often rely on language-specific resources, such as word lists and morphological databases. Some popular lemmatization tools include the WordNet lemmatizer and the spaCy library (which supports lemmatization for multiple languages).

Lemmatization typically produces more accurate and meaningful results compared to stemming because it retains the core meaning of words. It is especially useful in tasks that require precise word analysis, such as information retrieval, question answering, and sentiment analysis.

However, lemmatization can be more computationally intensive compared to stemming due to its reliance on POS tagging and language-specific resources.

Before running any of these tokenization techniques, make sure you have `wordnet` downloaded.

In [16]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to C:\Users\FM-PC-
[nltk_data]     LT-279\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [17]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
word = "tasty"
lemmatized_word = lemmatizer.lemmatize(word)
print(lemmatized_word)

tasty


When deciding between stemming and lemmatization, consider the trade-off between simplicity and accuracy. If you require speed and a broad reduction of word forms, stemming may be sufficient. However, if you need more accurate analysis and want to preserve the semantic meaning of words, lemmatization is generally the preferred choice.

It's important to note that both stemming and lemmatization have limitations. They may not always produce the correct base forms, especially for irregular words or those not present in the chosen language's dictionary. Contextual information, such as word sense disambiguation, can further enhance the accuracy of both techniques.