# Preprocessing Techniques

This notebook contains Python implementations of 12 common text preprocessing techniques. 
Each technique is provided as a standalone function that takes a text string as input and returns the processed string. Users can combine functions in any order to create custom preprocessing pipelines.

### Descriptions of Preprocessing Techniques

- **Lowercasing**: Converts all text to lowercase to ensure uniformity.
- **Uppercasing**: Converts all text to uppercase, useful in some tokenization scenarios.
- **Removing Punctuation**: Deletes punctuation marks that may not add value to the model.
- **Removing Numbers**: Elimintes digits from the text if they are not informative.
- **Removing Extra Whitespace**: Collapses multiple spaces into a single space for clean input.
- **Removing Stopwords**: Removes common words (like "the", "and") that usually carry little meaning.
- **Stemming**: Reduces words to their root form (e.g., "running" → "run").
- **Lemmatization**: Converts words to their base form using vocabulary knowledge (e.g., "better" → "good").
- **Removing Special Characters**: Removes symbols and non-alphanumeric characters.
- **Expanding Contractions**: Converts contractions to their full forms (e.g., "don't" → "do not").
- **Removing HTML Tags**: Strips HTML markup from text, useful for web-scraped data.
- **Removing Non-ASCII Characters**: Deletes characters outside standard ASCII encoding, e.g., emojis or accented letters.

In [19]:
#Libraries

import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk
import contractions


#nltk.download('stopwords')
#nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()


In [9]:
#Example of the sentence
sentence = "Don't worry! Visit my website at https://diegoblassio.com <b>now</b> for 100% free resources 😊."

In [10]:
# 1. Lowercasing
def lowercasing(text=""):
    return text.lower()

lowercasing(text=sentence)

"don't worry! visit my website at https://diegoblassio.com <b>now</b> for 100% free resources 😊."

In [11]:
# 2. Uppercasing
def uppercasing(text=""):
    return text.upper()

uppercasing(text=sentence)

"DON'T WORRY! VISIT MY WEBSITE AT HTTPS://DIEGOBLASSIO.COM <B>NOW</B> FOR 100% FREE RESOURCES 😊."

In [12]:
# 3. Removing punctuation
def remove_punctuation(text=""):
    return text.translate(str.maketrans('', '', string.punctuation))

remove_punctuation(text=sentence)

'Dont worry Visit my website at httpsdiegoblassiocom bnowb for 100 free resources 😊'

In [13]:
# 4. Removing digits/numbers
def remove_numbers(text=""):
    return re.sub(r'\d+', '', text)

remove_numbers(text=sentence)

"Don't worry! Visit my website at https://diegoblassio.com <b>now</b> for % free resources 😊."

In [14]:
# 5. Removing extra whitespaces
def remove_extra_whitespace(text=""):
    return " ".join(text.split())

remove_extra_whitespace(text=sentence)

"Don't worry! Visit my website at https://diegoblassio.com <b>now</b> for 100% free resources 😊."

In [15]:
# 6. Removing stopwords
def remove_stopwords(text=""):
    tokens = text.split()
    filtered = [word for word in tokens if word.lower() not in stop_words]
    return " ".join(filtered)

remove_stopwords(text=sentence)

'worry! Visit website https://diegoblassio.com <b>now</b> 100% free resources 😊.'

In [20]:
# 7. Stemming
def stemming(text=""):
    tokens = text.split()
    stemmed = [stemmer.stem(word) for word in tokens]
    return " ".join(stemmed)

stemming(text=sentence)

"don't worry! visit my websit at https://diegoblassio.com <b>now</b> for 100% free resourc 😊."

In [18]:
# 8. Lemmatization
def lemmatization(text=""):
    tokens = text.split()
    lemmatized = [lemmatizer.lemmatize(word) for word in tokens]
    return " ".join(lemmatized)

lemmatization(text=sentence)

"Don't worry! Visit my website at https://diegoblassio.com <b>now</b> for 100% free resource 😊."

In [None]:
# 9. Removing special characters (non-alphanumeric)
def remove_special_characters(text=""):
    return re.sub(r'[^A-Za-z0-9\s]', '', text)


remove_special_characters(text=sentence)

'Dont worry Visit my website at httpsdiegoblassiocom bnowb for 100 free resources '

In [22]:
# 10. Expanding contractions (e.g., "don't" -> "do not")
def expand_contractions(text=""):
    return contractions.fix(text)

expand_contractions(text=sentence)

'Do not worry! Visit my website at https://diegoblassio.com <b>now</b> for 100% free resources 😊.'

In [23]:
# 11. Removing HTML tags
def remove_html_tags(text=""):
    return re.sub(r'<.*?>', '', text)

remove_html_tags(text=sentence)

"Don't worry! Visit my website at https://diegoblassio.com now for 100% free resources 😊."

In [24]:
# 12. Removing non-ASCII characters
def remove_non_ascii(text=""):
    return text.encode('ascii', 'ignore').decode('ascii')

remove_non_ascii(text=sentence)

"Don't worry! Visit my website at https://diegoblassio.com <b>now</b> for 100% free resources ."

In [26]:
latest_dev

NameError: name 'latest_dev' is not defined

In our work, we opted not to apply traditional text preprocessing techniques, such as stopword removal, stemming, or lemmatization, before feeding the data into our models. Recent studies have shown that pre-trained transformer models, including BERT and RoBERTa, are robust to raw textual input and are capable of learning contextual representations without extensive preprocessing. Applying aggressive preprocessing can even remove information relevant to the model, potentially reducing performance. Therefore, we maintained the original text structure, relying on the tokenization mechanisms of the pre-trained models to preserve semantic and syntactic information.