### Text Preprocessing Techniques

Text preprocessing refers to a series of techniques used to clean, transform and prepare raw textual data into a format that is suitable for NLP or ML tasks. The goal of text preprocessing is to enhance the quality and usability of the text data for subsequent analysis or modeling.

Text preprocessing typically involves the following steps:

* Lowercasing
* Removing Punctuation & Special Characters
* Stop-Words Removal
* Removal of URLs
* Removal of HTML Tags
* Stemming & Lemmatization
* Tokenization

### Lowercasing
Lowercasing is a text preprocessing step where all letters in the text are converted to lowercase.

In [1]:
text = "Hello WorlD!"
lowercased_text = text.lower()

print(lowercased_text)

hello world!


### Removing Punctuation & Special Characters
Punctuation removal is a text preprocessing step where you remove all punctuation marks (such as periods, commas, exclamation marks, emojis etc.) from the text to simplify it and focus on the words themselves.

In [2]:
import re

text = "Hello, world! This is?* 💜an&/|~^+%'\" example- of text preprocessing."

punctuation_pattern = r'[^\w\s]'

text_cleaned = re.sub(punctuation_pattern, '', text)

print(text_cleaned)

Hello world This is an example of text preprocessing


### Stop-Words Removal
Stopwords are words that don’t contribute to the meaning of a sentence. So they can be removed without causing any change in the meaning of the sentence.

In [3]:
from nltk.corpus import  stopwords

# remove english stopwords function
def remove_stopwords(text, language):
    stop_words = set(stopwords.words(language))
    word_tokens = text.split()
    filtered_text = [word for word in word_tokens if word not in stop_words]
    print(language)
    print(filtered_text)
 
en_text = "This is a sample sentence and we are going to remove the stopwords from this"
remove_stopwords(en_text, "english")

tr_text = "bu cümledeki engellenen kelimeleri kaldıracağız"
remove_stopwords(tr_text, "turkish")

english
['This', 'sample', 'sentence', 'going', 'remove', 'stopwords']
turkish
['cümledeki', 'engellenen', 'kelimeleri', 'kaldıracağız']


Note: If you examine the output closely, you’ll notice that in the first sentence, the word ‘this’ was removed, but ‘This’ was not removed. So, it is necessary to convert the sentence to lowercase and remove punctuation marks before applying this step.