# <font color='#154360'> <b> <center> Text Pre-processing for NLP</center> </b> </font>

<b> Steps </b>

1. Text Lowercasing
2. Remove anything not important for the task (example: hyperlinks in twiter data)
2. Tokenization
3. Noise Removal
    Remove unnecessary elements from the text that do not contribute to the meaning, such as:
        - Special characters (e.g., '@', '$')
        - HTML tags
        - Non-alphabetic characters (if not relevant to the task)
4. Remove stop words and punctuation
5. Normalization
    Normalize text to ensure consistency in representation, which may include:
        - Handling contractions (e.g., "can't" to "cannot")
        - Correcting spelling mistakes
        - Lemmatization or stemming (reducing words to their base or root form).


In [23]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import re

In [24]:
# Example text
text = "Hello @marcos, this is an example of text preprocessing! Hope you enjoy it 😊"

# Step 1: Lowercasing
text = text.lower()

# Step 2: Remove hyperlinks
text = re.sub(r'https?://\S+', '', text)

# Step 3: Tokenization (including emoticons)
tokens = nltk.casual_tokenize(text)

# Step 4: Noise removal (including emoticons)
clean_tokens = [token for token in tokens if token.isalpha() 
                    or token.encode('ascii', 'ignore').decode('ascii') == '']

# Step 5: Stopword removal
stop_words = set(stopwords.words('english'))
clean_tokens = [token for token in clean_tokens if token not in stop_words]

# Step 6: Normalization (using stemming in this example)
stemmer = PorterStemmer()
clean_tokens = [stemmer.stem(token) for token in clean_tokens]

# Example of final processed data
print("Processed Text:", clean_tokens)



Processed Text: ['hello', 'exampl', 'text', 'preprocess', 'hope', 'enjoy', '😊']


Let's write a fucntion that combines all this tasks:

In [25]:
def preprocess_text(text):
    # Step 1: Lowercasing
    text = text.lower()
    
    # Step 2: Remove hyperlinks
    text = re.sub(r'https?://\S+', '', text)
    
    # Step 3: Tokenization
    tokens = word_tokenize(text)
    
    # Step 4: Noise removal (including emoticons)
    clean_tokens = [token for token in tokens if token.isalpha() 
                        or token.encode('ascii', 'ignore').decode('ascii') == '']
    

    # Step 5: Stopword removal
    stop_words = set(stopwords.words('english'))
    clean_tokens = [token for token in clean_tokens if token not in stop_words]

    # Step 6: Normalization (using stemming in this example)
    stemmer = PorterStemmer()
    clean_tokens = [stemmer.stem(token) for token in clean_tokens]
    
    return clean_tokens

In [26]:
# Example sentences
sentences = [
    "Hello, this is a sample sentence!",
    "I love coding with Python and NLTK.",
    "Text preprocessing is crucial for NLP tasks.",
    "Emoticons like 😊 should be handled properly.",
    "Stemming reduces words to their root form.",
    "With a link: https://www.nico.com.uy "
]

# Apply preprocess_text function to each sentence using list comprehension
processed_sentences = [preprocess_text(sentence) for sentence in sentences]

# Print processed sentences
for idx, tokens in enumerate(processed_sentences):
    print(f"Sentence {idx + 1}: {tokens}")

Sentence 1: ['hello', 'sampl', 'sentenc']
Sentence 2: ['love', 'code', 'python', 'nltk']
Sentence 3: ['text', 'preprocess', 'crucial', 'nlp', 'task']
Sentence 4: ['emoticon', 'like', '😊', 'handl', 'properli']
Sentence 5: ['stem', 'reduc', 'word', 'root', 'form']
Sentence 6: ['link']
