**Text Cleaning Pipeline**

**Scenario:** You're building a basic sentiment analysis tool for social media comments. The comments often contain typos, emojis, and irrelevant words. You need to clean the text before feeding it to your sentiment analysis model.

**Tasks:**

1. **Text normalization:** Write a function that takes a comment string and performs the following:
    - Lowercase all characters.
    - Remove punctuation (except for exclamation points and question marks that might be sentiment indicators).
    - Replace emojis with descriptive text (e.g., "happy" for ).
2. **Tokenization:** Write a function that splits the normalized text into individual words (tokens).
3. **Stop word removal:** Create a list of common stop words (e.g., "the", "a", "is"). Write a function that removes these stop words from the list of tokens.

In [1]:
import nltk
import emoji
import string
import pandas as pd
from nltk.corpus import stopwords
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer


nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/haria/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Load CSV file into DataFrame

In [2]:
def load_data_from_csv(file_path, text_column='Text'):
    try:
        df = pd.read_csv(file_path)
        if text_column not in df.columns:
            raise KeyError(f"Column '{text_column}' not found in CSV")
        return df[text_column].fillna('')
    except FileNotFoundError:
        print(f"Error: The file at {file_path} was not found.")
        return pd.Series([])

### Text normalization

In [3]:
def normalize_text(comment_series):
    """Lowercases, removes punctuation (except '!' and '?'), and replaces emojis with descriptive text."""
    punctuation_remove = string.punctuation.replace('!', '').replace('?', '')
    
    comment_series = comment_series.str.lower() 
    comment_series = comment_series.str.translate(str.maketrans('', '', punctuation_remove)) 
    comment_series = comment_series.apply(emoji.demojize)
    
    return comment_series

### Tokenization

In [4]:
def tokenize_text(comment_series):
    """Splits each comment into individual tokens (words)."""
    return comment_series.str.split()

### Stop word removal

In [5]:
def remove_stop_words(token_series, custom_stop_words=None):
    """Removes stop words from tokenized comments. Accepts an optional custom stop-word list."""
    stop_words = set(stopwords.words('english'))
    if custom_stop_words:
        stop_words.update(custom_stop_words)
    
    return token_series.apply(lambda tokens: [token for token in tokens if token not in stop_words])

### Text processing pipeline

In [6]:
def create_text_pipeline(custom_stop_words=None):
    """Creates a text processing pipeline for normalization, tokenization, and stop-word removal."""
    return Pipeline([
        ('normalize', FunctionTransformer(lambda x: normalize_text(x), validate=False)),
        ('tokenize', FunctionTransformer(lambda x: tokenize_text(x), validate=False)),
        ('remove_stopwords', FunctionTransformer(lambda x: remove_stop_words(x, custom_stop_words), validate=False))
    ])

### Main function

In [7]:
def process_comments_from_csv(file_path, text_column='Text', custom_stop_words=None):
    """Processes comments from CSV using the text pipeline and returns cleaned tokens."""
    comments = load_data_from_csv(file_path, text_column)
    if comments.empty:
        print("No comments to process.")
        return pd.DataFrame()
    
    text_pipeline = create_text_pipeline(custom_stop_words)
    cleaned_tokens = text_pipeline.fit_transform(comments)
    
    cleaned_df = pd.DataFrame({'Cleaned Tokens': cleaned_tokens})
    
    return cleaned_df

In [8]:
csv_file_path = 'datasets/sentiment.csv'
cleaned_tokens = process_comments_from_csv(csv_file_path)

print(cleaned_tokens)

                                        Cleaned Tokens
0                    [enjoying, beautiful, day, park!]
1                         [traffic, terrible, morning]
2       [finished, amazing, workout!, :flexed_biceps:]
3               [excited, upcoming, weekend, getaway!]
4               [trying, new, recipe, dinner, tonight]
..                                                 ...
727  [collaborating, science, project, received, re...
728  [attending, surprise, birthday, party, organiz...
729  [successfully, fundraising, school, charity, i...
730  [participating, multicultural, festival, celeb...
731  [organizing, virtual, talent, show, challengin...

[732 rows x 1 columns]
