# Data Preprocessing of GoEmotions Dataset by Google

 There are 3 main reasons why we conduct Data Processing:

 (a) Remove symbols and irrelevant content
 
 (b) Normalise texts to reduce vocabulary size

 (c) Improve tokenisation

 In this notebook, GoEmotions Dataset by Google in Kaggle is used. Data cleaning techniques such as RegEx, and stopword removal are used to clean the dataset, before doing tokenization. After that, stemming/lemmatization can be selected as options to reduce vocab size for easier processing. Parts of Speech tagging is used to label the tokens. TF-IDF, NMF and BM25 are then used as part of analysis of the processed dataset.

Before starting any project that uses datasets, it is generally a good idea to know and understand the data that is being handled. Data in various datasets can be messy and require some cleaning for tokenizers to work. For this project, although we are using BERT and its variants and XLNET which are all pretrained trasnformer models that can handle messy raw text via their respective tokenizers from huggingface, it is a good habit to know and understand the data that is being used.

## 1. Install the Natural Language Toolkit Preprocessing Models

In [1]:
import os, re, pandas as pd
from pathlib import Path
import nltk
nltk.download('punkt', quiet=True) # NLTK pre-trained tokenizer model
nltk.download('wordnet', quiet=True) # Lexical database for grouping words into sets of synonyms, used for lemmatization
nltk.download('omw-1.4', quiet=True) # WordNet but adds multilingual grouping
nltk.download('stopwords', quiet=True) # List of multilingual stopwords
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

DATA_DIR = Path(r'd:\EE6405 NLP Project\Data Preprocessing\dataset')
files = [DATA_DIR / 'go_emotions_dataset_shivamb']
files

[WindowsPath('d:/EE6405 NLP Project/Data Preprocessing/dataset/go_emotions_dataset_shivamb')]

## 2. Load selected CSV

In [2]:
from typing import Optional

def find_text_col(df: pd.DataFrame) -> Optional[str]:
    for cand in ['text','sentence','content','utterance','lyrics','statement']:
        if cand in df.columns:
            return cand
    obj_cols = df.select_dtypes(include='object').columns.tolist()
    return obj_cols[0] if obj_cols else None

def load_one(f: Path) -> Optional[pd.DataFrame]:
    df = pd.read_csv(f)
    text_col = find_text_col(df)
    if text_col is None:
        return None
    df = df.rename(columns={text_col: 'text'})
    df['source_file'] = f.name
    return df

#files = list(DATA_DIR.glob('*.csv'))
#files
print(f"{len(files)} file(s) scheduled")

1 file(s) scheduled


## 3. Regex Cleaning Function

#### Cleaning is to remove noise & normalize text before tokenization

In [3]:
# Cleaning is to remove noise & normalize text before tokenization

URL_PATTERN = re.compile(r'https?://\S+|www\.\S+') # Removes URLs like http:// or https:// or www.
MENTION_PATTERN = re.compile(r'@[A-Za-z0-9_]+') # Removes Twitter mentions like @username
HASHTAG_PATTERN = re.compile(r'#(\w+)') # Keeps hashtagged words but removes the #
NON_ALPHA_PATTERN = re.compile(r'[^a-zA-Z\s]') # Removes non-alphabetic characters except spaces
MULTISPACE_PATTERN = re.compile(r'\s+') # Replaces multiple spaces with a single space
def basic_clean(text: str) -> str:
    if not isinstance(text, str):
        return ''
    text = text.lower()
    text = URL_PATTERN.sub(' ', text)
    text = MENTION_PATTERN.sub(' ', text)
    text = HASHTAG_PATTERN.sub(r'\1', text)
    text = NON_ALPHA_PATTERN.sub(' ', text)
    text = MULTISPACE_PATTERN.sub(' ', text).strip()
    return text

## 4. Stopword Removal & Tokenization

Stop words are removed to reduce vocabulary size and remove unecessary words.

Tokenization is to split sentences into words for further processing.

In [4]:
stop_words = set(stopwords.words('english')) # NLTK English stop words
DOMAIN_STOP = {"chorus","verse","repeat","na","la"}  # Add domain specific stop words, change as needed
stop_words |= DOMAIN_STOP # Merge both the sets

def tokenize_filter(s: str):
    tokens = word_tokenize(s) # Tokenize text into words
    tokens = [t for t in tokens if t not in stop_words and len(t) > 2] # Remove stop words and words shorter than 3 characters
    return tokens

## 5. Stemming vs Lemmatization

Stemming reduces words to its root/stem (eg studies --> studi).

Lemmatization reduces words to their dictionary base form (eg studies --> study).

These 2 processes help to reduce words to base forms to reduce vocabulary size for easier processing.

In [5]:
stemmer = PorterStemmer() # NLTK Word Stemmer
lemmatizer = WordNetLemmatizer() # NLTK Word Lemmatizer

def stem_tokens(tokens):
    return [stemmer.stem(t) for t in tokens]

def lemmatize_tokens(tokens):
    return [lemmatizer.lemmatize(t) for t in tokens]

## 6. Choose one representation (Lemmatized/Stemmed)

In [6]:
def preprocess_df(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df['clean_text']   = df['text'].apply(basic_clean)
    df['tokens']       = df['clean_text'].apply(tokenize_filter)
    df['stemmed']      = df['tokens'].apply(stem_tokens)
    df['lemmatized']   = df['tokens'].apply(lemmatize_tokens)

# Change to df['stemmed'] to use stemmed version
    df['final_text']   = df['lemmatized'].apply(lambda toks: ' '.join(toks)) 
    return df

## 7. Part-of-speech (POS) Tagging

POS tagging is to label each word (eg. noun, verb, adjective) with reference to its context.

Models need this tag to understand sentence structure & semantic meaning.

In [7]:
import nltk
from collections import Counter

# Ensure POS tagger models are available
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)

def add_pos_tags(df: pd.DataFrame) -> pd.DataFrame:
    """Add POS tags for each token list in df['tokens']."""
    df = df.copy()
    if 'tokens' not in df.columns:
        # Fallback: build tokens from clean_text if needed
        df['tokens'] = df['clean_text'].apply(word_tokenize)

    # PTB tags (e.g., NN, VBZ) and Universal tags (e.g., NOUN, VERB)
    df['pos_ptb'] = df['tokens'].apply(lambda toks: nltk.pos_tag(toks))
    df['pos_universal'] = df['tokens'].apply(lambda toks: nltk.pos_tag(toks, tagset='universal'))

    # Counts of Universal POS per doc (useful features)
    df['pos_universal_counts'] = df['pos_universal'].apply(lambda pairs: Counter(tag for _, tag in pairs))
    return df

## Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF score is an indicator to show how important a word is in relation to its document. High frequency in a document but low frequency in a corpus means high importance as only that document contains that theme. High frequency in both a document and corpus means lower importance as many documents share the same theme.

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
import joblib
from pathlib import Path

def _resolve_df_with_final_text():
    glb = globals()
    if 'df_proc' in glb and isinstance(glb['df_proc'], pd.DataFrame) and 'final_text' in glb['df_proc'].columns:
        return glb['df_proc']
    if 'df' in glb and isinstance(glb['df'], pd.DataFrame) and 'final_text' in glb['df'].columns:
        return glb['df']
    try:
        srcs = files
    except NameError:
        raise ValueError("Provide a DataFrame (df_proc/df) with 'final_text' or define 'files'.")
    csvs = []
    for f in srcs:
        p = Path(f)
        if p.is_dir():
            csvs.extend(sorted(p.glob('*.csv')))
            continue
        if p.suffix.lower() != '.csv':
            cand = p.with_suffix('.csv')
            if cand.exists():
                p = cand
        if p.exists() and p.is_file() and p.suffix.lower() == '.csv':
            csvs.append(p)
    if not csvs:
        raise ValueError("No CSVs resolved from 'files'.")
    dfs = []
    for f in csvs:
        d = load_one(Path(f))
        if d is not None:
            dfs.append(preprocess_df(d))
    if not dfs:
        raise ValueError("No valid dataframes built. Check text column detection.")
    return pd.concat(dfs, ignore_index=True)

# Build corpus
df_nmf = _resolve_df_with_final_text()
corpus = df_nmf['final_text'].fillna('')

# TF-IDF
tfidf = TfidfVectorizer(
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.9,
    sublinear_tf=True,
    lowercase=False  # already lowercased in cleaning
)
X = tfidf.fit_transform(corpus)
print("TF-IDF shape:", X.shape)

import numpy as np
feat = tfidf.get_feature_names_out()
max_scores = X.max(axis=0).toarray().ravel()
top_idx = np.argsort(max_scores)[::-1][:10]
print("\nTop 10 terms (by max TF-IDF across docs):")
for i in top_idx:
    print(f"{feat[i]} -> {max_scores[i]:.4f}")

joblib.dump(tfidf, 'tfidf_vectorizer.joblib')

TF-IDF shape: (211225, 253874)

Top 10 terms (by max TF-IDF across docs):
police -> 1.0000
special -> 1.0000
reasonable -> 1.0000
happen -> 1.0000
hallelujah -> 1.0000
poisonous -> 1.0000
hamberders -> 1.0000
specifically -> 1.0000
handsome -> 1.0000
calm -> 1.0000


['tfidf_vectorizer.joblib']

## Non-negative Matrix Factorization (NMF)

After getting the TF-IDF matrix, we factorise it into a document-topic matrix and topic-word matrix. This is done to find latent topics within a document.

In [9]:
from sklearn.decomposition import NMF
import joblib

# Ensure prerequisites exist
assert 'X' in globals() and 'tfidf' in globals() and 'df_nmf' in globals(), "Run the TF-IDF cell first."

# NMF
n_topics = 30  # Split into 30 topics
nmf = NMF(
    n_components=n_topics,
    init='nndsvd',
    random_state=42,
    max_iter=400,
    alpha_W=0.0,
    l1_ratio=0.0
)
W = nmf.fit_transform(X)  # doc-topic matrix
H = nmf.components_      # topic-term matrix
feat = tfidf.get_feature_names_out()

# Show top terms per topic
topn = 12
for k, comp in enumerate(H):
    top_idx = np.argsort(comp)[::-1][:topn]
    terms = [feat[i] for i in top_idx]
    print(f"Topic {k}: {', '.join(terms)}")

# Attach dominant topic to dataframe
df_nmf['nmf_topic'] = W.argmax(axis=1)
df_nmf['nmf_strength'] = W.max(axis=1)

# Persist artifacts
joblib.dump(nmf, 'nmf_model.joblib')
joblib.dump(W, 'nmf_doc_topic.joblib')
joblib.dump(H, 'nmf_topic_term.joblib')
df_nmf.to_csv('df_with_nmf.csv', index=False)

print("NMF done. Doc-topic shape:", W.shape)

Topic 0: name, name name, like name, love name, thank name, name love, name would, damn, hate, miss name, think name, miss
Topic 1: thank, thank name, thank much, thank sharing, appreciate, thank service, sharing, awesome thank, advice, thank thank, name thank, service
Topic 2: like, like name, sound, sound like, look like, feel like, really like, seems like, seems, like one, name like, something
Topic 3: love, love name, name love, would love, love username, username, love see, really love, love guy, thanks love, love love, got love
Topic 4: thanks, thanks name, thanks man, thanks sharing, sharing, thanks hate, thanks much, thanks info, info, cool, appreciate, advice
Topic 5: good, luck, good luck, good one, good job, job, good know, really good, name good, good thing, good name, good idea
Topic 6: think, people, mean, think name, way, even, need, thing, still, wrong, hate, lot
Topic 7: happy, day, cake, cake day, happy cake, birthday, happy birthday, happy new, new year, make happy, 

## Best Match 25 (BM25)

BM25 is an algorithm that considers both term frequency (TF) and document length normalisation to determine the relevance of a document to a given query.

In [10]:
import sys, subprocess
try:
    from rank_bm25 import BM25Okapi
except ImportError:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "rank-bm25"])
    from rank_bm25 import BM25Okapi

import numpy as np

def _df_for_bm25():
    glb = globals()
    if 'df_proc' in glb and isinstance(glb['df_proc'], pd.DataFrame):
        return glb['df_proc']
    try:
        return _resolve_df_with_final_text()
    except Exception as e:
        raise ValueError("No dataframe available. Run preprocessing first.") from e

def _docs_tokens(df_base: pd.DataFrame):
    if 'final_text' in df_base.columns:
        return df_base['final_text'].fillna('').str.split().tolist()
    if 'tokens' in df_base.columns:
        return df_base['tokens'].tolist()
    if 'clean_text' in df_base.columns:
        return df_base['clean_text'].fillna('').apply(word_tokenize).tolist()
    return df_base['text'].fillna('').apply(lambda s: word_tokenize(basic_clean(s))).tolist()

def _prep_query_tokens(q: str):
    q_clean = basic_clean(q)
    toks = tokenize_filter(q_clean)
    return lemmatize_tokens(toks)

# Build index once
df_bm = _df_for_bm25()
docs_tokens = _docs_tokens(df_bm)
bm25 = BM25Okapi(docs_tokens, k1=1.5, b=0.75)

def bm25_search_rank(df_base: pd.DataFrame, query: str, top_k=10):
    q_toks = _prep_query_tokens(query)
    scores = bm25.get_scores(q_toks)
    top_idx = np.argsort(scores)[::-1][:top_k]
    out = df_base.iloc[top_idx].copy()
    out['bm25_score'] = np.array(scores)[top_idx]
    cols = [c for c in ['bm25_score', 'final_text', 'text', 'source_file'] if c in out.columns]
    return out[cols]

print("'final_text' is the post-processed text used for BM25 search.\n'text' is the original pre-processed text.\n")
# Input word query here
results = bm25_search_rank(df_bm, "water", top_k=10)
print(results.head(10))

'final_text' is the post-processed text used for BM25 search.
'text' is the original pre-processed text.

        bm25_score                                 final_text  \
16657     8.963676  interesting fresh water dilute salt water   
55744     8.963676  interesting fresh water dilute salt water   
185409    8.963676  interesting fresh water dilute salt water   
165011    8.963676  interesting fresh water dilute salt water   
102250    8.963676  interesting fresh water dilute salt water   
134479    8.911442                                spicy water   
155168    8.911442                               filled water   
87151     8.911442                                spicy water   
75331     8.911442                               filled water   
144509    8.911442                                spicy water   

                                                     text  \
16657   It's very interesting that the fresh water doe...   
55744   It's very interesting that the fresh water doe..