# Phase 3 – Text Cleaning & Preprocessing

In this notebook I clean and normalize tweet text from `data/tweets_raw.csv` to prepare it for:
- Sentiment analysis
- Topic modeling
- Named Entity Recognition (NER)

Key steps:
1. Load raw data  
2. Normalize text (lowercase, remove URLs, mentions, hashtags, punctuation)  
3. Remove stopwords  
4. Lemmatize words using spaCy  
5. Save cleaned dataset as `data/tweets_cleaned.csv`


In [4]:
import os
import re
import pandas as pd

# Path handling – notebook is inside /notebooks, data is one level up
project_root = os.path.dirname(os.getcwd())        # go up from /notebooks
data_path_raw = os.path.join(project_root, "data", "tweets_raw.csv")

print("Loading from:", data_path_raw)
df = pd.read_csv(data_path_raw)

df.head()


Loading from: /Users/bonaventure/projects/trump-nigeria-sentiment-analysis/data/tweets_raw.csv


Unnamed: 0,date,username,text,likes,retweets
0,2025-11-02T12:48:00,AnalystEU,Trump is right to warn Nigeria. Someone has to...,486,194
1,2025-11-03T21:55:00,DiplomatDesk,Trump threatens to cut aid and consider action...,167,55
2,2025-11-03T02:41:00,NorthCentralVoice,"Before Trump reaches Nigeria, fuel go finish f...",106,34
3,2025-11-04T05:47:00,NaijaLaw,Context matters: both Christians and Muslims h...,542,164
4,2025-11-02T12:11:00,HumanRightsWatch,"Whatever our internal issues, military threats...",762,175


In [7]:
import spacy

# Load English model
nlp = spacy.load("en_core_web_sm")

# Get spaCy stopwords list
spacy_stopwords = nlp.Defaults.stop_words
len(spacy_stopwords)


326

# Define cleaning function

In [5]:
def clean_text(text: str) -> str:
    """
    Clean and lemmatize tweet text.
    Steps:
    - Remove URLs, mentions, and hashtags symbols
    - Keep only letters and spaces
    - Lowercase text
    - Tokenize with spaCy
    - Remove stopwords and short tokens
    - Lemmatize tokens
    """
    if not isinstance(text, str):
        return ""

    # Remove URLs
    text = re.sub(r"http\S+|www\.\S+", " ", text)

    # Remove @mentions
    text = re.sub(r"@\w+", " ", text)

    # Remove '#' symbol but keep the word (e.g. #Nigeria -> Nigeria)
    text = text.replace("#", " ")

    # Keep only letters and spaces
    text = re.sub(r"[^A-Za-z\s]", " ", text)

    # Lowercase
    text = text.lower()

    # Process with spaCy
    doc = nlp(text)

    tokens = []
    for token in doc:
        # Skip stopwords, punctuation, and very short tokens
        if (
            token.is_stop
            or token.is_punct
            or token.is_space
            or len(token.text) < 3
        ):
            continue

        # Use lemma (base form)
        lemma = token.lemma_.strip()
        if lemma:
            tokens.append(lemma)

    # Join tokens back to a single string
    return " ".join(tokens)


In [6]:
# Apply cleaning to all tweets
df["clean_text"] = df["text"].apply(clean_text)

df[["text", "clean_text"]].head(10)


Unnamed: 0,text,clean_text
0,Trump is right to warn Nigeria. Someone has to...,trump right warn nigeria stand persecute chris...
1,Trump threatens to cut aid and consider action...,trump threaten cut aid consider action allege ...
2,"Before Trump reaches Nigeria, fuel go finish f...",trump reaches nigeria fuel finish plane nigeri...
3,Context matters: both Christians and Muslims h...,context matter christians muslim victim stop h...
4,"Whatever our internal issues, military threats...",internal issue military threat trump unaccepta...
5,Trump said 'guns blazing'? Abeg make una no us...,trump say gun blaze abeg una use country actio...
6,Reminder: violence in Nigeria is driven by mul...,reminder violence nigeria drive multiple facto...
7,Context matters: both Christians and Muslims h...,context matter christians muslim victim stop h...
8,"Before Trump reaches Nigeria, fuel go finish f...",trump reaches nigeria fuel finish plane prayfo...
9,"We need real reforms, not loud threats from ab...",need real reform loud threat abroad trump


In [9]:
# Sanity check
print("Total rows:", len(df))
print("Number of empty clean_text rows:", (df["clean_text"].str.len() == 0).sum())

# Look at a few random samples
df.sample(5, random_state=42)[["text", "clean_text"]]


Total rows: 2000
Number of empty clean_text rows: 0


Unnamed: 0,text,clean_text
1860,Trump is right to warn Nigeria. Someone has to...,trump right warn nigeria stand persecute chris...
353,"Before Trump reaches Nigeria, fuel go finish f...",trump reaches nigeria fuel finish plane prayfo...
1333,Na every issue Trump dey use Nigeria do conten...,issue trump dey use nigeria content rest small
905,Media repeating Trump's line without nuance is...,medium repeat trump line nuance misleading har...
1289,Na every issue Trump dey use Nigeria do conten...,issue trump dey use nigeria content rest small...


In [10]:
# Save clean dataset

data_path_clean = os.path.join(project_root, "data", "tweets_cleaned.csv")
df.to_csv(data_path_clean, index=False)

print("✅ Saved cleaned data to:", data_path_clean)


✅ Saved cleaned data to: /Users/bonaventure/projects/trump-nigeria-sentiment-analysis/data/tweets_cleaned.csv


## Phase 3 Summary

In this phase I:

- Loaded raw tweets from `data/tweets_raw.csv`
- Normalized the text (lowercased, removed URLs, mentions, hashtags, punctuation)
- Tokenized, removed stopwords, and lemmatized using **spaCy**
- Created a `clean_text` column ready for:
  - Sentiment analysis
  - Topic modeling
  - NER
- Saved the processed dataset as `data/tweets_cleaned.csv`
