# Phase 3 – Text Cleaning & Preprocessing

In this notebook I clean and normalize tweet text from `data/tweets_raw.csv` to prepare it for:
- Sentiment analysis
- Topic modeling
- Named Entity Recognition (NER)

Key steps:
1. Load raw data  
2. Normalize text (lowercase, remove URLs, mentions, hashtags, punctuation)  
3. Remove stopwords  
4. Lemmatize words using spaCy  
5. Save cleaned dataset as `data/tweets_cleaned.csv`


In [1]:
import os
import re
import pandas as pd

# Path handling – notebook is inside /notebooks, data is one level up
project_root = os.path.dirname(os.getcwd())        # go up from /notebooks
data_path_raw = os.path.join(project_root, "data", "tweets_raw.csv")

print("Loading from:", data_path_raw)
df = pd.read_csv(data_path_raw)

df.head()


Loading from: /Users/bonaventure/projects/trump-nigeria-sentiment-analysis/data/tweets_raw.csv


Unnamed: 0,date,username,text,likes,retweets
0,2025-11-02T12:48:00,AnalystEU,Trump is right to warn Nigeria. Someone has to...,486,194
1,2025-11-03T21:55:00,DiplomatDesk,Trump threatens to cut aid and consider action...,167,55
2,2025-11-03T02:41:00,NorthCentralVoice,"Before Trump reaches Nigeria, fuel go finish f...",106,34
3,2025-11-04T05:47:00,NaijaLaw,Context matters: both Christians and Muslims h...,542,164
4,2025-11-02T12:11:00,HumanRightsWatch,"Whatever our internal issues, military threats...",762,175


In [1]:
import spacy

# Load English model
nlp = spacy.load("en_core_web_sm")

# Get spaCy stopwords list
spacy_stopwords = nlp.Defaults.stop_words
len(spacy_stopwords)


326

# Define 