#  USAID Sentiment Analysis in Kenya

#  1. Business Understanding

USAID has long played a major role in Kenya’s development — funding health, education, and governance programs. However, recent shifts in US foreign aid policy, including funding cuts and multiple project phaseouts, have sparked growing conversation and concern.

This project focuses on analyzing public and media sentiment **after these cuts or the scaling back of USAID programs**. The goal is to uncover:
- Public reaction to USAID’s funding changes
- Sentiment trends in both news media and online communities
- Common concerns, narratives, or misinformation emerging around USAID

These insights can support government and development stakeholders in understanding ground-level perception and refining their outreach or policy communication.

---

#  2. Data Understanding
## 2.1 Data Collection
We collected data from two main sources:
- **NewsAPI articles** referencing USAID and Kenya 
- **Reddit posts** from relevant subreddits discussing USAID-related topics



### 2.1.1 News Data Collection

### Overview

In [24]:
import pandas as pd
from glob import glob

# --- DIRECTORY PATH ---
data_dir = "../data/raw/news_data/"

# --- GET ALL CSV FILES IN THE DIRECTORY ---
csv_files = glob(data_dir + "*.csv")

# --- LOAD AND DISPLAY SUMMARY ---
news_dfs = {}
for file in csv_files:
    try:
        df = pd.read_csv(file)
        news_dfs[file] = df
        print(f"{file.split('/')[-1]}")
        print(f"   - Rows: {df.shape[0]}, Columns: {df.shape[1]}")
        print(f"   - Columns: {list(df.columns)}\n")
    except Exception as e:
        print(f"❌ Failed to load {file}: {e}")


Agatha_news_fulltext.csv
   - Rows: 562, Columns: 8
   - Columns: ['keyword', 'source', 'author', 'title', 'publishedAt', 'summary', 'text', 'url']

newsapi_usaid_articles.csv
   - Rows: 89, Columns: 6
   - Columns: ['title', 'description', 'url', 'publishedAt', 'source', 'content']

leo_newsapi_articles_enriched.csv
   - Rows: 99, Columns: 8
   - Columns: ['source', 'author', 'title', 'description', 'content', 'url', 'published_at', 'full_text']

Mbego_news_usaid_kenya_fulltext.csv
   - Rows: 24, Columns: 8
   - Columns: ['source', 'author', 'title', 'description', 'url', 'publishedAt', 'summary', 'full_text']

Agatha_news.csv
   - Rows: 592, Columns: 8
   - Columns: ['keyword', 'source', 'author', 'title', 'description', 'content', 'publishedAt', 'url']

ruth_news.csv
   - Rows: 20, Columns: 7
   - Columns: ['Unnamed: 0', 'source', 'title', 'description', 'content', 'url', 'publishedAt']

cecilia.newsapi.csv
   - Rows: 1787, Columns: 9
   - Columns: ['keyword', 'source', 'author', 't

### Collection

In [25]:

# Get all CSV files in the news_data folder
news_files = glob("../data/raw/news_data/*.csv")

# Final columns to keep
final_columns = ['source', 'title', 'description', 'text', 'url', 'keyword', 'published_date']

# List to store clean DataFrames
merged_dfs = []

for file in news_files:
    print(f"Processing: {file.split('/')[-1]}")
    df = pd.read_csv(file)

    # Remove unnamed index columns if any
    df = df.loc[:, ~df.columns.str.contains("^Unnamed")]

    # Determine which text column to use
    if 'full_text' in df.columns:
        df['text'] = df['full_text']
    elif 'text' in df.columns:
        pass  # Use existing 'text'
    else:
        print(f"Skipped (no text/full_text found): {file}")
        continue

    # Drop rows where text is fully missing or blank
    df = df[df['text'].notna() & (df['text'].str.strip() != "")]

    # Rename date columns
    df = df.rename(columns={
        'publishedAt': 'published_date',
        'published_at': 'published_date'
    })

    # Add missing expected columns with None
    for col in final_columns:
        if col not in df.columns:
            df[col] = None

    # Restrict to only the required final columns
    df = df[final_columns]

    # Fill missing keywords
    df['keyword'] = df['keyword'].fillna("Unknown")

    # Convert to datetime
    df['published_date'] = pd.to_datetime(df['published_date'], errors='coerce')

    # Drop rows without title or url (minimal metadata)
    df = df.dropna(subset=['url', 'title'])

    # Add cleaned DataFrame to list
    merged_dfs.append(df)

# Concatenate and deduplicate
combined_df = pd.concat(merged_dfs, ignore_index=True)
combined_df.drop_duplicates(subset='url', inplace=True)

# Save the merged file
combined_df.to_csv("../data/processed/Leo_merged_news_dataset.csv", index=False)
print(f"Merged News dataset saved with shape: {combined_df.shape}")


Processing: Agatha_news_fulltext.csv
Processing: newsapi_usaid_articles.csv
Skipped (no text/full_text found): ../data/raw/news_data/newsapi_usaid_articles.csv
Processing: leo_newsapi_articles_enriched.csv
Processing: Mbego_news_usaid_kenya_fulltext.csv
Processing: Agatha_news.csv
Skipped (no text/full_text found): ../data/raw/news_data/Agatha_news.csv
Processing: ruth_news.csv
Skipped (no text/full_text found): ../data/raw/news_data/ruth_news.csv
Processing: cecilia.newsapi.csv
Skipped (no text/full_text found): ../data/raw/news_data/cecilia.newsapi.csv
Processing: Mbego_news_usaid_kenya_recent.csv
Skipped (no text/full_text found): ../data/raw/news_data/Mbego_news_usaid_kenya_recent.csv
Merged News dataset saved with shape: (471, 7)


### 2.1.2 Reddit Data Collection
### Overview

In [26]:
# Get all Reddit CSVs from folder
reddit_files = glob("../data/raw/reddit_data/*.csv")

# Display shape and columns of each
reddit_dfs = {}
for file in reddit_files:
    try:
        df = pd.read_csv(file)
        reddit_dfs[file] = df
        print(f"{file.split('/')[-1]}")
        print(f"   - Rows: {df.shape[0]}, Columns: {df.shape[1]}")
        print(f"   - Columns: {list(df.columns)}\n")
    except Exception as e:
        print(f"Failed to load {file}: {e}")


reddit_usaid_sentiment.csv
   - Rows: 17, Columns: 7
   - Columns: ['subreddit', 'title', 'score', 'url', 'created_utc', 'num_comments', 'selftext']

Mbego_reddit_usaid_kenya2.csv
   - Rows: 163, Columns: 6
   - Columns: ['title', 'score', 'url', 'created', 'subreddit', 'selftext']

Mbego_reddit_usaid_kenya.csv
   - Rows: 17, Columns: 6
   - Columns: ['title', 'score', 'url', 'created', 'subreddit', 'selftext']

cecilia.redditsubs.csv
   - Rows: 247, Columns: 9
   - Columns: ['subreddit', 'keyword', 'title', 'text', 'date_posted', 'upvotes', 'comments', 'url', 'permalink']

leo_reddit_posts.csv
   - Rows: 150, Columns: 10
   - Columns: ['subreddit', 'search_term', 'title', 'text', 'created_utc', 'created_date', 'score', 'num_comments', 'permalink', 'url']

cecilia.reddit_nbo_ke_africa.csv
   - Rows: 29, Columns: 9
   - Columns: ['subreddit', 'keyword', 'title', 'text', 'date_posted', 'upvotes', 'comments', 'url', 'permalink']

reddit_usaid_kenya.csv
   - Rows: 17, Columns: 6
   - Colum

In [27]:
# Get all Reddit CSV files
reddit_files = glob("../data/raw/reddit_data/*.csv")

# Final columns to standardize
final_columns = ['subreddit', 'title', 'text', 'url', 'created_date', 'keyword']

# Store cleaned DataFrames
merged_dfs = []

for file in reddit_files:
    print(f"Processing: {file.split('/')[-1]}")
    df = pd.read_csv(file)

    # Drop any unnamed index column
    df = df.loc[:, ~df.columns.str.contains("^Unnamed")]

    # Normalize relevant columns
    df = df.rename(columns={
        'selftext': 'text',
        'search_term': 'keyword',
        'date_posted': 'created_date',
        'created': 'created_date'
    })

    # Handle created_utc if present
    if 'created_utc' in df.columns:
        df['created_utc'] = pd.to_datetime(df['created_utc'], unit='s', errors='coerce')
        df['created_date'] = df['created_utc']

    # Skip file if neither 'text' nor 'selftext' present
    if 'text' not in df.columns or df['text'].isna().all():
        print(f"Skipped (no usable text): {file.split('/')[-1]}")
        continue

    # Keep only final columns (fill missing ones with None)
    for col in final_columns:
        if col not in df.columns:
            df[col] = None

    df = df[final_columns]

    # Filter out rows with missing or empty text
    df = df[df['text'].notna() & (df['text'].str.strip() != "")]

    # Fill missing keywords
    df['keyword'] = df['keyword'].fillna("Unknown")

    # Parse dates safely
    df['created_date'] = pd.to_datetime(df['created_date'], errors='coerce')

    # Optional: Keep rows even if title or url are missing (for exploratory flexibility)
    merged_dfs.append(df)

# Merge and deduplicate
combined_df = pd.concat(merged_dfs, ignore_index=True)
combined_df.drop_duplicates(subset='url', inplace=True)

# Save cleaned dataset
combined_df.to_csv("../data/processed/Leo_merged_reddit_dataset.csv", index=False)
print(f"Merged Reddit dataset saved with shape: {combined_df.shape}")


Processing: reddit_usaid_sentiment.csv
Processing: Mbego_reddit_usaid_kenya2.csv
Processing: Mbego_reddit_usaid_kenya.csv
Processing: cecilia.redditsubs.csv
Processing: leo_reddit_posts.csv
Processing: cecilia.reddit_nbo_ke_africa.csv
Processing: reddit_usaid_kenya.csv
Processing: Agatha_reddit.csv
Processing: ruth_reddit.csv
Merged Reddit dataset saved with shape: (542, 6)


## 2.1.3 Unified Data Collection

- The group agreed on joint datasets in the `data/processed/news_data` and the `data/processed/reddit_data`subfolders


In [28]:
reddit_data = pd.read_csv('../data/processed/reddit_data/reddit_data.csv')
news_data = pd.read_csv('../data/processed/news_data/news_data.csv')

### 2.1.4 Data Overview

In [29]:
# Function to check data overview
def data_overview(df):
    print(f"The dataset has {df.shape[0]} rows and {df.shape[1]} columns\n")
    display(df.info())
    print( "\n---Missing Values---\n")
    display(df.isna().sum())
    print( "\n---Sample---\n")
    display(df.head())


### News Overview

In [30]:
data_overview(news_data)

The dataset has 2549 rows and 7 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2549 entries, 0 to 2548
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   title           2549 non-null   object
 1   description     2533 non-null   object
 2   text            2524 non-null   object
 3   url             2547 non-null   object
 4   keyword         2379 non-null   object
 5   published_date  2450 non-null   object
 6   source_file     2549 non-null   object
dtypes: object(7)
memory usage: 139.5+ KB


None


---Missing Values---



title               0
description        16
text               25
url                 2
keyword           170
published_date     99
source_file         0
dtype: int64


---Sample---



Unnamed: 0,title,description,text,url,keyword,published_date,source_file
0,Has DOGE really saved the US government $180bn?,Elon Musk first claimed the department would m...,President Donald Trump and adviser Elon Musk c...,https://www.aljazeera.com/news/2025/6/6/has-do...,usaid kenya,2025-06-06,Agatha_news.csv
1,The Life Story of Ecomobilus Technologies Limi...,By Prof Geoffrey Gitau Here is a story showcas...,By Prof Geoffrey Gitau\r\nHere is a story show...,https://cleantechnica.com/2025/05/26/the-life-...,usaid kenya,2025-05-26,Agatha_news.csv
2,"Death, Sexual Violence and Human Trafficking: ...",by Brett Murphy and Anna Maria Barry-Jester \n...,ProPublica is a nonprofit newsroom that invest...,https://www.propublica.org/article/trump-usaid...,usaid kenya,2025-05-28,Agatha_news.csv
3,Congress Should Quickly Approve Trump’s Rescis...,President Donald Trump‘s rescission legislatio...,President Donald Trumps rescission legislation...,https://www.dailysignal.com/2025/06/10/congres...,usaid kenya,2025-06-10,Agatha_news.csv
4,Food Safety Depends On Every Link In The Suppl...,Almost 1 in 10 people globally fall ill from c...,Colorful fish and vegetables can be purchased ...,https://www.forbes.com/sites/daniellenierenber...,usaid kenya,2025-06-06,Agatha_news.csv


### News Dataset Summary

The merged news dataset contains **2,549 articles** with 7 columns. Most records have complete `title`, `url`, and `text` fields, which are essential for sentiment analysis. However, there are a few missing values in `description`, `keyword`, and `published_date`.

- The `text` column is mostly intact, with only 25 missing entries (less than 1%), making the dataset suitable for text-based analysis.
- The `keyword` field is somewhat sparse but can be filled later through data engineering if needed.
- The source file column is retained for traceability, in case we need to trace back quality or source bias.

This dataset is rich enough for sentiment and thematic analysis on media coverage surrounding USAID in Kenya.


### Reddit Overview

In [31]:
data_overview(reddit_data)

The dataset has 1289 rows and 15 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1289 entries, 0 to 1288
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         1289 non-null   object 
 1   selftext      901 non-null    object 
 2   subreddit     1289 non-null   object 
 3   author        466 non-null    object 
 4   created_utc   1013 non-null   object 
 5   created_date  150 non-null    object 
 6   score         1013 non-null   float64
 7   num_comments  833 non-null    float64
 8   keyword       742 non-null    object 
 9   search_term   150 non-null    object 
 10  date_posted   276 non-null    object 
 11  upvotes       276 non-null    float64
 12  comments      276 non-null    float64
 13  url           1289 non-null   object 
 14  permalink     426 non-null    object 
dtypes: float64(4), object(11)
memory usage: 151.2+ KB


None


---Missing Values---



title              0
selftext         388
subreddit          0
author           823
created_utc      276
created_date    1139
score            276
num_comments     456
keyword          547
search_term     1139
date_posted     1013
upvotes         1013
comments        1013
url                0
permalink        863
dtype: int64


---Sample---



Unnamed: 0,title,selftext,subreddit,author,created_utc,created_date,score,num_comments,keyword,search_term,date_posted,upvotes,comments,url,permalink
0,"USAID left a month ago, do we have ARVs in Kenya?",Someone on a different group (different websit...,Kenya,muerki,2025-04-15 13:16:53,,3.0,5.0,usaid kenya,,,,,https://www.reddit.com/r/Kenya/comments/1jzrn2...,
1,Classism in r/Kenya and r/nairobi,The classism I'm seeing in both subs is a good...,Kenya,Morio_anzenza,2025-04-07 04:21:12,,169.0,95.0,usaid kenya,,,,,https://www.reddit.com/r/Kenya/comments/1jtcvb...,
2,EX-USAID people!! Let's talk,Are you still in contact with the organisation...,Kenya,vindtar,2025-04-05 19:09:10,,2.0,2.0,usaid kenya,,,,,https://www.reddit.com/r/Kenya/comments/1jsb14...,
3,Why western powers back Israel no matter what ...,"I don't care what good book you read, but it's...",Kenya,Gold_Smart,2025-03-25 08:18:04,,13.0,20.0,usaid kenya,,,,,https://www.reddit.com/r/Kenya/comments/1jjehw...,
4,Is kenya capable of funding its needs now that...,How is kenya prepared to fill the vacuum of US...,Kenya,westmaxia,2025-03-08 08:08:58,,1.0,6.0,usaid kenya,,,,,https://www.reddit.com/r/Kenya/comments/1j6cjz...,


### Reddit Dataset Summary

The merged Reddit dataset contains **1,289 posts** and 15 columns. The dataset includes metadata such as `subreddit`, `author`, `score`, and `num_comments`, which can provide context beyond the post content.

- The main text content comes from the `selftext` field, which has **388 missing values**, meaning about 70% of posts have usable body content.
- Timestamp data is spread across `created_utc`, `created_date`, and `date_posted` with some sparsity — useful for temporal sentiment trends if cleaned carefully.
- Several fields like `author`, `keyword`, `search_term`, and engagement metrics (`upvotes`, `comments`) have missing values but can be optionally used depending on the analytical direction.

Despite sparsity in some fields, this dataset captures a wide range of public sentiment and discourse related to USAID, especially useful for assessing grassroots reactions after funding changes.


#  3. Data Cleaning
- The raw  **news** and **reddit** data shall now be cleaned to a more structured and consitent format before insights could be drawn

# 3.1 News Data Cleaning


In [32]:

print(f"Shape before data cleaning dropping duplicates ->_{news_data.shape}")


Shape before data cleaning dropping duplicates ->_(2549, 7)


In [33]:
import numpy as np
import re

# --- 1. Drop Duplicates (full and by URL) ---
news_data = news_data.drop_duplicates()
news_data = news_data.drop_duplicates(subset=['text'])
news_data = news_data.drop_duplicates(subset=['url'])

# --- 2. Remove Empty or Very Short Posts ---
news_data['text'] = news_data['text'].astype(str)
news_data = news_data[news_data['text'].str.strip().astype(bool)]
news_data = news_data[news_data['text'].str.split().str.len() >= 10]

# --- 3. Fix Date Format (don't drop missing dates) ---
news_data['published_date'] = pd.to_datetime(news_data['published_date'], errors='coerce')
news_data = news_data[~(news_data['published_date'] > pd.Timestamp.now())]  # Drop only future dates

# --- 4. Clean and Normalize Text ---
def clean_text(text):
    text = str(text).lower()
    text = re.sub(r"http\S+|www\S+|https\S+", '', text)  # URLs
    text = re.sub(r"<.*?>", '', text)                    # HTML tags
    text = re.sub(r"[@#]\w+", '', text)                  # Mentions/hashtags
    text = re.sub(r"[^a-z0-9\s\.,!?'\"]", '', text)      # Emojis/symbols
    text = re.sub(r"\s+", ' ', text).strip()             # Whitespace
    return text

news_data['text'] = news_data['text'].apply(clean_text)

# --- 5. Fill Missing Keywords ---
news_data['keyword'] = news_data['keyword'].fillna("unknown")

# --- 6. Mark if "kenya" is mentioned ---
news_data['mentions_kenya'] = news_data['text'].str.contains(r'\bkenya\b', case=False, na=False)

# --- 7. Drop Unnecessary columns ---
news_data = news_data.drop(columns= ['description','source_file','url'])

# Overview
data_overview(news_data)


The dataset has 1399 rows and 5 columns

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1399 entries, 0 to 2525
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   title           1399 non-null   object        
 1   text            1399 non-null   object        
 2   keyword         1399 non-null   object        
 3   published_date  1325 non-null   datetime64[ns]
 4   mentions_kenya  1399 non-null   bool          
dtypes: bool(1), datetime64[ns](1), object(3)
memory usage: 56.0+ KB


None


---Missing Values---



title              0
text               0
keyword            0
published_date    74
mentions_kenya     0
dtype: int64


---Sample---



Unnamed: 0,title,text,keyword,published_date,mentions_kenya
0,Has DOGE really saved the US government $180bn?,president donald trump and adviser elon musk c...,usaid kenya,2025-06-06,False
1,The Life Story of Ecomobilus Technologies Limi...,by prof geoffrey gitau here is a story showcas...,usaid kenya,2025-05-26,False
2,"Death, Sexual Violence and Human Trafficking: ...",propublica is a nonprofit newsroom that invest...,usaid kenya,2025-05-28,False
3,Congress Should Quickly Approve Trump’s Rescis...,president donald trumps rescission legislation...,usaid kenya,2025-06-10,False
4,Food Safety Depends On Every Link In The Suppl...,colorful fish and vegetables can be purchased ...,usaid kenya,2025-06-06,False


# 3.1 Reddit Data Cleaning

In [35]:
# Keep rows with valid selftext
reddit_data = reddit_data[reddit_data['selftext'].notna()]
reddit_data = reddit_data[reddit_data['selftext'].str.strip().astype(bool)]
reddit_data = reddit_data[reddit_data['selftext'].str.split().str.len() >= 3]

# Deduplicate by URL
reddit_data = reddit_data.drop_duplicates(subset=['url'])

# Date handling
reddit_data['created_utc_dt'] = pd.to_datetime(reddit_data['created_utc'], errors='coerce', unit='s')
reddit_data['created_date_dt'] = pd.to_datetime(reddit_data['created_date'], errors='coerce')
reddit_data['date_posted_dt'] = pd.to_datetime(reddit_data['date_posted'], errors='coerce')

reddit_data['created_date'] = (
    reddit_data['created_utc_dt']
    .fillna(reddit_data['created_date_dt'])
    .fillna(reddit_data['date_posted_dt'])
).dt.date

reddit_data['is_future_date'] = reddit_data['created_date'] > pd.Timestamp.now().date()

# Clean text
reddit_data['selftext'] = reddit_data['selftext'].apply(clean_text)

# Fill keyword
reddit_data['keyword'] = reddit_data['keyword'].fillna("unknown")

# Rename and drop
reddit_data = reddit_data.rename(columns={'selftext': 'text'})
cols_to_drop = [
    'author', 'created_utc', 'score', 'num_comments', 'search_term',
    'date_posted', 'upvotes', 'comments', 'url', 'permalink',
    'created_utc_dt', 'created_date_dt', 'date_posted_dt'
]
reddit_data = reddit_data.drop(columns=cols_to_drop)


In [36]:
data_overview(reddit_data)

The dataset has 538 rows and 6 columns

<class 'pandas.core.frame.DataFrame'>
Int64Index: 538 entries, 0 to 1288
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   title           538 non-null    object
 1   text            538 non-null    object
 2   subreddit       538 non-null    object
 3   created_date    123 non-null    object
 4   keyword         538 non-null    object
 5   is_future_date  538 non-null    bool  
dtypes: bool(1), object(5)
memory usage: 25.7+ KB


None


---Missing Values---



title               0
text                0
subreddit           0
created_date      415
keyword             0
is_future_date      0
dtype: int64


---Sample---



Unnamed: 0,title,text,subreddit,created_date,keyword,is_future_date
0,"USAID left a month ago, do we have ARVs in Kenya?",someone on a different group different website...,Kenya,NaT,usaid kenya,False
1,Classism in r/Kenya and r/nairobi,the classism i'm seeing in both subs is a good...,Kenya,NaT,usaid kenya,False
2,EX-USAID people!! Let's talk,are you still in contact with the organisation...,Kenya,NaT,usaid kenya,False
3,Why western powers back Israel no matter what ...,"i don't care what good book you read, but it's...",Kenya,NaT,usaid kenya,False
4,Is kenya capable of funding its needs now that...,how is kenya prepared to fill the vacuum of us...,Kenya,NaT,usaid kenya,False
