#  USAID Sentiment Analysis in Kenya

#  1. Business Understanding

USAID has long played a major role in Kenya’s development — funding health, education, and governance programs. However, recent shifts in US foreign aid policy, including funding cuts and multiple project phaseouts, have sparked growing conversation and concern.

This project focuses on analyzing public and media sentiment **after these cuts or the scaling back of USAID programs**. The goal is to uncover:
- Public reaction to USAID’s funding changes
- Sentiment trends in both news media and online communities
- Common concerns, narratives, or misinformation emerging around USAID

These insights can support government and development stakeholders in understanding ground-level perception and refining their outreach or policy communication.

---

#  2. Data Understanding
## 2.1 Data Collection
We collected data from two main sources:
- **NewsAPI articles** referencing USAID and Kenya 
- **Reddit posts** from relevant subreddits discussing USAID-related topics



### 2.1.1 News Data Collection

In [None]:
import pandas as pd


# Files of interest
filenames = [
    "Agatha_news.csv",
    "cecilia.newsapi.csv",
    "leo_newsapi_articles_enriched.csv",
    "Mbego_news_usaid_kenya_fulltext.csv",
    "Mbego_news_usaid_kenya_recent.csv",
    "ruth_news.csv"
]

# Load and display summary
news_dfs = {}
for file in filenames:
    path = "../data/raw/news_data/"+file
    try:
        df = pd.read_csv(path)
        news_dfs[file] = df
        print(f" {file}")
        print(f"   - Rows: {df.shape[0]}, Columns: {df.shape[1]}")
        print(f"   - Columns: {list(df.columns)}\n")
    except Exception as e:
        print(f" Failed to load {file}: {e}")


 Agatha_news.csv
   → Rows: 592, Columns: 8
   → Columns: ['keyword', 'source', 'author', 'title', 'description', 'content', 'publishedAt', 'url']

 cecilia.newsapi.csv
   → Rows: 1145, Columns: 6
   → Columns: ['keyword', 'source', 'title', 'description', 'url', 'publishedAt']

 leo_newsapi_articles_enriched.csv
   → Rows: 99, Columns: 8
   → Columns: ['source', 'author', 'title', 'description', 'content', 'url', 'published_at', 'full_text']

 Mbego_news_usaid_kenya_fulltext.csv
   → Rows: 24, Columns: 8
   → Columns: ['source', 'author', 'title', 'description', 'url', 'publishedAt', 'summary', 'full_text']

 Mbego_news_usaid_kenya_recent.csv
   → Rows: 27, Columns: 7
   → Columns: ['source', 'author', 'title', 'description', 'url', 'publishedAt', 'content']

 ruth_news.csv
   → Rows: 20, Columns: 7
   → Columns: ['Unnamed: 0', 'source', 'title', 'description', 'content', 'url', 'publishedAt']



In [1]:
import pandas as pd

# --- FILES TO MERGE (Only files with full_text or complete text for now) ---
filenames = [
    "Agatha_news.csv",
    "leo_newsapi_articles_enriched.csv",
    "Mbego_news_usaid_kenya_fulltext.csv",
    "Mbego_news_usaid_kenya_recent.csv"
]

# --- FINAL COLUMNS TO KEEP ---
final_columns = ['source', 'title', 'description', 'text', 'url', 'keyword', 'published_date']

# --- STORAGE FOR CLEANED DFs ---
merged_dfs = []

for file in filenames:
    path = "../data/raw/news_data/" + file
    df = pd.read_csv(path)

    # Drop unnamed index columns
    df = df.loc[:, ~df.columns.str.contains("^Unnamed")]

    # Preserve both full_text and content temporarily
    full_text_col = df.columns[df.columns.str.lower() == 'full_text']
    content_col = df.columns[df.columns.str.lower() == 'content']

    # Assign priority: full_text > content > None
    if len(full_text_col) > 0:
        df['text'] = df[full_text_col[0]]
    elif len(content_col) > 0:
        df['text'] = df[content_col[0]]
    else:
        df['text'] = None

    # Standardize other columns
    df = df.rename(columns={
        'publishedAt': 'published_date',
        'published_at': 'published_date'
    })

    # Add missing columns
    for col in final_columns:
        if col not in df.columns:
            df[col] = None

    # Filter only final columns
    df = df[final_columns]

    # Drop rows with empty or missing text
    df = df[df['text'].notna() & (df['text'].str.strip() != "")]

    # Fill keyword if missing
    df['keyword'] = df['keyword'].fillna("Unknown")

    # Convert dates
    df['published_date'] = pd.to_datetime(df['published_date'], errors='coerce')

    # Drop invalid rows (no title or url)
    df = df.dropna(subset=['url', 'title'])

    merged_dfs.append(df)

# --- MERGE AND SAVE ---
combined_df = pd.concat(merged_dfs, ignore_index=True)
combined_df.drop_duplicates(subset='url', inplace=True)

combined_df.to_csv("../data/processed/Leo_merged_news_dataset.csv", index=False)
print(f"Deduplicated and merged News dataset saved with shape: {combined_df.shape}")


Deduplicated and merged News dataset saved with shape: (501, 7)


### 2.1.2 Reddit Data Collection


In [22]:
import pandas as pd

# --- FILES OF INTEREST ---
filenames = [
    "Agatha_reddit.csv",
    "cecilia.reddit_nbo_ke_africa.csv",
    "cecilia.redditsubs.csv",
    "leo_reddit_posts.csv",
    "Mbego_reddit_usaid_kenya.csv",
    "Mbego_reddit_usaid_kenya2.csv",
    "reddit_usaid_sentiment.csv",
    "ruth_reddit.csv"
]

# --- LOAD AND DISPLAY SUMMARY ---
reddit_dfs = {}
for file in filenames:
    path = "../data/raw/reddit_data/" + file
    try:
        df = pd.read_csv(path)
        reddit_dfs[file] = df
        print(f"{file}")
        print(f"   - Rows: {df.shape[0]}, Columns: {df.shape[1]}")
        print(f"   - Columns: {list(df.columns)}\n")
    except Exception as e:
        print(f" Failed to load {file}: {e}")


Agatha_reddit.csv
   - Rows: 466, Columns: 9
   - Columns: ['title', 'selftext', 'subreddit', 'author', 'created_utc', 'url', 'score', 'num_comments', 'keyword']

cecilia.reddit_nbo_ke_africa.csv
   - Rows: 29, Columns: 9
   - Columns: ['subreddit', 'keyword', 'title', 'text', 'date_posted', 'upvotes', 'comments', 'url', 'permalink']

cecilia.redditsubs.csv
   - Rows: 247, Columns: 9
   - Columns: ['subreddit', 'keyword', 'title', 'text', 'date_posted', 'upvotes', 'comments', 'url', 'permalink']

leo_reddit_posts.csv
   - Rows: 150, Columns: 10
   - Columns: ['subreddit', 'search_term', 'title', 'text', 'created_utc', 'created_date', 'score', 'num_comments', 'permalink', 'url']

Mbego_reddit_usaid_kenya.csv
   - Rows: 17, Columns: 6
   - Columns: ['title', 'score', 'url', 'created', 'subreddit', 'selftext']

Mbego_reddit_usaid_kenya2.csv
   - Rows: 163, Columns: 6
   - Columns: ['title', 'score', 'url', 'created', 'subreddit', 'selftext']

reddit_usaid_sentiment.csv
   - Rows: 17, Colu

In [2]:
# --- FILES TO MERGE ---
filenames = [
    "Agatha_reddit.csv",
    "cecilia.reddit_nbo_ke_africa.csv",
    "cecilia.redditsubs.csv",
    "leo_reddit_posts.csv",
    "Mbego_reddit_usaid_kenya.csv",
    "Mbego_reddit_usaid_kenya2.csv",
    "reddit_usaid_sentiment.csv",
    "ruth_reddit.csv"
]

# --- FINAL COLUMNS ---
final_columns = ['subreddit', 'title', 'text', 'url', 'created_date', 'keyword']
merged_dfs = []

for file in filenames:
    path = "../data/raw/reddit_data/" + file
    df = pd.read_csv(path)

    # Remove unnamed index if present
    df = df.loc[:, ~df.columns.str.contains("^Unnamed")]

    # Rename relevant columns
    df = df.rename(columns={
        'selftext': 'text',
        'search_term': 'keyword',
        'date_posted': 'created_date',
        'created': 'created_date'
    })

    # If 'created_utc' exists, convert from Unix timestamp to datetime
    if 'created_utc' in df.columns:
        df['created_utc'] = pd.to_datetime(df['created_utc'], unit='s', errors='coerce')
        df['created_date'] = df['created_utc']

    # Ensure all required columns exist
    for col in final_columns:
        if col not in df.columns:
            df[col] = None

    # Subset to relevant columns
    df = df[final_columns]

    # Fill missing keyword
    df['keyword'] = df['keyword'].fillna("Unknown")

    # Convert created_date column to datetime
    df['created_date'] = pd.to_datetime(df['created_date'], errors='coerce')

    # Drop rows with no title or url
    df = df.dropna(subset=['title', 'url'])

    merged_dfs.append(df)

# --- MERGE AND DEDUPE ---
combined_df = pd.concat(merged_dfs, ignore_index=True)
combined_df.drop_duplicates(subset='url', inplace=True)

# --- SAVE ---
combined_df.to_csv("../data/processed/Leo_merged_reddit_dataset.csv", index=False)
print(f"Deduplicated and merged Reddit dataset saved with shape: {combined_df.shape}")


Deduplicated and merged Reddit dataset saved with shape: (839, 6)
