### USAUD FUNDING CUTS SENTIMENT ANALYSIS 

### Introduction

.....(light intro text)
.....(TBD)

### Data 

This is data preparation phase for the project. The dataset used here is compiled from two primary sources: Reddit (via web scraping) and NewsAPI (via API calls). Each contributor collected data independently from these platforms, targeting relevant topics for analysis. Below, we begin by importing the collected datasets, merging them, and performing initial cleaning steps to prepare the data for further exploration and modeling.



#### Data Importation

##### news_data

In [45]:
import os
import pandas as pd

# Set the path to your news_data folder
folder_path = r'N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\raw\news_data'

# List all CSV files in the folder
csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

# Read and display columns for each CSV file
for file in csv_files:
    file_path = os.path.join(folder_path, file)
    try:
        df = pd.read_csv(file_path, nrows=0)  # Read only headers
        print(f"Columns in {file}:")
        print(list(df.columns))
        print("-" * 50)
    except Exception as e:
        print(f"Error reading {file}: {e}")


Columns in Agatha_news.csv:
['keyword', 'source', 'author', 'title', 'description', 'content', 'publishedAt', 'url']
--------------------------------------------------
Columns in cecilia.newsapi.csv:
['keyword', 'source', 'author', 'title', 'description', 'content', 'url', 'publishedAt', 'urlToImage']
--------------------------------------------------
Columns in leo_newsapi_articles_enriched.csv:
['source', 'author', 'title', 'description', 'content', 'url', 'published_at', 'full_text']
--------------------------------------------------
Columns in Mbego_news_usaid_kenya_fulltext.csv:
['source', 'author', 'title', 'description', 'url', 'publishedAt', 'summary', 'full_text']
--------------------------------------------------
Columns in Mbego_news_usaid_kenya_recent.csv:
['source', 'author', 'title', 'description', 'url', 'publishedAt', 'content']
--------------------------------------------------
Columns in ruth_news.csv:
['Unnamed: 0', 'source', 'title', 'description', 'content', 'url',

##### reddit_data

In [46]:

# Set the path to your news_data folder
folder_path = r'N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\raw\reddit_data'

# List all CSV files in the folder
csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

# Read and display columns for each CSV file
for file in csv_files:
    file_path = os.path.join(folder_path, file)
    try:
        df = pd.read_csv(file_path, nrows=0)  # Read only headers
        print(f"Columns in {file}:")
        print(list(df.columns))
        print("-" * 50)
    except Exception as e:
        print(f"Error reading {file}: {e}")


Columns in Agatha_reddit.csv:
['title', 'selftext', 'subreddit', 'author', 'created_utc', 'url', 'score', 'num_comments', 'keyword']
--------------------------------------------------
Columns in cecilia.redditsubs.csv:
['subreddit', 'keyword', 'title', 'text', 'date_posted', 'upvotes', 'comments', 'url', 'permalink']
--------------------------------------------------
Columns in cecilia.reddit_nbo_ke_africa.csv:
['subreddit', 'keyword', 'title', 'text', 'date_posted', 'upvotes', 'comments', 'url', 'permalink']
--------------------------------------------------
Columns in leo_reddit_posts.csv:
['subreddit', 'search_term', 'title', 'text', 'created_utc', 'created_date', 'score', 'num_comments', 'permalink', 'url']
--------------------------------------------------
Columns in Mbego_reddit_usaid_kenya.csv:
['title', 'score', 'url', 'created', 'subreddit', 'selftext']
--------------------------------------------------
Columns in Mbego_reddit_usaid_kenya2.csv:
['title', 'score', 'url', 'creat

#### Data Merging 

##### news_data



In [47]:

# Folder containing all News CSVs
folder_path = r'N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\raw\news_data'

# Final save location
save_path = r"N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\processed"

# All .csv files in the news_data folder
news_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

# Define the final standardized columns
standard_news_cols = [
    'keyword', 'source', 'author', 'title', 'description', 'content',
    'summary', 'full_text', 'publishedAt', 'url', 'language'
]

# Create empty master DataFrame
merged_news_df = pd.DataFrame(columns=standard_news_cols)

# Loop through each file
for file in news_files:
    file_path = os.path.join(folder_path, file)
    try:
        df = pd.read_csv(file_path)

        # Drop index column if present
        if 'Unnamed: 0' in df.columns:
            df.drop(columns=['Unnamed: 0'], inplace=True)

        # Standardize column names
        df.rename(columns={
            'published_at': 'publishedAt',
            'published_date': 'publishedAt',
            'text': 'content'
        }, inplace=True)

        # Add missing columns
        for col in standard_news_cols:
            if col not in df.columns:
                df[col] = pd.NA

        # Align column order
        df = df[standard_news_cols]

        # Add to master DataFrame
        merged_news_df = pd.concat([merged_news_df, df], ignore_index=True)

        print(f"✅ Merged: {file}")
    except Exception as e:
        print(f" Error processing {file}: {e}")

# Save merged file
output_path = os.path.join(save_path, 'Mbego_all_news_merged.csv')
merged_news_df.to_csv(output_path, index=False)

print(f"\n✅ All News files merged and saved to '{output_path}'")


✅ Merged: Agatha_news.csv
✅ Merged: cecilia.newsapi.csv
✅ Merged: leo_newsapi_articles_enriched.csv
✅ Merged: Mbego_news_usaid_kenya_fulltext.csv
✅ Merged: Mbego_news_usaid_kenya_recent.csv
✅ Merged: ruth_news.csv

✅ All News files merged and saved to 'N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\processed\Mbego_all_news_merged.csv'


In [48]:

news_merged_path = r"N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\processed\Mbego_all_news_merged.csv"

news_df = pd.read_csv(news_merged_path)

news_df.head(3)


Unnamed: 0,keyword,source,author,title,description,content,summary,full_text,publishedAt,url,language
0,usaid kenya,Al Jazeera English,Al Jazeera,Has DOGE really saved the US government $180bn?,Elon Musk first claimed the department would m...,President Donald Trump and adviser Elon Musk c...,,,2025-06-06T11:21:51Z,https://www.aljazeera.com/news/2025/6/6/has-do...,
1,usaid kenya,CleanTechnica,Guest Contributor,The Life Story of Ecomobilus Technologies Limi...,By Prof Geoffrey Gitau Here is a story showcas...,By Prof Geoffrey Gitau\r\nHere is a story show...,,,2025-05-26T17:13:41Z,https://cleantechnica.com/2025/05/26/the-life-...,
2,usaid kenya,ProPublica,by Brett Murphy and Anna Maria Barry-Jester,"Death, Sexual Violence and Human Trafficking: ...",by Brett Murphy and Anna Maria Barry-Jester \n...,ProPublica is a nonprofit newsroom that invest...,,,2025-05-28T18:45:00Z,https://www.propublica.org/article/trump-usaid...,


##### reddit_data

In [49]:

# Define paths
folder_path = r'N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\raw\reddit_data'
save_path = r'N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\processed'

# Define standard columns
standard_cols = [
    'title', 'selftext', 'subreddit', 'author', 'created_utc',
    'created_date', 'score', 'num_comments', 'keyword', 'search_term',
    'date_posted', 'upvotes', 'comments', 'url', 'permalink'
]

# Initialize master DataFrame
merged_df = pd.DataFrame(columns=standard_cols)

# Get all CSV files in folder
reddit_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

# Loop through each CSV file
for file in reddit_files:
    file_path = os.path.join(folder_path, file)
    try:
        df = pd.read_csv(file_path)

        # Drop any unnamed index column
        if 'Unnamed: 0' in df.columns:
            df.drop(columns=['Unnamed: 0'], inplace=True)

        # Rename common variations
        df.rename(columns={
            'text': 'selftext',
            'created': 'created_utc'
        }, inplace=True)

        # Add missing columns as empty (NA)
        for col in standard_cols:
            if col not in df.columns:
                df[col] = pd.NA

        # Reorder columns to match standard
        df = df[standard_cols]

        # Append to the master DataFrame
        merged_df = pd.concat([merged_df, df], ignore_index=True)

        print(f"✅ Merged: {file}")
    except Exception as e:
        print(f" Error processing {file}: {e}")

# Save merged result
output_file = os.path.join(save_path, 'mbego_all_reddit_merged.csv')
merged_df.to_csv(output_file, index=False)
print(f"\n✅ All Reddit files merged and saved to '{output_file}'")


✅ Merged: Agatha_reddit.csv
✅ Merged: cecilia.redditsubs.csv
✅ Merged: cecilia.reddit_nbo_ke_africa.csv
✅ Merged: leo_reddit_posts.csv
✅ Merged: Mbego_reddit_usaid_kenya.csv
✅ Merged: Mbego_reddit_usaid_kenya2.csv
✅ Merged: reddit_usaid_sentiment.csv
✅ Merged: ruth_reddit.csv

✅ All Reddit files merged and saved to 'N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\processed\mbego_all_reddit_merged.csv'


In [50]:

reddit_merged_path = r"N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\processed\mbego_all_reddit_merged.csv"

reddit_df = pd.read_csv(reddit_merged_path)

reddit_df.head(3)

Unnamed: 0,title,selftext,subreddit,author,created_utc,created_date,score,num_comments,keyword,search_term,date_posted,upvotes,comments,url,permalink
0,"USAID left a month ago, do we have ARVs in Kenya?",Someone on a different group (different websit...,Kenya,muerki,2025-04-15 13:16:53,,3.0,5.0,usaid kenya,,,,,https://www.reddit.com/r/Kenya/comments/1jzrn2...,
1,Classism in r/Kenya and r/nairobi,The classism I'm seeing in both subs is a good...,Kenya,Morio_anzenza,2025-04-07 04:21:12,,169.0,95.0,usaid kenya,,,,,https://www.reddit.com/r/Kenya/comments/1jtcvb...,
2,EX-USAID people!! Let's talk,Are you still in contact with the organisation...,Kenya,vindtar,2025-04-05 19:09:10,,2.0,2.0,usaid kenya,,,,,https://www.reddit.com/r/Kenya/comments/1jsb14...,


#### Data Understanding 

##### on News Data

Basic Overview

In [51]:
print(news_df.shape)             # Rows and columns
print(news_df.dtypes)           # Data types                


(2549, 11)
keyword         object
source          object
author          object
title           object
description     object
content         object
summary         object
full_text       object
publishedAt     object
url             object
language       float64
dtype: object


Missing Data

In [52]:
missing = news_df.isna().sum().sort_values(ascending=False)
print("Missing values per column:\n", missing)

Missing values per column:
 language       2549
summary        2525
full_text      2439
author          245
keyword         170
content          25
description      16
publishedAt       2
url               2
source            0
title             0
dtype: int64


Unique Values per Key Column

In [53]:
print("Unique sources:", news_df['source'].nunique())
print("Unique languages:", news_df['language'].dropna().unique())
print("Sample keywords:", news_df['keyword'].dropna().unique()[:10])

Unique sources: 290
Unique languages: []
Sample keywords: ['usaid kenya' 'usaid funding' 'usaid budget cut' 'kenya foreign aid'
 'usaid suspended funding' 'development aid kenya' 'kenya donor funding'
 'foreign aid cut' 'foreign aid withdrawal' 'us foreign aid kenya']


Date Range

In [54]:
news_df['publishedAt'] = pd.to_datetime(news_df['publishedAt'], errors='coerce')
print("Date range:", news_df['publishedAt'].min(), "to", news_df['publishedAt'].max())

Date range: 2025-05-09 09:26:01+00:00 to 2025-06-23 16:51:31+00:00


Content Length Check

In [55]:
news_df['content_length'] = news_df['content'].astype(str).apply(len)
print(news_df['content_length'].describe())


count    2549.000000
mean      211.604943
std        21.766251
min         3.000000
25%       214.000000
50%       214.000000
75%       214.000000
max       221.000000
Name: content_length, dtype: float64


Top Sources & Languages

In [56]:
print(news_df['source'].value_counts().head(10))
print(news_df['language'].value_counts().head())

source
Al Jazeera English     177
NPR                    161
Forbes                 148
BBC News               114
ABC News                99
Business Insider        82
Plos.org                57
Time                    49
Gizmodo.com             47
Yahoo Entertainment     46
Name: count, dtype: int64
Series([], Name: count, dtype: int64)


Duplicates

In [57]:
duplicates = news_df.duplicated(subset=['title', 'url']).sum()
print("Duplicate articles (by title+url):", duplicates)

Duplicate articles (by title+url): 1110


Sample Full Article Text

In [58]:
sample = news_df[['title', 'description', 'content', 'full_text']].dropna().sample(3)
print(sample)

                                                  title  \
2470  Президент Грузии рассказал о просьбе Запада вм...   
2422  BBC’s ‘independent’ Russian partner begged UK ...   
2408  Boulder, Colorado firebomb madman hit with hat...   

                                            description  \
2470  После начала конфликта на Украине страны Запад...   
2422  Leaked documents show the supposedly self-reli...   
2408  The suspect in Colorado’s horrific antisemitic...   

                                                content  \
2470  , . , .\r\n" , . . , ", - .\r\n , , , , .\r\n ...   
2422  Mediazona, the self-styled independent Russian...   
2408  The suspect in Colorados horrific antisemitic ...   

                                              full_text  
2470  Политик напомнил, что Грузия является кандидат...  
2422  Leaked documents show the supposedly self-reli...  
2408  The hate-fueled Colorado firebomber disguised ...  


`Columns Importance`

In [59]:
print(news_df.columns.tolist()) 

['keyword', 'source', 'author', 'title', 'description', 'content', 'summary', 'full_text', 'publishedAt', 'url', 'language', 'content_length']


In [60]:
#column_name:   brief info'

# title:        Useful for headline analysis, summarization, keyword extraction, or sentiment approximation
# description:  concise summary of the article
# content:      Main body of the article. Crucial for any text-based NLP
# source:       Helps identify bias or clustering by publisher; useful in framing analysis
# language:     language filtering
# keyword:      metadata for filtering or guiding classification topics
#*publishedAt:  Useful for temporal analysis, trend detection, or filtering by date (keeping at now if incase we tailor some visualizations as well)

important_cols = ['title', 'description', 'content', 'publishedAt', 'source', 'language', 'keyword']
news_df = news_df[important_cols] 


`Data cleaning (minor_ for quick cleaning)`

- Since title is a `Key` column Removing duplicates from it is neccesary so as to limit redundancy ad will do based on content as well

In [61]:
news_df.drop_duplicates(subset=['title', 'content'], inplace=True)

- Clean language column  

In [62]:
news_df['language'] = news_df['language'].astype(str).str.strip().str.lower()

In [63]:
from langdetect import detect
from langdetect.lang_detect_exception import LangDetectException

# Function to detect language from available text
def detect_language(row):
    try:
        # Use content first, fallback to description or title
        text = str(row['content']) or str(row['description']) or str(row['title'])
        return detect(text)
    except LangDetectException:
        return 'unknown'

# Apply to each row
news_df['language'] = news_df.apply(detect_language, axis=1)


In [64]:
print("Unique languages:", news_df['language'].dropna().unique())

Unique languages: ['en' 'de' 'pt' 'tl' 'es' 'no' 'hu' 'tr' 'da' 'fr' 'ro']


In [65]:
# chose to work with English-only content to ensure accurate analysis and easy interpretation, since it's the only language we fully understand.

news_df = news_df[news_df['language'].isin(['en'])]


In [69]:
#assign my data to avoid re_runs
temp_clean_newsdata = news_df
temp_clean_newsdata.head(4)

Unnamed: 0,title,description,content,publishedAt,source,language,keyword
0,Has DOGE really saved the US government $180bn?,Elon Musk first claimed the department would m...,President Donald Trump and adviser Elon Musk c...,2025-06-06 11:21:51+00:00,Al Jazeera English,en,usaid kenya
1,The Life Story of Ecomobilus Technologies Limi...,By Prof Geoffrey Gitau Here is a story showcas...,By Prof Geoffrey Gitau\r\nHere is a story show...,2025-05-26 17:13:41+00:00,CleanTechnica,en,usaid kenya
2,"Death, Sexual Violence and Human Trafficking: ...",by Brett Murphy and Anna Maria Barry-Jester \n...,ProPublica is a nonprofit newsroom that invest...,2025-05-28 18:45:00+00:00,ProPublica,en,usaid kenya
3,Congress Should Quickly Approve Trump’s Rescis...,President Donald Trump‘s rescission legislatio...,President Donald Trumps rescission legislation...,2025-06-10 12:00:00+00:00,Daily Signal,en,usaid kenya


##### on Reddit Data