### USAUD FUNDING CUTS SENTIMENT ANALYSIS 

### Introduction

.....(light intro text)
.....(TBD)

### Data 

This is data preparation phase for the project. The dataset used here is compiled from two primary sources: Reddit (via web scraping) and NewsAPI (via API calls). Each contributor collected data independently from these platforms, targeting relevant topics for analysis. Below, we begin by importing the collected datasets, merging them, and performing initial cleaning steps to prepare the data for further exploration and modeling.



#### Data Importation

##### news_data

In [1]:
import os
import pandas as pd

# Set the path to your news_data folder
folder_path = r'N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\raw\news_data'

# List all CSV files in the folder
csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

# Read and display columns for each CSV file
for file in csv_files:
    file_path = os.path.join(folder_path, file)
    try:
        reddit_mbegoall_df = pd.read_csv(file_path, nrows=0)  # Read only headers
        print(f"Columns in {file}:")
        print(list(reddit_mbegoall_df.columns))
        print("-" * 50)
    except Exception as e:
        print(f"Error reading {file}: {e}")


Columns in Agatha_news.csv:
['keyword', 'source', 'author', 'title', 'description', 'content', 'publishedAt', 'url']
--------------------------------------------------
Columns in Agatha_news_fulltext.csv:
['keyword', 'source', 'author', 'title', 'publishedAt', 'summary', 'text', 'url']
--------------------------------------------------
Columns in cecilia.newsapi.csv:
['keyword', 'source', 'author', 'title', 'description', 'content', 'url', 'publishedAt', 'urlToImage']
--------------------------------------------------
Columns in leo_newsapi_articles_enriched.csv:
['source', 'author', 'title', 'description', 'content', 'url', 'published_at', 'full_text']
--------------------------------------------------
Columns in Mbego_news_usaid_kenya_fulltext.csv:
['source', 'author', 'title', 'description', 'url', 'publishedAt', 'summary', 'full_text']
--------------------------------------------------
Columns in Mbego_news_usaid_kenya_recent.csv:
['source', 'author', 'title', 'description', 'url',

##### reddit_data

In [2]:

# Set the path to your news_data folder
folder_path = r'N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\raw\reddit_data'

# List all CSV files in the folder
csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

# Read and display columns for each CSV file
for file in csv_files:
    file_path = os.path.join(folder_path, file)
    try:
        reddit_mbegoall_df = pd.read_csv(file_path, nrows=0)  # Read only headers
        print(f"Columns in {file}:")
        print(list(reddit_mbegoall_df.columns))
        print("-" * 50)
    except Exception as e:
        print(f"Error reading {file}: {e}")


Columns in Agatha_reddit.csv:
['title', 'selftext', 'subreddit', 'author', 'created_utc', 'url', 'score', 'num_comments', 'keyword']
--------------------------------------------------
Columns in cecilia.redditsubs.csv:
['subreddit', 'keyword', 'title', 'text', 'date_posted', 'upvotes', 'comments', 'url', 'permalink']
--------------------------------------------------
Columns in cecilia.reddit_nbo_ke_africa.csv:
['subreddit', 'keyword', 'title', 'text', 'date_posted', 'upvotes', 'comments', 'url', 'permalink']
--------------------------------------------------
Columns in leo_reddit_posts.csv:
['subreddit', 'search_term', 'title', 'text', 'created_utc', 'created_date', 'score', 'num_comments', 'permalink', 'url']
--------------------------------------------------
Columns in Mbego_reddit_usaid_kenya.csv:
['title', 'score', 'url', 'created', 'subreddit', 'selftext']
--------------------------------------------------
Columns in Mbego_reddit_usaid_kenya2.csv:
['title', 'score', 'url', 'creat

#### Data Merging 

##### news_data



In [3]:

# Folder containing all News CSVs
folder_path = r'N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\raw\news_data'

# Final save location
save_path = r"N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\processed\individual datasets"

# All .csv files in the news_data folder
news_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

# Define the final standardized columns
standard_news_cols = [
    'keyword', 'source', 'author', 'title', 'description', 'content',
    'summary', 'full_text', 'publishedAt', 'url', 'language'
]

# Create empty master DataFrame
merged_news_df = pd.DataFrame(columns=standard_news_cols)

# Loop through each file
for file in news_files:
    file_path = os.path.join(folder_path, file)
    try:
        reddit_mbegoall_df = pd.read_csv(file_path)

        # Drop index column if present
        if 'Unnamed: 0' in reddit_mbegoall_df.columns:
            reddit_mbegoall_df.drop(columns=['Unnamed: 0'], inplace=True)

        # Standardize column names
        reddit_mbegoall_df.rename(columns={
            'published_at': 'publishedAt',
            'published_date': 'publishedAt',
            'text': 'content'
        }, inplace=True)

        # Add missing columns
        for col in standard_news_cols:
            if col not in reddit_mbegoall_df.columns:
                reddit_mbegoall_df[col] = pd.NA

        # Align column order
        reddit_mbegoall_df = reddit_mbegoall_df[standard_news_cols]

        # Add to master DataFrame
        merged_news_df = pd.concat([merged_news_df, reddit_mbegoall_df], ignore_index=True)

        print(f"✅ Merged: {file}")
    except Exception as e:
        print(f" Error processing {file}: {e}")

# Save merged file
output_path = os.path.join(save_path, 'Mbego_all_news_merged.csv')
merged_news_df.to_csv(output_path, index=False)

print(f"\n✅ All News files merged and saved to '{output_path}'")


✅ Merged: Agatha_news.csv
✅ Merged: Agatha_news_fulltext.csv
✅ Merged: cecilia.newsapi.csv
✅ Merged: leo_newsapi_articles_enriched.csv
✅ Merged: Mbego_news_usaid_kenya_fulltext.csv
✅ Merged: Mbego_news_usaid_kenya_recent.csv
✅ Merged: newsapi_usaid_articles.csv
✅ Merged: ruth_news.csv

✅ All News files merged and saved to 'N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\processed\individual datasets\Mbego_all_news_merged.csv'


In [4]:

news_merged_path = r"N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\processed\individual datasets\Mbego_all_news_merged.csv"

news_mbegoall_df = pd.read_csv(news_merged_path)

news_mbegoall_df.head(3)


Unnamed: 0,keyword,source,author,title,description,content,summary,full_text,publishedAt,url,language
0,usaid kenya,Al Jazeera English,Al Jazeera,Has DOGE really saved the US government $180bn?,Elon Musk first claimed the department would m...,President Donald Trump and adviser Elon Musk c...,,,2025-06-06T11:21:51Z,https://www.aljazeera.com/news/2025/6/6/has-do...,
1,usaid kenya,CleanTechnica,Guest Contributor,The Life Story of Ecomobilus Technologies Limi...,By Prof Geoffrey Gitau Here is a story showcas...,By Prof Geoffrey Gitau\r\nHere is a story show...,,,2025-05-26T17:13:41Z,https://cleantechnica.com/2025/05/26/the-life-...,
2,usaid kenya,ProPublica,by Brett Murphy and Anna Maria Barry-Jester,"Death, Sexual Violence and Human Trafficking: ...",by Brett Murphy and Anna Maria Barry-Jester \n...,ProPublica is a nonprofit newsroom that invest...,,,2025-05-28T18:45:00Z,https://www.propublica.org/article/trump-usaid...,


##### reddit_data

In [5]:

# Define paths
folder_path = r'N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\raw\reddit_data'
save_path = r'N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\processed\individual datasets'

# Define standard columns
standard_cols = [
    'title', 'selftext', 'subreddit', 'author', 'created_utc',
    'created_date', 'score', 'num_comments', 'keyword', 'search_term',
    'date_posted', 'upvotes', 'comments', 'url', 'permalink'
]

# Initialize master DataFrame
merged_df = pd.DataFrame(columns=standard_cols)

# Get all CSV files in folder
reddit_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

# Loop through each CSV file
for file in reddit_files:
    file_path = os.path.join(folder_path, file)
    try:
        reddit_mbegoall_df = pd.read_csv(file_path)

        # Drop any unnamed index column
        if 'Unnamed: 0' in reddit_mbegoall_df.columns:
            reddit_mbegoall_df.drop(columns=['Unnamed: 0'], inplace=True)

        # Rename common variations
        reddit_mbegoall_df.rename(columns={
            'text': 'selftext',
            'created': 'created_utc'
        }, inplace=True)

        # Add missing columns as empty (NA)
        for col in standard_cols:
            if col not in reddit_mbegoall_df.columns:
                reddit_mbegoall_df[col] = pd.NA

        # Reorder columns to match standard
        reddit_mbegoall_df = reddit_mbegoall_df[standard_cols]

        # Append to the master DataFrame
        merged_df = pd.concat([merged_df, reddit_mbegoall_df], ignore_index=True)

        print(f"✅ Merged: {file}")
    except Exception as e:
        print(f" Error processing {file}: {e}")

# Save merged result
output_file = os.path.join(save_path, 'mbego_all_reddit_merged.csv')
merged_df.to_csv(output_file, index=False)
print(f"\n✅ All Reddit files merged and saved to '{output_file}'")


✅ Merged: Agatha_reddit.csv
✅ Merged: cecilia.redditsubs.csv
✅ Merged: cecilia.reddit_nbo_ke_africa.csv
✅ Merged: leo_reddit_posts.csv
✅ Merged: Mbego_reddit_usaid_kenya.csv
✅ Merged: Mbego_reddit_usaid_kenya2.csv
✅ Merged: reddit_usaid_kenya.csv
✅ Merged: reddit_usaid_sentiment.csv
✅ Merged: ruth_reddit.csv

✅ All Reddit files merged and saved to 'N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\processed\individual datasets\mbego_all_reddit_merged.csv'


In [6]:

reddit_merged_path = r"N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\processed\individual datasets\mbego_all_reddit_merged.csv"

reddit_mbegoall_df = pd.read_csv(reddit_merged_path)

reddit_mbegoall_df.head(3)

Unnamed: 0,title,selftext,subreddit,author,created_utc,created_date,score,num_comments,keyword,search_term,date_posted,upvotes,comments,url,permalink
0,"USAID left a month ago, do we have ARVs in Kenya?",Someone on a different group (different websit...,Kenya,muerki,2025-04-15 13:16:53,,3.0,5.0,usaid kenya,,,,,https://www.reddit.com/r/Kenya/comments/1jzrn2...,
1,Classism in r/Kenya and r/nairobi,The classism I'm seeing in both subs is a good...,Kenya,Morio_anzenza,2025-04-07 04:21:12,,169.0,95.0,usaid kenya,,,,,https://www.reddit.com/r/Kenya/comments/1jtcvb...,
2,EX-USAID people!! Let's talk,Are you still in contact with the organisation...,Kenya,vindtar,2025-04-05 19:09:10,,2.0,2.0,usaid kenya,,,,,https://www.reddit.com/r/Kenya/comments/1jsb14...,


---

#### Data Understanding 

##### News Data

Basic Overview

In [85]:
proccessed_final_path = r"N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\processed\news_data\news_data.csv" 

news_df = pd.read_csv(proccessed_final_path)

In [86]:
news_df.head(5)

Unnamed: 0,title,description,text,url,keyword,published_date,source_file
0,Has DOGE really saved the US government $180bn?,Elon Musk first claimed the department would m...,President Donald Trump and adviser Elon Musk c...,https://www.aljazeera.com/news/2025/6/6/has-do...,usaid kenya,2025-06-06,Agatha_news.csv
1,The Life Story of Ecomobilus Technologies Limi...,By Prof Geoffrey Gitau Here is a story showcas...,By Prof Geoffrey Gitau\r\nHere is a story show...,https://cleantechnica.com/2025/05/26/the-life-...,usaid kenya,2025-05-26,Agatha_news.csv
2,"Death, Sexual Violence and Human Trafficking: ...",by Brett Murphy and Anna Maria Barry-Jester \n...,ProPublica is a nonprofit newsroom that invest...,https://www.propublica.org/article/trump-usaid...,usaid kenya,2025-05-28,Agatha_news.csv
3,Congress Should Quickly Approve Trump’s Rescis...,President Donald Trump‘s rescission legislatio...,President Donald Trumps rescission legislation...,https://www.dailysignal.com/2025/06/10/congres...,usaid kenya,2025-06-10,Agatha_news.csv
4,Food Safety Depends On Every Link In The Suppl...,Almost 1 in 10 people globally fall ill from c...,Colorful fish and vegetables can be purchased ...,https://www.forbes.com/sites/daniellenierenber...,usaid kenya,2025-06-06,Agatha_news.csv


In [87]:
print(news_df.shape)             # Rows and columns
print(news_df.dtypes)           # Data types                


(2549, 7)
title             object
description       object
text              object
url               object
keyword           object
published_date    object
source_file       object
dtype: object


In [88]:
news_df = news_df.drop(columns= ['source_file','url'])

Unique Values per Key Column

In [89]:
print("1. Sample keywords:", news_df['keyword'].dropna().unique()[:5])
print("2. Unique text:", news_df['text'].nunique())
print("3. Unique description:", news_df['description'].nunique())
print("4. Unique title:", news_df['title'].dropna().nunique())


1. Sample keywords: ['usaid kenya' 'usaid funding' 'usaid budget cut' 'kenya foreign aid'
 'usaid suspended funding']
2. Unique text: 1401
3. Unique description: 1411
4. Unique title: 1410


Date Range

In [90]:
news_df['published_date'] = pd.to_datetime(news_df['published_date'], errors='coerce')
print("Date range:", news_df['published_date'].min(), "to", news_df['published_date'].max())

Date range: 2025-05-09 00:00:00 to 2025-06-23 00:00:00


Sample Full Article Text

In [91]:
sample = news_df[['title', 'description', 'text']].dropna().sample(3)
print(sample)

                                                  title  \
1536  JUST IN: House Votes to Advance DOGE Cuts to N...   
2088  Studies for breast cancer, ALS: Here are some ...   
787   PBS sues Trump over funding cuts to public med...   

                                            description  \
1536  The House of Representatives has advanced a $9...   
2088  Grant terminations at Harvard have affected re...   
787   PBS said in a lawsuit that the Trump administr...   

                                                   text  
1536  The House of Representatives has advanced a $9...  
2088  Amid the Trump administration's battle with Ha...  
787   PBS is taking the Trump administration to cour...  


`Columns Importance`

In [92]:
print(news_df.columns.tolist()) 

['title', 'description', 'text', 'keyword', 'published_date']


In [93]:
#column_name:   brief info'

# title:        Useful for headline analysis, summarization, keyword extraction, or sentiment approximation
# description:  concise summary of the article
# keyword:      metadata for filtering or guiding classification topics
#*published_date:  Useful for temporal analysis, trend detection, or filtering by date (keeping at now if incase we tailor some visualizations as well)

#missing after impt after group merge#
# content:      Main body of the article. Crucial for any text-based NLP
# source:       Helps identify bias or clustering by publisher; useful in framing analysis

important_cols = ['title', 'description', 'published_date', 'text', 'keyword']
news_df = news_df[important_cols] 


`Data cleaning (minor_ for quick cleaning)`

Missing Data

In [94]:
missing = news_df.isna().sum().sort_values(ascending=False)
print("Missing values per column:\n", missing)

Missing values per column:
 keyword           170
published_date     99
text               25
description        16
title               0
dtype: int64


Filling missing text columns with empty strings

In [95]:
news_df['text'] = news_df['text'].fillna('')
news_df['description'] = news_df['description'].fillna('')
news_df['keyword'] = news_df['keyword'].fillna('')

# Drop rows where published_date is missing (NaT)
news_df = news_df.dropna(subset=['published_date'])

creating a new column to enrich NLP

In [96]:
# Create the new combined column
news_df['news_content'] = news_df['description'] + ' ' + news_df['text']

drop the two merged columns and retain new column `news_content`

In [97]:
news_df = news_df.drop(columns=['text','description'])

> TBD on news data begin from cell blocks before this so after this is the last step and it is pre included as it is important for data viewing

In [98]:
#assign my data to avoid re_runs
temp_clean_newsdata = news_df
temp_clean_newsdata.head(4)

Unnamed: 0,title,published_date,keyword,news_content
0,Has DOGE really saved the US government $180bn?,2025-06-06,usaid kenya,Elon Musk first claimed the department would m...
1,The Life Story of Ecomobilus Technologies Limi...,2025-05-26,usaid kenya,By Prof Geoffrey Gitau Here is a story showcas...
2,"Death, Sexual Violence and Human Trafficking: ...",2025-05-28,usaid kenya,by Brett Murphy and Anna Maria Barry-Jester \n...
3,Congress Should Quickly Approve Trump’s Rescis...,2025-06-10,usaid kenya,President Donald Trump‘s rescission legislatio...


---

##### Reddit Data

---

In [None]:
print(reddit_mbegoall_df.columns.tolist()) 

['title', 'selftext', 'subreddit', 'author', 'created_utc', 'created_date', 'score', 'num_comments', 'keyword', 'search_term', 'date_posted', 'upvotes', 'comments', 'url', 'permalink']


Drop unneeded columns for modeling

In [None]:

columns_to_drop = ['created_utc', 'date_posted', 'upvotes', 'url', 'permalink', 'author','created_date']
reddit_mbegoall_df = reddit_mbegoall_df.drop(columns=columns_to_drop, errors='ignore')  # errors='ignore' ensures it won’t crash if a column is missing

#remaining columns
print("Remaining columns:")
print(reddit_mbegoall_df.columns)


Remaining columns:
Index(['title', 'selftext', 'subreddit', 'score', 'num_comments', 'keyword',
       'search_term', 'comments'],
      dtype='object')


Null values check to determine how handle for NLP

In [None]:
reddit_mbegoall_df.isnull().sum().sort_values(ascending=False)

search_term     1156
comments        1030
keyword          564
num_comments     473
selftext         398
score            276
title              0
subreddit          0
dtype: int64

Cleaning Code for null values 

In [None]:
# Drop rows where 'created_date' is missing
#reddit_df = reddit_df.dropna(subset=['created_date'])

# Drop 'search_term' if not needed
reddit_mbegoall_df = reddit_mbegoall_df.drop(columns=['search_term'], errors='ignore')

# Fill missing text fields with empty strings
reddit_mbegoall_df['selftext'] = reddit_mbegoall_df['selftext'].fillna('')
reddit_mbegoall_df['comments'] = reddit_mbegoall_df['comments'].fillna('')

# Fill numeric fields with 0
reddit_mbegoall_df['score'] = reddit_mbegoall_df['score'].fillna(0)
reddit_mbegoall_df['num_comments'] = reddit_mbegoall_df['num_comments'].fillna(0)

# Fill missing 'keyword' with 'subreddit' before dropping subreddit
reddit_mbegoall_df['keyword'] = reddit_mbegoall_df['keyword'].fillna(reddit_mbegoall_df['subreddit'])

# Normalize 'keyword' by stripping whitespace and lowercasing
reddit_mbegoall_df['keyword'] = reddit_mbegoall_df['keyword'].str.strip().str.lower()

# Drop 'subreddit' since it's now redundant
reddit_mbegoall_df = reddit_mbegoall_df.drop(columns=['subreddit'], errors='ignore')

# Create unified text field for NLP modeling
reddit_mbegoall_df['full_text'] = reddit_mbegoall_df['title'] + ' ' + reddit_mbegoall_df['selftext']


In [None]:
reddit_mbegoall_df.nunique().sort_values()


keyword          31
comments         77
num_comments    112
score           238
selftext        544
title           828
full_text       833
dtype: int64

In [None]:
reddit_mbegoall_df.tail(5)

Unnamed: 0,title,selftext,score,num_comments,keyword,comments,full_text
1301,Weekly Sub-Saharan Africa Security Situation a...,#Somalia 🇸🇴\r\n#Sudan 🇸🇩\r\nDemocratic Republi...,3.0,2.0,africa,,Weekly Sub-Saharan Africa Security Situation a...
1302,No evidence that Burkina Faso paid off all its...,,52.0,25.0,africa,,No evidence that Burkina Faso paid off all its...
1303,Ghana orders foreigners to exit gold market by...,Ghana has ordered foreigners to exit its gold ...,101.0,12.0,africa,,Ghana orders foreigners to exit gold market by...
1304,Unending Frustration Regarding Sudan War.,https://www.reuters.com/world/britain-boosts-a...,11.0,8.0,africa,,Unending Frustration Regarding Sudan War. http...
1305,Tanzania's Authoritarian Government Has Just B...,Tanzania's main opposition party has been barr...,52.0,14.0,africa,,Tanzania's Authoritarian Government Has Just B...


In [None]:
reddit_mbegoall_df.shape

(1306, 7)

Top 20 most frequent keywords (after normalization)

In [None]:
reddit_mbegoall_df['keyword'].value_counts().head(20)


keyword
kenya                                   342
foreign aid, foreign aid                157
worldnews                               138
usaid kenya funding cut                 129
kenya foreign aid                       105
development aid kenya                    99
africa                                   81
usaid                                    66
kenya donor funding                      56
usaid budget cut                         27
usaid, foreign aid, foreign aid          24
usaid suspended funding                  22
usaid funding                            18
usaid kenya                              10
aid withdrawal                            3
usaid, usaid money, donors, ngos          3
foreign aid, foreign aid, trump cuts      3
usaid, usaid money                        3
internationaldev                          3
usaid, ngos                               2
Name: count, dtype: int64

In [None]:
print(news_df.columns.tolist()) 

['title', 'description', 'content', 'publishedAt', 'source', 'language', 'keyword']


---