### USAUD FUNDING CUTS SENTIMENT ANALYSIS 

### Introduction

.....(light intro text)
.....(TBD)

### Data 

This is data preparation phase for the project. The dataset used here is compiled from two primary sources: Reddit (via web scraping) and NewsAPI (via API calls). Each contributor collected data independently from these platforms, targeting relevant topics for analysis. Below, we begin by importing the collected datasets, merging them, and performing initial cleaning steps to prepare the data for further exploration and modeling.



#### Data Importation

##### news_data

In [1]:
import os
import pandas as pd

# Set the path to your news_data folder
folder_path = r'N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\raw\news_data'

# List all CSV files in the folder
csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

# Read and display columns for each CSV file
for file in csv_files:
    file_path = os.path.join(folder_path, file)
    try:
        df = pd.read_csv(file_path, nrows=0)  # Read only headers
        print(f"Columns in {file}:")
        print(list(df.columns))
        print("-" * 50)
    except Exception as e:
        print(f"Error reading {file}: {e}")


Columns in Agatha_news.csv:
['keyword', 'source', 'author', 'title', 'description', 'content', 'publishedAt', 'url']
--------------------------------------------------
Columns in cecilia.newsapi.csv:
['keyword', 'source', 'title', 'description', 'url', 'publishedAt']
--------------------------------------------------
Columns in gnews_usaid_kenya_full.csv:
['title', 'url', 'published_date', 'source', 'text']
--------------------------------------------------
Columns in gnews_usaid_kenya_full_en_sw.csv:
['title', 'url', 'published_date', 'source', 'language', 'text']
--------------------------------------------------
Columns in leo_newsapi_articles.csv:
['source', 'author', 'title', 'description', 'content', 'url', 'published_at']
--------------------------------------------------
Columns in leo_newsapi_articles_enriched.csv:
['source', 'author', 'title', 'description', 'content', 'url', 'published_at', 'full_text']
--------------------------------------------------
Columns in Mbego_news

##### reddit_data

In [2]:
import os
import pandas as pd

# Set the path to your news_data folder
folder_path = r'N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\raw\reddit_data'

# List all CSV files in the folder
csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

# Read and display columns for each CSV file
for file in csv_files:
    file_path = os.path.join(folder_path, file)
    try:
        df = pd.read_csv(file_path, nrows=0)  # Read only headers
        print(f"Columns in {file}:")
        print(list(df.columns))
        print("-" * 50)
    except Exception as e:
        print(f"Error reading {file}: {e}")


Columns in Agatha_reddit.csv:
['title', 'selftext', 'subreddit', 'author', 'created_utc', 'url', 'score', 'num_comments', 'keyword']
--------------------------------------------------
Columns in cecilia.redditsubs.csv:
['subreddit', 'keyword', 'title', 'text', 'date_posted', 'upvotes', 'comments', 'url', 'permalink']
--------------------------------------------------
Columns in cecilia.reddit_nbo_ke_africa.csv:
['subreddit', 'keyword', 'title', 'text', 'date_posted', 'upvotes', 'comments', 'url', 'permalink']
--------------------------------------------------
Columns in leo_reddit_posts.csv:
['subreddit', 'search_term', 'title', 'text', 'created_utc', 'created_date', 'score', 'num_comments', 'permalink', 'url']
--------------------------------------------------
Columns in Mbego_reddit_usaid_kenya.csv:
['title', 'score', 'url', 'created', 'subreddit', 'selftext']
--------------------------------------------------
Columns in Mbego_reddit_usaid_kenya2.csv:
['title', 'score', 'url', 'creat

#### Data Merging 

##### news_data



In [None]:
import os
import pandas as pd

# Folder containing all News CSVs
folder_path = r'N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\raw\news_data'

# Final save location
save_path = r"N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\processed"

# All .csv files in the news_data folder
news_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

# Define the final standardized columns
standard_news_cols = [
    'keyword', 'source', 'author', 'title', 'description', 'content',
    'summary', 'full_text', 'publishedAt', 'url', 'language'
]

# Create empty master DataFrame
merged_news_df = pd.DataFrame(columns=standard_news_cols)

# Loop through each file
for file in news_files:
    file_path = os.path.join(folder_path, file)
    try:
        df = pd.read_csv(file_path)

        # Drop index column if present
        if 'Unnamed: 0' in df.columns:
            df.drop(columns=['Unnamed: 0'], inplace=True)

        # Standardize column names
        df.rename(columns={
            'published_at': 'publishedAt',
            'published_date': 'publishedAt',
            'text': 'content'
        }, inplace=True)

        # Add missing columns
        for col in standard_news_cols:
            if col not in df.columns:
                df[col] = pd.NA

        # Align column order
        df = df[standard_news_cols]

        # Add to master DataFrame
        merged_news_df = pd.concat([merged_news_df, df], ignore_index=True)

        print(f"✅ Merged: {file}")
    except Exception as e:
        print(f" Error processing {file}: {e}")

# Save merged file
output_path = os.path.join(save_path, 'Mbego_all_news_merged.csv')
merged_news_df.to_csv(output_path, index=False)

print(f"\n✅ All News files merged and saved to '{output_path}'")


✅ Merged: Agatha_news.csv
✅ Merged: cecilia.newsapi.csv
✅ Merged: gnews_usaid_kenya_full.csv
✅ Merged: gnews_usaid_kenya_full_en_sw.csv
✅ Merged: leo_newsapi_articles.csv
✅ Merged: leo_newsapi_articles_enriched.csv
✅ Merged: Mbego_news_usaid_kenya_fulltext.csv
✅ Merged: Mbego_news_usaid_kenya_recent.csv
✅ Merged: ruth_news.csv

✅ All News files merged and saved to 'N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\processed\Mbego_all_news_merged.csv'


##### reddit_data

In [None]:
import os
import pandas as pd

# Define paths
folder_path = r'N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\raw\reddit_data'
save_path = r'N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\processed'

# Define standard columns
standard_cols = [
    'title', 'selftext', 'subreddit', 'author', 'created_utc',
    'created_date', 'score', 'num_comments', 'keyword', 'search_term',
    'date_posted', 'upvotes', 'comments', 'url', 'permalink'
]

# Initialize master DataFrame
merged_df = pd.DataFrame(columns=standard_cols)

# Get all CSV files in folder
reddit_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

# Loop through each CSV file
for file in reddit_files:
    file_path = os.path.join(folder_path, file)
    try:
        df = pd.read_csv(file_path)

        # Drop any unnamed index column
        if 'Unnamed: 0' in df.columns:
            df.drop(columns=['Unnamed: 0'], inplace=True)

        # Rename common variations
        df.rename(columns={
            'text': 'selftext',
            'created': 'created_utc'
        }, inplace=True)

        # Add missing columns as empty (NA)
        for col in standard_cols:
            if col not in df.columns:
                df[col] = pd.NA

        # Reorder columns to match standard
        df = df[standard_cols]

        # Append to the master DataFrame
        merged_df = pd.concat([merged_df, df], ignore_index=True)

        print(f"✅ Merged: {file}")
    except Exception as e:
        print(f" Error processing {file}: {e}")

# Save merged result
output_file = os.path.join(save_path, 'mbego_all_reddit_merged.csv')
merged_df.to_csv(output_file, index=False)
print(f"\n✅ All Reddit files merged and saved to '{output_file}'")


✅ Merged: Agatha_reddit.csv
✅ Merged: cecilia.redditsubs.csv
✅ Merged: cecilia.reddit_nbo_ke_africa.csv
✅ Merged: leo_reddit_posts.csv
✅ Merged: Mbego_reddit_usaid_kenya.csv
✅ Merged: Mbego_reddit_usaid_kenya2.csv
✅ Merged: reddit_usaid_sentiment.csv
✅ Merged: ruth_reddit.csv

✅ All Reddit files merged and saved to 'N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\processed\mbego_all_reddit_merged.csv'


#### Data Understanding 

In [11]:
import pandas as pd

news_merged_path = r"N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\processed\Mbego_all_news_merged.csv"

news_df = pd.read_csv(news_merged_path)

news_df.head(3)


Unnamed: 0,keyword,source,author,title,description,content,summary,full_text,publishedAt,url,language
0,usaid kenya,Al Jazeera English,Al Jazeera,Has DOGE really saved the US government $180bn?,Elon Musk first claimed the department would m...,President Donald Trump and adviser Elon Musk c...,,,2025-06-06T11:21:51Z,https://www.aljazeera.com/news/2025/6/6/has-do...,
1,usaid kenya,CleanTechnica,Guest Contributor,The Life Story of Ecomobilus Technologies Limi...,By Prof Geoffrey Gitau Here is a story showcas...,By Prof Geoffrey Gitau\r\nHere is a story show...,,,2025-05-26T17:13:41Z,https://cleantechnica.com/2025/05/26/the-life-...,
2,usaid kenya,ProPublica,by Brett Murphy and Anna Maria Barry-Jester,"Death, Sexual Violence and Human Trafficking: ...",by Brett Murphy and Anna Maria Barry-Jester \n...,ProPublica is a nonprofit newsroom that invest...,,,2025-05-28T18:45:00Z,https://www.propublica.org/article/trump-usaid...,


In [12]:

reddit_merged_path = r"N:\Moringa\afterM\Leo NLP 004 USAID 01.06.2025\USAID-Kenya-Sentiment-Analysis\data\processed\mbego_all_reddit_merged.csv"

reddit_df = pd.read_csv(reddit_merged_path)

reddit_df.head(3)

Unnamed: 0,title,selftext,subreddit,author,created_utc,created_date,score,num_comments,keyword,search_term,date_posted,upvotes,comments,url,permalink
0,"USAID left a month ago, do we have ARVs in Kenya?",Someone on a different group (different websit...,Kenya,muerki,2025-04-15 13:16:53,,3.0,5.0,usaid kenya,,,,,https://www.reddit.com/r/Kenya/comments/1jzrn2...,
1,Classism in r/Kenya and r/nairobi,The classism I'm seeing in both subs is a good...,Kenya,Morio_anzenza,2025-04-07 04:21:12,,169.0,95.0,usaid kenya,,,,,https://www.reddit.com/r/Kenya/comments/1jtcvb...,
2,EX-USAID people!! Let's talk,Are you still in contact with the organisation...,Kenya,vindtar,2025-04-05 19:09:10,,2.0,2.0,usaid kenya,,,,,https://www.reddit.com/r/Kenya/comments/1jsb14...,


In [17]:
news_df.columns.tolist()

['keyword',
 'source',
 'author',
 'title',
 'description',
 'content',
 'summary',
 'full_text',
 'publishedAt',
 'url',
 'language']

In [18]:
reddit_df.columns.tolist()

['title',
 'selftext',
 'subreddit',
 'author',
 'created_utc',
 'created_date',
 'score',
 'num_comments',
 'keyword',
 'search_term',
 'date_posted',
 'upvotes',
 'comments',
 'url',
 'permalink']

In [20]:
news_df['language'].value_counts()

language
en    248
sw    248
Name: count, dtype: int64