# Data Collection Experimentation
- The announcement of the USAID funding cuts was made on March 28, 2025
## 1. Environments
- tweepy, PRAW requests (pip install)

## 2. Reddit Data collection
- Create connection with Reddit: [link to create script->(https://www.reddit.com/prefs/apps)](https://www.reddit.com/prefs/apps)
- Get `client_id`, `client_secret`
```
reddit = praw.Reddit(
    client_id='tWd763iMgmjp8YamFF96Wg',
    client_secret='	XmaO9oO_kioW-8DzCLipxz-A4hffSA',
    user_agent='usaid-sentiment-KE'
)

```

In [None]:
import praw
import pandas as pd
from datetime import datetime, timezone

reddit = praw.Reddit(
    client_id='tWd763iMgmjp8YamFF96Wg',
    client_secret='XmaO9oO_kioW-8DzCLipxz-A4hffSA',
    user_agent='usaid-sentiment-KE'
)

# --- PARAMETERS ---
subreddits = ['Kenya', 'EastAfrica']

"""keywords = [
    "USAID", "usaid", 
    "foreign aid", "foreign assistance", 
    "donor", "donour", 
    "funding", "funds", 
    "budget cuts", "aid cuts", 
    "development aid", 
    "healthcare", "health care", 
    "NGOs", "nonprofits", "non-profits"
]"""


keywords = [
    "USAID", "usaid", 
    "foreign aid", "foreign funding"  
]

# Combine keywords and phrases
search_terms = keywords

# Earliest date (after funding cuts) → March 28, 2025
cutoff_date = datetime(2025, 3, 28, tzinfo=timezone.utc).timestamp()

# --- SCRAPING ---
data = []

for sub in subreddits:
    subreddit = reddit.subreddit(sub)
    print(f"Searching r/{sub}...")
    
    for term in search_terms:
        try:
            for post in subreddit.search(term, sort='new', limit=200):
                if post.created_utc < cutoff_date:
                    continue  # skip posts before March 28, 2025
                
                data.append({
                    'subreddit': sub,
                    'search_term': term,
                    'title': post.title,
                    'text': post.selftext,
                    'created_utc': post.created_utc,
                    'created_date': datetime.fromtimestamp(post.created_utc),
                    'score': post.score,
                    'num_comments': post.num_comments,
                    'permalink': f"https://reddit.com{post.permalink}",
                    'url': post.url
                })
        except Exception as e:
            print(f"Error searching term '{term}' in r/{sub}: {e}")

# --- SAVE TO CSV ---
df = pd.DataFrame(data)
df['created_date'] = pd.to_datetime(df['created_utc'], unit='s')
df.to_csv('../data/raw/leo_reddit_posts.csv', index=False)

print(f" Scraped {len(df)} posts. Saved to ../data/raw/leo_reddit_posts.csv'.")


Searching r/Kenya...
Searching r/EastAfrica...
✅ Scraped 24 posts. Saved to ../data/raw/leo_reddit_posts.csv'.


In [2]:
df.sample(random_state=42, n=10)

Unnamed: 0,subreddit,search_term,title,text,created_utc,created_date,score,num_comments,permalink,url
8,Kenya,usaid,Economy,For the experts in matters economy and finance...,1743959000.0,2025-04-06 17:01:41,1,15,https://reddit.com/r/Kenya/comments/1jsytyp/ec...,https://www.reddit.com/r/Kenya/comments/1jsyty...
16,Kenya,foreign funding,Be very cautious of the UAE,Kasongo has been cozying up to the UAE recentl...,1748512000.0,2025-05-29 09:51:51,32,13,https://reddit.com/r/Kenya/comments/1ky6sma/be...,https://www.reddit.com/r/Kenya/comments/1ky6sm...
0,Kenya,USAID,USAID Repercussions + Economy,My neighbour’s wife was a very big shot in USA...,1747235000.0,2025-05-14 15:11:02,12,32,https://reddit.com/r/Kenya/comments/1kmhn87/us...,https://www.reddit.com/r/Kenya/comments/1kmhn8...
18,Kenya,foreign funding,Daily Nation,,1747811000.0,2025-05-21 07:09:24,1,8,https://reddit.com/r/Kenya/comments/1krrnpb/da...,https://www.reddit.com/gallery/1krrnpb
11,Kenya,foreign aid,Is There a Better Way to Fund Africa’s Infrast...,I'm researching a fintech concept rooted in a ...,1745161000.0,2025-04-20 14:49:50,9,9,https://reddit.com/r/Kenya/comments/1k3o7to/is...,https://www.reddit.com/r/Kenya/comments/1k3o7t...
9,Kenya,usaid,EX-USAID people!! Let's talk,Are you still in contact with the organisation...,1743880000.0,2025-04-05 19:09:10,2,0,https://reddit.com/r/Kenya/comments/1jsb149/ex...,https://www.reddit.com/r/Kenya/comments/1jsb14...
13,Kenya,foreign aid,Kibaki ALSO failed us,\nThere is a tendency to over-exaggerate the p...,1743470000.0,2025-04-01 01:12:42,120,124,https://reddit.com/r/Kenya/comments/1jojl2f/ki...,https://www.reddit.com/r/Kenya/comments/1jojl2...
1,Kenya,USAID,"USAID left a month ago, do we have ARVs in Kenya?",Someone on a different group (different websit...,1744723000.0,2025-04-15 13:16:53,3,5,https://reddit.com/r/Kenya/comments/1jzrn2s/us...,https://www.reddit.com/r/Kenya/comments/1jzrn2...
21,Kenya,foreign funding,"Like It or Not, Here's Why Ruto Will Win in 20...","CALL ME WHATEVER YOU WANT, BUT HERE'S THE BARE...",1745140000.0,2025-04-20 09:06:35,0,9,https://reddit.com/r/Kenya/comments/1k3ijdu/li...,https://www.reddit.com/r/Kenya/comments/1k3ijd...
5,Kenya,usaid,USAID Repercussions + Economy,My neighbour’s wife was a very big shot in USA...,1747235000.0,2025-05-14 15:11:02,13,32,https://reddit.com/r/Kenya/comments/1kmhn87/us...,https://www.reddit.com/r/Kenya/comments/1kmhn8...


In [3]:
sample = df.sample(n=5,random_state= 42)
display(sample)

Unnamed: 0,subreddit,search_term,title,text,created_utc,created_date,score,num_comments,permalink,url
8,Kenya,usaid,Economy,For the experts in matters economy and finance...,1743959000.0,2025-04-06 17:01:41,1,15,https://reddit.com/r/Kenya/comments/1jsytyp/ec...,https://www.reddit.com/r/Kenya/comments/1jsyty...
16,Kenya,foreign funding,Be very cautious of the UAE,Kasongo has been cozying up to the UAE recentl...,1748512000.0,2025-05-29 09:51:51,32,13,https://reddit.com/r/Kenya/comments/1ky6sma/be...,https://www.reddit.com/r/Kenya/comments/1ky6sm...
0,Kenya,USAID,USAID Repercussions + Economy,My neighbour’s wife was a very big shot in USA...,1747235000.0,2025-05-14 15:11:02,12,32,https://reddit.com/r/Kenya/comments/1kmhn87/us...,https://www.reddit.com/r/Kenya/comments/1kmhn8...
18,Kenya,foreign funding,Daily Nation,,1747811000.0,2025-05-21 07:09:24,1,8,https://reddit.com/r/Kenya/comments/1krrnpb/da...,https://www.reddit.com/gallery/1krrnpb
11,Kenya,foreign aid,Is There a Better Way to Fund Africa’s Infrast...,I'm researching a fintech concept rooted in a ...,1745161000.0,2025-04-20 14:49:50,9,9,https://reddit.com/r/Kenya/comments/1k3o7to/is...,https://www.reddit.com/r/Kenya/comments/1k3o7t...


In [4]:
sample_text= sample['text'].to_list()
for x in sample_text:
    display(x)

'For the experts in matters economy and finance I ask this politely(mnielezee Kama mtoto tafadhali). How is our country still semi functional? Everyday we hear cases of billions lost here billions lost there. Sometime there was reports of I think 1.3 trillion irregularly withdrawn from the treasury, the dollar has surprisingly been stable at around 129 despite all this and there was the case where funding would be halted by the USAID. How has the economy not crashed yet? Is it normal to lose a third of the budget and still have a running country?'

"Kasongo has been cozying up to the UAE recently and as Kenyans we should be very careful here, if you look at their foreign policy they a pattern of fostering chaos and undermining democracy and legitimate governments.\n\n- In Sudan they fund and support the RSF ,in fact there are reports that they are the ones who pushed the RSF into launching the war.\n- In Somalia they support the breakaway region of Somaliland.\n- In Libya they fund and support the warlord, Khalifa Haftar.\n- In Egypt they orchestrated a coup to overthrow Morsy, the only democratically elected leader in Egypt.\n\nI don't know but who's to say that they will not try and help Kasongo in subverting the 2027 elections? After all they wouldn't wanna lose their logistics hub.  As the Swahili say 'Ukiona cha mwenzako kinanyolewa ,chako tia maji' and btw all those countries you see above all thought it couldn't happen to them."

'My neighbour’s wife was a very big shot in USAID and has now lost her job. Children have been removed from big private school. Husband is a big guy at PWC. Lifestyle changes are occurring rapidly as her income has vanished. Thousands of her USAID coworkers were sent home with no salaries. \n\nUSAID Vendors, contractors, non-profits that received funding from them have all been left in a lurch. Sasa machozi zimeanza. \n\nNext is empty apartments around “high class” areas.\n\nUN is laying people off left right and center.\n\nAdditionally, public assistance programs in the Europe and America are being slashed so remittances by a certain sector are falling.\n\nIf you think things are hard, ngojeni mpaka December. A lot of your highlife hotspots are about to close. A lot of these restaurants are about to close. \n\nCrime shall return so please rudini mashambani mulime.\n\nAvoid Mombasa, Lamu and malls.\n\n\n'

''

'I\'m researching a fintech concept rooted in a simple but powerful idea: What if African citizens could directly micro-invest in their own infrastructure and economic development — from as little as $1 — instead of relying so heavily on foreign loans or aid?\n\nThe idea is inspired by:\n\nEthiopia\'s Renaissance Dam, where despite China funding most of the $5B project, citizens contributed around $1B through bonds and mobile payments. It was a unifying act of nation-building.\n\nDenmark’s wind cooperatives, where tens of thousands of Danes co-own wind turbines, investing small amounts and earning steady returns from green energy sales.\n\nArla Foods, one of the world’s largest dairy companies, is owned by thousands of farmer-members across Europe.\n\nPark Slope Food Co-op (Brooklyn, USA) – over 17,000 members run and own this highly successful grocery store. Members contribute labor and share in decision-making and cost savings — a small-scale but high-functioning democratic economic 

- As seen above, some texts in our dataset contains mixed languages; we shall therefore have some preprocessing before analysis

## 3. NewsAPI Data Collection
- Sign up at [newsapi.org](newsapi.org)
- Create account and get api key : `bc6c52fd05ee4e63827b7cf45fa0bdb2`
- Limitations
    - The free version allows me to search back until 2025-05-03, 2 months after USAID funding cuts were already announced
    - Paywalled Websites like Daily Nation won't have content
    - Limited to 100 results

- 

In [1]:
import requests
import pandas as pd
from datetime import datetime

# --- PARAMETERS ---
api_key = 'bc6c52fd05ee4e63827b7cf45fa0bdb2'
query = 'USAID'
from_date = '2025-05-04'  # YYYY-MM-DD
country = 'ke'  # Kenya
page_size = 100  # max per request
max_pages = 1    # you can loop through more if needed

# --- FETCH ARTICLES ---
all_articles = []

for page in range(1, max_pages + 1):
    url = (
        f'https://newsapi.org/v2/everything?'
        f'q={query}&'
        f'from={from_date}&'
        f'sortBy=publishedAt&'
        f'pageSize={page_size}&'
        f'page={page}&'
        f'apiKey={api_key}'
    )

    response = requests.get(url)
    if response.status_code != 200:
        print(f"❌ Error: {response.status_code} - {response.json()}")
        break

    articles = response.json().get('articles', [])
    if not articles:
        break  # no more results

    for article in articles:
        all_articles.append({
            'source': article['source']['name'],
            'author': article.get('author'),
            'title': article.get('title'),
            'description': article.get('description'),
            'content': article.get('content'),
            'url': article.get('url'),
            'published_at': article.get('publishedAt')
        })

df = pd.DataFrame(all_articles)
# --- SAVE TO CSV ---
df = pd.DataFrame(all_articles)
df['published_at'] = pd.to_datetime(df['published_at'])
df.to_csv('../data/raw/leo_newsapi_articles.csv', index=False)

print(f"✅ Fetched {len(df)} articles. Saved to ../data/raw/leo_newsapi_articles.csv.")


❌ Error: 426 - {'status': 'error', 'code': 'parameterInvalid', 'message': 'You are trying to request results too far in the past. Your plan permits you to request articles as far back as 2025-05-12, but you have requested 2025-05-04. You may need to upgrade to a paid plan.'}


KeyError: 'published_at'

In [4]:
df_news = pd.read_csv('../data/raw/leo_newsapi_articles.csv')
df_news.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99 entries, 0 to 98
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   source        99 non-null     object
 1   author        94 non-null     object
 2   title         99 non-null     object
 3   description   98 non-null     object
 4   content       98 non-null     object
 5   url           97 non-null     object
 6   published_at  97 non-null     object
dtypes: object(7)
memory usage: 5.5+ KB


-Enrich NewsAPI Data with Full Text from `newspaper3k`

In [10]:
# --- INSTALLATION (if not done) ---
# pip install newspaper3k pandas

import pandas as pd
from newspaper import Article
import time

# --- LOAD EXISTING DATA ---
df = pd.read_csv('../data/raw/leo_newsapi_articles.csv')
print(f"📄 Loaded {len(df)} articles from NewsAPI.")

# --- EXTRACT FULL TEXT FROM URL ---
full_texts = []

for i, row in df.iterrows():
    url = row['url']
    try:
        article = Article(url)
        article.download()
        article.parse()
        full_texts.append(article.text)
    except Exception as e:
        print(f"❌ Failed at index {i} ({url}): {e}")
        full_texts.append(None)
    
    time.sleep(1)  # to avoid IP blocks or throttling

# --- ADD TO DATAFRAME & SAVE ---
df['full_text'] = full_texts
df.to_csv('../data/raw/leo_newsapi_articles_enriched.csv', index=False)

print(f"✅ Done. Saved full-text articles to ../data/raw/leo_newsapi_articles_enriched.csv")


📄 Loaded 99 articles from NewsAPI.
❌ Failed at index 11 (https://www.foxnews.com/us/boulder-suspect-spent-year-planning-molotov-cocktail-attack-pro-israel-march-docs): Article `download()` failed with 404 Client Error: Not Found for url: https://www.foxnews.com/us/boulder-suspect-spent-year-planning-molotov-cocktail-attack-pro-israel-march-docs on URL https://www.foxnews.com/us/boulder-suspect-spent-year-planning-molotov-cocktail-attack-pro-israel-march-docs
❌ Failed at index 13 (https://www.foxnews.com/us/usaid-paperwork-found-car-boulder-terror-attack-suspect-targeting-pro-israel-group): Article `download()` failed with 404 Client Error: Not Found for url: https://www.foxnews.com/us/usaid-paperwork-found-car-boulder-terror-attack-suspect-targeting-pro-israel-group on URL https://www.foxnews.com/us/usaid-paperwork-found-car-boulder-terror-attack-suspect-targeting-pro-israel-group
❌ Failed at index 15 (https://www.abc.net.au/news/2025-06-03/colorado-terror-attack-suspect-charged-with-u

Building prefix dict from /home/leo/anaconda3/envs/learn-env/lib/python3.8/site-packages/jieba/dict.txt ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 1.4312019348144531 seconds.
Prefix dict has been built succesfully.


✅ Done. Saved full-text articles to ../data/raw/leo_newsapi_articles_enriched.csv


- Trying out alternative news sources due to limitations of `NewsAPI`
## News Collection using Gnews


In [11]:
# --- INSTALLATION ---
# pip install gnews newspaper3k pandas

from gnews import GNews
from newspaper import Article
import pandas as pd
from datetime import datetime, timedelta

# --- PARAMETERS ---
query = "USAID Kenya"  
start_date = datetime(2025, 3, 28)
end_date = datetime(2025, 6, 13)
chunk_days = 7
max_results_per_chunk = 50

# --- INITIALIZE ---
gnews = GNews(language='en', country='KE', max_results=max_results_per_chunk)
current_start = start_date
all_articles = []

# --- DATE RANGE LOOP ---
while current_start < end_date:
    current_end = min(current_start + timedelta(days=chunk_days), end_date)
    gnews.start_date = current_start
    gnews.end_date = current_end
    print(f"🔍 Searching from {current_start.date()} to {current_end.date()}")

    try:
        results = gnews.get_news(query)
        print(f"   ✅ Found {len(results)} articles")
    except Exception as e:
        print(f"   ❌ Error during search: {e}")
        results = []

    # --- EXTRACT FULL TEXT ---
    for article in results:
        try:
            a = Article(article['url'])
            a.download()
            a.parse()
            all_articles.append({
                'title': a.title,
                'url': article['url'],
                'published_date': article['published date'],
                'source': article['publisher']['title'],
                'text': a.text
            })
        except Exception as e:
            print(f"   ❌ Failed to parse {article['url']}: {e}")

    current_start += timedelta(days=chunk_days)

# --- SAVE TO CSV ---
df = pd.DataFrame(all_articles)
df['published_date'] = pd.to_datetime(df['published_date'], errors='coerce')
df.to_csv("../data/raw/gnews_usaid_kenya_full.csv", index=False)

print(f"\n✅ Done. Saved {len(df)} full-text articles to CSV.")


🔍 Searching from 2025-03-28 to 2025-04-04
   ✅ Found 0 articles
🔍 Searching from 2025-04-04 to 2025-04-11




   ✅ Found 14 articles
🔍 Searching from 2025-04-11 to 2025-04-18
   ✅ Found 10 articles
🔍 Searching from 2025-04-18 to 2025-04-25
   ✅ Found 24 articles
🔍 Searching from 2025-04-25 to 2025-05-02
   ✅ Found 36 articles
🔍 Searching from 2025-05-02 to 2025-05-09
   ✅ Found 31 articles
🔍 Searching from 2025-05-09 to 2025-05-16
   ✅ Found 28 articles
🔍 Searching from 2025-05-16 to 2025-05-23
   ✅ Found 30 articles
🔍 Searching from 2025-05-23 to 2025-05-30
   ✅ Found 35 articles
🔍 Searching from 2025-05-30 to 2025-06-06
   ✅ Found 32 articles
🔍 Searching from 2025-06-06 to 2025-06-13
   ✅ Found 38 articles

✅ Done. Saved 278 full-text articles to CSV.


In [12]:
# --- INSTALLATION ---
# pip install gnews newspaper3k pandas

from gnews import GNews
from newspaper import Article
import pandas as pd
from datetime import datetime, timedelta

# --- PARAMETERS ---
query = "USAID Kenya"
start_date = datetime(2025, 3, 28)
end_date = datetime(2025, 6, 3)
chunk_days = 7
max_results_per_chunk = 50
languages = ['en', 'sw']  # English and Swahili

# --- STORAGE ---
all_articles = []

# --- LOOP THROUGH LANGUAGES ---
for lang in languages:
    print(f"\n🌐 Searching in language: {lang.upper()}")
    current_start = start_date

    while current_start < end_date:
        current_end = min(current_start + timedelta(days=chunk_days), end_date)
        
        gnews = GNews(language=lang, country='KE', max_results=max_results_per_chunk)
        gnews.start_date = current_start
        gnews.end_date = current_end
        
        print(f"🔍 {lang.upper()}: Searching from {current_start.date()} to {current_end.date()}")

        try:
            results = gnews.get_news(query)
            print(f"   ✅ Found {len(results)} articles")
        except Exception as e:
            print(f"   ❌ Error during search: {e}")
            results = []

        for article in results:
            try:
                a = Article(article['url'])
                a.download()
                a.parse()
                all_articles.append({
                    'title': a.title,
                    'url': article['url'],
                    'published_date': article.get('published date'),
                    'source': article['publisher']['title'],
                    'language': lang,
                    'text': a.text
                })
            except Exception as e:
                print(f"   ❌ Failed to parse {article['url']}: {e}")

        current_start += timedelta(days=chunk_days)

# --- SAVE TO CSV ---
df = pd.DataFrame(all_articles)
df['published_date'] = pd.to_datetime(df['published_date'], errors='coerce')
df.to_csv("../data/raw/gnews_usaid_kenya_full_en_sw.csv", index=False)

print(f"\n✅ Done. Saved {len(df)} full-text articles (EN + SW) to CSV.")



🌐 Searching in language: EN
🔍 EN: Searching from 2025-03-28 to 2025-04-04
   ✅ Found 23 articles
🔍 EN: Searching from 2025-04-04 to 2025-04-11
   ✅ Found 14 articles
🔍 EN: Searching from 2025-04-11 to 2025-04-18
   ✅ Found 10 articles
🔍 EN: Searching from 2025-04-18 to 2025-04-25
   ✅ Found 24 articles
🔍 EN: Searching from 2025-04-25 to 2025-05-02
   ✅ Found 36 articles
🔍 EN: Searching from 2025-05-02 to 2025-05-09
   ✅ Found 31 articles
🔍 EN: Searching from 2025-05-09 to 2025-05-16
   ✅ Found 28 articles
🔍 EN: Searching from 2025-05-16 to 2025-05-23
   ✅ Found 30 articles
🔍 EN: Searching from 2025-05-23 to 2025-05-30
   ✅ Found 35 articles
🔍 EN: Searching from 2025-05-30 to 2025-06-03
   ✅ Found 23 articles

🌐 Searching in language: SW
🔍 SW: Searching from 2025-03-28 to 2025-04-04
   ✅ Found 23 articles
🔍 SW: Searching from 2025-04-04 to 2025-04-11
   ✅ Found 14 articles
🔍 SW: Searching from 2025-04-11 to 2025-04-18
   ✅ Found 10 articles
🔍 SW: Searching from 2025-04-18 to 2025-04-25

In [19]:
df= pd.read_csv("../data/raw/gnews_usaid_kenya_full_en_sw.csv")
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 508 entries, 0 to 507
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           508 non-null    object 
 1   url             508 non-null    object 
 2   published_date  508 non-null    object 
 3   source          508 non-null    object 
 4   language        508 non-null    object 
 5   text            0 non-null      float64
dtypes: float64(1), object(5)
memory usage: 23.9+ KB


Unnamed: 0,title,url,published_date,source,language,text
0,Google News,https://news.google.com/rss/articles/CBMimwFBV...,2025-04-04 07:00:00+00:00,Daily Nation,en,
1,Google News,https://news.google.com/rss/articles/CBMirgFBV...,2025-04-01 07:00:00+00:00,Kenyans,en,
2,Google News,https://news.google.com/rss/articles/CBMiigFBV...,2025-04-02 07:00:00+00:00,NTV Kenya,en,
3,Google News,https://news.google.com/rss/articles/CBMimwFBV...,2025-03-31 07:00:00+00:00,KBC Digital,en,
4,Google News,https://news.google.com/rss/articles/CBMixwFBV...,2025-04-01 07:00:00+00:00,The EastAfrican,en,


In [20]:
# --- INSTALLATION ---
# pip install gnews newspaper3k pandas requests

import pandas as pd
import requests
from gnews import GNews
from newspaper import Article
from datetime import datetime, timedelta

# --- PARAMETERS ---
query = "USAID Kenya"
start_date = datetime(2025, 3, 28)
end_date = datetime(2025, 6, 3)
chunk_days = 7
max_results_per_chunk = 50
languages = ['en', 'sw']

# --- STORAGE ---
all_articles = []

# --- LOOP THROUGH LANGUAGES ---
for lang in languages:
    print(f"\n🌐 Searching in language: {lang.upper()}")
    current_start = start_date

    while current_start < end_date:
        current_end = min(current_start + timedelta(days=chunk_days), end_date)
        
        gnews = GNews(language=lang, country='KE', max_results=max_results_per_chunk)
        gnews.start_date = current_start
        gnews.end_date = current_end
        
        print(f"🔍 {lang.upper()}: Searching from {current_start.date()} to {current_end.date()}")

        try:
            results = gnews.get_news(query)
            print(f"   ✅ Found {len(results)} articles")
        except Exception as e:
            print(f"   ❌ Error during search: {e}")
            results = []

        for article in results:
            try:
                # --- Resolve Google News redirect URL to real URL ---
                response = requests.get(article['url'], allow_redirects=True, timeout=10)
                real_url = response.url

                # --- Use real URL to extract content ---
                a = Article(real_url)
                a.download()
                a.parse()

                all_articles.append({
                    'title': a.title,
                    'url': real_url,
                    'published_date': article.get('published date'),
                    'source': article['publisher']['title'],
                    'language': lang,
                    'text': a.text
                })
            except Exception as e:
                print(f"   ❌ Failed to parse {article['url']} -> {e}")

        current_start += timedelta(days=chunk_days)

# --- SAVE TO CSV ---
df = pd.DataFrame(all_articles)
df['published_date'] = pd.to_datetime(df['published_date'], errors='coerce')
df.to_csv("../data/raw/gnews_usaid_kenya_full_en_sw.csv", index=False)

print(f"\n✅ Done. Saved {len(df)} full-text articles (EN + SW) to CSV.")



🌐 Searching in language: EN
🔍 EN: Searching from 2025-03-28 to 2025-04-04
   ✅ Found 23 articles
🔍 EN: Searching from 2025-04-04 to 2025-04-11
   ✅ Found 14 articles
🔍 EN: Searching from 2025-04-11 to 2025-04-18
   ✅ Found 10 articles
🔍 EN: Searching from 2025-04-18 to 2025-04-25
   ✅ Found 23 articles
🔍 EN: Searching from 2025-04-25 to 2025-05-02
   ✅ Found 36 articles
🔍 EN: Searching from 2025-05-02 to 2025-05-09
   ✅ Found 31 articles
🔍 EN: Searching from 2025-05-09 to 2025-05-16
   ✅ Found 28 articles
🔍 EN: Searching from 2025-05-16 to 2025-05-23
   ✅ Found 24 articles
🔍 EN: Searching from 2025-05-23 to 2025-05-30
   ✅ Found 35 articles
🔍 EN: Searching from 2025-05-30 to 2025-06-03
   ✅ Found 21 articles

🌐 Searching in language: SW
🔍 SW: Searching from 2025-03-28 to 2025-04-04
   ✅ Found 23 articles
🔍 SW: Searching from 2025-04-04 to 2025-04-11
   ✅ Found 15 articles
🔍 SW: Searching from 2025-04-11 to 2025-04-18
   ✅ Found 10 articles
🔍 SW: Searching from 2025-04-18 to 2025-04-25

In [28]:
# --- INSTALLATION ---
# pip install gnews newspaper3k pandas requests

from gnews import GNews
from newspaper import Article
import pandas as pd
from datetime import datetime, timedelta
import requests

# --- FUNCTION TO RESOLVE REDIRECTS ---
def resolve_real_url(google_news_url):
    try:
        response = requests.get(google_news_url, timeout=10, allow_redirects=True)
        return response.url  # Final URL after redirects
    except Exception as e:
        print(f"   ❌ Could not resolve URL {google_news_url}: {e}")
        return None

# --- PARAMETERS ---
query = "USAID Kenya"
start_date = datetime(2025, 3, 28)
end_date = datetime(2025, 6, 3)
chunk_days = 7
max_results_per_chunk = 50
languages = ['en', 'sw']  # English and Swahili

# --- STORAGE ---
all_articles = []

# --- LOOP THROUGH LANGUAGES ---
for lang in languages:
    print(f"\n🌐 Searching in language: {lang.upper()}")
    current_start = start_date

    while current_start < end_date:
        current_end = min(current_start + timedelta(days=chunk_days), end_date)

        gnews = GNews(language=lang, country='KE', max_results=max_results_per_chunk)
        gnews.start_date = current_start
        gnews.end_date = current_end

        print(f"🔍 {lang.upper()}: Searching from {current_start.date()} to {current_end.date()}")

        try:
            results = gnews.get_news(query)
            print(f"   ✅ Found {len(results)} articles")
        except Exception as e:
            print(f"   ❌ Error during search: {e}")
            results = []

        for article in results:
            try:
                real_url = resolve_real_url(article['url'])
                if real_url is None:
                    continue

                a = Article(real_url)
                a.download()
                a.parse()

                all_articles.append({
                    'title': a.title,
                    'url': real_url,
                    'published_date': article.get('published date'),
                    'source': article['publisher']['title'],
                    'language': lang,
                    'text': a.text
                })
            except Exception as e:
                print(f"   ❌ Failed to parse article from {article['url']}: {e}")

        current_start += timedelta(days=chunk_days)

# --- SAVE TO CSV ---
df = pd.DataFrame(all_articles)
df['published_date'] = pd.to_datetime(df['published_date'], errors='coerce')
df.to_csv("../data/raw/gnews_usaid_kenya_full_en_sw.csv", index=False)

print(f"\n✅ Done. Saved {len(df)} full-text articles (EN + SW) to CSV.")




🌐 Searching in language: EN
🔍 EN: Searching from 2025-03-28 to 2025-04-04
   ✅ Found 23 articles
🔍 EN: Searching from 2025-04-04 to 2025-04-11
   ✅ Found 14 articles
🔍 EN: Searching from 2025-04-11 to 2025-04-18
   ✅ Found 10 articles
🔍 EN: Searching from 2025-04-18 to 2025-04-25
   ✅ Found 23 articles
🔍 EN: Searching from 2025-04-25 to 2025-05-02
   ✅ Found 35 articles
🔍 EN: Searching from 2025-05-02 to 2025-05-09
   ✅ Found 31 articles
🔍 EN: Searching from 2025-05-09 to 2025-05-16
   ✅ Found 28 articles
🔍 EN: Searching from 2025-05-16 to 2025-05-23
   ✅ Found 29 articles
🔍 EN: Searching from 2025-05-23 to 2025-05-30
   ✅ Found 33 articles
🔍 EN: Searching from 2025-05-30 to 2025-06-03
   ✅ Found 22 articles

🌐 Searching in language: SW
🔍 SW: Searching from 2025-03-28 to 2025-04-04
   ✅ Found 23 articles
🔍 SW: Searching from 2025-04-04 to 2025-04-11
   ✅ Found 14 articles
🔍 SW: Searching from 2025-04-11 to 2025-04-18
   ✅ Found 10 articles
🔍 SW: Searching from 2025-04-18 to 2025-04-25

In [29]:
df.head()

Unnamed: 0,title,url,published_date,source,language,text
0,Google News,https://news.google.com/rss/articles/CBMimwFBV...,2025-04-04 07:00:00+00:00,Daily Nation,en,
1,Google News,https://news.google.com/rss/articles/CBMirgFBV...,2025-04-01 07:00:00+00:00,Kenyans,en,
2,Google News,https://news.google.com/rss/articles/CBMiigFBV...,2025-04-02 07:00:00+00:00,NTV Kenya,en,
3,Google News,https://news.google.com/rss/articles/CBMimwFBV...,2025-03-31 07:00:00+00:00,KBC Digital,en,
4,Google News,https://news.google.com/rss/articles/CBMixwFBV...,2025-04-01 07:00:00+00:00,The EastAfrican,en,


## News from `RSS`

In [27]:
# --- INSTALLATION ---
# pip install feedparser newspaper3k pandas

import feedparser
from newspaper import Article
import pandas as pd
from datetime import datetime

# --- PARAMETERS ---
query = "USAID Kenya"
rss_feeds = {
    'en': [
        f"https://news.google.com/rss/search?q={query}+when:30d&hl=en-KE&gl=KE&ceid=KE:en"
    ],
    'sw': [
        f"https://news.google.com/rss/search?q={query}+when:30d&hl=sw&gl=KE&ceid=KE:sw"
    ]
}
start_date = datetime(2025, 3, 28)
end_date = datetime(2025, 6, 3)

# --- FETCH + EXTRACT ---
articles = []

for lang, feeds in rss_feeds.items():
    print(f"\n🌐 Processing language: {lang.upper()}")
    for url in feeds:
        print(f"🔗 Reading RSS: {url}")
        feed = feedparser.parse(url)
        
        for entry in feed.entries:
            try:
                published = entry.get('published', '') or entry.get('updated', '')
                published_dt = pd.to_datetime(published, errors='coerce')
                if pd.isna(published_dt):
                    continue
                if not (start_date <= published_dt <= end_date):
                    continue

                article = Article(entry.link)
                article.download()
                article.parse()

                articles.append({
                    'title': article.title,
                    'url': entry.link,
                    'published_date': published_dt,
                    'source': entry.get('source', {}).get('title', 'Unknown'),
                    'language': lang,
                    'text': article.text
                })
            except Exception as e:
                print(f"❌ Failed to parse article: {entry.link} - {e}")

# --- SAVE TO CSV ---
df = pd.DataFrame(articles)
df.to_csv("../data/raw/rss_usaid_kenya_en_sw.csv", index=False)
print(f"\n✅ Done. Saved {len(df)} full-text articles from RSS feeds.")



🌐 Processing language: EN
🔗 Reading RSS: https://news.google.com/rss/search?q=USAID Kenya+when:30d&hl=en-KE&gl=KE&ceid=KE:en


InvalidURL: URL can't contain control characters. '/rss/search?q=USAID Kenya+when:30d&hl=en-KE&gl=KE&ceid=KE:en' (found at least ' ')

In [13]:
df.sample(n=10,random_state=42)

Unnamed: 0,title,url,published_date,source,language,text
79,Google News,https://news.google.com/rss/articles/CBMiowFBV...,2025-04-30 07:00:00+00:00,Tech In Africa,en,
316,Google News,https://news.google.com/rss/articles/CBMiZEFVX...,2025-04-21 07:00:00+00:00,TRT Global,sw,
485,Google News,https://news.google.com/rss/articles/CBMie0FVX...,2025-06-02 07:00:00+00:00,TechCabal,sw,
396,Google News,https://news.google.com/rss/articles/CBMilgFBV...,2025-05-15 07:00:00+00:00,Daily Nation,sw,
167,Google News,https://news.google.com/rss/articles/CBMiswFBV...,2025-05-22 07:00:00+00:00,Kenyans,en,
493,Google News,https://news.google.com/rss/articles/CBMiX0FVX...,2025-06-02 07:00:00+00:00,Centers for Disease Control and Prevention | C...,sw,
63,Google News,https://news.google.com/rss/articles/CBMiowFBV...,2025-04-22 07:00:00+00:00,Times Higher Education,en,
185,Google News,https://news.google.com/rss/articles/CBMitgFBV...,2025-05-20 07:00:00+00:00,Tuko News,en,
84,Google News,https://news.google.com/rss/articles/CBMi1AFBV...,2025-05-02 07:00:00+00:00,Business Insider Africa,en,
124,Google News,https://news.google.com/rss/articles/CBMi1AFBV...,2025-05-02 07:00:00+00:00,Business Insider Africa,en,


In [14]:
"""# --- SAVE TO CSV ---
df = pd.DataFrame(all_articles)
df['published_at'] = pd.to_datetime(df['published_at'])
df.to_csv('data/newsapi_articles.csv', index=False)

print(f"✅ Fetched {len(df)} articles. Saved to data/newsapi_articles.csv.")"""


'# --- SAVE TO CSV ---\ndf = pd.DataFrame(all_articles)\ndf[\'published_at\'] = pd.to_datetime(df[\'published_at\'])\ndf.to_csv(\'data/newsapi_articles.csv\', index=False)\n\nprint(f"✅ Fetched {len(df)} articles. Saved to data/newsapi_articles.csv.")'

In [15]:
import requests
import pandas as pd
from datetime import datetime

# --- PARAMETERS ---
api_key = 'bc6c52fd05ee4e63827b7cf45fa0bdb2'

# More focused query
query = '(USAID OR donor aid OR foreign aid OR healthcare funding) AND Kenya'
from_date = '2025-05-04'
page_size = 100  # NewsAPI max per page
max_pages = 1    # Free tier limit is 100 articles total

# Recommended reliable/regional domains (you can expand)
domains = 'nation.africa,standardmedia.co.ke,citizen.digital,aljazeera.com,bbc.com,reuters.com,who.int,devex.com,un.org'

# --- FETCH ARTICLES ---
all_articles = []

for page in range(1, max_pages + 1):
    url = (
        f'https://newsapi.org/v2/everything?'
        f'q={query}&'
        f'from={from_date}&'
        f'sortBy=publishedAt&'
        f'domains={domains}&'
        f'pageSize={page_size}&'
        f'page={page}&'
        f'apiKey={api_key}'
    )

    response = requests.get(url)
    if response.status_code != 200:
        print(f"❌ Error: {response.status_code} - {response.json()}")
        break

    articles = response.json().get('articles', [])
    if not articles:
        break

    for article in articles:
        all_articles.append({
            'source': article['source']['name'],
            'author': article.get('author'),
            'title': article.get('title'),
            'description': article.get('description'),
            'content': article.get('content'),
            'url': article.get('url'),
            'published_at': article.get('publishedAt')
        })

# --- Convert to DataFrame ---
df = pd.DataFrame(all_articles)
#print(df[['title', 'source', 'published_at']])
df.sample(n=10,random_state=42)


❌ Error: 426 - {'status': 'error', 'code': 'parameterInvalid', 'message': 'You are trying to request results too far in the past. Your plan permits you to request articles as far back as 2025-05-12, but you have requested 2025-05-04. You may need to upgrade to a paid plan.'}


ValueError: a must be greater than 0 unless no samples are taken

In [None]:
len(df)

11

In [None]:
df_citizen = df[df['source']== 'Citizen.digital']
df_citizen.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 7 to 9
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   source        2 non-null      object
 1   author        2 non-null      object
 2   title         2 non-null      object
 3   description   2 non-null      object
 4   content       2 non-null      object
 5   url           2 non-null      object
 6   published_at  2 non-null      object
dtypes: object(7)
memory usage: 128.0+ bytes


In [None]:
x = df_citizen['content'].to_list()
display( x[0])
len(x[0])

'In 2024, the World Bank projected that the overall unemployment rate of the youth in Kenya was at 5.7 per cent. \r\nThe Federation of Kenya (FKE) says that the youth account for over 35 per cent of the… [+8544 chars]'

214

## 4. X Data collection

In [None]:
import snscrape.modules.twitter as sntwitter
import pandas as pd
from tqdm import tqdm

# List of keywords and common Kenyan locations (can expand)
kenyan_locations = ['kenya', 'nairobi', 'mombasa', 'kisumu', 'eldoret', 'nakuru', 'ke']

# Function to check if location mentions a Kenyan place
def is_kenyan_location(loc):
    if not loc:
        return False
    loc = loc.lower()
    return any(place in loc for place in kenyan_locations)

# Query Twitter for tweets mentioning USAID and Kenya
query = 'USAID Kenya lang:en since:2024-12-01 until:2025-06-01'
tweets = []
max_results = 500  # increase if needed

print("Scraping tweets...")
for i, tweet in enumerate(tqdm(sntwitter.TwitterSearchScraper(query).get_items(), total=max_results)):
    if i >= max_results:
        break
    tweets.append({
        'date': tweet.date,
        'username': tweet.user.username,
        'content': tweet.content,
        'user_location': tweet.user.location,
        'coordinates': tweet.coordinates
    })

# Load into DataFrame
df = pd.DataFrame(tweets)

# Filter tweets by user_location (Kenya-only)
df['is_kenyan'] = df['user_location'].apply(is_kenyan_location)
df_kenya = df[df['is_kenyan'] == True].reset_index(drop=True)

print(f"\nTotal tweets scraped: {len(df)}")
print(f"Tweets with Kenyan location: {len(df_kenya)}")

# Preview
print("\nSample Kenyan tweet:")
print(df_kenya[['date', 'username', 'content', 'user_location']].iloc[0])


  0%|          | 0/500 [00:00<?, ?it/s]

Scraping tweets...


Error retrieving https://twitter.com/i/api/graphql/7jT5GT59P8IFjgxwqnEdQw/SearchTimeline?variables=%7B%22rawQuery%22%3A%22USAID%20Kenya%20lang%3Aen%20since%3A2024-12-01%20until%3A2025-06-01%22%2C%22count%22%3A20%2C%22product%22%3A%22Latest%22%2C%22withDownvotePerspective%22%3Afalse%2C%22withReactionsMetadata%22%3Afalse%2C%22withReactionsPerspective%22%3Afalse%7D&features=%7B%22rweb_lists_timeline_redesign_enabled%22%3Afalse%2C%22blue_business_profile_image_shape_enabled%22%3Afalse%2C%22responsive_web_graphql_exclude_directive_enabled%22%3Atrue%2C%22verified_phone_label_enabled%22%3Afalse%2C%22creator_subscriptions_tweet_preview_api_enabled%22%3Afalse%2C%22responsive_web_graphql_timeline_navigation_enabled%22%3Atrue%2C%22responsive_web_graphql_skip_user_profile_image_extensions_enabled%22%3Afalse%2C%22tweetypie_unmention_optimization_enabled%22%3Atrue%2C%22vibe_api_enabled%22%3Atrue%2C%22responsive_web_edit_tweet_api_enabled%22%3Atrue%2C%22graphql_is_translatable_rweb_tweet_is_translata

ScraperException: 4 requests to https://twitter.com/i/api/graphql/7jT5GT59P8IFjgxwqnEdQw/SearchTimeline?variables=%7B%22rawQuery%22%3A%22USAID%20Kenya%20lang%3Aen%20since%3A2024-12-01%20until%3A2025-06-01%22%2C%22count%22%3A20%2C%22product%22%3A%22Latest%22%2C%22withDownvotePerspective%22%3Afalse%2C%22withReactionsMetadata%22%3Afalse%2C%22withReactionsPerspective%22%3Afalse%7D&features=%7B%22rweb_lists_timeline_redesign_enabled%22%3Afalse%2C%22blue_business_profile_image_shape_enabled%22%3Afalse%2C%22responsive_web_graphql_exclude_directive_enabled%22%3Atrue%2C%22verified_phone_label_enabled%22%3Afalse%2C%22creator_subscriptions_tweet_preview_api_enabled%22%3Afalse%2C%22responsive_web_graphql_timeline_navigation_enabled%22%3Atrue%2C%22responsive_web_graphql_skip_user_profile_image_extensions_enabled%22%3Afalse%2C%22tweetypie_unmention_optimization_enabled%22%3Atrue%2C%22vibe_api_enabled%22%3Atrue%2C%22responsive_web_edit_tweet_api_enabled%22%3Atrue%2C%22graphql_is_translatable_rweb_tweet_is_translatable_enabled%22%3Atrue%2C%22view_counts_everywhere_api_enabled%22%3Atrue%2C%22longform_notetweets_consumption_enabled%22%3Atrue%2C%22tweet_awards_web_tipping_enabled%22%3Afalse%2C%22freedom_of_speech_not_reach_fetch_enabled%22%3Afalse%2C%22standardized_nudges_misinfo%22%3Atrue%2C%22tweet_with_visibility_results_prefer_gql_limited_actions_policy_enabled%22%3Afalse%2C%22interactive_text_enabled%22%3Atrue%2C%22responsive_web_text_conversations_enabled%22%3Afalse%2C%22longform_notetweets_rich_text_read_enabled%22%3Afalse%2C%22longform_notetweets_inline_media_enabled%22%3Afalse%2C%22responsive_web_enhance_cards_enabled%22%3Afalse%2C%22responsive_web_twitter_blue_verified_badge_is_enabled%22%3Atrue%7D failed, giving up.

In [None]:
# src/x_collection.py
import os
import tweepy
import pandas as pd
from datetime import datetime, timedelta



X_BEARER_TOKEN = os.getenv("X_BEARER_TOKEN")

try:
    client = tweepy.Client(X_BEARER_TOKEN)
    print("X API client initialized.")
except Exception as e:
    print(f"Error initializing X API client: {e}")
    print("Please ensure your X_BEARER_TOKEN is correct and valid. Skipping X data collection.")
    client = None

if client:
    # Define your keywords and a recent start date (REQUIRED FOR FREE/BASIC TIER)
    query_keywords = "USAID Kenya lang:en -is:retweet" # English, exclude retweets

    # For free tier, search is limited to the last 7 days.
    # We will try to set the date to March 1st, 2025, but anticipate failure/no results
    # if it's outside the 7-day window for your current execution date.
    target_start_date_str = "2025-03-01T00:00:00Z" # ISO 8601 format
    # The actual 'start_time' for search_recent_tweets must be within 7 days.
    # We'll set it to 7 days ago to demonstrate functionality, but this will NOT get March 1st, 2025 data currently.
    actual_start_time_for_search = (datetime.utcnow() - timedelta(days=7)).replace(microsecond=0)
    end_time_for_search = datetime.utcnow().replace(microsecond=0)

    print(f"\n--- Attempting to collect X data for '{query_keywords}' from {target_start_date_str} ---")
    print(f"NOTE: X API free/basic tiers typically limit searches to the last 7 days.")
    print(f"Actual search window being attempted: from {actual_start_time_for_search.isoformat()}Z to {end_time_for_search.isoformat()}Z")

    tweets_data = []
    try:
        # max_results is 100 for recent search endpoint on standard access.
        # Free tier is even lower.
        response = client.search_recent_tweets(
            query=query_keywords,
            start_time=actual_start_time_for_search,
            end_time=end_time_for_search,
            tweet_fields=["created_at", "public_metrics", "author_id", "lang"],
            expansions=["author_id"],
            max_results=100 # Adjust based on your tier and testing
        )

        if response and response.data:
            users = {user["id"]: user for user in response.includes.get("users", [])}
            for tweet in response.data:
                author_username = users.get(tweet.author_id, {}).get("username", "[unknown]")
                tweets_data.append({
                    "id": tweet.id,
                    "text": tweet.text,
                    "created_at": tweet.created_at,
                    "retweet_count": tweet.public_metrics.get("retweet_count", 0),
                    "reply_count": tweet.public_metrics.get("reply_count", 0),
                    "like_count": tweet.public_metrics.get("like_count", 0),
                    "quote_count": tweet.public_metrics.get("quote_count", 0),
                    "author_id": tweet.author_id,
                    "author_username": author_username
                })
            df_tweets = pd.DataFrame(tweets_data)
            print(f"\n--- X Data Preview (first 5 relevant entries) ---")
            print(df_tweets[['text', 'author_username', 'created_at', 'like_count']].head())
            print(f"Total relevant X tweets collected: {len(df_tweets)}")
            df_tweets.to_csv("../data/raw/x_usaid_kenya_tweets.csv", index=False)
            print("\nX data saved to ../data/raw/x_usaid_kenya_tweets.csv")
        else:
            print("No tweets found for this query within the accessible time range or API limits reached.")

    except tweepy.TweepyException as e:
        print(f"Tweepy API Error: {e}")
        print("This often indicates rate limits, invalid query, or insufficient access permissions (e.g., historical access blocked).")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
else:
    print("X API client not initialized. Skipping X data collection.")

X API client initialized.

--- Attempting to collect X data for 'USAID Kenya lang:en -is:retweet' from 2025-03-01T00:00:00Z ---
NOTE: X API free/basic tiers typically limit searches to the last 7 days.
Actual search window being attempted: from 2025-06-04T18:32:06Z to 2025-06-11T18:32:06Z
Tweepy API Error: 401 Unauthorized
Unauthorized
This often indicates rate limits, invalid query, or insufficient access permissions (e.g., historical access blocked).
