# Collecting and Cleaning News Articles - Using NewsAPI

In this notebook, we collect news articles related to **Donald Trump** and **Kamala Harris** using the NewsAPI, apply data cleaning techniques, and prepare the data for further analysis.

In [1]:
import requests
import json
from datetime import datetime, timedelta
import re
import pandas as pd

API_KEY = "9c56272ec3ef4a73a5bfe06c8bc1a4e9"
URL = "https://newsapi.org/v2/everything"

## Fetching News Articles from NewsAPI

We use the **NewsAPI** to gather news articles that mention **Donald Trump** and **Kamala Harris**. Our goal is to fetch articles where:

- **Trump** is mentioned in the title or description but **Kamala Harris** is not.
- **Kamala Harris** is mentioned in the title or description but **Donald Trump** is not.

To achieve this, we split the search into 5 weekly intervals between **2024-09-17** and **2024-10-22**. For each interval, we fetch relevant articles and save them in two separate JSON files:
- **trump_articles.json** for articles related to Donald Trump.
- **harris_articles.json** for articles related to Kamala Harris.

This ensures a broad coverage of articles while avoiding API rate limits and ensuring the relevance of the data.


In [None]:
def fetch_articles(query, exclude, from_date, to_date, page_size=100):
    params = {
        "q": f"{query} -{exclude}",  # Exclude the other candidate
        "from": from_date,
        "to": to_date,
        "language": "en",
        "pageSize": page_size,
        "searchIn": "title,description",
        "apiKey": API_KEY,
    }
    response = requests.get(URL, params=params)
    data = response.json()

    if data["status"] != "ok":
        print(f"Error: {data['message']}")
        return []

    return data.get("articles", [])


def save_articles_to_json(articles, filename):
    with open(filename, "w") as file:
        json.dump(articles, file, indent=4)


# Function to split date range into weekly intervals
def date_range_splitter(start_date, end_date, delta=7):
    date_ranges = []
    current_date = start_date
    while current_date < end_date:
        next_date = current_date + timedelta(days=delta)
        date_ranges.append(
            (current_date.strftime("%Y-%m-%d"), next_date.strftime("%Y-%m-%d"))
        )
        current_date = next_date
    return date_ranges

In [16]:
today = datetime.now()
start_date = datetime(2024, 9, 17)
date_ranges = date_range_splitter(
    start_date, today, delta=7
)  # Split into weekly intervals

all_trump_articles = []
all_harris_articles = []

for from_date, to_date in date_ranges:
    print(f"Fetching Trump articles from {from_date} to {to_date}")
    trump_articles = fetch_articles("Donald Trump", "Kamala Harris", from_date, to_date)
    all_trump_articles.extend(trump_articles)

    print(f"Fetching Harris articles from {from_date} to {to_date}")
    harris_articles = fetch_articles("Kamala Harris", "Donald Trump", from_date, to_date)
    all_harris_articles.extend(harris_articles)

save_articles_to_json(all_trump_articles, "dataset/trump_articles.json")
save_articles_to_json(all_harris_articles, "dataset/harris_articles.json")

print(
    f"Saved {len(all_trump_articles)} Trump articles and {len(all_harris_articles)} Harris articles."
)

Fetching Trump articles from 2024-09-17 to 2024-09-24
Fetching Harris articles from 2024-09-17 to 2024-09-24
Fetching Trump articles from 2024-09-24 to 2024-10-01
Fetching Harris articles from 2024-09-24 to 2024-10-01
Fetching Trump articles from 2024-10-01 to 2024-10-08
Fetching Harris articles from 2024-10-01 to 2024-10-08
Fetching Trump articles from 2024-10-08 to 2024-10-15
Fetching Harris articles from 2024-10-08 to 2024-10-15
Fetching Trump articles from 2024-10-15 to 2024-10-22
Fetching Harris articles from 2024-10-15 to 2024-10-22
Saved 441 Trump articles and 500 Harris articles.


## Cleaning the Data

Once the data is collected, we proceed with cleaning to ensure consistency and remove irrelevant characters or noise. The following cleaning steps are applied to both the **title** and **description** fields:

1. **Lowercasing**: Convert all characters to lowercase for uniformity.
2. **URL Removal**: Remove any URLs from the text, as they do not contribute to the content analysis.
3. **Unicode Character Removal**: Remove non-ASCII characters (e.g., `\u00a0`), which often appear as encoding artifacts.
4. **Non-Alphanumeric Character Removal**: Remove all non-alphanumeric characters, except spaces, to focus on meaningful words.
5. **Whitespace Normalization**: Replace multiple spaces with a single space to ensure clean formatting.
6. **Newline and Tab Removal**: Remove any newline or tab characters to keep the text on a single line.

Any **Removed** or **duplicate** articles are also excluded from the dataset.

After applying these steps, we save the cleaned data into new JSON files:
- **cleaned_trump_articles.json**
- **cleaned_harris_articles.json**

This cleaned data will be used for further analysis.

In [2]:
with open("dataset/trump_articles.json", "r") as file:
    trump_articles = json.load(file)

with open("dataset/harris_articles.json", "r") as file:
    harris_articles = json.load(file)

df_trump = pd.DataFrame(trump_articles)
df_harris = pd.DataFrame(harris_articles)

def clean_text(text):
    text = text.lower()
    text = re.sub(r"http\S+", "", text)
    text = text.encode("ascii", "ignore").decode()
    text = re.sub(r"[^a-zA-Z\s]", "", text)
    text = re.sub(r"[\n\t]", " ", text)
    text = re.sub(r"\s+", " ", text)
    return text.strip()

# Function to filter out entries with "removed" in title or description
def filter_removed_entries(df):
    # Remove rows where title or description contains 'removed'
    df_filtered = df[~((df['title'].str.contains('removed', case=False, na=False)) |
                       (df['description'].str.contains('removed', case=False, na=False)))]
    return df_filtered

In [3]:
# Apply enhanced cleaning to title and description columns
df_trump['title'] = df_trump['title'].apply(lambda x: clean_text(x) if pd.notnull(x) else '')
df_trump['description'] = df_trump['description'].apply(lambda x: clean_text(x) if pd.notnull(x) else '')

df_harris['title'] = df_harris['title'].apply(lambda x: clean_text(x) if pd.notnull(x) else '')
df_harris['description'] = df_harris['description'].apply(lambda x: clean_text(x) if pd.notnull(x) else '')

# Remove duplicates (if any)
df_trump = df_trump.drop_duplicates(subset=['title', 'description'])
df_harris = df_harris.drop_duplicates(subset=['title', 'description'])

# Filter out "removed" entries
df_trump = filter_removed_entries(df_trump)
df_harris = filter_removed_entries(df_harris)

# Save cleaned data to new JSON files
df_trump.to_json('dataset/cleaned_trump_articles.json', orient='records', indent=4)
df_harris.to_json('dataset/cleaned_harris_articles.json', orient='records', indent=4)

print(f"Cleaned data: {len(df_trump)} Trump articles and {len(df_harris)} Harris articles saved.")

Cleaned data: 361 Trump articles and 407 Harris articles saved.
