# Collection of Articles using NEWSAPI

### Authors: Julian Rojas & Rafaël Mourouvin
---

Ce script effectue une recherche automatisée d'articles d'actualité sur l'impact environnemental de l'intelligence artificielle en utilisant l'API de NewsAPI. Il récupère des articles en anglais et en français selon les mots-clés spécifiés.  

L'algorithme extrait les métadonnées des articles (Title, Source, Raw Date, Parsed Date, Link) et les classe par ordre chronologique grâce à la variable `SORT_BY="publishedAt"`. Ensuite, il tente de récupérer le texte intégral de chaque article via la fonction `get_full_article_text(url)`, qui utilise la bibliothèque **newspaper3k**.  

Les articles dont le contenu est trop court ou inaccessibles sont automatiquement filtrés grâce à la condition `df_articles[~df_articles["Full Article Text"].str.startswith("Error fetching article")]` afin de garantir la qualité des résultats.  

Les résultats finaux sont sauvegardés sous forme de fichier **CSV** (`ai_environmental_impact_articles_FULLTEXT.csv`) contenant les informations essentielles ainsi que le texte complet des articles valides. L'affichage des premiers résultats est également proposé en sortie du script via `print(df_articles[['Parsed Date', 'Language', 'Title', 'Source', 'Link', 'Full Article Text']].head(40))`.  

**Paramètres de configuration dans le code**  

- **Mots-clés de recherche** :`QUERY_EN = "AI environmental impact"` en anglais et `QUERY_FR = "impact environnemental de l'IA"` en français.  
- **Langues analysées** : `LANGUAGES = {"en": QUERY_EN, "fr": QUERY_FR}` (Anglais et Français).  
- **Nombre d'articles récupérés par requête** : `PAGE_SIZE=60`.  
- **Critère de tri des résultats** : `sortBy="publishedAt"`, pour classer les articles du plus récent au plus ancien.  
- **Filtrage des articles invalides** : `df_articles[~df_articles["Full Article Text"].str.startswith("Error fetching article")]`.  
- **Exportation des résultats** : `df_articles.to_csv(OUTPUT_FILENAME, index=False, encoding='utf-8-sig')`.  

---

This script performs an automated search for news articles on the environmental impact of artificial intelligence using the NewsAPI. It retrieves articles in English and French based on the specified keywords.

The algorithm extracts metadata from the articles (Title, Source, Raw Date, Parsed Date, Link) and sorts them chronologically using the variable `SORT_BY="publishedAt"`. Then, it attempts to retrieve the full text of each article through the function `get_full_article_text(url)`, which uses the **newspaper3k** library.

Articles with content that is too short or inaccessible are automatically filtered using the condition `df_articles[~df_articles["Full Article Text"].str.startswith("Error fetching article")]` to ensure the quality of the results.

The final results are saved as a **CSV** file (`ai_environmental_impact_articles_FULLTEXT.csv`), containing essential information along with the full text of valid articles. The first results are also displayed as script output using `print(df_articles[['Parsed Date', 'Language', 'Title', 'Source', 'Link', 'Full Article Text']].head(40))`.

**Configuration parameters in the code**

- **Search keywords** : `QUERY_EN = "AI environmental impact"` in english and `QUERY_FR = "impact environnemental de l'IA"` in french.
- **Analyzed languages** : `LANGUAGES = {"en": QUERY_EN, "fr": QUERY_FR}` (English and French).
- **Number of articles retrieved per query** : `PAGE_SIZE=60`.
- **Sorting criterion** : `sortBy="publishedAt"`, to order articles from most recent to oldest.
-  **Filtering invalid articles** : `df_articles[~df_articles["Full Article Text"].str.startswith("Error fetching article")]`.
- **Exporting results**  : `df_articles.to_csv(OUTPUT_FILENAME, index=False, encoding='utf-8-sig')`.



In [43]:
# ==============================================================================================================
# INSTALL PACKAGES AND IMPORT LIBRARIES
# ==============================================================================================================

# Install required packages if not already installed
import importlib.util
import sys
import subprocess

required_packages = ["pandas", "requests", "newspaper3k", "lxml_html_clean"]
for package in required_packages:
    if importlib.util.find_spec(package) is None:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Import libraries
import requests
import pandas as pd
from datetime import datetime
from newspaper import Article
from lxml.html.clean import clean_html


Installing newspaper3k...


In [44]:
# ==============================================================================================================
# CONFIGURATION 
# ==============================================================================================================

# NewsAPI key (replace with your own key)
API_KEY = "0b248e558e354c2e88b4fc4bee466ead"

# Search queries for articles (English & French)
QUERY_EN = "AI environmental impact"
QUERY_FR = "impact environnemental de l'IA"

# Languages to fetch articles in
LANGUAGES = {"en": QUERY_EN, "fr": QUERY_FR}

# Number of articles to fetch per query
PAGE_SIZE = 60

# Sorting criteria for articles (most recent first)
SORT_BY = "publishedAt"

# Minimum article text length to be considered valid
MIN_TEXT_LENGTH = 100

# Output file for saving the results
OUTPUT_FILENAME = "ai_environmental_impact_articles_FULLTEXT.csv"


# Warning: The number of articles fetched may not match the final exported count, 
# as some articles may be filtered out due to accessibility issues or insufficient content length. 
# To improve results, consider increasing PAGE_SIZE to fetch more articles before filtering.


In [47]:
# ==============================================================================================================
# FUNCTIONS
# ==============================================================================================================

def fetch_newsapi_articles(query, language):
    """Fetch news articles using NewsAPI.org."""
    url = "https://newsapi.org/v2/everything"
    params = {
        "q": query,
        "language": language,
        "pageSize": PAGE_SIZE,
        "sortBy": SORT_BY,
        "apiKey": API_KEY
    }

    response = requests.get(url, params=params)
    if response.status_code != 200:
        print(f"⚠️ Error fetching articles: HTTP {response.status_code}")
        return pd.DataFrame(columns=['Title', 'Source', 'Raw Date', 'Parsed Date', 'Link', 'Language'])

    data = response.json()
    articles = data.get("articles", [])

    results = []
    for article in articles:
        title = article.get("title", "N/A")
        source = article.get("source", {}).get("name", "N/A")
        date_text = article.get("publishedAt", "N/A")
        link = article.get("url", "N/A")
        parsed_date = datetime.fromisoformat(date_text.replace("Z", "+00:00")) if date_text != "N/A" else datetime.min

        results.append([title, source, date_text, parsed_date, link, language])

    return pd.DataFrame(results, columns=['Title', 'Source', 'Raw Date', 'Parsed Date', 'Link', 'Language'])

def get_full_article_text(url):
    """Fetch the full article text using newspaper3k."""
    try:
        article = Article(url)
        article.download()
        article.parse()
        text = article.text.strip()
        if len(text) < MIN_TEXT_LENGTH:
            return "Error fetching article: content too short"
        return text
    except Exception as e:
        return f"Error fetching article: {e}"

def search_ai_environmental_impact_articles():
    """Fetch articles in multiple languages and combine results."""
    df_list = [fetch_newsapi_articles(query, lang) for lang, query in LANGUAGES.items()]
    df_combined = pd.concat(df_list, ignore_index=True)
    df_combined = df_combined.sort_values(by="Parsed Date", ascending=False).reset_index(drop=True)
    return df_combined

In [49]:
# ==============================================================================================================
# EXECUTION
# ==============================================================================================================

if __name__ == "__main__":
    pd.set_option('display.max_colwidth', None)

    # Step 1: Fetch articles
    df_articles = search_ai_environmental_impact_articles()
    print(f"\n🔍 Fetched {len(df_articles)} articles before filtering.\n")

    # Step 2: Fetch full article text
    print("📄 Fetching full article text. This may take a moment...\n")
    df_articles["Full Article Text"] = df_articles["Link"].apply(get_full_article_text)

    # Step 3: Remove articles with errors
    df_articles = df_articles[~df_articles["Full Article Text"].str.startswith("Error fetching article")]
    df_articles = df_articles.reset_index(drop=True)

    print(f"\n✅ Remaining articles after filtering: {len(df_articles)}\n")

    # Step 4: Save to CSV
    df_articles.to_csv(OUTPUT_FILENAME, index=False, encoding='utf-8-sig')

    # Step 5: Display results
    print(df_articles[['Parsed Date', 'Language', 'Title', 'Source', 'Link', 'Full Article Text']].head(40))
    print(f"\n✅ Full articles saved to: {OUTPUT_FILENAME}")


🔍 Fetched 73 articles before filtering.

📄 Fetching full article text. This may take a moment...


✅ Remaining articles after filtering: 60

                 Parsed Date Language  \
0  2025-03-20 18:24:47+00:00       en   
1  2025-03-20 18:08:41+00:00       en   
2  2025-03-20 15:39:00+00:00       en   
3  2025-03-20 15:02:28+00:00       en   
4  2025-03-20 15:00:00+00:00       en   
5  2025-03-20 14:57:00+00:00       en   
6  2025-03-20 14:48:40+00:00       en   
7  2025-03-20 14:36:00+00:00       en   
8  2025-03-20 14:17:00+00:00       en   
9  2025-03-20 14:13:00+00:00       en   
10 2025-03-20 14:13:00+00:00       fr   
11 2025-03-20 14:00:00+00:00       en   
12 2025-03-20 13:50:00+00:00       en   
13 2025-03-20 13:39:00+00:00       en   
14 2025-03-20 13:17:00+00:00       en   
15 2025-03-20 13:15:00+00:00       en   
16 2025-03-20 12:47:42+00:00       en   
17 2025-03-20 12:33:03+00:00       en   
18 2025-03-20 12:00:41+00:00       en   
19 2025-03-20 11:57:19+00:00       en 