## Task 1: Data Collection and Pipeline Initialization
This notebook outlines the process for collecting news data related to AI.

## 1-> installing required libraries

 #This cell installs the necessary Python package to communicate with the NewsAPI.

In [None]:
!pip install requests pandas newsapi-python



## 2->Importing Libraries and Loading API Keys
# Here, we import the libraries and securely access our API key from Colab's secret manager.

In [None]:
import os
import pandas as pd
from newsapi import NewsApiClient
from google.colab import userdata

# Loading my API key stored in the secrets manager
NEWS_APIKEY = userdata.get('news_api')

## 3-> Data Collection from NewsAPI
# This function fetches news articles for a specific topic. It is designed to be clean and reusable.

In [None]:
def fetch_news_data(api_key, search_query):
    """Fetches news from NewsAPI for a specific query."""
    print(f"Fetching articles for query: '{search_query}'...")
    try:
        newsapi = NewsApiClient(api_key=api_key)
        all_articles = newsapi.get_everything(q=search_query,
                                              language='en',
                                              sort_by='relevancy',
                                              page_size=100) # Get 100 articles

        print(f"Successfully fetched {len(all_articles['articles'])} articles.")
        return all_articles['articles']
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

# We use this function to fetch articles
articles_list = fetch_news_data(api_key=NEWS_APIKEY, search_query="Artificial Intelligence")

Fetching articles for query: 'Artificial Intelligence'...
Successfully fetched 100 articles.


## 4-> Processing and Saving the Data
# The raw data is processed into a structured table (DataFrame) and saved as a CSV file.
# This CSV is the final output of our data collection pipeline.

In [None]:
if articles_list:
    df = pd.DataFrame(articles_list)

    # This selects and cleans the most important columns
    dfclean = df[['title', 'author', 'source', 'publishedAt', 'url', 'content']].copy()
    dfclean['source'] = dfclean['source'].apply(lambda x: x['name']) # Extract just the source name

    # Here Save the final data to a CSV file
    output_file = 'newsdata.csv'
    dfclean.to_csv(output_file, index=False, encoding='utf-8-sig')

    print(f"\nPipeline complete. File saved to '{output_file}'")

    # Now display the first 5 rows of our final dataset
    print("\nPreview of the final data:")
    display(dfclean.head())


Pipeline complete. File saved to 'newsdata.csv'

Preview of the final data:


Unnamed: 0,title,author,source,publishedAt,url,content
0,AI invents new antibiotics that could kill sup...,,BBC News,2025-08-14T15:03:49Z,https://www.bbc.com/news/articles/cgr94xxye2lo,Artificial intelligence has invented two new p...
1,Gemini for Home is Google’s biggest smart home...,Jennifer Pattison Tuohy,The Verge,2025-08-20T16:59:47Z,https://www.theverge.com/news/762370/google-an...,<ul><li></li><li></li><li></li></ul>\r\nThe al...
2,AI Is Eliminating Jobs for Younger Workers,Will Knight,Wired,2025-08-26T07:00:00Z,https://www.wired.com/story/stanford-research-...,Economists at Stanford University have found t...
3,Inside the Biden Administration’s Unpublished ...,Will Knight,Wired,2025-08-06T18:00:00Z,https://www.wired.com/story/inside-the-biden-a...,At a computer security conference in Arlington...
4,The Jobs AI is Replacing the Fastest,Riley Gutiérrez McDermid,Gizmodo.com,2025-08-21T12:37:41Z,https://gizmodo.com/the-jobs-ai-is-replacing-t...,The jobs most likely to be replaced by artific...
