<a href="https://colab.research.google.com/github/MuskanTiwari12/Sentimental-Analysis-Project/blob/main/task1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Task 1:Environment Setup , Data Collection & Pipeline Initialization

1>Install required libraries for Data Collection Pipeline


In [10]:
from google.colab import drive
drive.mount('/content/drive')
task2_folder = "/content/drive/MyDrive/Task1.ipynb"
!mkdir -p "{task1_folder}"  # Create folder if it doesn't exist

Mounted at /content/drive


In [2]:
!pip install pandas              # For data handling & saving (CSV, DataFrame)
!pip install python-dotenv       # Manage API keys securely
!pip install newsapi-python      # Fetch news articles using NewsAPI
!pip install requests            # Make HTTP requests (web scraping, APIs)
#!pip install beautifulsoup4      # Parse HTML for scraping news sites
!pip install newspaper3k         # Extract structured news content (title, text, date)
!pip install tweepy              # Collect tweets from Twitter (X) API
!pip install lxml_html_clean     #cleans and sanitizes messy HTML by removing scripts, styles, and unwanted tags, ensuring clean text for analysis.

Collecting newsapi-python
  Downloading newsapi_python-0.2.7-py2.py3-none-any.whl.metadata (1.2 kB)
Downloading newsapi_python-0.2.7-py2.py3-none-any.whl (7.9 kB)
Installing collected packages: newsapi-python
Successfully installed newsapi-python-0.2.7
Collecting newspaper3k
  Downloading newspaper3k-0.2.8-py3-none-any.whl.metadata (11 kB)
Collecting cssselect>=0.9.2 (from newspaper3k)
  Downloading cssselect-1.3.0-py3-none-any.whl.metadata (2.6 kB)
Collecting feedparser>=5.2.1 (from newspaper3k)
  Downloading feedparser-6.0.12-py3-none-any.whl.metadata (2.7 kB)
Collecting tldextract>=2.0.1 (from newspaper3k)
  Downloading tldextract-5.3.0-py3-none-any.whl.metadata (11 kB)
Collecting feedfinder2>=0.0.4 (from newspaper3k)
  Downloading feedfinder2-0.0.4.tar.gz (3.3 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jieba3k>=0.35.1 (from newspaper3k)
  Downloading jieba3k-0.35.1.zip (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m 

   Import Required Libraries



In [3]:
import os                # For file paths & environment variables
import pandas as pd      # For storing and handling collected data
from dotenv import load_dotenv   # For loading API keys securely
from newsapi import NewsApiClient   # News API client
import requests          # For API requests & scraping
#from bs4 import BeautifulSoup  # For parsing scraped HTML
from newspaper import Article   # For extracting structured news
import tweepy            # For fetching data from Twitter (X) API

Load All API Keys from Colab Secret Manage

In [4]:
from google.colab import userdata

# Load API keys from Colab Secret Manager
news_api_key = userdata.get("NEWS_API_KEY")
twitter_bearer = userdata.get("TWITTER_BEARER")
huggingface_key = userdata.get("HUGGINGFACE_API_KEY")

This reusable function fetches news articles for a given topic using NewsAPI
and returns structured data in a pandas DataFrame with selected clean columns.

In [5]:
def fetch_news_articles(api_key: str, query: str, page_size: int = 50) -> pd.DataFrame:
    """
    Fetch news articles from NewsAPI for a given topic.

    Args:
        api_key (str): NewsAPI key stored in secret manager
        query (str): Topic to search for (e.g., "Artificial Intelligence")
        page_size (int): Number of articles per request (max 100 for free plan)

    Returns:
        pd.DataFrame: Clean DataFrame of articles
    """
    url = "https://newsapi.org/v2/everything"
    params = {
        "q": query,
        "pageSize": page_size,
        "language": "en",
        "sortBy": "publishedAt",
        "apiKey": api_key
    }

    try:
        response = requests.get(url, params=params)
        data = response.json()

        if data.get("status") != "ok":
            raise Exception(data.get("message", "Unknown API error"))

        articles = data.get("articles", [])
        if not articles:
            return pd.DataFrame()

        # Convert to DataFrame with selected clean columns
        df = pd.DataFrame(articles)[[
            "title", "author", "source", "description", "url", "publishedAt", "content"
        ]]
        df["source"] = df["source"].apply(lambda x: x.get("name") if isinstance(x, dict) else x)
        df.rename(columns={"publishedAt": "published_at"}, inplace=True)

        return df

    except Exception as e:
        print(f" Error fetching news: {e}")
        return pd.DataFrame()


In [6]:
NEWS_API_KEY = os.environ.get("NEWS_API_KEY")

# Example: Fetch 50 AI news articles
df_news = fetch_news_articles(NEWS_API_KEY, query="Artificial Intelligence", page_size=50)

if not df_news.empty:
    df_news.to_csv("ai_news.csv", index=False)
    print("Saved", len(df_news), "articles to ai_news.csv")
else:
    print(" No data fetched")


 Error fetching news: Your API key is missing. Append this to the URL with the apiKey param, or use the x-api-key HTTP header.
 No data fetched


In [7]:

# Save API key securely in Colab environment
os.environ["NEWS_API_KEY"] = "50bdb7ac3c504a44923f970177f409c2"


## Fetching News Articles and Performing Sentiment Analysis


In [8]:

from transformers import pipeline

# Load API key from environment
NEWS_API_KEY = os.getenv("NEWS_API_KEY")

# Hugging Face sentiment pipeline
sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")

def fetch_and_analyze_news(query="AI", max_articles=20):
    url = f"https://newsapi.org/v2/everything?q={query}&pageSize={max_articles}&apiKey={NEWS_API_KEY}"

    try:
        response = requests.get(url)
        data = response.json()

        if response.status_code != 200 or "articles" not in data:
            print("Error fetching news:", data)
            return pd.DataFrame()

        articles = data["articles"]
        results = []

        for article in articles:
            text = article.get("content") or article.get("description") or ""
            sentiment = sentiment_pipeline(text[:512])[0] if text else {"label": "N/A", "score": 0.0}

            results.append({
                "title": article.get("title"),
                "source": article.get("source", {}).get("name"),
                "publishedAt": article.get("publishedAt"),
                "content": text,
                "sentiment_label": sentiment["label"],
                "sentiment_score": sentiment["score"]
            })

        df = pd.DataFrame(results)
        df.to_csv("news_with_sentiment.csv", index=False)
        print(f"Saved {len(df)} articles with sentiment to news_with_sentiment.csv")
        return df

    except Exception as e:
        print(" Exception occurred:", str(e))
        return pd.DataFrame()

# Example usage
df_news = fetch_and_analyze_news("Artificial Intelligence", 10)
df_news.head()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


Saved 10 articles with sentiment to news_with_sentiment.csv


Unnamed: 0,title,source,publishedAt,content,sentiment_label,sentiment_score
0,Meet the Top 10 AI-Proof Jobs That Everyone Wants,Gizmodo.com,2025-08-31T19:25:08Z,AI is rapidly scaling in the workforce and cre...,NEGATIVE,0.983192
1,‘Tron: Ares’ Star Says Her Character Reveals a...,Gizmodo.com,2025-09-03T15:00:22Z,Even in the opening moments of the Tron: Ares ...,NEGATIVE,0.945469
2,Did Nvidia Just Pop an AI Bubble? Here’s What ...,Gizmodo.com,2025-08-28T10:46:19Z,Lukewarm second quarter results from AI powerh...,POSITIVE,0.996129
3,AI invents new antibiotics that could kill sup...,BBC News,2025-08-14T15:03:49Z,Artificial intelligence has invented two new p...,NEGATIVE,0.909215
4,Gemini for Home is Google’s biggest smart home...,The Verge,2025-08-20T16:59:47Z,<ul><li></li><li></li><li></li></ul>\r\nThe al...,POSITIVE,0.861856
