<a href="https://colab.research.google.com/github/MuskanTiwari12/Sentimental-Analysis-Project/blob/main/task1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Task 1:Environment Setup , Data Collection & Pipeline Initialization

1>Install required libraries for Data Collection Pipeline


In [7]:
from google.colab import drive
drive.mount('/content/drive')
task2_folder = "/content/drive/MyDrive/Task1.ipynb"
!mkdir -p "{task1_folder}"  # Create folder if it doesn't exist

MessageError: Error: credential propagation was unsuccessful

In [2]:
!pip install pandas              # For data handling & saving (CSV, DataFrame)
!pip install python-dotenv       # Manage API keys securely
!pip install newsapi-python      # Fetch news articles using NewsAPI
!pip install requests            # Make HTTP requests (web scraping, APIs)
#!pip install beautifulsoup4      # Parse HTML for scraping news sites
!pip install newspaper3k         # Extract structured news content (title, text, date)
!pip install tweepy              # Collect tweets from Twitter (X) API
!pip install lxml_html_clean     #cleans and sanitizes messy HTML by removing scripts, styles, and unwanted tags, ensuring clean text for analysis.

Collecting newsapi-python
  Downloading newsapi_python-0.2.7-py2.py3-none-any.whl.metadata (1.2 kB)
Downloading newsapi_python-0.2.7-py2.py3-none-any.whl (7.9 kB)
Installing collected packages: newsapi-python
Successfully installed newsapi-python-0.2.7
Collecting newspaper3k
  Downloading newspaper3k-0.2.8-py3-none-any.whl.metadata (11 kB)
Collecting cssselect>=0.9.2 (from newspaper3k)
  Downloading cssselect-1.3.0-py3-none-any.whl.metadata (2.6 kB)
Collecting feedparser>=5.2.1 (from newspaper3k)
  Downloading feedparser-6.0.12-py3-none-any.whl.metadata (2.7 kB)
Collecting tldextract>=2.0.1 (from newspaper3k)
  Downloading tldextract-5.3.0-py3-none-any.whl.metadata (11 kB)
Collecting feedfinder2>=0.0.4 (from newspaper3k)
  Downloading feedfinder2-0.0.4.tar.gz (3.3 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jieba3k>=0.35.1 (from newspaper3k)
  Downloading jieba3k-0.35.1.zip (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m 

   Import Required Libraries



In [3]:
import os                # For file paths & environment variables
import pandas as pd      # For storing and handling collected data
from dotenv import load_dotenv   # For loading API keys securely
from newsapi import NewsApiClient   # News API client
import requests          # For API requests & scraping
#from bs4 import BeautifulSoup  # For parsing scraped HTML
from newspaper import Article   # For extracting structured news
import tweepy            # For fetching data from Twitter (X) API

Load All API Keys from Colab Secret Manage

In [4]:
from google.colab import userdata

# Load API keys from Colab Secret Manager
news_api_key = userdata.get("NEWS_API_KEY")
twitter_bearer = userdata.get("TWITTER_BEARER")
huggingface_key = userdata.get("HUGGINGFACE_API_KEY")

This reusable function fetches news articles for a given topic using NewsAPI
and returns structured data in a pandas DataFrame with selected clean columns.

In [5]:
def fetch_news_articles(api_key: str, query: str, page_size: int = 50) -> pd.DataFrame:
    """
    Fetch news articles from NewsAPI for a given topic.

    Args:
        api_key (str): NewsAPI key stored in secret manager
        query (str): Topic to search for (e.g., "Artificial Intelligence")
        page_size (int): Number of articles per request (max 100 for free plan)

    Returns:
        pd.DataFrame: Clean DataFrame of articles
    """
    url = "https://newsapi.org/v2/everything"
    params = {
        "q": query,
        "pageSize": page_size,
        "language": "en",
        "sortBy": "publishedAt",
        "apiKey": api_key
    }

    try:
        response = requests.get(url, params=params)
        data = response.json()

        if data.get("status") != "ok":
            raise Exception(data.get("message", "Unknown API error"))

        articles = data.get("articles", [])
        if not articles:
            return pd.DataFrame()

        # Convert to DataFrame with selected clean columns
        df = pd.DataFrame(articles)[[
            "title", "author", "source", "description", "url", "publishedAt", "content"
        ]]
        df["source"] = df["source"].apply(lambda x: x.get("name") if isinstance(x, dict) else x)
        df.rename(columns={"publishedAt": "published_at"}, inplace=True)

        return df

    except Exception as e:
        print(f" Error fetching news: {e}")
        return pd.DataFrame()


In [10]:
NEWS_API_KEY = os.environ.get("NEWS_API_KEY")
os.environ["NEWS_API_KEY"] = "50bdb7ac3c504a44923f970177f409c2"


# Example: Fetch 50 AI news articles
df_news = fetch_news_articles(NEWS_API_KEY, query="Artificial Intelligence", page_size=50)

if not df_news.empty:
    df_news.to_csv("ai_news.csv", index=False)
    print("Saved", len(df_news), "articles to ai_news.csv")
else:
    print(" No data fetched")


Saved 48 articles to ai_news.csv


In [11]:

# Save API key securely in Colab environment
os.environ["NEWS_API_KEY"] = "50bdb7ac3c504a44923f970177f409c2"


## Fetching News Articles and Performing Sentiment Analysis


In [16]:
from transformers import pipeline
import pandas as pd
import requests
import os

# Load API key
NEWS_API_KEY = os.getenv("NEWS_API_KEY")

# Use 3-class model (Positive, Negative, Neutral)
sentiment_pipeline = pipeline(
    "sentiment-analysis",
    model="cardiffnlp/twitter-roberta-base-sentiment-latest"
)

def fetch_and_analyze_news(query="AI", max_articles=20):
    url = f"https://newsapi.org/v2/everything?q={query}&pageSize={max_articles}&apiKey={NEWS_API_KEY}"

    try:
        response = requests.get(url)
        data = response.json()

        if response.status_code != 200 or "articles" not in data:
            print("Error fetching news:", data)
            return pd.DataFrame()

        articles = data["articles"]
        results = []

        for article in articles:
            text = article.get("content") or article.get("description") or ""
            if text:
                sentiment = sentiment_pipeline(text[:512])[0]
                label = sentiment["label"].upper()
                score = sentiment["score"]

                # Map sentiment scores
                if "NEGATIVE" in label:
                    sentiment_score = -score
                elif "POSITIVE" in label:
                    sentiment_score = score
                else:  # NEUTRAL → keep its own probability
                    sentiment_score = score
            else:
                label, sentiment_score = "N/A", 0.0

            results.append({
                "title": article.get("title"),
                "source": article.get("source", {}).get("name"),
                "publishedAt": article.get("publishedAt"),
                "content": text,
                "sentiment_label": label,
                "sentiment_score": sentiment_score
            })

        df = pd.DataFrame(results)
        df.to_csv("news_with_sentiment.csv", index=False)
        print(f"✅ Saved {len(df)} articles with sentiment to news_with_sentiment.csv")
        return df

    except Exception as e:
        print("❌ Exception occurred:", str(e))
        return pd.DataFrame()

# Example usage
df_news = fetch_and_analyze_news("Artificial Intelligence", 10)
df_news.head()


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


✅ Saved 10 articles with sentiment to news_with_sentiment.csv


Unnamed: 0,title,source,publishedAt,content,sentiment_label,sentiment_score
0,Our hottest takes on AI’s wild summer,The Verge,2025-09-12T14:02:02Z,<ul><li></li><li></li><li></li></ul>\r\nOn The...,NEUTRAL,0.80948
1,Meet the Top 10 AI-Proof Jobs That Everyone Wants,Gizmodo.com,2025-08-31T19:25:08Z,AI is rapidly scaling in the workforce and cre...,NEGATIVE,-0.499058
2,Did Nvidia Just Pop an AI Bubble? Here’s What ...,Gizmodo.com,2025-08-28T10:46:19Z,Lukewarm second quarter results from AI powerh...,POSITIVE,0.728195
3,‘Tron: Ares’ Star Says Her Character Reveals a...,Gizmodo.com,2025-09-03T15:00:22Z,Even in the opening moments of the Tron: Ares ...,NEGATIVE,-0.60706
4,Gemini for Home is Google’s biggest smart home...,The Verge,2025-08-20T16:59:47Z,<ul><li></li><li></li><li></li></ul>\r\nThe al...,NEUTRAL,0.670566
