# Section 1: Project Overview

## Project Description
This project implements a secure and structured **Social Media Sentiment Analysis System** within a Jupyter Notebook. It is designed to download data from trusted sources, clean text, and classify sentiments into **Positive**, **Neutral**, or **Negative** categories.

## Problem Statement
Analyzing social media sentiment at scale requires robust pipelines that can handle messy data, ensure security when dealing with external sources, and provide reliable insights for decision-making.

## Real-World Use Case
*   **Brand Reputation:** Monitoring public perception of a brand in real-time.
*   **Product Feedback:** Aggregating user reviews from social platforms.
*   **Trend Analysis:** Detecting shifts in public sentiment regarding global events.

# Section 2: Security & Design Principles

Security and reliability are core to this implementation:

1.  **Controlled Dataset Downloads:** We strictly allow downloads only from a pre-defined allow-list of trusted URLs. Arbitrary user-provided URLs are rejected.
2.  **No Dynamic URLs:** The system is sealed against Open Redirect or SSRF vulnerabilities by avoiding dynamic URL inputs.
3.  **Graceful Failure & Logging:** The pipeline does not crash on errors; instead, it logs issues and provides meaningful feedback.
4.  **No Hardcoded Secrets:** No API keys or sensitive credentials are used or stored.
5.  **Safe File Handling:** All data operations are strictly confined to `data/` and `output/` directories.

# Section 3: Dataset Sources

We utilize public datasets to benchmark text processing and sentiment analysis. The trusted sources for this project are:

*   **Sentiment140:** [Kaggle Link](https://www.kaggle.com/datasets/kazanova/sentiment140)
*   **Twitter Sentiment (Multiclass):** [Kaggle Link](https://www.kaggle.com/datasets/saurabhshahane/twitter-sentiment-dataset)
*   **Open CSV (Direct Download):** [OpenDataBay Link](https://www.opendatabay.com/data/web-social/b52f6148-5dd3-4317-b32c-e7a497064c51)

## Expected Format
The pipeline expects CSV files containing a column named **`text`** which holds the social media posts.

In [None]:
# Section 4: Dataset Configuration

# Define trusted, allow-listed URLs. 
# Arbitrary URLs are NOT allowed to prevent SSRF or downloading malicious files.

ALLOWED_DATASET_URLS = {
    "sentiment140": "https://www.kaggle.com/datasets/kazanova/sentiment140",
    "twitter_multiclass": "https://www.kaggle.com/datasets/saurabhshahane/twitter-sentiment-dataset",
    "open_csv": "https://www.opendatabay.com/data/web-social/b52f6148-5dd3-4317-b32c-e7a497064c51"
}

print("Allowed datasets configured:", list(ALLOWED_DATASET_URLS.keys()))

In [None]:
# Section 6: Error Tracking Setup
# (Moved up to ensure logging is available for the download step)

import logging
import sys
import os

# Ensure directories exist
os.makedirs('data', exist_ok=True)
os.makedirs('output', exist_ok=True)

# Configure Logging
logger = logging.getLogger("SentimentPipeline")
logger.setLevel(logging.INFO)

# Clear existing handlers to prevent duplicates in Jupyter
if logger.hasHandlers():
    logger.handlers.clear()

# 1. Console Handler
console_handler = logging.StreamHandler(sys.stdout)
console_handler.setFormatter(logging.Formatter('%(levelname)s - %(message)s'))

# 2. File Handler
file_handler = logging.FileHandler("output/pipeline.log", mode='w')
file_handler.setFormatter(logging.Formatter('%(asctime)s - %(levelname)s - %(message)s'))

logger.addHandler(console_handler)
logger.addHandler(file_handler)

logger.info("Logging initialized.")

In [None]:
# Section 5: Controlled Dataset Download

import requests
import os

def download_allowed_dataset(dataset_key, filename="dataset.csv"):
    """
    Downloads a file ONLY if the URL is in the ALLOWED_DATASET_URLS list.
    """
    filepath = os.path.join("data", filename)
    
    if dataset_key not in ALLOWED_DATASET_URLS:
        logger.error(f"blocked attempt to access unauthorized key: {dataset_key}")
        return False

    url = ALLOWED_DATASET_URLS[dataset_key]
    
    # Skip if already exists to save bandwidth/time
    if os.path.exists(filepath):
        logger.info(f"File {filepath} already exists. Skipping download.")
        return True

    logger.info(f"Attempting to download from trusted source: {dataset_key}")
    
    # Note: Kaggle URLs usually require API/Auth (cookies). 
    # For this demonstration, we'll try the 'open_csv' link or simulate a safe failure for others.
    if "kaggle" in url:
        logger.warning(f"Direct download from Kaggle ({url}) requires API tokens. Please manually place CSV in data/ folder for this source.")
        # In a real scenario with API keys, we would use the kaggle CLI or API here.
        return False

    try:
        # Basic security check on file size inside stream
        with requests.get(url, stream=True) as r:
            r.raise_for_status()
            with open(filepath, 'wb') as f:
                for chunk in r.iter_content(chunk_size=8192):
                    f.write(chunk)
        logger.info(f"Successfully downloaded dataset to {filepath}")
        return True
    except Exception as e:
        logger.error(f"Download failed: {e}")
        return False

# Example usage: Try to download standard text set or fallback to manual
# Note: The 'open_csv' link typically points to real data. If it fails, we fall back to manual placement logic.
download_allowed_dataset("open_csv", "dataset.csv")

In [None]:
# Section 7: Dataset Loading & Validation

import pandas as pd

def load_local_dataset(filename='dataset.csv'):
    filepath = os.path.join('data', filename)
    
    # Validation: Check if file exists
    if not os.path.exists(filepath):
        logger.warning(f"Dataset not found at {filepath}. Generating mock data for testing.")
        # Generating mock data if download failed or manual file missing (for interview demo continuity)
        return pd.DataFrame({'text': [
            "I love this service!", 
            "Terrible experience.", 
            "It is okay.", 
            "Worst app ever!", 
            "Best day of my life."
        ]})

    try:
        # Handle encoding issues (utf-8 vs latin1)
        try:
            df = pd.read_csv(filepath, encoding='utf-8')
        except UnicodeDecodeError:
            df = pd.read_csv(filepath, encoding='latin1')

        # Validation: Check for required column
        df.columns = [c.lower() for c in df.columns] # normalize
        if 'text' not in df.columns:
            logger.error("Dataset missing 'text' column.")
            return None

        logger.info(f"Loaded dataset with {len(df)} rows.")
        return df
    except Exception as e:
        logger.error(f"Failed to load dataset: {e}")
        return None

df = load_local_dataset()

In [None]:
# Section 8: Text Preprocessing

import re

def clean_text(text):
    """
    Preprocesses raw text for sentiment analysis.
    """
    try:
        if not isinstance(text, str):
            return ""
        
        # 1. Lowercase
        text = text.lower()
        # 2. Remove URLs
        text = re.sub(r'http\S+', '', text)
        # 3. Remove Punctuation & Digits
        text = re.sub(r'[^a-z\s]', '', text)
        # 4. Normalize Whitespace
        text = re.sub(r'\s+', ' ', text).strip()
        
        return text
    except Exception as e:
        logger.error(f"Preprocessing error: {e}")
        return ""

logger.info("Preprocessing function ready.")

In [None]:
# Section 9: Sentiment ClassificationLogic

def analyze_sentiment(text):
    """
    Simple rule-based classifier.
    """
    if not text:
        return "Neutral"
        
    positive_words = {'love', 'best', 'great', 'happy', 'good', 'excellent', 'amazing', 'fun'}
    negative_words = {'terrible', 'worst', 'hate', 'bad', 'poor', 'useless', 'fail', 'sad'}
    
    words = text.split()
    score = 0
    
    for w in words:
        if w in positive_words:
            score += 1
        elif w in negative_words:
            score -= 1
            
    if score > 0:
        return "Positive"
    elif score < 0:
        return "Negative"
    return "Neutral"

logger.info("Sentiment logic ready.")

In [None]:
# Section 10: Apply Sentiment Pipeline

if df is not None:
    logger.info("Applying pipeline to dataset...")
    df['cleaned_text'] = df['text'].apply(clean_text)
    df['sentiment'] = df['cleaned_text'].apply(analyze_sentiment)
    logger.info("Pipeline finished.")
    display(df.head())
else:
    logger.error("DataFrame is empty. Cannot apply pipeline.")

In [None]:
# Section 11: Evaluation & Insights

if df is not None:
    counts = df['sentiment'].value_counts()
    print("\n--- Sentiment Distribution ---")
    print(counts)
    
    total = len(df)
    if total > 0:
        for label, count in counts.items():
            print(f"{label}: {count/total:.1%}")
            
    # Optional Visual
    try:
        import matplotlib.pyplot as plt
        counts.plot(kind='bar', color=['green', 'gray', 'red'])
        plt.title("Sentiment Outcomes")
        plt.show()
    except ImportError:
        pass

In [None]:
# Section 12: Testing (Notebook-Based)

logger.info("Running inline tests...")

# 1. Test Cleaning
assert clean_text("Go to http://site.com!!") == "go to", "Cleaning Failed"

# 2. Test Sentiment
assert analyze_sentiment("i love this") == "Positive", "Positive Sentiment Refused"
assert analyze_sentiment("terrible service") == "Negative", "Negative Sentiment Refused"

# 3. Test Empty
assert analyze_sentiment("") == "Neutral", "Empty Sentiment Failed"

logger.info("All tests passed successfully.")

In [None]:
# Section 13: Save Output

outfile = "output/sentiment_results.csv"
if df is not None:
    try:
        df.to_csv(outfile, index=False)
        logger.info(f"Results successfully saved to {outfile}")
    except Exception as e:
        logger.error(f"Save failed: {e}")

# Section 14: Limitations & Future Improvements

1.  **Sarcasm:** Rule-based systems miss sarcasm (e.g., "Oh brilliant, another crash").
2.  **Language:** Currently supports English only.
3.  **Enhancements:** Could replace dictionary logic with NLTK/VADER or HuggingFace transformers for higher accuracy.

# Section 15: Interview Explanation

“I used trusted public dataset URLs with controlled downloads to enable testing, while enforcing security, logging, and validation similar to production systems.”