<a href="https://colab.research.google.com/github/Achiever-caleb/Sentiment_Analysis_System/blob/main/Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Introduction**  
This project aims to build a **Sentiment Analysis System** by scraping news details from **lite.cnn.com**, a lightweight version of CNN's news platform known for its concise and impactful reporting.  

By analyzing the sentiments expressed in the news sentences, this system can categorize them into **Positive**, **Neutral**, and **Negative** sentiments, providing valuable insights into the tone of media coverage on various topics.  

The project will involve the following key steps:  
1. **Data Collection:** Scraping 200 news sentences from **lite.cnn.com** using web scraping techniques.  
2. **Data Annotation:** Labeling each sentence with its corresponding sentiment category.  
3. **Model Training:** Building and training a sentiment analysis model to accurately classify the sentiments.  
4. **Deployment:** Exposing the trained model as a **REST API** using **FastAPI**, enabling real-time sentiment analysis.  
5. **Optimization and A/B Testing:** Enhancing model inference speed through quantization and conducting **A/B testing** to compare performance against a baseline model.  



In [None]:
%pip install fastapi

Collecting fastapi
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting starlette<0.47.0,>=0.40.0 (from fastapi)
  Downloading starlette-0.46.2-py3-none-any.whl.metadata (6.2 kB)
Downloading fastapi-0.115.12-py3-none-any.whl (95 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.2/95.2 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading starlette-0.46.2-py3-none-any.whl (72 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.0/72.0 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: starlette, fastapi
Successfully installed fastapi-0.115.12 starlette-0.46.2


## Data Wrangling

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import glob
from textblob import TextBlob
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import joblib
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import random
from sklearn.model_selection import GridSearchCV


In [None]:
def fetch_article_content(url):
    """
    Fetches the content of a web page from a given URL.

    Args:
        url: The URL of the web page.

    Returns:
        The content of the web page as a string, or None if an error occurs.
        Prints an error message if the request fails.
    """
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for bad status codes
        return response.text  # Return the content as text

    except requests.exceptions.RequestException as e:
        print(f"Error fetching URL {url}: {e}")
        return None

def extract_news_sentences(url, output_filename='news_sentences2.csv'):
    """
    Extracts sentences from paragraphs of a news article URL and saves them to a CSV file.

    Args:
        url: URL of the news article to process.
        output_filename: Name of the output CSV file. Defaults to 'news_sentences2.csv'.
    """
    # Fetch article content
    article_content = fetch_article_content(url)
    if article_content is None:
        return

    # Parse HTML content
    soup = BeautifulSoup(article_content, 'html.parser')

    # Extract sentences from all <p> tags
    sentences = []
    for p_tag in soup.find_all('p'):
        text = p_tag.get_text().strip()
        if text:
            # Split text into sentences (assuming '. ' is the delimiter)
            sentences.extend(text.split('. '))

    # Create DataFrame and save to CSV
    df = pd.DataFrame(sentences, columns=['sentence'])
    df.to_csv(output_filename, index=False)
    print(f"Successfully saved {len(sentences)} sentences to {output_filename}")



In [None]:
extract_news_sentences('https://lite.cnn.com/2025/02/17/europe/europe-ukraine-summit-paris-trump-intl-hnk/index.html', 'my_sentences1.csv')

Successfully saved 55 sentences to my_sentences1.csv


In [None]:
extract_news_sentences('https://lite.cnn.com/2025/02/16/health/apple-cider-vinegar-netflix-gaps-wellness/index.html', 'my_sentences2.csv')

Successfully saved 56 sentences to my_sentences2.csv


In [None]:
extract_news_sentences('https://lite.cnn.com/2025/02/17/politics/what-to-know-about-trumps-appeal-to-the-supreme-court/index.html', 'my_sentences3.csv')

Successfully saved 79 sentences to my_sentences3.csv


In [None]:
extract_news_sentences('https://lite.cnn.com/2025/02/17/us/deadly-winter-storm-eastern-us/index.html', 'my_sentences4.csv')

Successfully saved 67 sentences to my_sentences4.csv


In [None]:
extract_news_sentences('https://lite.cnn.com/2025/02/17/style/emma-stone-popcorn-dress-pocket/index.html', 'my_sentences5.csv')

Successfully saved 31 sentences to my_sentences5.csv


In [None]:
df1 = pd.read_csv('my_sentences1.csv')
df1.head(3)

Unnamed: 0,sentence
0,"By Christian Edwards, Helen Regan, Michael Rio..."
1,"Updated: \n 6:21 PM EST, Mon February 1..."
2,Source: CNN


In [None]:
df1.tail(7)

Unnamed: 0,sentence
48,See Full Web Article
49,Go to the full CNN experience
50,© 2025 Cable News Network
51,A Warner Bros
52,Discovery Company
53,All Rights Reserved.
54,Terms of Use\n \n\n |\n \n\n P...


## Data Cleaning

We noticed that our scrapped data contains some unwanted rows in the first 3 and bottom 7 rows. we will proceed to remove them, and concatenate all 5 dataset to form a single dataframe.

In [None]:
# Step 1: Load all CSV files from the directory
csv_files = glob.glob("my_sentences*.csv")  # Adjust the path as needed
all_data = []  # Initialize an empty list to hold the cleaned DataFrames

# Step 2: Loop through each file, clean it, and append to the list
for file in csv_files:
    df = pd.read_csv(file)

    # Drop the first 3 rows and last 7 rows for the current DataFrame
    cleaned_df = df.drop(df.index[:3].tolist() + df.index[-7:].tolist())

    # Append the cleaned DataFrame to the list
    all_data.append(cleaned_df)

# Step 3: Concatenate all cleaned DataFrames into one
final_df = pd.concat(all_data, ignore_index=True)

# Display the combined DataFrame
print(final_df)


                                              sentence
0    President Donald Trump is heading to the Supre...
1                                  The case, Bessent v
2    Dellinger, could eventually help clarify wheth...
3    It arrives at a moment when Trump is attemptin...
4        Here’s a look at the case and why it matters:
..                                                 ...
233  Whatever practices or supplements you try, jus...
234  Some patients are scared to disclose alternati...
235  At the end of the show (spoiler alert), next t...
236                      Balance is key, Strauss said.
237  Clarification: This story was updated to more ...

[238 rows x 1 columns]


In [None]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 238 entries, 0 to 237
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   sentence  238 non-null    object
dtypes: object(1)
memory usage: 2.0+ KB


In [None]:
final_df.duplicated().sum()


np.int64(0)

## Data Preprocessing

We will perform some feature engineering here to create a new column called sentiment. we will also balace the dataset

In [None]:
new_df= final_df.copy()

In [None]:
# annotate_sentences.py
def get_sentiment(text):
    analysis = TextBlob(text)
    if analysis.sentiment.polarity > 0:
        return 'Positive'
    elif analysis.sentiment.polarity < 0:
        return 'Negative'
    else:
        return 'Neutral'

def annotate_sentences():
    new_df['sentiment'] = new_df['sentence'].apply(get_sentiment)
    new_df.to_csv('annotated_sentences.csv', index=False)
    print("Annotation completed and saved to annotated_sentences.csv")

if __name__ == "__main__":
    annotate_sentences()


Annotation completed and saved to annotated_sentences.csv


In [None]:
df = pd.read_csv('annotated_sentences.csv')
df['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
Positive,119
Neutral,88
Negative,31


In [None]:
# Check the distribution before balancing
print("Before balancing:")
print(new_df['sentiment'].value_counts())

# Get the maximum count among all sentiment classes
max_count = new_df['sentiment'].value_counts().max()

# Oversample each class to have the same number of samples as the largest class
positive_df = new_df[new_df['sentiment'] == 'Positive']
neutral_df = new_df[new_df['sentiment'] == 'Neutral']
negative_df = new_df[new_df['sentiment'] == 'Negative']

neutral_oversampled = resample(neutral_df,
                               replace=True,   # Allow duplicate samples
                               n_samples=max_count,
                               random_state=42)

negative_oversampled = resample(negative_df,
                                replace=True,
                                n_samples=max_count,
                                random_state=42)

# Combine the oversampled DataFrames with the original Positive class
balanced_df = pd.concat([positive_df, neutral_oversampled, negative_oversampled])

# Shuffle the balanced DataFrame
balanced_df = balanced_df.sample(frac=1, random_state=42).reset_index(drop=True)

# Check the distribution after balancing
print("\nAfter balancing:")
print(balanced_df['sentiment'].value_counts())


Before balancing:
sentiment
Positive    119
Neutral      88
Negative     31
Name: count, dtype: int64

After balancing:
sentiment
Neutral     119
Positive    119
Negative    119
Name: count, dtype: int64


In [None]:
balanced_df.head(10)

Unnamed: 0,sentence,sentiment
0,"Traditionally, medicine has worked in a patern...",Neutral
1,"It also nodded to another, more functional flo...",Positive
2,Fire officials found the person trapped inside...,Negative
3,Flooding in parts of Virginia mixed with recen...,Negative
4,"Further East, about 20,000 outages have been r...",Positive
5,And not enough funding has gone into research ...,Negative
6,While President Jimmy Carter signed the law cr...,Positive
7,Kentucky’s latest flood disaster hit more than...,Positive
8,Frigid Arctic air started to seep into the nor...,Negative
9,"Jonathan Bonnet, a lifestyle medicine-certifie...",Neutral


## Model Training

we will be working with two model, multinomial Naive bayes as our base model and Logistic Regression.

In [None]:

def train_models():
    X = balanced_df['sentence']
    y = balanced_df['sentiment']

    # Split Data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Vectorization
    vectorizer = TfidfVectorizer(stop_words='english')
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)

    # Baseline Model: Naive Bayes
    nb_model = MultinomialNB()
    nb_model.fit(X_train_vec, y_train)
    y_pred_nb = nb_model.predict(X_test_vec)

    print("Baseline Model (Naive Bayes) Report:")
    print(classification_report(y_test, y_pred_nb))

    # Hyperparameter tuning for Logistic Regression
    param_grid = {
    'C': np.logspace(-3, 3, 10),
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga'],
    'max_iter': [200, 400, 600]
}

    grid_search = GridSearchCV(LogisticRegression(random_state=42), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
    grid_search.fit(X_train_vec, y_train)

    # Best Model
    best_lr_model = grid_search.best_estimator_
    y_pred_lr = best_lr_model.predict(X_test_vec)

    print("New Model (Optimized Logistic Regression) Report:")
    print(classification_report(y_test, y_pred_lr))

    print("Best Parameters:", grid_search.best_params_)

    # Save Models and Vectorizer
    joblib.dump(nb_model, 'baseline_model.pkl')
    joblib.dump(best_lr_model, 'new_model.pkl')
    joblib.dump(vectorizer, 'vectorizer.pkl')

    print("Models and vectorizer saved.")

if __name__ == "__main__":
    train_models()


Baseline Model (Naive Bayes) Report:
              precision    recall  f1-score   support

    Negative       0.93      1.00      0.96        26
     Neutral       0.76      0.70      0.73        23
    Positive       0.70      0.70      0.70        23

    accuracy                           0.81        72
   macro avg       0.80      0.80      0.80        72
weighted avg       0.80      0.81      0.80        72

New Model (Optimized Logistic Regression) Report:
              precision    recall  f1-score   support

    Negative       0.96      1.00      0.98        26
     Neutral       0.87      0.57      0.68        23
    Positive       0.67      0.87      0.75        23

    accuracy                           0.82        72
   macro avg       0.83      0.81      0.81        72
weighted avg       0.84      0.82      0.81        72

Best Parameters: {'C': np.float64(10.0), 'max_iter': 200, 'penalty': 'l2', 'solver': 'saga'}
Models and vectorizer saved.




# Model Deployment

In [None]:
!pip install uvicorn

Collecting uvicorn
  Downloading uvicorn-0.34.2-py3-none-any.whl.metadata (6.5 kB)
Downloading uvicorn-0.34.2-py3-none-any.whl (62 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/62.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: uvicorn
Successfully installed uvicorn-0.34.2


In [None]:
import uvicorn
import nest_asyncio

In [None]:
# main.py

app = FastAPI()

# Load Models and Vectorizer
vectorizer = joblib.load('vectorizer.pkl')
baseline_model = joblib.load('baseline_model.pkl')
new_model = joblib.load('new_model.pkl')

class TextData(BaseModel):
    text: str

@app.post('/predict/')
def predict_sentiment(data: TextData):
    text_vector = vectorizer.transform([data.text])

    # A/B Testing Logic (50-50 split)
    if random.random() < 0.5:
        prediction = baseline_model.predict(text_vector)
        version = "Baseline Model (Naive Bayes)"
    else:
        prediction = new_model.predict(text_vector)
        version = "New Model (Logistic Regression)"

    return {
        "sentiment": prediction[0],
        "model_version": version
    }


if __name__ == "__main__":
    nest_asyncio.apply()  # Fix for running in notebooks
    uvicorn.run(app, host="127.0.0.1", port=8000)



INFO:     Started server process [819]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
