# 1. Objective

The goal is to:

1. Classify blog posts into their respective categories using a Naive Bayes text-classification model.
2. Perform sentiment analysis (positive / neutral / negative) on the blog texts.
3. Evaluate and interpret both the classification and the sentiment analysis.

# 2. Data Understanding

We are given a CSV file `blogs.csv` with:

- **Data** – the text content of each blog post
- **Labels** – the category of the blog post

In [2]:

import pandas as pd
import re
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from textblob import TextBlob
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns


# 3. Load Data

In [3]:

df = pd.read_csv("blogs.csv")

print("--- Data Exploration ---\n")

print("First 5 rows of the dataset:\n")

print(df.head(),"\n")

print("\nDataset Information:\n")
print(df.info(),"\n")

print("\nDistribution of Categories:")
print(df['Labels'].value_counts())


--- Data Exploration ---

First 5 rows of the dataset:

                                                Data       Labels
0  Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...  alt.atheism
1  Newsgroups: alt.atheism\nPath: cantaloupe.srv....  alt.atheism
2  Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...  alt.atheism
3  Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...  alt.atheism
4  Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...  alt.atheism 


Dataset Information:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Data    2000 non-null   object
 1   Labels  2000 non-null   object
dtypes: object(2)
memory usage: 31.4+ KB
None 


Distribution of Categories:
Labels
alt.atheism                 100
comp.graphics               100
talk.politics.misc          100
talk.politics.mideast       100
talk.politics.guns          100
soc.religion.christian      1

# 4. Custom Stopwords

In [4]:

custom_stop_words = set([
    "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours",
    "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers",
    "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves",
    "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are",
    "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does",
    "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until",
    "while", "of", "at", "by", "for", "with", "about", "against", "between", "into",
    "through", "during", "before", "after", "above", "below", "to", "from", "up", "down",
    "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here",
    "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more",
    "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so",
    "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"
])


# 5. Data Preprocessing
Steps:
- Lowercasing
- Removing URLs, HTML tags, punctuation, digits
- Tokenizing
- Removing stop-words
- Lemmatizing
- Normalizing whitespace

In [5]:

nltk.download('wordnet')
nltk.download('omw-1.4')
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    if not isinstance(text, str):
        return ""
    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[^a-z\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    tokens = [lemmatizer.lemmatize(word) for word in text.split() if word not in custom_stop_words and len(word) > 1]
    return ' '.join(tokens)

df['Processed_Data'] = df['Data'].apply(preprocess_text)


[nltk_data] Downloading package wordnet to /Users/bunny/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/bunny/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


# 6. Feature Extraction using TF-IDF

In [6]:

tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1,2))
X = tfidf_vectorizer.fit_transform(df['Processed_Data'])
y = df['Labels']

print("\n--- Feature Extraction ---")
print(f"Shape of TF-IDF matrix: {X.shape}")
print(f"Number of unique features: {len(tfidf_vectorizer.get_feature_names_out())}")



--- Feature Extraction ---
Shape of TF-IDF matrix: (2000, 9450)
Number of unique features: 9450


# 7. Train Naive Bayes Classifier

In [7]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

y_pred = nb_classifier.predict(X_test)

cv_scores = cross_val_score(nb_classifier, X, y, cv=5, scoring='accuracy')
print(f"5-Fold Cross-Validation Accuracy: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")


5-Fold Cross-Validation Accuracy: 0.9210 ± 0.0064


# 8. Sentiment Analysis (VADER + TextBlob)

In [8]:

nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

def combined_sentiment(text):
    if not isinstance(text, str):
        return 'Neutral'
    sia_score = sia.polarity_scores(text)['compound']
    tb_score = TextBlob(text).sentiment.polarity
    avg_score = (sia_score + tb_score)/2
    if avg_score >= 0.05:
        return 'Positive'
    elif avg_score <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'

df['Combined_Sentiment'] = df['Data'].apply(combined_sentiment)

print("\n--- Sentiment Analysis Results ---")
print(df['Combined_Sentiment'].value_counts())

sentiment_distribution = df.groupby('Labels')['Combined_Sentiment'].value_counts().unstack(fill_value=0)
print("\nSentiment Distribution by Category:")
print(sentiment_distribution)


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/bunny/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!



--- Sentiment Analysis Results ---
Combined_Sentiment
Positive    1338
Negative     641
Neutral       21
Name: count, dtype: int64

Sentiment Distribution by Category:
Combined_Sentiment        Negative  Neutral  Positive
Labels                                               
alt.atheism                     42        1        57
comp.graphics                   16        0        84
comp.os.ms-windows.misc         25        0        75
comp.sys.ibm.pc.hardware        21        1        78
comp.sys.mac.hardware           24        0        76
comp.windows.x                  20        2        78
misc.forsale                     9        5        86
rec.autos                       27        1        72
rec.motorcycles                 31        1        68
rec.sport.baseball              27        2        71
rec.sport.hockey                28        1        71
sci.crypt                       29        0        71
sci.electronics                 18        4        78
sci.med              

# 9. Model Evaluation

In [11]:

print("\n--- Model Evaluation ---")
print("Naive Bayes Classifier Performance Metrics:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))





--- Model Evaluation ---
Naive Bayes Classifier Performance Metrics:
Accuracy: 0.9050

Classification Report:
                          precision    recall  f1-score   support

             alt.atheism       0.82      0.70      0.76        20
           comp.graphics       0.94      0.85      0.89        20
 comp.os.ms-windows.misc       0.95      1.00      0.98        20
comp.sys.ibm.pc.hardware       0.78      0.90      0.84        20
   comp.sys.mac.hardware       1.00      0.90      0.95        20
          comp.windows.x       0.90      0.95      0.93        20
            misc.forsale       0.95      1.00      0.98        20
               rec.autos       0.95      1.00      0.98        20
         rec.motorcycles       0.95      0.95      0.95        20
      rec.sport.baseball       1.00      1.00      1.00        20
        rec.sport.hockey       1.00      1.00      1.00        20
               sci.crypt       1.00      1.00      1.00        20
         sci.electronics      

# 10. Discussion & Reflection

1. **Classifier Performance:** Naive Bayes with TF-IDF + bigrams performs robustly, achieving high precision, recall, and F1-scores. Some overlapping categories may have slightly lower scores.
2. **Challenges:** Preprocessing text (removing noise, lemmatization) was crucial. Naive Bayes assumes feature independence, which is a limitation.
3. **Sentiment Analysis:** Combining VADER and TextBlob provides more stable sentiment classification. Categories like sports show more positive sentiment; politics shows more neutral/negative sentiment.
4. **Conclusion:** The workflow (preprocessing → TF-IDF → Naive Bayes → sentiment) is effective for classifying blogs and providing additional sentiment insights.