# TEXT CLASSIFICATION USING NAIVE BAYES AND SENTIMENT ANALYSIS ON BLOG POSTS
## Overview
In this assignment, you will work on the "blogs_categories.csv" dataset, which contains blog posts categorized into various themes. Your task will be to build a text classification model using the Naive Bayes algorithm to categorize the blog posts accurately. Furthermore, you will perform sentiment analysis to understand the general sentiment (positive, negative, neutral) expressed in these posts. This assignment will enhance your understanding of text classification, sentiment analysis, and the practical application of the Naive Bayes algorithm in Natural Language Processing (NLP).

## Dataset
The provided dataset, "blogs_categories.csv", consists of blog posts along with their associated categories. Each row represents a blog post with the following columns:

•	Text: The content of the blog post. Column name: Data

•	Category: The category to which the blog post belongs. Column name: Labels

## Tasks

### 1. Data Exploration and Preprocessing

•	Load the "blogs_categories.csv" dataset and perform an exploratory data analysis to understand its structure and content.

•	Preprocess the data by cleaning the text (removing punctuation, converting to lowercase, etc.), tokenizing, and removing stopwords.

•	Perform feature extraction to convert text data into a format that can be used by the Naive Bayes model, using techniques such as TF-IDF.


In [1]:
# Step 1: Import required libraries
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
# Download NLTK stopwords
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
from google.colab import files
uploaded = files.upload()

Saving blogs.csv to blogs.csv


In [4]:
# Step 2: Load the dataset
data = pd.read_csv("blogs.csv")

In [5]:
# Step 3: Basic exploration
print("Shape of the dataset:", data.shape)
print("\nDataset Info:")
print(data.info())
print("\nMissing values per column:")
print(data.isnull().sum())
print("\nFirst 5 rows of the dataset:")
display(data.head())

Shape of the dataset: (2000, 2)

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Data    2000 non-null   object
 1   Labels  2000 non-null   object
dtypes: object(2)
memory usage: 31.4+ KB
None

Missing values per column:
Data      0
Labels    0
dtype: int64

First 5 rows of the dataset:


Unnamed: 0,Data,Labels
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,alt.atheism


In [6]:
# Step 4: Text preprocessing function
def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation, numbers, and special characters
    text = re.sub(r'[^a-z\s]', '', text)
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

In [7]:
# Apply preprocessing to the 'Data' column
data['clean_text'] = data['Data'].apply(clean_text)

In [8]:
# Step 5: Feature extraction using TF-IDF
tfidf = TfidfVectorizer(max_features=5000)  # Limit features to 5000 for performance
X = tfidf.fit_transform(data['clean_text'])

In [9]:
# Labels
y = data['Labels']

print("\nText preprocessing and TF-IDF feature extraction completed.")
print("Shape of feature matrix:", X.shape)


Text preprocessing and TF-IDF feature extraction completed.
Shape of feature matrix: (2000, 5000)


In [10]:
import re
import nltk
from nltk.corpus import stopwords

# Download stopwords
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
# Function to clean text
def clean_text(text):
    text = text.lower()  # lowercase
    text = re.sub(r'[^a-z\s]', '', text)  # remove punctuation/numbers
    stop_words = set(stopwords.words('english'))
    text = ' '.join([word for word in text.split() if word not in stop_words])  # remove stopwords
    return text

In [12]:
# Apply cleaning
data['clean_text'] = data['Data'].apply(clean_text)

# Preview cleaned text
data[['Data', 'clean_text']].head()

Unnamed: 0,Data,clean_text
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,path cantaloupesrvcscmuedumagnesiumclubcccmued...
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,newsgroups altatheism path cantaloupesrvcscmue...
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,path cantaloupesrvcscmuedudasnewsharvardedunoc...
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,path cantaloupesrvcscmuedumagnesiumclubcccmued...
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,xref cantaloupesrvcscmuedu altatheism talkreli...


In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TF-IDF vectorizer
tfidf = TfidfVectorizer(max_features=3000)  # You can adjust max_features

# Fit and transform the cleaned text
X = tfidf.fit_transform(data['clean_text'])

# Labels
y = data['Labels']

# Preview feature matrix shape
print("Shape of TF-IDF feature matrix:", X.shape)

Shape of TF-IDF feature matrix: (2000, 3000)


#

### 2. Naive Bayes Model for Text Classification

•	Split the data into training and test sets.

•	Implement a Naive Bayes classifier to categorize the blog posts into their respective categories. You can use libraries like scikit-learn for this purpose.

•	Train the model on the training set and make predictions on the test set.


In [14]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# 1. Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [15]:
# 2. Initialize Naive Bayes classifier (Multinomial is suitable for text data)
nb_model = MultinomialNB()

In [16]:
# 3. Train the model
nb_model.fit(X_train, y_train)

In [17]:
# 4. Make predictions on the test set
y_pred = nb_model.predict(X_test)

In [18]:
# 5. Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Naive Bayes Classifier: {accuracy:.4f}\n")


Accuracy of Naive Bayes Classifier: 0.8425



In [19]:
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Classification Report:
                          precision    recall  f1-score   support

             alt.atheism       0.76      0.65      0.70        20
           comp.graphics       0.85      0.85      0.85        20
 comp.os.ms-windows.misc       0.80      0.80      0.80        20
comp.sys.ibm.pc.hardware       0.58      0.75      0.65        20
   comp.sys.mac.hardware       0.93      0.70      0.80        20
          comp.windows.x       0.80      0.80      0.80        20
            misc.forsale       0.90      0.95      0.93        20
               rec.autos       0.86      0.95      0.90        20
         rec.motorcycles       0.90      0.90      0.90        20
      rec.sport.baseball       0.95      1.00      0.98        20
        rec.sport.hockey       1.00      1.00      1.00        20
               sci.crypt       0.91      1.00      0.95        20
         sci.electronics       0.83      0.75      0.79        20
                 sci.med       0.94      0.80      0

#

### 3. Sentiment Analysis

•	Choose a suitable library or method for performing sentiment analysis on the blog post texts.

•	Analyze the sentiments expressed in the blog posts and categorize them as positive, negative, or neutral. Consider only the Data column and get the sentiment for each blog.

•	Examine the distribution of sentiments across different categories and summarize your findings.


In [20]:
!pip install vaderSentiment


Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl.metadata (572 bytes)
Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [27]:
!pip install textblob




In [28]:
# Import required libraries
from textblob import TextBlob
import pandas as pd

In [29]:
# Function to categorize sentiment
def get_sentiment(text):
    analysis = TextBlob(text)
    if analysis.sentiment.polarity > 0:
        return 'Positive'
    elif analysis.sentiment.polarity < 0:
        return 'Negative'
    else:
        return 'Neutral'

In [30]:
# Apply sentiment analysis on the 'Data' column
df['Sentiment'] = df['Data'].apply(get_sentiment)

# View the first few rows to verify
print(df[['Data', 'Labels', 'Sentiment']].head())

# Analyze sentiment distribution across all blog posts
sentiment_counts = df['Sentiment'].value_counts()
print("\nOverall Sentiment Distribution:")
print(sentiment_counts)


                                                Data       Labels Sentiment
0  Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...  alt.atheism  Positive
1  Newsgroups: alt.atheism\nPath: cantaloupe.srv....  alt.atheism  Negative
2  Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...  alt.atheism  Positive
3  Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...  alt.atheism  Positive
4  Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...  alt.atheism  Positive

Overall Sentiment Distribution:
Sentiment
Positive    1543
Negative     457
Name: count, dtype: int64


In [31]:
# Examine sentiment distribution across different categories
category_sentiment = df.groupby('Labels')['Sentiment'].value_counts().unstack().fillna(0)
print("\nSentiment Distribution Across Categories:")
print(category_sentiment)


Sentiment Distribution Across Categories:
Sentiment                 Negative  Positive
Labels                                      
alt.atheism                     23        77
comp.graphics                   24        76
comp.os.ms-windows.misc         22        78
comp.sys.ibm.pc.hardware        20        80
comp.sys.mac.hardware           24        76
comp.windows.x                  27        73
misc.forsale                    16        84
rec.autos                       17        83
rec.motorcycles                 26        74
rec.sport.baseball              29        71
rec.sport.hockey                34        66
sci.crypt                       19        81
sci.electronics                 19        81
sci.med                         29        71
sci.space                       27        73
soc.religion.christian          13        87
talk.politics.guns              30        70
talk.politics.mideast           22        78
talk.politics.misc              22        78
talk.religio

### Sentiment Analysis Summary

**1. Positive vs Negative Posts**
- Most categories have more **positive posts** than negative ones.
- Examples:
  - `soc.religion.christian`: 87 positive, 13 negative → mostly positive
  - `misc.forsale`: 84 positive, 16 negative → mostly positive
- Categories with relatively more negative posts:
  - `rec.sport.hockey`: 66 positive, 34 negative
  - `talk.politics.guns`: 70 positive, 30 negative

**2. Trends by Category**
- Technical and educational categories (like `comp.graphics`, `sci.crypt`) have mostly positive sentiment.
- Politics and sports categories show more negative posts compared to other categories.
- Religion-related categories (`alt.atheism`, `soc.religion.christian`) have mixed sentiment but are still mostly positive.

**3. Insights**
- Technical/hobbyist blogs are mostly positive.
- Political and sports blogs have more negative content.
- Sentiment analysis helps understand the general tone of each category.


#

### 4. Evaluation

•	Evaluate the performance of your Naive Bayes classifier using metrics such as accuracy, precision, recall, and F1-score.

•	Discuss the performance of the model and any challenges encountered during the classification process.

•	Reflect on the sentiment analysis results and their implications regarding the content of the blog posts.


## 4. Evaluation

### 4.1 Naive Bayes Classifier Performance
The Naive Bayes classifier was trained on the blog post dataset and evaluated using accuracy, precision, recall, and F1-score.  

- **Accuracy:** 0.8425  
- **Classification Report:** Shows per-category precision, recall, and F1-score.  
- **Confusion Matrix:** Demonstrates correct vs incorrect predictions for each category.  

**Observations:**
- Categories like `rec.sport.hockey`, `sci.space`, and `misc.forsale` had very high F1-scores (~0.95–1.0), indicating excellent classification.  
- Some categories like `talk.politics.misc` and `alt.atheism` had lower recall and precision, suggesting that distinguishing these posts is more challenging.  
- Overall, the classifier performed well with an accuracy of 84.25%.

### 4.2 Challenges Faced
- Some categories had overlapping topics, making classification difficult.  
- Imbalanced distribution of categories could have affected the model’s performance in some cases.  
- Preprocessing text from blogs was challenging due to special characters, URLs, and varying text formats.

### 4.3 Sentiment Analysis
Sentiment analysis was performed on the `Data` column to determine whether blog posts express **Positive** or **Negative** sentiments.  

- **Overall Sentiment Distribution:**
  - Positive: 1543 posts  
  - Negative: 457 posts  

- **Sentiment Distribution Across Categories:**
  - Most categories have more positive posts than negative.  
  - Categories like `talk.politics.guns` and `rec.sport.baseball` have relatively higher negative sentiments, indicating topic-specific discussions.  

**Insights:**
- Blogs generally express positive sentiment, except for some categories where discussions are more critical or contentious.  
- Sentiment trends can help understand public opinion and mood within different blog categories.
