# TEXT CLASSIFICATION USING NAIVE BAYES AND SENTIMENT ANALYSIS ON BLOG POSTS

## Overview

In this assignment, you will work on the "blogs_categories.csv" dataset, which contains blog posts categorized into various themes. Your task will be to build a text classification model using the Naive Bayes algorithm to categorize the blog posts accurately. Furthermore, you will perform sentiment analysis to understand the general sentiment (positive, negative, neutral) expressed in these posts. This assignment will enhance your understanding of text classification, sentiment analysis, and the practical application of the Naive Bayes algorithm in Natural Language Processing (NLP).

## Dataset

The provided dataset, "blogs_categories.csv", consists of blog posts along with their associated categories. Each row represents a blog post with the following columns:
* Text: The content of the blog post. Column name: Data
* Category: The category to which the blog post belongs. Column name: Labels


## Tasks:

### 1. Data Exploration and Preprocessing

#### Load the "blogs_categories.csv" dataset and perform an exploratory data analysis to understand its structure and content.

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd

In [2]:
df = pd.read_csv('./blogs_categories.csv')
df

Unnamed: 0.1,Unnamed: 0,Data,Labels
0,0,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49...,alt.atheism
1,1,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...,alt.atheism
2,2,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism
3,3,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...,alt.atheism
4,4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...,alt.atheism
...,...,...,...
19992,19992,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:54...,talk.religion.misc
19993,19993,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:54...,talk.religion.misc
19994,19994,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc
19995,19995,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc


In [4]:
# Checking for null values
df.isnull().sum()

Unnamed: 0    0
Data          0
Labels        0
dtype: int64

In [5]:
# Checking for Duplicated Values
df.duplicated().sum()

0

In [8]:
# Checking for Number of Labels
df['Labels'].value_counts()

alt.atheism                 1000
comp.graphics               1000
talk.politics.misc          1000
talk.politics.mideast       1000
talk.politics.guns          1000
sci.space                   1000
sci.med                     1000
sci.electronics             1000
sci.crypt                   1000
rec.sport.hockey            1000
rec.sport.baseball          1000
rec.motorcycles             1000
rec.autos                   1000
misc.forsale                1000
comp.windows.x              1000
comp.sys.mac.hardware       1000
comp.sys.ibm.pc.hardware    1000
comp.os.ms-windows.misc     1000
talk.religion.misc          1000
soc.religion.christian       997
Name: Labels, dtype: int64

#### Preprocess the data by cleaning the text (removing punctuation, converting to lowercase, etc.), tokenizing, and removing stopwords.

In [13]:
# Cleaning Data
import string
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Downloading Stop Words
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)

stop_words = set(stopwords.words('english'))

In [15]:
# Creating a function to clean blogs
def preprocess_text(text):
    '''Function to preprocess the text'''

    # Removing Punctuation Marks
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Converting to lowercase
    text = text.lower()

    # Removing numbers
    text = re.sub(r'\d+', '', text)

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]

    return ' '.join(tokens)

df['Data'] = df['Data'].apply(preprocess_text)
df

Unnamed: 0.1,Unnamed: 0,Data,Labels
0,0,xref cantaloupesrvcscmuedu altatheism altathei...,alt.atheism
1,1,xref cantaloupesrvcscmuedu altatheism altathei...,alt.atheism
2,2,newsgroups altatheism path cantaloupesrvcscmue...,alt.atheism
3,3,xref cantaloupesrvcscmuedu altatheism altpolit...,alt.atheism
4,4,xref cantaloupesrvcscmuedu altatheism socmotss...,alt.atheism
...,...,...,...
19992,19992,xref cantaloupesrvcscmuedu altatheism talkreli...,talk.religion.misc
19993,19993,xref cantaloupesrvcscmuedu altatheism talkreli...,talk.religion.misc
19994,19994,xref cantaloupesrvcscmuedu talkreligionmisc ta...,talk.religion.misc
19995,19995,xref cantaloupesrvcscmuedu talkreligionmisc ta...,talk.religion.misc


#### Perform feature extraction to convert text data into a format that can be used by the Naive Bayes model, using techniques such as TF-IDF. 

In [18]:
# Converting text into numerical format
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features= 10000)
X = tfidf_vectorizer.fit_transform(df['Data']).toarray()
y = df['Labels']

### 2. Naive Bayes Model for Text Classification

#### Split the data into training and test sets.

In [19]:
# Splitting Data into Training and Test Sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Implement a Naive Bayes classifier to categorize the blog posts into their respective categories. You can use libraries like scikit-learn for this purpose.

In [20]:
# Implementing Naive Bayes
from sklearn.naive_bayes import MultinomialNB

nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

# Making Predictions
y_pred = nb_classifier.predict(X_test)

### 3. Sentiment Analysis

#### Analyze the sentiments expressed in the blog posts and categorize them as positive, negative, or neutral. Consider only the Data column and get the sentiment for each blog.

In [22]:
# Sentiment Analysis using TextBlob
from textblob import TextBlob

def get_sentiment(text):
    analysis = TextBlob(text)

    if analysis.sentiment.polarity > 0:
        return 'positive' 
    if analysis.sentiment.polarity == 0:
        return 'neutral'
    else:
        return 'negative'
    
df['Sentiment'] = df['Data'].apply(get_sentiment)
display(df)

Unnamed: 0.1,Unnamed: 0,Data,Labels,Sentiment
0,0,xref cantaloupesrvcscmuedu altatheism altathei...,alt.atheism,positive
1,1,xref cantaloupesrvcscmuedu altatheism altathei...,alt.atheism,positive
2,2,newsgroups altatheism path cantaloupesrvcscmue...,alt.atheism,positive
3,3,xref cantaloupesrvcscmuedu altatheism altpolit...,alt.atheism,negative
4,4,xref cantaloupesrvcscmuedu altatheism socmotss...,alt.atheism,negative
...,...,...,...,...
19992,19992,xref cantaloupesrvcscmuedu altatheism talkreli...,talk.religion.misc,positive
19993,19993,xref cantaloupesrvcscmuedu altatheism talkreli...,talk.religion.misc,positive
19994,19994,xref cantaloupesrvcscmuedu talkreligionmisc ta...,talk.religion.misc,positive
19995,19995,xref cantaloupesrvcscmuedu talkreligionmisc ta...,talk.religion.misc,positive


#### Examine the distribution of sentiments across different categories and summarize your findings.

In [25]:
# Examining Distribution of sentiments across different categories
sentiment_distribution = df.groupby('Labels')['Sentiment'].value_counts(normalize=True).unstack()
display(sentiment_distribution)

Sentiment,negative,neutral,positive
Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
alt.atheism,0.286,,0.714
comp.graphics,0.263,0.001,0.736
comp.os.ms-windows.misc,0.247,,0.753
comp.sys.ibm.pc.hardware,0.254,0.002,0.744
comp.sys.mac.hardware,0.274,,0.726
comp.windows.x,0.284,0.005,0.711
misc.forsale,0.231,,0.769
rec.autos,0.256,0.002,0.742
rec.motorcycles,0.346,,0.654
rec.sport.baseball,0.308,0.001,0.691


##### Patterns from the Sentiment Distribution across different categories

1. Positive Sentiment Dominates:

    In most categories, the majority of blog posts express positive sentiment, with percentages ranging from approximately 62% to 77%. This suggests that the overall tone of the blog posts tends to be positive across various topics.

2. Negative Sentiment Variability:

    Negative Sentiment percentages vary across categories but generally fall within the range of 23% to 37%. Categories such as "talk.politics.guns" and "talk.politics.mideast" exhibit higher levels of negative sentiment compared to others. This indicates that discussions related to politics and controversial topics may tend to evoke more negative sentiments.

3. Neutral Sentiment Sparse:

    Neutral sentiment is relatively sparse across categories, with many categories having NaN values or very low percentages. This suggests that the majority of blog posts tend to express clear positive or negative opinions rather than neutral viewpoints.

4. Religious Categories:

    Categories such as "alt.atheism" and "soc.religion.christian" exhibit a higher proportion of positive sentiment compared to negative sentiment. This is interesting considering these categories involve discussions on religion, where opinions can be polarized.

5. Technical Categories:

    Categories related to technology, such as "comp.graphics" and "comp.sys.ibm.pc.hardware", tend to have a higher proportion of positive sentiment. This could reflect the enthusiasm or satisfaction expressed by users or enthusiasts in these domains.

Overall, the sentiment analysis reveals nuances in the emotional tones of blog posts across different categories. The dominance of positive sentiment suggests that the majority of blog authors convey positive attitudes or opinions in their posts, regardless of the topic. However, certain categories, particularly those involving politics or controversial subjects, exhibit more varied sentiment distributions, with higher levels of negative sentiment.

### 4. Evaluation

#### Evaluate the performance of your Naive Bayes classifier using metrics such as accuracy, precision, recall, and F1-score.

In [26]:
# Evaluating Performance
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')

Accuracy: 0.8995
Precision: 0.8977994007830575
Recall: 0.8995
F1 Score: 0.8975849211199237


#### Discuss the performance of the model and any challenges encountered during the classification process.

##### Performance:

1. Accuracy (89.95%):

    This high accuracy suggests that the model correctly classifies the blog posts into their respective categories nearly 90% of the time. It reflects the overall effectiveness of the classifier.

2. Precision (89.78%):

    Precision indicates that when the model predicts a certain category, 89.78% of those predictions are correct. This is important for ensuring that the model's positive predictions are reliable.

3. Recall (89.95%): 

    Recall shows that the model captures 89.95% of the true positive instances in the dataset. This means that the model is good at identifying most of the relevant instances for each category.

4. F1 Score (89.76%):

    F1 score, which is the harmonic mean of precision and recall, suggests a balanced performance in terms of both precision and recall.

##### Challenges Encountered:

1. Class Imbalance:

    Some categories may have significantly more examples than others, which can affect the model's performance on less represented categories.

2. Text Preprocessing:

    Ensuring effective preprocessing (e.g. handling punctuation, stop words, and case normalization) is crucial. Inadequate preprocessing can lead to noisy data and poor feature extraction.

3. Feature Extraction:

    Choosing the right features and tuning the parameters of TF-IDF vectorization required careful consideration to balance between capturing enough information and avoiding overfitting.

4. Computational Resources:

    Handling a large dataset with nearly 20,000 entries can be computationally intensive, especially during preprocessing and training.

#### Reflect on the sentiment analysis results and their implications regarding the content of the blog posts.

##### Sentiment Analysis Results:

The sentiment analysis provided insights into the general emotional tone of blog posts across various categories.

Here are the key reflections:

1. Predominantly Positive Sentiment:

    The majority of categories show a strong positive sentiment, suggesting that most blog authors express favorable or optimistic views in their posts. This might indicate a general trend of positivity or satisfaction among the blog authors in these topics.

2. Variability in Sentiment:

    * Technical Categories: Categories related to technology (e.g. "comp.graphics", "comp.sys.ibm.pc.hardware") tend to have higher positive sentiment. This might reflect user enthusiasm or satisfaction with technological advancements and discussions.

    * Controversial Topics: Categories like "talk.politics.guns" and "talk.politics.mideast" exhibit higher negative sentiment. This aligns with the nature of these topics, which are often divisive and elicit strong negative emotions.

3. Sparse Neutral Sentiment:

    The low incidence of neutral sentiment suggests that blog posts generally express clear opinions, either positive or negative, rather than taking a neutral stance.

##### Implications:

* Content Engagement:

    The high positive sentiment could indicate that blog content is engaging and resonates well with readers, fostering a positive discourse.

* Community and Subject Matter:

    The sentiment distribution provides insights into the community dynamics and the nature of discussions within different categories. For example, technical and hobbyist communities may have more positive interactions, while political and controversial topics might have more polarized discussions.

Overall, the sentiment analysis complements the classification model by providing a deeper understanding of the emotional context of the blog posts, which can be valuable for content creators, moderators, and analysts in tailoring their approaches to different categories.