# TEXT CLASSIFICATION USING NAIVE BAYES AND SENTIMENT ANALYSIS ON BLOG POSTS

### Overview
In this assignment, you will work on the "blogs_categories.csv" dataset, which contains blog posts categorized into various themes. Your task will be to build a text classification model using the Naive Bayes algorithm to categorize the blog posts accurately. Furthermore, you will perform sentiment analysis to understand the general sentiment (positive, negative, neutral) expressed in these posts. This assignment will enhance your understanding of text classification, sentiment analysis, and the practical application of the Naive Bayes algorithm in Natural Language Processing (NLP).

Dataset

The provided dataset, "blogs_categories.csv", consists of blog posts along with their associated categories. Each row represents a blog post with the following columns:

•	Text: The content of the blog post. Column name: Data
•	Category: The category to which the blog post belongs. Column name: Labels

Tasks
1. Data Exploration and Preprocessing
2. Naive Bayes Model for Text Classification
3. Sentiment Analysis
4. Sentiment Analysis


In [1]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to C:\Users\SHUBHAM
[nltk_data]     GARKAL\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [2]:
import pandas as pd
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from nltk.sentiment import SentimentIntensityAnalyzer
import warnings
warnings.filterwarnings("ignore")

In [3]:
# Load the dataset
df = pd.read_csv("C:\\Users\\SHUBHAM GARKAL\\Downloads\\Naive Bayes and Text Mining\\Naive Bayes and Text Mining\\blogs_categories.csv")

In [4]:
# Clean the text data
def clean_text(text):
    text = text.lower()  # Convert text to lowercase
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    return text

df['Cleaned_Text'] = df['Data'].apply(clean_text)

In [5]:
# Tokenize and remove stopwords
stop_words = set(stopwords.words('english'))

def tokenize_and_remove_stopwords(text):
    tokens = word_tokenize(text)
    filtered_tokens = [token for token in tokens if token not in stop_words]
    return filtered_tokens

df['Tokenized_Text'] = df['Cleaned_Text'].apply(tokenize_and_remove_stopwords)

In [6]:
# Perform feature extraction using TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['Cleaned_Text'])

In [7]:
# Step 2: Naive Bayes Model for Text Classification

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(tfidf_matrix, df['Labels'], test_size=0.2, random_state=42)

In [8]:
# Train the Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

In [9]:
# Make predictions on the test set
y_pred = nb_classifier.predict(X_test)

In [10]:
# Step 3: Sentiment Analysis

# Initialize SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

In [11]:
# Analyze sentiments in the blog posts
df['Sentiment'] = df['Cleaned_Text'].apply(lambda x: sia.polarity_scores(x)['compound'])

In [12]:
# Categorize sentiments as positive, negative, or neutral
def categorize_sentiment(score):
    if score > 0:
        return 'Positive'
    elif score < 0:
        return 'Negative'
    else:
        return 'Neutral'

df['Sentiment_Category'] = df['Sentiment'].apply(categorize_sentiment)

In [13]:
# Step 4: Evaluation

# Evaluate the performance of the Naive Bayes classifier
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

In [14]:
# Print accuracy and classification report
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_rep)

Accuracy: 0.88075
Classification Report:
                          precision    recall  f1-score   support

             alt.atheism       0.66      0.80      0.73       173
           comp.graphics       0.87      0.89      0.88       179
 comp.os.ms-windows.misc       0.94      0.84      0.88       226
comp.sys.ibm.pc.hardware       0.82      0.85      0.84       204
   comp.sys.mac.hardware       0.89      0.94      0.92       205
          comp.windows.x       0.98      0.92      0.95       186
            misc.forsale       0.92      0.70      0.80       190
               rec.autos       0.89      0.94      0.91       203
         rec.motorcycles       1.00      0.94      0.97       218
      rec.sport.baseball       0.99      0.98      0.99       192
        rec.sport.hockey       0.98      0.99      0.98       203
               sci.crypt       0.83      0.98      0.90       200
         sci.electronics       0.95      0.86      0.91       227
                 sci.med       1.0

In [15]:
# Reflect on sentiment analysis results
sentiment_distribution = df.groupby(['Labels', 'Sentiment_Category']).size()
print("Sentiment Distribution Across Different Categories:")
print(sentiment_distribution)

Sentiment Distribution Across Different Categories:
Labels                    Sentiment_Category
alt.atheism               Negative              392
                          Neutral                10
                          Positive              598
comp.graphics             Negative              131
                          Neutral                43
                          Positive              826
comp.os.ms-windows.misc   Negative              204
                          Neutral                40
                          Positive              756
comp.sys.ibm.pc.hardware  Negative              214
                          Neutral                22
                          Positive              764
comp.sys.mac.hardware     Negative              255
                          Neutral                49
                          Positive              696
comp.windows.x            Negative              229
                          Neutral                39
                   