# TEXT CLASSIFICATION USING NAIVE BAYES AND SENTIMENT ANALYSIS ON BLOG POSTS

## Dataset
The provided dataset, "blogs_categories.csv", consists of blog posts along with their associated categories. Each row represents a blog post with the following columns:
•	Text: The content of the blog post. Column name: Data
•	Category: The category to which the blog post belongs. Column name: Labels


In [59]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Tasks

## 1. Data Exploration and Preprocessing

In [47]:
df = pd.read_csv('blogs_categories.csv',encoding = "latin-1")

In [48]:
df

Unnamed: 0.1,Unnamed: 0,Data,Labels
0,0,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49...,alt.atheism
1,1,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...,alt.atheism
2,2,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism
3,3,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...,alt.atheism
4,4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...,alt.atheism
...,...,...,...
19992,19992,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:54...,talk.religion.misc
19993,19993,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:54...,talk.religion.misc
19994,19994,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc
19995,19995,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc


In [49]:
df.columns

Index(['Unnamed: 0', 'Data', 'Labels'], dtype='object')

In [50]:
df = df.drop(columns=['Unnamed: 0'],axis=1)
df

Unnamed: 0,Data,Labels
0,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49...,alt.atheism
1,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...,alt.atheism
2,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism
3,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...,alt.atheism
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...,alt.atheism
...,...,...
19992,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:54...,talk.religion.misc
19993,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:54...,talk.religion.misc
19994,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc
19995,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc


## check for datatype

In [51]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19997 entries, 0 to 19996
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Data    19997 non-null  object
 1   Labels  19997 non-null  object
dtypes: object(2)
memory usage: 312.6+ KB


## summary statistics

In [52]:
df.describe()

Unnamed: 0,Data,Labels
count,19997,19997
unique,19466,20
top,Xref: cantaloupe.srv.cs.cmu.edu talk.politics....,alt.atheism
freq,4,1000


## check for null values

In [53]:
df.isnull().sum()

Data      0
Labels    0
dtype: int64

## check for duplicates

In [55]:
df[df.duplicated()]

Unnamed: 0,Data,Labels
4095,Xref: cantaloupe.srv.cs.cmu.edu comp.sys.mac.h...,comp.sys.mac.hardware
14618,Xref: cantaloupe.srv.cs.cmu.edu sci.physics:51...,sci.space
14640,Xref: cantaloupe.srv.cs.cmu.edu sci.physics:52...,sci.space
14646,Xref: cantaloupe.srv.cs.cmu.edu sci.physics:52...,sci.space
18283,Xref: cantaloupe.srv.cs.cmu.edu alt.journalism...,talk.politics.misc


In [57]:
df = df.drop_duplicates()
df

Unnamed: 0,Data,Labels
0,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49...,alt.atheism
1,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...,alt.atheism
2,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism
3,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...,alt.atheism
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...,alt.atheism
...,...,...
19992,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:54...,talk.religion.misc
19993,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:54...,talk.religion.misc
19994,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc
19995,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc



## Preprocess the text data:

In [40]:
import string
import nltk
from nltk.corpus import stopwords

In [58]:
def preprocess_text(text):
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Convert to lowercase
    text = text.lower()
    # Tokenize
    tokens = text.split()
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # Join tokens back to string
    return ' '.join(tokens)

In [60]:
# Apply the preprocessing function to the Data column
df['Cleaned_Data'] = df['Data'].apply(preprocess_text)

In [61]:
df.head()

Unnamed: 0,Data,Labels,Cleaned_Data
0,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49...,alt.atheism,xref cantaloupesrvcscmuedu altatheism49960 alt...
1,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...,alt.atheism,xref cantaloupesrvcscmuedu altatheism51060 alt...
2,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism,newsgroups altatheism path cantaloupesrvcscmue...
3,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...,alt.atheism,xref cantaloupesrvcscmuedu altatheism51120 alt...
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...,alt.atheism,xref cantaloupesrvcscmuedu altatheism51121 soc...


## Feature extraction using TF-IDF:

In [62]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [63]:
# Initialize the TF-IDF Vectorizer
tfidf = TfidfVectorizer(max_features=5000) 

In [64]:
# Fit and transform the Cleaned_Data
X = tfidf.fit_transform(df['Cleaned_Data'])

In [65]:
y = df['Labels']


## 2. Naive Bayes Model for Text Classification


In [66]:
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [67]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


In [70]:
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

In [72]:
y_pred = nb_classifier.predict(X_test)
y_pred

array(['rec.motorcycles', 'misc.forsale', 'comp.graphics', ...,
       'comp.sys.mac.hardware', 'comp.windows.x', 'rec.sport.hockey'],
      dtype='<U24')

## 3. Sentiment Analysis

In [73]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [75]:
# Initialize the sentiment intensity analyzer
sid = SentimentIntensityAnalyzer()


In [76]:
# Define a function to get sentiment
def get_sentiment(text):
    scores = sid.polarity_scores(text)
    compound = scores['compound']
    if compound >= 0.05:
        return 'positive'
    elif compound <= -0.05:
        return 'negative'
    else:
        return 'neutral'


In [77]:
# Apply the sentiment function to the Cleaned_Data column
df['Sentiment'] = df['Cleaned_Data'].apply(get_sentiment)

In [78]:
# Display the sentiment distribution
print(df['Sentiment'].value_counts())

Sentiment
positive    13620
negative     5613
neutral       759
Name: count, dtype: int64


In [79]:
# Group by Labels and Sentiment to see the distribution
sentiment_distribution = df.groupby(['Labels', 'Sentiment']).size().unstack(fill_value=0)
print(sentiment_distribution)

Sentiment                 negative  neutral  positive
Labels                                               
alt.atheism                    347       19       634
comp.graphics                   88       58       854
comp.os.ms-windows.misc        152       52       796
comp.sys.ibm.pc.hardware       171       43       786
comp.sys.mac.hardware          192       55       752
comp.windows.x                 199       52       749
misc.forsale                   126       76       798
rec.autos                      275       47       678
rec.motorcycles                286       35       679
rec.sport.baseball             201       55       744
rec.sport.hockey               245       31       724
sci.crypt                      256       29       715
sci.electronics                141       45       814
sci.med                        307       44       649
sci.space                      239       33       725
soc.religion.christian         231        6       760
talk.politics.guns          

## 4. Evaluation

In [80]:
# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

# Print evaluation metrics
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')

Accuracy: 0.8712178044511127
Precision: 0.8714816682188422
Recall: 0.8712178044511127
F1 Score: 0.8688900471166117


### conclusion:
In this assignment, we successfully implemented a text classification model using the Naive Bayes algorithm and performed sentiment analysis on a dataset of blog posts. here we applies data preprocessing ,text preprossesing,Feature Extraction using the TF-IDF vectorization method,we build a model on Naive Bayes Model for Text Classification,Sentiment Analysis and evaluate the score of model
