## Overview
    In this assignment, you will work on the "blogs_categories.csv" dataset, which contains blog posts categorized into various themes. Your task will be to build a text classification model using the Naive Bayes algorithm to categorize the blog posts accurately. Furthermore, you will perform sentiment analysis to understand the general sentiment (positive, negative, neutral) expressed in these posts. This assignment will enhance your understanding of text classification, sentiment analysis, and the practical application of the Naive Bayes algorithm in Natural Language Processing (NLP).
## Dataset
    The provided dataset, "blogs_categories.csv", consists of blog posts along with their associated categories. Each row represents a blog post with the following columns:
     •	Text: The content of the blog post. Column name: Data
     •	Category: The category to which the blog post belongs. Column name: Labels


## Tasks
      1. Data Exploration and Preprocessing
          •	Load the "blogs_categories.csv" dataset and perform an exploratory data analysis to understand its structure and content.
          •	Preprocess the data by cleaning the text (removing punctuation, converting to lowercase, etc.), tokenizing, and removing stopwords.
          •	Perform feature extraction to convert text data into a format that can be used by the Naive Bayes model, using techniques such as TF-IDF.


In [1]:
!pip install spacy



In [2]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------- ------------------------------ 2.9/12.8 MB 13.7 MB/s eta 0:00:01
     ------------------ --------------------- 5.8/12.8 MB 13.4 MB/s eta 0:00:01
     --------------------------- ------------ 8.7/12.8 MB 13.1 MB/s eta 0:00:01
     ---------------------------------- ---- 11.3/12.8 MB 12.8 MB/s eta 0:00:01
     ---------------------------------------- 12.8/12.8 MB 11.9 MB/s  0:00:01
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize,sent_tokenize
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
import re
import spacy
nlp= spacy.load('en_core_web_sm')

In [2]:
df=pd.read_csv('blogs.csv')
df.head()

Unnamed: 0,Data,Labels
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,alt.atheism


In [3]:
df.isnull().sum()

Data      0
Labels    0
dtype: int64

In [4]:
df.duplicated().sum()

np.int64(0)

In [50]:
def data_clean(text):
    text1= ' '.join(re.findall('\w+',text))
    doc= nlp(text1)
    clean_data= [token for token in doc  if not token.is_stop and not token.is_punct
               and not token.is_digit and not token.is_bracket and not token.is_currency]
    return clean_data

In [8]:
count= CountVectorizer(analyzer=preprocess_text)

In [9]:
x= count.fit_transform(df['Data'])

In [10]:
tfidf= TfidfTransformer()

In [49]:
y= tfidf.fit_transform(x)

In [19]:
x.shape

(2000, 459225)

In [51]:
def preprocess_text(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

In [52]:
df['Data']=df['Data'].apply(preprocess_text)

In [54]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [55]:
vectorizer = TfidfVectorizer()
features = vectorizer.fit_transform(df['Data'])

## 2. Naive Bayes Model for Text Classification
       • Split the data into training and test sets.
       • Implement a Naive Bayes classifier to categorize the blog posts into their respective categories. You can use libraries like scikit-learn for this purpose.
       • Train the model on the training set and make predictions on the test set.


In [56]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB,GaussianNB,BernoulliNB
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score

In [57]:
features

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 268023 stored elements and shape (2000, 56432)>

In [58]:
target=df['Labels']

In [59]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

In [60]:
x_train,x_test,y_train,y_test=train_test_split(features,target,train_size=0.80,random_state=100)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(1600, 56432)
(400, 56432)
(1600,)
(400,)


In [61]:
multi=MultinomialNB()

In [62]:
multi.fit(x_train,y_train)
y_pred= multi.predict(x_test)

## 3. Sentiment Analysis
      • Choose a suitable library or method for performing sentiment analysis on the blog post texts.
      • Analyze the sentiments expressed in the blog posts and categorize them as positive, negative, or neutral. Consider only the Data column and get the sentiment for each blog.
      • Examine the distribution of sentiments across different categories and summarize your findings.


In [27]:
!pip install textblob

Collecting textblob
  Downloading textblob-0.19.0-py3-none-any.whl.metadata (4.4 kB)
Downloading textblob-0.19.0-py3-none-any.whl (624 kB)
   ---------------------------------------- 0.0/624.3 kB ? eta -:--:--
   ---------------------------------------- 624.3/624.3 kB 4.3 MB/s  0:00:00
Installing collected packages: textblob
Successfully installed textblob-0.19.0


In [30]:
from textblob import TextBlob

In [31]:
def sent_count(text):
  sent1= TextBlob(text)
  return sent1.sentiment.polarity

In [32]:
df['Blob_sent']= df['Data'].apply(sent_count)

In [33]:
df.head()

Unnamed: 0,Data,Labels,Blob_sent
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism,0.072213
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism,-0.053757
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism,0.093119
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism,0.055008
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,alt.atheism,0.132183


In [36]:
df[df['Blob_sent']<0]

Unnamed: 0,Data,Labels,Blob_sent
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism,-0.053757
7,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism,-0.010606
14,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism,-0.041667
30,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism,-0.145556
35,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism,-0.176282
...,...,...,...
1987,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc,-0.150000
1989,Xref: cantaloupe.srv.cs.cmu.edu alt.conspiracy...,talk.religion.misc,-0.070296
1990,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,talk.religion.misc,-0.031548
1991,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,talk.religion.misc,-0.060193


In [38]:
df[df['Blob_sent']>0]

Unnamed: 0,Data,Labels,Blob_sent
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism,0.072213
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism,0.093119
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism,0.055008
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,alt.atheism,0.132183
5,Xref: cantaloupe.srv.cs.cmu.edu talk.abortion:...,alt.atheism,0.066416
...,...,...,...
1995,Xref: cantaloupe.srv.cs.cmu.edu talk.abortion:...,talk.religion.misc,0.024232
1996,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc,0.153333
1997,Xref: cantaloupe.srv.cs.cmu.edu talk.origins:4...,talk.religion.misc,0.120370
1998,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc,0.111306


## 4. Evaluation
    • Evaluate the performance of your Naive Bayes classifier using metrics such as accuracy, precision, recall, and F1-score.
    • Discuss the performance of the model and any challenges encountered during the classification process.
    • Reflect on the sentiment analysis results and their implications regarding the content of the blog posts.


In [66]:
print(accuracy_score(y_test,y_pred))
print(precision_score(y_test,y_pred,average='weighted'))
print(recall_score(y_test,y_pred,average='weighted'))
print(f1_score(y_test,y_pred,average='weighted'))

0.8275
0.8486277788795236
0.8275
0.8260473425084313


#### The performance of the model was good which attained 80% accuracy.The challenge encountered was during the cleaning process since it contain lots of stop words and symbols.

#### I have done the sentiment analyisis using the textblob package.It analayse the text and provide some scores for better classification of the text.From the result we infer that the blog data contains both positive and negative texts.