In [None]:
'''
Overview
In this assignment, you will work on the "blogs_categories.csv" dataset, which contains blog posts categorized into various themes. Your task will be to 
build a text classification model using the Naive Bayes algorithm to categorize the blog posts accurately. Furthermore, you will perform sentiment 
analysis to understand the general sentiment (positive, negative, neutral) expressed in these posts. This assignment will enhance your understanding of 
text classification, sentiment analysis, and the practical application of the Naive Bayes algorithm in Natural Language Processing (NLP).
Dataset
The provided dataset, "blogs_categories.csv", consists of blog posts along with their associated categories. Each row represents a blog post with the 
following columns:
•	Text: The content of the blog post. Column name: Data
•	Category: The category to which the blog post belongs. Column name: Labels
Tasks
1. Data Exploration and Preprocessing
•	Load the "blogs_categories.csv" dataset and perform an exploratory data analysis to understand its structure and content.
•	Preprocess the data by cleaning the text (removing punctuation, converting to lowercase, etc.), tokenizing, and removing stopwords.
•	Perform feature extraction to convert text data into a format that can be used by the Naive Bayes model, using techniques such as TF-IDF.
2. Naive Bayes Model for Text Classification
•	Split the data into training and test sets.
•	Implement a Naive Bayes classifier to categorize the blog posts into their respective categories. You can use libraries like scikit-learn for this 
purpose.
•	Train the model on the training set and make predictions on the test set.
3. Sentiment Analysis
•	Choose a suitable library or method for performing sentiment analysis on the blog post texts.
•	Analyze the sentiments expressed in the blog posts and categorize them as positive, negative, or neutral. Consider only the Data column and get the 
sentiment for each blog.
•	Examine the distribution of sentiments across different categories and summarize your findings.
4. Evaluation
•	Evaluate the performance of your Naive Bayes classifier using metrics such as accuracy, precision, recall, and F1-score.
•	Discuss the performance of the model and any challenges encountered during the classification process.
•	Reflect on the sentiment analysis results and their implications regarding the content of the blog posts.
Submission Guidelines
•	Your submission should include a comprehensive report and the complete codebase.
•	Your code should be well-documented and include comments explaining the major steps.
Evaluation Criteria
•	Correct implementation of data preprocessing and feature extraction.
•	Accuracy and robustness of the Naive Bayes classification model.
•	Depth and insightfulness of the sentiment analysis.
•	Clarity and thoroughness of the evaluation and discussion sections.
•	Overall quality and organization of the report and code.
Good luck, and we look forward to your insightful analysis of the blog posts dataset!
'''

In [None]:
'''
1. Data Exploration and Preprocessing
•	Load the "blogs_categories.csv" dataset and perform an exploratory data analysis to understand its structure and content.
•	Preprocess the data by cleaning the text (removing punctuation, converting to lowercase, etc.), tokenizing, and removing stopwords.
•	Perform feature extraction to convert text data into a format that can be used by the Naive Bayes model, using techniques such as TF-IDF.
'''

In [1]:
import pandas as pd
df = pd.read_csv("C:\\Users\\sujey\\Downloads\\Assignments\\Bayes_Class and NLP\\blogs.csv")
df

Unnamed: 0,Data,Labels
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,alt.atheism
...,...,...
1995,Xref: cantaloupe.srv.cs.cmu.edu talk.abortion:...,talk.religion.misc
1996,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc
1997,Xref: cantaloupe.srv.cs.cmu.edu talk.origins:4...,talk.religion.misc
1998,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc


In [5]:
#perform an exploratory data analysis to understand its structure and content
print("Data types:")
print(df.dtypes)
print("Null values:")
print(df.isnull().sum())
df["Labels"].value_counts()

Data types:
Data      object
Labels    object
dtype: object
Null values:
Data      0
Labels    0
dtype: int64


Labels
alt.atheism                 100
comp.graphics               100
talk.politics.misc          100
talk.politics.mideast       100
talk.politics.guns          100
soc.religion.christian      100
sci.space                   100
sci.med                     100
sci.electronics             100
sci.crypt                   100
rec.sport.hockey            100
rec.sport.baseball          100
rec.motorcycles             100
rec.autos                   100
misc.forsale                100
comp.windows.x              100
comp.sys.mac.hardware       100
comp.sys.ibm.pc.hardware    100
comp.os.ms-windows.misc     100
talk.religion.misc          100
Name: count, dtype: int64

In [7]:
import re
import nltk
from nltk.corpus import stopwords
def preprocess_text(text):
    # Remove punctuation and convert to lowercase
    text = re.sub(r'[^\w\s]', '', text.lower())
    # Tokenize using regex
    tokens = re.findall(r'\b\w+\b', text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

df['Cleaned_Data'] = df['Data'].apply(preprocess_text)
display(df.head())

Unnamed: 0,Data,Labels,Cleaned_Data
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism,path cantaloupesrvcscmuedumagnesiumclubcccmued...
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism,newsgroups altatheism path cantaloupesrvcscmue...
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism,path cantaloupesrvcscmuedudasnewsharvardedunoc...
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism,path cantaloupesrvcscmuedumagnesiumclubcccmued...
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,alt.atheism,xref cantaloupesrvcscmuedu altatheism53485 tal...


In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000) # You can adjust max_features as needed

# Fit and transform the cleaned data
tfidf_features = tfidf_vectorizer.fit_transform(df['Cleaned_Data'])

In [10]:
#Split the data into training and test sets.
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(tfidf_features, df['Labels'], test_size=0.2, random_state=42)

In [11]:
from sklearn.naive_bayes import MultinomialNB

# Initialize the Naive Bayes classifier
nb_model = MultinomialNB()

# Train the classifier on the training data
nb_model.fit(X_train, Y_train)

In [14]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Initialize the VADER sentiment intensity analyzer
analyzer = SentimentIntensityAnalyzer()

# Function to get sentiment score
def get_sentiment_score(text):
    sentiment = analyzer.polarity_scores(text)
    # You can return the compound score or classify based on it
    return sentiment['compound']

# Apply the sentiment analysis to the 'Cleaned_Data' column
df['Sentiment_Score'] = df['Cleaned_Data'].apply(get_sentiment_score)

# Display the first few rows with the new sentiment scores
df[['Cleaned_Data', 'Sentiment_Score']].head()

Unnamed: 0,Cleaned_Data,Sentiment_Score
0,path cantaloupesrvcscmuedumagnesiumclubcccmued...,-0.9896
1,newsgroups altatheism path cantaloupesrvcscmue...,0.9251
2,path cantaloupesrvcscmuedudasnewsharvardedunoc...,-0.994
3,path cantaloupesrvcscmuedumagnesiumclubcccmued...,-0.9996
4,xref cantaloupesrvcscmuedu altatheism53485 tal...,0.989


In [15]:
# Function to categorize sentiment based on the compound score
def categorize_sentiment(score):
    if score >= 0.05:
        return 'Positive'
    elif score <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'

# Apply the categorization function to create a new 'Sentiment_Category' column
df['Sentiment_Category'] = df['Sentiment_Score'].apply(categorize_sentiment)

# Display the first few rows with the new sentiment category
df[['Cleaned_Data', 'Sentiment_Score', 'Sentiment_Category']].head()

Unnamed: 0,Cleaned_Data,Sentiment_Score,Sentiment_Category
0,path cantaloupesrvcscmuedumagnesiumclubcccmued...,-0.9896,Negative
1,newsgroups altatheism path cantaloupesrvcscmue...,0.9251,Positive
2,path cantaloupesrvcscmuedudasnewsharvardedunoc...,-0.994,Negative
3,path cantaloupesrvcscmuedumagnesiumclubcccmued...,-0.9996,Negative
4,xref cantaloupesrvcscmuedu altatheism53485 tal...,0.989,Positive


In [16]:
Y_pred_test = nb_model.predict(X_test)

In [20]:
#Evaluate the performance of your Naive Bayes classifier using metrics such as accuracy, precision, recall, and F1-score.
#calculate metrics
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
test_score = accuracy_score(Y_test,Y_pred_test)
prec_score = precision_score(Y_test,Y_pred_test,average="weighted")
recall = recall_score(Y_test,Y_pred_test,average="weighted")
f1 = f1_score(Y_test,Y_pred_test,average="weighted")
print("Testing accuracy:",round(test_score,2))
print("Precision score:",round(prec_score,2))
print("Recall score:",round(recall,2))
print("F1 score:",round(f1,2))

Testing accuracy: 0.82
Precision score: 0.83
Recall score: 0.82
F1 score: 0.82
