**Perform sentiment analysis on the text data using suitable Naive Bayes Algorithm**

In [47]:
import pandas as pd

In [48]:
df = pd.read_csv('news_sentiment_analysis.csv')
df

Unnamed: 0,Description,Sentiment
0,"ST. GEORGE — Kaitlyn Larson, a first-year teac...",positive
1,"Harare, Zimbabwe – Local businesses are grappl...",neutral
2,(marketscreener.com) Billionaire Elon Musk has...,positive
3,(marketscreener.com) A U.S. trade regulator on...,negative
4,4.5 million households in the U.S. have solar ...,positive
...,...,...
3495,QRG Capital Management Inc. increased its stak...,positive
3496,QRG Capital Management Inc. bought a new posit...,positive
3497,QRG Capital Management Inc. boosted its stake ...,positive
3498,"WESTFORD, Mass., July 18, 2024 /PRNewswire/ --...",neutral


**Data Preprocessing**

In [49]:
import re

In [50]:
import nltk
from nltk.corpus import stopwords

# Download necessary NLTK data
#nltk.download('stopwords')
#nltk.download('wordnet')

# Initialize stopwords list
custom_stopwords = [
    'the', 'and', 'is', 'in', 'it', 'of', 'to', 'a', 'with', 'as', 
    'this', 'for', 'on', 'at', 'by', 'an', 'be', 'that', 'have', 
    'not', 'are', 'but', 'was', 'from', 'or', 'if'
]

In [51]:
# Define a function to clean text
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\b\w*\.com\b', '', text)  # Remove words containing .com
    text = re.sub(r'https?://\S+', '', text)  # Remove URLs (if needed)
    text = re.sub(r'\[\w+\]|\{\w+\}|\(\w+\)|\<\w+\>', '', text)  # Remove words in brackets
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = ' '.join(word for word in text.split() if word not in custom_stopwords)  # Remove stopwords
    return text

In [52]:
df['cleaned_review'] = df['Description'].apply(clean_text)

In [53]:
cleaned_file_path = 'cleaned_news_reviews.csv'
df.to_csv(cleaned_file_path, index=False)

In [54]:
df1 = pd.read_csv('cleaned_news_reviews.csv')

In [55]:
df1 = df1.drop('Description', axis=1)
df1

Unnamed: 0,Sentiment,cleaned_review
0,positive,st george kaitlyn larson firstyear teacher pin...
1,neutral,harare zimbabwe local businesses grappling sev...
2,positive,billionaire elon musk has donated super pac wo...
3,negative,us trade regulator fridayannounced suite actio...
4,positive,million households us solar panels their homes...
...,...,...
3495,positive,qrg capital management inc increased its stake...
3496,positive,qrg capital management inc bought new position...
3497,positive,qrg capital management inc boosted its stake a...
3498,neutral,westford mass july prnewswire according skyque...


In [56]:
X = df1.iloc[:,-1]

In [57]:
y = df1.iloc[:,:-1]

**Multinomial Naive Bayes with CountVectorizer**

In [58]:
from sklearn.feature_extraction.text import CountVectorizer

In [59]:
from sklearn.model_selection import train_test_split

In [60]:
from sklearn.naive_bayes import MultinomialNB

In [61]:
xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=0.25,random_state=42)

In [62]:
from sklearn.pipeline import make_pipeline
# Create a pipeline that combines preprocessing and model
pipeline = make_pipeline(CountVectorizer(binary=False),  # Vectorization step
    MultinomialNB() # Classification step
)

In [63]:
# Train the pipeline
pipeline.fit(xtrain, ytrain)

  y = column_or_1d(y, warn=True)


In [64]:
y_pred1 = pipeline.predict(xtest)

In [65]:
from sklearn.metrics import accuracy_score,confusion_matrix

In [66]:
accuracy_score(ytest,y_pred1)

0.8022857142857143

In [67]:
confusion_matrix(ytest,y_pred1)

array([[ 97,   4,  53],
       [  4, 106,  77],
       [ 12,  23, 499]])

**Bernoulli Naive Bayes with CountVectorizer**

In [68]:
vectorizer2 = CountVectorizer(binary = True)

In [69]:
X2 = vectorizer2.fit_transform(X)

In [70]:
x2train,x2test,ytrain,ytest = train_test_split(X2,y,test_size=0.25,random_state=42)

In [71]:
from sklearn.naive_bayes import BernoulliNB

In [72]:
bnb = BernoulliNB()

In [73]:
bnb.fit(x2train,ytrain)

  y = column_or_1d(y, warn=True)


In [74]:
y_pred2 = bnb.predict(x2test)

In [75]:
accuracy_score(ytest,y_pred2)

0.7188571428571429

In [76]:
confusion_matrix(ytest,y_pred2)

array([[ 45,   6, 103],
       [  0,  88,  99],
       [  7,  31, 496]])

**Multinomial Naive Bayes with Tfidf Vectorizer**

In [77]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [78]:
vectorizer3 = TfidfVectorizer()

In [79]:
X3 = vectorizer3.fit_transform(X)

In [80]:
x3train,x3test,ytrain,ytest = train_test_split(X3,y,test_size=0.25,random_state=42)

In [81]:
from sklearn.naive_bayes import MultinomialNB

In [82]:
mnb1 = MultinomialNB()

In [83]:
mnb1.fit(x3train,ytrain)

  y = column_or_1d(y, warn=True)


In [84]:
y_pred3 = mnb1.predict(x3test)

In [85]:
accuracy_score(ytest,y_pred3)

0.7337142857142858

In [86]:
confusion_matrix(ytest,y_pred3)

array([[ 40,   5, 109],
       [  0,  84, 103],
       [  2,  14, 518]])

**Conclusion:** Based on the above analysis, it is determined that count vectorization is the most effective technique for extracting features from the dataset. Moreover, the Multinomial Naive Bayes model proved to be the best performing model for this dataset, indicating that it is well-suited for handling the extracted features and producing accurate predictions. 

**Making prediction on the inputs given by the user**

In [131]:
user_input = input("Please enter text to classify: ")

Please enter text to classify:  Harare, Zimbabwe – Local businesses are grappling with a severe liquidity crunch, which is limiting [...]


In [132]:
user_input_trans = clean_text(user_input)

In [133]:
user_predictions = pipeline.predict([user_input_trans])

In [134]:
print(f"The predicted category is: {user_predictions}")

The predicted category is: ['neutral']
