 **Natural Language Processing ||
Importing the libraries**



*   Task: Text-Based Sentiment Analysis
*   Description: Load a dataset containing text reviews or comments (e.g., product reviews, movie reviews, or social media comments). Build a sentiment analysis model that can classify these texts as either positive, negative, or neutral sentiment. Requirements: Preprocess the text data by tokenizing, removing stop words, and performing any necessary text cleaning. Choose and implement a traditional machine learning algorithm and report the model's accuracy, precision, recall, and F1-score on the validation set. Discuss their feature selection process and any feature engineering techniques applied to improve model performance. Present your findings, including the model's algorithm, preprocessing steps, and performance metrics.



In [155]:
# Dataset taken is of 2000 row

import string
import pandas as pd
import numpy as np
import nltk
from nltk.stem import PorterStemmer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from tqdm.notebook import tqdm

**Downloading package**

In [156]:
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('vader_lexicon')


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [157]:
stemmer = PorterStemmer()
sia = SentimentIntensityAnalyzer()

**Importing the dataset**

In [158]:
df = pd.read_csv('/content/Book1.csv')
df = df.head(2000)
print(df.shape)
df.head()

(2000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


**Cleaning the texts**

In [159]:
def cleaning_review(text: str) -> str:
    text = text.lower()
    words = word_tokenize(text)
    words = [word for word in words if word.isalnum()]
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    stemmed_sentence = [stemmer.stem(word) for word in words]
    return ' '.join(stemmed_sentence)
df['review'] = df['review'].apply(cleaning_review)

**Given sentiment in the dataset**

In [160]:
input_sentiment = [i for i in df['sentiment']]
df['sentiment'].value_counts()
print(input_sentiment)

['positive', 'positive', 'positive', 'negative', 'positive', 'positive', 'positive', 'negative', 'negative', 'positive', 'negative', 'negative', 'negative', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'negative', 'positive', 'positive', 'negative', 'negative', 'positive', 'positive', 'positive', 'negative', 'positive', 'negative', 'negative', 'negative', 'negative', 'positive', 'negative', 'negative', 'positive', 'negative', 'negative', 'positive', 'positive', 'negative', 'negative', 'positive', 'negative', 'positive', 'positive', 'positive', 'positive', 'negative', 'negative', 'negative', 'negative', 'positive', 'positive', 'negative', 'negative', 'positive', 'negative', 'negative', 'positive', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'positive', 'positive', 'negative', 'positive', 'positive', 'negative', 'negative', 'positive', 'positive', 'negative', 'negative', 'ne

**Predicted sentiment using SentimentIntensityAnalyzer**

In [161]:
def sentiment_analyse(sentiment_text, predicted_sentiment):
    score = SentimentIntensityAnalyzer().polarity_scores(sentiment_text)
    if score['neg'] > score['pos']:
        predicted_sentiment.append('negative')
    else:
        predicted_sentiment.append('positive')
predicted_sentiment = []
for i in df['review']:
    sentiment_analyse(i, predicted_sentiment)
print(predicted_sentiment)

['negative', 'positive', 'positive', 'negative', 'positive', 'positive', 'positive', 'positive', 'negative', 'positive', 'negative', 'positive', 'positive', 'positive', 'negative', 'positive', 'negative', 'negative', 'positive', 'positive', 'positive', 'negative', 'positive', 'positive', 'positive', 'positive', 'positive', 'negative', 'negative', 'negative', 'negative', 'negative', 'positive', 'positive', 'positive', 'positive', 'negative', 'negative', 'positive', 'positive', 'positive', 'positive', 'negative', 'positive', 'positive', 'positive', 'positive', 'negative', 'positive', 'negative', 'positive', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'positive', 'positive', 'positive', 'negative', 'positive', 'positive', 'positive', 'positive', 'negative', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'negative', 'negative', 'positive', 'negative', 'positive', 'negative', 'negative', 'ne

**Spliting data**

In [162]:
X = df['review']
y = df['sentiment']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

**Feature matrix**

In [163]:
tfidf_vectorizer = TfidfVectorizer(max_features=1000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_val_tfidf = tfidf_vectorizer.transform(X_val)

 **Naive Bayes ( Traditional machine learning algorithm)**

In [164]:
naive_bayes_classifier = MultinomialNB()
naive_bayes_classifier.fit(X_train_tfidf, y_train)


**Model's accuracy, precision, recall, and F1-score**

In [165]:
y_pred = naive_bayes_classifier.predict(X_val_tfidf)
accuracy = accuracy_score(y_val, y_pred)
precision = precision_score(y_val, y_pred, average='weighted')
recall = recall_score(y_val, y_pred, average='weighted')
f1 = f1_score(y_val, y_pred, average='weighted')

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")

Accuracy: 0.8375
Precision: 0.8391927083333335
Recall: 0.8375
F1-Score: 0.8374603883968474


In [166]:
from sklearn.metrics import precision_score, recall_score, f1_score
y_pred_val = naive_bayes_classifier.predict(X_val_tfidf_selected)
precision_val = precision_score(y_val, y_pred_val, average=None)
recall_val = recall_score(y_val, y_pred_val, average=None)
f1_val = f1_score(y_val, y_pred_val, average=None)
print(f"Validation Set Metrics for Each Class:")
for label, precision_score, recall_score, f1_score in zip(naive_bayes_classifier.classes_, precision_val, recall_val, f1_val):
    print(f"Class {label}:")
    print(f"  Precision: {precision_score:.2f}")
    print(f"  Recall: {recall_score:.2f}")
    print(f"  F1-Score: {f1_score:.2f}")


Validation Set Metrics for Each Class:
Class negative:
  Precision: 0.81
  Recall: 0.87
  F1-Score: 0.84
Class positive:
  Precision: 0.86
  Recall: 0.81
  F1-Score: 0.84


In [168]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
model = SVC()
model.fit(X_train_tfidf, y_train)

y_pred = model.predict(X_val_tfidf)

accuracy = accuracy_score(y_val, y_pred)
precision = precision_score(y_val, y_pred, average='weighted')
recall = recall_score(y_val, y_pred, average='weighted')
f1 = f1_score(y_val, y_pred, average='weighted')
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")


Accuracy: 0.8425
Precision: 0.8426159147869674
Recall: 0.8425
F1-Score: 0.8424260679079956




*   Model Algorithm: Multinomial Naive Bayes classifier

*   Preprocessing Steps: Tokenization, Lowercasing, TF-IDF Vectorization, Stop Words Removal,Stemming

*   Feature Selection Process: Performed during TF-IDF vectorization using the max_features parameter

*   Feature Engineering Techniques:Feature engineering is applied by stemming the words in the 'review' column using the Porter Stemmer.
N-Grams, Sentiment Lexicons, Word Embeddings, Text Augmentation Can be used to improve performance
*  Performance Metrics: Model's accuracy, precision, recall, and F1-score

