In [4]:
import os
import pandas as pd


file_path = os.path.join(os.path.expanduser('~'), 'Text Analytics', 'Reviews.csv', 'Review.csv')


In [5]:
df = pd.read_csv(file_path)

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

In [7]:
tfidf_vectorizer = TfidfVectorizer()
X = tfidf_vectorizer.fit_transform(df['Text'])
y = df['Score']

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report


In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [10]:
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)

In [11]:
y_pred = lr_model.predict(X_test)

In [12]:
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

In [14]:
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_rep)

Accuracy: 0.7591190155773104
Classification Report:
              precision    recall  f1-score   support

           1       0.69      0.72      0.70     10326
           2       0.54      0.27      0.36      5855
           3       0.52      0.36      0.43      8485
           4       0.55      0.29      0.38     16123
           5       0.81      0.95      0.88     72902

    accuracy                           0.76    113691
   macro avg       0.62      0.52      0.55    113691
weighted avg       0.73      0.76      0.73    113691



In [None]:
#Naive Bayes is a simple and efficient model for sentiment classification. It works well with text data and can handle large datasets efficiently. However, it assumes independence between features, which may not hold true for all text data. Also, it may not capture complex relationships in the data as effectively as some other models like neural networks. For this specific task, Naive Bayes performs reasonably well, but for more nuanced sentiment analysis tasks, more advanced models like deep learning architectures could be explored.
#Lexicon-based Approaches (TextBlob and VADER)
# 
a)Strengths
-Lexicon-based approaches provide polarity scores that are easy to interpret as they directly indicate the sentiment (positive, negative, or neutral).
-These models do not require training on labeled data, making them fast and easy to implement.
-Suitable for Short Texts: Lexicon-based approaches perform well on short texts like product reviews, making them suitable for sentiment analysis tasks.
b)Weaknesses
-Lexicon-based approaches often fail to capture the nuances and context of language, leading to misinterpretation of sentiment in complex sentences or sarcasm.
-Performance heavily relies on the quality and coverage of the sentiment lexicon used, which may not always capture domain-specific nuances.
-Distinguishing between neutral sentiments and truly neutral statements (e.g., factual information) can be challenging.                                                      
Machine Learning-based Approaches (Naive Bayes and SVM)
a)Strengths
-Machine learning models can capture intricate patterns and relationships in the data, allowing them to handle more complex text structures and contexts effectively.
-With appropriate feature engineering and model tuning, machine learning models can adapt well to different domains and datasets.
-When trained on sufficient and representative data, machine learning models can achieve higher accuracy compared to lexicon-based approaches.
b)Weaknesses
-Machine learning models require labeled training data, which can be time-consuming and expensive to obtain, especially for large datasets.
-Extracting relevant features from text data (e.g., Bag-of-Words, TF-IDF) requires careful consideration and domain expertise, which can be a bottleneck in model development.
-Without proper regularization and hyperparameter tuning, machine learning models may overfit to the training data, resulting in poor generalization to unseen data.
# 
# Tharma Raj(IS01081129)
#  Yovesh varma (IS01081505)