<a href="https://colab.research.google.com/github/SajjaDNakhwa/Sentiment-Analysis-using-NLP/blob/main/sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis using Natural Language Processing


## Importing Libaries

In [67]:
import numpy as np
import pandas as pd

## Importing Dataset


In [68]:
ds = pd.read_csv('Restaurant_Reviews.tsv', delimiter= '\t', quoting = 3)
print(ds.shape)

(1000, 2)


## Text Cleaning and Formatting

In [69]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
corpus = []

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [70]:
# to include important negative words such as 'no', 'not' etc. which are excluded by default
# exclusion would hamper the analysis
sw = stopwords.words('english')
required_words = ["no", "not", "don't", "isn't", "wasn't", "doesn't", "hasn't", "can", "cannot", "can't"]
for word in required_words:
  if(word not in sw): continue
  sw.remove(word)

In [71]:
#removes punctuations, lowers the case, stems the word to its root
for i in range(0,1000):
  review = re.sub('[^a-zA-Z]', ' ', ds['Review'][i])
  review = review.lower()
  review = review.split()
  review = [ps.stem(word) for word in review if not word in set(sw)]
  review = ' '.join(review)
  corpus.append(review)
# print(corpus)

## Creating the 'Bag Of Words'

In [72]:
#sparse matrix of the bag of words
from sklearn.feature_extraction.text import CountVectorizer
cvv = CountVectorizer(max_features=1500)
X = cvv.fit_transform(corpus).toarray()
y = ds.iloc[:, -1].values
# print(X.shape)
# print(y.shape)

## Splitting dataset into training and test sets

In [73]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Training the model

In [74]:
#after using several models, logistic regression gives the best accuracy
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)

## Testing and Validations

In [75]:
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [0 1]
 [1 1]
 [1 0]
 [1 0]
 [1 1]
 [0 1]
 [1 1]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [0 1]
 [0 1]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [1 1]
 [1 0]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [1 0]
 [1 0]
 [1 0]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [0 0]
 [1 1]
 [0 1]
 [0 0]
 [1 1]
 [1 0]
 [0 0]
 [0 0]
 [1 0]
 [0 1]
 [0 0]
 [1 1]
 [0 1]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [0 1]
 [0 0]
 [0 0]
 [0 1]
 [1 1]
 [0 0]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 0]
 [0 0]
 [1 1]
 [1 0]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [0 1]
 [0 0]
 [0 1]
 [0 1]
 [1 0]
 [0 1]
 [1 1]
 [1 1]
 [1 0]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [0 0]
 [0 1]
 [0 1]
 [1 1]
 [0 0]
 [1 0]
 [0 1]
 [1 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [0 1]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [0 1]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [1 1]

## Checking accuracies

In [76]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
#confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)
#accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
#precision value
precision = precision_score(y_test, y_pred)
print("Precision:", precision)
#recall value
recall = recall_score(y_test, y_pred)
print("Recall:", recall)
#f1 score
f1 = f1_score(y_test, y_pred)
print("F1 Score:", f1)

Confusion Matrix:
[[81 16]
 [26 77]]
Accuracy: 0.79
Precision: 0.8279569892473119
Recall: 0.7475728155339806
F1 Score: 0.7857142857142857


## Testing using own reviews

In [77]:
new_reviews = ['Fair food', 'The food can be much better.', 'The restaurant is nice, the waiter was very kind.', 'Horrible food, disgusting', 'Excellent environment and great food!! Would definitely recommend!']
formatted_reviews = []
for i in new_reviews:
  review = re.sub('[^a-zA-Z]', ' ', i)
  review = review.lower()
  review = review.split()
  review = [ps.stem(word) for word in review if not word in set(sw)]
  review = ' '.join(review)
  formatted_reviews.append(review)
# print(formatted_reviews)

In [78]:
preds = classifier.predict(cvv.transform(formatted_reviews).toarray())
for i in range(len(new_reviews)):
  a = "Positive Review" if preds[i] == 1 else "Negative Review"
  print(new_reviews[i], ":", a)

Fair food : Negative Review
The food can be much better. : Negative Review
The restaurant is nice, the waiter was very kind. : Positive Review
Horrible food, disgusting : Negative Review
Excellent environment and great food!! Would definitely recommend! : Positive Review


## Final Notes

The Logistic Regression Model performs the best amongst classification models (Random Forest, Naive Bayes, SVM etc.).
It gave an accuracy of around 79 which is fair.
It still is not that good, I expect to improve it in the future. It performs well on obvious reviews where common words such as 'good', 'great', 'bad' etc. are used. A bigger dataset can help to improve it!