# Sentiment Analysis of Restaurant Reviews

### This script performs sentiment analysis on a collection of restaurant reviews using machine learning techniques. Sentiment analysis aims to determine the sentiment expressed in text, specifically whether a review conveys a positive or negative sentiment about a restaurant experience.

### Steps:
1. Data Loading and Exploration: The script loads a dataset containing restaurant reviews from a TSV file.It checks for any missing values and provides a glimpse of the dataset's structure.

2. Data Preprocessing: The text data in the reviews undergoes preprocessing to ensure consistency and remove irrelevant information. The text is converted to lowercase, tokenized into words, and stemmed to their root forms. Common stopwords are also removed to focus on meaningful content.

3. Feature Extraction: To enable machine learning models to process text data, the preprocessed reviews are transformed into numerical features. This process, known as Count vectorization, assigns weights to words based on their importance within each review and across the entire dataset.

4. Model Selection and Evaluation: The script considers multiple machine learning models for sentiment analysis, including Multinomial Naive Bayes, Random Forest, Gradient Boosting, XG Boost and Support Vector Classifier (SVC).Each model is trained on a subset of the data and evaluated on another subset to measure its predictive ability.

5. Best Model Identification: During evaluation, the script identifies the best-performing model based on its accuracy in predicting sentiment. This helps determine which model is most effective for this analysis.

6. Evaluation Metrics and Insights: The script provides evaluation metrics such as accuracy, which reflects the proportion of correctly predicted sentiments. Additionally, it generates a confusion matrix and a classification report to offer insights into model performance for positive and negative sentiments.

7. Conclusion: Sentiment analysis of restaurant reviews holds practical value for understanding customer opinions, improving restaurant services, and making informed business decisions. This script demonstrates the process of preprocessing text, training machine learning models, and evaluating their performance in sentiment analysis.

### Sentiment analysis plays a crucial role in extracting valuable insights from unstructured text data, contributing to enhanced customer experiences and data-driven decision-making in the restaurant industry.

Import necessary libraries

In [19]:
import numpy as np
import pandas as pd
from nltk.stem import PorterStemmer
from string import punctuation
from spacy.lang.en import STOP_WORDS
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

Load the dataset


In [20]:
review=pd.read_csv("Restaurant_Reviews.tsv",sep="\t")

Display the loaded dataset

In [21]:
review

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
995,I think food should have flavor and texture an...,0
996,Appetite instantly gone.,0
997,Overall I was not impressed and would not go b...,0
998,"The whole experience was underwhelming, and I ...",0


Check for any missing values in the dataset

In [22]:
review.isnull().sum()

Review    0
Liked     0
dtype: int64

Display the first few rows of the dataset

In [23]:
review.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


Display the last few rows of the dataset

In [24]:
review.tail()

Unnamed: 0,Review,Liked
995,I think food should have flavor and texture an...,0
996,Appetite instantly gone.,0
997,Overall I was not impressed and would not go b...,0
998,"The whole experience was underwhelming, and I ...",0
999,"Then, as if I hadn't wasted enough of my life ...",0


Preprocess the data

In [25]:
corpus=[]
stopwords=list(STOP_WORDS)
stopwords_to_remove=["n‘t","n't","n’t","not"]
stopwords=[word for word in stopwords if word not in stopwords_to_remove]

In [26]:
for i in range(review.shape[0]):
    data=review.iloc[i,0]
    data=data.lower()
    data=data.split()
    ps=PorterStemmer()
    data=[ps.stem(word) for word in data if not word in set(stopwords)]
    data=' '.join(data)
    corpus.append(data)

Create Count vectorizer and transform the text data into a numerical format

In [27]:
cv=CountVectorizer(max_features=1500, ngram_range=(1, 2), stop_words='english')
x=cv.fit_transform(corpus).toarray()
y=review['Liked'].values

Split the data into training and testing sets

In [28]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.2)

Define models for sentiment analysis

In [29]:
models={'MultinomialNB':MultinomialNB(),
        'RandomForestClassifier': RandomForestClassifier(n_estimators=100,max_depth=4),
        'GradientBoostingClassifier' : GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42),
        'xgb_classifier' : XGBClassifier(n_estimators=100,learning_rate=0.1,max_depth=3,random_state=42),
        'SVC': SVC(kernel='linear')}

Initialize variables to track the best model and its accuracy

In [30]:
best_model = None
best_accuracy = 0

Dictionary to store evaluation metrics for each model

In [31]:
score={}

Loop through each model, train, and evaluate and check if current model has the best accuracy so far

In [32]:
for model_name,model in models.items():
    model.fit(x_train,y_train)
    train_score=model.score(x_train,y_train)
    test_score=model.score(x_test,y_test)
    y_pred=model.predict(x_test)
    accuracy=accuracy_score(y_test,y_pred)
    if ((train_score-test_score)*100)<= 5 and accuracy > best_accuracy:
        best_accuracy = accuracy
        best_model = model_name
    score[model_name]={'Train Score':train_score, 'Test Score':test_score,'Accuracy':accuracy,'y_pred':y_pred}

Print evaluation metrics for each model

In [33]:
for model_name, scores in score.items():
    print(f"Model: {model_name}")
    print(f"Train Score: {scores['Train Score']:.4f}")
    print(f"Test Score: {scores['Test Score']:.4f}")
    print(f"Accuracy: {scores['Accuracy']:.2f}\n")

Model: MultinomialNB
Train Score: 0.9200
Test Score: 0.7800
Accuracy: 0.78

Model: RandomForestClassifier
Train Score: 0.8200
Test Score: 0.7500
Accuracy: 0.75

Model: GradientBoostingClassifier
Train Score: 0.8337
Test Score: 0.7550
Accuracy: 0.76

Model: xgb_classifier
Train Score: 0.7650
Test Score: 0.7200
Accuracy: 0.72

Model: SVC
Train Score: 0.9762
Test Score: 0.7350
Accuracy: 0.73



Print the best model and its accuracy

In [34]:
print(f"The best model is {best_model} with an accuracy of {best_accuracy*100:.2f}%")

The best model is xgb_classifier with an accuracy of 72.00%


Calculate and display the confusion matrix for the best model

In [35]:
conf_matrix = confusion_matrix(y_test, score[best_model]['y_pred'])
print("Confusion Matrix:")
print(conf_matrix)

Confusion Matrix:
[[87 16]
 [40 57]]


Create and display the classification report for the best model

In [36]:
class_rep = classification_report(y_test, score[best_model]['y_pred'])
print("Classification Report:")
print(class_rep)

Classification Report:
              precision    recall  f1-score   support

           0       0.69      0.84      0.76       103
           1       0.78      0.59      0.67        97

    accuracy                           0.72       200
   macro avg       0.73      0.72      0.71       200
weighted avg       0.73      0.72      0.71       200

