<a href="https://colab.research.google.com/github/123ranika/Research-paper/blob/main/SA_restaurant_review_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project: Sentiment Analysis on Bengali Restaurant Reviews

In this project we will classify the sentiment of a review either it is positive or negative. For this we have created a dataset of $1.4k$ bengali restaurant reviews. It is a balanced dataset where $630$ reviews are annotated as Positive Sentiment and another $790$ reviews as negative sentiment. All the Reviews are collected from different social media groups( such as food monster) and then manually annotated by two native bengali speaker.  


**Project Includes:**

-   Preprocessing
-   Exploratory Analysis
-   Feature Extraction using TF-IDF for N-gram
-   Machine Learning Model Development
-   Evaluation Measure
-   Saved the Final Model

## Import Libraries

In [20]:
import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import re,json,nltk
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,accuracy_score,precision_score,recall_score,f1_score
stopwords_list ='stopwords-bn.txt'

## Importing the Dataset

In [21]:
# Read the data and take only 1000 Reviews
data = pd.read_csv('/content/Traning.xlsx - Sheet1 (6).csv',encoding='UTF-8')
print("Total Reviews:",len(data),
      "\nTotal Cyberbullying Reviews:",len(data[data.labels =='Cyberbullying']),
      "\nTotal Religious_Hatred Reviews:",len(data[data.labels=='Religious_Hatred']),
      "\nTotal Gender_Discrimination Reviews:",len(data[data.labels =='Gender_Discrimination']),
      "\nTotal Sarcasm Reviews:",len(data[data.labels=='Sarcasm']),
      "\nTotal Political Reviews:",len(data[data.labels =='Political']),
      "\nTotal Racism Reviews:",len(data[data.labels=='Racism'])
      )

Total Reviews: 6000 
Total Cyberbullying Reviews: 1686 
Total Religious_Hatred Reviews: 673 
Total Gender_Discrimination Reviews: 678 
Total Sarcasm Reviews: 1518 
Total Political Reviews: 776 
Total Racism Reviews: 668


In [22]:
data.columns

Index(['PID', 'text', 'labels'], dtype='object')

In [23]:
# print some unprocessed reviews
sample_data = [10,100,150,200,250,600,650,666,689,640,650,700,750,800,1000]
for i in sample_data:
      print(data.text[i],'\n','labels:-- ',data.labels[i],'\n')

আল্লাহ এদের উপর গজব নাজিল করুক। 
 labels:--  Religious_Hatred 

ধুত তেরি শালার পুতেরা 
 labels:--  Cyberbullying 

চাপা দিয়ে নিজেকে দিয়েছি ধিক্কার,, 
 labels:--  Cyberbullying 

হ্স্ত মেরে করবো শেষ 
 labels:--  Sarcasm 

মানগের নাতি, ইজলা এতে কিয়া কয় 
 labels:--  Cyberbullying 

৯ বার ব্রাজিলের পোদ মেরে জেতা 
 labels:--  Cyberbullying 

হিন্দুর সোনা মজা 
 labels:--  Cyberbullying 

তাদের যৌন সন্তুষ্টি 
 labels:--  Cyberbullying 

ওরে কানাচোদা,,,তোদের ওই লেওড়া খান ওরফে ছাকিব খান 
 labels:--  Cyberbullying 

নাটকের রুনা খানের মতো লাগে এখন 
 labels:--  Cyberbullying 

হিন্দুর সোনা মজা 
 labels:--  Cyberbullying 

ভারতের রেসিজম টা চরমে! 
 labels:--  Racism 

কই সোনা 
 labels:--  Sarcasm 

মাইয়ারে পাইলে দুধ এর উপর ঘুম যাইতাম আহা সেই মাল 
 labels:--  Gender_Discrimination 

আপনি নিজে ভালো তো ভাই 
 labels:--  Sarcasm 



## Data Processing
This step includes removal of punctuation mark, numbers, emoji and stopwords from the reviews. We have used a helper functions for cleaning the corpus.


In [27]:
# Check the contents of the utils module
import utils

# This will list all attributes and functions in the utils module
print(dir(utils))


['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']


In [36]:
import importlib.util
spec = importlib.util.spec_from_file_location("utils", "/content/utils.py")
utils = importlib.util.module_from_spec(spec)
spec.loader.exec_module(utils)

# Now you can access the functions
cleaned_reviews = utils.cleaned_reviews
stopwords_info = utils.stopwords_info
stopword_removal = utils.stopword_removal
process_reviews = utils.process_reviews


In [38]:
#from utils import cleaned_reviews,stopwords_info,stopword_removal,process_reviews


In [44]:
def process_reviews(review, stopwords, removing_stopwords=True):
    """This function takes a review string as input and returns a cleaned review string
        after removing punctuation, English characters, numbers and stopwords.

        Args:
            review: str
            stopwords: list
            removing_stopwords: bool

        Returns:
            cleaned review: str
    """
    if type(review) != str:
        review = str(review)  # Convert review to string if it's not already

    if removing_stopwords == False:
        reviews = cleaned_reviews(review)
        return reviews
    else:
        reviews = cleaned_reviews(review)
        reviews = stopword_removal(reviews,stopwords)
        return reviews

def cleaned_reviews(review):
    """This function takes a review string and removes unnecessary
        punctuations, english characters, numbers.

        Args:
            review: str

        Returns:
            cleaned review: str
    """
    review = review.replace('\n', '') #removing new line
    review = re.sub('[^\u0980-\u09FF]',' ',str(review)) #removing unnecessary punctuation
    return review

## Remove Low Length Data

In [52]:
# Length of each Reveiws
data['length'] = data['text'].astype(str).apply(lambda x: len(x.split()))
# Remove the reviews with least words
dataset = data.loc[data.length > 2]
dataset = dataset.reset_index(drop=True)
print(
    "After Cleaning:","\nRemoved {} Small Reviews".format(len(data) - len(dataset)),
    "\nTotal Reviews:",len(dataset),
    "\nTotal Cyberbullying Reviews:",len(data[data.labels == 'Cyberbullying']),
    "\nTotal Religious_Hatred Reviews:",len(data[data.labels == 'Religious_Hatred']),
    "\nTotal Gender_Discrimination Reviews:",len(data[data.labels == 'Gender_Discrimination']),
    "\nTotal Sarcasm Reviews:",len(data[data.labels == 'Sarcasm']),
    "\nTotal Political Reviews:",len(data[data.labels == 'Political']),
    "\nTotal Racism Reviews:",len(data[data.labels == 'Racism'])
)

After Cleaning: 
Removed 177 Small Reviews 
Total Reviews: 5823 
Total Cyberbullying Reviews: 1686 
Total Religious_Hatred Reviews: 673 
Total Gender_Discrimination Reviews: 678 
Total Sarcasm Reviews: 1518 
Total Political Reviews: 776 
Total Racism Reviews: 668


In [54]:
dataset[['text','labels']].to_csv('/content/Traning.xlsx - Sheet1 (6).csv')

### Save the cleaned data  and stopwords into a pickle file

In [58]:
data = pd.read_excel('clean_rr_reviews.xlsx')

FileNotFoundError: [Errno 2] No such file or directory: 'clean_rr_reviews.xlsx'

In [None]:
# open a file, where you ant to store the data
file = open('rr_review_data.pkl', 'wb')
# dump information to that file
pickle.dump(data, file)

In [None]:
# load the save file
data = open('rr_review_data.pkl','rb')
data = pickle.load(data)

In [None]:
# Stopwords pickle
stp = open(stopwords_list,'r', encoding='utf-8').read().split()
# open a file, where you ant to store the data
file = open('rr_stopwords.pkl', 'wb')
# dump information to that file
pickle.dump(stp, file)

In [None]:
stp = open('rr_stopwords.pkl','rb')
stp = pickle.load(stp)
len(stp)

##### Processing of a sample review

In [None]:
tweet = 'খাবার ভাল ছিল, তাছাড়া পরিবেশটা ও চমৎকার ।। ।।!!!!'
process_reviews(review = tweet, stopwords =stopwords_list,removing_stopwords=True)

## Dataset Summary

In [None]:
from utils import data_summary
documents,words,u_words,class_names = data_summary(dataset)

### Dataset Summary Visualization

In [None]:
data_matrix = pd.DataFrame({'Total Documents':documents,
                            'Total Words':words,
                            'Unique Words':u_words,
                            'Class Names':class_names})
df = pd.melt(data_matrix, id_vars="Class Names", var_name="Category", value_name="Values")
print(df)

In [None]:
plt.figure(figsize=(6, 4))
ax = plt.subplot()

sns.barplot(data=df,x='Class Names', y='Values' ,hue='Category')
ax.set_xlabel('Class Names')
ax.set_title('Data Statistics')

ax.xaxis.set_ticklabels(class_names, rotation=45);

### Review Length Distribution

In [None]:
# Calculate the Review of each of the Review
dataset['ReviewLength'] = dataset.cleaned.apply(lambda x:len(x.split()))
frequency = dict()
for i in dataset.ReviewLength:
    frequency[i] = frequency.get(i, 0)+1

plt.bar(frequency.keys(), frequency.values(), color ="b")
plt.xlim(1, 80)
# in this notbook color is not working but it should work.
plt.xlabel('Lenght of the Texts')
plt.ylabel('Frequency')
plt.title('Length-Frequency Distribution')
plt.show()
print(f"Maximum Length of a review: {max(dataset.ReviewLength)}")
print(f"Minimum Length of a review: {min(dataset.ReviewLength)}")
print(f"Average Length of a reviews: {round(np.mean(dataset.ReviewLength),0)}")

## Feature Extraction Using TF-IDF

In [None]:
from utils import calc_unigram_tfidf,calc_bigram_tfidf,calc_trigram_tfidf,show_tfidf

In [None]:
tweet = 'খাবার ভাল ছিল, তাছাড়া পরিবেশটা ও চমৎকার ।। ।।!!!!'
cv,feature_vector = calc_trigram_tfidf(dataset.cleaned)
print("Shape of TF-IDF Corpus =====>",feature_vector.shape,'\n')
show_tfidf(cv,tweet)
#first_vector = tfidf.transform([samp_review]).toarray()

In [None]:
#help(calc_unigram_tfidf)

## ML Model Development Using Unigram Feature

### Unigram Tf-idf Feature Extraction, Label Encoding and Splitting

In [None]:
from utils import label_encoding,dataset_split
from utils import calc_unigram_tfidf

# calculate the Unigram Tf-idf feature
cv,feature_vector = calc_unigram_tfidf(dataset.cleaned)
# Encode the labels
lables = label_encoding(dataset.Sentiment,False)
# Split the Feature into train and test set
X_train,X_test,y_train,y_test = dataset_split(feature_space=feature_vector,sentiment=lables)

### Model Defination

In [None]:
from utils import model_performace,ml_models_for_unigram_tfidf

## classifiers defination
ml_models,model_names = ml_models_for_unigram_tfidf()

# call model accuracy function and save the metrices into a dictionary
accuracy = {f'{model_names[i]}':model_performace(model,X_train,X_test,y_train,y_test) for i,model in enumerate(ml_models)}
# Save the performance parameter into json file
with open('ml_performance_unigram.json', 'w') as f:
    json.dump(accuracy, f)


### Performance Table  

In [None]:
from utils import performance_table

# Load the json file
accuracy = json.load(open('ml_performance_unigram.json'))
table = performance_table(accuracy)
table

In [None]:
print(f"Highest Accuracy achieved by {table.Accuracy.idxmax(axis = 0)} at = {max(table.Accuracy)}")
print(f"Highest F1-Score achieved by {table['F1 Score'].idxmax(axis = 0)} at = {max(table['F1 Score'] )}")
print(f"Highest Precision Score achieved by {table['Precision'].idxmax(axis = 0)} at = {max(table['Precision'] )}")
print(f"Highest Recall Score achieved by {table['Recall'].idxmax(axis = 0)} at = {max(table['Recall'] )}")


### ROC Curve

In [None]:
from utils import plot_roc_curve,ml_models_for_unigram_tfidf
## classifiers defination
gram_models = ml_models_for_unigram_tfidf()

plot_roc_curve(gram_models,X_train,X_test,y_train,y_test,'Unigram')

### Precision-Recall Curve

In [None]:
from utils import plot_PR_curve,ml_models_for_unigram_tfidf

gram_models = ml_models_for_unigram_tfidf()

plot_PR_curve(gram_models,X_train,X_test,y_train,y_test,'Unigram')

## Model Development Using Bigram Feature

### Bi-gram Tf-idf Feature Extraction, Label Encoding and Splitting

In [None]:
from utils import label_encoding,dataset_split
from utils import calc_bigram_tfidf

# calculate the Bigram Tf-idf feature
cv,feature_vector = calc_bigram_tfidf(dataset.cleaned)
# Encode the labels
lables = label_encoding(dataset.Sentiment,False)
# Split the Feature into train and test set
X_train,X_test,y_train,y_test = dataset_split(feature_space=feature_vector,sentiment=lables)

### Model Defination

In [None]:
from utils import model_performace,ml_models_for_bigram_tfidf

# Classifiers Defination
ml_models,model_names = ml_models_for_bigram_tfidf()

# call model accuracy function and save the metrices into a dictionary
accuracy = {f'{model_names[i]}':model_performace(model,X_train,X_test,y_train,y_test) for i,model in enumerate(ml_models)}
# Save the performance parameter into json file
with open('ml_performance_bigram.json', 'w') as f:
    json.dump(accuracy, f)


### Performance Table

In [None]:
from utils import performance_table

# Load the json file
accuracy = json.load(open('ml_performance_bigram.json'))
table = performance_table(accuracy)
table

In [None]:
print(f"Highest Accuracy achieved by {table.Accuracy.idxmax(axis = 0)} at = {max(table.Accuracy)}")
print(f"Highest F1-Score achieved by {table['F1 Score'].idxmax(axis = 0)} at = {max(table['F1 Score'] )}")
print(f"Highest Precision Score achieved by {table['Precision'].idxmax(axis = 0)} at = {max(table['Precision'] )}")
print(f"Highest Recall Score achieved by {table['Recall'].idxmax(axis = 0)} at = {max(table['Recall'] )}")


### ROC Curve

In [None]:
from utils import plot_roc_curve,ml_models_for_bigram_tfidf
## classifiers defination
gram_models = ml_models_for_bigram_tfidf()

plot_roc_curve(gram_models,X_train,X_test,y_train,y_test,'Bigram')

## Precision-Recall Curve

In [None]:
from utils import plot_PR_curve,ml_models_for_bigram_tfidf
## classifiers defination
gram_models = ml_models_for_bigram_tfidf()

plot_PR_curve(gram_models,X_train,X_test,y_train,y_test,'Bigram')

## Model Development Using Tri-gram Feature

### Tri-gram Tf-idf Feature Extraction, Label Encoding and Splitting

In [None]:
from utils import label_encoding,dataset_split
from utils import calc_trigram_tfidf

# calculate the Tri-gram Tf-idf feature
cv,feature_vector = calc_trigram_tfidf(dataset.cleaned)
# Encode the labels
lables = label_encoding(dataset.Sentiment,False)
# Split the Feature into train and test set
X_train,X_test,y_train,y_test = dataset_split(feature_space=feature_vector,sentiment=lables)

### Model Defination

In [None]:
from utils import model_performace,ml_models_for_trigram_tfidf


# Classifiers Defination
ml_models,model_names = ml_models_for_trigram_tfidf()

# call model accuracy function and save the metrices into a dictionary
accuracy = {f'{model_names[i]}':model_performace(model,X_train,X_test,y_train,y_test) for i,model in enumerate(ml_models)}
# Save the performance parameter into json file
with open('ml_performance_trigram.json', 'w') as f:
    json.dump(accuracy, f)


### Performance Table

In [None]:
from utils import performance_table

# Load the json file
accuracy = json.load(open('ml_performance_trigram.json'))
table = performance_table(accuracy)
table


In [None]:
print(f"Highest Accuracy achieved by {table.Accuracy.idxmax(axis = 0)} at = {max(table.Accuracy)}")
print(f"Highest F1-Score achieved by {table['F1 Score'].idxmax(axis = 0)} at = {max(table['F1 Score'] )}")
print(f"Highest Precision Score achieved by {table['Precision'].idxmax(axis = 0)} at = {max(table['Precision'] )}")
print(f"Highest Recall Score achieved by {table['Recall'].idxmax(axis = 0)} at = {max(table['Recall'] )}")


### ROC Curve

In [None]:
from utils import plot_roc_curve,ml_models_for_trigram_tfidf
## classifiers defination
gram_models = ml_models_for_trigram_tfidf()

plot_roc_curve(gram_models,X_train,X_test,y_train,y_test,'Trigram')

### Precision-Recall

In [None]:
from utils import plot_PR_curve,ml_models_for_trigram_tfidf
## classifiers defination
gram_models = ml_models_for_trigram_tfidf()

plot_PR_curve(gram_models,X_train,X_test,y_train,y_test,'Trigram')

## Final Model

- Selected Feature: Trigram
- Selected Model : Stochastic Gradient Descent

In [None]:
from utils import label_encoding,dataset_split
from utils import calc_unigram_tfidf

# calculate the Tri-gram Tf-idf feature
cv,feature_vector = calc_trigram_tfidf(dataset.cleaned)
# Encode the labels
lables = label_encoding(dataset.Sentiment,False)
# Split the Feature into train and test set
X_train,X_test,y_train,y_test = dataset_split(feature_space=feature_vector,sentiment=lables)

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
sgd_model = SGDClassifier(loss ='log',penalty='l2', max_iter=5)
sgd_model.fit(X_train,y_train)
y_pred = mnb_model.predict(X_test)
accuracy_score(y_true=y_test,y_pred=y_pred)*100

### Saved the model for reuse again

In [None]:
import pickle
# open a file, where you ant to store the data
file = open('rr_review_sgd.pkl', 'wb')

# dump information to that file
pickle.dump(sgd_model, file)

In [None]:
model = open('rr_review_sgd.pkl','rb')
sgd = pickle.load(model)

In [None]:
y_pred = sgd.predict(X_test)
accuracy_score(y_test,y_pred)

## Check a Review Sentiment using our model

In [None]:
# load the model
model = open('rr_review_sgd.pkl','rb')
sgd = pickle.load(model)
######
#review = 'aaaasd asd asdasd asd'
review = 'খাবার ভাল ছিল , তাছাড়া পরিবেশটা ও চমৎকার ।। ।।!!!!'
# Process the reviews
processed_review = process_reviews(review,stopwords = stopwords_list,removing_stopwords = True)
if (len(processed_review))>0:
    # calculate the Unigram Tf-idf feature
    cv,feature_vector = calc_trigram_tfidf(dataset.cleaned)
    feature = cv.transform([processed_review]).toarray()

    sentiment = sgd.predict(feature)
    score = round(max(sgd.predict_proba(feature).reshape(-1)),2)*100

    if (sentiment ==0):
        print(f"It is a Negative Review and the probability is {score}%")
    else:
        print(f"It is a Positive Review and the probability is {score}%")
else:
    print("This review doesn't contains any bengali Words, thus cannot predict the Sentiment.")


In [None]:
review = 'aaaasd asd asdasd asd'
# Process the reviews
processed_review = process_reviews(review,stopwords = stopwords_list,removing_stopwords = True)
len(processed_review)