# Sentiment, Statistical Analysis, and Hypothesis 

## Sentiment Analysis

Sentiment analysis, also known as opinion mining, is a powerful technique used to determine the sentiment or emotional tone expressed in text data, such as user reviews in your research. It involves analyzing and classifying the polarity of the text, typically into positive, negative, or neutral categories, to gauge the sentiment conveyed by the author. In the context of your research on the impact of review sentiments and star ratings on business patronage, sentiment analysis is a crucial component for understanding how users perceive and evaluate businesses. Below, I'll detail the methods you can use for sentiment analysis and how to create a sentiment scale for your text sentiment data:

### Data preprocessing, tokenization, feature extraction, post processing and evaluation
This preprocessing plays a pivotal role in conducting sentiment analysis by facilitating the cleaning and normalization of the data sample reviews text data, thereby enhancing its suitability for analysis. This crucial step encompasses various techniques aimed at converting raw text data into a format conducive to analysis. Common text preprocessing techniques include tokenization, removal of stop words such as "and," "the," "of," and "it", and lemmatization used to reduce "Lemma" infected words based on their intending meaning.

### Import need libraries

Download all the nltk corpus for the first time to ensure that all the necessary data is available for natural language processing tasks.

In [1]:
# import natural language processing tool kit NLTK libraries for data preprocessing and tokenization
import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer, PorterStemmer
import re 
nltk.download('all')


[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\raphr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\raphr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\raphr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\raphr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     C:\Users\raphr\AppData\Roaming\nltk_data...
[nltk_data]    | 

[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package mte_teip5 to
[nltk_data]    |     C:\Users\raphr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package mte_teip5 is already up-to-date!
[nltk_data]    | Downloading package mwa_ppdb to
[nltk_data]    |     C:\Users\raphr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package mwa_ppdb is already up-to-date!
[nltk_data]    | Downloading package names to
[nltk_data]    |     C:\Users\raphr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Downloading package nombank.1.0 to
[nltk_data]    |     C:\Users\raphr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package nombank.1.0 is already up-to-date!
[nltk_data]    | Downloading package nonbreaking_prefixes to
[nltk_data]    |     C:\Users\raphr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package nonbreaking_prefixes is already up-to-date!
[nltk_data]    | Downloading package nps_chat

[nltk_data]    |   Package treebank is already up-to-date!
[nltk_data]    | Downloading package twitter_samples to
[nltk_data]    |     C:\Users\raphr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package twitter_samples is already up-to-date!
[nltk_data]    | Downloading package udhr to
[nltk_data]    |     C:\Users\raphr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package udhr is already up-to-date!
[nltk_data]    | Downloading package udhr2 to
[nltk_data]    |     C:\Users\raphr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package udhr2 is already up-to-date!
[nltk_data]    | Downloading package unicode_samples to
[nltk_data]    |     C:\Users\raphr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package unicode_samples is already up-to-date!
[nltk_data]    | Downloading package universal_tagset to
[nltk_data]    |     C:\Users\raphr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package universal_tagset is already up-to-date!
[nltk_data]    | Downloading package universal

True

### Import generated random sample business dataset for preprocessing

In [2]:
# Read random sample csv into a dataframe
biz_sample = pd.read_csv('csv/biz_sample.csv')

In [3]:
# Inspect dataframe
biz_sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   business_id   25000 non-null  object
 1   name          25000 non-null  object
 2   address       24703 non-null  object
 3   city          25000 non-null  object
 4   state         25000 non-null  object
 5   postal_code   24999 non-null  object
 6   review_count  25000 non-null  int64 
 7   is_open       25000 non-null  int64 
 8   categories    25000 non-null  object
 9   review_id     25000 non-null  object
 10  user_id       25000 non-null  object
 11  stars         25000 non-null  int64 
 12  useful        25000 non-null  int64 
 13  funny         25000 non-null  int64 
 14  cool          25000 non-null  int64 
 15  text          25000 non-null  object
 16  date          25000 non-null  object
dtypes: int64(6), object(11)
memory usage: 3.2+ MB


In [5]:
# Rename the reviews column 'text' to 'review'
biz_sample.rename(columns = {'text': 'review'}, inplace=True)

In [6]:
biz_sample.head(2)

Unnamed: 0,business_id,name,address,city,state,postal_code,review_count,is_open,categories,review_id,user_id,stars,useful,funny,cool,review,date
0,WxZ2Ua5hb7g3hZqWjw5k6w,615 Pizza and Pasta,5337 Charlotte Ave,Nashville,TN,37209,38,1,Restaurants,dOITfBm-j5Ts5kYx1J0UAw,mOapNjIy3ynfYuNt5Pvexg,5,1,0,0,First let me start off by saying I wasn't look...,2020-01-22 06:46:53
1,LCitqsu9DV1RjGn7EIzzIQ,Outback Steakhouse,610 Old York Rd,Jenkintown,PA,19046,147,1,Restaurants,RueeEC-0KfShSmr5x6DP7Q,MUGaNRi8f9mzsEqzw98YOQ,2,0,0,0,Dirty..dirty..dirty place. Floors very greasy...,2016-05-08 04:01:00


*Next we preprocess sample dataframe 'review' field for sentiment analysis using multi-model fusion approach.*

### Methods for Sentiment Analysis

#### Lexicon-Based Sentiment Analysis: 
Lexicon-based approaches involve using predefined sentiment lexicons or dictionaries that contain words and phrases associated with positive and negative sentiment. We use the NLTK Vader sentiment analyzer which employs a predefined set of rules and heuristics to assess the sentiment of a given text. These rules primarily rely on lexical and syntactic characteristics of the text, taking into account the occurrence of positive or negative words and phrases.Each word is assigned a polarity score, and the sentiment of a text is determined by summing these scores. For example, the word "excellent" might have a high positive score, while "terrible" has a high negative score. This method is relatively simple and efficient.

#### Machine Learning-Based Sentiment Analysis: 
Machine learning methods involve training models on labeled data to predict the sentiment of text. Common techniques include using algorithms like Naive Bayes, Support Vector Machines (SVM), or deep learning models like Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs). Training data consists of text samples with known sentiment labels (positive, negative, neutral). Once trained, the model can predict sentiment for unseen text.

#### Rule-Based Sentiment Analysis: 
Rule-based approaches involve designing custom rules or heuristics to identify sentiment in text. These rules can be based on grammatical structures, word patterns, or specific keywords. Rule-based methods are highly interpretable and allow for fine-grained control over sentiment analysis.

#### Hybrid Approaches: 
Hybrid methods combine elements of lexicon-based, machine learning, and rule-based approaches to improve accuracy. They may use lexicons as features for machine learning models or incorporate rule-based heuristics within a machine learning framework.

### NLTK data preprocessing combining lemmatization and stemming 

By combining lemmatization and stemming in NLTK pipelines we aim to reduce words to their base or root form to increase the coverage and improve text matching and similarity calculations. It helps use in treating variations of words as similar, contributing to better text comparisons. Stemming is a more aggressive approach that trims the end of words to a common root. THus, it may not always produce valid words, but it can handle variations well. While lemmatization, on the other hand, tends to produce valid words but may not cover as many variations as stemming. This combine approach maintain interpretability while still capturing some of the aggressive reductions and is beneficial in helping us understand the sentiment of the core words in the reviews. Also using both lemmatization and stemming contribute to creating better feature representations for our text data and helps in reducing dimensionality by capturing the essential semantic information for further machine learning application.

In [7]:
# Create function to 'preprocess' to tokenize the dataframe review field, then remove step words such as 'and','of','it','or'etc.
revLemmatize = WordNetLemmatizer()
stemmer = PorterStemmer()

def rev_prep(reviewText):
    reviewText=str(reviewText) #convert all text to string.
    reviewText=reviewText.lower() #convert all string to lower case. 
    reviewText=reviewText.replace('{html}',"") #replace html str instances with space.
    cleaner=re.compile('<.*?>') #compile punctuations str pattern into cleaner object.
    cleanText=re.sub(cleaner, '', reviewText) #replace all matching occurrences of the cleaner object pattern. 
    remove_url=re.sub(r'http\S+', '',cleanText) #replace all matching occurrences of url.
    remove_num=re.sub('[0-9]+', '', remove_url) #replace all occurrences of digits (0-9) in a string with a space.
    tokenizer = RegexpTokenizer(r'\w+') #define the pattern for tokenization using re to split text into tokens based on seq. 
    revTokens_Filter = tokenizer.tokenize(remove_num) #tokenize the string stored in the variable remove_num 
    
    #filter out tokens that have a length less than or equal to 2 characters from the revTokens_Filter list and are found in a list of English stopwords
    filtered_text = [i for i in revTokens_Filter if len(i) > 2 if not i in stopwords.words('english')]
    stem_text=[stemmer.stem(i) for i in filtered_text] #reduce words in filtered_text obj to their base or root form
    lemma_words=[revLemmatize.lemmatize(i) for i in stem_text] #lemmatize each word in the stem_text list
    return ' '.join(filtered_text) #concatenate the elements of the filtered_text list into a single string with space


In [8]:
# Apply the funtion to 'review' column in the sample dataframe and store processed text
biz_sample['review'] = biz_sample['review'].apply(rev_prep)
# To apply additional logic we will use the lambda function below
# biz_sample['review'] = biz_sample['review'].map(lambda s:rev_prep(s)) 

In [13]:
biz_sample.head(2)

Unnamed: 0,business_id,name,address,city,state,postal_code,review_count,is_open,categories,review_id,user_id,stars,useful,funny,cool,review,date
0,WxZ2Ua5hb7g3hZqWjw5k6w,615 Pizza and Pasta,5337 Charlotte Ave,Nashville,TN,37209,38,1,Restaurants,dOITfBm-j5Ts5kYx1J0UAw,mOapNjIy3ynfYuNt5Pvexg,5,1,0,0,first let start saying looking much late night...,2020-01-22 06:46:53
1,LCitqsu9DV1RjGn7EIzzIQ,Outback Steakhouse,610 Old York Rd,Jenkintown,PA,19046,147,1,Restaurants,RueeEC-0KfShSmr5x6DP7Q,MUGaNRi8f9mzsEqzw98YOQ,2,0,0,0,dirty dirty dirty place floors greasy sliding ...,2016-05-08 04:01:00


### Sentiment Scale

Due to the computational constraint of this research we will be using lexicon based sentiment analysis. This allow us to create a sentiment scale that involves quantifying the sentiment scores obtained from the sentiment analysis methods above to provide a structured representation of the sentiment expressed in the users reviews. This scale can be used to numerically assess the sentiment of a given text, making it amenable to statistical analysis. 

**Polarity Scores:** Each sentiment analysis method will assign polarity scores to the text, typically on a scale from -1 (very negative) to 1 (very positive). Neutral sentiments can be assigned a score of 0. These scores represent the intensity of sentiment in the text.

**Aggregation:** To create a sentiment scale, we aggregate the polarity scores from multiple methods. This can be done by averaging the scores or using a weighted average if you want to give more weight to a specific method's output.

**Binning:** After aggregation, we categorize the aggregated scores into sentiment categories. For instance, scores in the range of -1 to -0.5 could be classified as "negative," -0.5 to 0.5 as "neutral," and 0.5 to 1 as "positive."

**Normalization:** Based on the analysis requirements, we choose to normalize the scale, mapping the scores to a specific range. This normalization makes the sentiment scores more interpretable and consistent across various texts.

This sentiment scale, created from sentiment analysis output, will allow us to quantitatively assess the sentiment expressed in reviews, enabling us to explore the relationship between sentiment, star ratings, and their impact on business patronage in a structured and analyzable manner.

In [18]:
#initialize NLTK setiment analyzer
revAnalyzer = SentimentIntensityAnalyzer()

In [22]:
# Assign polarity score
def getSenti(text):
    scores = revAnalyzer.polarity_scores(text) #analyze the sentiment and provide a dictionary of polarity scores.
    sentiment = 1 if scores['pos'] > 0 else 0 #accesses the 'pos' key in the scores dictionary that rep positive polarity
    return sentiment

In [27]:
#run polarity score function
biz_sample['sentiment'] = biz_sample['review'].apply(getSenti)

In [37]:
#Create a sentiment scale using capture_sentiment funtion
def capture_sentiment(revText):
    scores = revAnalyzer.polarity_scores(revText) #use polarity scores to assign sentiment value pos, neg and neu
    sentiment_score = scores['compound'] #normalize the combination of the positive, negative, and neutral scores 
    if sentiment_score >= 0.5:
        sentiment_label = 'positive' #label for pos sentiment
    elif sentiment_score <= -0.5:
        sentiment_label = 'negative' #label for neg sentiment
    else:
        sentiment_label = 'neutral' #label for neu sentiment
    
    return sentiment_score, sentiment_label


In [38]:
# Apply the capture sentiment function to review field and assign scores and lables in a new columns 
biz_sample['sentiment_score'], biz_sample['sentiment_label'] = zip(*biz_sample['review'].apply(capture_sentiment))

In [45]:
biz_sample.to_csv('yelp_nltk_sentiment_score.csv', index=False)

### NLTK classification and evaluation of model performance

We use a confusion matrix to evaluate the performance of the NLTK model classification by comparing the predicted labels with the actual labels.

In [49]:
# Import confusion matrix module function from scikit-learn machine learning library 
from sklearn.metrics import confusion_matrix, classification_report 

### Model 2 - Sentiment Analysis with BERT Model

### Model 3 - Sentiment Analysis with CNN VGG-16  Model

### Model 4 - Sentiment Analysis with Word2Vec Model

### Model 4 - Sentiment Analysis with GloVe Model

### Model 4 - Sentiment Analysis with SPINN Hybrid tree-sequence neural networks