# Sentiment Analysis <br>

## Muzammil Mushtaq

In [1]:
import pandas as pd
df = pd.read_csv('restaurant-reviews.csv')

In [2]:
print ('Shape of the Initial Dataset before preprocessing : ', df.shape)
df.head(2)

Shape of the Initial Dataset before preprocessing :  (1000, 5)


Unnamed: 0,name,restaurant_url,title,text,rating
0,Manufactur,https://www.tripadvisor.com/Restaurant_Review-...,Best in Kiel,The absolutely best restaurant in the town of ...,5.0
1,Manufactur,https://www.tripadvisor.com/Restaurant_Review-...,"Simply, tasty and very good",Tasty and high quality food! A “healthier”way ...,5.0


### Preprocessing Dataset

In [3]:
# Lowercasing.
df['text'] = [str(text.lower()) for text in df['text']]

In [4]:
# Remove the Non English Reviews (just for my understading :P )
from langdetect import detect
def detect_language(text):
    
    try:
        language = detect(text)
        return language == 'en'
    except:
        return False  

df = df[df['text'].apply(detect_language)]
        

In [5]:
# Handling special characters and Numbers
import re
df['text'] = df['text'].apply(lambda x: re.sub(r'[^A-Za-z\s]', '', x))


In [6]:
# Lemmatization

import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
def lemmatize_sentence(sentence):
    lemmatizer = WordNetLemmatizer()
    return ' '.join([lemmatizer.lemmatize(word) for word in sentence.split()])

df['text'] = df['text'].apply(lemmatize_sentence)

[nltk_data] Error loading wordnet: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


In [7]:
# Stopword Removal
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def remove_stop_words(sentence):
    return ' '.join([word for word in sentence.split() if word.lower() not in stop_words])

df['text'] = df['text'].apply(remove_stop_words)

In [8]:
# Temporary modification
df['rating'] = df['rating'] - 1

In [9]:
print ('Shape of dataset after preprocessing :', df.shape)
df[['text', 'rating']]

Shape of dataset after preprocessing : (964, 5)


Unnamed: 0,text,rating
0,absolutely best restaurant town kiel nice deco...,4.0
1,tasty high quality food healthierway propose c...,4.0
2,food wa asked happily surprised food small res...,4.0
3,amazing service amzing food nice people quite ...,4.0
4,manufaktur really nice small self service rest...,4.0
...,...,...
995,traum gmbh traumfabrik former name kiel reside...,3.0
996,great big place nice patio like fireplace staf...,1.0
997,big choice many different dish tried wa good l...,3.0
998,bad service,0.0


#### Heuristic Approaches to Sentiment Analysis <br>

TextBlob uses a heuristic approach for sentiment analysis, relying on predefined patterns and rules. It assigns polarity scores to words, calculates the overall sentiment of a sentence based on these scores, and provides subjectivity scores. It also includes a basic Naive Bayes classifier for sentiment analysis. While convenient, this rule-based approach may be less accurate than models trained on specific datasets. <br>
##### *Importance of TextBlob*

TextBlob's sentiment analysis uses a mix of machine learning and rules.

1. **Preprocessing:** Clean and prepare the input text.

2. **Text Parsing:** Break down the text into sentences and words.

3. **Feature Extraction:** Extract important elements for sentiment analysis.

4. **Machine Learning Model:** Use a pre-trained model for sentiment classification.

5. **Rule-based Classification:** Apply predefined rules for linguistic nuances.

6. **Sentiment Polarity and Subjectivity:** Calculate polarity (positive/negative/neutral) and subjectivity.

7. **Final Sentiment Output:** Assign a sentiment label based on machine learning and rules.

8. **Probability Scores:** Provide confidence scores for sentiment predictions.


In the previous steps, although we have conducted the Text preprocessing mainly because of the learning and practice purpose. TextBlob has the predefined algorithm to clean the dataset.

In [10]:
from textblob import TextBlob

def filter_reviews(reviews):
    filtered_reviews = []
    
    for review in reviews:
        analysis = TextBlob(review)
        sentiment_score = analysis.sentiment.polarity
        if sentiment_score >= 0.5:
            filtered_reviews.append(4)
        elif 0 < sentiment_score < 0.5:
            filtered_reviews.append(3)
        elif sentiment_score == 0 :
            filtered_reviews.append(2)
        elif -0.5 < sentiment_score < 0.0:
            filtered_reviews.append(1)
        else:
            filtered_reviews.append(0)
    return filtered_reviews


df['sentiment_analysis'] = filter_reviews(df['text'])

df[['text','sentiment_analysis']]

Unnamed: 0,text,sentiment_analysis
0,absolutely best restaurant town kiel nice deco...,4
1,tasty high quality food healthierway propose c...,3
2,food wa asked happily surprised food small res...,3
3,amazing service amzing food nice people quite ...,4
4,manufaktur really nice small self service rest...,3
...,...,...
995,traum gmbh traumfabrik former name kiel reside...,3
996,great big place nice patio like fireplace staf...,4
997,big choice many different dish tried wa good l...,3
998,bad service,0


In [11]:
# Classification Report to find relation between actual rating and textblob rating

import numpy as np
from sklearn.metrics import accuracy_score, classification_report

accuracy = accuracy_score(df['rating'], df['sentiment_analysis'])
print(f"Accuracy: {accuracy:.2f}")

# Display classification report
print("Classification Report:")
print(classification_report(df['rating'], df['sentiment_analysis']))




Accuracy: 0.42
Classification Report:
              precision    recall  f1-score   support

         0.0       0.50      0.05      0.09        39
         1.0       0.27      0.26      0.27        53
         2.0       0.00      0.00      0.00       102
         3.0       0.37      0.81      0.51       327
         4.0       0.65      0.28      0.39       443

    accuracy                           0.42       964
   macro avg       0.36      0.28      0.25       964
weighted avg       0.46      0.42      0.37       964



#### Finetuned Transformer Model for Sentiment Analysis

Fine-tuning a transformer model for sentiment analysis involves taking a pre-trained transformer model (such as BERT, GPT, or others) and training it on a dataset specifically designed for sentiment analysis tasks.

#### Purpose & Resources of Finetuned Transformer Model
Fine-tuning a transformer model serves the purpose of adapting a pre-trained model on a specific downstream task or dataset. Transformer models, like BERT, GPT, or others, are typically pre-trained on large datasets and learn general language representations. However, these pre-trained models may not be optimized for specific tasks, such as sentiment analysis, question answering, or named entity recognition.

The main purposes of fine-tuning a transformer model are:

1. **Task-Specific Adaptation:**
   - Fine-tuning allows you to adapt a pre-trained model to your specific task or domain. For example, you can take a transformer model pre-trained on a general corpus and fine-tune it on a sentiment analysis dataset to make it specific to sentiment prediction.

2. **Improved Performance on Specific Tasks:**
   - Fine-tuning helps improve the model's performance on a specific task by leveraging the knowledge and representations learned during pre-training. The model can capture task-specific nuances and patterns during the fine-tuning process.

3. **Reduced Training Time and Resources:**
   - Training a transformer model from scratch requires substantial computational resources and time. Fine-tuning is more computationally efficient since it leverages the knowledge already present in a pre-trained model, saving both time and resources.

4. **Transfer Learning:**
   - Fine-tuning enables transfer learning, where knowledge gained from a source task (pre-training) is transferred to a target task (fine-tuning). This is especially useful when the target task has limited labeled data.

Here's a simplified process for fine-tuning a transformer model:

- **Pre-training:** Train a transformer model on a large corpus with a self-supervised objective (e.g., predicting missing words in a sentence).

- **Fine-tuning:** Further train the pre-trained model on a smaller dataset related to your specific task (e.g., sentiment analysis) with labeled examples. This fine-tuning process updates the model's parameters to be more task-specific.

The fine-tuned model can then be used for making predictions on new data in the target domain or task.

In summary, fine-tuning allows you to take advantage of pre-trained models' general language understanding and adapt them to specific tasks, improving performance and efficiency.

In [11]:
'''                                 We are ignoring the rating 1, the only 
                                    reason is to increase computational time
                                    by generating small random + balance dataset as 
                                    we have done in later part.
'''
#df = df[df['rating'] != 0.0].copy()
#print ('Shape of dataframe after removing rating 0 :', df.shape )

Shape of dataframe after removing rating 0 : (924, 6)


#### BERT Finetuned Transformer Model for Sentiment Analysis: <br>

The key points of importance of BERT are, <br>

*Contextual Understanding:* BERT captures bidirectional context for better understanding.

*Pre-training on Large Corpora:* Pre-trained on large datasets for rich language representations.

*Transfer Learning:* Transfers pre-learned knowledge to specific tasks.

*Fine-tuning for Specific Tasks:* Adapts pre-trained BERT to task-specific nuances.

*Handling Complex Sentiments:* Effectively handles complex and nuanced sentiments.

*Out-of-the-Box Performance:* Strong performance without extensive feature engineering.

*Handling Polysemy and Ambiguity:* Deals well with multiple word meanings and ambiguity.

*State-of-the-Art Results:* Achieves state-of-the-art results on NLP benchmarks.

In [14]:
'''                            Only 12% dataset has been Random selected for Training Datset
                               And Test Dataset for the quick analysis. 
'''

from sklearn.model_selection import train_test_split

total_df = df.sample(frac=0.1255, random_state=None) 
print('Shape of the Randomly Selected Dataset : ',total_df.shape)
X_train, X_test, y_train, y_test = train_test_split(total_df['text'], total_df['rating'], test_size=0.8, random_state=None)
print('Unique set of Labels in our training/testing dataset',set(y_train.values), set(y_test.values))
#print(set(y_test.values))


Shape of the Randomly Selected Dataset :  (121, 6)
Unique set of Labels in our training/testing dataset {0.0, 1.0, 2.0, 3.0, 4.0} {0.0, 1.0, 2.0, 3.0, 4.0}


In [15]:
'''                             Trained the BERT Model for Sentiment Analysis

'''
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from torch.utils.data import DataLoader, TensorDataset
import torch

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=5)  

# Tokenize and encode the dataset
texts = X_train.tolist()
labels = torch.tensor(y_train.values, dtype=torch.long)
#labels = torch.clamp(labels - 1, min=0)
#print (labels)
encoded_texts = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

# Create a DataLoader for training
dataset = TensorDataset(encoded_texts['input_ids'], encoded_texts['attention_mask'], labels)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# Set up optimizer and loss function
optimizer = AdamW(model.parameters(), lr=1e-5)
criterion = torch.nn.CrossEntropyLoss()

# Fine-tune the model
num_epochs = 3
for epoch in range(num_epochs):
    for batch in dataloader:
        input_ids, attention_mask, label = batch
       # print (label)
        outputs = model(input_ids, attention_mask=attention_mask, labels=label)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

# Save the fine-tuned model
model.save_pretrained('fine_tuned_bert_sentiment_model')
tokenizer.save_pretrained('fine_tuned_bert_sentiment_model')


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


('fine_tuned_bert_sentiment_model\\tokenizer_config.json',
 'fine_tuned_bert_sentiment_model\\special_tokens_map.json',
 'fine_tuned_bert_sentiment_model\\vocab.txt',
 'fine_tuned_bert_sentiment_model\\added_tokens.json')

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
from sklearn.metrics import accuracy_score, classification_report


# Load the fine-tuned model and tokenizer
tokenizer = BertTokenizer.from_pretrained('fine_tuned_bert_sentiment_model')
model = BertForSequenceClassification.from_pretrained('fine_tuned_bert_sentiment_model')

#def test_BERT(text_to_predict):
# Text to predict sentiment for
text_to_predict = X_test.tolist()
# Tokenize and encode the text
encoded_text = tokenizer(text_to_predict, padding=True, truncation=True, return_tensors='pt')

# Forward pass through the model
outputs = model(**encoded_text)

# Access the logits (raw scores before softmax) for sentiment prediction
logits = outputs.logits

# Perform softmax to get probabilities
probabilities = logits.softmax(dim=-1)

predicted_label = torch.argmax(probabilities,dim=-1)
predicted_label = predicted_label.tolist()

predicted_sentiment = predicted_label
#print (y_test, predicted_sentiment)

# Evaluate the model
accuracy = accuracy_score(y_test, predicted_sentiment)
print(f"Accuracy: {accuracy:.2f}")

# Display classification report
print("Classification Report:")
print(classification_report(y_test, predicted_sentiment))

636    1.0
288    4.0
116    4.0
435    3.0
430    4.0
      ... 
944    1.0
489    3.0
26     4.0
439    4.0
585    4.0
Name: rating, Length: 97, dtype: float64 [3, 3, 4, 4, 4, 4, 4, 3, 4, 3, 3, 4, 3, 4, 3, 3, 3, 4, 4, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 4, 3, 3, 4, 3, 3, 4, 4, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 3, 4, 3, 4, 4, 3, 4, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3]
Accuracy: 0.44
Classification Report:
              precision    recall  f1-score   support

         0.0       0.00      0.00      0.00         2
         1.0       0.00      0.00      0.00         5
         2.0       0.00      0.00      0.00        10
         3.0       0.40      0.69      0.51        39
         4.0       0.53      0.39      0.45        41

    accuracy                           0.44        97
   macro avg       0.19      0.22      0.19        97
weighted avg       0.39      0.44      0.40        97



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### Train a Model <br>
##### *Train a Multinomial Naive Bayes (NB) machine learning model based on TF-IDF* <br>

**Algorithm: Training Multinomial Naive Bayes (NB) Model based on TF-IDF:**

1. **Text Preprocessing:**
   - Tokenize and preprocess the text data, including steps like lowercasing, removing stop words, and stemming if needed.

2. **TF-IDF Vectorization:**
   - Convert the text data into a numerical format using TF-IDF (Term Frequency-Inverse Document Frequency) vectorization.
   - Calculate the TF-IDF scores for each term in the documents.

3. **Labeling:**
   - Assign labels to the documents based on the corresponding categories or classes.

4. **Multinomial Naive Bayes Training:**
   - Train the Multinomial Naive Bayes model using the TF-IDF vectors and the assigned labels.
   - The model estimates the probabilities of each term's occurrence given the class and uses Bayes' theorem to make predictions.

5. **Model Evaluation (Optional):**
   - Evaluate the trained model on a separate validation set or through cross-validation to assess its performance.

6. **Inference (Prediction):**
   - Use the trained model to predict the class or category of new, unseen text data.

**Benefits of Multinomial Naive Bayes based on TF-IDF:**

1. **Efficiency:**
   - Computationally efficient and fast, making it suitable for large datasets and real-time applications.

2. **Simple and Interpretable:**
   - Easy to implement and interpret, with a clear probabilistic framework.

3. **Handles Multiclass Classification:**
   - Well-suited for multiclass classification problems, where documents can belong to more than two categories.

4. **Works well with Text Data:**
   - Effective for text classification tasks, especially when dealing with a large number of features (words in a vocabulary).

5. **Naive Independence Assumption:**
   - The Naive Bayes assumption of feature independence simplifies the model and often performs well in practice.

6. **TF-IDF for Feature Representation:**
   - TF-IDF provides a meaningful representation of terms in documents, highlighting important words while downplaying common ones.

7. **Adaptable to Streaming Data:**
   - Naive Bayes models can be updated incrementally, making them adaptable to streaming or changing data.


In [24]:
# Train a suitable machine learning model based on TF-IDF

# Import necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['rating'], test_size=0.05, random_state=None)


# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the training data
X_train_tfidf = vectorizer.fit_transform(X_train)

# Transform the testing data
X_test_tfidf = vectorizer.transform(X_test)

# Train a machine learning model (Naive Bayes in this example)
# Add alpha as a hyperparameter
alpha = 0.1  
classifier = MultinomialNB(alpha=alpha)
classifier.fit(X_train_tfidf, y_train)

# Make predictions on the test set
y_pred = classifier.predict(X_test_tfidf)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Display classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))


Accuracy: 0.63
Classification Report:
              precision    recall  f1-score   support

         1.0       0.00      0.00      0.00         4
         2.0       0.00      0.00      0.00         4
         3.0       0.56      0.59      0.57        17
         4.0       0.68      0.88      0.76        24

    accuracy                           0.63        49
   macro avg       0.31      0.37      0.33        49
weighted avg       0.52      0.63      0.57        49



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))



 
 > Use scikit-learn's TfidfVectorizer to convert the raw text into a TF-IDF feature representation.

 > The Multinomial Naive Bayes is trained on the TF-IDF transformed training data.

 > Predictions are made on the test set.

 > The model is evaluated using accuracy and a classification report.

##### *Train a Logistic Regression Machine Learning Model based on Word Embedding*

**Training a Logistic Regression Model Based on Word Embedding:**

1. **Word Embedding:**
   - Preprocess text data and tokenize it into words.
   - Convert words into dense vectors using word embedding techniques like Word2Vec, GloVe, or embeddings layers in deep learning models.

2. **Feature Extraction:**
   - For each document, obtain the word embeddings of individual words.
   - Aggregate word embeddings to obtain a fixed-size feature vector for each document.

3. **Model Training:**
   - Use the aggregated word embeddings as input features for logistic regression.
   - Train a logistic regression model to predict the binary class labels based on the feature vectors.

4. **Prediction:**
   - Given a new document, tokenize and convert it into word embeddings.
   - Aggregate the word embeddings and use the trained logistic regression model to predict the class label.

**Benefits of Logistic Regression with Word Embedding:**

1. **Captures Semantic Information:**
   - Word embeddings capture semantic relationships between words, allowing the model to understand the meaning of words in context.

2. **Variable-Length Input Handling:**
   - Word embeddings enable handling variable-length documents by converting them into fixed-size feature vectors.

3. **Interpretability:**
   - Logistic regression provides interpretable results by assigning weights to features (word embeddings).
   - It allows understanding which words contribute more or less to the prediction.

4. **Efficient for Linear Separation:**
   - Logistic regression is effective when the relationship between features and the target is approximately linear.
   - It can perform well in scenarios where the decision boundary is relatively simple.

5. **Scalability:**
   - Word embeddings can be pre-trained on large corpora and then fine-tuned for specific tasks.
   - This leverages the benefits of transfer learning, especially when labeled data is limited.

6. **Suitable for Binary Classification:**
   - Logistic regression is a natural choice for binary classification problems, such as sentiment analysis or spam detection.


In [25]:
# Word embeddings are dense vector representations of words that capture semantic relationships between words. One popular method for generating word embeddings is Word2Vec. I will show how to train a sentiment analysis model using Word2Vec embeddings with Python and gensim.
import gensim
import numpy as np
from sklearn.linear_model import LogisticRegression
from gensim.models import Word2Vec

# Tokenize the sentences into words
tokenized_train = [gensim.utils.simple_preprocess(text) for text in X_train]
tokenized_test = [gensim.utils.simple_preprocess(text) for text in X_test]

# Train Word2Vec model with hyperparameters vector_size=100 and window=5
vector_size = 100
window = 5
model = Word2Vec(sentences=tokenized_train, vector_size=vector_size, window=window, min_count=1, workers=4)
model.train(tokenized_train, total_examples=len(tokenized_train), epochs=10)


# Function to get the vector representation of a sentence
def get_sentence_vector(sentence):
    vector_sum = np.zeros(model.vector_size)
    for word in sentence:
        if word in model.wv:
            vector_sum += model.wv[word]
    return vector_sum / len(sentence)

# Convert sentences to vectors
X_train_vectors = [get_sentence_vector(sentence) for sentence in tokenized_train]
X_test_vectors = [get_sentence_vector(sentence) for sentence in tokenized_test]

# Train a machine learning model (Logistic Regression in this example)
classifier = LogisticRegression()
classifier.fit(X_train_vectors, y_train)

# Make predictions on the test set
y_pred = classifier.predict(X_test_vectors)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Display classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))




Accuracy: 0.45
Classification Report:
              precision    recall  f1-score   support

         1.0       0.00      0.00      0.00         4
         2.0       0.00      0.00      0.00         4
         3.0       0.00      0.00      0.00        17
         4.0       0.47      0.92      0.62        24

    accuracy                           0.45        49
   macro avg       0.12      0.23      0.15        49
weighted avg       0.23      0.45      0.30        49



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
