<center><font size=10>Transformers - Hands-on</font></center>

## **Problem Statement**

In the fast-evolving landscape of the entertainment industry, it is important to gauge audience sentiments towards movie releases. Understanding the sentiments expressed in movie reviews is crucial for shaping marketing strategies, refining content creation, and ultimately enhancing the overall viewer experience. However, manually analyzing an extensive volume of reviews is time-consuming and may not capture nuanced sentiments at scale. To address this, we aim to develop an ML-based sentiment analyzer that automatically evaluates movie reviews, providing actionable insights into audience perceptions.

### **Data Dictionary**

- **review:** review of a movie
- **sentiment:** indicates the sentiment of the review ( 0 is for negative review and 1 for positive review)

## **Installing Necessary libraries**

In [None]:
# installing the libraries for transformers
#pip install -U -q sentence-transformers transformers bitsandbytes accelerate sentencepiece

## **Importing Necessary Libraries**

In [None]:
# to read and manipulate the data
import pandas as pd
import numpy as np
pd.set_option('max_colwidth', None)    # setting column to the maximum column width as per the data

# to visualise data
import matplotlib.pyplot as plt
import seaborn as sns

# Deep Learning library
import torch

# to load transformer models
from sentence_transformers import SentenceTransformer
from transformers import T5Tokenizer, T5ForConditionalGeneration, pipeline, BitsAndBytesConfig

# to split the data
from sklearn.model_selection import train_test_split

# to compute performance metrics
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# To build a Random Forest model
from sklearn.ensemble import RandomForestClassifier


# to ignore unnecessary warnings
import warnings
warnings.filterwarnings("ignore")

## **Importing the dataset**

In [None]:
# loading data into a pandas dataframe
reviews = pd.read_csv("data/movie_reviews.csv")

In [None]:
# creating a copy of the data
data = reviews.copy()

## **Data Overview**

### **Checking the first 5 rows**

In [None]:
data.head(5)

* Here, a sentiment value of **0 is negative** and **1 is positive**.

### **Checking the shape of the data**

In [None]:
data.shape

* The dataset has 10000 rows and 2 columns.

### **Checking for missing values**

In [None]:
data.isnull().sum()

* There are no missing values in the data

### **Checking for duplicate values**

In [None]:
# checking for duplicate values
data.duplicated().sum()

In [None]:
# keeping only the first occurence of duplicate values and dropping the rest
data = data.drop_duplicates(keep = 'first')

In [None]:
# reseting the index of the dataframe
data = data.reset_index(drop = True)

### **Checking the distribution of sentiments**

In [None]:
sns.countplot(data=data, x='sentiment');

- There are almost an equal number of positive and negative reviews.


## **Semantic Search**

### **Defining the model**

We'll be using the **all-MiniLM-L6-v2** model here.

💡 The **all-MiniLM-L6-v2** model is an all-round (**all**) model trained on a large and diverse dataset of over 1 billion training samples and generates state-of-the-art sentence embeddings of 384 dimensions.

📊  It is a language model (**LM**) that has 6 transformer encoder layers (**L6**) and is a smaller model (**Mini**) trained to mimic the performance of a larger model (BERT).

🛠️ Potential use-cases include text classification, sentiment analysis, and semantic search.

In [None]:
# defining the model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
# there are many other models to choose from too!
# https://www.sbert.net/docs/pretrained_models.html
# https://huggingface.co/spaces/mteb/leaderboard
# model = SentenceTransformer('BAAI/bge-base-en-v1.5')

### **Basic Examples**

In [None]:
model.encode(['hello, my name is Dan!'])

In [None]:
model.encode(['hello, my name is Dan?'])

In [None]:
# defining a function to compute the cosine similarity between two embedding vectors
def cosine_score(text):
    # encoding the text
    embeddings = model.encode(text)

    # calculating the L2 norm of the embedding vector
    norm1 = np.linalg.norm(embeddings[0])
    norm2 = np.linalg.norm(embeddings[1])

    # computing the cosine similarity
    cosine_similarity_score = ((np.dot(embeddings[0],embeddings[1]))/(norm1*norm2))

    return cosine_similarity_score

In [None]:
sentence_1 = "The cat is on the mat."
sentence_2 = "The mat has a cat on it."

cosine_score([sentence_1, sentence_2])

- The **high cosine similarity score** indicates that the sentences are **semantically similar**.


In [None]:
sentence_1 = "Roses are red, violets are blue."
sentence_2 = "The Earth orbits the Sun in our solar system."

cosine_score([sentence_1, sentence_2])

- The **low cosine similarity score** indicates that the sentences are **semantically dissimilar**.

In [None]:
sentence_1 = "My name is Mark and I love football."
sentence_2 = "A strange object was found in the Mariana Trench."

cosine_score([sentence_1, sentence_2])

- The **lower cosine similarity score** indicates that the sentences are even more **semantically dissimilar**.


### **Encoding the dataset**

In [None]:
# setting the device to GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
# encoding the dataset
embedding_matrix = model.encode(data['review'], device=device, show_progress_bar=True)

In [None]:
# printing the shape of the embedding matrix
embedding_matrix.shape

In [None]:
# printing the embedding vector of the first review in the dataset
embedding_matrix[0,:]

### **Querying from the dataset**

**Now, let's search for similar reviews in our dataset.**

In [None]:
# defining a function to find the top k similar sentences for a given query
def top_k_similar_sentences(embedding_matrix,query_text,k):
    # encoding the query text
    query_embedding = model.encode(query_text)

    # calculating the cosine similarity between the query vector and all other encoded vectors of our dataset
    score_vector = np.dot(embedding_matrix,query_embedding)

    # sorting the scores in descending order and choosing the first k
    top_k_indices = np.argsort(score_vector)[::-1][:k]

    # returning the corresponding reviews
    return data.loc[list(top_k_indices), 'review']

In [None]:
# defining the query text
query_text = "Horror movies"

# displaying the top 5 similar sentences
top_k_reviews = top_k_similar_sentences(embedding_matrix,query_text,5)

for i in top_k_reviews:
    print(i, end="\n\n")

In [None]:
# defining the query text
query_text = "Action movie with lots of car chases"

# displaying the top 5 similar sentences
top_k_reviews = top_k_similar_sentences(embedding_matrix,query_text,5)

for i in top_k_reviews:
    print(i, end="\n\n")

In [None]:
# defining the query text
query_text = "The movie wasn't great but it delivered as per the expectations."

# displaying the top 5 similar sentences
top_k_reviews = top_k_similar_sentences(embedding_matrix,query_text,5)

for i in top_k_reviews:
    print(i, end="\n\n")

### **Categorization of Reviews**

- Let's try categorizing some of the reviews based on queries that can act as category identifiers.

- We'll choose
    - one highly positive query
    - one moderately positive query
    - one highly negative query
    - one moderately negative query

In [None]:
# queries acting as catergory identifiers
queries_for_categorization = [
    "Overall a great movie that ticks all the right boxes.",
    "The movie wasn't that great but it delivered as per the expectations.",
    "The movie was a bad experience with bad direction and poor story.",
    "The plot was confusing but the acting performances were okay. The movie was mediocre at best."
]

**Now, we'll use the model to identify movie reviews that are most similar to the above queries.**

In [None]:
# dictionary to store the reviews for each of the categories
categorized_reviews = {}

# number of reviews to consider for similarity
k = 500

# looping over the queries and updating the values to the top review sentences similar to the queries
for query in queries_for_categorization:
    categorized_reviews[query] = top_k_similar_sentences(embedding_matrix, query, k)

**Let's check the results.**

In [None]:
i = 0
print('Query Text:', queries_for_categorization[i], end='\n')
print('Similar Reviews:', end='\n\n')
categorized_reviews[queries_for_categorization[i]].head(10)

In [None]:
i = 1
print('Query Text:', queries_for_categorization[i], end='\n')
print('Similar Reviews:', end='\n\n')
categorized_reviews[queries_for_categorization[i]].head(10)

In [None]:
i = 2
print('Query Text:', queries_for_categorization[i], end='\n')
print('Similar Reviews:', end='\n\n')
categorized_reviews[queries_for_categorization[i]].head(10)

In [None]:
i = 3
print('Query Text:', queries_for_categorization[i], end='\n')
print('Similar Reviews:', end='\n\n')
categorized_reviews[queries_for_categorization[i]].head(10)

**Important Note**
1. It is important to note that we loaded a pre-trained model.
2. As the model was not trained or fine-tuned on this data, the performance may not be excellent.
3. As we are doing semantic search and not clustering, there may be reviews that fall in multiple categories.

## **Sentiment Analysis**

### **Random forrest classifier!**

In [None]:
# Process the data

X = embedding_matrix
y = data["sentiment"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [None]:
# build a RF model

# Building the model
rf_transformer = RandomForestClassifier(n_estimators = 100, max_depth = 7, random_state = 42)

# Fitting on train data
rf_transformer.fit(X_train, y_train)

In [None]:
# Predicting on train data
y_pred_train = rf_transformer.predict(X_train)

# Predicting on test data
y_pred_test = rf_transformer.predict(X_test)

In [None]:
# creating a function to plot the confusion matrix
def plot_confusion_matrix(actual, predicted):
    cm = confusion_matrix(actual, predicted)

    plt.figure(figsize = (5, 4))
    label_list = [0, 1]
    sns.heatmap(cm, annot = True,  fmt = '.0f', xticklabels = label_list, yticklabels = label_list)
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.show()

In [None]:
plot_confusion_matrix(y_train, y_pred_train)

In [None]:
plot_confusion_matrix(y_test, y_pred_test)

### **Hugginface pre-trained model!**

In [None]:
sentiment_hf = pipeline("sentiment-analysis") # this uses the HF default sentiment analysis model

# you can choose to use any other model by including a model argument like:
# sentiment_hf = pipeline(model="finiteautomata/bertweet-base-sentiment-analysis")
# or
# nlptown/bert-base-multilingual-uncased-sentiment
# many more models can be found at https://huggingface.co/models?pipeline_tag=text-classification&sort=trending&search=sentiment
# these are all text-classification models...sentiment analysis is a special case of text classification

In [None]:

trial_data = ["I love this movie", "This movie is not very good at all!",'There is a cat outside.']
sentiment_hf(trial_data)

In [None]:
hf_review_dict = sentiment_hf(data['review'].to_list(),truncation=True) # very long reviews will be truncated to 512 tokens...which isn't great!


In [None]:
hf_review_sent = [0]*len(data['review'])
for movie in range(len(data['review'])):
  if hf_review_dict[movie]['label']=='POSITIVE':
    hf_review_sent[movie] = 1

In [None]:
plot_confusion_matrix(y, hf_review_sent)

### **We could also use an LLM to ask about sentiment!**

#### **Defining the input and target variables**

In [None]:
X = data['review']
y = data["sentiment"]

#### **Defining the Model**

We'll be using the **Google FLAN-T5** model here.

💡 **FLAN-T5, developed by Google Research, is a "Fine-tuned LAnguage Net" (FLAN) with "Text-To-Text Transfer Transformer" (T-5) architecture.**

📊 **FLAN-T5 excels in various NLP tasks**, including translation, classification, and question answering, and it's known for its speed and efficiency.

📋 FLAN-T5 comes in different sizes: small, base, large, XL, and XXL, offering customization options.

🛠️ Potential use-cases include text generation, classification, summarization, sentiment analysis, question-answering, translation, and chatbots.

In [None]:
# initializing a T5 tokenizer using the pre-trained model
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")

In [None]:
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

In [None]:
# initializing a T5 model for conditional generation using the pre-trained model "google/flan-t5-large"
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto")

- We have loaded the model in 8-bit quantized format for efficiency and lower memory usage.
- We have set the device mapping to "auto" for automatic device assignment.
    - This will automatically detect available GPUs and use it.

In [None]:
# defining a function to generate, process, and return a response
def generate_response(prompt):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("mps")    ### using the tokenizer to create tokens in tensor format from an input
    outputs = model.generate(input_ids, max_length=16, do_sample=True, temperature=0.001)    ### generating the model output in tensor format
    return tokenizer.decode(outputs[0])[6:-4]    ### using the tokenizer to decode the model output, and then return it

In the `generate()` function defined above, the following arguments were used:

1. `max_length`: This parameter determines the maximum length of the generated sequence. In the provided code, max_length is set to 300, which means the generated sequence should not exceed 300 tokens.

2. `temperature`: The temperature parameter controls the level of randomness in the generation process. A higher temperature (e.g., closer to 1) makes the output more diverse and creative but potentially less focused, while a lower temperature (e.g., close to 0) produces more deterministic and focused but potentially repetitive outputs. In the code, temperature is set to 0.001, indicating a very low temperature and, consequently, a more deterministic sampling.

3. `do_sample`: This is a boolean parameter that determines whether to use sampling during generation (do_sample=True) or use greedy decoding (do_sample=False). When set to True, as in the provided code, the model samples from the distribution of predicted tokens at each step, introducing randomness in the generation process.

### **Model Predictions**

In [None]:
# checking a customer review and it's sentiment
print('Review:\t', X[4])
print('Actual Sentiment:\t', y[4])

In [None]:
# defining a prompt which tells the model what to do
sys_prompt = """
    Categorize the sentiment of the review as positive or negative.
    Return 1 for positive and 0 for negative.
"""

# predicting the sentiment using the model by incorporating the system prompt and the provided review text

pred_sent = generate_response(
    """
        {}
        Review text: '{}'
    """.format(sys_prompt, X[4])
)

print(pred_sent)

- The model was able to correctly identify the sentiment here.

**Note**: We'll discuss more about prompts, types of prompts, and how to effectively write them to optimize model outputs in upcoming classes.

In [None]:
# defining a function to generate a sentiment prediction
def predict_sentiment(review_text):
    pred = generate_response(
        """
            {}
            Review text: '{}'
        """.format(sys_prompt, review_text)
    )

    return pred

In [None]:
# making predictions with the model
predicted_sentiment = [int(predict_sentiment(X[item])[0]) for item in range(len(X))]
# must be careful with an LLM. It just gives you what it 'thinks' you want, which may not be what you actually want
# it's not uncommon for the returned value to not match the format you ask for!
# for example, several times predict_sentiment has returned '1 for positive' instead of just 1!
# that's why I grab the 0th entry from the returned string!

In [None]:
plot_confusion_matrix(y, predicted_sentiment)

## **Conclusion**

- We used the ***all-MiniLM-L6-v2*** model to do semantic search.
    - We first encoded the dataset using the model to generate embeddings of 384 dimensions.
    - Then we queried the dataset to find movie reviews similar to the query text we passed.
    - Finally, we categorized the movie reviews using queries that acted as category identifiers.

- We trained a Random Forrest model to do sentiment analysis based on the output of ***all-MiniLM-L6-v2***

- We used the ***sentiment analysis*** pipeline from huggingface to analyze the sentiment of all the movie reviews

- We used the ***Google FLAN-T5*** model to do sentiment analysis.
    - We first created a function that would take the input data, tokenize it, pass the tokenized data to the model for predictions, process the model output, and then return a response.
    - We then defined a prompt to tell the model what exactly it has to do.
    - We then made predictions with the model using the function and the prompt.

<font size=5 color='blue'>Power Ahead!</font>
___