<a href="https://colab.research.google.com/github/SaibalPatraDS/Hands-on-LLM/blob/main/Movie_Review_Sentiment_Analysis_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Movie Review Sentiment Analysis = Embeddings Method

**Flow**

1. Creating Embedding using Pretrained Models [`Feature Extraction`]
2. Using those embeddings for future Classification task [`Classifier`]

In [None]:
!pip install transformers datasets sentence-transformers

### Loading the Datasets

In [2]:
## loading the data
from datasets import load_dataset
## Loading Moview Review Data
review_df = load_dataset(
    "cornell-movie-review-data/rotten_tomatoes"
)
review_df


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.46k [00:00<?, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

## Classification Tasks that Leverage Embeddings

### Supervised Classification

In [3]:
## loading the packages
from sentence_transformers import SentenceTransformer

## load the model
model = SentenceTransformer(
    "sentence-transformers/all-mpnet-base-v2"
)
## creating Embeddings
training_embeddings = model.encode(review_df['train']["text"], show_progress_bar = True)
testing_embeddings = model.encode(review_df["test"]["text"], show_progress_bar = True)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/267 [00:00<?, ?it/s]

Batches:   0%|          | 0/34 [00:00<?, ?it/s]

In [5]:
## testing the loaded data
training_embeddings.shape, testing_embeddings.shape

((8530, 768), (1066, 768))

### Logistic Regression for Classification task

In [9]:
## loading logistic regression model
from sklearn.linear_model import LogisticRegression

## Train a logistic Regression Model on Train Embeddings
lr = LogisticRegression(
    random_state=42
)
## Fitting into the Model
lr.fit(training_embeddings, review_df["train"]["label"])

In [10]:
## fucntion to Evaluate Performance
from sklearn.metrics import classification_report

def evaluation_metrics(y_true, y_pred):
  """
  Classification Report for the Case
  """
  report  = classification_report(y_true, y_pred,
                                  target_names=["Negative Reviews", "Positive Reviews"])
  print(report)

In [12]:
## Testing with test data
y_pred = lr.predict(testing_embeddings)
## Evaluation Metrics
y_true = review_df["test"]["label"]
evaluation_metrics(y_true, y_pred)

                  precision    recall  f1-score   support

Negative Reviews       0.85      0.86      0.85       533
Positive Reviews       0.86      0.85      0.85       533

        accuracy                           0.85      1066
       macro avg       0.85      0.85      0.85      1066
    weighted avg       0.85      0.85      0.85      1066



## Conclusion

1. With Embeddings model, we have acheived an accuracy of almost 85% what is massive.
2. Even `f1-score` and `precision` and `recall` scores were also great.

## Zero Shot Classification

-- Leverage power of Sentence Transformers when labels for the reviews are not there

In [15]:
## importing necessary libraries
import numpy as np

In [22]:
## Create Embeddings for the Label Classes
labels = ["A Very Negative Movie Review", "A Very Positive Movie Review"]
labels_embeddings = model.encode(labels)
# labels_embeddings.shape

In [23]:
## Model Evaluation
from sklearn.metrics.pairwise import cosine_similarity

## Prediction
sim_matrix = cosine_similarity(testing_embeddings, labels_embeddings)
y_pred = np.argmax(sim_matrix, axis = 1)

In [24]:
## Evaluation of Results
evaluation_metrics(y_true, y_pred)

                  precision    recall  f1-score   support

Negative Reviews       0.86      0.73      0.79       533
Positive Reviews       0.76      0.88      0.82       533

        accuracy                           0.80      1066
       macro avg       0.81      0.80      0.80      1066
    weighted avg       0.81      0.80      0.80      1066



### Conclusion :

1. Even with `Zero Shot Classification` we are able to acheive almost 80% accuracy
2. `Precision` for Negative Reviews are higher than `Precision` for Positive Reviews.
3. `Recall` for Positive Reviews are higher than `Recall` for Negative Reviews.