# Sentiment Analysis
## Table of contents
* [Overview](#section-1)
* [Dataset](#section-2)
* [Objective](#section-3)
* [Costs](#section-4)
* [Data Loading](#section-5)
* [Preparing training data](#section-6)
* [Creating Dataset in Vertex AI](#section-7)
* [Training the model using Vertex AI](#section-8)
* [Deploy the Model to the endpoint](#section-9)
* [Prediction](#section-10)
* [Reviews visualisation](#section-11)
* [Clean-up](#section-12)

## Overview
<a name="section-1"></a>
This notebook demonstrates performing sentiment analysis on stanford movie reviews dataset using AutoML NLP, deploying the sentimental model on Vertex AI and getting predictions. 
<b>Note</b>: This notebook is designed to run on managed notebooks instance of Vertex AI Workbench. Some components of this notebook may not work in other notebook environments.

## Dataset
<a name="section-2"></a>
The dataset used in this notebook is a part of the [Stanford Sentiment Treebank Dataset](https://nlp.stanford.edu/sentiment/) which consists of all movie review phrases and the corresponding sentiment scores.

## Objective
<a name="section-3"></a>
In this notebook :

- Loading the required data. 
- Preprocessing the data 
- Selecting the required data for the model.
- Load the dataset into Vertex AI Managed datasets.
- Training a sentimental model using AutoML NLP.
- Evaluating the model.
- Deploying the model on Vertex AI.
- Getting Predictions
- Clean up.


## Costs
<a name="section-4"></a>
This tutorial uses the following billable components of Google Cloud:

- Vertex AI
- Cloud Storage

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Kernel selection
Select <b>Python</b> kernel while running this notebook on Vertex AI's managed instances and ensure that the following libraries are installed in the environment where this notebook is being run.
- wordcloud
- Pandas 


Along with the above libraries, the following google-cloud libraries are also used in this notebook.

- google.cloud.aiplatform
- google.cloud.storage


## Install required packages

! pip install wordcloud

If you are using Vertex AI Workbench, your environment already meets all the requirements to run this notebook. You can skip this step.
- ! pip install google-cloud-aiplatform
- ! pip install fsspec
- ! pip install gcsfs


## Set your project ID

If you don't know your project ID, you may be able to get your project ID using gcloud.

In [None]:
import os

PROJECT_ID = ""

# Get your Google Cloud project ID from gcloud
if not os.getenv("IS_TESTING"):
    shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID: ", PROJECT_ID)

Otherwise, set your project ID here.

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None:
    PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
    %env GOOGLE_CLOUD_PROJECT  PROJECT_ID
    print("Project ID: ", PROJECT_ID)

## Timestamp

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append it onto the name of resources you create in this tutorial.

In [None]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

## Authenticate your Google Cloud account

**If you are using Google Cloud Notebooks**, your environment is already authenticated. Skip this step.

**If you are using Colab**, run the cell below and follow the instructions when prompted to authenticate your account via oAuth.

**Otherwise**, follow these steps:

In the Cloud Console, go to the [Create service account key page](https://console.cloud.google.com/projectselector2/iam-admin/serviceaccounts?supportedpurview=project).

Click **Create service account.**

In the **Service account name** field, enter a name, and click **Create**.

In the **Grant this service account access to project section**, click the **Role** drop-down list. Type "Vertex AI" into the filter box, and select **Vertex AI Administrator**. Type "Storage Object Admin" into the filter box, and select **Storage Object Admin**.

Click Create. A JSON file that contains your key downloads to your local environment.

Enter the path to your service account key as the GOOGLE_APPLICATION_CREDENTIALS variable in the cell below and run the cell.

In [None]:
import os
import sys

# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

# The Google Cloud Notebook product has specific requirements
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

# If on Google Cloud Notebooks, then don't execute this code
if not IS_GOOGLE_CLOUD_NOTEBOOK:
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

## Import required libraries and define constants

In [None]:
import os
import random
from typing import Dict, List, Optional, Sequence, Tuple, Union

import matplotlib.pyplot as plt
import pandas as pd
from google.cloud import aiplatform, storage
from wordcloud import STOPWORDS, WordCloud

In [None]:
LOCATION = "us-central1"
BUCKET_NAME = "{your_bucket_name}"

## Loading data 
<a name="section-5"></a>

In [None]:
phrases = pd.read_csv(
    "gs://vertex_ai_managed_services_demo/sentiment_analysis/stanfordSentimentTreebank/dictionary.txt",
    sep="|",
)
phrases.columns = ["text", "phrase ids"]
scores = pd.read_csv(
    "gs://vertex_ai_managed_services_demo/sentiment_analysis/stanfordSentimentTreebank/sentiment_labels.txt",
    sep="|",
)
df = phrases.merge(scores, how="left", on="phrase ids")
print(df.head(5))

In [None]:
print(max(df["sentiment values"]), min(df["sentiment values"]))

The data itself doesn't contain any feature names and thus needs its columns to be re-named. dictionary.txt contains all phrases and their IDs, separated by a vertical line |. sentiment_labels.txt contains all phrase ids and the corresponding sentiment scores, separated by a vertical line. 4 classes are created by mapping the positivity probability using the following cut-offs:
[0, 0.25], (0.25, 0.5], (0.5, 0.75],(0.75, 1.0]

### Creating labels 

In [None]:
VERYNEGATIVE = 0
NEGATIVE = 1
POSITIVE = 2
VERYPOSITIVE = 3

In [None]:
bins = [0, 0.25, 0.5, 0.75, 1]
labels = [VERYNEGATIVE, NEGATIVE, POSITIVE, VERYPOSITIVE]
df["label"] = pd.cut(df["sentiment values"], bins=bins, labels=labels)
print(df.head())

## Preparing training data
<a name="section-6"></a>

To train a sentiment analysis model, you provide representative samples of the type of content you want AutoML Natural Language to analyze, each labeled with a value indicating how positive the sentiment is within the content.

The sentiment score is an integer ranging from 0 (relatively negative) to a maximum value of your choice (positive). For example, if you want to identify whether the sentiment is negative, positive, or neutral, you would label the training data with sentiment scores of 0 (negative), 1 (neutral), and 2 (positive).If you want to capture more granularity with five levels of sentiment, you still label documents with the most negative sentiment as 0 and use 4 for the most positive sentiment. The Maximum sentiment score (sentiment_max) for the dataset would be 4.

For this notebook we are selecting subset of the orginal data to train on, which consists extreme positive and negative samples. Here the maximum sentiment would be 1. In <i>ML use</i> column we could provide if it is a TRAIN/VALIDATION/TEST sample or let the Vertex AI randomly assign. 
Each line in a CSV file refers to a single document. The following example shows the general format of a valid CSV file. The ml_use column is optional.

\[ml_use\],gcs_file_uri|"inline_text",sentiment,sentimentMax

For more information visit the official [documentation](https://cloud.google.com/vertex-ai/docs/datasets/prepare-text#sentiment-analysis)


### Selecting subset data

In [None]:
subset_data = df[df["label"].isin([VERYNEGATIVE, VERYPOSITIVE])].reset_index(drop=True)
subset_data.head()

In [None]:
subset_data["label"] = subset_data["label"].apply(lambda x: 1 if x == 3 else 0)
subset_data["ml_use"] = ""
subset_data["sentimentMax"] = 1
subset_data = subset_data[["ml_use", "text", "label", "sentimentMax"]]
print(subset_data.head())

### Creating an import csv

In [None]:
FILE_NAME = "sentiment_data.csv"
subset_data.to_csv(FILE_NAME, index=False)
# Upload the saved model file to Cloud Storage
BLOB_PATH = "sentiment_analysis/"
BLOB_NAME = os.path.join(BLOB_PATH, FILE_NAME)
bucket = storage.Client().bucket(BUCKET_NAME)
blob = bucket.blob(BLOB_NAME)
blob.upload_from_filename(FILE_NAME)

## Creating Dataset in Vertex AI
<a name="section-7"></a>

The following code uses the Vertex AI SDK for Python to both create a dataset and import data. 

In [None]:
def import_data_text_sentiment_analysis(
    project: str,
    location: str,
    display_name: str,
    src_uris: Union[str, List[str]],
    sync: bool = True,
):
    aiplatform.init(project=project, location=location)

    ds = aiplatform.TextDataset.create(
        display_name=display_name,
        gcs_source=src_uris,
        import_schema_uri=aiplatform.schema.dataset.ioformat.text.sentiment,
        sync=sync,
    )

    ds.wait()

    print(ds.display_name)
    print(ds.resource_name)
    return ds

In [None]:
display_name = "sentimentanalysis"
src_uris = [f"gs://{BUCKET_NAME}/sentiment_analysis/sentiment_data.csv"]
dataset = import_data_text_sentiment_analysis(
    PROJECT_ID, LOCATION, display_name, src_uris
)

## Training the model using Vertex AI
<a name="section-8"></a>

The following code uses the Vertex AI SDK for Python to train the model on the above created dataset. You can get the dataset id from the Dataset section of Vertex AI in the console or get from the resource name in the dataset object created above. We can specify how the training data is split between the training, validation, and test sets by setting the fraction_split variables.

In [None]:
def create_training_pipeline_text_sentiment_analysis(
    project: str,
    location: str,
    display_name: str,
    dataset_id: str,
    model_display_name: Optional[str] = None,
    sentiment_max: int = 10,
    training_fraction_split: float = 0.8,
    validation_fraction_split: float = 0.1,
    test_fraction_split: float = 0.1,
    sync: bool = True,
):
    aiplatform.init(project=project, location=location)

    job = aiplatform.AutoMLTextTrainingJob(
        display_name=display_name,
        prediction_type="sentiment",
        sentiment_max=sentiment_max,
    )

    text_dataset = aiplatform.TextDataset(dataset_id)

    model = job.run(
        dataset=text_dataset,
        model_display_name=model_display_name,
        training_fraction_split=training_fraction_split,
        validation_fraction_split=validation_fraction_split,
        test_fraction_split=test_fraction_split,
        sync=sync,
    )

    model.wait()

    print(model.display_name)
    print(model.resource_name)
    print(model.uri)
    return model

In [None]:
display_name = "sentimentanalysis"
dataset_id = dataset.resource_name.split("/")[-1]
print(dataset_id)
model = create_training_pipeline_text_sentiment_analysis(
    PROJECT_ID, LOCATION, display_name, dataset_id, sentiment_max=1
)

## Deploy the Model to the endpoint
<a name="section-9"></a>


#### create the endpoint

In [None]:
def create_endpoint(
    project: str,
    display_name: str,
    location: str,
):
    aiplatform.init(project=project, location=location)

    endpoint = aiplatform.Endpoint.create(
        display_name=display_name,
        project=project,
        location=location,
    )

    print(endpoint.display_name)
    print(endpoint.resource_name)
    return endpoint

In [None]:
display_name = "sentiment-analysis"
endpoint = create_endpoint(PROJECT_ID, display_name, LOCATION)

#### Deploy the model

The following code uses the Vertex AI SDK for Python to deploy the model to a endpoint. You can get the model id from the models section of Vertex AI in the console

In [None]:
def deploy_model_with_automatic_resources(
    project,
    location,
    model_name: str,
    endpoint: Optional[aiplatform.Endpoint] = None,
    deployed_model_display_name: Optional[str] = None,
    traffic_percentage: Optional[int] = 0,
    traffic_split: Optional[Dict[str, int]] = None,
    min_replica_count: int = 1,
    max_replica_count: int = 1,
    metadata: Optional[Sequence[Tuple[str, str]]] = (),
    sync: bool = True,
):
    """
    model_name: A fully-qualified model resource name or model ID.
          Example: "projects/123/locations/us-central1/models/456" or
          "456" when project and location are initialized or passed.
    """

    aiplatform.init(project=project, location=location)

    model = aiplatform.Model(model_name=model_name)
    model.deploy(
        endpoint=endpoint,
    )
    model.wait()
    print(model.display_name)
    print(model.resource_name)
    return model

In [None]:
model_id = ""
model = deploy_model_with_automatic_resources(PROJECT_ID, LOCATION, model_id, endpoint)

## Prediction
<a name="section-10"></a>


After deploying the model to an endpoint use the Vertex AI API to request an online prediction. Filter the data which we haven't used for the training and pick longer reviews to test the model 

In [None]:
def predict_text_sentiment_analysis_sample(endpoint, content):
    print(content)
    response = endpoint.predict(instances=[{"content": content}], parameters={})

    for prediction_ in response.predictions:
        print(prediction_)

In [None]:
test_data_pos = df[df["label"].isin([POSITIVE])].reset_index(drop=True)
test_data_neg = df[df["label"].isin([NEGATIVE])].reset_index(drop=True)

test_data_neg = test_data_neg.text.values
test_data_neg = [i for i in test_data_neg if len(i) > 200]
random.shuffle(test_data_neg)

In [None]:
test_data_pos = test_data_pos.text.values
test_data_pos = [i for i in test_data_pos if len(i) > 200]
random.shuffle(test_data_pos)

Here is the prediction results on the positive samples. Model did a good job on predicting positive sentiment for positive reviews. First and last review predictions are false negatives. 


In [None]:
for review in test_data_pos[0:10]:
    predict_text_sentiment_analysis_sample(endpoint, review)

Here is the prediction results on the negative reviews. Out of 10 reviews below 7 negative reviews are correctly predicted with negative sentiment

In [None]:
for review in test_data_neg[0:10]:
    predict_text_sentiment_analysis_sample(endpoint, review)

## Reviews visualisation
<a name="section-11"></a>


Here we are trying to visualise the positive and negative reviews in the data.

In [None]:
data_pos = df[df["label"].isin([VERYPOSITIVE])].reset_index(drop=True)
data_neg = df[df["label"].isin([VERYNEGATIVE])].reset_index(drop=True)

data_neg = data_neg.text.values

In [None]:
data_pos = data_pos.text.values

Creating the word cloud by removing the common words to highlight the words representing positive and negative samples 

In [None]:
# Python program to generate WordCloud
def plot_word_cloud(data, common_words):
    comment_words = ""
    stopwords = set(STOPWORDS)
    for val in data:
        tokens = val.split()
        for i in range(len(tokens)):
            tokens[i] = tokens[i].lower()
            for each in common_words:
                if each in tokens[i]:
                    tokens[i] = ""
                    break

        comment_words += " ".join(tokens) + " "

    wordcloud = WordCloud(
        width=800,
        height=800,
        background_color="white",
        stopwords=stopwords,
        min_font_size=10,
    ).generate(comment_words)

    plt.figure(figsize=(8, 8), facecolor=None)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.tight_layout(pad=0)

    plt.show()

Word cloud of negative reviews

In [None]:
plot_word_cloud(
    data_neg,
    [
        "movie",
        "film",
        "story",
        "audience",
        "director",
        "watch",
        "seem",
        "world",
        "one",
        "make",
        "way",
        "character",
        "much",
        "time",
        "even",
        "take",
        "s",
        "n't",
        "will",
        "may",
        "re",
        "plot",
        "good",
        "comedy",
        "made",
    ],
)

Word cloud of positive reviews

In [None]:
plot_word_cloud(
    data_pos,
    [
        "movie",
        "film",
        "story",
        "audience",
        "director",
        "watch",
        "seem",
        "world",
        "one",
        "make",
        "way",
        "character",
        "much",
        "time",
        "even",
        "take",
        "s",
        "n't",
        "will",
        "may",
        "re",
        "plot",
        "made",
    ],
)

## Clean up
<a name="section-12"></a>

Undeploy the model from endpoint.

In [None]:
DEPLOYED_MODEL_ID = ""
endpoint.undeploy(deployed_model_id=DEPLOYED_MODEL_ID)

Delete the endpoint.

In [None]:
endpoint.delete()

Delete the dataset

In [None]:
dataset.delete()

Delete the model

In [None]:
model.delete()