In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Semantic Similarity Analysis of Dialogflow CX Pages and Ad Hoc Input Data
In this notebook, we will show you how to uses NLU sentence embeddings to determine how similar different utterances are. We use this information to perform the following analyses:

* Find similar training phrases in different intents that will cause confusion for the NLU model.
* Identify the most similar training phrases for a user-supplied set of utterances. This will explain where incorrect predictions are coming from on an eval set.
* Identify clusters of utterances that are unlike any of the phrases in the training data. This can be used to search through utterances that produced NO_MATCH in the logs and identify missing intents/training phrases.

## Prerequisites
- Ensure you have a GCP Service Account key with the Dialogflow API Admin privileges assigned to it.

## NOTE!
_This colab notebook was intended to run on [Google Colab](https://colab.sandbox.google.com/) infrastrucutre due to the `scann` library dependency._

In [None]:
# If you haven't already, make sure you install the `dfcx-scrapi` library

!pip install dfcx-scrapi
!pip install scann

# Imports
During import, Colab will ask you to auth with your Google credentials.  
The creds are used to access Google Sheets where your training data lives.

In [None]:
import gspread
from google.colab import auth
from google.auth import default

from dfcx_scrapi.tools.nlu_util import KonaEmbeddingModel, SheetsLoader, NaturalLanguageUnderstandingUtil

from google.colab import drive
drive.mount('/content/drive')

auth.authenticate_user()
creds, _ = default()

gc = gspread.authorize(creds)

# User Inputs
In the next section, we will collect runtime variables needed to execute this notebook.   
This should be the only cell of the notebook you need to edit in order for this notebook to run.

Getting an the training phrsaes data from your existing DFCX agent requires the following information:
- `creds_path`, path to your service account credentials file.
- `agent_id`, which is your GCP agent ID.
- `flow_display_name`, the Display Name of the Flow to use
- `page_display_name`, the Display Name of the Page to use

In [None]:
creds_path = '<PATH_TO_YOUR_CREDS_FILE>'
agent_id = '<YOUR_AGENT_ID>'
flow_display_name = '<YOUR_DFCX_FLOW_DISPLAY_NAME>'
page_display_name = '<YOUR_DFCX_PAGE_DISPLAY_NAME>'

# Load Agent
First, we will instantiate our `embedder` by loading in Agent, Flow, and Page information.

In [None]:
embedder = NaturalLanguageUnderstandingUtil(agent_id, flow_display_name, page_display_name, creds_path)

# Analysis 1: Find conflicting training phrases

This analysis identifies pairs of training phrases in different intents that are confusing for the NLU model.

We recommend reviewing all training phrases with a similarity above 0.9. In most cases the conflicts should be resolved by deleting one of the training phrases.

No additional inputs are needed to run this analysis. Just run this cell after loading the agent.

In [None]:
similar_df = embedder.find_similar_training_phrases_in_different_intents()
similar_df

# Analysis 2: Find the most similar training phrases for a set of utterances

This analysis finds the most similar training phrases in the agent data for a set of utterances. 

Example use cases:
* Identifying why utterances in an eval set were classified as some intent.
* Looking for any similar training phrases for a list of utterances from the logs.

We read the utterances from a Google Sheet. That sheet must be shared with your service account creds file that was provided in this notebook.

We'll collect the following input arguments to access your Google Sheet:
* `sheet_name`, the display name of your Google Sheet
* `worksheet_name`, the display name of the Worksheet or tab where your data lives
* `utterance_column_name`, the name of the column where your Utterance data lives

Input data can be a single column of utterances like:
|  utterances   |
| --- | 
| new york |
| big apple |
| I like traveling to nyc |
| some people call it the big apple |

In [None]:
sheet_name = "<YOUR_GOOGLE_SHEET_NAME>"
worksheet_name = "<YOUR_GOOGLE_SHEET_WORKSHEET_NAME>"
utterance_column_name = "utterances"

sheet_loader = SheetsLoader(creds_path)
utterances = sheet_loader.load_column_from_sheet(sheet_name, worksheet_name, utterance_column_name).astype(str)
embedder.find_similar_phrases(utterances)

# Analysis 3: Find clusters of utterances that don't match training phrases

This analysis finds clusters of utterances that don't match any training phrases. Example use cases:

* Running on a set of utterances that were labeled NO_MATCH in logs. To fix any entries that show up, consider:
  * Adding the identified utterances as training phrases in an existing intent (the most similar intent will be displayed).
  * Adding new intents.
  * Enabling inactive intents.
* Running on a set of eval/log utterances to identify regions where an intent is lacking training phrases.
  * To fix any entries that show up, we recommend adding the identified utterances as training phrases.

We read the utterances from a Google Sheet. That sheet must be shared with your service account creds file that was provided in this notebook.

We'll collect the following input arguments to access your Google Sheet:
* `sheet_name`, the display name of your Google Sheet
* `worksheet_name`, the display name of the Worksheet or tab where your data lives
* `utterance_column_name`, the name of the column where your Utterance data lives

Input data can be a single column of utterances like:
|  no_match_utterances   |
| --- | 
| new york |
| big apple |
| I like traveling to nyc |
| some people call it the big apple |

In [None]:
sheet_name = "<YOUR_GOOGLE_SHEET_NAME>"
worksheet_name = "<YOUR_GOOGLE_SHEET_WORKSHEET_NAME>"
utterance_column_name = "no_match_utterances"

sheet_loader = SheetsLoader(creds_path)
utterances = sheet_loader.load_column_from_sheet(sheet_name, 
                                                 worksheet_name, 
                                                 utterance_column_name).astype(str)
embedder.find_new_groups(utterances)

# Final Thoughts and Wrap-Up
In this notebook, we've shown you how to uses NLU sentence embeddings to determine how similar different utterances are.

For further instruction, please contact: [cgibson6279](https://github.com/cgibson6279)