## **Import IBM NLU API**

In [None]:
!pip install --upgrade "ibm-watson>=4.2.1"

In [None]:
import json
from ibm_watson import NaturalLanguageUnderstandingV1
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
from ibm_watson.natural_language_understanding_v1 import Features, CategoriesOptions, ClassificationsOptions, ConceptsOptions, EmotionOptions, \
EntitiesOptions, KeywordsOptions, RelationsOptions, SemanticRolesOptions, SentimentOptions, SyntaxOptions, SyntaxOptionsTokens

authenticator = IAMAuthenticator('hzQviQPuTzZXbVrXp5BLRb3qLn4PkGhuJq0XM0UUSD0t')
natural_language_understanding = NaturalLanguageUnderstandingV1(version='2021-08-01', authenticator=authenticator)

natural_language_understanding.set_service_url('https://api.us-south.natural-language-understanding.watson.cloud.ibm.com/instances/b8ba28f5-ecdb-4a85-bbca-cccb8cffe764')

## **Connect to Google Drive to get IBM category dataset**

In [None]:
# Mount Google Drive to get the dataset
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Store the dataset as a dataframe
import pandas as pd

df = pd.read_excel("/content/drive/My Drive/categories-hierarchy-iab.xlsx")

In [None]:
df.head(1)

Unnamed: 0,LEVEL 1,LEVEL 2,LEVEL 3,LEVEL 4,LEVEL 5
0,art and entertainment,books and literature,best-sellers,,


In [None]:
print((str(df.head(1)['LEVEL 1'][0])))

art and entertainment


## **Focus on the Categories feature**

### Process **sample_text**: get category labels and their respective scores

In [None]:
sample_text = "According to the statement, the rockfall was unrelated to work that’s underway on the Kicking Horse Canyon Phase 4 project. Rockfall is a hazard along many B.C. highways and the Ministry of Transportation and Infrastructure has dedicated staff and resources that work to identify, prioritize and mitigate rockfall hazards throughout the province. The rockfall protection near this site pre-dates the project and was installed by Ministry of Transportation and Infrastructure in an area identified by geotechnical engineers."

print("Text: %s\n" % sample_text)
response = natural_language_understanding.analyze(text=sample_text, features=Features(categories=CategoriesOptions(limit=3))).get_result()

categories = response["categories"]
category_dict = {}

for category in categories:
  label_list = []
  for label in reversed(category["label"].split("/")):
    if label != '':
      if label not in category_dict.keys():             # If a label has already appeared in our list, don't update its score because we want to store the highest score for every label
        category_dict[label] = category["score"]
  
print(category_dict)
sample_text_cat_dict = category_dict

Text: According to the statement, the rockfall was unrelated to work that’s underway on the Kicking Horse Canyon Phase 4 project. Rockfall is a hazard along many B.C. highways and the Ministry of Transportation and Infrastructure has dedicated staff and resources that work to identify, prioritize and mitigate rockfall hazards throughout the province. The rockfall protection near this site pre-dates the project and was installed by Ministry of Transportation and Infrastructure in an area identified by geotechnical engineers.

{'construction': 0.714107, 'business and industrial': 0.714107, 'earthquakes': 0.586231, 'seismology': 0.586231, 'geology': 0.586231, 'science': 0.586231, 'engineering': 0.58616}


### Process **text_to_compare**: get category labels and their respective scores

In [None]:
text_to_compare1 = "The building’s post-tensioned concrete structure is designed to minimize damage and risk to residents in the event of an earthquake. The south and southeast facades will feature a floor-to-ceiling glass wall section facing the Salt Lake Valley and Wasatch Front, maximizing scenic views for residents. The remainder of the facade will be composed of a variation of glass-fiber reinforced concrete, set off with recessed windows."
text_to_compare2 = "Dr. Kieran Moore has also said making masks optional does not signal that COVID-19 has disappeared or that the pandemic is over, but it means that Ontario has come to place where it can now manage the virus. Some local health officials in parts of the province's north are encouraging residents to keep wearing masks in indoor public settings."
text_to_compare3 = "China Eastern gave its website a black-and-white homepage after the crash. The crash quickly became a leading topic on China's Weibo social media platform, with 1.34 billion views and 690,000 discussions. Many posts expressed condolences to the families of victims. Boeing began delivering the 737-800 to customers in 1997 and delivered the last of the series to China Eastern in 2020. It made over 5,200 of the narrow-body aircraft, a popular, single-aisle commuter plane."

text_to_compare_list = [text_to_compare1, text_to_compare2, text_to_compare3]
all_category_dict = {}

for text in text_to_compare_list:
  print("Text: %s\n" % text)
  response = natural_language_understanding.analyze(text=text, features=Features(categories=CategoriesOptions(limit=3))).get_result()

  categories = response["categories"]
  category_dict = {}

  for category in categories:
    label_list = []
    for label in reversed(category["label"].split("/")):
      if label != '':
        if label not in category_dict.keys():             # If a label has already appeared in our list, don't update its score because we want to store the highest score for every label
          category_dict[label] = category["score"]
  
  all_category_dict[text] = category_dict
  print(category_dict)
  print("\n")

print("all_category_dict: ", all_category_dict)

Text: The building’s post-tensioned concrete structure is designed to minimize damage and risk to residents in the event of an earthquake. The south and southeast facades will feature a floor-to-ceiling glass wall section facing the Salt Lake Valley and Wasatch Front, maximizing scenic views for residents. The remainder of the facade will be composed of a variation of glass-fiber reinforced concrete, set off with recessed windows.

{'construction': 0.723916, 'business and industrial': 0.723916, 'architecture': 0.599365, 'visual art and design': 0.599365, 'art and entertainment': 0.599365, 'design': 0.598439}


Text: Dr. Kieran Moore has also said making masks optional does not signal that COVID-19 has disappeared or that the pandemic is over, but it means that Ontario has come to place where it can now manage the virus. Some local health officials in parts of the province's north are encouraging residents to keep wearing masks in indoor public settings.

{'cold and flu': 0.94605, 'dise

## **Get similarity scores**

In [None]:
similarity_scores_dict = {}

# Initialize
for compare_text in all_category_dict.keys():
  similarity_scores_dict[compare_text] = 0

# Create reward based on the label's relevance to sample_text
sample_highest_reward = len(sample_text_cat_dict)  

for i, sample_label in enumerate(sample_text_cat_dict.keys()):
  reward = sample_highest_reward - i

  for j, compare_text in enumerate(all_category_dict.keys()):
    if sample_label in all_category_dict[compare_text]:
      score = all_category_dict[compare_text][sample_label]

      similarity_scores_dict[compare_text] += reward*score

print("Similarity scores: ", similarity_scores_dict.values())

Similarity scores:  dict_values([9.410908, 0, 3.605148])


## **Recommend the most similar compare_text to sample_text**

In [None]:
find_max = max(similarity_scores_dict, key=similarity_scores_dict.get)

print("The most similar text: ", find_max)

The most similar text:  The building’s post-tensioned concrete structure is designed to minimize damage and risk to residents in the event of an earthquake. The south and southeast facades will feature a floor-to-ceiling glass wall section facing the Salt Lake Valley and Wasatch Front, maximizing scenic views for residents. The remainder of the facade will be composed of a variation of glass-fiber reinforced concrete, set off with recessed windows.
