# Introduction:
### Title: Guess the Super Hero!
### Problem Statement:
Given a description of a `superhero`, return the superhero name <br>
### Example Input & Output:
Input description: `Knight of Dark, Gotham protector, Smart, Intelligent, martial artist, master of dark, educated` <br>
Output: `Batman`
### Dataset Description
Dataset link: https://www.kaggle.com/datasets/jonathanbesomi/superheroes-nlp-dataset <br><br>
The dataset has 1450 unique superheros and each columns describes diffeent attribute of a superhero.<br><br>
Example Attributes are: Face Features, Teams associated with, super powers, created by etc. <br><br>
Total rows: 1450 <br>
Total columns: 81 <br>

### Solution
The task is to map a given description to an entity name. <br> 
Challenges:
* `Limited number of records` for supervised machine learning approches. 
* There is `no target class`: Each record describes a unique super hero. Thereby not a regular classification & regression task
* Although `multiple superheros may have similar characteristics, each super hero is different`. Thereby not exactly clustering

Considering the input to the system, first and second points in challenges, we need to `construct superhero description` from the provided dataset. This transformation of structured information into unstructured text is `unconventional` but it is `efficient` this way. <br><br>
The assumption is this constructed superhero `description` will have `rich information` about the superhero. This description will help us match the input description to superhero name. <br><br>

`Solution`: Use `Semantic Search` to match the input description query to existing superhero descriptions and fetch top - k records. Later, use `Keyword Search` to find the records that have large number of similar words as of input description. <br><br>

### Tech Stack:
* Python - 3.8 <br>
* Spacy - 3.4 <br>
* Spacy Transformers - 1.1.8 <br>
* KeyBERT - 0.6.0 <br>
* hnswlib - 0.6.2 - Hierarchical Navigable Small World for Approximate Nearest Neightbour Search <br>
* Sentence Transformers - 2.2.2 <br>
* Text Distance - 4.5.0 <br>
* AWS  <br>
* Docker  <br>
* CI/CD  <br>

In [387]:
#!nvidia-smi
!python --version

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Python 3.8.13


# Load necessary packages

In [42]:
# !pip install keybert
# !pip install keyphrase-vectorizers
# !pip install -U sentence-transformers
# !pip install hnswlib
# !pip install spacy[transformers]
# !pip install pytextrank

# !python -m spacy download en_core_web_trf

# Import Packages:

In [2]:
import os
import re
import ast
import time
import spacy
import pickle
import pathlib
import hnswlib
import pytextrank
import numpy as np
import textdistance
import pandas as pd
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS
from sentence_transformers import SentenceTransformer, util
# from keybert import KeyBERT
# from keyphrase_vectorizers import KeyphraseTfidfVectorizer


pd.set_option("display.max_columns", None)

# Load Dataset

In [13]:
# Set Working Directory Paths

notebook_directory = pathlib.Path().resolve()
data_directory = os.path.join(notebook_directory.parent, 'Data')
artifact_directory = os.path.join(notebook_directory.parent, 'Artifacts')
superhero_nlp_datazip_path = os.path.join(data_directory, 'superhero_nlp.zip')
dataset_path = os.path.join(data_directory, 'superheroes_nlp_dataset.csv')

In [15]:
!unzip f'{superhero_nlp_datazip_path}' -d f'{data_directory}'

unzip:  cannot find or open f/Users/rahulnenavath/Documents/Personal_Projects/SuperHeroAssignment/Data/superhero_nlp.zip, f/Users/rahulnenavath/Documents/Personal_Projects/SuperHeroAssignment/Data/superhero_nlp.zip.zip or f/Users/rahulnenavath/Documents/Personal_Projects/SuperHeroAssignment/Data/superhero_nlp.zip.ZIP.


In [210]:
df = pd.read_csv(f'{dataset_path}')

In [211]:
print(f'Number of rows: {len(df)}')
print(f'Number of columns: {len(df.columns)}')
df.info()

Number of rows: 1450
Number of columns: 81
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1450 entries, 0 to 1449
Data columns (total 81 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   name                              1448 non-null   object 
 1   real_name                         1301 non-null   object 
 2   full_name                         956 non-null    object 
 3   overall_score                     1450 non-null   object 
 4   history_text                      1360 non-null   object 
 5   powers_text                       1086 non-null   object 
 6   intelligence_score                1450 non-null   int64  
 7   strength_score                    1450 non-null   int64  
 8   speed_score                       1450 non-null   int64  
 9   durability_score                  1450 non-null   int64  
 10  power_score                       1450 non-null   int64  
 11  combat_score              

In [7]:
def get_record(index:int, data:pd.DataFrame=df) -> dict:
  record = data.iloc[index]
  return dict(record)

In [22]:
get_record(133)

{'name': 'Batman (DC One Million)',
 'real_name': nan,
 'full_name': nan,
 'overall_score': '12',
 'history_text': "In the distant 853rd Century (year 85,300), criminals of the most dangerous variety have been relocated to the bleak prison planet of Pluto. With the aid of the Laughing Virus, the criminal Xauron instigated a rebellion, seizing control of the world. Seeking retribution and a demonstration of power, Xauron had thousands of guards, as well as their spouses, herded into an arena, with their children forced to watch as their parents were massacred over the course of days. When the 15,000 children of those families were forced to watch the massacre of their parents, many committed suicide or went insane. One swore never to let such a tragedy occur again, and vowed to become The Batman. Already aware of the Batman Legacy and the many Batmen that had already existed, he chose the symbol of the bat because he thought that it stood for more than simply fear; it stood as an ideal,

The dataset provides us with Superhero description in a structured format. However, this structured format might not be suitable for a semantic search solution. Thereby we proceed to `convert this structured description into unstructured description`. <br><br> Considering the input description is basically comma separated superhero attributes, I proceed to transform the structured superhero description into comma-separated description text. <br><br> The idea is to `search the input description query in an existing comma-separated superhero description dataset`

# Data Transformation

In [212]:
# Separate Numerical Data and Categorical Data for transformation

numeric_cols = ['overall_score', 'intelligence_score', 'strength_score', 
               'speed_score', 'durability_score', 'power_score', 'combat_score']
numeric_data = df[numeric_cols]

categorical_data = df.drop(numeric_cols + ['height', 'weight', 'img'], axis=1)

### Categorical Data

We first try to extract essential information from categorical data

In [None]:
# convert list columns to text

categorical_data['superpowers'] = categorical_data['superpowers'].apply(lambda x: ",".join(ast.literal_eval(x)))
categorical_data['alter_egos'] = categorical_data['alter_egos'].apply(lambda x: ",".join(ast.literal_eval(x)))
categorical_data['aliases'] = categorical_data['aliases'].apply(lambda x: ",".join(ast.literal_eval(x)))
categorical_data['teams'] = categorical_data['teams'].apply(lambda x: ",".join(ast.literal_eval(x)))

In [None]:
# Replace NaN records with Empty Strings in history_text column
categorical_data['history_text'] = categorical_data['history_text'].apply(lambda x: "" if isinstance(x, float) else x)

We have `History` & `Power Text` about every superhero in the given dataset. Although this is rich description about each superhero, it still contains noise. We need to extract the key-information (keywords) from these two columns and remove the remaining text. <br>
For this we will be using two packages - `KeyBERT` and `TextRank`

#### KeyBERT

In [None]:
# Extract Keywords and noun chucks from `history text` using KeyBERT Package

kw_model = KeyBERT(model='all-mpnet-base-v2')

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [None]:
# Example output from KeyBERT
[key for (key, score) in kw_model.extract_keywords(categorical_data['history_text'].iloc[560], vectorizer=KeyphraseTfidfVectorizer(), use_mmr=True,)]

['sardonic superhero',
 'hancock',
 'pr executive ray embrey',
 'good citizens',
 'countless lives']

In [None]:
# Compute Keywords using KeyBERT and store them

keywords = []
for index, text in enumerate(categorical_data['history_text']):
  if len(text.split(' ')) <=1 :
    keywords.append("")
    continue
  keywords.append(",".join([key for (key, score) in kw_model.extract_keywords(text, vectorizer=KeyphraseTfidfVectorizer(), use_mmr=True)]))

In [None]:
categorical_data['history_keywords'] = keywords

In [None]:
# Drop History Text Column
categorical_data.drop('history_text', axis=1, inplace=True)

In [None]:
# Drop the One-Hot Encoded Columns because we have `superpowers` column 
# which describes all the super-powers

cols = [col_name for col_name in categorical_data.columns if col_name.startswith('has_')]
categorical_data.drop(cols, axis=1, inplace=True)

Power Text Keyword computation

In [None]:
# Replace NaN values with empty strings
categorical_data['powers_text'] = categorical_data['powers_text'].apply(lambda x: "" if isinstance(x, float) else x)

In [None]:
# Extract Keywords and noun chucks from `powers_text` using KeyBERT Package

keywords = []
for index, text in enumerate(categorical_data['powers_text']):
  if len(text.split(' ')) <=1 :
    keywords.append("")
    continue
  keywords.append(",".join([key for (key, score) in kw_model.extract_keywords(text, use_mmr=True)]))

In [None]:
categorical_data['power_keywords'] = keywords

In [None]:
# Drop Powers Text Column
categorical_data.drop('powers_text', axis=1, inplace=True)

In [None]:
# Replace Missing NaN values with Empty String in complete dataset

categorical_data = categorical_data.replace(np.nan, '', regex=True)

In [None]:
# Save the Categorical transformation progress till now!
# categorical_data.to_parquet(f'{data_directory}/superhero_nlp_categorical_data.gzip', compression='gzip')

In [25]:
# Load the Categorical dataset
# categorical_data = pd.read_parquet(f'{data_directory}/superhero_nlp_categorical_data.gzip')

In [26]:
# The Alignment Column has text like "Good" & "Bad". Lets add more terms 
# to provide verbose description

def alignment_aug(text:str) -> str:

  if text == "":
    return ""

  if text == 'Good':
    return f'{text}, Hero, Saviour, Protector of Justice'
  else:
    return f'{text}, Villan, Enemy, Destroyer, Mischief, Evil'

categorical_data['alignment'] = categorical_data['alignment'].apply(lambda x: alignment_aug(x))

In [43]:
categorical_data

Unnamed: 0,name,real_name,full_name,superpowers,alter_egos,aliases,place_of_birth,first_appearance,creator,alignment,occupation,base,teams,relatives,gender,type_race,eye_color,hair_color,skin_color,history_keywords,power_keywords
0,3-D Man,"Delroy Garrett, Jr.","Delroy Garrett, Jr.","Super Speed,Super Strength",,,,,Marvel Comics,"Good, Hero, Saviour, Protector of Justice",,,"Annihilators,Asgardians,Avengers,New Avengers",,Male,Human,,,,"delroy garrett,triune understanding,steroids,n...",
1,514A (Gotham),Bruce Wayne,,"Durability,Reflexes,Super Strength","Batgod,Batman,Batman (1966),Batman (Arkham),Ba...","Subject 514A,Bruce Wayne,Bruce 2",,,DC Comics,,,,,Bruce Wayne (genetic template),,,,,,"bruce wayne look,alfred catch,selina kyle,subj...",
2,A-Bomb,Richard Milhouse Jones,Richard Milhouse Jones,"Accelerated Healing,Agility,Berserk Mode,Blood...",,Rick Jones,"Scarsdale, Arizona","Hulk Vol 2 #2 (April, 2008) (as A-Bomb)",Marvel Comics,"Good, Hero, Saviour, Protector of Justice","Musician, adventurer, author; formerly talk sh...",,"Teen Brigade,Ultimate Fantastic Four,U-Men,God...",Marlo Chandler-Jones (wife); Polly (aunt); Mrs...,Male,Human,Yellow,No Hair,,"hulk,rick radiation poisoning,jones,gamma bomb...","jones,superhuman,force,radiation,transform"
3,Aa,Aa,,"Energy Absorption,Energy Armor,Energy Beams,En...",,,Stoneworld,Green Lantern Vol 3 #21,DC Comics,"Good, Hero, Saviour, Protector of Justice",,,"Blue Lantern Corps,Green Lantern Corps,Justice...",,Male,Human,,,,"green lantern corps,comatose star sapphire,hal...",
4,Aaron Cash,Aaron Cash,Aaron Cash,"Weapon-based Powers,Weapons Master",,,Gotham City,,DC Comics,"Good, Hero, Saviour, Protector of Justice",,,,,Male,Human,,,,"aaron cash,hook,killer croc,arkham asylum,hand",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1445,Zatanna,Zatanna Zatara,Zatanna Zatara,"Cryokinesis,Fire Control,Magic,Probability Man...","Chaos Zatanna,Doctor Fate,Zatanna (Full Potent...",,,Hawkman #4,DC Comics,"Good, Hero, Saviour, Protector of Justice",,,"Seven Soldiers of Victory,Sentinels of Magic,Y...","Giovanni ""John"" Zatara (father, deceased), Sin...",Female,Human,Blue,Black,,"zatanna,evil witch allura,john constantine,mag...","zatanna,spellcasting,telepathic,able,does"
1446,Zero,DWN-∞: Zero,DWN-∞: Zero,"Accelerated Healing,Acrobatics,Agility,Cold Re...",,,,Mega Man X (1993),Capcom,"Good, Hero, Saviour, Protector of Justice",,,,,Male,Robot,Blue,Blond,Red,"mega man zero concept,unconscious zero,maveric...",
1447,Zoom (New 52),Hunter Zolomon,,"Accelerated Healing,Agility,Durability,Electro...","Black Flash (CW),Zoom,Zoom (CW)",Judge · Reverse-Flash · The Flash,,Flash Secret Files and Origins #3,DC Comics,"Bad, Villan, Enemy, Destroyer, Mischief, Evil","Criminal · former F.B.I. Profiler, Politician","Keystone City, Kansas",Flash Family,Ashley Zolomon (ex-wife),Male,,Red,Brown,,"hunter zolomon,flash,zoom,former friend wally ...","zoom,speed,force,hunter,conduit"
1448,Zoom,Hunter Zolomon,Hunter Zolomon,"Intangibility,Super Speed,Time Manipulation,Ti...","Black Flash (CW),Zoom (CW),Zoom (New 52)",,,Flash Secret Files #3,DC Comics,"Bad, Villan, Enemy, Destroyer, Mischief, Evil",,"Keystone City, Kansas","Rogues,The Society,Flash Family,Justice League...",Ashley Zolomon (ex-wife),Male,,Red,Brown,,"hunter zolomon,barry,current flash,car acciden...","speeds,zoom,superman,flashes,force"


#### Text Rank

I will be using `Spacy-Transformers` package to compute `Text Rank keywords` and also compute `Named Entity Recognition` to extract keywords and Entities from History Text and Power Text columns

In [28]:
# Spacy Transformer Model
nlp = spacy.load("en_core_web_trf")
# add PyTextRank to the spaCy pipeline
nlp.add_pipe("textrank")

<pytextrank.base.BaseTextRankFactory at 0x17e8fc8b0>

In [38]:
def get_named_entities(text: str, spacy_model: spacy.lang.en.English = nlp, exclude_ner_labels:list = ['MONEY', 'DATE', 'CARDINAL']) -> list:
    my_doc = nlp(text)
    entities = []
    if my_doc.ents:
        for ents in my_doc.ents:
            if ents.label_ not in exclude_ner_labels and ents.text not in entities:
                entities.append(ents.text)
    return entities

In [29]:
def get_textkank_keywords(text: str, spacy_model: spacy.lang.en.English = nlp) -> list:
  
  # Clean stop words and then get keywords
  
  my_doc = nlp(text)
  token_list = []
  for token in my_doc:
      token_list.append(token.text)

  filtered_sentence =[] 

  for word in token_list:
    lexeme = nlp.vocab[word]
    if lexeme.is_stop == False:
        filtered_sentence.append(word) 
  
  text = " ".join(filtered_sentence)

  keywords = []
  doc = nlp(text.lower())
  for phrase in doc._.phrases:
    keywords.append(phrase.text)
  return list(set(keywords))

In [30]:
# Replace NaN values with Empty Strings
df['history_text'] = df['history_text'].apply(lambda x: "" if isinstance(x, float) else x)

In [388]:
# Compute Keywords and Named Entites for History Text using Spacy Transformers
keywords, entities = [], []
for index, text in enumerate(df['history_text']):
  if len(text.split(' ')) <=1 :
    keywords.append("")
    entities.append("")
    continue
  entities.append(get_named_entities(text))
  keywords.append(get_textkank_keywords(text))

In [54]:
categorical_data['history_textrank_keywords'] = keywords
categorical_data['history_NER'] = entities

In [60]:
# Replace NaN values with Empty Strings
df['powers_text'] = df['powers_text'].apply(lambda x: "" if isinstance(x, float) else x)

In [389]:
# Compute Keywords and Named Entites for Power Text using Spacy Transformers

keywords, entities = [], []
for index, text in enumerate(df['powers_text']):
  if len(text.split(' ')) <=1 :
    keywords.append("")
    entities.append("")
    continue
  entities.append(get_named_entities(text))
  keywords.append(get_textkank_keywords(text))

In [64]:
categorical_data['powers_textrank_keywords'] = keywords
categorical_data['powers_NER'] = entities

In [69]:
# Convert Lists into comma-separated values

categorical_data['history_textrank_keywords'] = categorical_data['history_textrank_keywords'].apply(lambda x: ",".join(x))
categorical_data['history_NER'] = categorical_data['history_NER'].apply(lambda x: ",".join(x))
categorical_data['powers_textrank_keywords'] = categorical_data['powers_textrank_keywords'].apply(lambda x: ",".join(x))
categorical_data['powers_NER'] = categorical_data['powers_NER'].apply(lambda x: ",".join(x))

Combine the Keyword & NER from KeyBert & Spacy modules

In [72]:

categorical_data['history_extracted_keywords'] = categorical_data['history_keywords'] + categorical_data['history_textrank_keywords'] + categorical_data['history_NER']


In [73]:

categorical_data['powers_extracted_keywords'] = categorical_data['power_keywords'] + categorical_data['powers_textrank_keywords'] + categorical_data['powers_NER']

In [77]:
# Drop extra keywords and NER columns

categorical_data.drop(['history_keywords', 'history_textrank_keywords', 'history_NER', 'power_keywords', 'powers_textrank_keywords', 'powers_NER'], axis=1, inplace=True)

In [78]:
categorical_data

Unnamed: 0,name,real_name,full_name,superpowers,alter_egos,aliases,place_of_birth,first_appearance,creator,alignment,occupation,base,teams,relatives,gender,type_race,eye_color,hair_color,skin_color,history_extracted_keywords,powers_extracted_keywords
0,3-D Man,"Delroy Garrett, Jr.","Delroy Garrett, Jr.","Super Speed,Super Strength",,,,,Marvel Comics,"Good, Hero, Saviour, Protector of Justice",,,"Annihilators,Asgardians,Avengers,New Avengers",,Male,Human,,,,"delroy garrett,triune understanding,steroids,n...",
1,514A (Gotham),Bruce Wayne,,"Durability,Reflexes,Super Strength","Batgod,Batman,Batman (1966),Batman (Arkham),Ba...","Subject 514A,Bruce Wayne,Bruce 2",,,DC Comics,,,,,Bruce Wayne (genetic template),,,,,,"bruce wayne look,alfred catch,selina kyle,subj...",
2,A-Bomb,Richard Milhouse Jones,Richard Milhouse Jones,"Accelerated Healing,Agility,Berserk Mode,Blood...",,Rick Jones,"Scarsdale, Arizona","Hulk Vol 2 #2 (April, 2008) (as A-Bomb)",Marvel Comics,"Good, Hero, Saviour, Protector of Justice","Musician, adventurer, author; formerly talk sh...",,"Teen Brigade,Ultimate Fantastic Four,U-Men,God...",Marlo Chandler-Jones (wife); Polly (aunt); Mrs...,Male,Human,Yellow,No Hair,,"hulk,rick radiation poisoning,jones,gamma bomb...","jones,superhuman,force,radiation,transformroug..."
3,Aa,Aa,,"Energy Absorption,Energy Armor,Energy Beams,En...",,,Stoneworld,Green Lantern Vol 3 #21,DC Comics,"Good, Hero, Saviour, Protector of Justice",,,"Blue Lantern Corps,Green Lantern Corps,Justice...",,Male,Human,,,,"green lantern corps,comatose star sapphire,hal...",
4,Aaron Cash,Aaron Cash,Aaron Cash,"Weapon-based Powers,Weapons Master",,,Gotham City,,DC Comics,"Good, Hero, Saviour, Protector of Justice",,,,,Male,Human,,,,"aaron cash,hook,killer croc,arkham asylum,hand...",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1445,Zatanna,Zatanna Zatara,Zatanna Zatara,"Cryokinesis,Fire Control,Magic,Probability Man...","Chaos Zatanna,Doctor Fate,Zatanna (Full Potent...",,,Hawkman #4,DC Comics,"Good, Hero, Saviour, Protector of Justice",,,"Seven Soldiers of Victory,Sentinels of Magic,Y...","Giovanni ""John"" Zatara (father, deceased), Sin...",Female,Human,Blue,Black,,"zatanna,evil witch allura,john constantine,mag...","zatanna,spellcasting,telepathic,able,doessense..."
1446,Zero,DWN-∞: Zero,DWN-∞: Zero,"Accelerated Healing,Acrobatics,Agility,Cold Re...",,,,Mega Man X (1993),Capcom,"Good, Hero, Saviour, Protector of Justice",,,,,Male,Robot,Blue,Blond,Red,"mega man zero concept,unconscious zero,maveric...",
1447,Zoom (New 52),Hunter Zolomon,,"Accelerated Healing,Agility,Durability,Electro...","Black Flash (CW),Zoom,Zoom (CW)",Judge · Reverse-Flash · The Flash,,Flash Secret Files and Origins #3,DC Comics,"Bad, Villan, Enemy, Destroyer, Mischief, Evil","Criminal · former F.B.I. Profiler, Politician","Keystone City, Kansas",Flash Family,Ashley Zolomon (ex-wife),Male,,Red,Brown,,"hunter zolomon,flash,zoom,former friend wally ...","zoom,speed,force,hunter,conduitforce barrier,b..."
1448,Zoom,Hunter Zolomon,Hunter Zolomon,"Intangibility,Super Speed,Time Manipulation,Ti...","Black Flash (CW),Zoom (CW),Zoom (New 52)",,,Flash Secret Files #3,DC Comics,"Bad, Villan, Enemy, Destroyer, Mischief, Evil",,"Keystone City, Kansas","Rogues,The Society,Flash Family,Justice League...",Ashley Zolomon (ex-wife),Male,,Red,Brown,,"hunter zolomon,barry,current flash,car acciden...","speeds,zoom,superman,flashes,forceslow - motio..."


In [79]:
# Save processed Categorical Data
# categorical_data.to_parquet(f'{data_directory}/superhero_nlp_categotical_processed.gzip', compression='gzip')

In [223]:
categorical_data = pd.read_parquet(f'{data_directory}/superhero_nlp_categotical_processed.gzip')

### Numeric Data

Numerical data needs to be labeled into classes to convert it into text. I'll move forward to map the numerical values to attribute class

In [213]:
# map all `∞` values to 10,000, `-` to 0 and convert column type to int

numeric_data['overall_score'] = numeric_data['overall_score'].str.replace('∞','10000')
numeric_data['overall_score'] = numeric_data['overall_score'].str.replace('-','0')
numeric_data['overall_score'] = pd.to_numeric(numeric_data['overall_score'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  numeric_data['overall_score'] = numeric_data['overall_score'].str.replace('∞','10000')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  numeric_data['overall_score'] = numeric_data['overall_score'].str.replace('-','0')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  numeric_data['overall_score'] = pd.

To map numeric values into buckets, we need to provide the bucket range and labels to pandas pd.cut function. I've written a small function to provide the setting in neat dictionary

In [214]:
def generate_settings(dataset: pd.DataFrame, numeric_labels: dict):
  # Generating required setting for converting numbers to categorical classes
  settings = {}
  for col in dataset:
    # Bin ranges have been chosen based on summary stats of numeric columns
    if col == 'overall_score':
      score_bucket = [0,8,13,10_000]
    elif col in ["intelligence_score", "speed_score"]:
      score_bucket = [0,50,100]
    else:
      continue

    x = {
        "labels": numeric_labels.get(col),
        "bucket": score_bucket
    }
    settings[col] = x
  return settings

Now, for every numerical score, classes are basically: `weak`, `mediocre`, and `strong`. However, to make the description more rich with keywords I've added multiple synonyms of these three classes. This will help the model to search for attributes that aren't present in the description extracted from dataset

In [215]:
numeric_labels = {
    "overall_score": ["feeble, powerless, frail, puny, breakable, easily broken, brittle, helpless, useless, vulnerable, without power, powerless, ineffectual, inadequate, ineffective, defenseless, vulnerable, gutless, disenfranchised, frangible, smashable, splintery, flimsy, delicate, fracturable, delicate, faint, weak, fragile, weakly, infirm, sick, sickly, shaky, debilitated, ailing,, decrepit, enervated", 
                      "moderate, capable, average, middle, decent, adequate", 
                      "most powerfull, invincible, god like, all mighty, supreme, super strong, beast, powerful, muscular, brawny, super strong, all-powerful, authoritative, capable, compelling, dominant, influential, mighty, persuasive, potent, strengthy sturdy, well built, sturdy, hefty, fit, athletic, vigorous, long-lasting, hard-wearing, heavy-duty, tough, resistant, strong, sturdy, stout, sound, substantial, imperishable, indestructible"],
    "intelligence_score": ["dumb, dim witted, stupid, unintelligent, brainless, foolish",
                           "smart, clever, bright, intelligent, acute, shrewd, sharpe witted, educated"],
    "speed_score": ["slow, unhurried, slow-moving, slow-going, relaxed, laggard, dawdler, slowpoke, plodding, inert", 
                    "fast, speedy, quick, swift, rapid, brisk, nimble, sprightly, high-speed, turbo, sporty, accelerated, express, supersonic"],
}

In [216]:
numeric_to_text = generate_settings(numeric_data, numeric_labels)

In [217]:
numeric_to_text

{'overall_score': {'labels': ['feeble, powerless, frail, puny, breakable, easily broken, brittle, helpless, useless, vulnerable, without power, powerless, ineffectual, inadequate, ineffective, defenseless, vulnerable, gutless, disenfranchised, frangible, smashable, splintery, flimsy, delicate, fracturable, delicate, faint, weak, fragile, weakly, infirm, sick, sickly, shaky, debilitated, ailing,, decrepit, enervated',
   'moderate, capable, average, middle, decent, adequate',
   'most powerfull, invincible, god like, all mighty, supreme, super strong, beast, powerful, muscular, brawny, super strong, all-powerful, authoritative, capable, compelling, dominant, influential, mighty, persuasive, potent, strengthy sturdy, well built, sturdy, hefty, fit, athletic, vigorous, long-lasting, hard-wearing, heavy-duty, tough, resistant, strong, sturdy, stout, sound, substantial, imperishable, indestructible'],
  'bucket': [0, 8, 13, 10000]},
 'intelligence_score': {'labels': ['dumb, dim witted, stup

In [220]:
# Map numerical ranges to classes
# Converting numerical data to categorical data

for col in numeric_to_text:
  numeric_data[col] = pd.cut(numeric_data[col], bins=numeric_to_text.get(col).get('bucket'), labels=numeric_to_text.get(col).get('labels'))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  numeric_data[col] = pd.cut(numeric_data[col], bins=numeric_to_text.get(col).get('bucket'), labels=numeric_to_text.get(col).get('labels'))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  numeric_data[col] = pd.cut(numeric_data[col], bins=numeric_to_text.get(col).get('bucket'), labels=numeric_to_text.get(col).get('labels'))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas

In [221]:
# These columns aren't providing much information. So lets remove them

numeric_data.drop(['strength_score', 'durability_score', 'power_score', 'combat_score'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  numeric_data.drop(['strength_score', 'durability_score', 'power_score', 'combat_score'], axis=1, inplace=True)


In [222]:
numeric_data

Unnamed: 0,overall_score,intelligence_score,speed_score
0,"feeble, powerless, frail, puny, breakable, eas...","smart, clever, bright, intelligent, acute, shr...","fast, speedy, quick, swift, rapid, brisk, nimb..."
1,"moderate, capable, average, middle, decent, ad...","smart, clever, bright, intelligent, acute, shr...","slow, unhurried, slow-moving, slow-going, rela..."
2,"most powerfull, invincible, god like, all migh...","smart, clever, bright, intelligent, acute, shr...","fast, speedy, quick, swift, rapid, brisk, nimb..."
3,"moderate, capable, average, middle, decent, ad...","smart, clever, bright, intelligent, acute, shr...","fast, speedy, quick, swift, rapid, brisk, nimb..."
4,"feeble, powerless, frail, puny, breakable, eas...","smart, clever, bright, intelligent, acute, shr...","slow, unhurried, slow-moving, slow-going, rela..."
...,...,...,...
1445,"moderate, capable, average, middle, decent, ad...","smart, clever, bright, intelligent, acute, shr...","slow, unhurried, slow-moving, slow-going, rela..."
1446,"most powerfull, invincible, god like, all migh...","smart, clever, bright, intelligent, acute, shr...","fast, speedy, quick, swift, rapid, brisk, nimb..."
1447,"most powerfull, invincible, god like, all migh...","smart, clever, bright, intelligent, acute, shr...","fast, speedy, quick, swift, rapid, brisk, nimb..."
1448,"moderate, capable, average, middle, decent, ad...","smart, clever, bright, intelligent, acute, shr...","fast, speedy, quick, swift, rapid, brisk, nimb..."


### Combined Data

In [None]:
superhero_description_df = pd.concat([categorical_data, numeric_data], axis=1)

In [260]:
superhero_description_df

Unnamed: 0,name,real_name,full_name,superpowers,alter_egos,aliases,place_of_birth,first_appearance,creator,alignment,occupation,base,teams,relatives,gender,type_race,eye_color,hair_color,skin_color,history_extracted_keywords,powers_extracted_keywords,overall_score,intelligence_score,speed_score
0,3-D Man,"Delroy Garrett, Jr.","Delroy Garrett, Jr.","Super Speed,Super Strength",,,,,Marvel Comics,"Good, Hero, Saviour, Protector of Justice",,,"Annihilators,Asgardians,Avengers,New Avengers",,Male,Human,,,,"delroy garrett,triune understanding,steroids,n...",,"feeble, powerless, frail, puny, breakable, eas...","smart, clever, bright, intelligent, acute, shr...","fast, speedy, quick, swift, rapid, brisk, nimb..."
1,514A (Gotham),Bruce Wayne,,"Durability,Reflexes,Super Strength","Batgod,Batman,Batman (1966),Batman (Arkham),Ba...","Subject 514A,Bruce Wayne,Bruce 2",,,DC Comics,,,,,Bruce Wayne (genetic template),,,,,,"bruce wayne look,alfred catch,selina kyle,subj...",,"moderate, capable, average, middle, decent, ad...","smart, clever, bright, intelligent, acute, shr...","slow, unhurried, slow-moving, slow-going, rela..."
2,A-Bomb,Richard Milhouse Jones,Richard Milhouse Jones,"Accelerated Healing,Agility,Berserk Mode,Blood...",,Rick Jones,"Scarsdale, Arizona","Hulk Vol 2 #2 (April, 2008) (as A-Bomb)",Marvel Comics,"Good, Hero, Saviour, Protector of Justice","Musician, adventurer, author; formerly talk sh...",,"Teen Brigade,Ultimate Fantastic Four,U-Men,God...",Marlo Chandler-Jones (wife); Polly (aunt); Mrs...,Male,Human,Yellow,No Hair,,"hulk,rick radiation poisoning,jones,gamma bomb...","jones,superhuman,force,radiation,transformroug...","most powerfull, invincible, god like, all migh...","smart, clever, bright, intelligent, acute, shr...","fast, speedy, quick, swift, rapid, brisk, nimb..."
3,Aa,Aa,,"Energy Absorption,Energy Armor,Energy Beams,En...",,,Stoneworld,Green Lantern Vol 3 #21,DC Comics,"Good, Hero, Saviour, Protector of Justice",,,"Blue Lantern Corps,Green Lantern Corps,Justice...",,Male,Human,,,,"green lantern corps,comatose star sapphire,hal...",,"moderate, capable, average, middle, decent, ad...","smart, clever, bright, intelligent, acute, shr...","fast, speedy, quick, swift, rapid, brisk, nimb..."
4,Aaron Cash,Aaron Cash,Aaron Cash,"Weapon-based Powers,Weapons Master",,,Gotham City,,DC Comics,"Good, Hero, Saviour, Protector of Justice",,,,,Male,Human,,,,"aaron cash,hook,killer croc,arkham asylum,hand...",,"feeble, powerless, frail, puny, breakable, eas...","smart, clever, bright, intelligent, acute, shr...","slow, unhurried, slow-moving, slow-going, rela..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1445,Zatanna,Zatanna Zatara,Zatanna Zatara,"Cryokinesis,Fire Control,Magic,Probability Man...","Chaos Zatanna,Doctor Fate,Zatanna (Full Potent...",,,Hawkman #4,DC Comics,"Good, Hero, Saviour, Protector of Justice",,,"Seven Soldiers of Victory,Sentinels of Magic,Y...","Giovanni ""John"" Zatara (father, deceased), Sin...",Female,Human,Blue,Black,,"zatanna,evil witch allura,john constantine,mag...","zatanna,spellcasting,telepathic,able,doessense...","moderate, capable, average, middle, decent, ad...","smart, clever, bright, intelligent, acute, shr...","slow, unhurried, slow-moving, slow-going, rela..."
1446,Zero,DWN-∞: Zero,DWN-∞: Zero,"Accelerated Healing,Acrobatics,Agility,Cold Re...",,,,Mega Man X (1993),Capcom,"Good, Hero, Saviour, Protector of Justice",,,,,Male,Robot,Blue,Blond,Red,"mega man zero concept,unconscious zero,maveric...",,"most powerfull, invincible, god like, all migh...","smart, clever, bright, intelligent, acute, shr...","fast, speedy, quick, swift, rapid, brisk, nimb..."
1447,Zoom (New 52),Hunter Zolomon,,"Accelerated Healing,Agility,Durability,Electro...","Black Flash (CW),Zoom,Zoom (CW)",Judge · Reverse-Flash · The Flash,,Flash Secret Files and Origins #3,DC Comics,"Bad, Villan, Enemy, Destroyer, Mischief, Evil","Criminal · former F.B.I. Profiler, Politician","Keystone City, Kansas",Flash Family,Ashley Zolomon (ex-wife),Male,,Red,Brown,,"hunter zolomon,flash,zoom,former friend wally ...","zoom,speed,force,hunter,conduitforce barrier,b...","most powerfull, invincible, god like, all migh...","smart, clever, bright, intelligent, acute, shr...","fast, speedy, quick, swift, rapid, brisk, nimb..."
1448,Zoom,Hunter Zolomon,Hunter Zolomon,"Intangibility,Super Speed,Time Manipulation,Ti...","Black Flash (CW),Zoom (CW),Zoom (New 52)",,,Flash Secret Files #3,DC Comics,"Bad, Villan, Enemy, Destroyer, Mischief, Evil",,"Keystone City, Kansas","Rogues,The Society,Flash Family,Justice League...",Ashley Zolomon (ex-wife),Male,,Red,Brown,,"hunter zolomon,barry,current flash,car acciden...","speeds,zoom,superman,flashes,forceslow - motio...","moderate, capable, average, middle, decent, ad...","smart, clever, bright, intelligent, acute, shr...","fast, speedy, quick, swift, rapid, brisk, nimb..."


Now iterate through all the rows and concatinate the attributes from different columns. This will be the final description for each superhero. I've written a neat little function for it below

In [226]:
def generate_description(data: pd.DataFrame) -> pd.DataFrame:

  def description_template(row: pd.Series) -> str:
    
    if row["gender"] == "":
      row["gender"] = "They"
    
    return f"""
    {row["real_name"].lower()},{row["full_name"]},{row["gender"]},{row["superpowers"]},{row["alter_egos"]},{row["aliases"]},{row["place_of_birth"]},{row["first_appearance"]},{row["creator"]},{row["alignment"]},{row["occupation"]},{row["base"]},{row["teams"]},{row["relatives"]},{row["type_race"]},{row["eye_color"]},{row["hair_color"]},{row["skin_color"]},{row["history_extracted_keywords"]},{row["powers_extracted_keywords"]},{row["overall_score"]},{row["intelligence_score"]},{row["speed_score"]}
    """
  description_data = {
      "hero_name": [],
      "hero_description": []
  }


  for index, row in data.iterrows():
    description = description_template(row)
    description_data['hero_name'].append(row["name"])
    description_data['hero_description'].append(description.lower())
  
  df = pd.DataFrame(description_data)
  df['hero_description'] = df['hero_description'].apply(lambda x: re.sub("\n", "", x))
  df['hero_description'] = df['hero_description'].apply(lambda x: re.sub(r'\s+', " ", x))
  return df

In [227]:
superhero_description_text_df = generate_description(superhero_description_df)

In [262]:
superhero_description_text_df

Unnamed: 0,hero_name,hero_description
0,3-D Man,"delroy garrett, jr.,delroy garrett, jr.,male,..."
1,514A (Gotham),"bruce wayne,,they,durability,reflexes,super s..."
2,A-Bomb,"richard milhouse jones,richard milhouse jones..."
3,Aa,"aa,,male,energy absorption,energy armor,energ..."
4,Aaron Cash,"aaron cash,aaron cash,male,weapon-based power..."
...,...,...
1445,Zatanna,"zatanna zatara,zatanna zatara,female,cryokine..."
1446,Zero,"dwn-∞: zero,dwn-∞: zero,male,accelerated heal..."
1447,Zoom (New 52),"hunter zolomon,,male,accelerated healing,agil..."
1448,Zoom,"hunter zolomon,hunter zolomon,male,intangibil..."


Check each repord and see if the comma-separated values are well concatinated

In [390]:
get_record(index=133, data=superhero_description_text_df)

{'hero_name': 'Batman (DC One Million)',
 'hero_description': ' ,,male,acrobatics,agility,anti-gravity,cloaking,dexterity,duplication,durability,electrokinesis,element control,endurance,enhanced senses,explosion manipulation,fire resistance,flight,gliding,gravity control,homing attack,illusions,intelligence,invisibility,jump,longevity,marksmanship,master martial artist,paralysis,peak human condition,power suit,reflexes,soul manipulation,stamina,stealth,technopath/cyberpath,telepathy,weapon-based powers,weapons master,,,,,dc comics,good, hero, saviour, protector of justice,,,,,alien,blue,blond,,batman,20th century justice league,xauron,dangerous asylum planet,years20th century rebirth superman,batman million,evil,legacy batmen,parents,arena,families,batman,hundreds,new incarnation justice league,thought,spouses,20th century,criminal xauron,plans,light,criminals,fear,tragedy,combat,superman million,ideal,dangerous variety,millions physical mental techniques,technology,justice legion alph

In [230]:
# Save the updated combined description data
# superhero_description_text_df.to_parquet(f'{data_directory}/superhero_desc_updated.gzip', compression='gzip')

# Description Search Solution

In [264]:
superhero_description_text_df = pd.read_parquet(f'{data_directory}/superhero_desc_updated.gzip')

In [265]:
superhero_description_text_df

Unnamed: 0,hero_name,hero_description
0,3-D Man,"delroy garrett, jr.,delroy garrett, jr.,male,..."
1,514A (Gotham),"bruce wayne,,they,durability,reflexes,super s..."
2,A-Bomb,"richard milhouse jones,richard milhouse jones..."
3,Aa,"aa,,male,energy absorption,energy armor,energ..."
4,Aaron Cash,"aaron cash,aaron cash,male,weapon-based power..."
...,...,...
1445,Zatanna,"zatanna zatara,zatanna zatara,female,cryokine..."
1446,Zero,"dwn-∞: zero,dwn-∞: zero,male,accelerated heal..."
1447,Zoom (New 52),"hunter zolomon,,male,accelerated healing,agil..."
1448,Zoom,"hunter zolomon,hunter zolomon,male,intangibil..."


#### Embedding Generation / Load

Make use of `Sentence Transformers` and generate embeddings for every description record

In [338]:
embedding_model = SentenceTransformer('all-mpnet-base-v2')
embedding_size = 768    # Size of embeddings
top_k_hits = 3          # Output k hits
embedding_cache_path = f'{artifact_directory}/superhero-embeddings-{embedding_size}-size.pkl'

In [339]:
#Check if embedding cache path exists
if not os.path.exists(embedding_cache_path):

  # Embedding Cache is Not Present. Build Embeddings from Scratch
    print("Encoding the entire corpus as Embedding Cache is not present.\nThis might take a while")
    corpus_embeddings = embedding_model.encode(list(superhero_description_text_df['hero_description']),
                                               show_progress_bar=True, convert_to_numpy=True)

    print("Store file on disc")
    with open(embedding_cache_path, "wb") as fOut:
        pickle.dump({'hero_name': list(superhero_description_text_df['hero_name']),
                     'hero_description': list(superhero_description_text_df['hero_description']),
                     'hero_description_embeddings': corpus_embeddings}, fOut)
else:
    print("Load pre-computed embeddings from disc")
    with open(embedding_cache_path, "rb") as fIn:
        cache_data = pickle.load(fIn)
        hero_name = cache_data['hero_name']
        hero_description = cache_data['hero_description']
        hero_description_embeddings = cache_data['hero_description_embeddings']

Encoding the entire corpus as Embedding Cache is not present.
This might take a while


Batches:   0%|          | 0/46 [00:00<?, ?it/s]

Store file on disc


In [340]:
len(corpus_embeddings[0])

768

#### Index Generation / Load

Train HNSW Graph for Approximate Nearst Neighbour Search and save the index

In [341]:
index_path = f"{artifact_directory}/hnswlib.index"
index = hnswlib.Index(space = 'cosine', dim = embedding_size)

In [342]:
if os.path.exists(index_path):
    print("Loading index...")
    index.load_index(index_path)
else:
    ### Create the HNSWLIB index
    print("Start creating HNSWLIB index")
    index.init_index(max_elements = len(corpus_embeddings), ef_construction = 400, M = 64)

    # Then we train the index to find a suitable clustering
    index.add_items(corpus_embeddings, list(range(len(corpus_embeddings))))

    print("Saving index to:", index_path)
    index.save_index(index_path)

# Controlling the recall by setting ef:
index.set_ef(50)  # ef should always be > top_k_hits

Start creating HNSWLIB index
Saving index to: /Users/rahulnenavath/Documents/Personal_Projects/SuperHeroAssignment/Artifacts/hnswlib.index


In [343]:
print("Corpus loaded with {} sentences / embeddings".format(len(list(superhero_description_text_df['hero_description']))))

Corpus loaded with 1450 sentences / embeddings


#### Semantic Search
Semantic Search will fetch Top K documents which are similar to the input description

In [391]:
def semantic_search(description: str, top_k_hits:int=10, dataset:pd.DataFrame = superhero_description_text_df, verbose:bool = False) -> list:
  # Generate Input description embedding
  inp_hero_des_embedding = embedding_model.encode(description.lower())
  
  corpus_ids, distances = index.knn_query(inp_hero_des_embedding, k=top_k_hits)
  
  hits = [{'corpus_id': id, 'score': 1-score} for id, score in zip(corpus_ids[0], distances[0])]
  hits = sorted(hits, key=lambda x: x['score'], reverse=True)

  print("Input Hero Description:", description)
  corpus_ids = []
  for hit in hits[0:top_k_hits]:
    corpus_ids.append(hit['corpus_id'])
    if verbose:
      print("{:.3f}\t{}".format(hit['score'], dataset['hero_name'].iloc[hit['corpus_id']]))
  return corpus_ids

#### Keyword Search

Return the Top K semantically similar records which have highest keyword similarity. We basically compute the frequency of query tokens present in documents from existing database. We return the Top-K records which have highest frequency of keywords present.

In [350]:
def keyword_search(desc:str, corpus_ids:list, dataset:pd.DataFrame=superhero_description_text_df, top_k:int=3, verbose:bool = False):
    
    # Build a dictionary which has {'document id': Number of keywords present}
    
    def text_similarity(keyword:str, text_token:str, threshold:float=0.85) -> int:
        # Compute text similarity between keyword and description word token
        sim = textdistance.ratcliff_obershelp.normalized_similarity(keyword, text_token)
        return 1 if sim >= threshold else 0
    
    def tokenisation(text:str) -> list:
        # Clean words and tokenise the text
        text = re.sub(' ', '-', text)
        text = text.split(',')
        clean_text = []
        for token in text:
            if token.startswith('-'):
                clean_text.append(token.replace('-', ''))
            else:
                clean_text.append(token)
        text = re.sub('-', ' ', " ".join(clean_text))
        return text.split(' ')
    
    # Over all keyword frequency for each document - id
    freq_scroe_board = {}
    
    input_desc_keywords = desc.split(',')    
    
    for row_index in corpus_ids:
        des_doc = dataset.iloc[row_index]['hero_description']
        des_doc_tokens = tokenisation(des_doc)
        bit_arry = []
        for key in input_desc_keywords:
            for des_doc_token in des_doc_tokens:
                bit_arry.append(text_similarity(keyword=key, text_token=des_doc_token))
        
        freq_scroe_board[row_index] = sum(bit_arry)

    sorted_freq = {k: v for k, v in sorted(freq_scroe_board.items(), key=lambda item: item[1], reverse=True)}
    
    if verbose:
        print([dataset.iloc[doc_id]['hero_name'] for doc_id in sorted_freq])
    
    superhero_names = []
    for index, doc_id in enumerate(sorted_freq.keys()):
        if index >= top_k:
            break
        superhero_names.append(dataset.iloc[doc_id]['hero_name'])
    return superhero_names
    

#### Hybrid Search Inference

* First - Semantic Search to fetch based on context <br><br>
* Second - Keyword Search to find the accurate documents

In [392]:
desc = "marvel comics, armor, playboy, intelligent, smart, billionaire, hero, good, male, avenger, sheild"
index_ids = semantic_search(desc)
keyword_search(desc, index_ids, top_k=2)


Input Hero Description: marvel comics, armor, playboy, intelligent, smart, billionaire, hero, good, male, avenger, sheild


['Iron Man (Thorbuster)', 'Infinity Man']

In [371]:
desc = "Dark Knight, powerless, weak, gotham protector, martial artist, Hero, intelligent, Male, powerless, DC comics, Human"
index_ids = semantic_search(desc)
keyword_search(desc, index_ids, top_k=2)

Input Hero Description: Dark Knight, powerless, weak, gotham protector, martial artist, Hero, intelligent, Male, powerless, DC comics, Human


['Alfred (DCEU)', 'Batman (LEGO)']

In [372]:
desc = "marvel comics, new york city, Devil, male, human, intelligent, high senses, super reflexes"
index_ids = semantic_search(desc)
keyword_search(desc, index_ids, top_k=2)

Input Hero Description: marvel comics, new york city, Devil, male, human, intelligent, high senses, super reflexes


['Daredevil (FOX)', 'Spider-Man Noir']

In [374]:
desc = "DC comics, metropolis, red cape, heat vision, krytonian, super speed, super breath, super strength, daily planet"
index_ids = semantic_search(desc)
keyword_search(desc, index_ids, top_k=2)

Input Hero Description: DC comics, metropolis, red cape, heat vision, krytonian, super speed, super breath, super strength, daily planet


['Supergirl', 'Superboy-Prime']

In [375]:
desc = "marvel comics, super solider, super strength, brooklyn, male, hero, first solider, smart, healing"
index_ids = semantic_search(desc)
keyword_search(desc, index_ids, top_k=2)

Input Hero Description: marvel comics, super solider, super strength, brooklyn, male, hero, first solider, smart, healing


['Captain America (EMH)', 'Infinity Man']

In [376]:
desc = "mutant, marvel comeics, accelerated healing, claws, x men, powerful, animal instincts, adamantium, hero, male"
index_ids = semantic_search(desc)
keyword_search(desc, index_ids, top_k=2, verbose=False)

Input Hero Description: mutant, marvel comeics, accelerated healing, claws, x men, powerful, animal instincts, adamantium, hero, male


['Wolverine', 'Onslaught']

In [377]:
desc = "marvel comics, female, new york city, fast, intelligent, cat powers"
index_ids = semantic_search(desc)
keyword_search(desc, index_ids, top_k=2, verbose=False)

Input Hero Description: marvel comeics, female, new york city, fast, intelligent, cat powers


['Knockout', 'Black Cat (Edge of Time)']

In [380]:
desc = "marvel comics, male, hero, spider powers, new york city, super strength, durability, enhanced senses"
index_ids = semantic_search(desc)
keyword_search(desc, index_ids, top_k=2, verbose=False)

Input Hero Description: marvel comics, male, hero, spider powers, new york city, super strength, durability, enhanced senses


['Silk', 'Spider-Man Noir']

In [384]:
desc = "marvel comics, male, hero, odinson, champion of asgard, powerful, strong, super strenght, thunder"
index_ids = semantic_search(desc)
keyword_search(desc, index_ids, top_k=2, verbose=False)

Input Hero Description: marvel comics, male, hero, odinson, champion of asgard, powerful, strong, super strenght, thunder


['Thor (Odin Force)', 'Cosmic King Thor']

# Conclusion:

`Problem statement`: Given a superhero description, guess the super hero name <br><br>
`Solution`: 
* Transform the structured superhero description into comma-seperated unstructured attributes. <br>
* The description is generated by concatinating existing categorical data, keywords, named entities, and numerical score labels.
* Perform Semantic Search to find top-k semantically similar descriptions and then use Keyword Search to retrieve most accurate description from the set. <br><br>

`Example Input / Output`:<br>
* Input: `marvel comics, avenger green gamma monster, super strength, anger issues` <br>
* Output: `['Cosmic Hulk', 'Doc Samson']` <br>Both superheros are from `Hulk comics`! <br><br>

`Improvements`: The results of this approch are as good as the descriptions of each superhero. Thereby the results could improve drastically when we agument each description with more unique attribute and also increase the size of the dataset. <br>
* Make use of Wikipedia API to fetch additional information about the superhero.
* Scrape the Superhero Wiki pages for more detailed description
* Add / duplicate records - A single superhero can have multiple records with different set of attributes.
* Pre-train a BERT model on superhero dataset - This is difficult task considering the size of dataset.