Created by: Rosamund

## Overview

NOTE: 
Additional packages required: 
- pip install -U sentence-transformers==3.0.0

To avoid TqdmWarning: IProgress not found:
- pip install --upgrade jupyter ipywidgets


A Sentence Transformer is a type of natural language processing model designed specifically to produce meaningful and useful sentence embeddings. Sentence embeddings are fixed-length numerical representations that capture the semantic meaning of a sentence.

Reference: https://github.com/VishalS-HK/product-recommendation-system-BERT/blob/main/Product_Recommendation_System_BERT.ipynb

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import tqdm as tqdm

  from tqdm.autonotebook import tqdm, trange


## Extract data

In [2]:
anime_final_df = pd.read_csv('../data/anime_final.csv', sep="|")

In [3]:
print(f'No. of rows: {anime_final_df.shape[0]:,}')
print(f'No. of columns: {anime_final_df.shape[1]:,}')

No. of rows: 12,226
No. of columns: 22


In [4]:
anime_final_df.head(3)

Unnamed: 0,anime_id,name,type,episodes,mal_score,members,studio,release-season,release-year,release-date,...,themes,demographics,synopsis,image_url,rating,va_list,staff_list,recommended_review_count,mixedfeelings_review_count,notrecommended_review_count
0,32281,Kimi no Na wa.,Movie,1,9.37,200630,['CoMix Wave Films'],summer,2016.0,,...,[],,"Mitsuha Miyamizu, a high school girl, yearns t...",https://cdn.myanimelist.net/images/anime/5/870...,PG-13 - Teens 13 or older,"['Kamishiraishi, Mone', 'Kamiki, Ryunosuke', '...","['Bezerra, Wendel', 'Kawamura, Genki', 'Itou, ...",808.0,88.0,50.0
1,5114,Fullmetal Alchemist: Brotherhood,TV,64,9.26,793665,['Bones'],spring,2009.0,,...,['Military'],Shounen,After a horrific alchemy experiment goes wrong...,https://cdn.myanimelist.net/images/anime/1208/...,R - 17+ (violence & profanity),"['Park, Romi', 'Kugimiya, Rie', 'Miki, Shinich...","['Cook, Justin', 'Maruyama, Hiroo', 'Yonai, No...",912.0,59.0,39.0
2,28977,Gintama°,TV,51,9.25,114262,['Bandai Namco Pictures'],spring,2015.0,,...,"['Gag Humor', 'Historical', 'Parody', 'Samurai']",Shounen,"Gintoki, Shinpachi, and Kagura return as the f...",https://cdn.myanimelist.net/images/anime/3/720...,PG-13 - Teens 13 or older,"['Sugita, Tomokazu', 'Kugimiya, Rie', 'Sakaguc...","['Miyawaki, Chizuru', 'Takamatsu, Shinji', 'Yo...",79.0,3.0,1.0


Print out the first synopsis

In [5]:
anime_final_df.iloc[0]['synopsis']

"Mitsuha Miyamizu, a high school girl, yearns to live the life of a boy in the bustling city of Tokyo—a dream that stands in stark contrast to her present life in the countryside. Meanwhile in the city, Taki Tachibana lives a busy life as a high school student while juggling his part-time job and hopes for a future in architecture.    One day, Mitsuha awakens in a room that is not her own and suddenly finds herself living the dream life in Tokyo—but in Taki's body! Elsewhere, Taki finds himself living Mitsuha's life in the humble countryside. In pursuit of an answer to this strange phenomenon, they begin to search for one another.      revolves around Mitsuha and Taki's actions, which begin to have a dramatic impact on each other's lives, weaving them into a fabric held together by fate and circumstance.    [Written by MAL Rewrite]"

In [6]:
anime_final_df.iloc[100]['synopsis']

'After the National Tournament, the Seidou High baseball team moves forward with uncertainty as the Fall season quickly approaches. In an attempt to build a stronger team centered around their new captain, fresh faces join the starting roster for the very first time. Previous losses weigh heavily on the minds of the veteran players as they continue their rigorous training, preparing for what will inevitably be their toughest season yet.     Rivals both new and old stand in their path as Seidou once again climbs their way toward the top, one game at a time. Needed now more than ever before, Furuya and Eijun must be determined to pitch with all their skill and strength in order to lead their team to victory. And this time, one of these young pitchers may finally claim that coveted title: "The Ace of Seidou."    [Written by MAL Rewrite]'

Preprocessing:
At the end of every synopsis there is '[Written by MAL Rewrite]'. We want to remove this

In [7]:
anime_final_df['synopsis'] = anime_final_df['synopsis'].str.replace(r'\s*\[Written by MAL Rewrite\]\s*','',regex=True)

Let's inspect if that phrase is still there 

In [8]:
anime_final_df.iloc[0]['synopsis']

"Mitsuha Miyamizu, a high school girl, yearns to live the life of a boy in the bustling city of Tokyo—a dream that stands in stark contrast to her present life in the countryside. Meanwhile in the city, Taki Tachibana lives a busy life as a high school student while juggling his part-time job and hopes for a future in architecture.    One day, Mitsuha awakens in a room that is not her own and suddenly finds herself living the dream life in Tokyo—but in Taki's body! Elsewhere, Taki finds himself living Mitsuha's life in the humble countryside. In pursuit of an answer to this strange phenomenon, they begin to search for one another.      revolves around Mitsuha and Taki's actions, which begin to have a dramatic impact on each other's lives, weaving them into a fabric held together by fate and circumstance."

In [9]:
anime_final_df.iloc[100]['synopsis']

'After the National Tournament, the Seidou High baseball team moves forward with uncertainty as the Fall season quickly approaches. In an attempt to build a stronger team centered around their new captain, fresh faces join the starting roster for the very first time. Previous losses weigh heavily on the minds of the veteran players as they continue their rigorous training, preparing for what will inevitably be their toughest season yet.     Rivals both new and old stand in their path as Seidou once again climbs their way toward the top, one game at a time. Needed now more than ever before, Furuya and Eijun must be determined to pitch with all their skill and strength in order to lead their team to victory. And this time, one of these young pitchers may finally claim that coveted title: "The Ace of Seidou."'

Type conversion

In [19]:
anime_final_df[anime_final_df['synopsis'].isna()]

Unnamed: 0,anime_id,name,type,episodes,mal_score,members,studio,release-season,release-year,release-date,...,themes,demographics,synopsis,image_url,rating,va_list,staff_list,recommended_review_count,mixedfeelings_review_count,notrecommended_review_count
31,32983,Natsume Yuujinchou Go,TV,13,8.76,38865,['Shuka'],fall,2016.0,,...,"['Iyashikei', 'Mythology']",Shoujo,,,,,,,,
62,32995,Yuri!!! on Ice,TV,12,8.61,103178,['MAPPA'],fall,2016.0,,...,[],,,,,,,,,
74,21,One Piece,TV,Unknown,8.58,504862,['Toei Animation'],fall,1999.0,,...,[],,,,,,,,,
76,31933,JoJo no Kimyou na Bouken: Diamond wa Kudakenai,TV,39,8.57,74074,['David Production'],spring,2016.0,,...,[],,,,,,,,,
140,10937,Mobile Suit Gundam: The Origin,OVA,6,8.42,15420,['Sunrise'],winter,2015.0,,...,"['Mecha', 'Military', 'Space']",Shounen,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12207,34492,Nuki Doki! Tenshi to Akuma no Sakusei Battle -...,OVA,Unknown,,392,['Collaboration Works'],winter,2017.0,,...,[],,,,,,,,,
12211,34491,Sagurare Otome The Animation,OVA,1,,79,['Studio 1st'],winter,2017.0,,...,[],,,,,,,,,
12212,34312,Saimin Class,OVA,Unknown,,240,['BreakBottle'],fall,2016.0,,...,[],,,,,,,,,
12214,34388,Shikkoku no Shaga The Animation,OVA,Unknown,,195,['Seven'],winter,2017.0,,...,[],,,,,,,,,


## Sentence Transformers

Instantiate Bert sentence transformer

In [10]:
model = SentenceTransformer('bert-base-nli-mean-tokens')



Generate embeddings

In [14]:
# testing on first 5
synopsis_list = anime_final_df['synopsis'].iloc[:5].tolist()
sentence_embeddings = model.encode(synopsis_list, show_progress_bar=True)

# Note: for .encode(), there is an optional arugment: normalize_embeddings (bool, optional) – Whether to 
# normalize returned vectors to have length 1. In that case, the faster dot-product (util.dot_score) instead 
# of cosine similarity can be used. Defaults to False.

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [15]:
sentence_embeddings

array([[-0.46397483,  0.32970548,  0.9904363 , ..., -0.3416259 ,
        -0.23031402, -0.02224792],
       [-0.11658153,  1.0129116 ,  0.49288163, ...,  0.22548924,
         0.18881917, -0.09548517],
       [-0.54377246,  0.90457124,  0.09619945, ...,  0.5572938 ,
         0.16025406, -0.03929615],
       [-0.53630084,  1.0769869 ,  0.28372797, ...,  0.1067517 ,
         0.2952971 , -0.2879131 ],
       [-0.6076364 ,  0.61866707,  0.42367676, ..., -0.19480148,
         0.11390668, -0.04527906]], dtype=float32)

Issue: Could not complete the embedding because there are NaN values for the 'synopsis column'. 