# Semantic Search Using FAISS
* FAISS (Facebook AI Similarity Search) is an open-source library developed by Facebook AI Research. It is designed to efficiently perform similarity search and nearest neighbor search on large-scale datasets. FAISS is particularly useful in scenarios where you need to search for similar items or find the nearest neighbors of a given item in a high-dimensional space.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import warnings
warnings.filterwarnings("ignore")

/kaggle/input/wikipedia-movie-plots/wiki_movie_plots_deduped.csv


In [2]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 2.6 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[K     |████████████████████████████████| 7.2 MB 44.5 MB/s 
Collecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.16.2-py3-none-any.whl (268 kB)
[K     |████████████████████████████████| 268 kB 47.3 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[K     |████████████████████████████████| 7.8 MB 47.3 MB/s 
Collecting safetensors>=0.3.1
  Downloading safetensors-0.3.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 51.5 MB/s 
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) .

In [3]:
!nvidia-smi 

Thu Jul  6 16:32:11 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

In [4]:
import pandas as pd
import time
from tqdm import tqdm
import seaborn as sns
import numpy as np
from textblob import TextBlob
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('msmarco-distilbert-base-dot-prod-v3')

Downloading (…)b6d67/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)/2_Dense/config.json:   0%|          | 0.00/115 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

Downloading (…)13d78b6d67/README.md:   0%|          | 0.00/2.35k [00:00<?, ?B/s]

Downloading (…)d78b6d67/config.json:   0%|          | 0.00/554 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)b6d67/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/376 [00:00<?, ?B/s]

Downloading (…)13d78b6d67/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)78b6d67/modules.json:   0%|          | 0.00/341 [00:00<?, ?B/s]

In [5]:
data = pd.read_csv('../input/wikipedia-movie-plots/wiki_movie_plots_deduped.csv',memory_map=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34886 entries, 0 to 34885
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Release Year      34886 non-null  int64 
 1   Title             34886 non-null  object
 2   Origin/Ethnicity  34886 non-null  object
 3   Director          34886 non-null  object
 4   Cast              33464 non-null  object
 5   Genre             34886 non-null  object
 6   Wiki Page         34886 non-null  object
 7   Plot              34886 non-null  object
dtypes: int64(1), object(7)
memory usage: 2.1+ MB


In [6]:
data['Plot'][3]

'Lasting just 61 seconds and consisting of two shots, the first shot is set in a wood during winter. The actor representing then vice-president Theodore Roosevelt enthusiastically hurries down a hillside towards a tree in the foreground. He falls once, but rights himself and cocks his rifle. Two other men, bearing signs reading "His Photographer" and "His Press Agent" respectively, follow him into the shot; the photographer sets up his camera. "Teddy" aims his rifle upward at the tree and fells what appears to be a common house cat, which he then proceeds to stab. "Teddy" holds his prize aloft, and the press agent takes notes. The second shot is taken in a slightly different part of the wood, on a path. "Teddy" rides the path on his horse towards the camera and out to the left of the shot, followed closely by the press agent and photographer, still dutifully holding their signs.'

In [7]:
import gc
df = data[['Title','Plot']]
del data
gc.collect()

0

In [8]:
df.dropna(inplace=True)
df.drop_duplicates(subset=['Plot'],inplace=True)

In [9]:
df['doc_len'] = df['Plot'].apply(lambda words: len(words.split()))
df.head()

Unnamed: 0,Title,Plot,doc_len
0,Kansas Saloon Smashers,"A bartender is working at a saloon, serving dr...",83
1,Love by the Light of the Moon,"The moon, painted with a smiling face hangs ov...",86
2,The Martyred Presidents,"The film, just over a minute long, is composed...",76
3,"Terrible Teddy, the Grizzly King",Lasting just 61 seconds and consisting of two ...,153
4,Jack and the Beanstalk,The earliest known adaptation of the classic f...,140


In [10]:
df.head()

Unnamed: 0,Title,Plot,doc_len
0,Kansas Saloon Smashers,"A bartender is working at a saloon, serving dr...",83
1,Love by the Light of the Moon,"The moon, painted with a smiling face hangs ov...",86
2,The Martyred Presidents,"The film, just over a minute long, is composed...",76
3,"Terrible Teddy, the Grizzly King",Lasting just 61 seconds and consisting of two ...,153
4,Jack and the Beanstalk,The earliest known adaptation of the classic f...,140


In [11]:
df['Plot'][5]

'Alice follows a large white rabbit down a "Rabbit-hole". She finds a tiny door. When she finds a bottle labeled "Drink me", she does, and shrinks, but not enough to pass through the door. She then eats something labeled "Eat me" and grows larger. She finds a fan when enables her to shrink enough to get into the "Garden" and try to get a "Dog" to play with her. She enters the "White Rabbit\'s tiny House," but suddenly resumes her normal size. In order to get out, she has to use the "magic fan."\r\nShe enters a kitchen, in which there is a cook and a woman holding a baby. She persuades the woman to give her the child and takes the infant outside after the cook starts throwing things around. The baby then turns into a pig and squirms out of her grip. "The Duchess\'s Cheshire Cat" appears and disappears a couple of times to Alice and directs her to the Mad Hatter\'s "Mad Tea-Party." After a while, she leaves.\r\nThe Queen invites Alice to join the "ROYAL PROCESSION": a parade of marching 

In [12]:
!pip install faiss-gpu

Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[K     |████████████████████████████████| 85.5 MB 43 kB/s 
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2


In [13]:
import faiss
encoded_data = model.encode(df.Plot.tolist())
encoded_data = np.asarray(encoded_data.astype('float32'))
index = faiss.IndexIDMap(faiss.IndexFlatIP(768))
index.add_with_ids(encoded_data, np.array(range(0, len(df))))
faiss.write_index(index, 'movie_plot.index')

Batches:   0%|          | 0/1059 [00:00<?, ?it/s]

In [14]:
def fetch_movie_info(dataframe_idx):
    info = df.iloc[dataframe_idx]
    meta_dict = {}
    meta_dict['Title'] = info['Title']
    meta_dict['Text'] = info['Plot']
    return meta_dict
    
def search(query, top_k, index, model):
    t=time.time()
    query_vector = model.encode([query])
    top_k = index.search(query_vector, top_k)
    print('>>>> Results in Total Time: {}'.format(time.time()-t))
    top_k_ids = top_k[1].tolist()[0]
    top_k_ids = list(np.unique(top_k_ids))
    results =  [fetch_movie_info(idx) for idx in top_k_ids]
    return results

In [15]:
# install the keywords extractors
!pip install git+https://github.com/LIAAD/yake
!pip install keyBERT

Collecting git+https://github.com/LIAAD/yake
  Cloning https://github.com/LIAAD/yake to /tmp/pip-req-build-bynwklld
  Running command git clone -q https://github.com/LIAAD/yake /tmp/pip-req-build-bynwklld
Collecting segtok
  Downloading segtok-1.5.11-py3-none-any.whl (24 kB)
Collecting jellyfish
  Downloading jellyfish-1.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 11.0 MB/s 
Building wheels for collected packages: yake
  Building wheel for yake (setup.py) ... [?25l- \ done
[?25h  Created wheel for yake: filename=yake-0.4.8-py2.py3-none-any.whl size=62568 sha256=971c7008dda5ab61592ac8ce719772de1ba62f8bc9e4fbfe7ec919725b45fd44
  Stored in directory: /tmp/pip-ephem-wheel-cache-ih5gdwhs/wheels/52/79/f4/dae9309f60266aa3767a4381405002b6f2955fbcf038d804da
Successfully built yake
Installing collected packages: segtok, jellyfish, yake
Successfully installed jellyfish-1.0.0 segtok-1.5.11 yake-0.4.8


In [16]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from yake import KeywordExtractor
from keybert import KeyBERT
kw_extractor = KeyBERT('distilbert-base-nli-mean-tokens')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /usr/share/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Downloading (…)925a9/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)1a515925a9/README.md:   0%|          | 0.00/3.99k [00:00<?, ?B/s]

Downloading (…)515925a9/config.json:   0%|          | 0.00/550 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)925a9/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/450 [00:00<?, ?B/s]

Downloading (…)1a515925a9/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)15925a9/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

### Trying to search with a query

In [17]:
from pprint import pprint

query="Artificial Intelligence based action movie"
results=search(query, top_k=5, index=index, model=model)

print("\n")
for result in results:
    print('\t',result['Title'])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

>>>> Results in Total Time: 1.5976934432983398


	 The Cape Canaveral Monsters
	 Small Soldiers
	 Chappie
	 Armed Response
	 Galactic Armored Fleet Majestic Prince: Genetic Awakening


In [18]:
def get_hot_keywords(results,top_k, use_bert = True):
    for doc in results:
        
        if(use_bert):
            kw_extractor = KeyBERT('distilbert-base-nli-mean-tokens')
            keywords = kw_extractor.extract_keywords(doc['Text'], top_n=top_k, keyphrase_ngram_range=(1, 1),stop_words='english')
        else:
            kw_extractor = KeywordExtractor(lan="en", n=1, top=top_k)
            keywords = kw_extractor.extract_keywords(doc['Text'])
        # keywords = [x for x, y in keywords]
        print("Keywords of", doc['Title'] ," movie are: \n", keywords)

In [19]:
get_hot_keywords(results,5)

Keywords of The Cape Canaveral Monsters  movie are: 
 [('extraterrestrials', 0.224), ('aliens', 0.2228), ('alien', 0.2208), ('florida', 0.1137), ('killed', 0.0928)]
Keywords of Small Soldiers  movie are: 
 [('kidnapped', 0.1618), ('hijacked', 0.1395), ('yosemite', 0.1334), ('ceo', 0.1333), ('commando', 0.133)]
Keywords of Chappie  movie are: 
 [('robots', 0.2791), ('gangsters', 0.2516), ('gangster', 0.2374), ('robot', 0.2032), ('ninja', 0.2021)]
Keywords of Armed Response  movie are: 
 [('killed', 0.2844), ('horrific', 0.1714), ('trapped', 0.0687), ('operatives', 0.0552), ('trained', 0.0269)]
Keywords of Galactic Armored Fleet Majestic Prince: Genetic Awakening  movie are: 
 [('alien', 0.3403), ('battle', 0.1361), ('invasion', 0.1224), ('fight', 0.0962), ('space', 0.0863)]
