### Importing the dataset

In [1]:
!wget https://calmcode.io/static/data/clinc.csv

--2024-02-19 22:44:57--  https://calmcode.io/static/data/clinc.csv
Resolving calmcode.io (calmcode.io)... 172.66.0.96, 162.159.140.98, 2a06:98c1:58::60, ...
Connecting to calmcode.io (calmcode.io)|172.66.0.96|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1265609 (1.2M) [application/octet-stream]
Saving to: ‘clinc.csv.2’


2024-02-19 22:44:58 (70.3 MB/s) - ‘clinc.csv.2’ saved [1265609/1265609]



In [6]:
import pandas as pd
df=pd.read_csv('clinc.csv').assign(idx=lambda d:d.index)
df.head()

Unnamed: 0,text,label,idx
0,how would you say fly in italian,translate,0
1,what's the spanish word for pasta,translate,1
2,how would they say butter in zambia,translate,2
3,how do you say fast in spanish,translate,3
4,what's the word for trees in norway,translate,4


### Convert dataframe to dictionary

In [7]:
documents=df.to_dict(orient='records')
documents[:5]

[{'text': 'how would you say fly in italian', 'label': 'translate', 'idx': 0},
 {'text': "what's the spanish word for pasta", 'label': 'translate', 'idx': 1},
 {'text': 'how would they say butter in zambia',
  'label': 'translate',
  'idx': 2},
 {'text': 'how do you say fast in spanish', 'label': 'translate', 'idx': 3},
 {'text': "what's the word for trees in norway",
  'label': 'translate',
  'idx': 4}]

### Importing lunr

In [8]:
!pip install lunr



In [9]:
from lunr import lunr

index=lunr(ref='idx', fields=('text',),documents=documents)

In [10]:
index.search("spanish")

[{'ref': '4501', 'score': 7.801, 'match_data': <MatchData "spanish">},
 {'ref': '3', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '26', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '27', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '28', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4526', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4529', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4556', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4573', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4575', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4576', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4585', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '5638', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '19505', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '19507', 'score': 

Using lunr and the object 'index', we can search for documents

In [11]:
[documents[int(i['ref'])] for i in index.search('spanish')]

[{'text': "can you tell me how to say 'i do not speak much spanish', in spanish",
  'label': 'translate',
  'idx': 4501},
 {'text': 'how do you say fast in spanish', 'label': 'translate', 'idx': 3},
 {'text': 'what is dog in spanish', 'label': 'translate', 'idx': 26},
 {'text': 'how do you say dog in spanish', 'label': 'translate', 'idx': 27},
 {'text': 'dog in spanish', 'label': 'translate', 'idx': 28},
 {'text': 'how can i say not now in spanish',
  'label': 'translate',
  'idx': 4526},
 {'text': 'how do you say goodbye in spanish',
  'label': 'translate',
  'idx': 4529},
 {'text': 'what is spanish for hello', 'label': 'translate', 'idx': 4556},
 {'text': 'how do you say thank you in spanish',
  'label': 'translate',
  'idx': 4573},
 {'text': 'how can i say thank you in spanish',
  'label': 'translate',
  'idx': 4575},
 {'text': 'what is thank you in spanish', 'label': 'translate', 'idx': 4576},
 {'text': 'how do you say cat in spanish', 'label': 'translate', 'idx': 4585},
 {'text': 

Similarly, the above code retrieves documents from the documents list that match the search term "spanish" based on the search results obtained from the Lunr index.

In [14]:
import json
from lunr.index import Index

#Serialize the Lunr index object (index) into a JSON-compatible Python object.
serialized=index.serialize()

#Write the serialized Lunr index (serialized) to a JSON file named 'idx.json'.
with open('idx.json', 'w') as fd:
  json.dump(serialized, fd)

#Read the contents of the JSON file 'idx.json' and deserializes it back into a Python object
with open('idx.json') as fd:
  reloaded=json.loads(fd.read())

#Load the deserialized Lunr index (reloaded) into a new Lunr index object (idx).
idx=Index.load(reloaded)

#Perform a search operation on the reloaded Lunr index (idx) for the term 'spanish'.
idx.search('spanish')

[{'ref': '4501', 'score': 7.801, 'match_data': <MatchData "spanish">},
 {'ref': '3', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '26', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '27', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '28', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4526', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4529', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4556', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4573', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4575', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4576', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '4585', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '5638', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '19505', 'score': 7.62, 'match_data': <MatchData "spanish">},
 {'ref': '19507', 'score': 

The above method demonstrates how lunr can be used for efficient storage and retrieval of pre-indexed data for search functionality in applications.

### Comparing the time taken by each method to search

In [15]:
%timeit df.loc[lambda d: d['text'].str.contains("spanish")]

11.8 ms ± 188 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [16]:
%timeit [d for d in documents if 'spanish' in d['text']]

4.65 ms ± 389 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [17]:
%timeit index.search('spanish')

648 µs ± 7.53 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [18]:
%timeit [documents[int(i['ref'])] for i in index.search('spanish')]

790 µs ± 216 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
