# Keyword Search

In [1]:
from typing import List
import uuid

import pandas as pd

In [2]:
# Load in the data and add uuids 
data = pd.read_json("data/preprocessed_data.jsonl", orient='records', lines=True)

data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209527 entries, 0 to 209526
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   uuid               209527 non-null  object        
 1   link               209527 non-null  object        
 2   headline           209527 non-null  object        
 3   category           209527 non-null  object        
 4   short_description  209527 non-null  object        
 5   authors            209527 non-null  object        
 6   date               209527 non-null  datetime64[ns]
 7   clean_headline     209527 non-null  object        
 8   combined_text      209527 non-null  object        
dtypes: datetime64[ns](1), object(8)
memory usage: 14.4+ MB


In order to help speed up and simplify some of the code rather then relying on pandas for looking up articles I will use dictionaries instead. 

In [3]:
# Initialize lookups including uuid -> title and row index -> uuid
UUID_2_TITLE = dict(zip(data['uuid'], data['headline']))
IND_2_UUID = dict(zip(list(data.index), data['uuid']))

# Create generic docs list to be used for setting up searchs. 
DOCS = list(data['combined_text'])


## Naive Approach 

In the naive approach we will simply check each document to see if it contains the query term/s. The main reason for calling this the "naive" approach is because it will utilize a for loop to check each document for the term/s. In a small sample such as this using a simple for loop works relatively quickly while very clearly showing how keyword matching works however with larger datasets with not only more records but also longer texts this approach is simply unrealistic due to its limitation in regards to processing time. 

In [4]:
def search(query:str, docs:list)->List[str]:
    matches = [d for d in docs if query.lower() in d]
    return matches

In [7]:
%%time

results = search(query="airlines", docs=DOCS)
print(len(results))


396
CPU times: user 305 ms, sys: 0 ns, total: 305 ms
Wall time: 275 ms


In [9]:
for r in results[:3]:
    print(r + "\n")

american airlines flyer charged banned for life after punching flight attendant on video he was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation according to the us attorney's office in los angeles

alaska airlines cancels dozens of flights as pilots picket more than 100 alaska airlines flights were canceled by the airline including 66 in seattle 20 in portland oregon 10 in los angeles and seven in san francisco

russia's flagship airline aeroflot halts all international flights except to belarus russia's aviation agency recommended all russian airlines with foreignleased planes halt both passenger and cargo flights abroad



As can be seen this performs well enough in regards to surfacing documents that contain the search term (i.e. "airlines") however as noted this approach would not be feasible on larger datasets and also comes into issues when dealing with longer search terms. In other words once the query gets longer (e.g. "airlines in the united states") this will introduce to many complexities to really be handled via a simple term lookup because then you may need to start testing strategies that split up the query into individual terms and search against each word at a time which again leads to poor performance in terms of lookup speeds. Therefore in an attempt to try and solve for this we will perform keyword search against vectorized documents. 

## Vectorized Search

### Count Vectorization

### tfidf Vectorization