# Hybrid search

While dense vector indexes are by far the best option for semantic search systems, sparse keyword indexes can still add value. There may be cases where finding an exact match is important.

Hybrid search combines the results from sparse and dense vector indexes for the best of both worlds.

In [1]:
from txtai import Embeddings
import os
import pandas as pd
import re

In [2]:
# Define constants
DATA_DIR = '../datasets'
FILE_NAME = 'Articles.csv'
EMBEDDINGS_PATH = './hybrid_text_embeddings'
EMBEDDINGS_MODEL = 'sentence-transformers/nli-mpnet-base-v2'

In [3]:
# Load the data
if os.path.isfile(os.path.join(DATA_DIR, FILE_NAME)):
    df = pd.read_csv(os.path.join(DATA_DIR, FILE_NAME), encoding='latin1')

df.head()

Unnamed: 0,Article,Date,Heading,NewsType
0,KARACHI: The Sindh government has decided to b...,1/1/2015,sindh govt decides to cut public transport far...,business
1,HONG KONG: Asian markets started 2015 on an up...,1/2/2015,asia stocks up in new year trad,business
2,HONG KONG: Hong Kong shares opened 0.66 perce...,1/5/2015,hong kong stocks open 0.66 percent lower,business
3,HONG KONG: Asian markets tumbled Tuesday follo...,1/6/2015,asian stocks sink euro near nine year,business
4,NEW YORK: US oil prices Monday slipped below $...,1/6/2015,us oil prices slip below 50 a barr,business


In [4]:
# Data pre-processing
data = df['Article'].tolist()
data = [re.split(": *", text, 1)[1] for text in data if ":" in text]
data[:5]

['The Sindh government has decided to bring down public transport fares by 7 per cent due to massive reduction in petroleum product prices by the federal government, Geo News reported.Sources said reduction in fares will be applicable on public transport, rickshaw, taxi and other means of traveling.Meanwhile, Karachi Transport Ittehad (KTI) has refused to abide by the government decision.KTI President Irshad Bukhari said the commuters are charged the lowest fares in Karachi as compare to other parts of the country, adding that 80pc vehicles run on Compressed Natural Gas (CNG). Bukhari said Karachi transporters will cut fares when decrease in CNG prices will be made.                        \r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n',
 'Asian markets started 2015 on an upswing in limited trading on Friday, with mainland Chinese stocks surging in Hong Kong on speculation Beijing may ease monetary policy to boost slowing growth.Hong Kong rose 1.07 percent, closing 252.78 points higher at

In [5]:
# Load the embeddings model with subindexes for keyword and dense embeddings
embeddings = Embeddings(
  content=True,
  defaults=False,
  indexes={
    "keyword": {
      "keyword": True
    },
    "dense": {
      "path": EMBEDDINGS_MODEL,
    }
  }
)

In [6]:
# Index the data
embeddings.index(data)

In [7]:
# Save the embeddings
embeddings.save(EMBEDDINGS_PATH)

In [8]:
# Load the embeddings
embeddings = Embeddings()
embeddings.load(path=EMBEDDINGS_PATH)

In [23]:
# Perform a keyword search
query = "funny news"
keyword_results = embeddings.search(query, limit=4, index="keyword")
keyword_results

[{'id': '2184',
  'text': 'Mamadou Sakho was sent back from Liverpool\'s pre-season tour of the United States because of a lack of respect towards rules but the defender\'s actions were not serious enough to warrant more punishment, manager Jurgen Klopp has said.</strongThe France international, who was recently cleared by UEFA after allegations of a failed drugs test, is currently nursing an Achilles injury and is expected to miss the start of the Premier League season, which begins on Aug. 13.Klopp said Sakho broke rules during the tour."He missed the departure of the plane, he missed a session and then was late for a meal," Klopp told British media."I have to build a group here, I have to start anew, so I thought it maybe made sense that he flew home to Liverpool and after eight days, when we come back, we can talk about it."But it\'s not that serious. It is how I said, we have some rules and we have to respect them. If somebody doesn\'t respect it, or somebody gives me the feeling 

In [22]:
# Perform a dense vector search
query = "funny news"
semantic_results = embeddings.search(query, limit=4, index="dense")
semantic_results

[{'id': '1005',
  'text': 'Jonny Bairstow pumped his chest to the sky and wiped away a tear as he completed an emotional maiden test hundred for England against South Africa on Sunday, amid a record-breaking stand with Ben Stokes.</strongThe red-haired England wicketkeeper last year lost his grandfather, with whom he shared a close relationship, while his dad David, a former England cricketer, killed himself in 1998.\x93Obviously after everything that has gone on in the last year or so it\x92s fantastic to get over the line for me and my family,\x94 he told reporters after finishing on 150 not out with his mother and sister watching on from the stands. \x93They have supported me all through my career. To have them here in Cape Town is lovely and it is my mum\x92s birthday on the last day of the test so hopefully we can cap off a great game. \x93It was probably the best day of my life and one I will never forget.\x94After being recalled to the England side during last year\x92s Ashes tr