# Full Search

## Load Bible Text

In [1]:
import numpy as np
import pandas as pd

types = {
    'book': 'string',
    'chap': np.int32,
    'vers': np.int32,
    'text': 'string'
}
unv = pd.read_csv(
    'https://bible.fhl.net/public/dnstrunv.tgz',
    sep='#',
    compression='gzip',
    header=None,
    usecols=[1, 2, 3, 4],
    names=list(types.keys()))

In [58]:
unv.dropna(inplace=True)
unv = unv.astype(types)
unv.sort_values(['book', 'chap', 'vers'], inplace=True)

def in_order(nums: np.ndarray):
    return np.array_equal(nums, range(1, nums.max() + 1))

assert len(unv.book.unique()) == 66
assert unv.groupby('book').chap.unique().apply(in_order).all()
assert unv.groupby(['book', 'chap']).vers.apply(in_order).all()
assert unv.groupby(['book', 'chap']).vers.max().sum() == len(unv) == 31103

## Tokenize

In [5]:
import jieba
import jieba.posseg as pseg

jieba.enable_paddle()

Paddle enabled successfully......


In [6]:
from hanziconv import HanziConv

unv['s_text'] = unv.text.apply(HanziConv.toSimplified)

In [7]:
unv['s_text_tk'] = unv.s_text.apply(lambda v: pseg.lcut(v, use_paddle=True))

## Search

In [9]:
from typing import Iterable

def highlight_occurances(text: str, keywords: Iterable[str]) -> str:
    for i, kw in enumerate(keywords):
        text = text.replace(kw, highlight(kw, i))
    return text

def highlight(text: str, color_code: int) -> str:
    return f'\x1b[6;30;4{color_code + 1}m{text}\x1b[0m'

test_data = ['first', 'second', 'thrid', 'forth', 'fifth', 'sixth', 'seventh']
print(highlight_occurances(', '.join(test_data), test_data))

[6;30;41mfirst[0m, [6;30;42msecond[0m, [6;30;43mthrid[0m, [6;30;44mforth[0m, [6;30;45mfifth[0m, [6;30;46msixth[0m, [6;30;47mseventh[0m


In [10]:
from search import sentence_similarity

searches = ['挂虑 祈祷', '喜乐 事奉', '求救', '信心 行事']

for search_term in searches:
    print(f'Search for {search_term}:')
    search_tk = pseg.lcut(search_term, use_paddle=True)
    match_scores = {}
    for v, vers_tk in zip(unv.s_text, unv.s_text_tk):
        similarity = sentence_similarity(search_tk, vers_tk)
        match_kw = [kw for kw in similarity.keys() if similarity[kw] > 0]
        vers = highlight_occurances(v, match_kw)
        score = sum(similarity.values())
        match_scores[vers] = score
    for top_match in sorted(match_scores, key=match_scores.get, reverse=True)[:10]:
        print(f'Match: {match_scores[top_match]:7.4f} Verse: {top_match}')
    print()

Search for 挂虑 祈祷:
Match:  2.0000 Verse: 应当一无[6;30;41m挂虑[0m，只要凡事借着[6;30;42m祷告[0m、祈求，和感谢，将你们所要的告诉神。
Match:  1.1429 Verse: 妇人和处女也有分别。[6;30;42m没有[0m出嫁的，是为主的事[6;30;41m挂虑[0m，要身体、灵魂都圣洁；已经出嫁的，是为世上的事[6;30;41m挂虑[0m，想怎样叫丈夫喜悦。
Match:  1.1429 Verse: 我愿你们无所[6;30;41m挂虑[0m。[6;30;42m没有[0m娶妻的，是为主的事[6;30;41m挂虑[0m，想怎样叫主喜悦。
Match:  1.1429 Verse: 娶了妻的，[6;30;42m是[0m为世上的事[6;30;41m挂虑[0m，想怎样叫妻子喜悦。
Match:  1.1429 Verse: 他必像树栽于水旁，在河边扎根，炎热来到，并不[6;30;42m惧怕[0m，叶子仍必青翠，在干旱之年毫无[6;30;41m挂虑[0m，而且结果不止。
Match:  1.0000 Verse: 在这日以前，这日以后，耶和华听人的[6;30;41m祷告[0m，没有像这日的，是因耶和华为以色列争战。
Match:  1.0000 Verse: 于是禁食[6;30;41m祷告[0m，按手在他们头上，就打发他们去了。
Match:  1.0000 Verse: 希西家就转脸朝墙，[6;30;41m祷告[0m耶和华说：
Match:  1.0000 Verse: 散了众人以后，他就独自上山去[6;30;41m祷告[0m。到了晚上，只有他一人在那里。
Match:  1.0000 Verse: 对他们说：「经上说：我的殿必作[6;30;41m祷告[0m的殿，你们倒使它成为贼窝了。」

Search for 喜乐 事奉:
Match:  2.0000 Verse: 你们当乐意[6;30;41m事[0m[6;30;42m奉[0m耶和华，当来向他歌唱！
Match:  2.0000 Verse: 「以色列家啊，至于你们，主耶和华如此说：从此以后若不听从我，就任凭你们去[6;30;41m事[0m[6;30;42m奉[0m偶像，