# Question-Answering Demo using Scottish Widows Public Documents

## Environment

In [1]:
import os

import pandas as pd
import numpy as np

import faiss


pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

In [2]:
import vertexai
from vertexai.preview.language_models import TextGenerationModel, TextEmbeddingModel


In [3]:
PROJECT_ID = ! gcloud config get core/project
PROJECT_ID = PROJECT_ID[0]

REGION = "europe-west2"

PROJECT_ID, REGION

('playpen-af69ec', 'europe-west2')

In [4]:
%env PROJECT_ID=$PROJECT_ID

env: PROJECT_ID=playpen-af69ec


## Data

### Raw data

Document Source: Based on Scottish Widows' literature library search:
https://adviser.scottishwidows.co.uk/literature-library.html

Specifically for this demo, the *guides* are selected:
https://adviser.scottishwidows.co.uk/literature-library.html?n=1000&filter=swe:literaturelibrary/contenttype/guides

The pdf files are scrapped and save in local parquet file

In [5]:
all_guides_file = "../data/scottish_widows_all_guides.pq"

guides_df = pd.read_parquet(all_guides_file)

guides_df.head()

Unnamed: 0,page_number,page_text,title
0,1,\n \n \nWhich trust form should I use? \n \n...,Which trust form should I use?
1,2,\n \n \n4. The Gift trust (creating fixed int...,Which trust form should I use?
2,3,\n \n \nPlease tick one of the boxes below to...,Which trust form should I use?
3,1,POLICY PROVISIONS\nBP-S32/S32A (2016)PLANBUYOU...,Trustee Buyout Plan Policy Provisions
4,2,PAGE 2\n1 PRELIMINARY\nPAGE 4\n2 UNIT-LINKED F...,Trustee Buyout Plan Policy Provisions


In [6]:
guides_df.groupby(["title"])[["page_number"]].count()

Unnamed: 0_level_0,page_number
title,Unnamed: 1_level_1
A guide to pension tax,11
A guide to supporting vulnerability,8
Adviser guide to accessing income with Drip Feed Drawdown...,19
Advisers' Guide To The Portfolio Management Service,36
Annual FSA Insurance Returns for the year ended 31st Dece...,189
...,...
​​Scottish Widows Bank Premier Team Flyer,2
​​Scottish Widows Cash Fund weekly report,2
​​Scottish Widows Life Funds Investor's Guide,54
​​Scottish Widows OEIC and ISA brochure,12


### Pre-processing
#### Remove the blank pages

In [7]:
print(guides_df.shape)

guides_df = guides_df.loc[guides_df["page_text"]!=""]

print(guides_df.shape)

(2976, 3)
(2952, 3)


## Embedding using Google's `TextEmbedding` Model

**Approach 1: Using the natural pages as chunks**

In [8]:
guides_df["page_text"].loc[0]

" \n \n \nWhich trust form should I use?  \n \nFor life assurance (i.e. non pension) contracts which are already set up and on risk, Scottish Widows \ncurrently offers a choice of four trusts. To help you choose which trust is most appropriate for your \nneeds, a brief description of each trust and where i t may be used is given below.   \n \nPlacing a policy under trust usually means you are giving up all rights to the benefits under a policy, \nalthough, in a few very specific situations it ’s possible to retain certain benefits and our range of \ntrusts takes this  into account.   \n \nRememb er a trust is a legal document.  If you ’re in any doubt as to which trust is most suitable \nfor your policy and your requirements, please seek advice from your financial or legal \nadviser.    \n \nIf your policy is a regular premium policy it may be what is known as a “qualifying policy”. \nPlacing a qualifying policy under trust can have tax implications and advice should always be \nsought

In [9]:
model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")

# return a list of vertexai.language_models._language_models.TextEmbedding
#embeddings = model.get_embeddings( [guides_df["page_text"].loc[0]] )
embeddings = model.get_embeddings(guides_df["page_text"].loc[0:4]) # maximum 5 instance per embedding!

len(embeddings), type(embeddings[0])

(5, vertexai.language_models._language_models.TextEmbedding)

In [10]:
for embedding in embeddings:
    vector = np.array(embedding.values)
    print(vector.shape)
    print(vector[:10])

(768,)
[ 0.01049513 -0.01476781 -0.00213589  0.01635158  0.01572042 -0.0659941
  0.03316933  0.03560591 -0.02210218  0.01482409]
(768,)
[-0.01333201 -0.01170563 -0.02743361  0.01251736  0.01605495 -0.05653438
  0.04348524  0.02655993 -0.0336807   0.01399777]
(768,)
[ 0.01591383 -0.01322624 -0.00728575 -0.00176922  0.02779969 -0.0607065
  0.00740999  0.02279622 -0.01332888  0.0173459 ]
(768,)
[-0.00177567 -0.0265304   0.01459018  0.00554964  0.00642882 -0.04269994
  0.00504656  0.03302064 -0.01718982  0.01577001]
(768,)
[-0.00341803 -0.02207155 -0.004032   -0.01826066 -0.01206415 -0.04291812
  0.01481962  0.01428669 -0.00667574  0.04736852]


In [11]:
pd.Series([embedding.values for embedding in embeddings], name="embedding").to_frame()

Unnamed: 0,embedding
0,"[0.010495134629309177, -0.014767814427614212, ..."
1,"[-0.013332009315490723, -0.011705626733601093,..."
2,"[0.0159138273447752, -0.013226243667304516, -0..."
3,"[-0.0017756749875843525, -0.026530398055911064..."
4,"[-0.0034180316142737865, -0.02207154966890812,..."


In [12]:
def get_embedding_google(se, chunk_size=5):
    """Using Google's pretrained TextEmbeddingModel to vetorise the text series 
       Input:
           se: Series of string
           chunk_size: number of text items send to Google API.
                       By default, GCP can process maximum 5 itmes in one go, 
                       so the chunk_size should be less than 5
        Return: Numpy array with shape (m, n), where m is the number of text 
                and n the vector length (768 for Google model)
    """
    
    model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")
    
    # generator use to iterate the series over smaller series with chunk_size rows) 
    small_se_gen = (se.iloc[i:i+chunk_size] for i in range(0, len(se), chunk_size))
    small_se_embeddings = [model.get_embeddings(small_se) for small_se in small_se_gen]

    eb_list = [
        np.array(embedding.values, dtype="float32") 
        for embeddings in small_se_embeddings 
        for embedding in embeddings
        ]
    return np.vstack(eb_list)

**To test the embdding function**

In [13]:
# one text item each time for the 11 items embedding
v1 = get_embedding_google(guides_df["page_text"].iloc[0:11], 1)
v1.shape, v1[0][0:10]


((11, 768),
 array([ 0.01048785, -0.01465903, -0.00213082,  0.01636567,  0.01568375,
        -0.06606973,  0.03323705,  0.03563511, -0.02214777,  0.01477051],
       dtype=float32))

In [14]:
# using the default chunk size of 5 
v2 = get_embedding_google(guides_df["page_text"].iloc[0:11])
v2.shape,  v2[0][0:10]

((11, 768),
 array([ 0.01049513, -0.01476781, -0.00213589,  0.01635158,  0.01572042,
        -0.0659941 ,  0.03316933,  0.03560591, -0.02210218,  0.01482409],
       dtype=float32))

**Note: when more than one piece of text items are send for embedding, the model returns slightly different embeddings vector. But they are very similar when using the dot product to compare!**

In [15]:
[np.dot(v_1, v_2) for v_1, v_2 in zip(np.rollaxis(v1, 0), np.rollaxis(v2,0))]

[0.99999064,
 0.999993,
 0.9999926,
 0.999988,
 0.99999094,
 0.9999911,
 0.99999005,
 0.99999046,
 0.999987,
 0.9999883,
 0.9999908]

### Embedding the whole set

In [16]:
#%%timeit -n 1 -r 1 # how long does is take? about 1 min for 1000 rows
#get_embedding_google(guides_df["page_text"].iloc[0:100])

guides_embedded_df = pd.DataFrame(
    get_embedding_google(guides_df["page_text"]), index=guides_df.index
)

guides_embedded_df.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,...,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767
2971,0.023005,-0.008335,0.015968,0.020089,-0.022411,-0.082543,-0.015963,-0.020023,-0.043214,0.014353,-0.02578,0.001637,0.058558,0.008535,0.004042,-0.049677,-0.055249,-0.017848,0.046654,0.017081,-0.052548,-0.00426,0.004665,0.028817,0.014689,-0.077009,0.055519,0.019763,-0.063145,0.013305,0.015532,-0.027363,-0.017845,-0.03471,0.001972,0.055479,-0.026046,0.054247,0.013093,0.038073,-0.015672,-0.024904,0.009676,-0.014662,-0.021132,0.040256,-0.000887,0.049489,0.021773,-0.035762,...,0.002156,0.031861,-0.020296,0.004552,0.0067,-0.03058,0.021672,-0.063818,-0.001713,-0.000577,-0.043701,-0.004669,0.010236,-0.018455,0.019391,0.022422,-0.02666,-0.017342,0.046914,-0.006037,0.050536,-0.043194,0.011921,0.08186,-0.035276,-0.003702,0.034351,-0.005414,0.022709,-0.043799,0.007948,-0.055572,-0.085049,0.040661,0.013589,0.028428,0.021263,0.032594,0.016164,0.038673,7.1e-05,0.039359,-0.009554,-0.006101,-0.026769,0.023654,0.001714,-0.037269,-0.014077,-0.040879
2972,0.004236,-0.039678,0.019605,0.014696,0.02634,-0.076459,0.001416,-0.021444,-0.033944,0.008836,-0.000315,-0.001932,0.04689,0.052121,-0.002347,-0.020978,-0.076301,-0.017035,0.081462,0.033609,-0.058328,-0.026464,0.019679,-0.001503,0.006444,-0.073676,0.050129,0.000962,-0.088607,0.003243,0.004986,-0.007102,-0.010902,-0.053472,-0.018873,0.05397,-0.04609,0.054979,-0.010128,0.033029,-0.006924,-0.046878,0.027175,-0.000815,-0.048062,-0.007908,-0.037347,0.032452,0.033141,0.000614,...,0.005646,0.045042,-0.012921,0.012159,0.011549,-0.04744,-0.00486,-0.037287,-0.014955,-0.019401,-0.016383,0.051488,0.003985,-0.017966,0.012412,0.045464,-0.044328,0.006792,0.066181,0.008393,0.062614,-0.047557,-0.014064,0.064229,-0.050053,-0.00381,0.024471,0.015948,0.012391,-0.022188,0.03559,-0.079629,-0.091697,0.04857,0.020046,0.041549,-0.005412,0.045793,0.046022,0.035392,-0.037482,0.040118,0.000428,0.004294,-0.014218,-0.003514,-0.009422,0.005541,-0.028613,-0.023111
2973,0.006069,-0.04853,0.006824,0.000931,0.018895,-0.069443,-0.000718,-0.028296,-0.030608,-0.000899,-0.01657,-0.030518,0.040158,0.046594,0.01507,-0.039756,-0.047341,-0.026313,0.068929,0.053022,-0.04478,-0.02047,0.017019,0.004024,0.005173,-0.086929,0.044768,0.005014,-0.03755,0.021574,0.002843,-0.00847,-0.046293,-0.021847,-0.021615,0.043966,-0.034015,0.057203,-0.02324,0.014777,0.014467,-0.02778,0.049776,-0.005641,-0.029186,0.003274,-0.00683,0.000214,0.033753,-0.014095,...,-0.017977,0.049634,-0.013294,-0.012836,0.005182,-0.027894,-0.017991,-0.039864,-0.016902,0.010338,-0.016337,0.025613,-0.001158,-0.033053,0.010017,0.031964,-0.009585,-0.028806,0.057803,0.005253,0.040219,-0.041553,0.04444,0.02717,-0.042462,-0.008601,0.050087,-0.00337,0.004918,-0.035861,-0.005935,-0.06044,-0.101633,0.029178,-0.014942,0.055224,-0.011849,0.04202,0.059193,0.016681,-0.033432,0.04915,-0.012939,0.049393,-0.018804,0.017993,0.018184,0.005563,-0.003481,-0.053449
2974,0.012338,-0.027616,0.020484,-0.012849,-7e-05,-0.065475,0.026028,0.016407,0.004235,0.012088,-0.009587,0.012799,0.040866,-0.01321,-0.011045,-0.016686,-0.038683,-0.019249,0.05064,0.024926,-0.04536,0.016769,-0.002324,0.023365,-0.004998,-0.053218,0.039008,-0.010714,-0.103057,0.005535,0.040495,0.004507,-0.049671,-0.008873,-0.014579,0.047045,-0.006859,0.070592,-0.007623,0.020427,0.015153,-0.032202,0.023119,0.012023,-0.022448,0.022678,-0.002346,0.026251,0.025721,-0.031976,...,0.011418,0.038729,0.005058,-0.018753,-0.010125,-0.029914,-0.001266,-0.01636,0.010751,-0.006657,-0.025396,0.005239,0.02731,-0.023623,0.007547,-0.002672,-0.002635,0.009147,0.088605,-0.006925,0.051603,-0.018638,-0.001645,0.039367,-0.016303,-0.006195,0.018979,0.026279,0.013427,-0.055749,-0.000913,-0.046771,-0.096925,0.031593,-0.039723,0.066215,-0.012165,0.04979,0.033831,-0.015099,-0.043591,0.04261,-0.001827,-0.000854,-0.0032,0.029455,0.01929,-0.024217,-0.029916,-0.043876
2975,0.022185,-0.033173,0.009347,-0.024662,0.00258,-0.057153,0.00986,0.017371,-0.011084,0.018607,-0.013448,0.017459,-0.003734,0.014558,-0.001503,-0.031373,-0.062036,-0.010786,0.059708,0.001548,-0.077247,0.01263,-0.004428,0.015212,0.003572,-0.075635,0.022557,0.024011,-0.092295,0.017256,0.022071,-0.019026,-0.051287,-0.024358,-0.029306,0.028509,-0.015025,0.041008,-0.008376,0.007315,0.014936,-0.011113,0.010289,-0.01158,-0.034637,0.021047,-0.010455,0.037634,0.033239,-0.047006,...,0.018212,0.036554,0.002246,0.002502,0.027372,-0.039902,-0.001072,-0.043071,0.009656,0.010759,-0.020188,-0.002678,0.022682,-0.030582,-0.005811,0.015054,0.014017,-0.005891,0.067421,-0.002183,0.062956,-0.043583,0.010797,0.0351,-0.012281,-0.025695,0.013154,0.020848,0.024446,-0.0163,-0.019455,-0.017315,-0.100765,0.021911,-0.030963,0.07173,-0.009269,0.066918,0.032873,-0.027479,-0.023727,0.040294,-0.007311,-0.003273,-0.026122,0.013345,0.005259,-0.029302,-0.007045,-0.080517


In [17]:
guides_df.shape, guides_embedded_df.shape

((2952, 3), (2952, 768))

In [18]:
guides_embedded_file = "../data/scottish_widows_all_guides_embedded_v3.pq"

In [20]:
# guides_embedded_df.to_parquet(guides_embedded_file)

## Vector DB using Faiss

In [21]:
# guides_embedded_df = pd.read_parquet(guides_embedded_file)

guides_embedded_df.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,...,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767
2971,0.023005,-0.008335,0.015968,0.020089,-0.022411,-0.082543,-0.015963,-0.020023,-0.043214,0.014353,-0.02578,0.001637,0.058558,0.008535,0.004042,-0.049677,-0.055249,-0.017848,0.046654,0.017081,-0.052548,-0.00426,0.004665,0.028817,0.014689,-0.077009,0.055519,0.019763,-0.063145,0.013305,0.015532,-0.027363,-0.017845,-0.03471,0.001972,0.055479,-0.026046,0.054247,0.013093,0.038073,-0.015672,-0.024904,0.009676,-0.014662,-0.021132,0.040256,-0.000887,0.049489,0.021773,-0.035762,...,0.002156,0.031861,-0.020296,0.004552,0.0067,-0.03058,0.021672,-0.063818,-0.001713,-0.000577,-0.043701,-0.004669,0.010236,-0.018455,0.019391,0.022422,-0.02666,-0.017342,0.046914,-0.006037,0.050536,-0.043194,0.011921,0.08186,-0.035276,-0.003702,0.034351,-0.005414,0.022709,-0.043799,0.007948,-0.055572,-0.085049,0.040661,0.013589,0.028428,0.021263,0.032594,0.016164,0.038673,7.1e-05,0.039359,-0.009554,-0.006101,-0.026769,0.023654,0.001714,-0.037269,-0.014077,-0.040879
2972,0.004236,-0.039678,0.019605,0.014696,0.02634,-0.076459,0.001416,-0.021444,-0.033944,0.008836,-0.000315,-0.001932,0.04689,0.052121,-0.002347,-0.020978,-0.076301,-0.017035,0.081462,0.033609,-0.058328,-0.026464,0.019679,-0.001503,0.006444,-0.073676,0.050129,0.000962,-0.088607,0.003243,0.004986,-0.007102,-0.010902,-0.053472,-0.018873,0.05397,-0.04609,0.054979,-0.010128,0.033029,-0.006924,-0.046878,0.027175,-0.000815,-0.048062,-0.007908,-0.037347,0.032452,0.033141,0.000614,...,0.005646,0.045042,-0.012921,0.012159,0.011549,-0.04744,-0.00486,-0.037287,-0.014955,-0.019401,-0.016383,0.051488,0.003985,-0.017966,0.012412,0.045464,-0.044328,0.006792,0.066181,0.008393,0.062614,-0.047557,-0.014064,0.064229,-0.050053,-0.00381,0.024471,0.015948,0.012391,-0.022188,0.03559,-0.079629,-0.091697,0.04857,0.020046,0.041549,-0.005412,0.045793,0.046022,0.035392,-0.037482,0.040118,0.000428,0.004294,-0.014218,-0.003514,-0.009422,0.005541,-0.028613,-0.023111
2973,0.006069,-0.04853,0.006824,0.000931,0.018895,-0.069443,-0.000718,-0.028296,-0.030608,-0.000899,-0.01657,-0.030518,0.040158,0.046594,0.01507,-0.039756,-0.047341,-0.026313,0.068929,0.053022,-0.04478,-0.02047,0.017019,0.004024,0.005173,-0.086929,0.044768,0.005014,-0.03755,0.021574,0.002843,-0.00847,-0.046293,-0.021847,-0.021615,0.043966,-0.034015,0.057203,-0.02324,0.014777,0.014467,-0.02778,0.049776,-0.005641,-0.029186,0.003274,-0.00683,0.000214,0.033753,-0.014095,...,-0.017977,0.049634,-0.013294,-0.012836,0.005182,-0.027894,-0.017991,-0.039864,-0.016902,0.010338,-0.016337,0.025613,-0.001158,-0.033053,0.010017,0.031964,-0.009585,-0.028806,0.057803,0.005253,0.040219,-0.041553,0.04444,0.02717,-0.042462,-0.008601,0.050087,-0.00337,0.004918,-0.035861,-0.005935,-0.06044,-0.101633,0.029178,-0.014942,0.055224,-0.011849,0.04202,0.059193,0.016681,-0.033432,0.04915,-0.012939,0.049393,-0.018804,0.017993,0.018184,0.005563,-0.003481,-0.053449
2974,0.012338,-0.027616,0.020484,-0.012849,-7e-05,-0.065475,0.026028,0.016407,0.004235,0.012088,-0.009587,0.012799,0.040866,-0.01321,-0.011045,-0.016686,-0.038683,-0.019249,0.05064,0.024926,-0.04536,0.016769,-0.002324,0.023365,-0.004998,-0.053218,0.039008,-0.010714,-0.103057,0.005535,0.040495,0.004507,-0.049671,-0.008873,-0.014579,0.047045,-0.006859,0.070592,-0.007623,0.020427,0.015153,-0.032202,0.023119,0.012023,-0.022448,0.022678,-0.002346,0.026251,0.025721,-0.031976,...,0.011418,0.038729,0.005058,-0.018753,-0.010125,-0.029914,-0.001266,-0.01636,0.010751,-0.006657,-0.025396,0.005239,0.02731,-0.023623,0.007547,-0.002672,-0.002635,0.009147,0.088605,-0.006925,0.051603,-0.018638,-0.001645,0.039367,-0.016303,-0.006195,0.018979,0.026279,0.013427,-0.055749,-0.000913,-0.046771,-0.096925,0.031593,-0.039723,0.066215,-0.012165,0.04979,0.033831,-0.015099,-0.043591,0.04261,-0.001827,-0.000854,-0.0032,0.029455,0.01929,-0.024217,-0.029916,-0.043876
2975,0.022185,-0.033173,0.009347,-0.024662,0.00258,-0.057153,0.00986,0.017371,-0.011084,0.018607,-0.013448,0.017459,-0.003734,0.014558,-0.001503,-0.031373,-0.062036,-0.010786,0.059708,0.001548,-0.077247,0.01263,-0.004428,0.015212,0.003572,-0.075635,0.022557,0.024011,-0.092295,0.017256,0.022071,-0.019026,-0.051287,-0.024358,-0.029306,0.028509,-0.015025,0.041008,-0.008376,0.007315,0.014936,-0.011113,0.010289,-0.01158,-0.034637,0.021047,-0.010455,0.037634,0.033239,-0.047006,...,0.018212,0.036554,0.002246,0.002502,0.027372,-0.039902,-0.001072,-0.043071,0.009656,0.010759,-0.020188,-0.002678,0.022682,-0.030582,-0.005811,0.015054,0.014017,-0.005891,0.067421,-0.002183,0.062956,-0.043583,0.010797,0.0351,-0.012281,-0.025695,0.013154,0.020848,0.024446,-0.0163,-0.019455,-0.017315,-0.100765,0.021911,-0.030963,0.07173,-0.009269,0.066918,0.032873,-0.027479,-0.023727,0.040294,-0.007311,-0.003273,-0.026122,0.013345,0.005259,-0.029302,-0.007045,-0.080517


### Create unique ID to link embedded vectors with original page

In [27]:
guides_df = guides_df.reset_index().rename(columns={"index": "id"})

guides_df.tail()

### Build the vector DB

In [28]:
# instantiate the index
vector_length = guides_embedded_df.shape[1]

index = faiss.IndexFlatL2(vector_length)

# Pass the index to IndexIDMap and add vectors with IDs
indexed = faiss.IndexIDMap(index)
indexed.add_with_ids(guides_embedded_df, guides_df.id.values)

print(f"Number of vectors in the Faiss index: {indexed.ntotal}")

Number of vectors in the Faiss index: 2952


## Query

In [36]:
# randomly pick up a page and search. The results should include itself as the closest match
pick_page = 15 

em = guides_embedded_df.iloc[pick_page:pick_page+1, :]
distances, ids = indexed.search(em, k=3)
print(f'L2 distance: {distances[0]}\nIDs: {ids[0]}')

L2 distance: [0.         0.03080348 0.13289924]
IDs: [15 14 10]


In [37]:
guides_df[ guides_df.id.isin(ids[0])]

Unnamed: 0,id,page_number,page_text,title
10,10,8,6\nBuyout Plan2.5 Unit prices\nAt each valuati...,Trustee Buyout Plan Policy Provisions
14,14,12,10\nBuyout Plan3.7 Management charge\nFrom tim...,Trustee Buyout Plan Policy Provisions
15,15,13,11\nBuyout PlanProvision 3.7 deals with charge...,Trustee Buyout Plan Policy Provisions


In [39]:
def vector_search_google(query:str, index, num_results:int=3):
    """
    Encoding the query using Google's text embedding model and search the closetest matches from vector DB
    quert: the text to be embedded
    index: faiss.swigfaiss_avx2.IndexFlatL2 as vector DB
    num_results: number of matches to ruturn
    
    Returns:
        distances: distances between results and query as float or numpy array.
        ids: IDs of the maches as array.
    
    """

    model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")
    
    query_vector = np.array(model.get_embeddings([query])[0].values, dtype="float32").reshape(1, -1)

    distances, ids = index.search(query_vector, k=num_results)
    
    return distances, ids


In [40]:
user_query = """How does the Discounted Gift & Income Trust work?"""

ds, ids = vector_search_google(user_query, indexed, num_results=3)

print(f'Euclidean distance: {ds[0]}\nPage IDs: {ids[0]}')

Euclidean distance: [0.32595375 0.3572533  0.39052075]
Page IDs: [360 354 355]


In [41]:
# Fetch the paper titles based on their index
guides_df[ guides_df["id"].isin(ids[0])]

Unnamed: 0,id,page_number,page_text,title
350,354,2,Discounted Gift & Income TrustPAGE 1\nTHE DISC...,Discounted Gift & Income trust Client Brochure
351,355,3,Discounted Gift & Income Trust1\n1THE DISCOUNT...,Discounted Gift & Income trust Client Brochure
356,360,8,Discounted Gift & Income Trust6\n6\nSUMMARY OF...,Discounted Gift & Income trust Client Brochure


## Answer the query based on the relevant pages

In [43]:
context = guides_df["page_text"][ guides_df["id"].isin(ids[0])].iloc[0]
question = """How does the Discounted Gift & Income Trust work?"""

template = f"""You are an expert having a conversation with a user.
Given the following extracted parts of a long document and a question,
create a final answer. 
{context}

user: {question}
expert:
"""

parameters = {
    "temperature": 0.2,
    "max_output_tokens": 256,   
    "top_p": .8,                
    "top_k": 40,                 
}

model = TextGenerationModel.from_pretrained("text-bison@001")
response = model.predict(template, **parameters)

print(f"Question: {question}\n")
print(f"Response from Model: \n{response.text}")


Question: 
How does the Discounted Gift & Income Trust work?

Response from Model: 
The Discounted Gift & Income Trust (creating fixed trust interests) is a trust that allows you to make a gift of assets to your children or grandchildren while retaining the right to receive income from the trust for your lifetime.


In [44]:
def gen_text_google(input_text, temperature: float=0.2) -> None:
    parameters = {
        "temperature": temperature,
        "max_output_tokens": 256,   
        "top_p": .8,                
        "top_k": 40,                 
    }

    model = TextGenerationModel.from_pretrained("text-bison@001")
    response = model.predict(
        input_text,
        **parameters,
    )
    print(f"Response from Model: \n{response.text}")


In [47]:
question = """How does the Discounted Gift & Income Trust work?"""

ds, ids = vector_search_google(question, indexed, num_results=3)

context = guides_df["page_text"][ guides_df["id"].isin(ids[0])].iloc[0]

#style = "a concise way"
style = "details"

text = f"""You are an expert having a conversation with a user.
Given the following extracted parts of a long document and a question,
create a final answer in {style}. 
{context}

user: {question}
expert:
"""

print(f"Question: {question}\n")

gen_text_google(text)

Question: 
How does the Discounted Gift & Income Trust work?

Response from Model: 
The Discounted Gift & Income Trust (creating fixed trust interests) is a way of passing on assets to your children or grandchildren while you are still alive. It is a flexible and tax-efficient way of providing for your family and can be used to provide income, capital or both.

The trust is set up by you, the settlor, and you can choose who the beneficiaries will be. You can also choose how much income and capital will be paid to the beneficiaries and when.

The trust is a separate legal entity from you, the settlor, and this means that the assets in the trust are not subject to your creditors. This can be an important benefit if you are concerned about your financial situation in the future.

The trust can also be used to protect your assets from inheritance tax. When you die, the assets in the trust will not form part of your estate and will therefore not be subject to inheritance tax.

The Discounte

## Scratch