# <span style="color: #4daafc">Legal Case Similarity Detection - Create Dense Vector DB</span>
- [Environment](#environment)
- [Load Data](#load-data)
- [Estimate Number of Tokens](#estimate-number-of-tokens)
- [Create Embeddings and Insert into Vector DB](#create-embeddings-and-insert-into-vector-db)
- [Save Vector Store](#save-vector-store)

# Environment

### Import python packages

In [1]:
from utils.file_utils import load_file
from utils.df import df_shape, concat_df_cols
from langchain_community.document_loaders import DataFrameLoader
import tiktoken
import numpy as np
from langchain_ollama import OllamaEmbeddings
import faiss
from langchain_community.vectorstores import FAISS 
from langchain_community.docstore.in_memory import InMemoryDocstore

### Global variables

In [2]:
# Ollama base URL
base_url = "http://localhost:11434"

# Load Data

In [3]:
f_path = 'data/sum_trans_w_refs_sparse_vec_data_ar_ap_100.xlsx'
df = load_file(file_name=f_path)

Successfully loaded DataFrame from data/sum_trans_w_refs_sparse_vec_data_ar_ap_100.xlsx


In [4]:
df_shape(df)
display(df.head(10))

Data shape: 100 rows x 505 columns


Unnamed: 0,case_number,procedure_name,case_date,case_link,document_body,document_body_eng_sum,document_body_english_1,document_body_english_2,legal_refs,legal_refs_sparse_vec,...,ת/66,ת/67,ת/67א,ת/68,ת/72,ת/82,ת/85,ת/85א,ת/89,ת/92
0,1108/97,"ע""א 1108/97 מרחיב אביב נ. מדינת ישראל",1997-05-11,https://supremedecisions.court.gov.il/Verdicts...,"בבית המשפט העליון בש""פ 97 / 1108 בפני: כבוד הר...",1. Case Number: 97/1108\n\n2. Case Type: Admin...,In the Supreme Court of Israel Case No. 97/110...,,[],[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,...,0,0,0,0,0,0,0,0,0,0
1,4477/00,"ע""א 4477/00 לודמילה וורוביוב נ. היועצ המשפטי ל...",2000-07-06,https://supremedecisions.court.gov.il/Verdicts...,"בבית המשפט העליון בה""נ 4477/00 בפני כבוד נשיא ...",\nCase Number: HCJ 4477/00\n\nCase Type: Admin...,"In the Supreme Court of Israel, HCJ 4477/00 Be...",,"['1(א)', '4477/00']",[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,...,0,0,0,0,0,0,0,0,0,0
2,1890/16,"ע""פ 1890/16",2017-03-09,https://supremedecisions.court.gov.il/Verdicts...,"פסק דין בתיק ע""פ 1890/16 בבית המשפט העליון בשב...",1. Case Number: 1890/16\n\n2. Case Type: Crimi...,"In this case, the Hebrew text is a translation...",,"['9816/09', '1890/16', '345(א)(1)', '345(ב)(1)...",[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,...,0,0,0,0,0,0,0,0,0,0
3,7176/04,"ע""פ 7176/04 ירונ תלמי נ. מדינת ישראל",2006-02-02,https://supremedecisions.court.gov.il/Verdicts...,"פסק-דין בתיק ע""פ 7176/04 בבית המשפט העליון בשב...",Case Number: T.P. 2328/01\n\nCase Type: Crimin...,In the case of Appellant Yaron Telmi v. State ...,,"['61(א)(4)', '7(א)', '40075/04', '46', '185/87...",[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,...,0,0,0,0,0,0,0,0,0,0
4,3766/12,"ע""א 3766/12",2012-06-17,https://supremedecisions.court.gov.il/Verdicts...,"החלטה בתיק ע""א 3766/12 בבית המשפט העליון בירוש...",1. Case Number: A3766/12\n\n2. Case Type: Civi...,The decision in Case A3766/12 at the Supreme C...,,"['1113/97', '3766/12', '471ג(ד)', '8467/06', '...",[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,...,0,0,0,0,0,0,0,0,0,0
5,8178/12,"ע""א 8178/12 עו""ד צבי סלנט נ. יונתנ גוטליב",2014-11-12,https://supremedecisions.court.gov.il/Verdicts...,"החלטה בתיק ע""א 8178/12 בבית המשפט העליון בשבתו...",1. Case Number: Appeal No. 8178/12\n\n2. Case ...,In the case of Appeal No. 8178/12 at the Supre...,,"['8178/12', '7998/12']",[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,...,0,0,0,0,0,0,0,0,0,0
6,3015/09,"ע""פ 3015/09 מדינת ישראל נ. פואד קדיח",2010-07-20,https://supremedecisions.court.gov.il/Verdicts...,"פסק דין בתיק ע""פ 3015/09 בבית המשפט העליון בשב...","1. Case Number: Criminal Appeal No. 3015/09, C...",The State of Israel appeals the sentence issue...,,"['3', '1242/97', '9097/05', '499(א)(1)', '3015...",[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,...,0,0,0,0,0,0,0,0,0,0
7,4272/05,"ע""פ 4272/05 אמיר חג'וג' נ. מדינת ישראל",2006-01-04,https://supremedecisions.court.gov.il/Verdicts...,"פסק-דין בתיק ע""פ 4272/05 בבית המשפט העליון בשב...",1. Case Number: Appeal No. 4272/05\n\n2. Case ...,In the case of Appeal No. 4272/05 at the Supre...,,"['3121/04', '4272/05']",[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,...,0,0,0,0,0,0,0,0,0,0
8,10467/08,"ע""א 10467/08 עומר חג'אזי נ. אדיב עיסא דיאב",2010-11-03,https://supremedecisions.court.gov.il/Verdicts...,"פסק דין בתיק ע""א 10467/08 בבית המשפט העליון בש...",Case Number: Appeal No. 10467/08\n\nCase Type:...,The case of Appeal No. 10467/08 at the Supreme...,he registered a cautionary note or completed ...,"['9', '2643/97', '380/88', '2275/90', '534/06'...",[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0...,...,0,0,0,0,0,0,0,0,0,0
9,3330/11,"ע""א 3330/11 אגד אגודה שיתופית לתחבורה בישראל ב...",2011-11-17,https://supremedecisions.court.gov.il/Verdicts...,"פסק דין בתיק ע""א 3330/11 בבית המשפט העליון בשב...",1. Case Number: Appeal No. 3330/11\n\n2. Case ...,In the case of Appeal No. 3330/11 at the Supre...,,"['329/08', '6ב', '3330/11', '81', '4ג', '79א']",[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,...,0,0,0,0,0,0,0,0,0,0


### Drop unnecessary columns

In [5]:
# drop one-hot encoded columns 
col_index = df.columns.get_loc('legal_refs_sparse_vec')  # find index of col 'legal_refs_sparse_vec'
df_legal_short = df.iloc[:, :col_index + 1]

### Concat previously splitted columns

In [6]:
# concat splitted columns
df_legal_short = concat_df_cols(df_legal_short, col_prefix='document_body_english')

In [7]:
df_legal_short.head(10)

Unnamed: 0,case_number,procedure_name,case_date,case_link,document_body,document_body_eng_sum,document_body_english,legal_refs,legal_refs_sparse_vec
0,1108/97,"ע""א 1108/97 מרחיב אביב נ. מדינת ישראל",1997-05-11,https://supremedecisions.court.gov.il/Verdicts...,"בבית המשפט העליון בש""פ 97 / 1108 בפני: כבוד הר...",1. Case Number: 97/1108\n\n2. Case Type: Admin...,In the Supreme Court of Israel Case No. 97/110...,[],[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
1,4477/00,"ע""א 4477/00 לודמילה וורוביוב נ. היועצ המשפטי ל...",2000-07-06,https://supremedecisions.court.gov.il/Verdicts...,"בבית המשפט העליון בה""נ 4477/00 בפני כבוד נשיא ...",\nCase Number: HCJ 4477/00\n\nCase Type: Admin...,"In the Supreme Court of Israel, HCJ 4477/00 Be...","['1(א)', '4477/00']",[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
2,1890/16,"ע""פ 1890/16",2017-03-09,https://supremedecisions.court.gov.il/Verdicts...,"פסק דין בתיק ע""פ 1890/16 בבית המשפט העליון בשב...",1. Case Number: 1890/16\n\n2. Case Type: Crimi...,"In this case, the Hebrew text is a translation...","['9816/09', '1890/16', '345(א)(1)', '345(ב)(1)...",[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
3,7176/04,"ע""פ 7176/04 ירונ תלמי נ. מדינת ישראל",2006-02-02,https://supremedecisions.court.gov.il/Verdicts...,"פסק-דין בתיק ע""פ 7176/04 בבית המשפט העליון בשב...",Case Number: T.P. 2328/01\n\nCase Type: Crimin...,In the case of Appellant Yaron Telmi v. State ...,"['61(א)(4)', '7(א)', '40075/04', '46', '185/87...",[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
4,3766/12,"ע""א 3766/12",2012-06-17,https://supremedecisions.court.gov.il/Verdicts...,"החלטה בתיק ע""א 3766/12 בבית המשפט העליון בירוש...",1. Case Number: A3766/12\n\n2. Case Type: Civi...,The decision in Case A3766/12 at the Supreme C...,"['1113/97', '3766/12', '471ג(ד)', '8467/06', '...",[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
5,8178/12,"ע""א 8178/12 עו""ד צבי סלנט נ. יונתנ גוטליב",2014-11-12,https://supremedecisions.court.gov.il/Verdicts...,"החלטה בתיק ע""א 8178/12 בבית המשפט העליון בשבתו...",1. Case Number: Appeal No. 8178/12\n\n2. Case ...,In the case of Appeal No. 8178/12 at the Supre...,"['8178/12', '7998/12']",[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
6,3015/09,"ע""פ 3015/09 מדינת ישראל נ. פואד קדיח",2010-07-20,https://supremedecisions.court.gov.il/Verdicts...,"פסק דין בתיק ע""פ 3015/09 בבית המשפט העליון בשב...","1. Case Number: Criminal Appeal No. 3015/09, C...",The State of Israel appeals the sentence issue...,"['3', '1242/97', '9097/05', '499(א)(1)', '3015...",[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
7,4272/05,"ע""פ 4272/05 אמיר חג'וג' נ. מדינת ישראל",2006-01-04,https://supremedecisions.court.gov.il/Verdicts...,"פסק-דין בתיק ע""פ 4272/05 בבית המשפט העליון בשב...",1. Case Number: Appeal No. 4272/05\n\n2. Case ...,In the case of Appeal No. 4272/05 at the Supre...,"['3121/04', '4272/05']",[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
8,10467/08,"ע""א 10467/08 עומר חג'אזי נ. אדיב עיסא דיאב",2010-11-03,https://supremedecisions.court.gov.il/Verdicts...,"פסק דין בתיק ע""א 10467/08 בבית המשפט העליון בש...",Case Number: Appeal No. 10467/08\n\nCase Type:...,The case of Appeal No. 10467/08 at the Supreme...,"['9', '2643/97', '380/88', '2275/90', '534/06'...",[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0...
9,3330/11,"ע""א 3330/11 אגד אגודה שיתופית לתחבורה בישראל ב...",2011-11-17,https://supremedecisions.court.gov.il/Verdicts...,"פסק דין בתיק ע""א 3330/11 בבית המשפט העליון בשב...",1. Case Number: Appeal No. 3330/11\n\n2. Case ...,In the case of Appeal No. 3330/11 at the Supre...,"['329/08', '6ב', '3330/11', '81', '4ג', '79א']",[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...


### Convert DataFrame into Documents

In [8]:
df_loader = DataFrameLoader(df_legal_short, page_content_column="document_body_eng_sum")
docs = df_loader.load()

In [9]:
docs[:5]

[Document(metadata={'case_number': '1108/97', 'procedure_name': 'ע"א 1108/97 מרחיב אביב נ. מדינת ישראל', 'case_date': '1997-05-11', 'case_link': 'https://supremedecisions.court.gov.il/Verdicts/Results/1/null/1997/1108/1/8/null/null/null', 'document_body': 'בבית המשפט העליון בש"פ 97 / 1108 בפני: כבוד הרשמת א\' אפעל-גבאי המבקש: מרחיב אביב נגד המשיבה: מדינת ישראל בקשה להארכת מועד להגשת ערעור בשם המבקש: עו"ד איילון אורון בשם המשיבה עו"ד נאוה בן-אור החלטה הארכת מועד כמבוקש. ניתנה היום, ד\' באייר תשנ"ז (11.5.97). אורית אפעל-גבאי, שופטת ר ש מ ת העתק מתאים למקור שמריהו כהן - מזכיר ראשי 97011080.Q03', 'document_body_english': "In the Supreme Court of Israel Case No. 97/1108 Before: Honorable Justice A. Epstein-Gabay Petitioner: Aviv Marehiv against Respondent: State of Israel Request for Extension of Time to File an Appeal On behalf of the petitioner: Advocate Eilon Oron On behalf of the respondent: Advocate Nava Ben-Or As requested, the deadline is extended. Given on this 11th day of Iyar 5757

# Estimate Number of Tokens

Number of tokens estimated by GPT-4o-mini

In [10]:
encoding = tiktoken.encoding_for_model("gpt-4o-mini")

In [11]:
# create the length function
token_counts = []
for doc in docs:
    token_counts.append(len(encoding.encode(doc.page_content)))

tc_summary = {'count': len(token_counts), 'min': min(token_counts), 'avg': int(sum(token_counts) / len(token_counts)), 'max': max(token_counts)}

# print token counts
print(", ".join(f"{k}: {v}" for k, v in tc_summary.items()))

count: 100, min: 52, avg: 197, max: 324


# Create Embeddings and Insert into Vector DB

**Reference**: [LangChain | Faiss](https://api.python.langchain.com/en/latest/community/vectorstores/langchain_community.vectorstores.faiss.FAISS.html)

In [12]:
emb_model = 'nomic-embed-text:latest'

For cosine similarity we are going to use normalized vectors \
Inherit and extend OllamaEmbeddings to verify if embeddings vector is norm 1, if not normalize it:
$$\hat{u} = \frac{u}{\Vert u \Vert}$$

Thanks to: [Github ollama issue #4128](https://github.com/ollama/ollama/issues/4128)

In [13]:
# create embedding function
class OllamaEmbeddingsNorm(OllamaEmbeddings):
    def _process_emb_response(self, input: str) -> list[float]:
        emb = super()._process_emb_response(input)
        norm_1 = np.allclose(np.linalg.norm(emb), 1)
        if norm_1:
            return emb
        else:
            return (np.array(emb) / np.linalg.norm(emb)).tolist()

In [14]:
ollama_emb = OllamaEmbeddingsNorm(model=emb_model, base_url=base_url)

Create dummy text and embed it, check the vector size

In [15]:
vector = ollama_emb.embed_query("hello world")
vector_len = len(vector)
print(f"Model '{ollama_emb.model}' vector length: {vector_len}")

Model 'nomic-embed-text:latest' vector length: 768


Create a FAISS index with dimension of vector length (retrieved above) using unit vector with inner product for cosine similarity search <br/>
Reference: [Langchain Faiss Cosine Similarity](https://www.restack.io/docs/langchain-knowledge-faiss-cosine-similarity-cat-ai)

In [16]:
index = faiss.IndexFlatIP(vector_len)
print(f"Total number of vectors stored in FAISS index: {index.ntotal}, vector dimension: {index.d}")

Total number of vectors stored in FAISS index: 0, vector dimension: 768


Create FAISS vector store with embeddings:
- embedding_function - LLM embeddings function. It will be used to embed the documents to vector
- index - FAIS index (in our case based on the inner product for cosine similarity)
- docstore - store documents in RAM
- index_to_doctstore_id - holds the mapping of vectors inside the FAISS index to the relevant document id in the document store.
  initialized to empty dict, will be filled during insertyion of vectors into the DB.

In [17]:
vector_store = FAISS(
    embedding_function=ollama_emb,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={}
)

In [18]:
print(f"Total number of vectors stored in FAISS DB: {vector_store.index.ntotal}, vector dimension: {vector_store.index.d}")

Total number of vectors stored in FAISS DB: 0, vector dimension: 768


In [19]:
ids = vector_store.add_documents(documents=docs)

In [20]:
print(f"Total number of vectors stored in FAISS DB: {vector_store.index.ntotal}, vector dimension: {vector_store.index.d}")

Total number of vectors stored in FAISS DB: 100, vector dimension: 768


# Save Vector Store

In [21]:
db_name = "db/vectors/db_legal_cases_summary_100"
vector_store.save_local(db_name)