## 1. Set up the environment
First, ensure you have the necessary libraries installed for embedding generation, vector search, and data manipulation. You can use models from transformers for embedding generation, faiss for similarity search, and pandas for working with your DataFrame.

In [3]:
from pathlib import Path


path_root = Path.cwd()
while not (path_root / "data").exists():
    if path_root == path_root.parent:
        raise FileNotFoundError("Directory 'data' not found")
    path_root = path_root.parent

path_data = path_root / "data"

In [9]:
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.metrics.pairwise import cosine_similarity
import re

from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


## 2. Prepare your DataFrame
You should have a DataFrame that contains a text column for which you want to perform a similarity search.

In [96]:
data_BoR = pd.read_excel(
    path_root / 'data/processed/Bank_of_Rules-refractored.xlsx',
    usecols=['Description', 'RuleID', 'Code', 'Parameters'],
    index_col='RuleID'
    )

df_defs = (pd.read_json(path_root / 'data/internim/definitions.json')[['RuleID', 'Description',
                                   'Condition/Logic', 'Example',
                                #    'Parameters',
                                   'Category', 'Categorization']]
            .drop_duplicates(subset=['RuleID'])).set_index('RuleID')

In [42]:
display(df_defs.columns)
df_defs = (
    df_defs
    .drop_duplicates()
    # .shape
    )
df_defs

Index(['RuleID', 'Description', 'Condition/Logic', 'Example', 'Category',
       'Categorization'],
      dtype='object')

Unnamed: 0,RuleID,Description,Condition/Logic,Example,Category,Categorization
0,DQRC0001,Checks that the value in `<cde>` is not null a...,`<cde>` is non-null and contains non-whitespac...,"If `<cde>` = `Name` and `Name` = `John`, retur...",Completeness,Completeness / Non-Null Check
1,DQRA0194,Ensures that `<cde>` exists in `<#>catalog<#>`...,`<cde>` exists in `<#>catalog<#>` keyed by `<#...,If `<#>DependentColumn<#>`=`VIP` and `<#>ListA...,Catalog Check,Value List / Catalog Membership
2,DQRV0020,Compares `<cde>` with `<#>Column1<#>` as dates...,Parse `<cde>` using `<#>frmtDte1<#>` and `<#>C...,If `<#>Operator<#>`=`==` and both dates parse ...,Date Check,Date Validation and Comparisons
3,DQRU0004,Ensures that `<cde>` does not contain duplicat...,`<cde>` must not have any duplicate values wit...,"If `<cde>`=`ID` and `ID`=[1, 2, 3, 3], return ...",Uniqueness Check,Uniqueness / Duplicate Checks
4,DQRF0005,Validates that the length of the value in `<cd...,The length of `<cde>` must satisfy `<#>Operato...,If `<cde>`=`Name` and `<#>Operator<#>`=`>` `<#...,Length Check,Data Type / Numeric / Length Constraints
5,DQRF0037,Applies a regex pattern (`<#>pattern<#>`) to v...,`<cde>` must match the provided `<#>pattern<#>...,If `<cde>`=`Code` and `<#>pattern<#>`=`^[A-Z]{...,Pattern Matching,Pattern Matching / Regex Validation
6,DQRI0106,Ensures that `<cde>` falls between `<#>Column2...,`<cde>` must satisfy `<#>Column2<#> <= <cde> <...,"If `<cde>`=`Score`=`50`, `<#>Column2<#>`=`40`,...",Range Check,Conditional Checks Based on Other Columns
7,DQRC0044,Checks if the value in `<cde>` is non-null and...,`<cde>` must be non-null and contain character...,If `<cde>`=`Address` and `Address`=`'123 Main ...,Completeness,Completeness / Non-Null Check
8,DQRF0006,Ensures that the length of `<cde>` falls betwe...,`<#>length_min<#>` `<#>Operator1<#>` length of...,"If `<cde>`=`Name`, `<#>length_min<#>`=`5`, `<#...",Length Check,Data Type / Numeric / Length Constraints
9,DQRF0178,Ensures that `<cde>` does not contain a decima...,The decimal part of `<cde>` must have at most ...,If `<cde>`=`Amount` and `<#>DecimalComplement<...,Numeric Validation,Data Type / Numeric / Length Constraints


## 3. Generate Vector Embeddings
Use a pre-trained transformer model like distilbert-base-uncased or sentence-transformers to generate embeddings for each entry in the text column.*italicized text*

In [43]:
from transformers import AutoTokenizer, AutoModel
import torch

# Load a transformer model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

# Save your files to a specified directory with PreTrainedModel.save_pretrained():
tokenizer.save_pretrained(path_root / "sentence-transformers/all-MiniLM-L6-v2/tokenizer")
model.save_pretrained(path_root / "sentence-transformers/all-MiniLM-L6-v2/model")

# #Now when you’re offline, reload your files with PreTrainedModel.from_pretrained() from the specified directory:
# tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2/tokenizer")
# model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2/model")

In [44]:
# Function to generate embeddings
def get_embeddings(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

# Generate embeddings for each row
df_defs['embeddings'] = df_defs['Description'].apply(lambda x: get_embeddings(x))

## 4. Build the Search Engine with FAISS
You can now use faiss to index these embeddings and perform similarity searches.

In [45]:
import faiss
import numpy as np

# Prepare the embeddings matrix
embeddings_matrix = np.stack(df_defs['embeddings'].values)

# Build the FAISS index
index = faiss.IndexFlatL2(embeddings_matrix.shape[1])  # Using L2 distance
index.add(embeddings_matrix)

# Search function to find the most similar texts
def search(query, top_k=5):
    query_embedding = get_embeddings(query).reshape(1, -1)
    distances, indices = index.search(query_embedding, top_k)
    results = df_defs.iloc[indices[0]]
    return results, distances[0]

In [46]:
query = "CDE should be unique"
results, distances = search(query, top_k=4)

# Display the most similar texts
print("Results:")
print(results[['RuleID','Description']])
print("Distances:", distances)

Results:
      RuleID                                        Description
18  DQRU0127  Ensures `<cde>` does not contain duplicate val...
3   DQRU0004  Ensures that `<cde>` does not contain duplicat...
10  DQRV0014  Ensures that `<cde>` matches a numeric format ...
14  DQRV0002  Validates that `<cde>` satisfies a numeric com...
Distances: [19.23109  20.247612 26.638626 28.703209]


## 5. Perform a Similarity Search
You can now search your DataFrame for similar text using the search function.

In [86]:
row

Description        Validates that `<cde>` satisfies a numeric com...
Condition/Logic    `<cde>` must satisfy `<#>Operator<#>` `<#>comp...
Example            If `<cde>`=`Score`, `<#>Operator<#>`=`>=`, and...
Category                                          Numeric Validation
Categorization              Data Type / Numeric / Length Constraints
Name: DQRV0002, dtype: object

Description        Validates that `<cde>` satisfies a numeric com...
Condition/Logic    `<cde>` must satisfy `<#>Operator<#>` `<#>comp...
Example            If `<cde>`=`Score`, `<#>Operator<#>`=`>=`, and...
Category                                          Numeric Validation
Categorization              Data Type / Numeric / Length Constraints
Code               <data>[cde].astype(float) <#>Operator<#> float...
Description        Validates that <cde> satisfies a numeric compa...
Parameters                                   ['Operator', 'compare']
Name: DQRV0002, dtype: object

In [105]:
from IPython.display import Markdown, display

# Print out the most similar texts
display(Markdown("### Results:"))
for i, rule_id in enumerate(results.index):
    row = pd.concat([df_defs.loc[rule_id], data_BoR.loc[rule_id]])
    display(Markdown(f"#### RuleID: {rule_id}"))
    for i in row.items():
        print(f"\033[1m{i[0]}:\033[0m {i[1]}")
        

### Results:

#### RuleID: DQRU0127

[1mDescription:[0m Ensures `<cde>` does not contain duplicate values.
[1mCondition/Logic:[0m All values in `<cde>` must be unique.
[1mExample:[0m If `<cde>`=`ID` and `ID`=[1, 2, 3, 3], return `False`. If `ID`=[1, 2, 3], return `True`.
[1mCategory:[0m Uniqueness Check
[1mCategorization:[0m Uniqueness / Duplicate Checks
[1mCode:[0m ~<data>[cde].duplicated()
[1mDescription:[0m Ensures that <cde> does not contain duplicate values. Similar to DQRU0004.
[1mParameters:[0m []


#### RuleID: DQRU0004

[1mDescription:[0m Ensures that `<cde>` does not contain duplicate values.
[1mCondition/Logic:[0m `<cde>` must not have any duplicate values within the dataset.
[1mExample:[0m If `<cde>`=`ID` and `ID`=[1, 2, 3, 3], return `False`. If `ID`=[1, 2, 3], return `True`.
[1mCategory:[0m Uniqueness Check
[1mCategorization:[0m Uniqueness / Duplicate Checks
[1mCode:[0m ~<data>[cde].duplicated()
[1mDescription:[0m Ensures that <cde> does not contain duplicate values.
[1mParameters:[0m []


#### RuleID: DQRV0014

[1mDescription:[0m Ensures that `<cde>` matches a numeric format with up to 14 digits before the decimal and exactly 2 digits after.
[1mCondition/Logic:[0m `<cde>` must match the pattern `\d{1,14}\.\d{2}`.
[1mExample:[0m If `<cde>`=`Balance` and `Balance`=`12345678901234.56`, return `True`. If `Balance`=`123456.789`, return `False`.
[1mCategory:[0m Pattern Matching
[1mCategorization:[0m Pattern Matching / Regex Validation
[1mCode:[0m ^[0-9]{1,14}.[0-9]{2}$
[1mDescription:[0m Ensures that <cde> matches a numeric format with up to 14 digits before the decimal and exactly 2 digits after.
[1mParameters:[0m []


#### RuleID: DQRV0002

[1mDescription:[0m Validates that `<cde>` satisfies a numeric comparison (`<#>Operator<#>` and `<#>compare<#>`).
[1mCondition/Logic:[0m `<cde>` must satisfy `<#>Operator<#>` `<#>compare<#>`.
[1mExample:[0m If `<cde>`=`Score`, `<#>Operator<#>`=`>=`, and `<#>compare<#>`=`50`, then `Score`=`45` returns `False`. `Score`=`55` returns `True`.
[1mCategory:[0m Numeric Validation
[1mCategorization:[0m Data Type / Numeric / Length Constraints
[1mCode:[0m <data>[cde].astype(float) <#>Operator<#> float(<#>compare<#>)
[1mDescription:[0m Validates that <cde> satisfies a numeric comparison (<#>Operator<#> and <#>compare<#>) after being converted to a float.
[1mParameters:[0m ['Operator', 'compare']
