In [1]:
!pip -q install langchain_community faiss-cpu langchain-openai langchain-text-splitters langchain_huggingface pypdf

[0m

In [2]:
from langchain_community.vectorstores import FAISS
import os 
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_text_splitters import CharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_community.vectorstores import FAISS
load_dotenv()
llm=ChatOpenAI(model="gpt-4o")
embeddings=OpenAIEmbeddings(model="text-embedding-3-large")
parser=StrOutputParser()
directory_path=("documentation_addr")
loader=PyPDFDirectoryLoader(directory_path)
documents=loader.load()
text_splitter=CharacterTextSplitter(chunk_size=100, chunk_overlap=20)
docs=text_splitter.split_documents(documents)
vectorstore=FAISS.from_documents(docs, embeddings)
retriever=vectorstore.as_retriever(search_type="similarity")

prompt=ChatPromptTemplate.from_messages([
    ("system", 
     """You are a senior model governance analyst. Your task is to compare two versions of a model report (v3 and v4) 
based on the provided document excerpts.

Please identify and summarize all **important differences** between the two documents, specifically focusing on:

- ✅ Model type (e.g., XGBoost, ensemble, etc.)
- ✅ Feature engineering and number/type of features
- ✅ Training data ranges (include exact date ranges, This is important do not miss this)
- ✅ Validation methodology (holdout, OOT, blind review, etc.)
- ✅ Model performance metrics (AUC, FDR, G:F ratio, etc.)
- ✅ Reason codes, output format, compliance references (e.g., ECOA)
- ✅ Any renaming, reversioning, or deployment differences

For each difference, include:
- A **clear numbered bullet point**
- Whether it applies to **v3** or **v4** or both
- A **short text excerpt** from the original source

Do **not** include unimportant boilerplate or redundant information. Only highlight true differences or notable enhancements.

Context:
{context}"""),
    "user",
    "{input}"
])
rag_chain={'context': retriever, 'input': RunnablePassthrough()} | prompt | llm | parser

response=rag_chain.invoke("Tell me the difference between the two document versions, also verify if they using the same model? , also list any changes relating to training, testing or anything to do with variables")
print(response)

1. **Model Type**:
   - **v3**: Not specifically mentioned in the excerpts.
   - **v4**: Utilizes the LightGBM model among various other models as explained in the benchmarking analysis. 
     - *Excerpt*: "The orange model is LightGBM, the green model is a Generalized Linear Model, the red model is a Gradient Boosted Model, the purple model is a Distributed Random Forest and the brown model is XGBoost."

2. **Training Data Ranges**:
   - **v3**: Not explicitly mentioned in the available excerpts.
   - **v4**: Specifies the use of out-of-time validation data which is not used in model training.
     - *Excerpt*: "The following results are based solely on out-of-time validation data, which is not used in model training."

3. **Feature Engineering and Number/Type of Features**:
   - **v3**: Details not included in the available excerpts.
   - **v4**: Elaborates on advanced statistical techniques for variable selection, including the use of Information Value determined by Weight of Eviden

In [18]:
response=rag_chain.invoke("What is the time range in the training data, test data and validation data between v3 and v4?")
print(response)

The document excerpts only provided details for the time ranges in version 4.0. Here's a summary for that version:

- **Version 4.0**:
  - **Development Dataset (Training and Testing)**: January 2021 to September 2023
  - **Hold-Out Dataset (Validation)**: January 2021 to September 2023
  - **Out-of-Time (OOT) Dataset (Validation and Stability Testing)**: October 2023 to December 2023

Unfortunately, the document excerpts provided did not give any information about the time ranges for version 3.0. Therefore, without additional information, I cannot compare the time ranges between version 3.0 and version 4.0.


In [19]:
response=rag_chain.invoke("What is the difference in sampling methodologies?")
print(response)

1. **Sampling Methodology - Undersampling vs. Oversampling**:
    - **v4**: "The goods are undersampled in the development set to aid in supervised modeling."
    - **v3**: "The bads are oversampled." 

This indicates a change from oversampling of fraudulent (bad) transactions in v3 to undersampling of non-fraudulent (good) transactions in v4 to handle the imbalance in the dataset.


In [20]:
response=rag_chain.invoke("What is the difference in sampling methodologies?")
print(response)

1. **Stratified Sampling Modifications**:
   - **Version 3**: Employs a stratified sample method aimed at generating a representative dataset of the entire US population. It mentions proportionally equal weightage applied to all client data across the strata.
     - *Excerpt*: "Socure uses a stratiﬁed sample method to generate a representative modeling dataset of the entire US population. [...] The sample is proportionally the same across all strata to ensure equal weightage is applied to all client data."
   - **Version 4**: Also utilizes a stratified sample method but emphasizes ensuring that fraud patterns observed across various clients are representative within each dataset and mentions the use of a consortium.
     - *Excerpt*: "A stratiﬁed sampling method is used as a baseline to ensure that fraud patterns observed across various clients are representative within each dataset. [...] Socure uses a stratiﬁed sampling method based on fraud label and client to generate a representat

In [21]:
response=rag_chain.invoke("What is the performancein V4, report the AUC and FDRs")
print(response)

In version 4 (V4) of the Address Risk Model, the performance metrics are as follows:

- **AUC-ROC**: 
  - Overall AUC-ROC for the Out-of-Time Set is **90.24%**. This indicates the model's ability to distinguish between good and fraudulent cases.

- **Fraud Detection Rates (FDR) / Good-to-Fraud Ratio (G:F)** for various risk depths:
  - **0.5% risk depth**: FDR is **16.61%** with a G:F ratio of **2:1**.
  - **1% risk depth**: FDR is **24.60%** with a G:F ratio of **3:1**.
  - **3% risk depth**: FDR is **41.23%** with a G:F ratio of **6:1**.
  - **5% risk depth**: FDR is **50.72%** with a G:F ratio of **9:1**.
  - **10% risk depth**: FDR is **66.50%** with a G:F ratio of **14:1**.


In [3]:
response=rag_chain.invoke("Tell me about the feature selections and variable selections in the new and old?")
print(response)

Here are the important differences in feature and variable selection between Address Risk Model v3 and v4:

1. **Feature Engineering and Number/Type of Features**
   - **Version 3 (v3)**: Over 5,000+ combination variables were evaluated and 150+ selected top predictors were used in models.
     - **Excerpt**: "Over 5,000+ combination variables were evaluated before using 150+ selected top predictors in our models."
   - **Version 4 (v4)**: A total of 800+ predictors were tested when building the model, with features derived from first name, surname, complete physical address, IP address, and others.
     - **Excerpt**: "Socure generated and tested 800+ predictors when building the current model in addition to existing features derived from the first name, surname, complete physical address, IP address."

2. **Variable Selection Method**
   - **Version 3 (v3)**: Used a Bayesian-based optimizer to identify the optimal number of variables to max accuracy and minimize time variance.
     -

In [7]:
response=rag_chain.invoke("from what period is the training, testing and validation data of be specific and tell me the exact to and from times ?")
print(response)

Here are the differences concerning training, testing, and validation data ranges between version 3 and version 4 of the model report:

1. **Training Data Ranges for v3:**
   - The training dataset used data from **October 2017 through April 2021** as noted: 
     - "Socure used data from October 2017 through November 2021."
     - "The development data consists of all applications and transactions for the period of October 2017 to April 2021 with tagged frauds."

2. **Training Data Ranges for v4:**
   - The training dataset used data from **January 2021 through December 2023** as noted:
     - "Socure used data from January 2021 through December 2023."
     - "The dataset used to train and test and tune the model from January 2021 to December 2023."

3. **Validation Data Ranges for v3:**
   - Validation was conducted on data from **February 2020 to November 2021**:
     - "We then validated the models on transactions and tagged frauds from February 2020 to November 2021."

4. **Valida

In [11]:
response=rag_chain.invoke("Tell me the performance of the model for V4 also provide the FDR for the 50% of the riskiest population? or other FDRs for training , testing and validation data")
print(response)

The provided excerpts include detailed model performance for Address Risk Version 4.0:

- **Model Performance (V4) - AUC-ROC:**
  - Development Set: 92.18%
  - Hold-Out Set: 91.56%
  - Out-of-Time Set: 90.24%

- **FDR (Fraud Detection Rate) / Good-to-Fraud Ratio (G:F) for V4:**

  - **Development Set:**
    - 0.5% Risk Depth: 25.33% / 1:1
    - 1% Risk Depth: 36.89% / 2:1
    - 3% Risk Depth: 56.14% / 4:1
    - 5% Risk Depth: 65.69% / 7:1
    - 10% Risk Depth: 78.32% / 12:1

  - **Hold-Out Set:**
    - 0.5% Risk Depth: 21.94% / 1:1
    - 1% Risk Depth: 32.36% / 2:1
    - 3% Risk Depth: 51.88% / 5:1
    - 5% Risk Depth: 61.09% / 7:1
    - 10% Risk Depth: 74.91% / 13:1

  - **Out-of-Time Set:**
    - 0.5% Risk Depth: 16.61% / 2:1
    - 1% Risk Depth: 24.60% / 3:1
    - 3% Risk Depth: 41.23% / 6:1
    - 5% Risk Depth: 50.72% / 9:1
    - 10% Risk Depth: 66.50% / 14:1

As for the FDR for 50% of the riskiest population, the document does not provide this specific metric. You might need addit