# BankStatement Analysis with Gen AI

## Table of Contents
1. Introduction
2. Examples
3. References and Further Reading

<a id='2'></a>

## 1. Introduction

Generative AI can be utilized in various ways for writing data engineering code, specifically for creating efficient and accurate data pipelines:

1. Code Generation: Automatically generating data engineering scripts based on high-level descriptions of tasks.
2. Optimization: Improving existing data engineering code based on performance feedback and best practices.
3. Schema Understanding: Interpreting data schemas to inform code generation and optimization.
4. Error Detection and Correction: Identifying and fixing errors in data engineering code through automated analysis.
5. Code Translation: Converting code between different programming languages and frameworks used in data engineering.
6. Complex Workflow Creation: Generating complex data workflows and pipelines based on user requirements.
7. Result Interpretation: Translating data processing results into human-readable reports and summaries.
8. Data Quality Checks: Generating code for validating data quality and consistency in pipelines.
9. Documentation Generation: Creating detailed documentation for data engineering code and workflows automatically.

Using Gen AI for this task offers several benefits:

- Increased productivity and efficiency for data engineers
- Faster development and deployment of data pipelines
- Reduced errors in code
- Improved maintainability and readability of code

In [1]:
!pip install openai
!pip install pandas
!pip install scikit-learn
!pip install matplotlib
!pip install pdf2image
!pip install poppler-utils
!pip install PyMuPDF
!pip install pytesseract Pillow

Collecting pdf2image
  Downloading pdf2image-1.17.0-py3-none-any.whl.metadata (6.2 kB)
Downloading pdf2image-1.17.0-py3-none-any.whl (11 kB)
Installing collected packages: pdf2image
Successfully installed pdf2image-1.17.0
Collecting poppler-utils
  Downloading poppler_utils-0.1.0-py3-none-any.whl.metadata (883 bytes)
Downloading poppler_utils-0.1.0-py3-none-any.whl (9.2 kB)
Installing collected packages: poppler-utils
Successfully installed poppler-utils-0.1.0
Collecting PyMuPDF
  Downloading PyMuPDF-1.24.10-cp312-none-win_amd64.whl.metadata (3.4 kB)
Collecting PyMuPDFb==1.24.10 (from PyMuPDF)
  Downloading PyMuPDFb-1.24.10-py3-none-win_amd64.whl.metadata (1.4 kB)
Downloading PyMuPDF-1.24.10-cp312-none-win_amd64.whl (3.2 MB)
   ---------------------------------------- 0.0/3.2 MB ? eta -:--:--
   ---------------------------------------- 3.2/3.2 MB 31.6 MB/s eta 0:00:00
Downloading PyMuPDFb-1.24.10-py3-none-win_amd64.whl (13.2 MB)
   ---------------------------------------- 0.0/13.2 MB ?

In [1]:
from openai import OpenAI 
import os 
import requests 
from pdf2image import convert_from_bytes 
from PIL import Image 
import base64 
import json
import pytesseract
from io import BytesIO 
import pandas as pd

def clean(dict_variable):
    return next(iter(dict_variable.values()))
    
client = OpenAI(api_key='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')


<a id='3'></a>
## 2. Fetch from PDF


In [2]:
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  # For Windows

# OCR function to extract text from image
# Function to extract text from an image
def extract_text_from_image(image):
    try:
        text = pytesseract.image_to_string(image)
        return text
    except Exception as e:
        return f"Error extracting text: {str(e)}"
# URL of the image
# image_url = "https://crm.autocarloan.co.uk/images/66e192f2f12a94a822acbbc1"


In [62]:
from pdf2image import convert_from_bytes
import io
import os

# Function to read PDF file as bytes
def read_pdf_as_bytes(pdf_path):
    try:
        with open(pdf_path, 'rb') as file:
            pdf_bytes = file.read()
        return pdf_bytes
    except Exception as e:
        print(f"Error reading PDF: {e}")
        return None
        
# Function to convert PDF bytes to images
def convert_pdf_bytes_to_images(pdf_bytes, poppler_path=None):
    try:
        # Convert PDF bytes to images using pdf2image
        images = convert_from_bytes(pdf_bytes, poppler_path=poppler_path)
        return images
    except Exception as e:
        print(f"Error converting PDF bytes to images: {e}")
        return None

# Main function
def main(pdf_path):
    poppler_path = r'C:\Program Files\poppler-24.07.0\Library\bin'  # Adjust this to the path where poppler is installed (Windows only)
    
    # Step 1: Read PDF file as bytes
    pdf_bytes = read_pdf_as_bytes(pdf_path)

    if pdf_bytes is None:
        print("Failed to read the PDF file.")
        return

    # Step 2: Convert PDF bytes to images
    images = convert_pdf_bytes_to_images(pdf_bytes, poppler_path=poppler_path)

    if images is None:
        print("Failed to convert PDF to images.")
        return

      # Step 3: Extract text from each image and save images (optional)
    all_extracted_text = ""  # Initialize variable to store all extracted text
    for i, image in enumerate(images):
        image_path = f'output_image_page_{i + 1}.png'
        image.save(image_path, 'PNG')
        # print(f"Saved {image_path}")

        # Extract text from the image
        extracted_text = extract_text_from_image(image)
        all_extracted_text += extracted_text + "\n\n"  # Accumulate extracted text with spacing between pages

    # Step 4: Print or store all extracted text
    # print("Extracted Text from PDF:")
    # print(all_extracted_text)

    # Optionally, you can save the extracted text to a file
    with open(f'{os.path.splitext(pdf_path)[0]}_extracted_text.txt', 'w', encoding='utf-8') as text_file:
        text_file.write(all_extracted_text)
    return all_extracted_text


# Example usage
if __name__ == "__main__":
    pdf_path = 'Bank of Scotland 2.pdf'  # Replace with your local PDF file path
    wholetext = main(pdf_path)
    # print(wholetext)


In [63]:
# Function to request the model and return JSON output
def extract_json_from_model(client, model, wholetext_chunk, is_continuation=False):
    system_message = {"role": "system", "content": "You are a helpful assistant that responds in JSON format."}
    
    # If it's a continuation, modify the user prompt to ask for the next part
    if is_continuation:
        user_message = {"role": "user", "content": "Please continue from where the previous response left off and provide the next JSON chunk."}
    else:
        user_message = {"role": "user", "content": f"Extract the statement's line items: Date, Description, Type, Money_In, Money_Out, and Balance from the following text: {wholetext_chunk}"}

    # Send request to GPT-4o-mini model
    response = client.chat.completions.create(
        model=model,
        messages=[system_message, user_message],
        temperature=0.0
    )
    # print(response.choices[0].message.content)
    return response.choices[0].message.content

# Function to clean and parse the JSON response
def parse_json_response(json_str):
    try:
        if json_str.startswith('```json'):
            json_str = json_str.lstrip('```json').rstrip('```')
        parsed_json = json.loads(json_str)
        # print("Parsed JSON:", parsed_json)  # Debug print
        return parsed_json
    except json.JSONDecodeError as e:
        print(f"JSON decoding failed: {e}")
        return None

# Function to extract JSON data repeatedly until all chunks are processed
def extract_large_json(client, model, wholetext, chunk_size=5000):
    all_results = []
    text_length = len(wholetext)
    current_position = 0

    # Loop through chunks of the input text
    while current_position < text_length:
        chunk_end = min(current_position + chunk_size, text_length)
        text_chunk = wholetext[current_position:chunk_end]
        
        # Extract JSON from the current chunk
        response_chunk = extract_json_from_model(client, model, text_chunk)  # Corrected this line
        
        # Parse the JSON from the response
        result = parse_json_response(response_chunk)
        if result is not None:
            all_results.append(result)

        # If response indicates end of content or no valid JSON, break
        if response_chunk.lstrip('```json').rstrip('```').endswith("}"):
            break

        # Move to the next chunk
        current_position = chunk_end

    return all_results

def extract_items(results):
    if not all(isinstance(item, dict) for item in results):
        print("Error: Results should be a list of dictionaries.")
        return pd.DataFrame()  # Return an empty DataFrame
    # Initialize lists to store extracted data
    dates = []
    descriptions = []
    types = []
    money_in = []
    money_out = []
    balances = []

    # Extract items from each result
    for result in results:
        transactions = result.get("transactions", [])
        for transaction in transactions:
            dates.append(transaction.get("Date", ""))
            descriptions.append(transaction.get("Description", ""))
            types.append(transaction.get("Type", ""))
            money_in.append(transaction.get("Money_In", 0.0))
            money_out.append(transaction.get("Money_Out", 0.0))
            balances.append(transaction.get("Balance", 0.0))


    # Create a DataFrame for better visualization (optional)
    df = pd.DataFrame({
        "Date": dates,
        "Description": descriptions,
        "Type": types,
        "Money_In": money_in,
        "Money_Out": money_out,
        "Balance": balances
    })

    return df
# Function to consolidate and clean results (if necessary)
def clean_consolidate_results(results):
    cleaned_results = []
    
    for result in results:
        if isinstance(result, str):
            try:
                parsed_result = json.loads(result)
                if isinstance(parsed_result, list):
                    cleaned_results.extend(parsed_result)
                elif isinstance(parsed_result, dict):
                    cleaned_results.append(parsed_result)
                else:
                    print("Unexpected JSON structure:", parsed_result)
            except json.JSONDecodeError as e:
                print(f"JSON decoding failed: {e}")
        elif isinstance(result, dict):
            cleaned_results.append(result)
        else:
            print("Unexpected result type:", type(result))
    
    return cleaned_results

# Main function
def main(client, model, pdf_text):
    # Step 1: Extract large JSON by processing the text in chunks
    extracted_results = extract_large_json(client, model, pdf_text, chunk_size=5000)
    
    if extracted_results:
        # Step 2: Clean and consolidate the extracted results
        consolidated_results = clean_consolidate_results(extracted_results)

       # Step 3: Extract items from the consolidated results
        finalresp = extract_items(consolidated_results)
        return finalresp
    else:
        print("No valid JSON data extracted.")

# Example usage
if __name__ == "__main__":
    # Assuming `client` is the OpenAI client and `pdf_text` contains the extracted text from the PDF
    model = "gpt-4o-mini"
    # print(wholetext)
    dfall = main(client, model, wholetext)


In [64]:
dfall['Date'] = pd.to_datetime(dfall['Date'],format='%d %b %y')

print(dfall)
dfall.to_csv('output.csv', index=False)

# Extract the year and month for grouping
dfall['YearMonth'] = dfall['Date'].dt.to_period('M')

# Calculate disposable amount (Credit - Debit) for each month
monthly_disposable = dfall.groupby('YearMonth').apply(lambda x: x['Money_In'].sum() - x['Money_Out'].sum(), include_groups=False)

# Convert the result to a DataFrame
monthly_disposable_df = monthly_disposable.reset_index(name='DisposableAmount')

print(monthly_disposable_df)

          Date         Description Type  Money_In  Money_Out  Balance
0   2021-07-01   TESCO PAY AT PUMP  DEB      0.00      40.00  -311.39
1   2021-07-01         LIDL GB AYR  DEB      0.00       3.65  -299.54
2   2021-07-01          GREGGS PLC  DEB      0.00       5.90  -320.94
3   2021-07-01   PPOINT_*WOODFIELD  DEB      0.00       7.36  -328.30
4   2021-07-01      TESCO PFS 4143  DEB      0.00      11.45  -339.75
..         ...                 ...  ...       ...        ...      ...
172 2021-07-30           J THOMSON  TFR     90.00       0.00  -318.97
173 2021-07-30           J THOMSON  TFR   1078.58       0.00  -318.97
174 2021-07-30  PAYPAL *APPLE.COM/  DEB      7.99       0.00  -318.97
175 2021-07-30           J THOMSON  TFR     78.00       0.00  -318.97
176 2021-07-30        DAILY OD INT  CHG      0.33       0.00  -318.97

[177 rows x 6 columns]
  YearMonth  DisposableAmount
0   2021-07           9067.57


In [65]:
prompt = f"""Arrive a lending decision with status Approved,Declined or Deferred to Underwriter by reviewing the below data
{monthly_disposable_df} . Approve only if disposable Amount is more than 1000 for last 3 months, Decline if its less 1000 for all the months, Deferred to Underwriter if disposable amount is more than 1000 for 1 or more months.
"""

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
)

result = response.choices[0].message.content
print(result)

To arrive at a lending decision based on the provided criteria, we need to analyze the `DisposableAmount` for the given months.

From the data provided:

- July 2021: 9067.57

Since we do not have data for the past three months (June 2021 and May 2021 are missing):

1. **Checking for last three months:**
   - We can only see the amount for July 2021, which is 9067.57. Since we do not have values for the previous two months, we cannot determine if the disposable amount was above or below 1000 for three consecutive months.

2. **Reviewing the criteria:**
   - Approve only if the disposable amount is more than 1000 for the last three months: **Cannot determine, as data is incomplete.**
   - Decline if it's less than 1000 for all the months: **Not applicable here, as we only have one month’s data.**
   - Deferred to Underwriter if disposable amount is more than 1000 for 1 or more months: **Here it applies**, as 9067.57 is greater than 1000.

Therefore, based on the available information, t