In [1]:
!pip install -q -U google-generativeai

In [6]:
import google.generativeai as genai
import re
from typing import List, Dict

API_KEY = ''
genai.configure(api_key=API_KEY)

In [None]:
model = genai.GenerativeModel("gemini-1.5-flash")

### Test with AWS cobol program file


In [5]:
def read_cobol_file(file_path: str) -> str:
    """
    Reads a COBOL file and preprocesses it.
    Args:
        file_path (str): Path to the COBOL file.
    Returns:
        str: The content of the COBOL file after preprocessing.
    """
    try:
        with open(file_path, "r") as file:
            # Read the file content
            content = file.readlines()

        # Preprocess the content
        preprocessed_content = []
        for line in content:
            # Remove inline comments (e.g., '*' in column 7 for COBOL)
            if len(line) > 6 and line[6] == '*':
                continue

            # Strip trailing whitespaces and normalize line endings
            line = line.rstrip()

            # Skip empty lines
            if not line.strip():
                continue

            preprocessed_content.append(line)

        return "\n".join(preprocessed_content)

    except FileNotFoundError:
        print(f"Error: File not found at {file_path}")
        return ""
    except Exception as e:
        print(f"An error occurred: {e}")
        return ""

def split_cobol_by_sections(file_content: str) -> Dict[str, str]:
    """
    Splits COBOL code into sections based on divisions and paragraph headers.
    Args:
        file_content (str): The content of the COBOL file as a string.
    Returns:
        Dict[str, str]: A dictionary where keys are headers or paragraphs, and values are their corresponding content.
    """

    division_pattern = r"^\s*(IDENTIFICATION|ENVIRONMENT|DATA|PROCEDURE)\s+DIVISION.*"

    paragraph_pattern = r"^\s*[a-zA-Z0-9-]+\.($|\s)"

    combined_pattern = re.compile(f"({division_pattern})|({paragraph_pattern})", re.IGNORECASE | re.MULTILINE)

    matches = list(combined_pattern.finditer(file_content))

    chunks = {}

    # Extract sections based on detected headers
    for i, match in enumerate(matches):
        header = match.group(0).strip()
        start_pos = match.start()
        end_pos = matches[i + 1].start() if i + 1 < len(matches) else len(file_content)
        section_content = file_content[start_pos:end_pos].strip()
        chunks[header] = section_content

    return chunks

Below you can see names of all chunks which are extracted

In [8]:
file_path = (r'/COTRN02C - TEST.cbl')
file = read_cobol_file(file_path)
chunks = split_cobol_by_sections(file)
for i, chunk in chunks.items():
    print(i)


IDENTIFICATION DIVISION.
PROGRAM-ID.
AUTHOR.
ENVIRONMENT DIVISION.
DATA DIVISION.
PROCEDURE DIVISION.
MAIN-PARA.
END-EXEC.
PROCESS-ENTER-KEY.
END-EVALUATE.
VALIDATE-INPUT-KEY-FIELDS.
VALIDATE-INPUT-DATA-FIELDS.
END-IF.
ADD-TRANSACTION.
COPY-LAST-TRAN-DATA.
RETURN-TO-PREV-SCREEN.
SEND-TRNADD-SCREEN.
RECEIVE-TRNADD-SCREEN.
POPULATE-HEADER-INFO.
READ-CXACAIX-FILE.
READ-CCXREF-FILE.
STARTBR-TRANSACT-FILE.
READPREV-TRANSACT-FILE.
ENDBR-TRANSACT-FILE.
WRITE-TRANSACT-FILE.
CLEAR-CURRENT-SCREEN.
INITIALIZE-ALL-FIELDS.
WS-MESSAGE.


Let's take a closer look at the chunk VALIDATE-INPUT-KEY-FIELDS

In [9]:
print(chunks['VALIDATE-INPUT-KEY-FIELDS.'])

VALIDATE-INPUT-KEY-FIELDS.
           EVALUATE TRUE
               WHEN ACTIDINI OF COTRN2AI NOT = SPACES AND LOW-VALUES
                   IF ACTIDINI OF COTRN2AI IS NOT NUMERIC
                       MOVE 'Y'     TO WS-ERR-FLG
                       MOVE 'Account ID must be Numeric...' TO
                                       WS-MESSAGE
                       MOVE -1       TO ACTIDINL OF COTRN2AI
                       PERFORM SEND-TRNADD-SCREEN
                   END-IF
                   COMPUTE WS-ACCT-ID-N = FUNCTION NUMVAL(ACTIDINI OF
                   COTRN2AI)
                   MOVE WS-ACCT-ID-N            TO XREF-ACCT-ID
                                                ACTIDINI OF COTRN2AI
                   PERFORM READ-CXACAIX-FILE
                   MOVE XREF-CARD-NUM         TO CARDNINI OF COTRN2AI
               WHEN CARDNINI OF COTRN2AI NOT = SPACES AND LOW-VALUES
                   IF CARDNINI OF COTRN2AI IS NOT NUMERIC
                       MOVE 'Y'     TO WS-ERR-F

Let's ask Gemini to generate high-level description

In [26]:
test_chunk = chunks['VALIDATE-INPUT-KEY-FIELDS.']
prompt = '''You are an expert in COBOL programming and software documentation.
I am providing a piece of a COBOL program, and I need a concise, functional, high-level description of its purpose and functionality.

Here is the code:
'''
response_general = model.generate_content(prompt + test_chunk )
print(response_general.text)

This COBOL code segment `VALIDATE-INPUT-KEY-FIELDS` validates user-supplied account ID (ACTIDINI) and card number (CARDNINI) fields from a record (`COTRN2AI`).

It first checks if either field contains data.  If only one field is populated, it performs validation on that field:

* **Numeric Validation:** It verifies that the input is numeric. If not, it sets an error flag (`WS-ERR-FLG`), creates an error message (`WS-MESSAGE`), sets an error indicator in the input record, and displays the error via `SEND-TRNADD-SCREEN`.
* **Data Conversion and Lookup:** If numeric, it converts the input to a numeric value using `NUMVAL`, updates a corresponding cross-reference field (XREF-ACCT-ID or XREF-CARD-NUM), and performs a file lookup (`READ-CXACAIX-FILE` or `READ-CCXREF-FILE`) to potentially update the other key field based on the validated input.

If neither field contains data, it sets an error indicating that at least one field must be populated.  In all error cases, it displays the error us

A more detailed, logical description

In [29]:
test_chunk = chunks['VALIDATE-INPUT-KEY-FIELDS.']
prompt = '''You are an expert in COBOL programming and code analysis. I am providing a piece of COBOL program, and I need a detailed, logical, step-by-step explanation of its workflow and logic.

Focus on:
1. The sequence of operations in the code. Do not add pieces of code that you describe.
2. Concise of control flow (e.g., conditional statements, loops, and procedure calls).
3. How data is processed, validated, and transformed at each step.
4. Specific mentions of key paragraphs or sections and their roles.

Be very concise.
Here is the code:
'''
response_logical = model.generate_content(prompt + test_chunk )
print(response_logical.text)

The `VALIDATE-INPUT-KEY-FIELDS` paragraph validates input fields `ACTIDINI` (Account ID) and `CARDNINI` (Card Number) from the `COTRN2AI` record.

1. **Evaluation:** It uses an `EVALUATE` statement to check conditions sequentially.

2. **Account ID Validation:** If `ACTIDINI` is not blank or low-values:
    - It checks if `ACTIDINI` is numeric. If not, it sets an error flag (`WS-ERR-FLG`), an error message (`WS-MESSAGE`), sets `ACTIDINL` to -1, and calls `SEND-TRNADD-SCREEN` to display the error.
    - If numeric, it converts `ACTIDINI` to a numeric value (`WS-ACCT-ID-N` using `NUMVAL`), moves it to `XREF-ACCT-ID` and `ACTIDINI` itself, and calls `READ-CXACAIX-FILE` (presumably to look up account information).  The result is then used to populate `CARDNINI` from `XREF-CARD-NUM`.


3. **Card Number Validation:** If `CARDNINI` is not blank or low-values (and the previous WHEN clause was false):
    - It checks if `CARDNINI` is numeric. If not, similar error handling as above occurs, sett