## Overview

Codey models are text-to-code models from Google AI, trained on a massive code related dataset. You can generate code related responses for different scenarios such as writing functions, unit tests, debugging, autocompleting and etc. This notebook is to show you how to use Code-bison and Codechat-bison Python SDKs to explain code and translate code.

The notebook is structured as follows:

1. We will show how to use Code-bison and Codechat-bison Python SDKs to explain Cobol code.
2. We will show how to use Code-bison to translate code from R to Python.

Refer to the doc for the latest codey model overview: https://cloud.google.com/vertex-ai/docs/generative-ai/code/code-models-overview


* Author: [leip@](https://moma.corp.google.com/person/leip)
* Date: 09/22/23

## Prep Work

In [None]:
# @title Install Libraries
import sys

if 'google.colab' in sys.modules:
    ! pip install google-cloud-aiplatform
    ! pip install google-cloud-discoveryengine
    from google.colab import auth as google_auth
    google_auth.authenticate_user()

In [None]:
# @title Initialize Vertex AI
import vertexai
from vertexai.language_models import CodeGenerationModel

VERTEX_API_PROJECT = 'certain-haiku-391918'
VERTEX_API_LOCATION = 'us-central1'

vertexai.init(project=VERTEX_API_PROJECT, location=VERTEX_API_LOCATION)
code_generation_model = CodeGenerationModel.from_pretrained("code-bison@001")

## Cobol Code Explanation with Code-bison

#### Quote from a customer - "Code-bison did a good job at explaining Cobol."



In [None]:
# @title Set Up Cobol Explanation Prompt

prefix = """You're an expert Cobol programmer and great at explaining Cobol code to non-technical audience. please explain the code below to a non-technical c-level executive:

IDENTIFICATION DIVISION.
           PROGRAM-ID. SORT-PROGRAM.
       ENVIRONMENT DIVISION.
       INPUT-OUTPUT SECTION.
       FILE-CONTROL.
       SELECT IN-FILE ASSIGN TO
       "/Users/leip/UNSORTED.txt"
       ORGANISATION IS LINE SEQUENTIAL.
       SELECT OUT-FILE ASSIGN TO
       "/Users/leip/SORTED.txt"
       ORGANISATION IS LINE SEQUENTIAL.
       SELECT WORK-FILE ASSIGN TO DISK.
       DATA DIVISION.
       FILE SECTION.
       FD IN-FILE
       RECORD CONTAINS 7 CHARACTERS.
       01 IN-REC.
           02 ITEM-NO-IN PIC XXX.
           02 ITEM-QTY-IN PIC 9999.
       FD OUT-FILE
       RECORD CONTAINS 132 CHARACTERS.
       01 OUT-REC PIC X(132).
       SD WORK-FILE.
       01 WORK-REC.
           02 ITEM-NO-WORK PIC XXX.
           02 ITEM-QTY-WORK PIC 9999.
       WORKING-STORAGE SECTION.
      * 01 ARE-THERE-MORE-RECORDS PIC XXX VALUE "YES".
       PROCEDURE DIVISION.
       000-MAIN-PROCEDURE.
           SORT WORK-FILE
               ON ASCENDING KEY ITEM-NO-IN
                            USING IN-FILE
                            GIVING OUT-FILE
            STOP RUN.

"""

In [None]:
# @title Invoke Code-bison model with Parameters and Prompt
parameters = {
    "temperature": 0.2,
    "max_output_tokens": 512
}

response = code_generation_model.predict(
        prefix=prefix, **parameters
)

print(type(response))

In [None]:
print(response.text)

This program sorts the data in the file "/Users/gaetanodorsi/Desktop/COBOL/lesson5/UNSORTED.txt" and writes the sorted data to the file "/Users/gaetanodorsi/Desktop/COBOL/lesson5/SORTED.txt".

The program first opens the two files and creates a work file. The work file is used to store the data from the input file in a temporary location.

The program then sorts the data in the work file by the item number. The item number is a three-digit number that is stored in the first two fields of the record.

Once the data is sorted, the program writes the sorted data to the output file.

The program ends by closing the two files.


## Follow-up Code Explanation Questions with Code-chat

In [None]:
# @title Initialize CodeChat-bison

from vertexai.language_models import CodeChatModel

code_chat_model = CodeChatModel.from_pretrained("codechat-bison@001")
chat = code_chat_model.start_chat()

In [None]:
# @title Set Up Cobol Explanation Prompt
message_1 = """You're an expert Cobol programmer and great at explaining Cobol code to non-technical audience. please explain the code line by line below:

IDENTIFICATION DIVISION.
           PROGRAM-ID. SORT-PROGRAM.
       ENVIRONMENT DIVISION.
       INPUT-OUTPUT SECTION.
       FILE-CONTROL.
       SELECT IN-FILE ASSIGN TO
       "/Users/leip/UNSORTED.txt"
       ORGANISATION IS LINE SEQUENTIAL.
       SELECT OUT-FILE ASSIGN TO
       "/Users/leip/SORTED.txt"
       ORGANISATION IS LINE SEQUENTIAL.
       SELECT WORK-FILE ASSIGN TO DISK.
       DATA DIVISION.
       FILE SECTION.
       FD IN-FILE
       RECORD CONTAINS 7 CHARACTERS.
       01 IN-REC.
           02 ITEM-NO-IN PIC XXX.
           02 ITEM-QTY-IN PIC 9999.
       FD OUT-FILE
       RECORD CONTAINS 132 CHARACTERS.
       01 OUT-REC PIC X(132).
       SD WORK-FILE.
       01 WORK-REC.
           02 ITEM-NO-WORK PIC XXX.
           02 ITEM-QTY-WORK PIC 9999.
       WORKING-STORAGE SECTION.
      * 01 ARE-THERE-MORE-RECORDS PIC XXX VALUE "YES".
       PROCEDURE DIVISION.
       000-MAIN-PROCEDURE.
           SORT WORK-FILE
               ON ASCENDING KEY ITEM-NO-IN
                            USING IN-FILE
                            GIVING OUT-FILE
            STOP RUN.

"""

In [None]:
# @title Invoke CodeChat-bison model with Parameters and Prompt
parameters = {
    "temperature": 0.2,
    "max_output_tokens": 1024
}

response_1 = chat.send_message(message_1)

print(response_1)

The first line of code, IDENTIFICATION DIVISION., identifies the program as SORT-PROGRAM.

The next section, ENVIRONMENT DIVISION, specifies the input and output files for the program. The IN-FILE file is assigned to the file "/Users/gaetanodorsi/Desktop/COBOL/lesson5/UNSORTED.txt" and is organized in a line sequential format. The OUT-FILE file is assigned to the file "/Users/gaetanodorsi/Desktop/COBOL/lesson5/SORTED.txt" and is also organized in a line sequential format.

The DATA DIVISION section defines the data structures that will be used by the program. The IN-FILE file is defined as a record that contains 7 characters. The OUT-FILE file is defined as a record that contains 132 characters. The WORK-FILE file is defined as a record that contains two fields: ITEM-NO-WORK and ITEM-QTY-WORK.

The WORKING-STORAGE SECTION defines a variable named ARE-THERE-MORE-RECORDS. This variable will be used to keep track of whether there are more records to be read from the IN-FILE file.

The PRO

In [None]:
# @title Set Up Follow-up Question Prompt

message_2 = """ Ok. Can you explain that what the code below does in the contex of the code block above?

01 IN-REC.
           02 ITEM-NO-IN PIC XXX.
           02 ITEM-QTY-IN PIC 9999.
"""

In [None]:
# @title Invoke CodeChat-bison model with Parameters and Follow-up Question Prompt

response_2 = chat.send_message(message_2)

print(response_2)

The code you provided defines the IN-REC record. This record is used to store the data from the IN-FILE file. The ITEM-NO-IN field is a three-character field that stores the item number. The ITEM-QTY-IN field is a four-digit field that stores the item quantity.


## Code Translation from R to Python with Code-bison


In [None]:
# @title Set Up Code Translation Prompt

## Translate code fromc c++ to python
prompt = """You're an expert R and Python programmer and great at translating R code to Python. please translate R code below to Python:

# Load libraries
library(dplyr)
library(bigrquery)

# Prepare training and evaluation data from BigQuery
sample_size <- 10000
sql_query <- sprintf(sql_query, sample_size)

train_query <- paste('SELECT * FROM (', sql_query,
  ') WHERE MOD(CAST(key AS INT64), 100) <= 75')
eval_query <- paste('SELECT * FROM (', sql_query,
  ') WHERE MOD(CAST(key AS INT64), 100) > 75')

# Load training data to data frame
train_data <- bq_table_download(
    bq_project_query("
        PROJECT_ID,
        query = train_query
    )
)

# Load evaluation data to data frame
eval_data <- bq_table_download(
    bq_project_query(
        PROJECT_ID,
        query = eval_query
    )
)
"""

In [None]:
# @title Invoke CodeChat-bison with Parameters and Code Translation Prompt
parameters = {
    "temperature": 0.2,
    "max_output_tokens": 512
}

response = code_generation_model.predict(
        prefix=prompt, **parameters
)

print(response.text)

```python
# Load libraries
import pandas as pd
from google.cloud import bigquery

# Prepare training and evaluation data from BigQuery
sample_size = 10000
sql_query = """
SELECT *
FROM `PROJECT_ID.DATASET_ID.TABLE_ID`
LIMIT {0}
""".format(sample_size)

train_query = """
SELECT *
FROM (
  {0}
)
WHERE MOD(CAST(key AS INT64), 100) <= 75
""".format(sql_query)

eval_query = """
SELECT *
FROM (
  {0}
)
WHERE MOD(CAST(key AS INT64), 100) > 75
""".format(sql_query)

# Load training data to data frame
train_data = pd.read_gbq(train_query)

# Load evaluation data to data frame
eval_data = pd.read_gbq(eval_query)
```
