# ML Python Code Generation using Causal Language Model

**In this Notebook, we aim to construct a Causal Language Model designed specifically for Python code generation. The main focus is to create a model capable of taking a fragment of Python code as input and generating the entire code sequence as output. The significance of this approach lies in its potential to streamline the coding process and enhance efficiency.**

**Due to the vast landscape of Python libraries and frameworks, we'll narrow down the scope of our model to cater specifically to Data Science libraries. This decision is driven by the desire to optimize resources and reduce processing time. Therefore, our model will be tailored to four essential Data Science libraries:**

- **Pandas**: A powerful library for data manipulation and analysis, providing versatile data structures and tools.

- **Matplotlib**: A widely-used library for creating static, interactive, and animated visualizations in Python.

- **Seaborn**: Built on top of Matplotlib, this library facilitates the creation of attractive statistical graphics.

- **Scikit-Learn**: An invaluable library for machine learning tasks, offering a wide array of algorithms and tools.

**By focusing on these libraries, we can harness the capabilities of the Causal Language Model to deliver efficient and accurate Python code generation, specifically tailored to Data Science tasks. Let's dive into the implementation and explore the potential of this novel approach!"**

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [2]:
#Install the necessary libraries
!pip install -q transformers
!pip install -q datasets

In [3]:
#Set the seed value
SEED = 4243

## Dataset

**Regarding the dataset used for training the Causal Language Model, the ideal candidate would be the [Codeparrot dataset](https://huggingface.co/datasets/transformersbook/codeparrot). However, due to its substantial size of nearly 180GB and its demanding computational requirements, we will opt for a more manageable version of this dataset. A [smaller](https://huggingface.co/datasets/huggingface-course/codeparrot-ds-train), curated version is conveniently accessible, and it will sufficiently serve our purposes.**

**This curated dataset is designed to provide a representative sample of Python code from GitHub repositories. While it may not encompass the entirety of the original dataset, it still contains diverse and relevant code examples. This scaled-down version helps conserve computational resources and facilitates a smoother and faster training process.**

**By utilizing this reduced dataset, we can still achieve substantial learning outcomes and generate Python code effectively. So, let's proceed with this more manageable dataset and explore the potentials of our Causal Language Model in Python code generation for Data Science libraries!**

In [4]:
#Download the dataset
from datasets import load_dataset, DatasetDict

train = load_dataset(path="huggingface-course/codeparrot-ds-train", split="train")
valid = load_dataset(path="huggingface-course/codeparrot-ds-valid", split="validation")

#Combine the train and valid dataset into a single DatasetDict object

dataset = DatasetDict(
        {
        "train": train,
        "valid": valid
        }
                     )

Downloading and preparing dataset json/huggingface-course--codeparrot-ds-train to /root/.cache/huggingface/datasets/json/huggingface-course--codeparrot-ds-train-a9b1bc4c2b855d04/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.25G [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/huggingface-course--codeparrot-ds-train-a9b1bc4c2b855d04/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.
Downloading and preparing dataset json/huggingface-course--codeparrot-ds-valid to /root/.cache/huggingface/datasets/json/huggingface-course--codeparrot-ds-valid-e5ece22bd7b6a6ac/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/46.1M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/huggingface-course--codeparrot-ds-valid-e5ece22bd7b6a6ac/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


In [5]:
#Have a look at the dataset fields
dataset

DatasetDict({
    train: Dataset({
        features: ['repo_name', 'path', 'copies', 'size', 'content', 'license'],
        num_rows: 606720
    })
    valid: Dataset({
        features: ['repo_name', 'path', 'copies', 'size', 'content', 'license'],
        num_rows: 3322
    })
})

In [6]:
#Interview the data / Have a look at some samples
for key, value in dataset["train"][0].items():
    print(f">>>> {key}: {value[:1_000]}\n")
#Since the content field contains long codes, we have restrcit it to 1,000 characters

>>>> repo_name: kmike/scikit-learn

>>>> path: sklearn/utils/__init__.py

>>>> copies: 3

>>>> size: 10094

>>>> content: """
The :mod:`sklearn.utils` module includes various utilites.
"""

from collections import Sequence

import numpy as np
from scipy.sparse import issparse

from .murmurhash import murmurhash3_32
from .validation import (as_float_array, check_arrays, safe_asarray,
                         assert_all_finite, array2d, atleast2d_or_csc,
                         atleast2d_or_csr, warn_if_not_float,
                         check_random_state)
from .class_weight import compute_class_weight

__all__ = ["murmurhash3_32", "as_float_array", "check_arrays", "safe_asarray",
           "assert_all_finite", "array2d", "atleast2d_or_csc",
           "atleast2d_or_csr", "warn_if_not_float", "check_random_state",
           "compute_class_weight"]



class deprecated(object):
    """Decorator to mark a function or class as deprecated.


>>>> license: bsd-3-clause



## Data Preprocessing

**In the data preprocessing phase, we need to perform two essential steps to prepare the text data for the Causal Language Model: tokenization and data collation.**

#### Tokenization:
It involves breaking down the raw text (Python code in this case) into smaller units called tokens. These tokens could be individual words, subwords, or characters, depending on the specific requirements of the model. Once tokenized, the next step is to convert these tokens into numerical representations. This conversion enables the model to understand and process the data effectively, as neural networks work with numerical inputs.

#### Data Collation:
After tokenization, the data needs to be organized into sequences that the Causal Language Model can use for training. Since models typically learn from fixed-length sequences, we create data samples by sliding a window over the tokenized text. This means we extract consecutive token sequences from the code, making each sequence a training instance. By using this approach, the model can learn patterns and dependencies within the code, which is crucial for accurate Python code generation.

**Together, these two preprocessing steps pave the way for successful training of the Causal Language Model, enabling it to effectively generate Python code based on the patterns it learns from the prepared tokenized and collated data.**

In [7]:
#Instantiate the tokenizer
from transformers import AutoTokenizer

context_length = 128
tokenizer = AutoTokenizer.from_pretrained("huggingface-course/code-search-net-tokenizer")

Downloading (…)okenizer_config.json:   0%|          | 0.00/265 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/789k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/448k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

**Before tokenizing the whole dataset, it is a good practice to check it on the sample inputs.**

In [8]:
#Let's tokenzie first 2 samples
samp_token = tokenizer(
    dataset["train"][:2]["content"],
    truncation=True,
    max_length=context_length,
    return_overflowing_tokens=True,
    return_length=True,
)

In [9]:
#Check the fields in the tokenized object
samp_token.keys()

dict_keys(['input_ids', 'attention_mask', 'length', 'overflow_to_sample_mapping'])

In [10]:
print(f"Input IDs length: {len(samp_token['input_ids'])}")
print(f"Input chunk lengths: {(samp_token['length'])}")
print(f"Chunk mapping: {samp_token['overflow_to_sample_mapping']}")

Input IDs length: 34
Input chunk lengths: [128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 117, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 41]
Chunk mapping: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


**Since the tokenizer works fine on the sample data, we can apply it on the whole dataset using Dataset.map() method**

In [11]:
#Define a function to tokenzie the whole dataset
def tokenize_ftn(element):
    outputs = tokenizer(
        element["content"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
    input_batch = []
    
    #Discard the tokens that are smaller than the context size
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == context_length:
            input_batch.append(input_ids)
    return {"input_ids": input_batch}

In [12]:
#Check the tokenization function on some sample input
for sample in dataset["train"]:
    tokenized_sample = tokenize_ftn(sample)
    break
tokenized_sample.keys()

dict_keys(['input_ids'])

In [None]:
#Apply the tokenization function on the whole dataset
tokenized_dataset = dataset.map(function=tokenize_ftn,
                                batched=True,
                                remove_columns=dataset["train"].column_names)
tokenized_dataset

  0%|          | 0/607 [00:00<?, ?ba/s]

## Model

**In our pursuit of generating Python code with utmost proficiency, we will harness the capabilities of the GPT-2 language model. Having undergone extensive training on an extensive corpus of text, GPT-2 has developed a remarkable aptitude for comprehending and predicting patterns within the provided data. To tailor the model to our specific needs, we can fine-tune a pre-trained GPT-2 model.**

**Nonetheless, given the substantial amount of data available, a prudent approach would be to train the model from scratch. This allows us to customize the model's understanding of Python code and adapt it precisely to the intricacies of our dataset. Training from scratch ensures that the model becomes intimately familiar with the specific nuances and complexities of Python syntax, thereby optimizing its code generation capabilities.**

In [None]:
#Load the configuration file for the GPT-2 model
from transformers import AutoTokenizer, TFGPT2LMHeadModel, AutoConfig

#Load the configuration file for the GPT-2 model
config = AutoConfig.from_pretrained(pretrained_model_name_or_path="gpt2",
                                    vocab_size=len(tokenizer),
                                    n_ctx=context_length,
                                    bos_token_id=tokenizer.bos_token_id,
                                    eos_token_id=tokenizer.eos_token_id,
)

In [None]:
#Print the configuration file
config

In [None]:
#Instantiate the model with the configuration file
model = TFGPT2LMHeadModel(config=config)
model(model.dummy_inputs)  # Builds the model
model.summary()

In [None]:
#Instantiate the data collator
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer,
                                                mlm=False,
                                                return_tensors="tf")

In [None]:
#Check the data collator on some samples
collated_sample = data_collator([tokenized_dataset["train"][i] for i in range(5)])
for key in collated_sample:
    print(f"{key} shape: {collated_sample[key].shape}")

In [None]:
#Create tf.data.Dataset object
tf_train_dataset = model.prepare_tf_dataset(tokenized_dataset["train"],
                                            collate_fn=data_collator,
                                            shuffle=True,
                                            batch_size=32)

tf_eval_dataset = model.prepare_tf_dataset(tokenized_dataset["valid"],
                                           collate_fn=data_collator,
                                           shuffle=False,
                                           batch_size=32)

In [None]:
#Log in to the HuggingFace account
from huggingface_hub import notebook_login
notebook_login()

In [None]:
#Define the optimizer
from transformers import create_optimizer
import tensorflow as tf

num_train_steps = len(tf_train_dataset)
optimizer, schedule = create_optimizer(init_lr=5e-5,
                                       num_warmup_steps=1_000,
                                       num_train_steps=num_train_steps,
                                       weight_decay_rate=0.01)

In [None]:
#Complie the model
model.compile(optimizer=optimizer)

In [None]:
#Define the callbacks
from transformers.keras_callbacks import PushToHubCallback

callback = PushToHubCallback(output_dir="python-code-generator",
                             tokenizer=tokenizer)

In [None]:
#train the model
model.fit(tf_train_dataset,
          validation_data=tf_eval_dataset,
          callbacks=[callback])