

# This project leverages the GPT-2 model, a transformer-based language model developed by OpenAI, to generate Python code
**1.Installing Required Libraries**


*   PyGithub: Used for interacting with the GitHub API.
*   pyarrow, datasets, and transformers: These libraries are essential for loading datasets and working with pre-trained models from the Hugging Face library.




In [None]:
!pip install PyGithub
!pip install pyarrow==14.0.1 datasets==2.12.0 transformers==4.31.0


**2.Loading the GPT-2 Tokenizer and Model**


* AutoTokenizer: This helps convert the input text into tokens that the model can understand.
* AutoModelForCausalLM: GPT-2 is a pre-trained causal language model, meaning it is trained to predict the next token in a sequence of tokens, which is useful for generating text or code.



In [None]:
tokenizer.pad_token = tokenizer.eos_token


**3. Mounting Google Drive to Access the Dataset**


* This command mounts Google Drive to the Colab environment, allowing you to access files stored in your drive.




In [None]:
from google.colab import drive
drive.mount('/content/drive')


**4.Loading and Preparing the Dataset**

* pandas: Used to load the dataset from a CSV file into a DataFrame
* Dataset.from_pandas: Converts the pandas DataFrame into a Hugging Face Dataset object, which is more efficient for training models with the transformers library.







In [None]:
import pandas as pd
from datasets import Dataset

csv_file_path = 'python_code_dataset.csv' #csv path
df = pd.read_csv(csv_file_path)
dataset = Dataset.from_pandas(df)


**5.Preprocessing the Dataset**


* Purpose: This function processes each row in the dataset by combining two fields: instruction and input. If any field is missing (NaN), it is replaced with an empty string.
* Tokenization: The combined text is tokenized, and the tokenized version is truncated to a maximum of 128 tokens. The labels are also set to the tokenized input IDs (used for model training).



In [None]:
def preprocess_function(examples):
    combined_texts = []
    for instruction, input_text in zip(examples['instruction'], examples['input']):
        instruction = instruction if pd.notna(instruction) else ""
        input_text = input_text if pd.notna(input_text) else ""
        combined_text = f"{instruction} {input_text}"
        combined_texts.append(combined_text)

    tokenized_inputs = tokenizer(combined_texts, truncation=True, max_length=128, padding='max_length')
    tokenized_inputs["labels"] = tokenized_inputs["input_ids"].copy()

    return tokenized_inputs


**6.Tokenizing the Dataset**

* map: Applies the preprocess_function to every example in the dataset. The function processes the examples in batches, which improves efficiency.





In [None]:
tokenized_datasets = dataset.map(preprocess_function, batched=True)


**7. Splitting the Dataset into Train and Test Sets**

* train_test_split: Splits the dataset into two parts, with 90% used for training and 10% reserved for testing the model.




In [None]:
tokenized_datasets = tokenized_datasets.train_test_split(test_size=0.1)


**8. Defining Training Arguments**


* output_dir: Where the model checkpoints and results will be saved.
* per_device_train_batch_size: Sets the batch size to 1 per device (which is necessary for limited memory environments like Colab).
* num_train_epochs: The number of full passes through the training dataset (in this case, 1 epoch).
* save_steps: Saves the model after every 5000 training steps
* save_total_limit: Limits the number of saved models to avoid running out of storage.









In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=1,
    num_train_epochs=1,
    save_steps=5000,
    save_total_limit=1,
)


**9. Training the Model**


*   Trainer: This is the Hugging Face utility that handles the entire training loop, from feeding data into the model to updating the weights. It simplifies the training process.
*   The model is trained on the train portion of the tokenized dataset.



In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
)

trainer.train()


**10.Generating Code Using the Fine-Tuned Model**


*  generate_code: This function takes a prompt (for example, the start of a Python function) and uses the fine-tuned model to generate a continuation of that code.
*  max_length=500: The model can generate up to 500 tokens in the output.

*  The generated output is tokenized back into readable text using the tokenizer.decode function, which removes any special tokens used during training.



In [None]:
def generate_code(prompt, max_length=500):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(inputs['input_ids'], max_length=max_length)
    generated_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_code


**11.Testing the Model with a Prompt**


*  Here, a sample prompt for a function (merge_sort) is provided to the model, and the model generates the remaining part of the function based on its training.
*  The generated code is printed



In [None]:
prompt = "def merge_sort(arr):"
generated_code = generate_code(prompt)

print("Generated Code:")
print(generated_code)
