# Implementation to Grab Pull Request Data from GitHub

This notebook demonstrates a method to interact with the GitHub API for fetching and displaying Pull Request (PR) data. It outlines how to set up the environment, make requests, process the responses, and explore ways to integrate this data into an LLM for further analysis.

### 1. Environment Setup

To run `base_github_pr_inter.ipynb` ensure the `requests` library is install. `requests` is used for making HTTP requests to the GithHub API and can be installed by: 

```bash
pip install requests
```

### 2. Import Library

The necessary library is imported as follows:

In [1]:
import requests

### 3. Set Up Request URL

To fetch PR data, use the GitHub API endpoint:

```bash
https://api.github.com/repos/{owner}/{repo}/pulls/{pull_number}
```

Define the `owner`, `repo`, and `pull_number` dynamically:

In [2]:
# Define unique aspects of the URL
owner = 'MatthewGoldsberry'
repo = 'LLM-Projects'
pull_number = 1

# Construct the URL
url = f'https://api.github.com/repos/{owner}/{repo}/pulls/{pull_number}'

### 4. Make Request

Send a GET request using the `requests` library:

In [42]:
# Make a request and store response
response = requests.get(url)

**Note on Authentication:**

For private repositories or higher request limits, include a GitHub personal access token:

```python
headers = {
    'Authorization': 'YOUR_GITHUB_TOKEN'
}
```

Update the parameters in the `requests.get()` to include `headers`:

```python
response = requests.get(url, headers=headers)
```

### 5. Process the Response

Now that we have the response from the request we can decipher it.

The first thing that needs to be done is to check its `status_code` to make sure the request was successful, which is indicated by a value of `200`. 

Once it is know that the response was successful we will unpack using `json()`.

In the following code we will check the `status_code` and if successful unpack via `json` and display the full thing so we can understand what all data is included in it. 

In [None]:
if response.status_code == 200:
    # Parse the response JSON
    data = response.json()
    
    # Pretty print the data
    for key, value in data.items():
        print(f'{key}: {value}')
else:
    print(f'Error: {response.status_code}')

**Grabbing What We Need**

From all of that data there are a couple of items we really want from that which are the `title` and `body` for a general context of the problem and then the `diff_url` so we can see what actually changes were actually made. 

To isolate these lets run some code to look at what we get. 

In [9]:
# Print the PR title
print(data['title'])

# Print the PR body
print(data['body'])

# Print the diff_url
print(data['diff_url'])

Test PR
This is a test message
https://github.com/MatthewGoldsberry/LLM-Projects/pull/1.diff


Now since the `diff_url` value is a URL we have to go through a similar process with that as we did the original URL where we make a GET request and then process the data. The major difference between the two though is that the `diff_url` returns plaintext instead of JSON so when processing the response it has to be handled as plaintext.

In [43]:
# Fetch the diff data
diff_response = requests.get(data['diff_url'])

if diff_response.status_code == 200:
    diff_data = diff_response.text
    print(diff_data)
else:
    print(f'Error: {diff_response.status_code}')

diff --git a/GitHub PR AutoGen/base_pretained_model.ipynb b/GitHub PR AutoGen/base_pretained_model.ipynb
index a4bdbe5..01e0c28 100644
--- a/GitHub PR AutoGen/base_pretained_model.ipynb	
+++ b/GitHub PR AutoGen/base_pretained_model.ipynb	
@@ -15,7 +15,7 @@
    "source": [
     "### 1. Setting Up the Environment\n",
     "\n",
-    "Before running `base_pretrained_model.py`, ensure you have the necessary libraries installed. The key libraries used in this implementation are:\n",
+    "Before running `base_pretrained_model.ipynb`, ensure you have the necessary libraries installed. The key libraries used in this implementation are:\n",
     "* `torch`\n",
     "* `transformers`\n",
     "* `huggingface_hub`\n",
@@ -212,7 +212,7 @@
     "# Generate the response using the pipeline\n",
     "output = pipe(\n",
     "    prompt,                        # The input prompt for the model to generate text based on\n",
-    "    max_new_tokens=2048,           # Specify the maximum number of new tok

### 8. LLM Integration

Use the diff output to generate insights with a pre-trained LLM like `Llama 3.2 1B` in this case. It will use a similar prompt to that made in `base_pretained_model.ipynb` with the values obtained from the `diff_url` being injected into it. 

Below is an excerpt of what the data looks like in markdown when specified as a `diff` type, like it will be in the prompt:

```diff
@@ -212,7 +212,7 @@
     # Generate the response using the pipeline
     output = pipe(
     prompt,                        # The input prompt for the model to generate text based on\n",
-    max_new_tokens=2048,           # Specify the maximum number of new tokens to generate (up to 2048 in this case)
+    max_new_tokens=1024,           # Specify the maximum number of new tokens to generate (up to 2048 in this case)
     do_sample=False,               # Set to False for deterministic output (the same input will always produce the same output)
     )
```

So just to see what we are working with we will inject the diff from the PR into the prompt and prompt the LLM.

In [41]:
import torch
from transformers import pipeline
from huggingface_hub import login
from hf_token import llama_3_2_token

login(token=llama_3_2_token)

# Specify the ID of the LLM model
model_id = "meta-llama/Llama-3.2-1B"

# Create the pipe
pipe = pipeline(
    "text-generation",              # Specify the task type as text generation
    model=model_id,                 # Specify the LLM model to be used for generation
    torch_dtype=torch.bfloat16,     # Specify the data type for the model weights (bfloat16 for better performance on certain hardware)
    device_map="auto"               # Automatically assign model layers to available devices (GPU/CPU) for optimal performance
)

# Generate prompt with the whole diff
prompt = (
    "You are an AI assistant tasked with generating a GitHub Pull Request comment. "
    "The comment should describe the code changes below in a clear, professional, and technical manner, suitable for documentation purposes.\n\n"
    
    "### Code Changes:\n"
    "```diff\n"
     "--- a/GitHub PR AutoGen/base_pretained_model.ipynb"
     "+++ b/GitHub PR AutoGen/base_pretained_model.ipynb"	
     "@@ -15,7 +15,7 @@"
     "### 1. Setting Up the Environment\n"
     "\n"
     "-    Before running `base_pretrained_model.py`, ensure you have the necessary libraries installed. The key libraries used in this implementation are:\n"
     "+    Before running `base_pretrained_model.ipynb`, ensure you have the necessary libraries installed. The key libraries used in this implementation are:\n"
     "* `torch`\n"
     "* `transformers`\n"
     "* `huggingface_hub`\n"
     "@@ -212,7 +212,7 @@"
     "# Generate the response using the pipeline\n"
     "output = pipe(\n"
     "    prompt,                        # The input prompt for the model to generate text based on\n"
     "-    max_new_tokens=2048,           # Specify the maximum number of new tokens to generate (up to 2048 in this case)\n"
     "+    max_new_tokens=1024,           # Specify the maximum number of new tokens to generate (up to 2048 in this case)\n"
     "    do_sample=False,               # Set to False for deterministic output (the same input will always produce the same output)\n"
     ")\n"   
     "```\n\n"
)

# Generate the response using the pipeline
output = pipe(
    prompt,                        # The input prompt for the model to generate text based on
    max_new_tokens=2048,           # Specify the maximum number of new tokens to generate (up to 2048 in this case)
    do_sample=False,               # Set to False for deterministic output (the same input will always produce the same output)
)

# Print the generated text from the output
print(output[0]["generated_text"])  # Access and print the generated text from the output object


Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


You are an AI assistant tasked with generating a GitHub Pull Request comment. The comment should describe the code changes below in a clear, professional, and technical manner, suitable for documentation purposes.

### Code Changes:
```diff
--- a/GitHub PR AutoGen/base_pretained_model.ipynb+++ b/GitHub PR AutoGen/base_pretained_model.ipynb@@ -15,7 +15,7 @@### 1. Setting Up the Environment

-    Before running `base_pretrained_model.py`, ensure you have the necessary libraries installed. The key libraries used in this implementation are:
+    Before running `base_pretrained_model.ipynb`, ensure you have the necessary libraries installed. The key libraries used in this implementation are:
* `torch`
* `transformers`
* `huggingface_hub`
@@ -212,7 +212,7 @@# Generate the response using the pipeline
output = pipe(
    prompt,                        # The input prompt for the model to generate text based on
-    max_new_tokens=2048,           # Specify the maximum number of new tokens to gene

### 9. Results

So now we have explored ways to access any information we may want from the GitHub API but based on the result of the test of plugging that into the model we will have to do some further prompt engineering and work on training the model. As currently, upgrading the model quality is limited by my computer's computational power.

**Next Steps**
1. Refine prompt engineering fro better LLM-generated outputs.
2. Work on training the LLM for better outputs.
3. Develop preprocessing steps to summarize large diffs for more concise inputs to the model.

**In Summary**

This notebook demonstrates the end-to-end process of interacting with the GitHub API, processing PR data, and leveraging LLMs for automated insights (while not perfect, still a POC). It provides a scalable foundation for enhancing development workflows through AI integration.