# Pull Requestion Analysis

This notebook aims to analyze the pull request dataset.

1. [Determine the number of interactions with ChatGPT](#interactions)
2. [Calculate the average length of prompts (measured in tokens)](#average-length-of-prompt)
3. [Calculate the average length of answers (measured in tokens)](#average-length-of-answer)

> NOTE
>
> All file paths to CSV files are replaced with relative paths.

In [3]:
# Import libraries.

import pandas as pd
from langdetect import detect
import numpy as np
import altair as alt

# alt.renderers.enable("jupyterlab")

## Read Dataset

In [4]:
file_path = "~/data/DevGPT/cleaned/pr_total.csv"

df = pd.read_csv(file_path, index_col=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 28348 entries, 0 to 28347
Data columns (total 40 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Type                  28348 non-null  object 
 1   URL_pr                28348 non-null  object 
 2   Author                28348 non-null  object 
 3   RepoName              28348 non-null  object 
 4   RepoLanguage          28252 non-null  object 
 5   Number                28348 non-null  int64  
 6   Title_x               28348 non-null  object 
 7   Body                  28150 non-null  object 
 8   CreatedAt             28348 non-null  object 
 9   ClosedAt              27642 non-null  object 
 10  MergedAt              23310 non-null  object 
 11  UpdatedAt             28348 non-null  object 
 12  State                 28348 non-null  object 
 13  Additions             28348 non-null  int64  
 14  Deletions             28348 non-null  int64  
 15  ChangedFiles          28

In [5]:
# Remove nan answer and promot.
df = df[~df["Answer"].isna()]
df = df[~df["Prompt"].isna()]
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 28242 entries, 0 to 28343
Data columns (total 40 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Type                  28242 non-null  object 
 1   URL_pr                28242 non-null  object 
 2   Author                28242 non-null  object 
 3   RepoName              28242 non-null  object 
 4   RepoLanguage          28146 non-null  object 
 5   Number                28242 non-null  int64  
 6   Title_x               28242 non-null  object 
 7   Body                  28052 non-null  object 
 8   CreatedAt             28242 non-null  object 
 9   ClosedAt              27544 non-null  object 
 10  MergedAt              23228 non-null  object 
 11  UpdatedAt             28242 non-null  object 
 12  State                 28242 non-null  object 
 13  Additions             28242 non-null  int64  
 14  Deletions             28242 non-null  int64  
 15  ChangedFiles          28

In [6]:
# Filter out Python and English
df = df.loc[df["RepoLanguage"] == "Python"]

# Detect answer language
df["Language"] = df["Answer"].apply(detect)

# Get English only.
df = df.loc[df["Language"] == "en"]
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 478 entries, 7870 to 23117
Data columns (total 41 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Type                  478 non-null    object 
 1   URL_pr                478 non-null    object 
 2   Author                478 non-null    object 
 3   RepoName              478 non-null    object 
 4   RepoLanguage          478 non-null    object 
 5   Number                478 non-null    int64  
 6   Title_x               478 non-null    object 
 7   Body                  405 non-null    object 
 8   CreatedAt             478 non-null    object 
 9   ClosedAt              466 non-null    object 
 10  MergedAt              465 non-null    object 
 11  UpdatedAt             478 non-null    object 
 12  State                 478 non-null    object 
 13  Additions             478 non-null    int64  
 14  Deletions             478 non-null    int64  
 15  ChangedFiles          4

In [7]:
df.head()

Unnamed: 0,Type,URL_pr,Author,RepoName,RepoLanguage,Number,Title_x,Body,CreatedAt,ClosedAt,...,MentionedURL,MentionedProperty,MentionedAuthor,MentionedText,MentionedPath,URL_chatgptsharing,Prompt,Answer,ListOfCode,Language
7870,pull request,https://github.com/paul-gauthier/aider/pull/119,joshuavial,paul-gauthier/aider,Python,119,create Dockerfile and scripts for managing it,I don't know if you want this in the codebase ...,2023-07-18T03:46:34Z,,...,https://github.com/paul-gauthier/aider/pull/11...,comments.body,joshuavial,This feels like a much nicer approach to me @p...,,https://chat.openai.com/share/4555f0ea-1e7b-49...,How can I setup a github action to automatical...,"Sure, I can certainly help you with that. Here...","[{'ReplaceString': '[CODE_BLOCK_0]', 'Type': '...",en
7871,pull request,https://github.com/paul-gauthier/aider/pull/119,joshuavial,paul-gauthier/aider,Python,119,create Dockerfile and scripts for managing it,I don't know if you want this in the codebase ...,2023-07-18T03:46:34Z,,...,https://github.com/paul-gauthier/aider/pull/11...,comments.body,joshuavial,This feels like a much nicer approach to me @p...,,https://chat.openai.com/share/4555f0ea-1e7b-49...,How can I setup a github action to automatical...,"Sure, I can certainly help you with that. Here...","[{'ReplaceString': '[CODE_BLOCK_0]', 'Type': '...",en
7872,pull request,https://github.com/paul-gauthier/aider/pull/119,joshuavial,paul-gauthier/aider,Python,119,create Dockerfile and scripts for managing it,I don't know if you want this in the codebase ...,2023-07-18T03:46:34Z,,...,https://github.com/paul-gauthier/aider/pull/11...,comments.body,joshuavial,This feels like a much nicer approach to me @p...,,https://chat.openai.com/share/4555f0ea-1e7b-49...,How can I setup a github action to automatical...,"Sure, I can certainly help you with that. Here...","[{'ReplaceString': '[CODE_BLOCK_0]', 'Type': '...",en
7873,pull request,https://github.com/paul-gauthier/aider/pull/119,joshuavial,paul-gauthier/aider,Python,119,create Dockerfile and scripts for managing it,I don't know if you want this in the codebase ...,2023-07-18T03:46:34Z,,...,https://github.com/paul-gauthier/aider/pull/11...,comments.body,joshuavial,This feels like a much nicer approach to me @p...,,https://chat.openai.com/share/4555f0ea-1e7b-49...,How can I setup a github action to automatical...,"Sure, I can certainly help you with that. Here...","[{'ReplaceString': '[CODE_BLOCK_0]', 'Type': '...",en
7898,pull request,https://github.com/chitalian/gptask/pull/2,calum-bird,chitalian/gptask,Python,2,Fix: recursive/glob support,"Changes:\r\n`-r` is now a flag, not an argumen...",2023-07-24T18:09:25Z,2023-07-24T19:52:21Z,...,https://github.com/chitalian/gptask/pull/2#iss...,comments.body,chitalian,@calum-bird \r\nhttps://chat.openai.com/share/...,,https://chat.openai.com/share/902cd378-3ebc-4e...,Give me some test commands for this\n\nimport ...,This Python script is a command-line tool that...,"[{'ReplaceString': '[CODE_BLOCK_0]', 'Type': '...",en


In [8]:
# Check unique pr URL: 18
print(f"There are {len(df["URL_pr"].unique())} unique URL in pull request dataset")

There are 18 unique URL in pull request dataset


In [9]:
# Group data by URL.
df_url = df.groupby(by=["URL_pr"])

## Interactions

Determine the number of interactions with ChatGPT.

Method:
1. Leverage the grouped dataset.
2. Calculate the average number of interactions using column `URL_chatgptsharing`.

In [10]:
df_url_count = df_url["URL_chatgptsharing"].aggregate("count").reset_index()
df_url_count.head()

Unnamed: 0,URL_pr,URL_chatgptsharing
0,https://github.com/Email-Generation/email_gene...,167
1,https://github.com/Hochfrequenz/kohlrahbi/pull...,16
2,https://github.com/RND247/Pype-Synthetic-Data-...,32
3,https://github.com/aiplanethub/genai-stack/pul...,1
4,https://github.com/app-sre/qontract-reconcile/...,80


In [11]:
idx_max = np.argmax(df_url_count["URL_chatgptsharing"])
idx_min = np.argmin(df_url_count["URL_chatgptsharing"])

min_count = df_url_count.iloc[idx_min]["URL_chatgptsharing"]
min_count_repos = df_url_count.loc[df_url_count["URL_chatgptsharing"] == min_count]["URL_pr"]

print(
    f"Average interactions per pr: {np.mean(df_url_count["URL_chatgptsharing"]):.2f}\n"
    f"Maximum number of interactions: {df_url_count.iloc[idx_max]["URL_chatgptsharing"]} from "
    f"repo: {df_url_count.iloc[idx_max]["URL_pr"]}\n"
    f"Minimum number of iteractions: {df_url_count.iloc[idx_min]["URL_chatgptsharing"]} from "
    f"repo: {[repo for repo in min_count_repos]}" 
)

Average interactions per pr: 26.56
Maximum number of interactions: 167 from repo: https://github.com/Email-Generation/email_generation/pull/2
Minimum number of iteractions: 1 from repo: ['https://github.com/aiplanethub/genai-stack/pull/21', 'https://github.com/bancaditalia/black-it/pull/58', 'https://github.com/comfyanonymous/ComfyUI/pull/1115', 'https://github.com/microsoft/visionmetrics/pull/42']


In [12]:
# Checking histogarm
inter_bar = alt.Chart(df_url_count).mark_bar(size=5).encode(
    x=alt.X(
        "URL_chatgptsharing:N",
        title="Number of Interactions with ChatGPT",
    ),
    y=alt.Y("count()", title="Count"),
).properties(
    title={
        "text": "Distribution of the number of interactions with ChatGPT",
        "subtitle": "Pull Request"
    }
)

rule = alt.Chart(df_url_count).mark_rule().encode(
    x=alt.X(
        'URL_chatgptsharing:N',
        aggregate="mean",
        type='nominal',
        axis=alt.Axis(format="2d")
    ),
    size=alt.value(2),
)

alt.layer(inter_bar, rule)

![](./graphs/interaction-pr-hist.png)

## Average Length of Prompt

Calculate the average length of prompts (measured in tokens).

Method:
1. Leverage the grouped dataset.
2. Calculate the average length of prompts using column `TokensOfPrompts`


In [13]:
df_url_prompt = df_url["TokensOfPrompts"].aggregate(["mean", "count"]).reset_index()
df_url_prompt.head()

Unnamed: 0,URL_pr,mean,count
0,https://github.com/Email-Generation/email_gene...,21133.0,167
1,https://github.com/Hochfrequenz/kohlrahbi/pull...,17.0,16
2,https://github.com/RND247/Pype-Synthetic-Data-...,33.0,32
3,https://github.com/aiplanethub/genai-stack/pul...,11.0,1
4,https://github.com/app-sre/qontract-reconcile/...,109.0,80


In [14]:
idx_max = np.argmax(df_url_prompt["mean"])
idx_min = np.argmin(df_url_prompt["mean"])

print(
    f"Average length of prompt per pr: {np.mean(df_url_prompt["mean"]):.2f}\n"
    f"Maximum number of prompt: {df_url_prompt["mean"].iloc[idx_max]} from "
    f"repo: {df_url_prompt["URL_pr"].iloc[idx_max]}\n"
    f"Minimum number of prompt: {df_url_prompt["mean"].iloc[idx_min]} from "
    f"repo: {df_url_prompt["URL_pr"].iloc[idx_min]}"
)

Average length of prompt per pr: 1520.47
Maximum number of prompt: 21133.0 from repo: https://github.com/Email-Generation/email_generation/pull/2
Minimum number of prompt: 10.0 from repo: https://github.com/microsoft/visionmetrics/pull/42


In [15]:
# Checking histogarm
inter_bar = alt.Chart(df_url_prompt).mark_bar(size=5).encode(
    x=alt.X(
        "mean:N",
        title="Length of Prompts (tokens)",
    ),
    y=alt.Y("count", title="Count"),
).properties(
    title={
        "text": "Distribution of the length of prompt with ChatGPT",
        "subtitle": "Pull Request"
    }
)

rule = alt.Chart(df_url_prompt).mark_rule().encode(
    x=alt.X(
        'mean:N',
        aggregate="mean",
        type='nominal',
        axis=alt.Axis(format="2d")
    ),
)

alt.layer(inter_bar, rule)

![](./graphs/prompt-pr-hist.png)

## Average Length of Answer

Calculate the average length of answers (measured in tokens).

Method:
1. Leverage the grouped dataset.
2. Calculate the average length of prompts using column `TokensOfAnswers`

In [16]:
df_url_ans = df_url["TokensOfAnswers"].aggregate(["mean", "count"]).reset_index()
df_url_ans.head()

Unnamed: 0,URL_pr,mean,count
0,https://github.com/Email-Generation/email_gene...,49053.0,167
1,https://github.com/Hochfrequenz/kohlrahbi/pull...,267.0,16
2,https://github.com/RND247/Pype-Synthetic-Data-...,761.0,32
3,https://github.com/aiplanethub/genai-stack/pul...,410.0,1
4,https://github.com/app-sre/qontract-reconcile/...,1660.0,80


In [17]:
idx_max = np.argmax(df_url_ans["mean"])
idx_min = np.argmin(df_url_ans["mean"])

print(
    f"Average length of prompt per pr: {np.mean(df_url_ans["mean"]):.2f}\n"
    f"Maximum number of prompt: {df_url_ans["mean"].iloc[idx_max]} from "
    f"repo: {df_url_ans["URL_pr"].iloc[idx_max]}\n"
    f"Minimum number of prompt: {df_url_ans["mean"].iloc[idx_min]} from "
    f"repo: {df_url_ans["URL_pr"].iloc[idx_min]}"
)

Average length of prompt per pr: 3434.83
Maximum number of prompt: 49053.0 from repo: https://github.com/Email-Generation/email_generation/pull/2
Minimum number of prompt: 100.0 from repo: https://github.com/monarch-initiative/oai-monarch-plugin/pull/39


In [18]:
# Checking histogarm
inter_bar = alt.Chart(df_url_ans).mark_bar(size=5).encode(
    x=alt.X(
        "mean:N",
        title="Length of Answers (tokens)",
    ),
    y=alt.Y("count", title="Count"),
).properties(
    title={
        "text": "Distribution of the length of answers with ChatGPT",
        "subtitle": "Pull Request"
    }
)

rule = alt.Chart(df_url_ans).mark_rule().encode(
    x=alt.X(
        'mean:N',
        aggregate="mean",
        type='nominal',
        axis=alt.Axis(format="2d")
    ),
)

alt.layer(inter_bar, rule)

![](./graphs/answer-pr-hist.png)