# Data Analysis - General

This notebook aims to analyze the merged dataset.

1. Determine the number of interactions with ChatGPT across issues, discussions, and pull requests.
2. Calculate the average length of prompts (measured in tokens) for issues, discussions, and pull requests.
3. Calculate the average length of answers (measured in tokens) for issues, discussions, and pull requests.request.

> NOTE
>
> All file paths to CSV files are replaced with relative paths.

Table of Content
- [Read the Merged Dataset](#read-the-merged-dataset)
- [Interactions](#interactions)
- [Average Length of Prompt](#average-length-of-prompt)
- [Average Length of Answer](#average-length-of-answer)

In [3]:
# Import libraries.

import pandas as pd
import numpy as np
import altair as alt

# alt.renderers.enable("jupyterlab")

## Read the Merged Dataset

In [None]:
df = pd.read_csv("~/data/DevGPT/cleaned/combine.csv", index_col=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1348 entries, 7870 to 313
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   RepoName            1348 non-null   object 
 1   URL_chatgptsharing  1348 non-null   object 
 2   Prompt              1348 non-null   object 
 3   Answer              1348 non-null   object 
 4   TokensOfPrompts     1348 non-null   float64
 5   TokensOfAnswers     1348 non-null   float64
dtypes: float64(2), object(4)
memory usage: 73.7+ KB


## Interactions

Determine the number of interactions with ChatGPT across issues, discussions, and pull requests.

Method:
1. Group data by `RepoName` and aggregate the `URL_chatgptsharing` into a list.
2. Calculate the number of interactions with ChatGPT by averaging the number of `URL_chatgptsharing`.
3. Visualize the distribution of interactions with ChatGPT.

In [5]:
# Group by RepoName
df_repo = df.groupby(by=["RepoName"])

In [6]:
# In our merged dataset, there are 66 unique repos.
len(df_repo["RepoName"].unique())

66

In [7]:
# Show the distribution of interactions

df_repo_count = df_repo["URL_chatgptsharing"].aggregate(len).reset_index()

idx_max = np.argmax(df_repo_count["URL_chatgptsharing"])
idx_min = np.argmin(df_repo_count["URL_chatgptsharing"])
print(
    f"Average interactions per pr: {np.mean(df_repo_count["URL_chatgptsharing"]):.2f}\n"
    f"Maximum number of interactions: {df_repo_count.iloc[idx_max]["URL_chatgptsharing"]}\n"
    f"Minimum number of interactions: {df_repo_count.iloc[idx_min]["URL_chatgptsharing"]}"
)

Average interactions per pr: 20.42
Maximum number of interactions: 501
Minimum number of interactions: 3


In [8]:
# Checking histogarm

inter_bar = alt.Chart(df_repo_count).mark_bar(size=5).encode(
    x=alt.X(
        "URL_chatgptsharing:N",
        title="Number of Interactions with ChatGPT",
    ),
    y=alt.Y("count()", title="Count"),
).properties(
    title={
        "text": "Distribution of the number of interactions with ChatGPT",
        "subtitle": "Merged dataset: issues, discussions, and pull requests"
    }
)

rule = alt.Chart(df_repo_count).mark_rule().encode(
    x=alt.X(
        'URL_chatgptsharing:N',
        aggregate="mean",
        type='nominal',
        axis=alt.Axis(format="2d")
    ),
    size=alt.value(2),
)

alt.layer(inter_bar, rule)

![](./graphs/interaction-general-hist.png)

## Average Length of Prompt

Calculate the average length of prompts (measured in tokens).

Method:
1. Leverage the grouped dataset.
2. Calculate the average length of prompts using column `TokensOfPrompts`
3. Check distribution of length of prompts with ChatGPT using bar chart.

In [9]:
df_url_prompt = df_repo["TokensOfPrompts"].aggregate(["mean", "count"]).reset_index()
df_url_prompt.head()

Unnamed: 0,RepoName,mean,count
0,ActivityWatch/aw-server,116.0,3
1,AndyGrant/OpenBench,31.0,3
2,AntonOsika/gpt-engineer,1102.333333,54
3,D3Zyre/Copy-All-Files-From-Folder,113.0,6
4,Delgan/loguru,269.0,9


In [10]:
idx_max = np.argmax(df_url_prompt["mean"])
idx_min = np.argmin(df_url_prompt["mean"])

print(
    f"Average length of prompt per pr: {np.mean(df_url_prompt["mean"]):.2f}\n"
    f"Maximum number of prompt: {df_url_prompt["mean"].iloc[idx_max]} from "
    f"repo: {df_url_prompt["RepoName"].iloc[idx_max]}\n"
    f"Minimum number of prompt: {df_url_prompt["mean"].iloc[idx_min]} from "
    f"repo: {df_url_prompt["RepoName"].iloc[idx_min]}"
)

Average length of prompt per pr: 1219.85
Maximum number of prompt: 23410.0 from repo: MauriceLe/PIMS
Minimum number of prompt: 10.0 from repo: Shreya-R-Dixit-Memorial-Foundation/EyeDaV2


In [11]:
# Checking histogarm
inter_bar = alt.Chart(df_url_prompt).mark_bar(size=5).encode(
    x=alt.X(
        "mean:N",
        title="Length of Prompts (tokens)",
    ),
    y=alt.Y("count", title="Count"),
).properties(
    title={
        "text": "Distribution of the length of prompt with ChatGPT",
        "subtitle": "Pull Request"
    }
)

rule = alt.Chart(df_url_prompt).mark_rule().encode(
    x=alt.X(
        'mean:N',
        aggregate="mean",
        type='nominal',
        axis=alt.Axis(format="2d")
    ),
)

alt.layer(inter_bar, rule)

![](./graphs/prompt-general-hist.png)

In [12]:
# Checking histogarm -- remove extreme data point.
bad = np.argmax(df_url_prompt["count"])
df_clean = df_url_prompt.loc[~df_url_prompt.index.isin([bad])]
print(
    f"Average length of prompt per pr after removing extreme data point: "
    f"{np.mean(df_clean["mean"]):.2f}"
    )
inter_bar = alt.Chart(df_clean).mark_bar(size=5).encode(
    x=alt.X(
        "mean:N",
        title="Length of Prompts (tokens)",
    ),
    y=alt.Y("count", title="Count"),
).properties(
    title={
        "text": "Distribution of the length of prompts with ChatGPT",
        "subtitle": "Pull Request"
    }
)

rule = alt.Chart(df_clean).mark_rule().encode(
    x=alt.X(
        'mean:N',
        aggregate="mean",
        type='nominal',
        axis=alt.Axis(format="2d")
    ),
)

alt.layer(inter_bar, rule)

Average length of prompt per pr after removing extreme data point: 913.49


![](./graphs/prompt-general-hist-remove.png)

## Average Length of Answer

Calculate the average length of answers (measured in tokens).

Method:
1. Leverage the grouped dataset.
2. Calculate the average length of prompts using column `TokensOfAnswers`
3. Check distribution of length of answers with ChatGPT using bar chart.

In [13]:
df_url_ans = df_repo["TokensOfAnswers"].aggregate(["mean", "count"]).reset_index()
df_url_ans.head()

Unnamed: 0,RepoName,mean,count
0,ActivityWatch/aw-server,572.0,3
1,AndyGrant/OpenBench,416.0,3
2,AntonOsika/gpt-engineer,5111.666667,54
3,D3Zyre/Copy-All-Files-From-Folder,1104.0,6
4,Delgan/loguru,1533.0,9


In [14]:
idx_max = np.argmax(df_url_ans["mean"])
idx_min = np.argmin(df_url_ans["mean"])

print(
    f"Average length of answers per pr: {np.mean(df_url_ans["mean"]):.2f}\n"
    f"Maximum number of answers: {df_url_ans["mean"].iloc[idx_max]} from "
    f"repo: {df_url_ans["RepoName"].iloc[idx_max]}\n"
    f"Minimum number of answers: {df_url_ans["mean"].iloc[idx_min]} from "
    f"repo: {df_url_ans["RepoName"].iloc[idx_min]}"
)

Average length of answers per pr: 2491.19
Maximum number of answers: 49053.0 from repo: Email-Generation/email_generation
Minimum number of answers: 30.0 from repo: related-sciences/nxontology-ml


In [15]:
# Checking histogarm
inter_bar = alt.Chart(df_url_ans).mark_bar(size=5).encode(
    x=alt.X(
        "mean:N",
        title="Length of Answers (tokens)",
    ),
    y=alt.Y("count", title="Count"),
).properties(
    title={
        "text": "Distribution of the length of answers with ChatGPT",
        "subtitle": "Pull Request"
    }
)

rule = alt.Chart(df_url_ans).mark_rule().encode(
    x=alt.X(
        'mean:N',
        aggregate="mean",
        type='nominal',
        axis=alt.Axis(format="2d")
    ),
)

alt.layer(inter_bar, rule)

![](./graphs/answer-general-hist.png)

In [16]:
# Checking histogarm -- remove extreme data point.
bad = np.argmax(df_url_ans["count"])
df_clean = df_url_ans.loc[~df_url_ans.index.isin([bad])]
print(
    f"Average length of answers per pr after removing extreme data point: "
    f"{np.mean(df_clean["mean"]):.2f}"
    )
inter_bar = alt.Chart(df_clean).mark_bar(size=5).encode(
    x=alt.X(
        "mean:N",
        title="Length of Answers (tokens)",
    ),
    y=alt.Y("count", title="Count"),
).properties(
    title={
        "text": "Distribution of the length of answers with ChatGPT",
        "subtitle": "Pull Request"
    }
)

rule = alt.Chart(df_clean).mark_rule().encode(
    x=alt.X(
        'mean:N',
        aggregate="mean",
        type='nominal',
        axis=alt.Axis(format="2d")
    ),
)

alt.layer(inter_bar, rule)

Average length of answers per pr after removing extreme data point: 1774.86


![](./graphs/answer-general-hist-remove.png)