# Data Analysis - General

This notebook aims to analyze the merged dataset.

1. Determine the number of interactions with ChatGPT across issues, discussions, and pull requests.
2. Calculate the average length of prompts (measured in tokens) for issues, discussions, and pull requests.
3. Calculate the average length of answers (measured in tokens) for issues, discussions, and pull requests.request.

> NOTE
>
> All file paths to CSV files are replaced with relative paths.

Table of Content
- [Read the Merged Dataset](#read-the-merged-dataset)
- [Interactions](#interactions)
- [Average Length of Prompt](#average-length-of-prompt)
- [Average Length of Answer](#average-length-of-answer)

In [5]:
# Import libraries.

import pandas as pd
import numpy as np
import altair as alt
import spacy
from multiprocessing.dummy import Pool
from tqdm import tqdm

alt.renderers.enable("jupyterlab")

RendererRegistry.enable('jupyterlab')

## Read the Merged Dataset

In [None]:
df = pd.read_csv("~/data/DevGPT/cleaned/combine.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1321 entries, 0 to 1320
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Unnamed: 0          1321 non-null   int64 
 1   RepoName            1321 non-null   object
 2   URL_chatgptsharing  1321 non-null   object
 3   Prompt              1321 non-null   object
 4   Answer              1321 non-null   object
dtypes: int64(1), object(4)
memory usage: 51.7+ KB


In [7]:
df.head()

Unnamed: 0.1,Unnamed: 0,RepoName,URL_chatgptsharing,Prompt,Answer
0,7870,paul-gauthier/aider,https://chat.openai.com/share/4555f0ea-1e7b-49...,How can I setup a github action to automatical...,"Sure, I can certainly help you with that. Here..."
1,7898,chitalian/gptask,https://chat.openai.com/share/902cd378-3ebc-4e...,Give me some test commands for this\n\nimport ...,This Python script is a command-line tool that...
2,7906,bbelderbos/htmx-demo,https://chat.openai.com/share/c8c101fa-aaae-49...,how to get the first 20 rows from a django model?,"To get the first 20 rows from a Django model, ..."
3,7907,bbelderbos/htmx-demo,https://chat.openai.com/share/c8c101fa-aaae-49...,"I have this view for infinite scroll, would be...",Certainly! You can refactor the code to use Dj...
4,7908,bbelderbos/htmx-demo,https://chat.openai.com/share/c8c101fa-aaae-49...,for page 49 this gives:\n\nInternal Server Err...,Certainly! To handle the ValueError when conve...


## Interactions

Determine the number of interactions with ChatGPT across issues, discussions, and pull requests.

Method:
1. Group data by `RepoName` and aggregate the `URL_chatgptsharing` into a list.
2. Calculate the number of interactions with ChatGPT by averaging the number of `URL_chatgptsharing`.
3. Visualize the distribution of interactions with ChatGPT.

In [8]:
# Group by RepoName
df_repo = df.groupby(by=["RepoName"])
df_repo.head()

Unnamed: 0.1,Unnamed: 0,RepoName,URL_chatgptsharing,Prompt,Answer
0,7870,paul-gauthier/aider,https://chat.openai.com/share/4555f0ea-1e7b-49...,How can I setup a github action to automatical...,"Sure, I can certainly help you with that. Here..."
1,7898,chitalian/gptask,https://chat.openai.com/share/902cd378-3ebc-4e...,Give me some test commands for this\n\nimport ...,This Python script is a command-line tool that...
2,7906,bbelderbos/htmx-demo,https://chat.openai.com/share/c8c101fa-aaae-49...,how to get the first 20 rows from a django model?,"To get the first 20 rows from a Django model, ..."
3,7907,bbelderbos/htmx-demo,https://chat.openai.com/share/c8c101fa-aaae-49...,"I have this view for infinite scroll, would be...",Certainly! You can refactor the code to use Dj...
4,7908,bbelderbos/htmx-demo,https://chat.openai.com/share/c8c101fa-aaae-49...,for page 49 this gives:\n\nInternal Server Err...,Certainly! To handle the ValueError when conve...
...,...,...,...,...,...
1250,227,HaroldMitts/VoAIce,https://chat.openai.com/share/262e8d7d-657b-4a...,"output audio of the following sentence;\n\n""Do...","Sure, here is the audio version of your text:C..."
1286,111,sugi-01096/72,https://chat.openai.com/share/e2c50f86-6c14-4f...,import streamlit as st\nimport json\n\n\ndef s...,The code you provided seems to be a simple bul...
1287,131,sugi-01096/72,https://chat.openai.com/share/4cd4bb4a-56f9-4a...,import streamlit as st\nimport json\n\n\ndef s...,The code you provided seems to be a simple bul...
1288,186,Significant-Gravitas/Auto-GPT,https://chat.openai.com/share/aec2922e-dbcc-47...,I am using venv(python module env) on the mac ...,To upgrade the Python version within a virtual...


In [9]:
# In our merged dataset, there are 66 unique repos.
len(df_repo["RepoName"].unique())

66

In [10]:
# Show the distribution of interactions

df_repo_count = df_repo["URL_chatgptsharing"].aggregate(len).reset_index()

idx_max = np.argmax(df_repo_count["URL_chatgptsharing"])
idx_min = np.argmin(df_repo_count["URL_chatgptsharing"])
print(
    f"Average interactions per pr: {np.mean(df_repo_count["URL_chatgptsharing"]):.2f}\n"
    f"Maximum number of interactions: {df_repo_count.iloc[idx_max]["URL_chatgptsharing"]}\n"
    f"Minimum number of interactions: {df_repo_count.iloc[idx_min]["URL_chatgptsharing"]}"
)

Average interactions per pr: 20.02
Maximum number of interactions: 501
Minimum number of interactions: 3


In [11]:
# Checking histogarm

inter_bar = alt.Chart(df_repo_count).mark_bar(size=5).encode(
    x=alt.X(
        "URL_chatgptsharing:N",
        title="Number of Interactions with ChatGPT",
    ),
    y=alt.Y("count()", title="Count"),
).properties(
    title={
        "text": "Distribution of the number of interactions with ChatGPT",
        "subtitle": "Merged dataset: issues, discussions, and pull requests"
    }
)

rule = alt.Chart(df_repo_count).mark_rule().encode(
    x=alt.X(
        'URL_chatgptsharing:N',
        aggregate="mean",
        type='nominal',
        axis=alt.Axis(format="2d")
    ),
    size=alt.value(2),
)

alt.layer(inter_bar, rule)

<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting


## Average Length of Prompt

Calculate the average length of prompts (measured in tokens) for issues, discussions, and pull requests.

Method:
1. Tokenzlize prompt in merged dataset.
2. Remove punctuation.
3. Calculate the average length of prompts.

In [12]:
# Load NLP model.

nlp = spacy.load("en_core_web_sm")

In [13]:
# Get all prompts.

prompts = df["Prompt"].drop_duplicates().to_list()

In [14]:
# Remove punctuation
with Pool() as pool:
    doc = pool.map(nlp, tqdm(prompts, total=len(prompts)))
    

100%|██████████| 426/426 [00:00<00:00, 65690.20it/s]


In [15]:
tokens = []
for d in tqdm(doc):
    token_without_punc = [token for token in d if not token.is_punct]
    tokens.append(token_without_punc)

100%|██████████| 426/426 [00:00<00:00, 79443.93it/s]


In [16]:
# Calculate average length of prompt
ls_len_promot = [len(token) for token in tokens]
idx_max = np.argmax(ls_len_promot)
idx_min = np.argmin(ls_len_promot)
print(
    f"Average length of prompt per pr: {np.mean(ls_len_promot):.2f}\n"
    f"---------------------------\n"
    f"Maximum length of prompt: {prompts[idx_max]}\n"
    f"---------------------------\n"
    f"Minimum length of prompt: {prompts[idx_min]}"
)

Average length of prompt per pr: 67.92
---------------------------
Maximum length of prompt: ---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-33-22ef21b7d160> in <cell line: 10>()
      8 
      9 # Now, you can use the model for prediction
---> 10 prediction = model.predict(img_expanded)
     11 print(f"This digit is probably a {np.argmax(prediction)}")
     12 

1 frames
/usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     50   try:
     51     ctx.ensure_initialized()
---> 52     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
     53                                         inputs, attrs, num_outputs)
     54   except core._NotOkStatusException as e:

InvalidArgumentError: Graph execution error:

Detected at node 'sequential/conv2d/BiasAdd' defined at (m

In [17]:
# Create a dataframe for histogram

df_prompt_count = pd.DataFrame(ls_len_promot, columns=["length_prompt"])
df_prompt_count.head()

Unnamed: 0,length_prompt
0,29
1,505
2,11
3,48
4,108


In [18]:
# Checking histogarm
inter_bar = alt.Chart(df_prompt_count).mark_bar(size=5).encode(
    x=alt.X(
        "length_prompt:N",
        title="Length of Prompts (tokens)",
    ),
    y=alt.Y("count()", title="Count"),
).properties(
    title={
        "text": "Distribution of the length of prompt with ChatGPT",
        "subtitle": "Merged dataset: issues, discussions, and pull requests"
    }
)

rule = alt.Chart(df_prompt_count).mark_rule().encode(
    x=alt.X(
        'length_prompt:N',
        aggregate="mean",
        type='nominal',
        axis=alt.Axis(format="2d")
    ),
)

alt.layer(inter_bar, rule)

<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting


## Average Length of Answer

Calculate the average length of answers (measured in tokens) for issues, discussions, and pull requests.request.

Method:

1. Tokenzlize prompt in merged dataset.
2. Remove punctuation.
3. Calculate the average length of answer.

In [19]:
# Get all answer.

answers = df["Answer"].drop_duplicates().to_list()

In [20]:
# Remove punctuation
with Pool() as pool:
    doc = pool.map(nlp, tqdm(answers, total=len(answers)))

100%|██████████| 433/433 [00:00<00:00, 842362.54it/s]


In [21]:
tokens = []
for d in tqdm(doc):
    token_without_punc = [token for token in d if not token.is_punct]
    tokens.append(token_without_punc)

100%|██████████| 433/433 [00:00<00:00, 3318.97it/s]


In [22]:
# Calculate average length of prompt
ls_len_answer = [len(token) for token in tokens]
idx_max = np.argmax(ls_len_answer)
idx_min = np.argmin(ls_len_answer)
print(
    f"Average length of answer per pr: {np.mean(ls_len_answer):.2f}\n"
    f"---------------------------\n"
    f"Maximum length of answer: {answers[idx_max]}\n"
    f"---------------------------\n"
    f"Minimum length of answer: {answers[idx_min]}"
)

Average length of answer per pr: 179.38
---------------------------
Maximum length of answer: Certainly! Here's an elaboration on the actions mentioned in the Windows response, providing more details about each mouse click and key press:Navigating the GUI:Left-clicking:Single left-click: Pressing the left mouse button once to select or activate an element, such as buttons, checkboxes, links, or menu items.Right-clicking:Single right-click: Pressing the right mouse button once to open a context menu, displaying additional options specific to the clicked item.Double-clicking:Double left-click: Rapidly pressing the left mouse button twice, with a short interval between clicks, to open files, launch applications, or perform actions associated with the clicked item.Middle-clicking:Single middle-click: Pressing the middle mouse button (usually the scroll wheel) once, typically used to open links in new tabs in web browsers or close tabs.Hovering:Moving the mouse cursor over an element withou

In [23]:
# Create a dataframe for histogram

df_answer_count = pd.DataFrame(ls_len_promot, columns=["length_answer"])
df_answer_count.head()

Unnamed: 0,length_answer
0,29
1,505
2,11
3,48
4,108


In [24]:
# Checking histogarm
inter_bar = alt.Chart(df_answer_count).mark_bar(size=5).encode(
    x=alt.X(
        "length_answer:N",
        title="Length of Answer (tokens)",
    ),
    y=alt.Y("count()", title="Count"),
).properties(
    title={
        "text": "Distribution of the length of answer with ChatGPT",
        "subtitle": "Merged dataset: issues, discussions, and pull requests"
    }
)

rule = alt.Chart(df_answer_count).mark_rule().encode(
    x=alt.X(
        'length_answer:N',
        aggregate="mean",
        type='nominal',
        axis=alt.Axis(format="2d")
    ),
)

alt.layer(inter_bar, rule)

<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting
