# Issue Report Classification: Few Shot Learning

### NLBSE 2024

In [None]:
%pip install pandas emoji openai tiktoken sklearn

## Import the requisite libraries

In [15]:
# Importing libraries
import pandas as pd
import emoji
import re
import string
import json

# Loading data from CSV files
test_data = pd.read_csv("./data/issues/issues_test.csv")
train_data = pd.read_csv("./data/issues/issues_train.csv")

## Data Visualization

In [16]:
test_data

Unnamed: 0,repo,created_at,label,title,body
0,facebook/react,2023-08-02 02:26:00,bug,Bug: [18.3.0-canary] renderToString hoists som...,<!--\r\n Please provide a clear and concise d...
1,facebook/react,2023-07-17 22:43:05,bug,[DevTools Bug]: Chrome extension gets disconne...,### Website or app\r\n\r\nhttps://react.dev/\r...
2,facebook/react,2023-07-13 19:01:47,bug,[DevTools Bug]: Deprecated __REACT_DEVTOOLS_GL...,### Website or app\n\nN/A\n\n### Repro steps\n...
3,facebook/react,2023-06-07 17:26:43,bug,[DevTools Bug]: React devtools stuck at Loadin...,### Website or app\n\ncorporate project (priva...
4,facebook/react,2023-05-31 15:17:41,bug,Bug: Radio button onChange not called in curre...,<!--\r\n Please provide a clear and concise d...
...,...,...,...,...,...
1495,opencv/opencv,2022-01-22 11:52:21,feature,Task: GCC 12 support,Support compilation with GCC 12 and fix tests\...
1496,opencv/opencv,2022-01-16 19:27:55,feature,AudioIO: add dnn speech recognition sample on C++,### Pull Request Readiness Checklist\r\n\r\nSe...
1497,opencv/opencv,2022-01-14 22:05:58,feature,Use modern OpenVINO package interface,"* new cmake options: `WITH_OPENVINO`, `OPENCV_..."
1498,opencv/opencv,2022-01-12 09:14:41,feature,TiffEncoder write support more depth type,**Merge with extra**: https://github.com/openc...


In [17]:
train_data

Unnamed: 0,repo,created_at,label,title,body
0,facebook/react,2023-08-26 06:33:37,bug,"[DevTools Bug] Cannot add node ""1"" because a n...",### Website or app\n\nPrivate repo cannot give...
1,facebook/react,2023-07-28 05:16:12,bug,[DevTools Bug]: Devtools extension build faili...,### Website or app\n\nN/A\n\n### Repro steps\n...
2,facebook/react,2023-07-13 21:58:31,bug,[DevTools Bug]: Deprecated __REACT_DEVTOOLS_GL...,### Website or app\n\nhttps://github.com/open-...
3,facebook/react,2023-06-14 02:31:20,bug,"[DevTools Bug] Cannot remove node ""0"" because ...",### Website or app\n\nlocal\n\n### Repro steps...
4,facebook/react,2023-06-03 11:29:44,bug,"[DevTools Bug] Cannot remove node ""103"" becaus...",### Website or app\n\nlocalhost\n\n### Repro s...
...,...,...,...,...,...
1495,opencv/opencv,2022-01-24 10:48:13,feature,core: FP denormals support,relates #21046\r\n\r\n- support x86 SSE FTZ+DA...
1496,opencv/opencv,2022-01-20 12:40:55,feature,feature: submodule or a class scope for export...,All classes are registered in the scope that c...
1497,opencv/opencv,2022-01-15 02:39:22,feature,Reading BigTiff images,**Merge with extra: https://github.com/opencv/...
1498,opencv/opencv,2022-01-14 15:37:53,feature,Add general broadcasting layer,Performance details(broadcasting 1x1 to 16x204...


## Data Preprocessing
### Data Cleaning: Method 1  
Within this notebook, we employ two distinct data cleaning methodologies. This tailored approach is followed given variations among the repositories, each showing a more favorable outcome in response to one or the other cleaning methods.

In [18]:
# Initialize counters for text cleaning
cleaned_count = 0
original_count = 0

# Text cleaning function
def clean_text(text):
    global cleaned_count, original_count

    if not isinstance(text, str):
        original_count += 1
        return text

    # Remove double quotation marks
    text = text.replace('"', '')

    # Remove text starting with "DevTools" and ending with "(automated)"
    text = re.sub(r'DevTools.*?\(automated\)', '', text)

    # Lowercasing should be one of the first steps to ensure uniformity
    text = text.lower()

    # Remove emojis
    text = emoji.demojize(text)

    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)

    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)

    # Remove special characters and punctuation
    text = re.sub(f"[{re.escape(string.punctuation)}]", '', text)

    # Remove '#' characters
    text = text.replace("#", "")

    # Remove consecutive whitespaces and replace with a single space
    text = re.sub(r'\s+', ' ', text)

    # Split the text into words
    words = text.split()

    # Remove words that are over 20 characters
    words = [word for word in words if len(word) <= 20]

    # Join the remaining words back into cleaned text
    cleaned_text = ' '.join(words)

    cleaned_count += 1
    return cleaned_text

test_data['body'] = test_data['body'].apply(clean_text)
test_data['title'] = test_data['title'].apply(clean_text)


print(f"Cleaned {cleaned_count} times.")
print(f"Returned original text {original_count} times.")

train_data['body'] = train_data['body'].apply(clean_text)
train_data['title'] = train_data['title'].apply(clean_text)


print(f"Cleaned {cleaned_count} times.")
print(f"Returned original text {original_count} times.")

Cleaned 2998 times.
Returned original text 2 times.
Cleaned 5998 times.
Returned original text 2 times.


## Data Division  

Subsequently, we partitioned our dataset into five smaller dataframes, ensuring an exclusive handling of each project. This segregation was executed on both the training and testing datasets.

In [38]:
def split_data():
    train_facebook = train_data[: 300]
    train_tensorflow = train_data[300: 600]
    train_microsoft = train_data[600: 900]
    train_bitcoin = train_data[900: 1200]
    train_opencv= train_data[1200: 1500]
    facebook = test_data[: 300]
    tensorflow = test_data[300: 600]
    microsoft = test_data[600: 900]
    bitcoin = test_data[900: 1200]
    opencv= test_data[1200: 1500]
    return train_facebook, train_tensorflow, train_microsoft, train_bitcoin, train_opencv, facebook, tensorflow, microsoft, bitcoin, opencv

train_data_facebook, train_data_tensorflow, train_data_microsoft, train_data_bitcoin, train_data_opencv, test_data_facebook, test_data_tensorflow, test_data_microsoft, test_data_bitcoin, test_data_opencv = split_data()

## Fine-Tuning  
We fine-tuned ChatGPT-3.5-Turbo using the training data, aiming to achieve superior performance compared to the standard approach of invoking the OpenAI API GPT-4 model.

In [39]:
# Invoking the API
from openai import OpenAI
client = OpenAI(api_key = 'open-ai-api-key')

## Data Transformation  
Prior to beginning the fine-tuning process, our initial step involves transforming our dataframe into a JSON line file format. This formatted file will serve as the prompt input for the fine-tuning process. Each prompt will encapsulate the title and body details of every pull request. Our anticipated outcome from the fine-tuned model will be the corresponding label for each PR, distinguishing between bug reports, questions, or feature requests.

In [40]:
import tiktoken

max_content_tokens = 3999

# Function to truncate the message and avoid passing the limit of 4k tokens per gpt-3.5 fine-tuned model limitations
def truncate_message(message, max_length):
        encoding = tiktoken.get_encoding("cl100k_base")
        encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
        tokens = encoding.encode(message)
        if len(tokens) > max_length:
            truncated_tokens = tokens[:max_length]
            message = encoding.decode(truncated_tokens)
        return message

def create_conversational_data(train_data, conversational_data):

    # Open the file in write mode
    with open(conversational_data, 'w', encoding='utf-8') as f:
        # Iterate over the rows in the DataFrame
        for index, row in train_data.iterrows():
            # Create the user message by formatting the prompt with the title and body
            user_message = f"Classify, IN ONLY 1 WORD, the following GitHub issue as 'feature', 'bug', or 'question' based on its title and body:\n{row['title']}\n{row['body']}"
            
            # Truncate the prompt if necessary
            user_message = truncate_message(user_message, max_content_tokens)

            # Create the assistant message by taking the label
            assistant_message = row['label']
            
            # Construct the conversation object
            conversation_object = {
                "messages": [
                    {"role": "system", "content": "GitHub Issue Report Classifier"},
                    {"role": "user", "content": user_message},
                    {"role": "assistant", "content": assistant_message}
                ]
            }
            
            # Write the conversation object to one line in the file
            f.write(json.dumps(conversation_object, ensure_ascii=False) + '\n')
    return conversational_data

## Training file  
With our JSON line file generated, it now serves as the foundational conversation input for our fine-tuned model. We're prepared to upload this training file to the OpenAI API to initiate the training process.

In [41]:
def create_training_file(conversational_data):  
  ## Uplopading a training file
  training_file = client.files.create(
    file=open(conversational_data, "rb"),
    purpose="fine-tune"
  )
  return training_file

## Model creation  
At last, the stage is set to create the model, designated with the suffix 'repo-prissueclassifier.'

In [42]:
def create_fine_tuned_model(model_training_file, model_sufix, model):
  ## Creating a fine-tuned model
  fine_tuning_job = client.fine_tuning.jobs.create(
    training_file=model_training_file.id, 
    model="gpt-3.5-turbo",
    suffix= model_sufix
  )
  return fine_tuning_job

In [43]:
def fine_tuning_and_training(train_data, conversational_data, model_sufix):
    model_conversational_data = create_conversational_data(train_data, conversational_data)
    trained_file = create_training_file(model_conversational_data)
    fine_tuned_model = create_fine_tuned_model(trained_file, model_sufix)
    return fine_tuned_model

## Facebook repository dataset fine-tuning process

In [44]:
facebook_ft_job_file = fine_tuning_and_training(train_data_facebook, 'data/conversationaldata/conversational_data_facebook.jsonl', "fb-issueclassifier")

In [47]:
# Retrieving the state of a fine-tune
facebook_ft_model = client.fine_tuning.jobs.retrieve(facebook_ft_job_file.id).fine_tuned_model
print(facebook_ft_model) # This fine-tuning job took around 40 min to be completed

ft:gpt-3.5-turbo-0613:northern-arizona-university-nau:fb-issueclassifier:8wuvNgeu


### Wait Before Continuing

For each repository (facebook, tensorflow, microsoft, bitcoin, opencv) please wait until the fine tuning job is done. You can ensure that by checking when the code snippet above does not return "None". You could also run the snippet below to track the progress of your fine-tuning job by checking the latest events.

In [None]:
# You can track the progress of your fine-tuning job by listing the lastest events. On our models it took about 3 hours to fine-tune each model
client.fine_tuning.jobs.list_events(fine_tuning_job_id=facebook_ft_job_file.id, limit=20)

## Tensorflow repository dataset fine-tuning process

In [50]:
tensorflow_ft_job_file = fine_tuning_and_training(train_data_tensorflow, 'data/conversationaldata/conversational_data_tensorflow.jsonl', "tf-issueclassifier")


In [52]:
tensorflow_ft_model = client.fine_tuning.jobs.retrieve(tensorflow_ft_job_file.id).fine_tuned_model # This fine-tuning job took around 35 min to be completed
print(tensorflow_ft_model)

ft:gpt-3.5-turbo-0613:northern-arizona-university-nau:tf-issueclassifier:8wviPsLJ


## Microsoft repository dataset fine-tuning process

In [53]:
microsoft_ft_job_file = fine_tuning_and_training(train_data_microsoft, 'data/conversationaldata/conversational_data_microsoft.jsonl', "ms-issueclassifier")

In [56]:
microsoft_ft_model = client.fine_tuning.jobs.retrieve(microsoft_ft_job_file.id).fine_tuned_model # This fine-tuning job took around 45 min to be completed
print(microsoft_ft_model)

ft:gpt-3.5-turbo-0613:northern-arizona-university-nau:ms-issueclassifier:8wwuk30C


## Bitcoin repository dataset fine-tuning process

In [57]:
bitcoin_ft_job_file = fine_tuning_and_training(train_data_bitcoin, 'data/conversationaldata/conversational_data_bitcoin.jsonl', "bc-issueclassifier")

In [58]:
bitcoin_ft_model = client.fine_tuning.jobs.retrieve(bitcoin_ft_job_file.id).fine_tuned_model # This fine-tuning job took around 35 min to be completed
print(bitcoin_ft_model)

ft:gpt-3.5-turbo-0613:northern-arizona-university-nau:bc-issueclassifier:8wxVmc7I


## OpenCV repository dataset fine-tuning process

In [59]:
opencv_ft_job_file = fine_tuning_and_training(train_data_opencv, 'data/conversationaldata/conversational_data_opencv.jsonl', "oc-issueclassifier")

In [60]:
opencv_ft_model = client.fine_tuning.jobs.retrieve(opencv_ft_job_file.id).fine_tuned_model
print(opencv_ft_model)

ft:gpt-3.5-turbo-0613:northern-arizona-university-nau:oc-issueclassifier:8wyNcwmv


## Fine-tuning results  
The successful fine-tuning of all models was completed using the default of 3 epochs. The process spanned approximately 5 hours; however, variations in processing time might occur due to queue dynamics at any given moment.

**Please remember to wait until the respective fine-tuning model job is completed for each repository before trying to evaluate the model in the steps below**

## Utilizing Fine-tuned model  
Next, another API from OpenAI is used to invoke the fine-tuned model and assess its performance on the testing dataset.

In [61]:
import openai
import time
import pandas as pd
import re
import concurrent.futures
import tiktoken
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix, classification_report

# Replace 'open-ai-key' with your actual OpenAI API key
openai.api_key = 'open-ai-api-key'

# max_token here should be one since 'bug', 'feature', and 'question' are one token long. This might change for future versions of the model and api but you can check the value on the
def query_chatgpt(prompt, model, temperature=0.0,  max_tokens=1, max_retries=5):
    """
    Function to query ChatGPT-4 with a given prompt, with retries for timeouts.

    :param prompt: Prompt string to send to ChatGPT-2.5
    :param model: The model to use, default is ChatGPT-3.5
    :param max_tokens: Maximum number of tokens to generate
    :param max_retries: Maximum number of retries for timeout
    :return: Response from ChatGPT-3.5 or None if all retries fail
    """
    attempt = 0
    max_content_tokens = 3999
    encoding = tiktoken.get_encoding("cl100k_base")
    encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

    # Function to truncate the message and avoid passing the limit of 4k tokens per gpt-3.5 fine-tuned model limitations
    def truncate_message(message, max_length):
        tokens = encoding.encode(message)
        if len(tokens) > max_length:
            truncated_tokens = tokens[:max_length]
            message = encoding.decode(truncated_tokens)
        return message

    # Truncate the prompt if necessary
    prompt = truncate_message(prompt, max_content_tokens)

    while attempt < max_retries:
        with concurrent.futures.ThreadPoolExecutor() as executor:
            future = executor.submit(
                openai.chat.completions.create,
                model=model,
                messages=[{"role": "system", "content": "GitHub Issue Report Classifier"}, {"role": "user", "content": prompt}],
                max_tokens=max_tokens,
                temperature=temperature
            )
            try:
                response = future.result(timeout=5)  # 5 seconds timeout
                return response.choices[0].message.content
            except concurrent.futures.TimeoutError:
                print(f"Attempt {attempt + 1}/{max_retries} - Request timed out. Retrying...")
            except Exception as e:
                print(f"Attempt {attempt + 1}/{max_retries} - An error occurred: {e}")
            finally:
                attempt += 1

    print("Failed to get a response after several retries.")
    return None
    
labels = ['feature', 'bug', 'question']

## Model Testing
The function defined above is being called, passing the specific model for each repository and testing it with the testing dataset. It's essential to note the setup of a timer to comply with the "token per minute" limitations on the API. Additionally, the results of each iteration are printed for tracking and improvement purposes.

In [66]:
def test_model(test_data, ft_model):
    y_true = []
    y_pred = []
    iterations = len(test_data)

    # Now let's loop through the test data and classify the GitHub issues
    for i in range(iterations):
        correct_label = test_data.iloc[i]['labels'].lower()
        description = f"{test_data.iloc[i]['title']} \n {test_data.iloc[i]['body']}"
        print(f"Correct GitHub Issue type: {correct_label}")
        
        prompt = f"Classify, IN ONLY 1 WORD, the following GitHub issue as 'feature', 'bug', or 'question' based on its title and body:\n{description}"
        response = query_chatgpt(prompt, ft_model)
        
        if response is None:
            print("Failed to get a response after several retries. Skipping this item.")
            continue  # Skip this iteration and move to the next one
        
        # Clean the response to keep only letters (and optionally numbers)
        predicted_label = re.sub(r'[^A-Za-z]+', '', response).lower().strip()
        print(f"Predicted GitHub Issue type: {predicted_label}")
        
        # Append to lists for evaluation
        y_true.append(correct_label)
        y_pred.append(predicted_label)
        time.sleep(6)  # Wait for 6 seconds before retrying since there is a token per minute limit

    return y_true, y_pred

# See output on outputs/cell51output.txt

## Calculating the results  
Once all testing data undergoes evaluation using the corresponding fine-tuned models, we'll leverage the two arrays generated—representing the true labels and predicted labels—to conduct result assessments.

For tracking purposes, a CSV file has been generated for each result.

In [82]:
def calculate_metrics(y_true, y_pred, cm_sheet):
    # Calculate weighted average F1-score, precision, and recall
    f1 = f1_score(y_true, y_pred, labels=labels, average='weighted')
    precision = precision_score(y_true, y_pred, labels=labels, average='weighted')
    recall = recall_score(y_true, y_pred, labels=labels, average='weighted')

    # Calculate confusion matrix
    cm = confusion_matrix(y_true, y_pred, labels=labels)

    cm_df = pd.DataFrame(cm, index=labels, columns=labels)

    # Calculate TP, FP, FN, TN
    results_fb = {}
    for i, label in enumerate(labels):
        results_fb[label] = {'TP': cm[i, i]}
        results_fb[label]['FP'] = cm[:, i].sum() - cm[i, i]
        results_fb[label]['FN'] = cm[i, :].sum() - cm[i, i]
        results_fb[label]['TN'] = cm.sum() - (results_fb[label]['TP'] + results_fb[label]['FP'] + results_fb[label]['FN'])

    # Print results_fb
    for label, metrics in results_fb.items():
        print(f"{label}: {metrics}")

    # Save results_fb to CSV
    results_fb_df = pd.DataFrame(results_fb).T
    results_fb_df['F1-score'] = f1
    results_fb_df['Recall'] = recall
    results_fb_df['Precision'] = precision

    results_fb_df.to_csv(cm_sheet, index=False)

    print(f"Precision = {precision}")
    print(f"Recall = {recall}")
    print(f"F1-score = {f1}")

## Evaluating the metrics  
Below, the metrics for each label are presented to facilitate a more precise evaluation.

In [68]:
def evaluating_metrics(y_true, y_pred):
    # Create a classification report
    report = classification_report(y_true, y_pred, labels=['bug', 'feature', 'question'], target_names=['bug', 'feature', 'question'], zero_division=0, output_dict=True)

    # Convert the report to a DataFrame
    report_df = pd.DataFrame(report).transpose()

    # Print the classification report
    print(report_df)
    return report_df

### Facebook Repo Testing

In [None]:
y_true_facebook, y_pred_facebook = test_model(test_data_facebook, facebook_ft_model) ## See results in outputs/cell48output.txt
## This specific model testing took 30 minutes to be completed 

**Wait for model testing to be over before proceeding. This can take up to 3 hours. You will need is over because when it stops printing predicted vs actual values. This can take up to 2 hours**

In [83]:
calculate_metrics(y_true_facebook, y_pred_facebook, 'metrics/confusion_matrix_fb.csv')

feature: {'TP': 91, 'FP': 18, 'FN': 9, 'TN': 182}
bug: {'TP': 94, 'FP': 25, 'FN': 6, 'TN': 175}
question: {'TP': 66, 'FP': 6, 'FN': 34, 'TN': 194}
Precision = 0.8471483394581074
Recall = 0.8366666666666667
F1-score = 0.8322342487262593


In [84]:
facebook_complete_metrics = evaluating_metrics(y_true_facebook, y_pred_facebook)

              precision    recall  f1-score     support
bug            0.789916  0.940000  0.858447  100.000000
feature        0.834862  0.910000  0.870813  100.000000
question       0.916667  0.660000  0.767442  100.000000
accuracy       0.836667  0.836667  0.836667    0.836667
macro avg      0.847148  0.836667  0.832234  300.000000
weighted avg   0.847148  0.836667  0.832234  300.000000


## Tensorflow Repo Testing

In [None]:
y_true_tensorflow, y_pred_tensorflow = test_model(test_data_tensorflow, tensorflow_ft_model) ## See results in outputs/cell53output.txt
## This specific model testing took 32 minutes to be completed 

**Wait for model testing to be over before proceeding. This can take up to 3 hours. You will need is over because when it stops printing predicted vs actual values. This can take up to 2 hours**

In [87]:
calculate_metrics(y_true_tensorflow, y_pred_tensorflow, 'metrics/confusion_matrix_tf.csv')

feature: {'TP': 77, 'FP': 1, 'FN': 23, 'TN': 199}
bug: {'TP': 87, 'FP': 8, 'FN': 13, 'TN': 192}
question: {'TP': 94, 'FP': 33, 'FN': 6, 'TN': 167}
Precision = 0.8810421470595527
Recall = 0.86
F1-score = 0.8618900214108847


#### F1-score: 86.19%

In [88]:
tensorflow_complete_metrics = evaluating_metrics(y_true_tensorflow, y_pred_tensorflow)

              precision  recall  f1-score  support
bug            0.915789    0.87  0.892308   100.00
feature        0.987179    0.77  0.865169   100.00
question       0.740157    0.94  0.828194   100.00
accuracy       0.860000    0.86  0.860000     0.86
macro avg      0.881042    0.86  0.861890   300.00
weighted avg   0.881042    0.86  0.861890   300.00


## Microsoft Repo Testing

In [None]:
y_true_microsoft, y_pred_microsoft = test_model(test_data_microsoft, microsoft_ft_model) ## See results in outputs/cell59output.txt
## This specific model testing took 33 minutes to be completed 

**Wait for model testing to be over before proceeding. This can take up to 3 hours. You will need is over because when it stops printing predicted vs actual values. This can take up to 2 hours**

In [90]:
calculate_metrics(y_true_microsoft, y_pred_microsoft, 'metrics/confusion_matrix_ms.csv')

feature: {'TP': 84, 'FP': 24, 'FN': 15, 'TN': 175}
bug: {'TP': 82, 'FP': 22, 'FN': 17, 'TN': 177}
question: {'TP': 71, 'FP': 15, 'FN': 29, 'TN': 183}
Precision = 0.7972735705293845
Recall = 0.79
F1-score = 0.7916849121782707


### F1-score: 79.17%

In [91]:
microsoft_complete_metrics = evaluating_metrics(y_true_microsoft, y_pred_microsoft)

              precision  recall  f1-score  support
bug            0.788462    0.82  0.803922    100.0
feature        0.777778    0.84  0.807692    100.0
question       0.825581    0.71  0.763441    100.0
micro avg      0.795302    0.79  0.792642    300.0
macro avg      0.797274    0.79  0.791685    300.0
weighted avg   0.797274    0.79  0.791685    300.0


## Bitcoin Repo Testing

In [None]:
y_true_bitcoin, y_pred_bitcoin = test_model(test_data_bitcoin, bitcoin_ft_model) ## See results in outputs/cell65output.txt
## This specific model testing took 32 minutes to be completed 

**Wait for model testing to be over before proceeding. This can take up to 3 hours. You will need is over because when it stops printing predicted vs actual values. This can take up to 2 hours**

In [94]:
calculate_metrics(y_true_bitcoin, y_pred_bitcoin, "metrics/confusion_matrix_bc.csv")

feature: {'TP': 92, 'FP': 18, 'FN': 7, 'TN': 182}
bug: {'TP': 77, 'FP': 28, 'FN': 23, 'TN': 171}
question: {'TP': 61, 'FP': 23, 'FN': 39, 'TN': 176}
Precision = 0.7652958152958153
Recall = 0.7666666666666667
F1-score = 0.7634844888821559


### F1-Score: 76.35%

In [95]:
bitcoin_complete_metrics = evaluating_metrics(y_true_bitcoin, y_pred_bitcoin)

              precision    recall  f1-score  support
bug            0.733333  0.770000  0.751220    100.0
feature        0.836364  0.920000  0.876190    100.0
question       0.726190  0.610000  0.663043    100.0
micro avg      0.769231  0.766667  0.767947    300.0
macro avg      0.765296  0.766667  0.763484    300.0
weighted avg   0.765296  0.766667  0.763484    300.0


## OpenCV Repo Testing

In [None]:
y_true_opencv, y_pred_opencv = test_model(test_data_opencv, opencv_ft_model) ## See results in outputs/cell65output.txt
## This specific model testing took 32 minutes to be completed 

**Wait for model testing to be over before proceeding. This can take up to 3 hours. You will need is over because when it stops printing predicted vs actual values. This can take up to 2 hours**

In [97]:
calculate_metrics(y_true_opencv, y_pred_opencv, "metrics/confusion_matrix_oc.csv")

feature: {'TP': 81, 'FP': 12, 'FN': 19, 'TN': 187}
bug: {'TP': 81, 'FP': 29, 'FN': 19, 'TN': 170}
question: {'TP': 78, 'FP': 18, 'FN': 21, 'TN': 182}
Precision = 0.80661045943304
Recall = 0.8
F1-score = 0.8022417257058263


### F1-score: 80.22%

In [98]:
opencv_complete_metrics = evaluating_metrics(y_true_opencv, y_pred_opencv)

              precision  recall  f1-score  support
bug            0.736364    0.81  0.771429    100.0
feature        0.870968    0.81  0.839378    100.0
question       0.812500    0.78  0.795918    100.0
micro avg      0.802676    0.80  0.801336    300.0
macro avg      0.806610    0.80  0.802242    300.0
weighted avg   0.806610    0.80  0.802242    300.0


## Initial results  
With results gathered for all repositories tested against their respective trained models, we're poised to consolidate the confusion matrix data and derive the overall metrics.

In [133]:
opencv_complete_metrics.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, bug to weighted avg
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   precision  6 non-null      float64
 1   recall     6 non-null      float64
 2   f1-score   6 non-null      float64
 3   support    6 non-null      float64
dtypes: float64(4)
memory usage: 412.0+ bytes


In [145]:
def calculate_average(values):
    total = 0
    for value in values:
        total += value
    return total/len(values)

# Calculating overall (average) values  
overall_bug_f1 = calculate_average([facebook_complete_metrics['f1-score'].loc['bug'], tensorflow_complete_metrics['f1-score'].loc['bug'], microsoft_complete_metrics['f1-score'].loc['bug'], bitcoin_complete_metrics['f1-score'].loc['bug'], opencv_complete_metrics['f1-score'].loc['bug']])
overall_bug_precision = calculate_average([facebook_complete_metrics.precision.loc['bug'], tensorflow_complete_metrics.precision.loc['bug'], microsoft_complete_metrics.precision.loc['bug'], bitcoin_complete_metrics.precision.loc['bug'], opencv_complete_metrics.precision.loc['bug']])
overall_bug_recall = calculate_average([facebook_complete_metrics.recall.loc['bug'], tensorflow_complete_metrics.recall.loc['bug'], microsoft_complete_metrics.recall.loc['bug'], bitcoin_complete_metrics.recall.loc['bug'], opencv_complete_metrics.recall.loc['bug']])
overall_feature_f1 = calculate_average([facebook_complete_metrics['f1-score'].loc['feature'], tensorflow_complete_metrics['f1-score'].loc['feature'], microsoft_complete_metrics['f1-score'].loc['feature'], bitcoin_complete_metrics['f1-score'].loc['feature'], opencv_complete_metrics['f1-score'].loc['feature']])
overall_feature_precision = calculate_average([facebook_complete_metrics.precision.loc['feature'], tensorflow_complete_metrics.precision.loc['feature'], microsoft_complete_metrics.precision.loc['feature'], bitcoin_complete_metrics.precision.loc['feature'], opencv_complete_metrics.precision.loc['feature']])
overall_feature_recall = calculate_average([facebook_complete_metrics.recall.loc['feature'], tensorflow_complete_metrics.recall.loc['feature'], microsoft_complete_metrics.recall.loc['feature'], bitcoin_complete_metrics.recall.loc['feature'], opencv_complete_metrics.recall.loc['feature']])
overall_question_f1 = calculate_average([facebook_complete_metrics['f1-score'].loc['question'], tensorflow_complete_metrics['f1-score'].loc['question'], microsoft_complete_metrics['f1-score'].loc['question'], bitcoin_complete_metrics['f1-score'].loc['question'], opencv_complete_metrics['f1-score'].loc['question']])
overall_question_precision = calculate_average([facebook_complete_metrics.precision.loc['question'], tensorflow_complete_metrics.precision.loc['question'], microsoft_complete_metrics.precision.loc['question'], bitcoin_complete_metrics.precision.loc['question'], opencv_complete_metrics.precision.loc['question']])
overall_question_recall = calculate_average([facebook_complete_metrics.recall.loc['question'], tensorflow_complete_metrics.recall.loc['question'], microsoft_complete_metrics.recall.loc['question'], bitcoin_complete_metrics.recall.loc['question'], opencv_complete_metrics.recall.loc['question']])
overall_average_f1 = calculate_average([facebook_complete_metrics['f1-score'].loc['macro avg'], tensorflow_complete_metrics['f1-score'].loc['macro avg'], microsoft_complete_metrics['f1-score'].loc['macro avg'], bitcoin_complete_metrics['f1-score'].loc['macro avg'], opencv_complete_metrics['f1-score'].loc['macro avg']])
overall_average_precision = calculate_average([facebook_complete_metrics.precision.loc['macro avg'], tensorflow_complete_metrics.precision.loc['macro avg'], microsoft_complete_metrics.precision.loc['macro avg'], bitcoin_complete_metrics.precision.loc['macro avg'], opencv_complete_metrics.precision.loc['macro avg']])
overall_average_recall = calculate_average([facebook_complete_metrics.recall.loc['macro avg'], tensorflow_complete_metrics.recall.loc['macro avg'], microsoft_complete_metrics.recall.loc['macro avg'], bitcoin_complete_metrics.recall.loc['macro avg'], opencv_complete_metrics.recall.loc['macro avg']])

print("Overall Results: ")
# Formatting the results
formatted_metrics = {
    "Bug": {
        "Precision": overall_bug_precision, 
        "Recall": overall_bug_recall, 
        "F1-Score": overall_bug_f1
    },
    "Feature": {
        "Precision": overall_feature_precision, 
        "Recall": overall_feature_recall, 
        "F1-Score": overall_feature_f1
    },
    "Question": {
        "Precision": overall_question_precision, 
        "Recall": overall_question_recall, 
        "F1-Score": overall_question_f1
    },
    "Average": {
        "Precision": overall_average_precision, 
        "Recall": overall_average_recall, 
        "F1-Score": overall_average_f1
    }
}
formatted_metrics

Overall Results: 


{'Bug': {'Precision': 0.7927727896458546,
  'Recall': 0.842,
  'F1-Score': 0.8154649666286623},
 'Feature': {'Precision': 0.8614302057154972,
  'Recall': 0.85,
  'F1-Score': 0.8518485917359564},
 'Question': {'Precision': 0.8042192037041882,
  'Recall': 0.74,
  'F1-Score': 0.7636076797774194},
 'Average': {'Precision': 0.8194740663551799,
  'Recall': 0.8106666666666668,
  'F1-Score': 0.8103070793806794}}

## Data Cleaning: Method 2  
Upon analysis, opportunities for enhancement in our cleaning method surfaced, leading to the implementation of a new cleaning function.  
In the revised cleaning method (Method 2), emphasis was placed on stripping markdown text while adopting a strategy of replacing certain text elements to uphold the intended meaning.

In [146]:
# Function to convert Markdown to plain text
def strip_markdown(text):
    # Remove Markdown links
    text = re.sub(r'\[([^\]]*)\]\([^\)]*\)', r'\1', text)
    
    # Remove Markdown emphasis (* or _)
    text = re.sub(r'(\*|_)(.*?)\1', r'\2', text)
    
    # Remove Markdown inline code (`)
    text = re.sub(r'`([^`]+)`', r'\1', text)
    
    # Remove Markdown headers (##, ###, etc.)
    text = re.sub(r'#+\s*(.*?)\n', r'\1\n', text)
    
    # Remove other Markdown elements as needed
    
    return text

# Initialize counters for text cleaning
cleaned_count = 0
original_count = 0

def clean_text(text):
    global cleaned_count, original_count

    if not isinstance(text, str):
        original_count += 1
        return text

######################################
#        Standardize The Text        #
######################################

    # Lowercasing should be one of the first steps to ensure uniformity
    text = text.lower()

######################################
#         Remove Characters          #
######################################

    # Remove emojis, special characters, and punctuation
    text = emoji.demojize(text)
    text = re.sub(f"[{re.escape(string.punctuation)}]", '', text)

######################################
#         Remove/Replace Text        #
######################################

    # Remove specific phrases "Website or app" and "local react development"
    text = text.replace("website or app", "")
    text = text.replace("local react development", "")

    # Replace URLs, HTML tags, user mentions, and markdown image references
    text = re.sub(r'https?://\S+|www\.\S+', '<URL>', text)
    text = re.sub(r'<.*?>', '<HTML_TAG>', text)
    text = re.sub(r'@\w+', '<USER>', text)
    text = re.sub(r'!\[image\]\(.*?\)', '<IMAGE>', text)

    # Remove text starting with "DevTools" and ending with "(automated)"
    text = re.sub(r'DevTools.*?\(automated\)', '', text)



        # Strip markdown formatting
    text = strip_markdown(text)

######################################
#        Tidy Up Whitespaces         #
######################################

    # Remove consecutive whitespaces and replace with a single space
    text = re.sub(r'\s+', ' ', text)

######################################
#            Final Things            #
######################################

    # Tokenize the text into words
    words = text.split()

    # Remove words that are over 20 characters
    words = [word for word in words if len(word) <= 20]

    # Join the remaining words back into cleaned text
    cleaned_text = ' '.join(words)

    cleaned_count += 1
    return cleaned_text

# Applying clean_text function to test and train data
test_data['body'] = test_data['body'].apply(clean_text)
test_data['title'] = test_data['title'].apply(clean_text)

train_data['body'] = train_data['body'].apply(clean_text)
train_data['title'] = train_data['title'].apply(clean_text)

# Displaying cleaning statistics
print(f"Cleaned {cleaned_count} times.")
print(f"Returned original text {original_count} times.")

Cleaned 5998 times.
Returned original text 2 times.


In [147]:
train_data_facebook2, train_data_tensorflow2, train_data_microsoft2, train_data_bitcoin2, train_data_opencv2, test_data_facebook2, test_data_tensorflow2, test_data_microsoft2, test_data_bitcoin2, test_data_opencv2 = split_data()

## Improved models  
Upon analyzing the step metrics of the fine-tuned models, it became evident that certain models, specifically those associated with the TensorFlow, Microsoft, and OpenCV repositories, exhibited training_loss figures indicating potential for improvement.

Considering this insight, we opted to develop new fine-tuned models, augmenting the epochs and integrating the enhanced cleaning method for these specific repositories.

## Tensorflow Improved model
For this tensorflow improved model we utilized the cleaning method 2 and 10 epochs on the fine-tuning processes. All the initial models displayed above on this notebook used cleaning method 1 and 3 epochs on the fine-tunning process.

### Training and Fine-tuning

In [None]:
## Create new tensorflow conversational data with cleaning method 2
tensorflow_conversational_data_new = create_conversational_data(train_data_tensorflow2, "data/conversationaldata/conversational_data_tensorflow_new.jsonl")
## Create new tensorflow training file
tensorflow_training_file_new = create_training_file(tensorflow_conversational_data_new)

## Creating new tensorflow fine-tuning job using gpt-3.5-turbo-1106 baseline model, 10 epochs, and cleaning method 2
tensorflow_fine_tuning_job_new = client.fine_tuning.jobs.create(
    training_file = tensorflow_training_file_new.id, 
    model="gpt-3.5-turbo-1106",
    suffix= "tensorflow",
    hyperparameters={"n_epochs": 10}
)

In [152]:
tensorflow_fine_tuning_job_new.id

'ftjob-zifIfmVpYAPDW7Q2xgLo9agK'

In [156]:
# Retrieving the state of a fine-tune
tensorflow_ft_model_new = client.fine_tuning.jobs.retrieve(tensorflow_fine_tuning_job_new.id).fine_tuned_model
print(tensorflow_ft_model_new)

ft:gpt-3.5-turbo-1106:northern-arizona-university-nau:tensorflow:8xHuEgsD


### Wait Before Continuing Again

For each repository (facebook, tensorflow, microsoft, bitcoin, opencv) please wait until the fine tuning job is done. You can ensure that by checking when the code snippet above does not return "None". You could also run the snippet below to track the progress of your fine-tuning job by checking the latest events.

In [None]:
# You can track the progress of your fine-tuning job by listing the lastest events. On our models it took about 3 hours to fine-tune each model
client.fine_tuning.jobs.list_events(fine_tuning_job_id=tensorflow_fine_tuning_job_new.id, limit=2reftg0)

### Model Testing and Evaluation

In [None]:
y_true_tensorflow_new, y_pred_tensorflow_new = test_model(test_data_tensorflow2, tensorflow_ft_model_new) ## See results on outputs/cell90output.txt

**Wait for model testing to be over before proceeding. This can take up to 3 hours. You will need is over because when it stops printing predicted vs actual values. This can take up to 2 hours**

In [158]:
calculate_metrics(y_true_tensorflow_new, y_pred_tensorflow_new, 'metrics/confusion_matrix_tf_new')

feature: {'TP': 85, 'FP': 6, 'FN': 15, 'TN': 194}
bug: {'TP': 87, 'FP': 7, 'FN': 13, 'TN': 193}
question: {'TP': 89, 'FP': 26, 'FN': 11, 'TN': 174}
Precision = 0.8778369641459373
Recall = 0.87
F1-score = 0.8716221830866581


In [159]:
tensorflow_complete_metrics_new = evaluating_metrics(y_true_tensorflow_new, y_pred_tensorflow_new)

              precision  recall  f1-score  support
bug            0.925532    0.87  0.896907   100.00
feature        0.934066    0.85  0.890052   100.00
question       0.773913    0.89  0.827907   100.00
accuracy       0.870000    0.87  0.870000     0.87
macro avg      0.877837    0.87  0.871622   300.00
weighted avg   0.877837    0.87  0.871622   300.00


## Facebook Improved Model

In [185]:
## Create new tensorflow conversational data with cleaning method 2
facebook_conversational_data_new = create_conversational_data(train_data_facebook2, "data/conversationaldata/conversational_data_facebook_new.jsonl")
## Create new facebook training file
facebook_training_file_new = create_training_file(facebook_conversational_data_new)

## Creating new facebook fine-tuning job using gpt-3.5-turbo-1106 baseline model, 10 epochs, and cleaning method 2
facebook_fine_tuning_job_new = client.fine_tuning.jobs.create(
    training_file = facebook_training_file_new.id, 
    model="gpt-3.5-turbo-1106",
    suffix= "facebook",
    hyperparameters={"n_epochs": 7}
)

In [186]:
facebook_fine_tuning_job_new.id

'ftjob-JipjimxPsrWjtEHPDjFyOYwR'

In [190]:
# Retrieving the state of a fine-tune
facebook_ft_model_new = client.fine_tuning.jobs.retrieve(facebook_fine_tuning_job_new.id).fine_tuned_model
print(facebook_ft_model_new)

ft:gpt-3.5-turbo-1106:northern-arizona-university-nau:facebook:8xjH103N


In [None]:
y_true_facebook_new, y_pred_facebook_new = test_model(test_data_facebook2, facebook_ft_model_new) ## See results on outputs/cell98output.txt

In [192]:
calculate_metrics(y_true_facebook_new, y_pred_facebook_new, 'metrics/confusion_matrix_fb_new')

feature: {'TP': 87, 'FP': 12, 'FN': 13, 'TN': 188}
bug: {'TP': 93, 'FP': 20, 'FN': 7, 'TN': 180}
question: {'TP': 75, 'FP': 13, 'FN': 25, 'TN': 187}
Precision = 0.8513564852060428
Recall = 0.85
F1-score = 0.8484945454472442


In [193]:
facebook_complete_metrics_new = evaluating_metrics(y_true_facebook_new, y_pred_facebook_new)

              precision  recall  f1-score  support
bug            0.823009    0.93  0.873239   100.00
feature        0.878788    0.87  0.874372   100.00
question       0.852273    0.75  0.797872   100.00
accuracy       0.850000    0.85  0.850000     0.85
macro avg      0.851356    0.85  0.848495   300.00
weighted avg   0.851356    0.85  0.848495   300.00


## BITCOIN

In [194]:
## Create new tensorflow conversational data with cleaning method 2
bitcoin_conversational_data_new = create_conversational_data(train_data_bitcoin2, "data/conversationaldata/conversational_data_bitcoin_new.jsonl")
## Create new bitcoin training file
bitcoin_training_file_new = create_training_file(bitcoin_conversational_data_new)

## Creating new bitcoin fine-tuning job using gpt-3.5-turbo-1106 baseline model, 10 epochs, and cleaning method 2
bitcoin_fine_tuning_job_new = client.fine_tuning.jobs.create(
    training_file = bitcoin_training_file_new.id, 
    model="gpt-3.5-turbo-1106",
    suffix= "bitcoin",
    hyperparameters={"n_epochs": 7}
)

In [196]:
bitcoin_fine_tuning_job_new.id

'ftjob-RWphxqFI3CI3DvA9UQfa0YMo'

In [198]:
# Retrieving the state of a fine-tune
bitcoin_ft_model_new = client.fine_tuning.jobs.retrieve(bitcoin_fine_tuning_job_new.id).fine_tuned_model
print(bitcoin_ft_model_new)

ft:gpt-3.5-turbo-1106:northern-arizona-university-nau:bitcoin:8xrdS2jR


In [None]:
y_true_bitcoin_new, y_pred_bitcoin_new = test_model(test_data_bitcoin2, bitcoin_ft_model_new) ## See results on outputs/cell105output.txt

In [200]:
calculate_metrics(y_true_bitcoin_new, y_pred_bitcoin_new, 'metrics/confusion_matrix_bc_new')

feature: {'TP': 88, 'FP': 17, 'FN': 12, 'TN': 183}
bug: {'TP': 74, 'FP': 18, 'FN': 26, 'TN': 182}
question: {'TP': 72, 'FP': 31, 'FN': 28, 'TN': 169}
Precision = 0.780490730131929
Recall = 0.78
F1-score = 0.7795765082035057


In [201]:
bitcoin_complete_metrics_new = evaluating_metrics(y_true_bitcoin_new, y_pred_bitcoin_new)

              precision  recall  f1-score  support
bug            0.804348    0.74  0.770833   100.00
feature        0.838095    0.88  0.858537   100.00
question       0.699029    0.72  0.709360   100.00
accuracy       0.780000    0.78  0.780000     0.78
macro avg      0.780491    0.78  0.779577   300.00
weighted avg   0.780491    0.78  0.779577   300.00


## OpenCV Improved Model
Cleaning method 2 and 6 epochs on fine-tuning process.

Employing a new cleaning method alongside 6 epochs, we chose this iteration as our experimentation with 10 epochs indicated it to be excessive for our TensorFlow model.

In [160]:
## Create new tensorflow conversational data with cleaning method 2
opencv_conversational_data_new = create_conversational_data(train_data_opencv2, "data/conversationaldata/conversational_data_opencv_new.jsonl")
## Create new opencv training file
opencv_training_file2 = create_training_file(opencv_conversational_data_new)

## Creating new opencv fine-tuning job using gpt-3.5-turbo base model, 6 epochs, and cleaning method 2
opencv_fine_tuning_job_new = client.fine_tuning.jobs.create(
    training_file = opencv_training_file2.id, 
    model="gpt-3.5-turbo-1106",
    suffix= "opencv",
    hyperparameters={"n_epochs": 6}
)

In [161]:
opencv_fine_tuning_job_new.id

'ftjob-1CfSvYPxWkm1RFTkpbj66P9f'

In [165]:
# Retrieving the state of a fine-tune
opencv_ft_model_new = client.fine_tuning.jobs.retrieve(opencv_fine_tuning_job_new.id).fine_tuned_model
print(opencv_ft_model_new)

ft:gpt-3.5-turbo-1106:northern-arizona-university-nau:opencv:8xNlytzM


### Wait Before Continuing Again

For each repository (facebook, tensorflow, microsoft, bitcoin, opencv) please wait until the fine tuning job is done. You can ensure that by checking when the code snippet above does not return "None". You could also run the snippet below to track the progress of your fine-tuning job by checking the latest events.

In [None]:
# You can track the progress of your fine-tuning job by listing the lastest events. On our models it took about 3 hours to fine-tune each model
client.fine_tuning.jobs.list_events(fine_tuning_job_id=opencv_fine_tuning_job_new.id, limit=20)

### Model Testing and Evaluation

In [None]:
y_true_opencv_new, y_pred_opencv_new = test_model(test_data_opencv2, opencv_ft_model_new) ## See results on outputs/cell115output.txt

**Wait for model testing to be over before proceeding. This can take up to 3 hours. You will need is over because when it stops printing predicted vs actual values. This can take up to 2 hours**

In [170]:
calculate_metrics(y_true_opencv_new, y_pred_opencv_new, 'metrics/confusion_matrix_oc_new.csv')

feature: {'TP': 73, 'FP': 10, 'FN': 26, 'TN': 189}
bug: {'TP': 79, 'FP': 19, 'FN': 20, 'TN': 180}
question: {'TP': 88, 'FP': 29, 'FN': 12, 'TN': 169}
Precision = 0.8125924244685003
Recall = 0.8
F1-score = 0.8022846378213909


In [171]:
opencv_complete_metrics_new = evaluating_metrics(y_true_opencv_new, y_pred_opencv_new)

              precision  recall  f1-score  support
bug            0.806122    0.79  0.797980    100.0
feature        0.879518    0.73  0.797814    100.0
question       0.752137    0.88  0.811060    100.0
micro avg      0.805369    0.80  0.802676    300.0
macro avg      0.812592    0.80  0.802285    300.0
weighted avg   0.812592    0.80  0.802285    300.0


## Microsoft Improved model
Cleaning Method 1 and 10 epochs

In [172]:
## Since we are just adding more epochs to the existing Microsoft model we don't have to create new conversational data or training files.

## Creating a fine-tuned model
microsoft_ft_job_file_new = client.fine_tuning.jobs.create(
  training_file = microsoft_ft_job_file.training_file, ## Using same file with cleaning method 1
  model = microsoft_ft_model, ## Using old model as the base model
  suffix= "ms-issueclassifier",
  hyperparameters={"n_epochs": 7}
)

In [173]:
microsoft_ft_job_file_new

FineTuningJob(id='ftjob-De6dIrMKz8BhAfcbANUgojrG', created_at=1709165781, error=Error(code=None, message=None, param=None, error=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs=7, batch_size='auto', learning_rate_multiplier='auto'), model='ft:gpt-3.5-turbo-0613:northern-arizona-university-nau:ms-issueclassifier:8wwuk30C', object='fine_tuning.job', organization_id='org-RQmLagMyfsDY9gy4UMq97uCI', result_files=[], status='validating_files', trained_tokens=None, training_file='file-EjuEwJyG8siTlN11etUaBzhP', validation_file=None, user_provided_suffix='ms-issueclassifier')

In [189]:
# Retrieving the state of a fine-tune
microsoft_ft_model_new = client.fine_tuning.jobs.retrieve(microsoft_ft_job_file_new.id).fine_tuned_model
print(microsoft_ft_model_new)

ft:gpt-3.5-turbo-0613:northern-arizona-university-nau:ms-issueclassifier:8xPhaSQC


### Wait Before Continuing Again

For each repository (facebook, tensorflow, microsoft, bitcoin, opencv) please wait until the fine tuning job is done. You can ensure that by checking when the code snippet above does not return "None". You could also run the snippet below to track the progress of your fine-tuning job by checking the latest events.

In [None]:
# You can track the progress of your fine-tuning job by listing the lastest events. On our models it took about 3 hours to fine-tune each model
client.fine_tuning.jobs.list_events(fine_tuning_job_id=microsoft_ft_job_file_new.id, limit=20)

### Model Testing and Evaluation

In [None]:
y_true_microsoft_new, y_pred_microsoft_new = test_model(test_data_microsoft, microsoft_ft_model_new) ## See results on outputs/cell126output.txt

**Wait for model testing to be over before proceeding. This can take up to 3 hours. You will need is over because when it stops printing predicted vs actual values. This can take up to 2 hours**

In [183]:
calculate_metrics(y_true_microsoft_new, y_pred_microsoft_new, 'metrics/confusion_matrix_ms_new.csv')

feature: {'TP': 87, 'FP': 30, 'FN': 13, 'TN': 170}
bug: {'TP': 78, 'FP': 15, 'FN': 22, 'TN': 185}
question: {'TP': 75, 'FP': 15, 'FN': 25, 'TN': 185}
Precision = 0.805210918114144
Recall = 0.8
F1-score = 0.799869052541097


In [178]:
microsoft_complete_metrics_new = evaluating_metrics(y_true_microsoft_new, y_pred_microsoft_new)

              precision  recall  f1-score  support
bug            0.838710    0.78  0.808290    100.0
feature        0.743590    0.87  0.801843    100.0
question       0.833333    0.75  0.789474    100.0
accuracy       0.800000    0.80  0.800000      0.8
macro avg      0.805211    0.80  0.799869    300.0
weighted avg   0.805211    0.80  0.799869    300.0


## Overall Results

To obtain the overall results we utilized the metrics from the regular facebook and bitcoin models, and the improved tensorflow, microsoft and opencv models. We then calculated the average of each metrics to have a complete overall metrics. 

In [203]:
def calculate_average(values):
    total = 0
    for value in values:
        total += value
    return total/len(values)

# Calculating overall (average) values  
overall_bug_f1 = calculate_average([facebook_complete_metrics_new['f1-score'].loc['bug'], tensorflow_complete_metrics_new['f1-score'].loc['bug'], microsoft_complete_metrics_new['f1-score'].loc['bug'], bitcoin_complete_metrics_new['f1-score'].loc['bug'], opencv_complete_metrics['f1-score'].loc['bug']])
overall_bug_precision = calculate_average([facebook_complete_metrics_new.precision.loc['bug'], tensorflow_complete_metrics_new.precision.loc['bug'], microsoft_complete_metrics_new.precision.loc['bug'], bitcoin_complete_metrics_new.precision.loc['bug'], opencv_complete_metrics.precision.loc['bug']])
overall_bug_recall = calculate_average([facebook_complete_metrics_new.recall.loc['bug'], tensorflow_complete_metrics_new.recall.loc['bug'], microsoft_complete_metrics_new.recall.loc['bug'], bitcoin_complete_metrics_new.recall.loc['bug'], opencv_complete_metrics.recall.loc['bug']])
overall_feature_f1 = calculate_average([facebook_complete_metrics_new['f1-score'].loc['feature'], tensorflow_complete_metrics_new['f1-score'].loc['feature'], microsoft_complete_metrics_new['f1-score'].loc['feature'], bitcoin_complete_metrics_new['f1-score'].loc['feature'], opencv_complete_metrics['f1-score'].loc['feature']])
overall_feature_precision = calculate_average([facebook_complete_metrics_new.precision.loc['feature'], tensorflow_complete_metrics_new.precision.loc['feature'], microsoft_complete_metrics_new.precision.loc['feature'], bitcoin_complete_metrics_new.precision.loc['feature'], opencv_complete_metrics.precision.loc['feature']])
overall_feature_recall = calculate_average([facebook_complete_metrics_new.recall.loc['feature'], tensorflow_complete_metrics_new.recall.loc['feature'], microsoft_complete_metrics_new.recall.loc['feature'], bitcoin_complete_metrics_new.recall.loc['feature'], opencv_complete_metrics.recall.loc['feature']])
overall_question_f1 = calculate_average([facebook_complete_metrics_new['f1-score'].loc['question'], tensorflow_complete_metrics_new['f1-score'].loc['question'], microsoft_complete_metrics_new['f1-score'].loc['question'], bitcoin_complete_metrics_new['f1-score'].loc['question'], opencv_complete_metrics['f1-score'].loc['question']])
overall_question_precision = calculate_average([facebook_complete_metrics_new.precision.loc['question'], tensorflow_complete_metrics_new.precision.loc['question'], microsoft_complete_metrics_new.precision.loc['question'], bitcoin_complete_metrics_new.precision.loc['question'], opencv_complete_metrics.precision.loc['question']])
overall_question_recall = calculate_average([facebook_complete_metrics_new.recall.loc['question'], tensorflow_complete_metrics_new.recall.loc['question'], microsoft_complete_metrics_new.recall.loc['question'], bitcoin_complete_metrics_new.recall.loc['question'], opencv_complete_metrics.recall.loc['question']])
overall_average_f1 = calculate_average([facebook_complete_metrics_new['f1-score'].loc['macro avg'], tensorflow_complete_metrics_new['f1-score'].loc['macro avg'], microsoft_complete_metrics_new['f1-score'].loc['macro avg'], bitcoin_complete_metrics_new['f1-score'].loc['macro avg'], opencv_complete_metrics['f1-score'].loc['macro avg']])
overall_average_precision = calculate_average([facebook_complete_metrics_new.precision.loc['macro avg'], tensorflow_complete_metrics_new.precision.loc['macro avg'], microsoft_complete_metrics_new.precision.loc['macro avg'], bitcoin_complete_metrics_new.precision.loc['macro avg'], opencv_complete_metrics.precision.loc['macro avg']])
overall_average_recall = calculate_average([facebook_complete_metrics_new.recall.loc['macro avg'], tensorflow_complete_metrics_new.recall.loc['macro avg'], microsoft_complete_metrics_new.recall.loc['macro avg'], bitcoin_complete_metrics_new.recall.loc['macro avg'], opencv_complete_metrics.recall.loc['macro avg']])

print("Overall Results: ")
# Formatting the results
formatted_metrics = {
    "Bug": {
        "Precision": overall_bug_precision, 
        "Recall": overall_bug_recall, 
        "F1-Score": overall_bug_f1
    },
    "Feature": {
        "Precision": overall_feature_precision, 
        "Recall": overall_feature_recall, 
        "F1-Score": overall_feature_f1
    },
    "Question": {
        "Precision": overall_question_precision, 
        "Recall": overall_question_recall, 
        "F1-Score": overall_question_f1
    },
    "Average": {
        "Precision": overall_average_precision, 
        "Recall": overall_average_recall, 
        "F1-Score": overall_average_f1
    }
}
formatted_metrics

Overall Results: 


{'Bug': {'Precision': 0.8255923808642173,
  'Recall': 0.8260000000000002,
  'F1-Score': 0.8241397426633768},
 'Feature': {'Precision': 0.8531013072948557,
  'Recall': 0.8559999999999999,
  'F1-Score': 0.8528364713995196},
 'Question': {'Precision': 0.7942096460595828,
  'Recall': 0.7780000000000001,
  'F1-Score': 0.7841061949277026},
 'Average': {'Precision': 0.8243011114062186,
  'Recall': 0.82,
  'F1-Score': 0.8203608029968661}}

## Extended Work Revision:
### Increased dataset

We believe that our project could benefit from a discussion on how well the models would scale with more subtantial and varied data. With that in mind, we decided to utilize the dataset from the NLBSE 2023 Tool Competition to compare it with the few shot dataset from the 2024 version.

Due to financial constraints we will limit the utilization of the dataset to 15000 for training and 15000 for testing, which is a 10x increase when compared to the 1500 training data points and 1500 testing data points utilized above

Also, we filtered out the labels "documentation" that were not kept for the NLBSE 2024, so we only consider the question, feature and bug.

Those new issue will be saved under issues/extension

In [5]:
## Getting the training and testing data
import pandas as pd

df_extension_training = pd.read_csv("data/issues/extension/nlbse23-issue-classification-test.csv")
df_extension_testing = pd.read_csv("data/issues/extension/nlbse23-issue-classification-train.csv")

# Now filtering to remove unwanted labeled rows
df_filtered_extension_training = df_extension_training[df_extension_training['labels'] != 'documentation']
df_filtered_extension_testing = df_extension_testing[df_extension_testing['labels'] != 'documentation']


Since Cleaning Method 1 had an overall better performance we would like to use it

In [18]:
import re
import emoji
import string

# Initialize counters for text cleaning
cleaned_count = 0
original_count = 0

# Text cleaning function
def clean_text(text):
    global cleaned_count, original_count

    if not isinstance(text, str):
        original_count += 1
        return text

    # Remove double quotation marks
    text = text.replace('"', '')

    # Remove text starting with "DevTools" and ending with "(automated)"
    text = re.sub(r'DevTools.*?\(automated\)', '', text)

    # Lowercasing should be one of the first steps to ensure uniformity
    text = text.lower()

    # Remove emojis
    text = emoji.demojize(text)

    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)

    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)

    # Remove special characters and punctuation
    text = re.sub(f"[{re.escape(string.punctuation)}]", '', text)

    # Remove '#' characters
    text = text.replace("#", "")

    # Remove consecutive whitespaces and replace with a single space
    text = re.sub(r'\s+', ' ', text)

    # Split the text into words
    words = text.split()

    # Remove words that are over 20 characters
    words = [word for word in words if len(word) <= 20]

    # Join the remaining words back into cleaned text
    cleaned_text = ' '.join(words)

    cleaned_count += 1
    return cleaned_text

In [19]:
# Function to get the first 5,000 entries for a specific label
def get_first_n_by_label(df, label, n=5000):
    df_label = df[df['labels'] == label].head(n)
    df_label['body'] = df_label['body'].apply(clean_text)
    df_label['title'] = df_label['title'].apply(clean_text)
    return df_label

To maintain the same ratio as we had with the few shot example (5k dataset) we want to accomplish the following:
* bug training dataset: 5000
* feature training dataset: 5000
* question training dataset: 5000
* bug testing dataset: 5000
* feature testing dataset: 5000
* question testing dataset: 5000

In [20]:
# Extracting the first 5000 data points for each label
feature_training = get_first_n_by_label(df_filtered_extension_training, 'feature')
bug_training = get_first_n_by_label(df_filtered_extension_training, 'bug')
question_training = get_first_n_by_label(df_filtered_extension_training, 'question')

In [21]:
feature_testing = get_first_n_by_label(df_filtered_extension_testing, 'feature')
bug_testing = get_first_n_by_label(df_filtered_extension_testing, 'bug')
question_testing = get_first_n_by_label(df_filtered_extension_testing, 'question')

In [22]:
feature_testing

Unnamed: 0,id,labels,title,body,author_association
11,1316955675,feature,show subscriber count on user profile,as an author i would like to see how many subs...,NONE
20,1266063747,feature,switch from aioredis to redis,code from aioredis was incorporated into the o...,NONE
21,635318979,feature,revoking certificates produces high cpu load,please reserve github issues for bug reports a...,NONE
22,1115945046,feature,devtoolset11 for manylinux2014,last year we had the upgrade to devtoolset10 f...,NONE
28,1136348977,feature,roadside decor incomplete,thank you for expanding this roadside pikmin c...,NONE
...,...,...,...,...,...
21422,855345144,feature,encoded uri for source urls nextjs,this is a specific problem with nextjs but im ...,NONE
21433,1054589101,feature,nxtaskrunner environment variable instead of p...,description we maintain 3 different taskrunner...,NONE
21445,291573842,feature,add forcednld parameter and other improvements...,the dnldncbigenefile function in associationsp...,NONE
21450,1169101952,feature,set public method that when called allows focu...,environment information package versions 8610 ...,NONE


In [39]:
# Saving the filtered datasets
feature_training.to_csv("data/issues/extension/filtered/feature_training.csv", index=False)
bug_training.to_csv("data/issues/extension/filtered/bug_training.csv", index=False)
question_training.to_csv("data/issues/extension/filtered/question_training.csv", index=False)

feature_testing.to_csv("data/issues/extension/filtered/feature_testing.csv", index=False)
bug_testing.to_csv("data/issues/extension/filtered/bug_testing.csv", index=False)
question_testing.to_csv("data/issues/extension/filtered/question_testing.csv", index=False)

NOW lets count the tokens

In [41]:
import tiktoken

# Tokenizer initialization
tokenizer = tiktoken.get_encoding("cl100k_base")

def count_tokens(df):
    return sum(len(tokenizer.encode(str(row['title']) + ' ' + str(row['body']))) for _, row in df.iterrows())

In [45]:
# Counting tokens
feature_training_tokens = count_tokens(feature_training)
bug_training_tokens = count_tokens(bug_training)
question_training_tokens = count_tokens(question_training)

feature_testing_tokens = count_tokens(feature_testing)
bug_testing_tokens = count_tokens(bug_testing)
question_testing_tokens = count_tokens(question_testing)

total_tokens = (feature_training_tokens + bug_training_tokens + question_training_tokens +
                feature_testing_tokens + bug_testing_tokens + question_testing_tokens)

total_tokens_training = feature_training_tokens + bug_training_tokens + question_training_tokens

total_tokens_testing = feature_testing_tokens + bug_testing_tokens + question_testing_tokens

In [46]:
# Printing token counts
print(f"Feature Training Tokens: {feature_training_tokens}")
print(f"Bug Training Tokens: {bug_training_tokens}")
print(f"Question Training Tokens: {question_training_tokens}")
print(f"Feature Testing Tokens: {feature_testing_tokens}")
print(f"Bug Testing Tokens: {bug_testing_tokens}")
print(f"Question Testing Tokens: {question_testing_tokens}")
print(f"Total Tokens: {total_tokens}")
print(f"Total Tokens Training: {total_tokens_training}")
print(f"Total Tokens Testing: {total_tokens_testing}")



Feature Training Tokens: 436353
Bug Training Tokens: 1165301
Question Training Tokens: 872991
Feature Testing Tokens: 696258
Bug Testing Tokens: 1357599
Question Testing Tokens: 813402
Total Tokens: 5341904
Total Tokens Training: 2474645
Total Tokens Testing: 2867259


In [None]:
# Invoking the API
from openai import OpenAI
client = OpenAI(api_key = 'open-ai-api-key')

In [69]:
max_content_tokens = 3999

# Function to truncate the message and avoid passing the limit of 4k tokens per gpt-3.5 fine-tuned model limitations
def truncate_message(message, max_length):
        encoding = tiktoken.get_encoding("cl100k_base")
        tokens = encoding.encode(message)
        if len(tokens) > max_length:
            truncated_tokens = tokens[:max_length]
            message = encoding.decode(truncated_tokens)
        return message

def create_conversational_data(train_data, conversational_data):

    # Open the file in write mode
    with open(conversational_data, 'w', encoding='utf-8') as f:
        # Iterate over the rows in the DataFrame
        for index, row in train_data.iterrows():
            # Create the user message by formatting the prompt with the title and body
            user_message = f"Classify, IN ONLY 1 WORD, the following GitHub issue as 'feature', 'bug', or 'question' based on its title and body:\n{row['title']}\n{row['body']}"
            
            # Truncate the prompt if necessary
            user_message = truncate_message(user_message, max_content_tokens)

            # Create the assistant message by taking the label
            assistant_message = row['labels']
            
            # Construct the conversation object
            conversation_object = {
                "messages": [
                    {"role": "system", "content": "GitHub Issue Report Classifier"},
                    {"role": "user", "content": user_message},
                    {"role": "assistant", "content": assistant_message}
                ]
            }
            
            # Write the conversation object to one line in the file
            f.write(json.dumps(conversation_object, ensure_ascii=False) + '\n')
    return conversational_data

In [71]:
def create_training_file(conversational_data):  
  ## Uplopading a training file
  training_file = client.files.create(
    file=open(conversational_data, "rb"),
    purpose="fine-tune"
  )
  return training_file

In [72]:
def create_fine_tuned_model(model_training_file, model_sufix, model):
  ## Creating a fine-tuned model
  fine_tuning_job = client.fine_tuning.jobs.create(
    training_file=model_training_file.id, 
    model=model,
    suffix= model_sufix,
    
  )
  return fine_tuning_job

In [73]:
def fine_tuning_and_training(train_data, conversational_data, model_sufix, model):
    model_conversational_data = create_conversational_data(train_data, conversational_data)
    trained_file = create_training_file(model_conversational_data)
    fine_tuned_model = create_fine_tuned_model(trained_file, model_sufix, model)
    return fine_tuned_model

In [74]:
extension_training_data = pd.concat([feature_training, bug_training, question_training], ignore_index=True)

In [75]:
extension_training_data

Unnamed: 0,id,labels,title,body,author_association
0,1089772715,feature,how to check if a certain entity still exists,during a bug in my own code i noticed that the...,NONE
1,1000928729,feature,chose the timezone in dbeaver option,dbeaver 2120 for all version dbeaver i put tim...,NONE
2,1270175611,feature,any roadmap about okhttp supports http3,i noticed that official http3 specific is publ...,NONE
3,1248029618,feature,keyboard maestro list macros add edit macro op...,extension – keyboard maestro list macros autho...,NONE
4,1226906409,feature,fluentd buffer chunk key configuration via sck...,what would you like to be added we need to con...,NONE
...,...,...,...,...,...
14995,1183401236,question,prometheus prefix option got removed in v4,in v3 there was a config option for prometheus...,CONTRIBUTOR
14996,463946177,question,how to derive an annotator with automatically ...,i want to use the regexmatcher with small opti...,CONTRIBUTOR
14997,1087228602,question,consider what kinds of urn manipulation of who...,maybe better to make that a subsequent milestone,CONTRIBUTOR
14998,1270241793,question,new test,new test,CONTRIBUTOR


In [73]:
gpt_4o_ft_extended_data = fine_tuning_and_training(extension_training_data, 'data/conversationaldata/extension/conversational_data_extension.jsonl', "extended15k", "gpt-4o-mini-2024-07-18")

In [37]:
gpt_4o_ft_extended_data = "FineTuningJob(id='ftjob-c89d0b5MIoiBIraRI0MQ5Hz6', created_at=1738382604, error=Error(code=None, message=None, param=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='gpt-4o-mini-2024-07-18', object='fine_tuning.job', organization_id='org-CpaRU3Zq9ePCCtbhezmcbgrg', result_files=[], status='validating_files', trained_tokens=None, training_file='file-WZC21uiSg9pBAFyknCUciE', validation_file=None, user_provided_suffix='extended15k', seed=1481594988, estimated_finish=None, integrations=[], method={'type': 'supervised', 'supervised': {'hyperparameters': {'batch_size': 'auto', 'learning_rate_multiplier': 'auto', 'n_epochs': 'auto'}}})"

In [39]:
# Retrieving the state of a fine-tune
gpt_4o_ft_extended_data_model = client.fine_tuning.jobs.retrieve('ftjob-c89d0b5MIoiBIraRI0MQ5Hz6').fine_tuned_model
print(gpt_4o_ft_extended_data_model) # This fine-tuning job took around 40 min to be completed

ft:gpt-4o-mini-2024-07-18:gcucst440:extended15k:AvzetMni


In [None]:
import concurrent.futures
import tiktoken
import openai
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix, classification_report

# Replace 'open-ai-key' with your actual OpenAI API key
openai.api_key = 'open-ai-api-key'

# max_token here should be one since 'bug', 'feature', and 'question' are one token long. This might change for future versions of the model and api but you can check the value on the
def query_chatgpt(prompt, model, temperature=0.0,  max_tokens=1, max_retries=5):
    """
    Function to query ChatGPT-4 with a given prompt, with retries for timeouts.

    :param prompt: Prompt string to send to ChatGPT-2.5
    :param model: The model to use, default is ChatGPT-3.5
    :param max_tokens: Maximum number of tokens to generate
    :param max_retries: Maximum number of retries for timeout
    :return: Response from ChatGPT-3.5 or None if all retries fail
    """
    attempt = 0
    max_content_tokens = 127000
    encoding = tiktoken.get_encoding("cl100k_base")

    # Function to truncate the message and avoid passing the limit of 4k tokens per gpt-3.5 fine-tuned model limitations
    def truncate_message(message, max_length):
        tokens = encoding.encode(message)
        if len(tokens) > max_length:
            truncated_tokens = tokens[:max_length]
            message = encoding.decode(truncated_tokens)
        return message

    # Truncate the prompt if necessary
    prompt = truncate_message(prompt, max_content_tokens)

    while attempt < max_retries:
        with concurrent.futures.ThreadPoolExecutor() as executor:
            future = executor.submit(
                openai.chat.completions.create,
                model=model,
                messages=[{"role": "system", "content": "GitHub Issue Report Classifier"}, {"role": "user", "content": prompt}],
                max_tokens=max_tokens,
                temperature=temperature
            )
            try:
                response = future.result(timeout=5)  # 5 seconds timeout
                return response.choices[0].message.content
            except concurrent.futures.TimeoutError:
                print(f"Attempt {attempt + 1}/{max_retries} - Request timed out. Retrying...")
            except Exception as e:
                print(f"Attempt {attempt + 1}/{max_retries} - An error occurred: {e}")
            finally:
                attempt += 1

    print("Failed to get a response after several retries.")
    return None
    
labels = ['feature', 'bug', 'question']

In [130]:
import time

def test_model(test_data, ft_model):
    y_true = []
    y_pred = []
    iterations = len(test_data)

    # Now let's loop through the test data and classify the GitHub issues
    for i in range(iterations):
        correct_label = test_data.iloc[i]['labels'].lower()
        description = f"{test_data.iloc[i]['title']} \n {test_data.iloc[i]['body']}"
        print(f"Correct GitHub Issue type: {correct_label}")
        
        prompt = f"Classify, IN ONLY 1 WORD, the following GitHub issue as 'feature', 'bug', or 'question' based on its title and body:\n{description}"
        response = query_chatgpt(prompt, ft_model)
        
        if response is None:
            print("Failed to get a response after several retries. Skipping this item.")
            continue  # Skip this iteration and move to the next one
        
        # Clean the response to keep only letters (and optionally numbers)
        predicted_label = re.sub(r'[^A-Za-z]+', '', response).lower().strip()
        print(f"Predicted GitHub Issue type: {predicted_label}")
        
        # Append to lists for evaluation
        y_true.append(correct_label)
        y_pred.append(predicted_label)
        time.sleep(4)  # Wait for 6 seconds before retrying since there is a token per minute limit

    return y_true, y_pred

In [None]:
def calculate_metrics(y_true, y_pred, cm_sheet):
    labels = ['feature', 'bug', 'question']
    # Calculate weighted average F1-score, precision, and recall
    f1 = f1_score(y_true, y_pred, labels=labels, average='weighted')
    precision = precision_score(y_true, y_pred, labels=labels, average='weighted')
    recall = recall_score(y_true, y_pred, labels=labels, average='weighted')

    # Calculate confusion matrix
    cm = confusion_matrix(y_true, y_pred, labels=labels)

    cm_df = pd.DataFrame(cm, index=labels, columns=labels)

    # Calculate TP, FP, FN, TN
    results_fb = {}
    for i, label in enumerate(labels):
        results_fb[label] = {'TP': cm[i, i]}
        results_fb[label]['FP'] = cm[:, i].sum() - cm[i, i]
        results_fb[label]['FN'] = cm[i, :].sum() - cm[i, i]
        results_fb[label]['TN'] = cm.sum() - (results_fb[label]['TP'] + results_fb[label]['FP'] + results_fb[label]['FN'])

    # Print results_fb
    for labels, metrics in results_fb.items():
        print(f"{labels}: {metrics}")

    # Save results_fb to CSV
    results_fb_df = pd.DataFrame(results_fb).T
    results_fb_df['F1-score'] = f1
    results_fb_df['Recall'] = recall
    results_fb_df['Precision'] = precision

    results_fb_df.to_csv(cm_sheet, index=False)

    print(f"Precision = {precision}")
    print(f"Recall = {recall}")
    print(f"F1-score = {f1}")

In [57]:
def evaluating_metrics(y_true, y_pred):
    # Create a classification report
    report = classification_report(y_true, y_pred, labels=['bug', 'feature', 'question'], target_names=['bug', 'feature', 'question'], zero_division=0, output_dict=True)

    # Convert the report to a DataFrame
    report_df = pd.DataFrame(report).transpose()

    # Print the classification report
    print(report_df)
    return report_df

In [52]:
extension_testing_data = pd.concat([feature_testing, bug_testing, question_testing], ignore_index=True)

In [53]:
extension_testing_data

Unnamed: 0,id,labels,title,body,author_association
0,1316955675,feature,show subscriber count on user profile,as an author i would like to see how many subs...,NONE
1,1266063747,feature,switch from aioredis to redis,code from aioredis was incorporated into the o...,NONE
2,635318979,feature,revoking certificates produces high cpu load,please reserve github issues for bug reports a...,NONE
3,1115945046,feature,devtoolset11 for manylinux2014,last year we had the upgrade to devtoolset10 f...,NONE
4,1136348977,feature,roadside decor incomplete,thank you for expanding this roadside pikmin c...,NONE
...,...,...,...,...,...
14995,1200874318,question,how do i compile cpp programs,how do i compile cpp programs cpp file copyrig...,NONE
14996,1215450181,question,getting error check failed 0 at,description reproducible example environment i...,NONE
14997,1275828069,question,warn url is blacklist even though i can view t...,hey great app i was able to download a bunch o...,NONE
14998,1341076303,question,question about secure boot public key initiali...,hi all i have a question about the initial cre...,NONE


In [54]:
y_true_extended_4o_mini, y_pred_extended_4o_mini = test_model(extension_testing_data, gpt_4o_ft_extended_data_model)


Correct GitHub Issue type: feature
Predicted GitHub Issue type: feature
Correct GitHub Issue type: feature
Predicted GitHub Issue type: feature
Correct GitHub Issue type: feature
Predicted GitHub Issue type: bug
Correct GitHub Issue type: feature
Predicted GitHub Issue type: question
Correct GitHub Issue type: feature
Predicted GitHub Issue type: feature
Correct GitHub Issue type: feature
Predicted GitHub Issue type: feature
Correct GitHub Issue type: feature
Predicted GitHub Issue type: feature
Correct GitHub Issue type: feature
Predicted GitHub Issue type: feature
Correct GitHub Issue type: feature
Predicted GitHub Issue type: feature
Correct GitHub Issue type: feature
Predicted GitHub Issue type: feature
Correct GitHub Issue type: feature
Predicted GitHub Issue type: bug
Correct GitHub Issue type: feature
Predicted GitHub Issue type: feature
Correct GitHub Issue type: feature
Predicted GitHub Issue type: feature
Correct GitHub Issue type: feature
Predicted GitHub Issue type: feature

In [62]:
y_true_extended_4o_mini

['feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'fe

In [63]:
y_pred_extended_4o_mini

['feature',
 'feature',
 'bug',
 'question',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'bug',
 'feature',
 'feature',
 'feature',
 'question',
 'feature',
 'feature',
 'feature',
 'question',
 'feature',
 'feature',
 'feature',
 'bug',
 'feature',
 'bug',
 'feature',
 'feature',
 'bug',
 'feature',
 'feature',
 'feature',
 'bug',
 'feature',
 'question',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'question',
 'feature',
 'feature',
 'bug',
 'feature',
 'feature',
 'feature',
 'feature',
 'bug',
 'feature',
 'question',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'question',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'question',
 'feature',
 'feature',
 'question',
 'feature',
 'feature',
 'feature',
 'question',
 'question',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'feature',
 'bug',
 'bug',
 'feature',
 'feature',
 'feature

In [65]:
calculate_metrics(y_true_extended_4o_mini, y_pred_extended_4o_mini, 'metrics/confusion_matrix_extende_4omini.csv')

feature: {'TP': 3782, 'FP': 478, 'FN': 1218, 'TN': 9522}
bug: {'TP': 4427, 'FP': 1323, 'FN': 573, 'TN': 8677}
question: {'TP': 3849, 'FP': 1141, 'FN': 1151, 'TN': 8859}
Precision = 0.8096830520263497
Recall = 0.8038666666666666
F1-score = 0.8036817099383777


In [66]:
extende_data_4o_metrics = evaluating_metrics(y_true_extended_4o_mini, y_pred_extended_4o_mini)

              precision    recall  f1-score       support
bug            0.769913  0.885400  0.823628   5000.000000
feature        0.887793  0.756400  0.816847   5000.000000
question       0.771343  0.769800  0.770571   5000.000000
accuracy       0.803867  0.803867  0.803867      0.803867
macro avg      0.809683  0.803867  0.803682  15000.000000
weighted avg   0.809683  0.803867  0.803682  15000.000000


In [None]:
import json

gpt_4o_ft_extended_data_v2 = fine_tuning_and_training(extension_training_data, 'data/conversationaldata/extension/conversational_data_extension.jsonl', "extended15k", "gpt-4o-2024-08-06")

In [78]:
gpt_4o_ft_extended_data_model_v2 = client.fine_tuning.jobs.retrieve(gpt_4o_ft_extended_data_v2.id).fine_tuned_model

In [79]:
y_true_extended_4o, y_pred_extended_4o = test_model(extension_testing_data, gpt_4o_ft_extended_data_model_v2)

Correct GitHub Issue type: feature
Predicted GitHub Issue type: feature
Correct GitHub Issue type: feature
Predicted GitHub Issue type: feature
Correct GitHub Issue type: feature
Predicted GitHub Issue type: bug
Correct GitHub Issue type: feature
Predicted GitHub Issue type: feature
Correct GitHub Issue type: feature
Predicted GitHub Issue type: feature
Correct GitHub Issue type: feature
Predicted GitHub Issue type: feature
Correct GitHub Issue type: feature
Predicted GitHub Issue type: feature
Correct GitHub Issue type: feature
Predicted GitHub Issue type: feature
Correct GitHub Issue type: feature
Predicted GitHub Issue type: feature
Correct GitHub Issue type: feature
Predicted GitHub Issue type: feature
Correct GitHub Issue type: feature
Predicted GitHub Issue type: bug
Correct GitHub Issue type: feature
Predicted GitHub Issue type: feature
Correct GitHub Issue type: feature
Predicted GitHub Issue type: feature
Correct GitHub Issue type: feature
Predicted GitHub Issue type: feature


In [80]:
calculate_metrics(y_true_extended_4o, y_pred_extended_4o, 'metrics/confusion_matrix_extende_4o.csv')

feature: {'TP': 3874, 'FP': 557, 'FN': 1126, 'TN': 9442}
bug: {'TP': 4299, 'FP': 1133, 'FN': 701, 'TN': 8866}
question: {'TP': 3929, 'FP': 1207, 'FN': 1070, 'TN': 8793}
Precision = 0.810236053696549
Recall = 0.8068
F1-score = 0.8069990873845255


In [81]:
extended_data_4o_metrics_v2 = evaluating_metrics(y_true_extended_4o, y_pred_extended_4o)

              precision  recall  f1-score  support
bug            0.791421  0.8598  0.824195   5000.0
feature        0.874295  0.7748  0.821546   5000.0
question       0.764992  0.7858  0.775257   5000.0
micro avg      0.806854  0.8068  0.806827  15000.0
macro avg      0.810236  0.8068  0.806999  15000.0
weighted avg   0.810236  0.8068  0.806999  15000.0


## Deepseek R1

We also wanted to compare the performance of the so-called Deepseek R1, so we fine-tuned the model as well.

Due to resource constrants, the fine-tuned had to be done on a Google Colab notebook, that can be found here: https://colab.research.google.com/drive/1z_86hKTwxL2O4PHsxtT5_VZONHPUBFPW#scrollTo=a8w6uZC2YvOp

You can see the results of the evaluation of the fine-tuned deepseek-r1 model under data/metrics/confusion_matrix_deepseek_r1_fine_tuned.csv