# NLBSE Code Comment Classification

This is our attempt at modeling the code comment classificaiton problem through the NLBSE '25 challenge.

@article{rani2021,
                title={How to identify class comment types? A multi-language approach for class comment classification},
                author={Rani, Pooja and Panichella, Sebastiano and Leuenberger, Manuel and Di Sorbo, Andrea and Nierstrasz, Oscar},
                journal={Journal of systems and software},
                volume={181},
                pages={111047},
                year={2021},
                publisher={Elsevier}
              }
@INPROCEEDINGS{AlKaswan2023,
                author={Al-Kaswan, Ali and Izadi, Maliheh and Van Deursen, Arie},
                booktitle={2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering (NLBSE)},
                title={STACC: Code Comment Classification using SentenceTransformers},
                year={2023},
                pages={28-31}
              }
@inproceedings{pascarella2017,
                title={Classifying code comments in Java open-source software systems},
                author={Pascarella, Luca and Bacchelli, Alberto},
                booktitle={2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR)},
                year={2017},
                organization={IEEE}
              }

In [None]:
!pip install --upgrade setfit
!pip install --upgrade huggingface_hub
!pip install torch torchvision torchaudio
!pip3 install transformers==4.42.2
!pip3 install datasets



**Important note:** We made sure to include the imports and original dataset from the baseline just in case we missed anything. This was the only thing that we pulled from the baseline code. The rest is from our own research.

In [None]:
import pandas as pd
from setfit import SetFitModel, SetFitTrainer, Trainer
from datasets import Dataset, DatasetDict, load_dataset
from tqdm.auto import tqdm
import numpy as np
import torch
from sklearn.feature_extraction.text import TfidfVectorizer

tqdm.pandas()

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

  and should_run_async(code)


In [None]:
langs = ['java', 'python', 'pharo']
labels = {
    'java': ['summary', 'Ownership', 'Expand', 'usage', 'Pointer', 'deprecation', 'rational'],
    'python': ['Usage', 'Parameters', 'DevelopmentNotes', 'Expand', 'Summary'],
    'pharo': ['Keyimplementationpoints', 'Example', 'Responsibilities', 'Classreferences', 'Intent', 'Keymessages', 'Collaborators']
}
ds = load_dataset('NLBSE/nlbse25-code-comment-classification')
ds

DatasetDict({
    java_train: Dataset({
        features: ['index', 'class', 'comment_sentence', 'partition', 'combo', 'labels'],
        num_rows: 7614
    })
    java_test: Dataset({
        features: ['index', 'class', 'comment_sentence', 'partition', 'combo', 'labels'],
        num_rows: 1725
    })
    python_train: Dataset({
        features: ['index', 'class', 'comment_sentence', 'partition', 'combo', 'labels'],
        num_rows: 1884
    })
    python_test: Dataset({
        features: ['index', 'class', 'comment_sentence', 'partition', 'combo', 'labels'],
        num_rows: 406
    })
    pharo_train: Dataset({
        features: ['index', 'class', 'comment_sentence', 'partition', 'combo', 'labels'],
        num_rows: 1298
    })
    pharo_test: Dataset({
        features: ['index', 'class', 'comment_sentence', 'partition', 'combo', 'labels'],
        num_rows: 289
    })
})

Since our dataset consists of multiple sets for different languages and different labels for each, we split up our project into three parts in order to classify each since we did not have an idea as to how to combine this into one model. Our first order of business was to get an idea of what models would be best suited for this application, so we did testing on the Java dataset first.

# Part 1: Testing Models


### Loading Data

In order to load in our data, we accessed the dataset that we loaded in previously, and focused on extracting only the java train and test sets.

In [None]:
#
# Loading Java for processing
#

java_labels = labels['java']  # These Labels represented as words

java_train = ds['java_train'].to_pandas()
java_train_labels = ds['java_train']['labels']  #These labels represented as [1,0,0,0,0], One-Hot Encoding

#java_train_true_labels = [java_labels[i] for i in np.argmax(java_train_labels, axis=1)]



java_test = ds['java_test'].to_pandas()
java_test_labels = ds['java_test']['labels']

#java_test_true_labels = [java_labels[i] for i in np.argmax(java_test_labels, axis=1)]



### Data Preprocessing

In order to handle our data preprocessing, we mainly focused on using CountVectorization, which we discovered through our research. Using this model, we create a 'bag of words' which we can modify by changing different parameters in our Count Vectorization. Some of the preprocessing methods that we wanted to focus on was removing stop words, making all words lowercase, and changing the n-gram range. Below are some of the links that we used in order to accomplish this goal.


- [Count Vectorization](https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

- [Bag of Words](https://en.wikipedia.org/wiki/Bag-of-words_model)

- [Text Preprocessing](https://www.geeksforgeeks.org/text-preprocessing-in-python-set-1/)

- [Text Preprocessing using Count Vectorizers](https://towardsdatascience.com/basics-of-countvectorizer-e26677900f9c)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer


#
# For preprocessing, we are using CountVectorizer in order to handle our preprocessing since it makes it easy
#

stop_words = [ 'todo', 'is','a', 'for', 'it', 'its','of','and','to' ]
def preprocess_text(train, test):

  vectorizer = TfidfVectorizer(lowercase = True, stop_words=stop_words, ngram_range=(1,1))
  train_vectorized = vectorizer.fit_transform(train)
  test_vectorized = vectorizer.transform(test)

  return train_vectorized, test_vectorized

In [None]:
#
# We only need the comments. The rest of the comments we think are invalid
# Since the comments are encoded, we use this so we can determine an integer value to represent the label
#
java_train_combo = ds['java_train']['combo']
java_train_labels_int = np.argmax(java_train_labels, axis=1)

java_test_combo= ds['java_test']['combo']
java_test_labels_int = np.argmax(java_test_labels, axis=1)

In [None]:

java_train_vectorized, java_test_vectorized = preprocess_text(java_train_combo, java_test_combo)


print("Java train:")
print("Data shape: ",java_train_vectorized.shape)
print("Labels shape: ",java_train_labels_int.shape)
print()
print("Java test:")
print("Data shape: ",java_test_vectorized.shape)
print("Labels shape: ",java_test_labels_int.shape)

Java train:
Data shape:  (7614, 7081)
Labels shape:  (7614,)

Java test:
Data shape:  (1725, 7081)
Labels shape:  (1725,)


## 1A: Logistic Regression

Our first attempt at creating our model was using logistic regression. We found this method through our research, and also through our lecture slides as well. This was going to serve as our first attempt in order to see where we were out. It was our sort of "baseline" since it was simple to setup.

Logistic Regression works well for multi-class problems and since we are working with words with vectors that will be sparse, it was a good first chocie.



- [Word Classification](https://medium.com/analytics-vidhya/nlp-tutorial-for-text-classification-in-python-8f19cd17b49e)

- [Logistic Regression](https://spotintelligence.com/2023/02/22/logistic-regression-text-classification-python/#:~:text=Once%20the%20model%20is%20trained,complex%20models%20in%20ensemble%20approaches.)


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [None]:
java_model = LogisticRegression(max_iter=1000)
java_model.fit(java_train_vectorized, java_train_labels_int)

predictions = java_model.predict(java_test_vectorized)
java_original_results = classification_report(java_test_labels_int, predictions, output_dict=True)

print(java_original_results)

{'0': {'precision': 0.7773399014778325, 'recall': 0.8845291479820628, 'f1-score': 0.8274777136864184, 'support': 892.0}, '1': {'precision': 1.0, 'recall': 0.9555555555555556, 'f1-score': 0.9772727272727273, 'support': 45.0}, '2': {'precision': 0.7058823529411765, 'recall': 0.12, 'f1-score': 0.20512820512820512, 'support': 100.0}, '3': {'precision': 0.8091603053435115, 'recall': 0.7447306791569087, 'f1-score': 0.775609756097561, 'support': 427.0}, '4': {'precision': 0.6949152542372882, 'recall': 0.9213483146067416, 'f1-score': 0.7922705314009661, 'support': 178.0}, '5': {'precision': 1.0, 'recall': 0.4, 'f1-score': 0.5714285714285714, 'support': 15.0}, '6': {'precision': 0.4666666666666667, 'recall': 0.10294117647058823, 'f1-score': 0.1686746987951807, 'support': 68.0}, 'accuracy': 0.776231884057971, 'macro avg': {'precision': 0.7791377829523537, 'recall': 0.589872124824551, 'f1-score': 0.6168374576870901, 'support': 1725.0}, 'weighted avg': {'precision': 0.768066739931359, 'recall': 0.

Based on our results, our data is comparable to what the baseline uses, which was SetFit models. With what we learned, we then attempted to improve on this model by using pipelines, other vectorizers and models.

## 1B: Pipeline models


Here is where we researched the effectiveness of other possble models, it allows us to look at the accuracy across all the labels in order to determine which vectorizer and model resulted in the best accuracy with minimal configuration

We created two sets of models in order to test these classifiers. Many of these classifers were classifiers we learned about in CS345. One set uses CountVectorizer and another set uses TFIDFVectorizers.

In [None]:
def count_pipeline_creator(classifier):
    pipeline = Pipeline([
      ('countvectorizer', CountVectorizer(lowercase = True) ),
      ('clf', classifier)
    ])
    return pipeline

def tfidf_pipeline_creator(classifier):
    pipeline = Pipeline([
      ('tfidf', TfidfVectorizer() ),
      ('clf', classifier)
    ])
    return pipeline

def use_pipeline(pipeline, train_data, train_labels, test_data, test_labels):
  pipeline.fit(train_data, train_labels)
  predictions = pipeline.predict(test_data)
  return classification_report(test_labels, predictions, output_dict=True)

  and should_run_async(code)


### CountVectorizer Pipelines:
Our first set of pipelines uses classifiers with our CoutVectorizer. Some of the classifiers we tried to use included Multinomial and Complement Naive Bayes, Linear SVCs and RandomForest classifiers. Many of thes classifiers like the MultinomialNB work well for word classification and we also decieded to include other classifiers we were familiar with as a comparison.

In [None]:
#CountVectorizer pipelines
pipelineMNB_count = count_pipeline_creator(MultinomialNB())
pipelineCNB_count = count_pipeline_creator(ComplementNB())
pipelineLR_count = count_pipeline_creator(LogisticRegression())
pipelineSVC_count = count_pipeline_creator(LinearSVC())
pipelineRF_count = count_pipeline_creator(RandomForestClassifier(n_estimators=100, random_state=42))

### Count Vectorizer Training & Testing:

In [None]:
print("COUNT VECTORIZER")

print("Pipeline 1:")
print(use_pipeline(pipelineMNB_count, java_train_combo, java_train_labels_int, java_test_combo, java_test_labels_int))

print("Pipeline 2:")
print(use_pipeline(pipelineCNB_count, java_train_combo, java_train_labels_int, java_test_combo, java_test_labels_int))

print("Pipeline 3:")
print(use_pipeline(pipelineLR_count, java_train_combo, java_train_labels_int, java_test_combo, java_test_labels_int))

print("Pipeline 4:")
print(use_pipeline(pipelineSVC_count, java_train_combo, java_train_labels_int, java_test_combo, java_test_labels_int))

print("Pipeline 5:")
print(use_pipeline(pipelineRF_count, java_train_combo, java_train_labels_int, java_test_combo, java_test_labels_int))

COUNT VECTORIZER
Pipeline 1:
{'0': {'precision': 0.7353535353535353, 'recall': 0.8161434977578476, 'f1-score': 0.7736450584484591, 'support': 892.0}, '1': {'precision': 0.9772727272727273, 'recall': 0.9555555555555556, 'f1-score': 0.9662921348314607, 'support': 45.0}, '2': {'precision': 0.30434782608695654, 'recall': 0.14, 'f1-score': 0.1917808219178082, 'support': 100.0}, '3': {'precision': 0.6933962264150944, 'recall': 0.6885245901639344, 'f1-score': 0.690951821386604, 'support': 427.0}, '4': {'precision': 0.7474226804123711, 'recall': 0.8146067415730337, 'f1-score': 0.7795698924731183, 'support': 178.0}, '5': {'precision': 1.0, 'recall': 0.13333333333333333, 'f1-score': 0.23529411764705882, 'support': 15.0}, '6': {'precision': 0.2, 'recall': 0.07352941176470588, 'f1-score': 0.10752688172043011, 'support': 68.0}, 'accuracy': 0.7136231884057971, 'macro avg': {'precision': 0.665398999362955, 'recall': 0.5173847328783443, 'f1-score': 0.5350086754892771, 'support': 1725.0}, 'weighted avg

### Tfidf Vectorizer Pipelines:
Our Tfidf Vectorizer was our next approach. This vectorizes in a similar approach to the CountVectoriers, but it takes in weights and adjusts for words that occur more and carry weight.

In [None]:
#TfidfVectorizer pipelines
pipelineMNB_tfidf = tfidf_pipeline_creator(MultinomialNB())
pipelineCNB_tfidf = tfidf_pipeline_creator(ComplementNB())
pipelineLR_tfidf = tfidf_pipeline_creator(LogisticRegression())
pipelineSVC_tfidf = tfidf_pipeline_creator(LinearSVC())
pipelineRF_tfidf = tfidf_pipeline_creator(RandomForestClassifier(n_estimators=100, random_state=42))

  and should_run_async(code)


### Tfidf Vectorizer Training & Testing:

In [None]:
print("TFIDF")

print("Pipeline 6:")
print(use_pipeline(pipelineMNB_tfidf, java_train_combo, java_train_labels_int, java_test_combo, java_test_labels_int))

print("Pipeline 7:")
print(use_pipeline(pipelineCNB_tfidf, java_train_combo, java_train_labels_int, java_test_combo, java_test_labels_int))

print("Pipeline 8:")
print(use_pipeline(pipelineLR_tfidf, java_train_combo, java_train_labels_int, java_test_combo, java_test_labels_int))

print("Pipeline 9:")
print(use_pipeline(pipelineSVC_tfidf, java_train_combo, java_train_labels_int, java_test_combo, java_test_labels_int))

print("Pipeline 10:")
print(use_pipeline(pipelineRF_tfidf, java_train_combo, java_train_labels_int, java_test_combo, java_test_labels_int))

TFIDF
Pipeline 6:
{'0': {'precision': 0.614529280948851, 'recall': 0.929372197309417, 'f1-score': 0.7398482820169567, 'support': 892.0}, '1': {'precision': 1.0, 'recall': 0.7333333333333333, 'f1-score': 0.8461538461538461, 'support': 45.0}, '2': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 100.0}, '3': {'precision': 0.7300380228136882, 'recall': 0.4496487119437939, 'f1-score': 0.5565217391304348, 'support': 427.0}, '4': {'precision': 0.65, 'recall': 0.29213483146067415, 'f1-score': 0.40310077519379844, 'support': 178.0}, '5': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 15.0}, '6': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 68.0}, 'accuracy': 0.6411594202898551, 'macro avg': {'precision': 0.42779532910893414, 'recall': 0.3434984391496026, 'f1-score': 0.36366066321357654, 'support': 1725.0}, 'weighted avg': {'precision': 0.5916442633900405, 'recall': 0.6411594202898551, 'f1-score': 0.5840048181039075, 'support': 1725.0}}
Pipeline 7:


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


{'0': {'precision': 0.7423167848699763, 'recall': 0.7040358744394619, 'f1-score': 0.7226697353279632, 'support': 892.0}, '1': {'precision': 0.7894736842105263, 'recall': 1.0, 'f1-score': 0.8823529411764706, 'support': 45.0}, '2': {'precision': 0.2111111111111111, 'recall': 0.19, 'f1-score': 0.2, 'support': 100.0}, '3': {'precision': 0.6896551724137931, 'recall': 0.6088992974238876, 'f1-score': 0.6467661691542289, 'support': 427.0}, '4': {'precision': 0.625, 'recall': 0.898876404494382, 'f1-score': 0.7373271889400922, 'support': 178.0}, '5': {'precision': 0.4117647058823529, 'recall': 0.4666666666666667, 'f1-score': 0.4375, 'support': 15.0}, '6': {'precision': 0.13414634146341464, 'recall': 0.16176470588235295, 'f1-score': 0.14666666666666667, 'support': 68.0}, 'accuracy': 0.6550724637681159, 'macro avg': {'precision': 0.5147811142787392, 'recall': 0.5757489927009644, 'f1-score': 0.5390403858950602, 'support': 1725.0}, 'weighted avg': {'precision': 0.6607624228597339, 'recall': 0.655072

From the above outputs we can see that the highest average accuracy provided was from the 8th element TFIDF Vectorizer with the Logistic Regression Model, it had the highest average f1 score across all the categories so we will use it as our baseline and fine tune it to achieve the best accuracy.

One important note that we discovered was each dataset performed slightly better depending on certain pipelines we created. In order to mitigate this, each section we tested the different pipelines.

Below, we used GridSearch, a method we learned from CS345 which tests a model using different parameters, and gives us the best parameters along with the best accuracies as well. Additionally, we ran the pipeline once again to see what the accuracies were at.

Best Pipeline:

In [None]:
print("Pipeline 5:")
java_pipeline_results = use_pipeline(pipelineRF_count, java_train_combo, java_train_labels_int, java_test_combo, java_test_labels_int)
print(java_pipeline_results)

Pipeline 5:


  and should_run_async(code)


{'0': {'precision': 0.7730294396961064, 'recall': 0.9125560538116592, 'f1-score': 0.8370179948586118, 'support': 892.0}, '1': {'precision': 0.9782608695652174, 'recall': 1.0, 'f1-score': 0.989010989010989, 'support': 45.0}, '2': {'precision': 0.5555555555555556, 'recall': 0.05, 'f1-score': 0.09174311926605505, 'support': 100.0}, '3': {'precision': 0.8337801608579088, 'recall': 0.7283372365339579, 'f1-score': 0.7775, 'support': 427.0}, '4': {'precision': 0.748898678414097, 'recall': 0.9550561797752809, 'f1-score': 0.8395061728395061, 'support': 178.0}, '5': {'precision': 0.8571428571428571, 'recall': 0.4, 'f1-score': 0.5454545454545454, 'support': 15.0}, '6': {'precision': 0.5, 'recall': 0.07352941176470588, 'f1-score': 0.1282051282051282, 'support': 68.0}, 'accuracy': 0.7860869565217391, 'macro avg': {'precision': 0.7495239373188204, 'recall': 0.5884969831265148, 'f1-score': 0.601205421376405, 'support': 1725.0}, 'weighted avg': {'precision': 0.7682926325774472, 'recall': 0.78608695652

### Grid Search Optimization:

In [None]:
parameters = {
    'countvectorizer__ngram_range':((1,1),(1,2)),
    'clf__n_estimators': (50, 150, 500),
}

grid_search = GridSearchCV(pipelineRF_count, parameters, n_jobs=-1, verbose=1)
grid_search.fit(java_train_combo, java_train_labels_int)

print("Best parameters: ", grid_search.best_params_)
print("Best score:", grid_search.best_score_)

java_tuned_results = classification_report(java_test_labels_int, grid_search.best_estimator_.predict(java_test_combo), output_dict=True)
print(java_tuned_results)

Fitting 5 folds for each of 6 candidates, totalling 30 fits


  and should_run_async(code)


Best parameters:  {'clf__n_estimators': 150, 'countvectorizer__ngram_range': (1, 1)}
Best score: 0.7784472516464582
{'0': {'precision': 0.771050141911069, 'recall': 0.9136771300448431, 'f1-score': 0.836326321190354, 'support': 892.0}, '1': {'precision': 0.9782608695652174, 'recall': 1.0, 'f1-score': 0.989010989010989, 'support': 45.0}, '2': {'precision': 0.45454545454545453, 'recall': 0.05, 'f1-score': 0.09009009009009009, 'support': 100.0}, '3': {'precision': 0.8387978142076503, 'recall': 0.7189695550351288, 'f1-score': 0.7742749054224464, 'support': 427.0}, '4': {'precision': 0.748898678414097, 'recall': 0.9550561797752809, 'f1-score': 0.8395061728395061, 'support': 178.0}, '5': {'precision': 0.8571428571428571, 'recall': 0.4, 'f1-score': 0.5454545454545454, 'support': 15.0}, '6': {'precision': 0.45454545454545453, 'recall': 0.07352941176470588, 'f1-score': 0.12658227848101267, 'support': 68.0}, 'accuracy': 0.7843478260869565, 'macro avg': {'precision': 0.729034467190257, 'recall': 0

As we can see, this pipeline along with some optimization was able to perform slightly better than our Logistic Regression. Interestingly enough, when we repeated this process with other datasets, we discovered that they were subject to overfitting (more on that later).

## 1C: Setfit model optimization


We also opted to test the SetFit model using a pretrained model just like the baseline. The NLBSE uses its own pretrained model, so we opted to look for a similar one to test since Transformers perform better in this area. We looked through the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) in order to find a model suitable for our application but opted to use a base model of BERT "bert-base-uncased".

Below, we created a small subset from our data for testing purposes. Since we can't use our entire dataset due to time constraints, we tested as such.

In [None]:
train = ds['java_train'].shuffle(seed=39)
train = train.select(range(30))

print(np.sum(train['labels'], axis=0))


[13  1  2 10  3  1  1]


In [None]:
# This code is based on the baseline code, it has been adapted to use a different pretrained model with a subset of the data for faster processing time

# sfm = SetFitModel.from_pretrained("bert-base-uncased", multi_target_strategy="multi-output")


# trainer = Trainer(
#     train_dataset=ds['java_train'],
#     eval_dataset=ds['java_test'],
#     column_mapping={"combo": "text", "labels": "label"},
#     model=sfm,
# )

# trainer.train()
# metrics = trainer.evaluate()
# print(metrics)

# predictions = sfm.predict(ds['java_test']['combo'])
# print(classification_report(ds['pjava_test']['labels'], predictions))

One of the primary problems with setfit is the amount of time and processing power it requires in order to get good results. Running in an environment with an i9 intell CPU still took ~33 minutes on just 100 rows of data. Resulting in around 70% accuracy. Running this on the 7614 rows of java_train data would take a very long time. This could be completed utilizing better hardware notably a GPU with adequet memory.
For this project it is out of the scope of what we could attempt.

# Part 2: Other Datasets

With these testing methods in mind, we set out to test our models on the Python and Pharo datasets. We repeated the same process above, so this was our results from our testing.

## 2A: Python Dataset

In [None]:
# Loading Python for processing
#

python_labels = labels['python']  # These Labels represented as words

python_train = ds['python_train'].to_pandas()
python_train_labels = ds['python_train']['labels']  # These labels represented as [1, 0, 0, 0, 0], One-Hot Encoding

python_test = ds['python_test'].to_pandas()
python_test_labels = ds['python_test']['labels']

### Data Preprocessing


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

stop_words = ['the', 'todo', 'is','a', 'this', 'for', 'it', 'its' ]

def preprocess_text(train, test):

  vectorizer = CountVectorizer(lowercase = True, stop_words=stop_words)
  train_vectorized = vectorizer.fit_transform(train)
  test_vectorized = vectorizer.transform(test)

  return train_vectorized, test_vectorized


In [None]:
python_train_combo = ds['python_train']['combo']
python_train_labels_int = np.argmax(python_train_labels, axis=1)

python_test_combo = ds['python_test']['combo']
python_test_labels_int = np.argmax(python_test_labels, axis=1)

In [None]:

python_train_vectorized, python_test_vectorized = preprocess_text(python_train_combo, python_test_combo)

print("Python train:")
print("Data shape: ",python_train_vectorized.shape)
print("Labels shape: ",python_train_labels_int.shape)
print()
print("Python test:")
print("Data shape: ",python_test_vectorized.shape)
print("Labels shape: ",python_test_labels_int.shape)

Python train:
Data shape:  (1884, 2550)
Labels shape:  (1884,)

Python test:
Data shape:  (406, 2550)
Labels shape:  (406,)


### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

python_model = LogisticRegression(max_iter=1000)
python_model.fit(python_train_vectorized, python_train_labels_int)

predictions = python_model.predict(python_test_vectorized)

python_original_results = classification_report(python_test_labels_int, predictions, output_dict=True)

print(python_original_results)

{'0': {'precision': 0.589041095890411, 'recall': 0.7107438016528925, 'f1-score': 0.6441947565543071, 'support': 121.0}, '1': {'precision': 0.7777777777777778, 'recall': 0.7716535433070866, 'f1-score': 0.7747035573122529, 'support': 127.0}, '2': {'precision': 0.09523809523809523, 'recall': 0.06060606060606061, 'f1-score': 0.07407407407407407, 'support': 33.0}, '3': {'precision': 0.4375, 'recall': 0.2916666666666667, 'f1-score': 0.35, 'support': 48.0}, '4': {'precision': 0.48148148148148145, 'recall': 0.5064935064935064, 'f1-score': 0.4936708860759494, 'support': 77.0}, 'accuracy': 0.5886699507389163, 'macro avg': {'precision': 0.4762076900775531, 'recall': 0.4682327157452425, 'f1-score': 0.46732865480331676, 'support': 406.0}, 'weighted avg': {'precision': 0.5696272945749968, 'recall': 0.5886699507389163, 'f1-score': 0.5753498029409356, 'support': 406.0}}


### Pipeline Models

##### Count Vectorizer

In [None]:
print("COUNT VECTORIZER")

print("Pipeline 1:")
print(use_pipeline(pipelineMNB_count, python_train_combo, python_train_labels_int, python_test_combo, python_test_labels_int))

print("Pipeline 2:")
print(use_pipeline(pipelineCNB_count, python_train_combo, python_train_labels_int, python_test_combo, python_test_labels_int))

print("Pipeline 3:")
print(use_pipeline(pipelineLR_count, python_train_combo, python_train_labels_int, python_test_combo, python_test_labels_int))

print("Pipeline 4:")
print(use_pipeline(pipelineSVC_count, python_train_combo, python_train_labels_int, python_test_combo, python_test_labels_int))

print("Pipeline 5:")
print(use_pipeline(pipelineRF_count, python_train_combo, python_train_labels_int, python_test_combo, python_test_labels_int))



COUNT VECTORIZER
Pipeline 1:
{'0': {'precision': 0.6470588235294118, 'recall': 0.5454545454545454, 'f1-score': 0.5919282511210763, 'support': 121.0}, '1': {'precision': 0.6453488372093024, 'recall': 0.8740157480314961, 'f1-score': 0.7424749163879598, 'support': 127.0}, '2': {'precision': 0.3333333333333333, 'recall': 0.15151515151515152, 'f1-score': 0.20833333333333334, 'support': 33.0}, '3': {'precision': 0.40425531914893614, 'recall': 0.3958333333333333, 'f1-score': 0.4, 'support': 48.0}, '4': {'precision': 0.6, 'recall': 0.5454545454545454, 'f1-score': 0.5714285714285714, 'support': 77.0}, 'accuracy': 0.5985221674876847, 'macro avg': {'precision': 0.5259992626441967, 'recall': 0.5024546647578143, 'f1-score': 0.5028330144541882, 'support': 406.0}, 'weighted avg': {'precision': 0.5833932888960325, 'recall': 0.5985221674876847, 'f1-score': 0.581262642283057, 'support': 406.0}}
Pipeline 2:


  and should_run_async(code)


{'0': {'precision': 0.6702127659574468, 'recall': 0.5206611570247934, 'f1-score': 0.586046511627907, 'support': 121.0}, '1': {'precision': 0.7194244604316546, 'recall': 0.7874015748031497, 'f1-score': 0.7518796992481203, 'support': 127.0}, '2': {'precision': 0.2631578947368421, 'recall': 0.30303030303030304, 'f1-score': 0.28169014084507044, 'support': 33.0}, '3': {'precision': 0.423728813559322, 'recall': 0.5208333333333334, 'f1-score': 0.4672897196261682, 'support': 48.0}, '4': {'precision': 0.5657894736842105, 'recall': 0.5584415584415584, 'f1-score': 0.5620915032679739, 'support': 77.0}, 'accuracy': 0.5935960591133005, 'macro avg': {'precision': 0.5284626816738952, 'recall': 0.5380735853266276, 'f1-score': 0.5297995149230479, 'support': 406.0}, 'weighted avg': {'precision': 0.603575453710637, 'recall': 0.5935960591133005, 'f1-score': 0.5945987109681414, 'support': 406.0}}
Pipeline 3:
{'0': {'precision': 0.6090225563909775, 'recall': 0.6694214876033058, 'f1-score': 0.6377952755905512

##### TFIDF

In [None]:
print("TFIDF")

print("Pipeline 6:")
print(use_pipeline(pipelineMNB_tfidf, python_train_combo, python_train_labels_int, python_test_combo, python_test_labels_int))

print("Pipeline 7:")
print(use_pipeline(pipelineCNB_tfidf, python_train_combo, python_train_labels_int, python_test_combo, python_test_labels_int))

print("Pipeline 8:")
print(use_pipeline(pipelineLR_tfidf, python_train_combo, python_train_labels_int, python_test_combo, python_test_labels_int))

print("Pipeline 9:")
print(use_pipeline(pipelineSVC_tfidf, python_train_combo, python_train_labels_int, python_test_combo, python_test_labels_int))

print("Pipeline 10:")
print(use_pipeline(pipelineRF_tfidf, python_train_combo, python_train_labels_int, python_test_combo, python_test_labels_int))

TFIDF
Pipeline 6:
{'0': {'precision': 0.5337837837837838, 'recall': 0.6528925619834711, 'f1-score': 0.587360594795539, 'support': 121.0}, '1': {'precision': 0.5409090909090909, 'recall': 0.937007874015748, 'f1-score': 0.6858789625360231, 'support': 127.0}, '2': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 33.0}, '3': {'precision': 0.875, 'recall': 0.14583333333333334, 'f1-score': 0.25, 'support': 48.0}, '4': {'precision': 0.6896551724137931, 'recall': 0.2597402597402597, 'f1-score': 0.37735849056603776, 'support': 77.0}, 'accuracy': 0.5541871921182266, 'macro avg': {'precision': 0.5278696094213335, 'recall': 0.39909480581456236, 'f1-score': 0.38011960957951996, 'support': 406.0}, 'weighted avg': {'precision': 0.5625289178796907, 'recall': 0.5541871921182266, 'f1-score': 0.4907238029209854, 'support': 406.0}}
Pipeline 7:
{'0': {'precision': 0.6702127659574468, 'recall': 0.5206611570247934, 'f1-score': 0.586046511627907, 'support': 121.0}, '1': {'precision': 0.6818181818

  and should_run_async(code)


{'0': {'precision': 0.5616438356164384, 'recall': 0.6776859504132231, 'f1-score': 0.6142322097378277, 'support': 121.0}, '1': {'precision': 0.6870748299319728, 'recall': 0.7952755905511811, 'f1-score': 0.7372262773722628, 'support': 127.0}, '2': {'precision': 0.3333333333333333, 'recall': 0.12121212121212122, 'f1-score': 0.17777777777777778, 'support': 33.0}, '3': {'precision': 0.43333333333333335, 'recall': 0.2708333333333333, 'f1-score': 0.3333333333333333, 'support': 48.0}, '4': {'precision': 0.5915492957746479, 'recall': 0.5454545454545454, 'f1-score': 0.5675675675675675, 'support': 77.0}, 'accuracy': 0.5960591133004927, 'macro avg': {'precision': 0.5213869255979452, 'recall': 0.48209230819288074, 'f1-score': 0.4860274331577538, 'support': 406.0}, 'weighted avg': {'precision': 0.5728243923290579, 'recall': 0.5960591133004927, 'f1-score': 0.5751704531377436, 'support': 406.0}}
Pipeline 9:
{'0': {'precision': 0.6638655462184874, 'recall': 0.6528925619834711, 'f1-score': 0.65833333333

#### Best Pipeline

In [None]:
print("Pipeline 9:")
python_pipeline_results = use_pipeline(pipelineSVC_tfidf, python_train_combo, python_train_labels_int, python_test_combo, python_test_labels_int)

Pipeline 9:


  and should_run_async(code)


#### Grid Search Optimization
Interestingly enough, when we started to tune the parameters, our accuracy dropped significabtly. We believe this is due to some heavy overfitting. Below, we included the results of our grid searches, but they did not help much for some reason. Nevertheless, we recorded the data as we wanted to compare it with our previous results.

In [None]:
parameters = {
    'tfidf__max_df': [0.9],
    'tfidf__min_df': [1, 2],
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    'tfidf__use_idf': [True],
    'clf__C': [1, 10],
    'clf__max_iter': [1000, 5000],
}

grid_search = GridSearchCV(pipelineSVC_tfidf, parameters, n_jobs=-1, verbose=1)
grid_search.fit(python_train_combo, python_train_labels_int)

print("Best parameters: ", grid_search.best_params_)
print("Best score:", grid_search.best_score_)

python_tuned_results = classification_report(python_test_labels_int, grid_search.best_estimator_.predict(python_test_combo), output_dict=True)

Fitting 5 folds for each of 16 candidates, totalling 80 fits


  and should_run_async(code)


Best parameters:  {'clf__C': 1, 'clf__max_iter': 1000, 'tfidf__max_df': 0.9, 'tfidf__min_df': 1, 'tfidf__ngram_range': (1, 2), 'tfidf__use_idf': True}
Best score: 0.4309865116541565


## 2B: Pharo Dataset

In [None]:
# Loading Pharo for processing

pharo_labels = labels['pharo']

pharo_train = ds['pharo_train'].to_pandas()
pharo_train_labels = ds['pharo_train']['labels']

pharo_test = ds['pharo_test'].to_pandas()
pharo_test_labels = ds['pharo_test']['labels']

  and should_run_async(code)


### Data Preprocessing



In [None]:

stop_words = []

def preprocess_text(train, test):
  vectorizer = CountVectorizer(lowercase = True, stop_words=stop_words)
  train_vectorized = vectorizer.fit_transform(train)
  test_vectorized = vectorizer.transform(test)
  return train_vectorized, test_vectorized

In [None]:
#
# Below is another method that we had thought to implement, but didn't utilzie
#


# print(pharo_train_combo)

# def remove_punctuation(text):
#     import string
#     translator = str.maketrans('', '', string.punctuation)
#     return text.translate(translator)

# pharo_train_combo = [remove_punctuation(comment) for comment in pharo_train_combo]
# pharo_test_combo = [remove_punctuation(comment) for comment in pharo_test_combo]

In [None]:
pharo_train_combo = ds['pharo_train']['combo']
pharo_train_labels_int = np.argmax(pharo_train_labels, axis=1)

pharo_test_combo = ds['pharo_test']['combo']
pharo_test_labels_int = np.argmax(pharo_test_labels, axis=1)

pharo_train_vectorized, pharo_test_vectorized = preprocess_text(pharo_train_combo, pharo_test_combo)

In [None]:

print("Pharo train:")
print("Data shape: ",pharo_train_vectorized.shape)
print("Labels shape: ",pharo_train_labels_int.shape)
print()
print("Pharo test:")
print("Data shape: ",pharo_test_vectorized.shape)
print("Labels shape: ",pharo_test_labels_int.shape)

Pharo train:
Data shape:  (1298, 2690)
Labels shape:  (1298,)

Pharo test:
Data shape:  (289, 2690)
Labels shape:  (289,)


### Training and testing

In [None]:
pharo_model = LogisticRegression(max_iter=1000)
pharo_model.fit(pharo_train_vectorized, pharo_train_labels_int)

predictions = pharo_model.predict(pharo_test_vectorized)
pharo_original_results = classification_report(pharo_test_labels_int, predictions, output_dict=True)

print(pharo_original_results)

{'0': {'precision': 0.6129032258064516, 'recall': 0.4418604651162791, 'f1-score': 0.5135135135135135, 'support': 43.0}, '1': {'precision': 0.7302631578947368, 'recall': 0.9327731092436975, 'f1-score': 0.8191881918819188, 'support': 119.0}, '2': {'precision': 0.6, 'recall': 0.5294117647058824, 'f1-score': 0.5625, 'support': 51.0}, '3': {'precision': 0.5, 'recall': 1.0, 'f1-score': 0.6666666666666666, 'support': 1.0}, '4': {'precision': 0.6521739130434783, 'recall': 0.5357142857142857, 'f1-score': 0.5882352941176471, 'support': 28.0}, '5': {'precision': 0.7352941176470589, 'recall': 0.6410256410256411, 'f1-score': 0.684931506849315, 'support': 39.0}, '6': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 8.0}, 'accuracy': 0.6851211072664359, 'macro avg': {'precision': 0.5472334877702465, 'recall': 0.5829693236865409, 'f1-score': 0.5478621675755802, 'support': 289.0}, 'weighted avg': {'precision': 0.6619152064103937, 'recall': 0.6851211072664359, 'f1-score': 0.6647112788377629

### Pipeline Models

#### Training and Testing

##### Count Vectorizer

In [None]:
print("COUNT VECTORIZER")

print("Pipeline 1:")
print(use_pipeline(pipelineMNB_count, pharo_train_combo, pharo_train_labels_int, pharo_test_combo, pharo_test_labels_int))

print("Pipeline 2:")
print(use_pipeline(pipelineCNB_count, pharo_train_combo, pharo_train_labels_int, pharo_test_combo, pharo_test_labels_int))

print("Pipeline 3:")
print(use_pipeline(pipelineLR_count, pharo_train_combo, pharo_train_labels_int, pharo_test_combo, pharo_test_labels_int))

print("Pipeline 4:")
print(use_pipeline(pipelineSVC_count, pharo_train_combo, pharo_train_labels_int, pharo_test_combo, pharo_test_labels_int))

print("Pipeline 5:")
print(use_pipeline(pipelineRF_count, pharo_train_combo, pharo_train_labels_int, pharo_test_combo, pharo_test_labels_int))

COUNT VECTORIZER
Pipeline 1:
{'0': {'precision': 0.43137254901960786, 'recall': 0.5116279069767442, 'f1-score': 0.46808510638297873, 'support': 43.0}, '1': {'precision': 0.7829457364341085, 'recall': 0.8487394957983193, 'f1-score': 0.8145161290322581, 'support': 119.0}, '2': {'precision': 0.46875, 'recall': 0.5882352941176471, 'f1-score': 0.5217391304347826, 'support': 51.0}, '3': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 1.0}, '4': {'precision': 0.7, 'recall': 0.25, 'f1-score': 0.3684210526315789, 'support': 28.0}, '5': {'precision': 0.6, 'recall': 0.5384615384615384, 'f1-score': 0.5675675675675675, 'support': 39.0}, '6': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 8.0}, 'accuracy': 0.6262975778546713, 'macro avg': {'precision': 0.4261526122076738, 'recall': 0.39100917647917843, 'f1-score': 0.3914755694355951, 'support': 289.0}, 'weighted avg': {'precision': 0.6180823953062354, 'recall': 0.6262975778546713, 'f1-score': 0.6093934228038065, 'support

  and should_run_async(code)
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


{'0': {'precision': 0.6129032258064516, 'recall': 0.4418604651162791, 'f1-score': 0.5135135135135135, 'support': 43.0}, '1': {'precision': 0.7302631578947368, 'recall': 0.9327731092436975, 'f1-score': 0.8191881918819188, 'support': 119.0}, '2': {'precision': 0.6, 'recall': 0.5294117647058824, 'f1-score': 0.5625, 'support': 51.0}, '3': {'precision': 0.5, 'recall': 1.0, 'f1-score': 0.6666666666666666, 'support': 1.0}, '4': {'precision': 0.6521739130434783, 'recall': 0.5357142857142857, 'f1-score': 0.5882352941176471, 'support': 28.0}, '5': {'precision': 0.7352941176470589, 'recall': 0.6410256410256411, 'f1-score': 0.684931506849315, 'support': 39.0}, '6': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 8.0}, 'accuracy': 0.6851211072664359, 'macro avg': {'precision': 0.5472334877702465, 'recall': 0.5829693236865409, 'f1-score': 0.5478621675755802, 'support': 289.0}, 'weighted avg': {'precision': 0.6619152064103937, 'recall': 0.6851211072664359, 'f1-score': 0.6647112788377629

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


##### TFIDF

In [None]:
print("TFIDF")

print("Pipeline 6:")
print(use_pipeline(pipelineMNB_tfidf, pharo_train_combo, pharo_train_labels_int, pharo_test_combo, pharo_test_labels_int))

print("Pipeline 7:")
print(use_pipeline(pipelineCNB_tfidf, pharo_train_combo, pharo_train_labels_int, pharo_test_combo, pharo_test_labels_int))

print("Pipeline 8:")
print(use_pipeline(pipelineLR_tfidf, pharo_train_combo, pharo_train_labels_int, pharo_test_combo, pharo_test_labels_int))

print("Pipeline 9:")
print(use_pipeline(pipelineSVC_tfidf, pharo_train_combo, pharo_train_labels_int, pharo_test_combo, pharo_test_labels_int))

print("Pipeline 10:")
print(use_pipeline(pipelineRF_tfidf, pharo_train_combo, pharo_train_labels_int, pharo_test_combo, pharo_test_labels_int))

  and should_run_async(code)
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


TFIDF
Pipeline 6:
{'0': {'precision': 0.42857142857142855, 'recall': 0.13953488372093023, 'f1-score': 0.21052631578947367, 'support': 43.0}, '1': {'precision': 0.5345622119815668, 'recall': 0.9747899159663865, 'f1-score': 0.6904761904761905, 'support': 119.0}, '2': {'precision': 0.4878048780487805, 'recall': 0.39215686274509803, 'f1-score': 0.43478260869565216, 'support': 51.0}, '3': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 1.0}, '4': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 28.0}, '5': {'precision': 0.8823529411764706, 'recall': 0.38461538461538464, 'f1-score': 0.5357142857142857, 'support': 39.0}, '6': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 8.0}, 'accuracy': 0.5432525951557093, 'macro avg': {'precision': 0.33332735139689235, 'recall': 0.27015672100682847, 'f1-score': 0.2673570572393717, 'support': 289.0}, 'weighted avg': {'precision': 0.48903559910293437, 'recall': 0.5432525951557093, 'f1-score': 0.46465767623511917, 's

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


{'0': {'precision': 0.55, 'recall': 0.2558139534883721, 'f1-score': 0.3492063492063492, 'support': 43.0}, '1': {'precision': 0.6098901098901099, 'recall': 0.9327731092436975, 'f1-score': 0.7375415282392026, 'support': 119.0}, '2': {'precision': 0.5384615384615384, 'recall': 0.5490196078431373, 'f1-score': 0.5436893203883495, 'support': 51.0}, '3': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 1.0}, '4': {'precision': 0.6666666666666666, 'recall': 0.2857142857142857, 'f1-score': 0.4, 'support': 28.0}, '5': {'precision': 0.8260869565217391, 'recall': 0.48717948717948717, 'f1-score': 0.6129032258064516, 'support': 39.0}, '6': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 8.0}, 'accuracy': 0.6124567474048442, 'macro avg': {'precision': 0.4558721816485792, 'recall': 0.3586429204955685, 'f1-score': 0.3776200605200504, 'support': 289.0}, 'weighted avg': {'precision': 0.604057160932443, 'recall': 0.6124567474048442, 'f1-score': 0.5730612319120953, 'support': 289

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


##### Best Pipeline

In [None]:
print("Pipeline 9:")
pharo_pipeline_results = use_pipeline(pipelineSVC_tfidf, pharo_train_combo, pharo_train_labels_int, pharo_test_combo, pharo_test_labels_int)


Pipeline 9:


  and should_run_async(code)


##### Grid Search Optimization

In [None]:
parameters = {
    'tfidf__max_df': [0.9],
    'tfidf__min_df': [1, 2],
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    'tfidf__use_idf': [True],
    'clf__C': [1, 10],
    'clf__max_iter': [1000, 5000],
}

grid_search = GridSearchCV(pipelineSVC_tfidf, parameters, n_jobs=-1, verbose=1)
grid_search.fit(pharo_train_combo, pharo_train_labels_int)

print("Best parameters: ", grid_search.best_params_)
print("Best score:", grid_search.best_score_)

pharo_tuned_results = classification_report(pharo_test_labels_int, grid_search.best_estimator_.predict(pharo_test_combo), output_dict=True)

print(pharo_tuned_results)

  and should_run_async(code)


Fitting 5 folds for each of 16 candidates, totalling 80 fits
Best parameters:  {'clf__C': 1, 'clf__max_iter': 1000, 'tfidf__max_df': 0.9, 'tfidf__min_df': 1, 'tfidf__ngram_range': (1, 2), 'tfidf__use_idf': True}
Best score: 0.4916008316008316
{'0': {'precision': 0.4883720930232558, 'recall': 0.4883720930232558, 'f1-score': 0.4883720930232558, 'support': 43.0}, '1': {'precision': 0.7794117647058824, 'recall': 0.8907563025210085, 'f1-score': 0.8313725490196079, 'support': 119.0}, '2': {'precision': 0.543859649122807, 'recall': 0.6078431372549019, 'f1-score': 0.5740740740740741, 'support': 51.0}, '3': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 1.0}, '4': {'precision': 0.65, 'recall': 0.4642857142857143, 'f1-score': 0.5416666666666666, 'support': 28.0}, '5': {'precision': 0.7419354838709677, 'recall': 0.5897435897435898, 'f1-score': 0.6571428571428571, 'support': 39.0}, '6': {'precision': 1.0, 'recall': 0.125, 'f1-score': 0.2222222222222222, 'support': 8.0}, 'accuracy': 

# Part 3: Results

In [None]:
#
# Writing results to a file
#

java_original_results_df = pd.DataFrame(java_original_results).transpose()
java_pipeline_results_df = pd.DataFrame(java_pipeline_results).transpose()
java_tuned_results_df = pd.DataFrame(java_tuned_results).transpose()

python_original_results_df = pd.DataFrame(python_original_results).transpose()
python_pipeline_results_df = pd.DataFrame(python_pipeline_results).transpose()
python_tuned_results_df = pd.DataFrame(python_tuned_results).transpose()

pharo_original_results_df = pd.DataFrame(pharo_original_results).transpose()
pharo_pipeline_results_df = pd.DataFrame(pharo_pipeline_results).transpose()
pharo_tuned_results_df = pd.DataFrame(pharo_tuned_results).transpose()



csv_results = "combined_results.csv"

#
# ------ Java -------
#
with open(csv_results, 'w') as f:
    f.write('Section 1: Java 1\n')
    f.write('Original Results\n')

java_original_results_df.to_csv(csv_results, index=False, mode='a')

with open(csv_results, 'a') as f:
    f.write('Pipeline Results\n')
java_pipeline_results_df.to_csv(csv_results, index=False, mode='a', header=False)

with open(csv_results, 'a') as f:
    f.write('Tuned Results\n')
java_tuned_results_df.to_csv(csv_results, index=False, mode='a', header=False)




#
# ----- Python ------
#
with open(csv_results, 'a') as f:
    f.write('Section 2: Python 1\n')
    f.write('Original Results\n')

python_original_results_df.to_csv(csv_results, index=False, mode='a', header=False)

with open(csv_results, 'a') as f:
    f.write('Pipeline Results\n')

python_pipeline_results_df.to_csv(csv_results, index=False, mode='a', header=False)

with open(csv_results, 'a') as f:
    f.write('Tuned Results\n')
python_tuned_results_df.to_csv(csv_results, index=False, mode='a', header=False)



#
# ----- Pharo ------
#
with open(csv_results, 'a') as f:
    f.write('Section 3: Pharo 1\n')
    f.write('Original Results\n')
pharo_original_results_df.to_csv(csv_results, index=False, mode='a', header=False)

with open(csv_results, 'a') as f:
    f.write('Pipeline Results\n')
pharo_pipeline_results_df.to_csv(csv_results, index=False, mode='a', header=False)

with open(csv_results, 'a') as f:
    f.write('Tuned Results\n')
pharo_tuned_results_df.to_csv(csv_results, index=False, mode='a', header=False)


