# Case Study Tracing Evolutionary Changes in APIs - Preparation Phase

The following case study aims at performing an analysis on 2 Java APIs. The first API is *JUnit4*, the dataset for which is already constructed from another study. The second API is AppCompat for which we need to manually construct the dataset and analyse the evolution.

## Goal
Our goal is to investigate the feasibility of the machine learning approach. This will be achieved via the following steps:
- Read the Excel sheets and analyze the data
- Extract important features such as the "Changes" column and then perform Natural Language Processing Techniques on them, such as tokenization
- Train the algorithm to perform classification
- Check the accuracy of the algorithm
- In the final stage, be able to classify solely based on the "Changes" column
First, we start with importing the necessary libraries and defining the file paths

## Some implementation ideas/goals
 - Analyse for each category what are the most popular words, basically what infers that there is a Bug change, etc.
 - Try to predict the category of the change based on the trained data
 - At later stage, try the following: "A really interesting question will be which changes impacting the architecture of the system are represented in the release log. This way you essentially combine the two RQs and the make even more sense." For example, are breaking changes represented in the release log, etc.

In [202]:
import pandas as pd
import re
import numpy as np
import nltk
from nltk import SnowballStemmer, NaiveBayesClassifier
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
# TODO: See if this is useful
import joblib

In [203]:
file_path = "../resources/JUnit/JUnit - Training.xlsx"
sheet_name = "JUnit"

After defining our path, the first thing we do is to display part of our data, in order to check if everything is working correctly.

In [204]:
data = pd.read_excel(file_path, sheet_name=sheet_name)

print(data.head())

       Year1  Year       Date        Version RELEASE  \
0  +15 years  2021 2021-02-13  4.13.  4.13.2   PATCH   
1  +15 years  2021 2021-02-13  4.13.  4.13.2   PATCH   
2  +15 years  2021 2021-02-13  4.13.  4.13.2   PATCH   
3  +15 years  2021 2021-02-13  4.13.  4.13.2   PATCH   
4  +14 years  2020 2020-10-11  4.13.  4.13.1   PATCH   

                                             Changes  By          .1 Type  \
0                                                NaN NaN         NaN  NaN   
1  Pull request #1687: Mark ThreadGroups created ... NaN       Rules  NaN   
2  Pull request $1691: Only create ThreadGroups i... NaN       Rules  NaN   
3  Pull request #1654: Fix for issue #1192: NotSe... NaN  Exceptions  NaN   
4                                                NaN NaN         NaN  NaN   

             General General Category  
0                NaN              NaN  
1     Fix regression          Bug fix  
2     Fix regression          Bug fix  
3  Fix serialization          Bug fix  


In the next parts, we will focus on building the classifier.
### 1. Tokenize the "Changes" column
First we clean the data from non-alphabetic characters and then perform the tokenization of the "Changes" column. After that we remove any stopwords, so that the output is cleaner and easier to analyse in later steps

In [205]:
# Import tokenizer model
# Make sure to download the following
# nltk.download('punkt')
# nltk.download('punkt_tab')
# nltk.download('stopwords')

data["Changes"] = data["Changes"].fillna("")
print(data["Changes"].head())
data["Tokens"] = data["Changes"].apply(word_tokenize)

# print(data[["Tokens", "Changes"]].head())

# Define the stopwords in a set.
stop_words = set(stopwords.words("english"))

data["Tokens"] = data["Tokens"].apply(lambda tokens: [word.lower() for word in tokens if re.match(r'^[a-zA-Z]+$',
                                                                                                  word) and word.casefold() not in stop_words])
print("Data without stop words and solely alphabetical tokens: ")
print(data["Tokens"].head())


0                                                     
1    Pull request #1687: Mark ThreadGroups created ...
2    Pull request $1691: Only create ThreadGroups i...
3    Pull request #1654: Fix for issue #1192: NotSe...
4                                                     
Name: Changes, dtype: object
Data without stop words and solely alphabetical tokens: 
0                                                   []
1    [pull, request, mark, threadgroups, created, f...
2          [pull, request, create, threadgroups, true]
3    [pull, request, fix, issue, notserializableexc...
4                                                   []
Name: Tokens, dtype: object


In the next parts, we will focus on building the classifier.
### 2. Stem the output
The next step is to stem the output, the reason for doing this is to focus on the basic meaning of the word.

In [206]:
stemmer = SnowballStemmer("english")

# Apply the stemmer
data["Stemmed_Tokens"] = data["Tokens"].apply(
    lambda tokens: [stemmer.stem(word) for word in tokens if isinstance(word, str)]
)

print("Stemmed tokens: ")
print(data[["Tokens", "Stemmed_Tokens"]].head())

Stemmed tokens: 
                                              Tokens  \
0                                                 []   
1  [pull, request, mark, threadgroups, created, f...   
2        [pull, request, create, threadgroups, true]   
3  [pull, request, fix, issue, notserializableexc...   
4                                                 []   

                                      Stemmed_Tokens  
0                                                 []  
1  [pull, request, mark, threadgroup, creat, fail...  
2          [pull, request, creat, threadgroup, true]  
3  [pull, request, fix, issu, notserializableexce...  
4                                                 []  


### 3. Feature engineering
After clearing the input, the next step is to extract features which will be suitable for the machine learning model. For this, we use TF-IDF (Term Frequency-Inverse Document Frequency).


In [207]:
vectorizer = TfidfVectorizer()
# TODO: Check if this is correct and makes sense
print(data["Stemmed_Tokens"].head())

documents = data["Stemmed_Tokens"].apply(lambda tokens: ' '.join(tokens) if isinstance(tokens, list) else '')

X = vectorizer.fit_transform(documents)

print(X.shape)

0                                                   []
1    [pull, request, mark, threadgroup, creat, fail...
2            [pull, request, creat, threadgroup, true]
3    [pull, request, fix, issu, notserializableexce...
4                                                   []
Name: Stemmed_Tokens, dtype: object
(271, 514)


### 4. Labels preparation
Once the data is prepared, the next step is to define the lables which will be used for categorization.

In [208]:
label_encoder = LabelEncoder()
print(data["General Category"].head())
y = label_encoder.fit_transform(data["General Category"])
print(y)


0        NaN
1    Bug fix
2    Bug fix
3    Bug fix
4        NaN
Name: General Category, dtype: object
[10  1  1  1 10  8  2 10  5  0  5  2  1  3  2  1  5  1  1  1  5  1  6  0
  1  1  0  2  2  1  0  1  4  1  1  1  1  1  1  1  5  1  1  1  0  1  0  3
  4  1  3  1  1  0  2  1  2  1  1  2  2  2  0  1  1  1 10  0  0  7  0  0
  0  1  1  0  1  1  1  5  1  0  2  1  0  0  0  3  1  1  1  0  0  0  2  0
  1  1  1  4  1  1  5  1  1  0  0  1  1  1  0  2  0  5  9  9  5  1  7 10
  0  0  2  2  2  2  5  0  0  0  0  0 10  1  0  2  2  1  1  1  4  4  4  4
  7  1  1  1  1  2  1  0 10  0  2  3  2  2  1  1  1  1  1  1  1  5  4  2
  5  1 10  1 10  1 10  0  1 10  0  0  0  0  0  0  0  0  2  2  4  4  4  4
  1  1  1 10  0  0  0  0  4  4  2  1  1 10  2  1  1  1  2  0  0  0  0  5
  2  0  0  0  0  5  7  2  2  4  0  2  0  2 10  0  0  0  0  1  1  2  5  5
  1  1 10  1  1 10  2  3  0  2  5  1  1  1  2  4 10  1  1  5  1 10  1  2
  1  4  4  4  2 10  2]


### 5. Split training and testing data
The next step is to prepare the training and testing data for the machine learining model

In [209]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training data shape: ", X_train.shape)
print("Test data: ", X_test.shape)


Training data shape:  (216, 514)
Test data:  (55, 514)


### 6. Train model


In [210]:
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(n_estimators=100, random_state=42)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=label_encoder.classes_))

Accuracy: 0.5636363636363636


ValueError: Number of classes, 10, does not match size of target_names, 11. Try specifying the labels parameter

### JUST A TEST

In [156]:
joblib.dump(classifier, "classifier_model.pkl")
joblib.dump(vectorizer, "vectorizer.pkl")

# Load the saved models
loaded_classifier = joblib.load("classifier_model.pkl")
loaded_vectorizer = joblib.load("vectorizer.pkl")

# Predict on new data
new_text = [
    "Source has been split into directories src/main/java and src/test/java, making it easier to exclude tests from builds, and making JUnit more maven-friendly"]
new_vector = loaded_vectorizer.transform(new_text)
prediction = loaded_classifier.predict(new_vector)

print("Predicted Category:", label_encoder.inverse_transform(prediction))


Predicted Category: ['Code redesign']
