# Case Study Tracing Evolutionary Changes in APIs - Preparation Phase

The following case study aims at performing an analysis on 2 Java APIs. The first API is *JUnit4*, the dataset for which is already constructed from another study. The second API is AppCompat for which we need to manually construct the dataset and analyse the evolution.

## Goal
Our goal is to investigate the feasibility of the machine learning approach. This will be achieved via the following steps:
- Read the Excel sheets and analyze the data
- Extract important features such as the "Changes" column and then perform Natural Language Processing Techniques on them, such as tokenization
- Train the algorithm to perform classification
- Check the accuracy of the algorithm
- In the final stage, be able to classify solely based on the "Changes" column
First, we start with importing the necessary libraries and defining the file paths

## Some implementation ideas/goals
 - Analyse for each category what are the most popular words, basically what infers that there is a Bug change, etc.
 - Try to predict the category of the change based on the trained data
 - At later stage, try the following: "A really interesting question will be which changes impacting the architecture of the system are represented in the release log. This way you essentially combine the two RQs and the make even more sense." For example, are breaking changes represented in the release log, etc.

In [50]:
import pandas as pd
import numpy as np
import nltk
from nltk import SnowballStemmer
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, SnowballStemmer


In [51]:
file_path = "../resources/JUnit/JUnit - Training.xlsx"
sheet_name = "JUnit"

After defining our path, the first thing we do is to display part of our data, in order to check if everything is working correctly.

In [52]:
data = pd.read_excel(file_path, sheet_name=sheet_name)

print(data.head())

       Year1  Year       Date        Version RELEASE  \
0  +15 years  2021 2021-02-13  4.13.  4.13.2   PATCH   
1  +15 years  2021 2021-02-13  4.13.  4.13.2   PATCH   
2  +15 years  2021 2021-02-13  4.13.  4.13.2   PATCH   
3  +15 years  2021 2021-02-13  4.13.  4.13.2   PATCH   
4  +14 years  2020 2020-10-11  4.13.  4.13.1   PATCH   

                                             Changes  By          .1 Type  \
0                                                NaN NaN         NaN  NaN   
1  Pull request #1687: Mark ThreadGroups created ... NaN       Rules  NaN   
2  Pull request $1691: Only create ThreadGroups i... NaN       Rules  NaN   
3  Pull request #1654: Fix for issue #1192: NotSe... NaN  Exceptions  NaN   
4                                                NaN NaN         NaN  NaN   

             General General Category  
0                NaN              NaN  
1     Fix regression          Bug fix  
2     Fix regression          Bug fix  
3  Fix serialization          Bug fix  


In the next parts, we will focus on building the classifier.
### 1. Tokenize the "Changes" column
First we perform the tokenization of the "Changes" column and then we remove any stopwords, so that the output is cleaner and easier to analyse in later steps

In [53]:
# Import tokenizer model
# Make sure to download punkt
# nltk.download('punkt')
# nltk.download('punkt_tab')
nltk.download('stopwords')

data["Changes"] = data["Changes"].fillna("")
data["Tokens"] = data["Changes"].apply(word_tokenize)

print(data[["Tokens", "Changes"]].head())

# Define the stopwords in a set.
stop_words = set(stopwords.words("english"))

data["Tokens"].apply(lambda tokens: [word for word in tokens if word.casefold() not in stop_words])

print("Data without stop words: ")
print(data["Tokens"].head())


                                              Tokens  \
0                                                 []   
1  [Pull, request, #, 1687, :, Mark, ThreadGroups...   
2  [Pull, request, $, 1691, :, Only, create, Thre...   
3  [Pull, request, #, 1654, :, Fix, for, issue, #...   
4                                                 []   

                                             Changes  
0                                                     
1  Pull request #1687: Mark ThreadGroups created ...  
2  Pull request $1691: Only create ThreadGroups i...  
3  Pull request #1654: Fix for issue #1192: NotSe...  
4                                                     
Data without stop words: 
0                                                   []
1    [Pull, request, #, 1687, :, Mark, ThreadGroups...
2    [Pull, request, $, 1691, :, Only, create, Thre...
3    [Pull, request, #, 1654, :, Fix, for, issue, #...
4                                                   []
Name: Tokens, dtype: object


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Krisi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In the next parts, we will focus on building the classifier.
### 2. Stem the output
The next step is to stem the output, the reason for doing this is to focus on the basic meaning of the word.

In [55]:
stemmer = SnowballStemmer("english")

# Apply the stemmer
data["Stemmed_Tokens"] = data["Tokens"].apply(
    lambda tokens: [stemmer.stem(word) for word in tokens if isinstance(word, str)]
)

print("Stemmed tokens: ")
print(data[["Tokens", "Stemmed_Tokens"]].head())

Stemmed tokens: 
                                              Tokens  \
0                                                 []   
1  [Pull, request, #, 1687, :, Mark, ThreadGroups...   
2  [Pull, request, $, 1691, :, Only, create, Thre...   
3  [Pull, request, #, 1654, :, Fix, for, issue, #...   
4                                                 []   

                                      Stemmed_Tokens  
0                                                 []  
1  [pull, request, #, 1687, :, mark, threadgroup,...  
2  [pull, request, $, 1691, :, onli, creat, threa...  
3  [pull, request, #, 1654, :, fix, for, issu, #,...  
4                                                 []  
