# Abstract
***

1. [Introduction](#intro)
1. [Goal of the work](#goal)
1. [Data](#data)
1. [Schedule](#sched)
1. [Results to obtain](#results)

### 1. Introduction<a id='intro'>

In software development, an important issue can be represented by the presence of **software defects**, that may cause several problems for both users and developers. In order to avoid that, researchers have been asking for long if is it possible to learn **which characteristics of the source code are likely to produce a software defect**. In the past years, prediction using machine learning methods has been a relevant topic. Many studies have been made with the purpose of support developers in handling these defects before they are introduced in the production environment. However, most of these works concentrate on predicting defects from a vast set of software feature and they lack of a good explanation of the reasons that drive the software to a defective state.

This work is based on <a href='https://link.springer.com/article/10.1007/s10515-020-00277-4'><b>Understanding machine learning software defect predictions</b></a> by Estevez, Figueiredo, Veloso, Viggiato, Ziviani (2020).

### 2. Goal of the work<a id='goal'>

The original research by the authors has the main purpose of **finding a model that perfectly fits to the prediction of software defects**. They propose a simple model sampling approach that finds accurate models with the minimum set of features, because **features not contributing to increasing the predictive power should not be included in the model**. The reduced set of features helps to increase model explainability, which is important to provide information to developers on features related to each module of the code which is more defect-prone. Models are evaluated on diverse projects within *Jureczko* datasets, and the goal is to show that **features that contribute most for finding best models may vary depending on the project** and **it is possible to find effective models that use few features leading to better understandability**.

### 3. Data<a id='data'>

In order to perform the analysis we will use the following sets included in the *Jureczko dataset*:

<ul>
    <li>Ant</li>
    <li>Camel</li>
    <li>Jedit</li>
    <li>Log4j</li>
    <li>Tomcat</li>
    <li>Velocity</li>
    <li>Xalan</li>
    <li>Xerces</li>
</ul>

Each of these datasets represents a **Java project**. They will be explored better in <a href="2_eda.ipynb">Section 2</a>

### 4. Schedule<a id='sched'>

This work will be divided in several **sections**:
<ul>
    <li>In the 1st one, data will be loaded and stored</li>
    <li>In the 2nd one, an exploratory data analysis will be performed on our datasets, in order to understand their structure and find a way to improve them</li>
    <li>In the 3rd one, data will be manipulated</li>
    <li>In the 4th one, seven state-of-the-art machine learning algorithms will be applied to our datasets; an optimized XGBoost model will be built; AUC and F1 scores will be computed for each model in order to find the best one</li>
    <li>In the 5th one, results will be explained; SHAP values will be computed on XG-Boost models; relevant features will be found</li>
</ul>

### 5. Results to obtain<a id='results'>

Our aim is to answer 2 questions:
<ul>
    <li>Q1: does our XGBoost outperforms the state-of-the-art ML classifiers for defect prediction?</li>
    <li>Q2: how does the number of features impact the explainability of the models?</li>