### Experiments pipeline description

#### Stage 1. Data mining

The data is split into two populations: `positive` and `negative` samples, natures of mining of which are different.

* positive samples are obtained by going through a list of repositories (20 most starred Java apache repos), and running RefactoringMiner tool to get instances of ExtractMethodRefactorings from authors' commits. For each instance a range of metrics is then computed (78) and written to csv file   

* negative samples are obtained by going through a list of repositories (20 most starred Java apache repos), each Java file is then opened, and each method is considered that is then split into statements with PsiMiner. From the list of statements, then all combinations of consequent statements of lengths from 1 to n, where n - number of statements within the method. Then, each combination is considered as a fragment of code, for which the same list of metrics is computed and its Haas score is measured

Haas score, is a scoring function inspired by Haas et al work [cite something] and can be written as 
$$Score(F) = LengthScore(F) + AreaScore(F) + DepthScore(F)$$
And each component is defined as
$$
\begin{align}
&LengthScore(F) = min(0.1 \cdot min(Length(F), Length(F^C)), 3) \\
&AreaScore(F) = \frac{2 \cdot Depth(M)}{Area(M)} \cdot min(Area(M) - Area(F), Area(M) - Area(F^C)) \\
&DepthScore(F) = min(Depth(M) - Depth(F), Depth(M) - Depth(F^C)) \\
\end{align}
$$

$M$ - method, $F$ - fragment of code, $F^C$ - fragment's complement w/r to method, or *remainder* 

$Length(F)$ - line-count of fragment of code, i.e. number of rows.

$Depth(F)$ - nesting depth of code, i.e. row-wise maximal nesting level within fragment.

$Area(F)$ - nesting area of code, i.e. row-wise sum of nesting levels within fragment.

#### Stage 2. Data preparation

Once the scripts were finished, overall about 16 thousand `positive` and 4 million `negative` samples were obtained. In order to reduce the stress on soon to-be-trained models it was decided to make a smaller subsample of negative observations that accounted for 1 million elements. 

After that, 80\% of both sets were left for training and validation stages, and 20% were left for the testing stage. 

(80\% from positive set, and 80\% from negative set)

#### Stage 3. Training procedure

Before each models training, two preprocessing tasks were always performed on the dataset.

1. Negative samples Haas scores' of which were too high (higher than 50th percentile, or 95th, depending on the experiment) were dropped from the training set. Haas scores were also dropped from the dataset (not a feature for a model to know)
2. Training data X scaled by MinMax strategy feature-wise, i.e. $X = \frac{X - min(X)}{max(X) - min(X)}$, to ensure distribution of features from 0 to 1. Implemented via `sklearn.preprocessing.MinMaxScaler` and incorporated into the model through sklearn Pipelines

After that, each model was trained with two thresholds for Haas-Score drop (50th, and 95th). 

Overall training procedure was conducted with use of `sklearn.model_selection.GridSearchCV` to find optimal hyperparameters from the given *grid* and get CrossValidation scores at the same time. As metric to optimize, the Precision-Recall Area Under Curve was chosen (PR AUC). 

#### Stage 4. Testing procedure

Each trained model was tested on the same dataset, that was built during stage 2, and 3 *metrics* were under consideration during the tests. 

1. Precision Recall Curve as a whole
2. Area under the curve
3. Recall at points Precision=0.9, and Precision=0.8 

Overall, this results are yet to be computed)

### Overall information

#### Repository list

1.  https://github.com/apache/dubbo
2.  https://github.com/apache/kafka
3.  https://github.com/apache/skywalking
4.  https://github.com/apache/flink
5.  https://github.com/apache/rocketmq
6.  https://github.com/apache/shardingsphere
7.  https://github.com/apache/hadoop
8.  https://github.com/apache/druid
9.  https://github.com/apache/zookeeper
10.  https://github.com/apache/pulsar
11.  https://github.com/apache/shardingsphere-elasticjob
12.  https://github.com/apache/cassandra
13.  https://github.com/apache/storm
14.  https://github.com/apache/tomcat
15.  https://github.com/apache/jmeter
16.  https://github.com/apache/zeppelin
17.  https://github.com/apache/incubator-shenyu
18.  https://github.com/apache/beam
19.  https://github.com/apache/groovy
20.  https://github.com/apache/hbase

#### Models used

1. `sklearn.ensemble.RandomForestClassifier` - RandomForest
2. `sklearn.ensemble.BaggingClassifier` of `sklearn.svm.SVC` - Ensemble of support vector machines
3. `sklearn.naive_bayes.GaussianNB` - Gaussian Naive Bayes
4. `sklearn.neural_network.MLPClassifier` - MultiLayer Perceptron
5. `sklearn.ensemble.GradientBoostingClassifier` - Ensemble of Decision Trees with Gradient Boosting

#### Models' Inference

Models were trained with sklearn, Python's module and saved in PMML format via [sklearn2pmml](https://github.com/jpmml/sklearn2pmml), then used in Java through [pmml4s library](https://github.com/autodeployai/pmml4s)