# 1. Abstract

Unbalanced data remains a major problem in the field of data-mining. It is very popular among medical datasets where the number of positive cases of patients who the conditions is much smaller than those who don't. This imbalance causes many learning method to bias the negative cases and predict the majority of cases to be negative. These learning methods have a very high "accuracy" due to an abundance of negative cases, but are extremely poor at predicting positive cases.

Common methods to fix this problem include removing negative samples that are irrelevant, ignore irrelevent attributes or artificially create new instances of positive cases (known as oversampling). Oversampling methods (such as SMOTE) generally perform better than others **[citation]**, but the increase in performance varies across datasets. In this paper, we aim to examine various oversampling methods, include the one we proposed. We focus on visually described how one method might increase the prediction performance. We then test all methods, including our own, using various medtrics. Later on, we will describe a novel tool that we build to specifically examine the behavior of various oversampling methods in unbalanced datasets.

# 2. Introduction

This behavior is undesirable. For examples, if we have a dataset of 2 malignant cancer tumor cases and 98 benign cases, a method predicting all cases to be negatives will have 98% "accuracy", but has no capability to different positive cases from negative cases. Thus, evaluating "accuracy" is  generally irrelevant when there is significant imbalance between classes in the dataset. A common method is to separate the performance of positive cases and negative cases and build a "confusion matrix."

Even the True Positive Rate and False Positive Rate can be misleading. Most learning methods produce a probability of each class, and the prediction can be modified by a "threshold value." The figure below demonstrates a hypothetical dataset and its probability prediction among each class. As we can observe, "true positve rate" and "false positive rate" may vary depending on the threshold values. Researchers can "cherry-pick" a threshold that seems to produce results that look good but have little values when comparing one method against each other **[citation]**. 

To prevent this, another method called "Area Under Curve" is implemented. Based on all probability predictions, the method will calculate all possible pair of (TPR, FPR) values and map them as a curve. We then proceed to calculate the area under the curve. A good learning method should maintain a high TPR even when the threshold starts to favor negative cases and, thus, should have higher AUC values in general. We will mainly use AUC as the main value to judge the performances of difference over sampling methods accross different datasets.

[Insert a figure here to indicate how misleading FPR and TPR rate can be and how effective AUC is]

[Maybe write a few criteria on what makes a good oversampling method]

# 3. Visualize behavior of different oversampling methods

!["OriginalData"](./Images/TrueOriginalData.png "Logo Title Text 1")

In this section, we will try to visualize the behavior of each sampling method through a hypothetical dataset. This dataset consists of 10000 negative cases and 300 positive cases in three separate circles with theoretical boundary denoted as the dashed line. One might ask why we don't create each case for each circle. The reasoning is simple: most datasets are hyper-dimensional (more than 3 attributes) and since there is no way to visualize more than 3 dimensions, it becomes extremely difficult for pracititioners to separate these classes by hands, even though different cases might be sporadic. Therefore, we expect the oversampling methods to also adapt to this situation where positive cases scatter into different groups.

[Maybe provide a few more hypothetical datasets]

## SMOTE

SMOTE is a very common over-sampling method when dealing with unbalanced dataset. You can read the pseudo-code in the original paper by **[citation]**. Essentially, the method pick a random point across the positive cases. It then picks k nearest neighbors of the same classes with the point in question. It then proceeds to choose two random points among these nearest neighbors and simply create a new positive instance somewhere in the middle.

!["OriginalData"](./Images/ComparisonSMOTE.png "Logo Title Text 1")

As we can see, SMOTE seems to "strengthen" the border by creating many instances between neighboring points. This behavior can be very beneficial for distance-related methods such as kNN. As you can see, the articial positive instances help the articial boundary approach closer to actual theorical boundary.

When k is too small, this "bordering behavior" becomes limited because it does stretch to points that are far across. When k is too big, however, a point might incidentally pick a neighbor that is actually among a separate group, potentially create some awkward links. In the example above, the local group on the bottom right only has 9 instances, but any random point has to pick 10 nearest neighbors. Point A, therefore, might pick a point that is extremely far apart, but still consider one of its nearest neighbors. This far apart point can in turn be chosen to create new a new instance, potentially create an awkard link in the middle.

The most desirable option is to choose the biggest k that is still smaller than the number of instances of the smallest local groups. In practice, this is hard to achieve, since there is no way to visualize more than 3 dimensions to actually know which groups are "local." Optimal k value is achieved on a case by case basis through experimentation.

[that is k, what about the number of samples generated?]

## Borderline SMOTE (Coming soon)

## AGO

Because of this awkward linkage weakness in SMOTE, we devise a new method to mitigate this problem called Augmented Gaussian Sampling (AGO). We provide the pseudo-code of the method in the reference section. Basically, we pick a random point and then choose k nearest neighbors. We then generate a new point using a Normal distribution N(mu, sigma) with mu equals to the point in question and sigma is the standard deviation of k nearest data points. This way, we essentially create a new point around the original positive case we initially pick.

!["OriginalData"](./Images/ComparisonAGO.png "Logo Title Text 1")

When a lot of these points are generated, the method has the effect of strengthening local points by creating points surrounding these locals.

This weakness might still suffer from the value of k that is too big. Similar to SMOTE, this method can sometimes choose an instance that is too far away from the original point and this instance can blow up the standard deviation and create a new instance that is slightly far away from the original point. However, since the new points are still close to the original points, its spill over effect is significantly less harmful than SMOTE.

Another weakness of this method is that a lot of times, the random point is already way inside the decision boundary and generating new instances around these points doesn't have much effect on the overall result. This effected can be solved by using Borderline NGO.

## Borderline NGO (Unfinished)

The obvious way to avoid oversampling these inside points is to only oversample points that are closed to the borders. To determine which instance is closed to tihe border, we can select k nearest points (instead of k nearest positive instances). If all k nearest points are positive, then we can be fairly certain that this point is already way inside the positive territory and there is no need to oversample this point. If there is at least c points (1 <= c <= k) that are negative, we can suspect that the point is very close to the border and needs to be oversampled.

We can see visually this method will clearly help reduce the abundant oversampled instances and only strengthens "borderline" points. The problem with this method is that computational cost of generating new instances will blow up. This is because since the cost of k-nearest neighbors algorithm is O(n^2), it is much more expensive to apply kNN on the entire dataset than to just on the positive instances (which are assumed to be minimal compared to the negative instances. We will examine the performance results to see if the increase in performance (if at all) is worth the blown up cost.

# 4. Testing performance among different oversampling methods.

To test the performance of different oversampling methods, we use 7 different learning algorithms to see how much one oversampling method might or might not enhance the performance of a learning algorithm. We will test each learning algorithm first on the original dataset, and then for each new dataset that contained oversampling positive instances from each of the described oversampling methods.

The AUC value is generated through cross-validating the dataset 100 times to get an average value and a confidence intervals.

## Results

The result clearly shows that oversampling methods generally increase the performance of the learning algorithm. AGO and SMOTE have a relative similar increases in performance. Whichever method performs better seems to be on a case by case basis.

Random Forest algorithm generally performs well even with the original dataset, suggesting that the learning mechanism is less affected by the imbalance of the dataset. There is still, however, some performance improvement when the dataset is pre-processed by some oversampling methods.