# Gradient-Based One Side Sampling

LightGBM is an algorithm optimized for the computation of large datasets. To enhance the training velocity and diminish the storage requirements, a potential solution is the limitation of data instances. Other GBDT-based frameworks, such as AdaBoost, CART, and XGBoost, employ random sampling to reduce the number of instances. As stated by Ke/Guolin et al. (2017, p. 4), the concept of random sampling is employed to enhance the LightGBM algorithm. Additionally, residuals and the distribution of data are taken into account. 

## Explanation of the Algorithm 
GOSS employs a sorting methodology based on ascending residuals. In this context, the term "residual" refers to the absolute gradient of the prediction errors of the previous tree. The sorted instances are then divided into two subsets according to two floating-point variables, designated as $a$ and $b$. The variable $a$ is used to select the instances that are multiplied by 100% and begin at the top of the subset. The variable $b$ represents a random sample of instances, with a probability of 100%, drawn from the remaining instances. The second subset, comprising randomly sampled data, is weighted with a constant, thereby ensuring that the distribution remains relatively stable. 

$$factor = {{1-a} \over {b}}$$

GBDT employs the information gain to identify the optimal splitting points. In most cases, the variance is calculated as a measure of information gain based on the data from the base population. GOSS employs a subset selection approach. It is not feasible to calculate the true variance of the population based on a sample. Nevertheless, the algorithm estimates the variance, which typically results in minor discrepancies.

![image.png](images/figure_8.PNG)

Figure 5: Explanation of GOSS

## Practical Understanding 
This section presents a streamlined version of GOSS, with the objective of illustrating its fundamental functionalities. In this section, fundamental Python extensions, including NumPy, Pandas, and Scikit-learn are used for the purposes of data generation, calculation, and storage. The data visualization tool Plotly.express is employed as well. The initial objective is to ascertain the characteristics of the data set. Subsequently, a streamlined version of the GOSS algorithm is devised for educational purposes. Finally, a comparison is conducted between the data sets before and after the application of GOSS.

In [2]:
# Run helper code in utilities
%run ../utilities.ipynb

For the purposes of presentation, a synthetic classification dataset is generated using the "make_blobs" function. Furthermore, a second dataset is generated with NumPy, referred to as "gradients," which represents the absolute residuals in a training process. The gradients are normally distributed with an average of 6 and a standard deviation of 1.5. The data is stored in a Pandas DataFrame.

In [3]:
# Generate synthetic data with 300 instances
samples = 300
np.random.seed(42)

# Generate a classification dataset
X, y = make_blobs(n_samples=samples, centers=2, n_features=2)

# Generate examplary gradients for visualization purpose (based on a normal distribution)
gradient = np.pi * np.absolute(np.random.normal(6, 1.5, samples))

df = pd.DataFrame(dict(x=X[:, 0], y=X[:, 1], label=y, gradients=gradient))
display(df)

Unnamed: 0,x,y,label,gradients
0,4.978375,1.557882,1,24.902775
1,5.278471,0.311650,1,22.096563
2,-2.522695,7.956575,0,22.797112
3,5.186976,1.770977,1,25.296035
4,4.929654,4.048570,1,25.234113
...,...,...,...,...
295,-2.281738,10.321429,0,15.749875
296,2.769087,1.621656,1,28.043588
297,2.601754,0.965083,1,16.069347
298,4.180518,1.123325,1,18.079234


In order to analyze the data, it is first necessary to obtain an overview. This is achieved by the creation of three plots, as illustrated below. Figure 4 depicts the unprocessed data for two variables, designated as $x$ and $y$. The coloration indicates the nature of the relationship.
Figure 5 illustrates the distribution of absolute residuals based on the $x$-axis of Figure 4. Figure 6 illustrates the aforementioned method based on the $y$-axis. 

In [4]:
# Generate a scatterplot and histograms
fig_scatter = px.scatter(
    df,
    x="x",
    y="y",
    color="label",
    title="Scatterplot of Features and Target (Figure 4)",
)
fig_hist_x = px.histogram(
    df,
    x="x",
    y="gradients",
    nbins=60,
    title="Histogram of x-Axis and distribution of gradients (Figure 5)",
)
fig_hist_y = px.histogram(
    df,
    x="y",
    y="gradients",
    nbins=60,
    title="Histogram of y-Axis and distribution of gradients (Figure 6)",
)

# Style and display visualizations
style_scatterplot(fig_scatter)
style_bar_chart(fig_hist_x)
style_bar_chart(fig_hist_y)
fig_scatter.show()
fig_hist_x.show()
fig_hist_y.show()

Figure 4 illustrates the distribution of datapoints for instances with label 0, which are situated between -6 and 0 on the $x$-axis and between 7 and 13 on the $y$-axis. The data for instances with label 1 is distributed between 2 and 8 on the $x$-axis and between 0 and 4.5 on the $y$-axis. <br>
Figures 5 and 6 illustrate the distribution of gradients along the $x$- and $y$-axes, respectively. It becomes evident that each visualization represents a normal distribution.
The following steps describe the statistical properties of the data. The initial step is to calculate the mean, standard deviation, and variance. These results serve as a benchmark for GOSS, aa the algorithm considers the distribution of the data. This will indicate that the later execution of the GOSS algorithm will not lead to significant alterations to these properties. 

In [5]:
# Print statistical properties
print("Mean: \n", df.mean(), "\n")
print("Standard Deviation: \n", df.std(), "\n")
print("Variance : \n", df.var(), "\n")
print("Sum : \n", df.sum(), "\n")

Mean: 
 x             1.056164
y             5.474303
label         0.500000
gradients    19.284833
dtype: float64 

Standard Deviation: 
 x            3.744119
y            3.688890
label        0.500835
gradients    4.881829
dtype: float64 

Variance : 
 x            14.018425
y            13.607906
label         0.250836
gradients    23.832250
dtype: float64 

Sum : 
 x             316.849082
y            1642.290808
label         150.000000
gradients    5785.450030
dtype: float64 



At this point in the demonstration, we will implement GOSS. It should be noted that this version of GOSS is intended for illustrative purposes only. In practical applications, the algorithm is more advanced, but this will be discussed in greater detail at a later stage. 

In this instance, we define GOSS as a function. This function accepts three parameters, the first of which is the data set structured in the same manner as previously demonstrated. The second parameter is designated as $a$. This is a value of dtype float, which is used to select a subset of instances exhibiting the largest gradients. The final parameter is $b$, which is also a floating-point number and selects a  randomly sampled subset of instances.

Based on these values, the algorithm calculates the number of instances to be selected from each subset. Additionally, a correction factor is calculated, which serves to weight the randomly selected gradients. Subsequently, the instances are sorted in descending order. In the case of $set_a$, the highest gradients are selected, while in the case of $set_b$, a random sample of $int_b$ instances is drawn from the remaining instances. Subsequently, the algorithm incorporates the column designated as *w* which serves the purpose of assigning weights. Instances belonging to $set_a$, which exhibit comparatively elevated gradients, are assigned a weight of 1.
In contrast, instances from $set_b$, which were randomly sampled, are weighted with the correction factor, which is intended to account for the distribution. Subsequently, the two datasets are merged and the weights are applied by multiplying the gradients. 

In [6]:
def GOSS(data: pd.DataFrame, a: float=0.15, b: float=0.3) -> pd.DataFrame:
    """A simplified implementation of the GOSS algorithm

    Args:
        data (pd.DataFrame): data set GOSS is to be performed on
        a (float): selects a subset of instances exhibiting the largest gradients
        b (float): selects a randomly sampled subset of instances
        
    Returns:
        result (pd.DataFrame): the resulting data set with reduced instances
    """
    df_size = len(data)
    int_a = int(df_size * a)  # Number of instances with large gradients
    int_b = int(df_size * b)  # Number of randomly sampled instances
    
    # Calculate correction factor
    fact = (1 - a) / b  
    
    # Sort data descending based on absolute residuals
    data = data.sort_values(
        by=["gradients"], ascending=False
    )  
    
    # Create subsets
    set_a = data.iloc[:int_a]  # Set containing samples with large gradients
    set_b = data.iloc[int_a:].sample(n=int_b)  # Set containing random samples
    
    # Apply weighting
    set_a["weighting"] = np.ones(len(set_a))  # Add a weight
    set_b["weighting"] = np.ones(len(set_b)) * fact  # Add correction factor as a weight
    
    # Create reduced and weighted dataset consisting of set_a and set_b
    result = pd.concat([set_a, set_b])
    result["gradients"] = result["gradients"].mul(result["weighting"])

    return result

In [7]:
goss_return = GOSS(data=df, a=0.2, b=0.4)
df1 = goss_return
df1



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,x,y,label,gradients,weighting
243,3.045451,1.373795,1,33.429918,1.0
257,-3.189222,9.246540,0,32.733257,1.0
169,-2.543909,7.845608,0,30.932394,1.0
77,4.063108,2.728561,1,30.413015,1.0
232,5.945358,1.994174,1,30.011128,1.0
...,...,...,...,...,...
103,-1.366375,9.766219,0,45.830782,2.0
125,6.284847,1.724134,1,31.902704,2.0
51,3.033433,2.176633,1,34.874142,2.0
274,3.456620,-0.066062,1,39.872851,2.0


After the execution of GOSS on the data set, it becomes evident that the number of rows are reduced from 300 to 135. This indicates a reduction in the quantity of data. The efficiency of GOSS will now be demonstrated visually through a comparison of the scatterplots.

In [8]:
# Generate a scatterplot and histograms
fig_scatter_after = px.scatter(
    df1,
    x="x",
    y="y",
    color="label",
    title="Scatterplot of Features and Target After Performing GOSS (Figure 7)",
)

# Style and display visualizations
style_scatterplot(fig_scatter)
style_scatterplot(fig_scatter_after)
fig_scatter.show()
fig_scatter_after.show()

The upper plot depicts the unprocessed data, while the plot below illustrates the outcome of the GOSS application. It is evident that the number of data points is reduced in a manner that is analogous in distribution. In the subsequent phase of the analysis, the histograms are subject to a comparative evaluation. 

In [9]:
fig_hist_x_after = px.histogram(
    df1,
    x="x",
    y="gradients",
    nbins=30,
    title="Histogram of x-Axis and Distribution of Gradients After Performing GOSS (Figure 9)",
)
fig_hist_y_after = px.histogram(
    df1,
    x="y",
    y="gradients",
    nbins=30,
    title="Histogram of y-Axis and Distribution of Gradients After Performing GOSS (Figure 10)",
)

# Sytle and display visualizations
style_bar_chart(fig_hist_x)
style_bar_chart(fig_hist_x_after)
style_bar_chart(fig_hist_y)
style_bar_chart(fig_hist_y_after)
fig_hist_x.show()
fig_hist_x_after.show()
fig_hist_y.show()
fig_hist_y_after.show()

A comparison of the distributions reveals that they are similar to those observed previously.
The subsequent step involves a comparison of the statistical properties of the dataset. In this particular instance, the values should not be identical, but rather in close proximity to the ground truth. As stated by Ke/Guolin et al. (2017, p. 4), there is a measurement for the approximation error, which represents the discrepancy between the estimated variance and the variance calculated on the basis of the base population. This measurement is expressed via the following formula:
$$\varepsilon_{(d)} = |\bar{V}_{j(d)} − V_{j(d)}|$$

In [10]:
# Sample variance (after GOSS)
print("Sample variance (after GOSS):")
print(df1[["x", "y"]].var(), "\n")

# Population variance (before GOSS)
print("Population variance (before GOSS):")
print(df[["x", "y"]].var(), "\n")

# Calculation of absolute error between estimated variance and population variance
error = abs(df1[["x", "y"]].var() - df[["x", "y"]].var())
print("Absolute error between estimated variance and population variance:")
print(error)

Sample variance (after GOSS):
x    13.206322
y    12.716475
dtype: float64 

Population variance (before GOSS):
x    14.018425
y    13.607906
dtype: float64 

Absolute error between estimated variance and population variance:
x    0.812103
y    0.891431
dtype: float64


The discrepancy between the variances of the sample and the population is minimal. It is important to note that GOSS is responsible for reducing instances while maintaining the distribution. GOSS effectively fulfills this role, as evidenced by the negligible residuals between the estimated variance and population variance. The magnitude of this error is contingent upon the values of the hyperparameters, $a$ and $b$. The process of tuning these variables can serve to reduce the overall error.

The subsequent step is to demonstrate that the sum of residuals is equal to zero. The GOSS algorithm employs a correction factor to rectify any discrepancies in the sampled instances.

In [11]:
# Sum of residuals of population (before GOSS)
raw_sum = df["gradients"].sum()
print("Sum of residuals of population (before GOSS):")
print(raw_sum, "\n")

# Sum of residuals of sample (after GOSS and weighting)
raw_sum_weighted = df1["gradients"].sum()
print("Sum of residuals of sample (after GOSS and weighting):")
print(raw_sum_weighted, "\n")

# Calculate the relative deviation of the error
error_p = (abs(raw_sum_weighted - raw_sum) / raw_sum) * 100
print("Relative deviation of the error:")
print(error_p, "%")

Sum of residuals of population (before GOSS):
5785.450029980487 

Sum of residuals of sample (after GOSS and weighting):
5715.946884965015 

Relative deviation of the error:
1.2013437961663014 %


The aggregated error of the raw data in relation to the sample data is approximately 1%. The results of these two tests demonstrate that the behavior of the sample and raw data is similar. The hyperparameters *a* and *b* can be employed to optimize the results, with consideration given to the runtime, accuracy, and memory usage. 

[<<<](1.3_gradient_boosting_decision_trees.ipynb)  | 1.4_gradient_based_one_side_sampling |   [>>>](1.5_exclusive_feature_bundling.ipynb)