# Balanced Regression

Joey Couse and Abraham Eaton <br>
22-Jan-2021<br>

### Overview:
Machine learning hyperparameters are optimized as a part of the training process to increase the performance of models.  An area less researched however has been how to optimize the selection of the training and validation datasets.  Often this is done randomly, in order to "fairly" address the data.  Standard practice involves balancing a dataset across the dependent (predicted) variable when a classification problem has unbalanced classes.  This is done through oversampling, undersampling, sample weighting, or other methods.  In practice this helps correct for unbalanced data and yields better results.  
<br>
We believe that this mentality should be taken one step further by balancing training data across a number of different features not limited to only the dependent variable.  Our hypothesis is that ensuring that the training set and validation set are loosely balanced will allow machine learning models to more accurately distill the true signals in the training data.  Particularly with small datasets, small samples sizes can lead to training or validation sets that are different from each other in potentially critical ways.  In this script we attempt to test this hypothesis using datasets from the machine learning respository.  
<br>

### Sources:
This script utilizes a UCI Machine Learning Repository API developed and maintained here: https://github.com/tirthajyoti/UCI-ML-API#lowlevelfunctions.

In [1]:
using Statistics, DataFrames, LinearAlgebra, CSV, Random, ScikitLearn, MatrixImpute
using ScikitLearn.Pipelines: Pipeline, make_pipeline
using Lathe.preprocess: TrainTestSplit, OneHotEncode

In [2]:
@sk_import preprocessing: normalize
@sk_import preprocessing: StandardScaler
@sk_import preprocessing: MinMaxScaler
@sk_import decomposition: PCA
@sk_import linear_model: LogisticRegression
@sk_import linear_model: LinearRegression
@sk_import metrics: (mean_squared_error, r2_score)
;

In [3]:
path = @__DIR__;
datapath = string(path[1:end-4],"Data");
outputpath = string(path[1:end-4],"Output");
;

In [4]:
if length(readdir(datapath)) == 0;
    error("No data files in data directory, run load_UCI_data.py first!")
else
    Folder_Array = readdir(datapath);
end
;

In [5]:
mutable struct DataSet
    name::String
    data::DataFrame
    
    ## Splits of the ds.data attribute
    X_train::Array
    y_train::Array
    X_train_scaled::Array
    y_train_scaled::Array
    
    X_valid::Array
    y_valid::Array
    X_valid_scaled::Array
    y_valid_scaled::Array
    
    X_test::Array
    y_test::Array
    X_test_scaled::Array
    y_test_scaled::Array
    
    
    ## How much to put in each set of data
    test_perc::Float64
    valid_perc::Float64
    
    ## Keep track of how the data was scaled
    scale_type::String
    
    ## empty after cleaning?
    empty_after_cleaning::Bool
    
    DataSet(name::String) = new(name, DataFrame(), 
        Array{Float64}(undef,0,0), Array{Float64}(undef,0,0), 
        Array{Float64}(undef,0,0), Array{Float64}(undef,0,0), 
        Array{Float64}(undef,0,0), Array{Float64}(undef,0,0), 
        Array{Float64}(undef,0,0), Array{Float64}(undef,0,0), 
        Array{Float64}(undef,0,0), Array{Float64}(undef,0,0), 
        Array{Float64}(undef,0,0), Array{Float64}(undef,0,0), 
        0.15, 0.15, "",false)
end

## Establish an array of datasets
datasets = Array{DataSet}(undef,length(Folder_Array))

## Fill each element of the array with a "DataSet" construct, supplying the name for each
for i in range(1,stop=length(Folder_Array))
    datasets[i] = DataSet(Folder_Array[i])
end
;

In [6]:
## Helper functions that operate on a "DataSet" construct

function load_dataset(ds::DataSet)
    file = readdir(string(datapath,"//",ds.name))[1];
    ds.data = CSV.read(string(datapath,"//",ds.name,"//",file),missingstrings=["-999", "NA","NaN","?"]);
end






function clean(ds::DataSet)
    
    
    ## drop rows with "missing" as a cell value
    #println(MatrixImpute.Impute(convert(Matrix,ds.data),4))
    ds.data = dropmissing(ds.data)
    
    
    ## room for improvement--build out something that detects categorical vairables and 
    ## ideally takes the top N categories and makes them one-hot, or just converts the 
    ## column if there are less than N categories.
    
#     scaled_feature = OneHotEncode(ds,:Status)
#     select!(df, Not([:Status,:Country]))
    
    
    ## only allow float or integer values (an oversimplification, see above for Categorical discussion)
    ds.data = ds.data[:,((eltype.(eachcol(ds.data)) .== Int64) .| (eltype.(eachcol(ds.data)) .== Float64))]
    #ds.data = complete_cases(ds.data)
    if (size(ds.data)[1]<=10) | (size(ds.data)[2]<=2)
        ds.empty_after_cleaning = true
        println("Not enough data after cleaning to regress.")
    end
    
end





function unload_dataset(ds::DataSet)
    ## To reduce memory for larger sets and reset the DataSet construct
    ds.data = DataFrame()
    
    
    ds.X_train= Array{Float64}(undef,0,0)
    ds.y_train= Array{Float64}(undef,0,0)
    ds.X_train_scaled= Array{Float64}(undef,0,0)
    ds.y_train_scaled= Array{Float64}(undef,0,0)
    
    ds.X_valid= Array{Float64}(undef,0,0)
    ds.y_valid= Array{Float64}(undef,0,0)
    ds.X_valid_scaled= Array{Float64}(undef,0,0)
    ds.y_valid_scaled= Array{Float64}(undef,0,0)
    
    ds.X_test= Array{Float64}(undef,0,0)
    ds.y_test= Array{Float64}(undef,0,0)
    ds.X_test_scaled= Array{Float64}(undef,0,0)
    ds.y_test_scaled= Array{Float64}(undef,0,0)
end
;

In [7]:
load_dataset(datasets[1]);
clean(datasets[1]);

In [8]:
function split(ds::DataSet)
    if ds.empty_after_cleaning == false
        
        train, test = TrainTestSplit(ds.data,(1-ds.test_perc))
        valid, train = TrainTestSplit(train,((ds.valid_perc)/(1-ds.test_perc)))


        ds.X_train = train[:,1:(size(ds.data)[2]-1)]
        ds.y_train = train[:,size(ds.data)[2]]

        ds.X_valid = valid[:,1:(size(ds.data)[2]-1)]
        ds.y_valid = valid[:,size(ds.data)[2]]

        ds.X_test = test[:,1:(size(ds.data)[2]-1)]
        ds.y_test = test[:,size(ds.data)[2]]

        ds.y_train = reshape(ds.y_train,(size(ds.y_train)[1],1))
        ds.y_valid = reshape(ds.y_valid,(size(ds.y_valid)[1],1))
        ds.y_test = reshape(ds.y_test,(size(ds.y_test)[1],1))
    
    end
end
;

In [9]:
split(datasets[1]);
;

│   caller = setproperty! at Base.jl:21 [inlined]
└ @ Core .\Base.jl:21
│   caller = setproperty! at Base.jl:21 [inlined]
└ @ Core .\Base.jl:21
│   caller = setproperty! at Base.jl:21 [inlined]
└ @ Core .\Base.jl:21


In [10]:
function scale(ds::DataSet)
    if ds.empty_after_cleaning == false
        if ds.scale_type == "MinMax"
            scaler = MinMaxScaler();
            yscaler = MinMaxScaler();
        elseif ds.scale_type == "Normalize"  ## not working currently
            scaler = normalize();
            yscaler = normalize();
        elseif ds.scale_type == "Standard"
            scaler = StandardScaler();
            yscaler = StandardScaler();
        elseif ds.scale_type == ""
            println("default Standard Scaler being used")
            scaler = StandardScaler();
            yscaler = StandardScaler();
        else
            println("Scaler type not recognized, try 'Normalize' or 'Standard' or 'MinMax'")
            return
        end

        fit!(scaler,ds.X_train);

        ds.X_train_scaled = transform(scaler,ds.X_train);
        ds.X_valid_scaled = transform(scaler,ds.X_valid);
        ds.X_test_scaled = transform(scaler,ds.X_test);

        fit!(yscaler,ds.y_train);

        ds.y_train_scaled = transform(yscaler,ds.y_train);
        ds.y_valid_scaled = transform(yscaler,ds.y_valid);
        ds.y_test_scaled = transform(yscaler,ds.y_test);
        return
    end
end
;

In [11]:
datasets[1].scale_type = "MinMax"
scale(datasets[1])

In [12]:
function fit_model(ds::DataSet)
    if (ds.empty_after_cleaning == false)
        ## can add functionality later for different types of models
        if true
            model = LinearRegression()
        end

        fit_model = fit!(model,ds.X_train,ds.y_train)

        predictions = fit_model.predict(ds.X_test)

        # The coefficients
        println("Coefficients: \n", fit_model.coef_)
        println()
        # The mean squared error
        println("Mean squared error: \n", mean_squared_error(ds.y_test, predictions))
        println()
        # The coefficient of determination: 1 is perfect prediction
        println("Coefficient of determination: \n", r2_score(ds.y_test, predictions))
    end
end
;

In [13]:
# estimators = [("normalize",normalize()),("reduce_dim", PCA()), ("logistic_regression", LogisticRegression())]
# clf = Pipeline(estimators)
# fit!(clf, X, y)

In [14]:
fit_model(datasets[1])

Coefficients: 
[-0.008972949320528011 0.23669992924633915 0.014515397820578044 0.004114840970287457 -0.03634268197145784 0.6035499702524569 0.058659845085849135 -8.36477626479504e-5 -0.24537776461370517 0.5092901582604304 -0.24289301826605048 0.32483644154124774 -0.030131100197043854 -0.050867552415045376 0.04519181485674417 -0.07839186882494921 -0.4676395564813209 -0.09293138757092609 -0.006551609856056274 0.19317587930904148 1.3625896506528286e-5 0.20948014106124435]

Mean squared error: 
1.0040214683866393

Coefficient of determination: 
0.8463856853017001


In [15]:
unload_dataset(datasets[1]);

In [16]:
load_dataset(datasets[1]);
clean(datasets[1]);
split(datasets[1]);
datasets[1].scale_type = "Standard";
scale(datasets[1]);
fit_model(datasets[1])

Coefficients: 
[-0.005325740734393695 0.224143045262405 0.01328947038410776 0.003912074435558345 -0.04008351617194593 0.6195236502073466 0.053768131816831606 -0.00027929292292136606 -0.3019560185788868 0.6290825905687777 -0.17840636499592688 0.34404131407297867 -0.026366594660104097 -0.06297801327417454 0.03815638867861296 -0.07358749355415155 -0.27012069062354943 -0.13682586286899376 -0.006402106614226001 0.18501872737431072 -2.862249890652369e-5 0.21086954381748488]

Mean squared error: 
0.8868756360175812

Coefficient of determination: 
0.8551188582550677


In [17]:
## loop through each DataSet and apply the load_dataset, clean, split, scale, fit_model, unload pipeline in sequence.

for i in range(5,stop=14)
    println("Name:")
    println(datasets[i].name)
    println()
    load_dataset(datasets[i]);
    clean(datasets[i]);
    split(datasets[i]);
    datasets[i].scale_type = "Standard";
    scale(datasets[i]);
    fit_model(datasets[i])
    unload_dataset(datasets[i]);
    println()
    println()
end

Name:
Large_csv_Opinion_Corpus_for_Lebanese_Arabic_Reviews_%28OCLAR%29

Not enough data after cleaning to regress.


Name:
Large_csv_QSAR_biodegradation

Coefficients: 
[-0.4765238310082644 0.31144397557153575 0.6146997454201498 -0.11079294190266932 0.034865149081312816 0.944861569128173 0.05314257239785255 0.04985551974376269 -0.17721689668397878 -0.05007588246855228 0.05814983809606156 0.017947915857807556 1.5915611515058814 -0.07237058456791842 -1.0096541153757639 -0.04572221055145353 3.305802382822178 38.69258858835163 -0.3103084291826603 -0.42731137312005507 -0.07575270028894521 0.9313573564356805 -0.0662004356287588 -0.24019749795649936 0.32970865363696017 -0.28677407985002484 2.9059889421178293 0.2054157181464794 1.1053941965574032 0.0071927775743400296 -0.04013595166039375 0.009943863624683805 0.13276074576534225 0.008615433841276544 -0.0281571413911846 -0.19239688494478818 -0.030054005626116997 -0.4322163086577883 -0.10156387734403682 0.17913096902174108]

Mean squared error: 