# Data Preprocessing

This file is split for preprocessing for each of the datasets used for the project.
In essence, raw files (`/datasets/raw`) get processed and put to (`/datasets/processed`), so that they can be freely loaded and used for the project.

**Requirements**
- for small datasets (labelled with [S]) there shuold be at most $10$ variables
- for large datasets (labelled with [L]) there should be more than $10$ variables.
- for ALL datasets the number of observations MUST be larger than the number of variables
- ALL datasets MUST have a clear binary target/response variable that should only take values $0$ or $1$
- ALL datasets MUST have less than $10$% missing data per variable
- for ALL datasets collinear variables SHOULD be removed
- EACH preprocessed dataset must be saved in a form of a single file under `datasets/processed` for further use

# Common tools

In [1]:
from statsmodels.stats.outliers_influence import variance_inflation_factor    


def calculate_vif_(X, thresh=5.0):
    X = X.assign(const=1)  # faster than add_constant from statsmodels
    variables = list(range(X.shape[1]))
    dropped = True
    while dropped:
        dropped = False
        vif = [variance_inflation_factor(X.iloc[:, variables].values, ix)
               for ix in range(X.iloc[:, variables].shape[1])]
        vif = vif[:-1]  # don't let the constant be removed in the loop.
        maxloc = vif.index(max(vif))
        if max(vif) > thresh:
            print('dropping \'' + X.iloc[:, variables].columns[maxloc] +
                  '\' at index: ' + str(maxloc))
            del variables[maxloc]
            dropped = True

    print('Remaining variables:')
    print(X.columns[variables[:-1]])
    return X.iloc[:, variables[:-1]]

# [S1] Diabetes

# [S2] Tour & Travels Customer Churn

# [S3] Seeds

# [L1] League of Legend Challenger Rank Game

# [L2] Jungle chess

# [L3] Patient Survival Prediction

# [L4] Hotel Booking Cancellation

# [L5] Ionosphere

# L[6] Sonar (Rock vs Mine)