# Example of RSW for machine learning reweighting

Using the PIMA diabetes dataset we compare the original dataset with labels to the reweighted dataset with artificial labels. We begin by importing the PIMA diabetes dataset, it is already cleaned and does not have any missing values, outliers, etc. This example is also nice since all features are continuous and the labels are binary.

In [1]:
# Get raw data, convert to csv, and import
# Update directory
import sys
sys.path.insert(1, '../')
# Import libraries
from rsw import *
import os.path
from get_data import get_data
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import numpy as np

# Data locations
raw_dir = "../data/raw/pima_diabetes/archive.zip"
processed_dir = "../data/processed/pima_diabetes/"

# If processed data does exist convert raw to csv
if not os.path.exists(processed_dir + "diabetes.csv"):
    get_data.RData_to_csv(raw_dir,processed_dir)
    
# Get data into dataframe
df = pd.read_csv(processed_dir + "diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Separating features from labels, useful for training ML models and for re-weighting.

In [2]:
#train_test_splitting of the dataset
X = df.drop(columns = 'Outcome')
# Getting Predicting Value
y = df['Outcome']

First, we look at the means for each features according to outcome.

In [3]:
marginals_mean = df.groupby('Outcome').mean()
display(marginals_mean)

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


We may also want to consider standard deviation of features.

In [4]:
marginals_std = df.groupby('Outcome').std()
display(marginals_std)

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.017185,26.1412,18.063075,14.889947,98.865289,7.689855,0.299085,11.667655
1,3.741239,31.939622,21.491812,17.679711,138.689125,7.262967,0.372354,10.968254


Skew may also be of interest

In [5]:
marginals_skew = df.groupby('Outcome').skew()
display(marginals_skew)

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,1.114105,0.173111,-1.809825,0.031155,2.498741,-0.665902,2.006242,1.571609
1,0.503749,-0.495557,-1.943633,0.11591,1.843831,0.000597,1.722373,0.581646


And Kurtosis

In [6]:
marginals_kurt = df.groupby('Outcome').apply(pd.DataFrame.kurt)
display(marginals_kurt)

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,0.687268,1.900521,5.686977,-1.038206,9.457788,3.058286,6.120934,1.963428,0.0
1,-0.441643,1.43196,4.699552,-0.209282,4.360493,4.763822,4.559083,-0.347937,0.0


### Matching marginals

First we construct functions $F$ upon which we take expectations of under the weighted measure. Mathematically, we express this as
$$ \mathbb E[F(x)] = \sum^N_{i=1} w_i F(x_i) $$

In [9]:
funs = {'mean' : list(X), 'std' : list(X), 'skew' : [], 'kurtosis' : []}

In [11]:
loss_0 = []
loss_1 = []
for key in funs.keys():
    if key == 'kurtosis':
        continue
    marg = getattr(df.groupby('Outcome'),key)()
    for feature in funs[key]:
        loss_0.append(EqualityLoss(marg.loc[0,feature]))
        loss_1.append(EqualityLoss(marg.loc[1,feature]))

In [12]:
#regularizer = regularizers.EntropyRegularizer(limit=None)
regularizer = regularizers.ZeroRegularizer()
w_0, out_0, sol_0 = rsw(X, funs, loss_0, regularizer,
                      2, verbose=False, rho=25, eps_abs=1e-6, eps_rel=1e-6)
w_1, out_1, sol_1 = rsw(X, funs, loss_1, regularizer,
                      2, verbose=False, rho=24, eps_abs=1e-6, eps_rel=1e-6)

<class 'dict'>
<class 'dict'>


In [13]:
X_0 = X.copy()
X_0["weights"] = w_0 
X_1 = X.copy()
X_1["weights"] = w_1 

# Set theoretical outcome to train on reweighted datasets
X_0['Outcome'] = 0
X_1['Outcome'] = 1

df_w = pd.concat([X_0,X_1])

In [14]:
df_valw = df_w.drop(columns=['Outcome','weights']).multiply(df_w['weights'], axis="index")
df_valw['Outcome'] = df_w['Outcome']

In [15]:
w_marginals = df_valw.groupby('Outcome').sum()

In [16]:
display(marginals_mean)
display(w_marginals)
display(abs((marginals_mean - w_marginals)))

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.292878,109.818946,68.091936,19.636641,68.685687,30.261579,0.429105,31.145716
1,4.86554,141.25332,70.822256,22.163477,100.333087,35.141441,0.550484,37.066027


Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0.005122,0.161054,0.092064,0.027359,0.106313,0.042621,0.000629,0.044284
1,0.000131,0.004143,0.002371,0.000702,0.002734,0.001096,1.6e-05,0.001137


In [19]:
std_w = pd.concat([abs(X_0[list(X)] - marginals_mean.loc[0]),abs(X_1[list(X)] - marginals_mean.loc[1])])
std_w = std_w.multiply(df_w['weights'], axis="index")
std_w = pd.concat([std_w, df_w['Outcome']],axis=1)

In [20]:
display(marginals_std)
std_w.groupby('Outcome').sum()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.017185,26.1412,18.063075,14.889947,98.865289,7.689855,0.299085,11.667655
1,3.741239,31.939622,21.491812,17.679711,138.689125,7.262967,0.372354,10.968254


Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,2.832319,23.533193,18.198816,14.8853,93.183717,7.612969,0.280036,10.769311
1,3.781445,28.301195,20.958874,17.495757,149.657579,6.850132,0.395758,10.836409
