In [1]:
import pandas as pd
import numpy as np
from importlib import import_module
from MLsandbox import model_search_builder as msb  # import msb tools (developed by Nicole)




## Load data

In [2]:
df = pd.read_csv('../data/sample_data.csv', sep=',', header=0, index_col=0)

# rename the target variable, "target_variable", to be called "target" so msb module will recognize the output variable
df.rename(columns={'target_variable':'target'}, inplace=True)
df.head()

Unnamed: 0,site_id,strategy_id,list_type,line_id,adv_id,adv_vertical,name,goal,price,limit,...,win_rate_site,win_rate_strat,cvr_strat,cvr,line_cvr,hist_zscore,overlap,target,win_rate_site_table,win_rate_strat_table
0,82932,313729,testing,20049,206.0,Travel,Nicole,0.0,3.95,10000.0,...,0.423778,0.111431,0.0,0.001197,0.0,2.708366,0.001066,0,0.450094,0.249479
1,90474,313729,testing,20049,206.0,Travel,Nicole,0.0,3.95,10000.0,...,0.16301,0.111431,0.0,0.001239,0.0,1.188635,0.000703,0,0.15805,0.249479
2,92345,313729,testing,20049,206.0,Travel,Nicole,0.0,3.95,10000.0,...,0.318358,0.111431,0.0,0.000729,0.0,1.503285,0.000873,0,0.360591,0.249479
3,92415,313729,testing,20049,206.0,Travel,Nicole,0.0,3.95,10000.0,...,0.133199,0.111431,0.0,0.005894,0.0,35.153628,0.004614,0,0.113717,0.249479
4,92425,313729,testing,20049,206.0,Travel,Nicole,0.0,3.95,10000.0,...,0.37931,0.111431,0.0,0.0,0.0,-0.091378,0.000344,0,0.019308,0.249479


## Check for data imbalance

In model_search_builder.py, you'll find three functions to help with data exploration: success_breakdown, find_nan, and fill_nan_deterministic

**success_breakdown(df, by_column='', nan_breakdown=False):** shows the number of successes and failures by value for a particular column.  

- This helps you determine whether you have an unbalanced dataset (e.g. if fewer than 10% of your target labels are classified as "success").  If this is the case, I recommend reading https://www3.nd.edu/~dial/publications/chawla2005data.pdf
- You can also stratify the success breakdown by one of the categorical variables, such as advertiser vertical, to ensure you aren't introducing systematic bias in the training set. 

#### Print the breakdown of positive and negative training samples and save examples of the positive and negative training sets to variables pos, neg:

In [3]:
pos, neg = msb.success_breakdown(df)

good: 53 test: 1945




#### Now check for target label imbalance that is specific to certain features

For example, below we see that advertiser 48 has 360 sites and none have reached the good list.  In this case, we are looking for signals that will hold true across different advertisers. Is there a source of systematic bias in this training set?  If we train with almost all successes belonging to advertiser 658, we may overfit our model to that example.  It is clear that we need a larger training set.

In [4]:
# we don't want to catch the groups for this one, so set equal to a throwaway variable
_ = msb.success_breakdown(df, by_column='adv_id')

206.0// good: 4 test: 108
1525.0// good: 0 test: 68
1831.0// good: 0 test: 201
658.0// good: 18 test: 241
915.0// good: 0 test: 46
48.0// good: 0 test: 360
1454.0// good: 20 test: 240
65.0// good: 1 test: 108
461.0// good: 0 test: 60
2717.0// good: 0 test: 8
2795.0// good: 0 test: 8
752.0// good: 5 test: 142
1234.0// good: 4 test: 88
1631.0// good: 1 test: 187
834.0// good: 0 test: 66
1906.0// good: 0 test: 14




## Handle missing data

**find_nan(df): ** shows the number of missing values for each feature

From the information below, there may be a need to impute values or drop columns like win_rate_strat and cvr_strat. 

In [5]:
msb.find_nan(df)

	name has 86 empty rows (4%)
	win_rate_site has 388 empty rows (19%)
	win_rate_strat has 988 empty rows (49%)
	cvr_strat has 988 empty rows (49%)


Complete features: site_id, strategy_id, list_type, line_id, adv_id, adv_vertical, goal, price, limit, avg_bid, max_bid, impressions, conversions, avg_imps_site, stdev_imps_site, cvr, line_cvr, hist_zscore, overlap, target, win_rate_site_table, win_rate_strat_table



#### Option 1: drop the rows

This isn't ideal, but if you have enough data and your source of missing values doesn't introduce bias, you can simply drop the rows with missing values

In [6]:
df_dropped = df.dropna(axis = 0, how = 'any', subset = ['win_rate_site'])
msb.find_nan(df_dropped)

	name has 79 empty rows (4%)
	win_rate_strat has 770 empty rows (47%)
	cvr_strat has 770 empty rows (47%)


Complete features: site_id, strategy_id, list_type, line_id, adv_id, adv_vertical, goal, price, limit, avg_bid, max_bid, impressions, conversions, avg_imps_site, stdev_imps_site, win_rate_site, cvr, line_cvr, hist_zscore, overlap, target, win_rate_site_table, win_rate_strat_table



#### Option 2: Deterministic imputation

In some cases, we can use deterministic clues to fill in missing data.

For instance, we are missing optimizer names for some strategies.  If we can assume all strategies for a single advertiser is always handled by the same optimizer, we can use other strategies under that advertiser to infer the correct optimizer name.  

In [7]:
df = msb.fill_nan_deterministic(df, fill_column='name', batch_by='adv_id')

Filling name by adv_id: 86 rows have been filled.



#### Option 3: Unique-value Imputation

In other cases, we want to assume a specific value for missing data to simply indicate that the value is unknown.  This is often an acceptable method when using neural nets.

Note: it's common to encode missing information as a -1, or something else that is an impossible value

In [8]:
df[['win_rate_strat']] = df[['win_rate_strat']].fillna(value=-1)
msb.find_nan(df)

	win_rate_site has 388 empty rows (19%)
	cvr_strat has 988 empty rows (49%)


Complete features: site_id, strategy_id, list_type, line_id, adv_id, adv_vertical, name, goal, price, limit, avg_bid, max_bid, impressions, conversions, avg_imps_site, stdev_imps_site, win_rate_strat, cvr, line_cvr, hist_zscore, overlap, target, win_rate_site_table, win_rate_strat_table



#### Option 4: Interpolation

Often, we have to make an educated guess

You can impute values according to the mean, median or mode:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html

Note: only use the mean if the prior distribution on the value is approximately normally distributed (most of your values center around a mean)!  In cases of high skew, use the median or mode instead.

In [10]:
from sklearn.preprocessing import Imputer

# build an imputer that takes the mean value of each feature as the imputation value
# and apply it to the columns with missing data
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
df[['cvr_strat']] = imp.fit_transform(df[['cvr_strat']]) 

# check again for any NaN values
msb.find_nan(df)

	win_rate_site has 388 empty rows (19%)


Complete features: site_id, strategy_id, list_type, line_id, adv_id, adv_vertical, name, goal, price, limit, avg_bid, max_bid, impressions, conversions, avg_imps_site, stdev_imps_site, win_rate_strat, cvr_strat, cvr, line_cvr, hist_zscore, overlap, target, win_rate_site_table, win_rate_strat_table



Another option is to use the most common value:

In [12]:
df['win_rate_site'].fillna(df['win_rate_site'].value_counts().idxmax(), inplace = True)
msb.find_nan(df)



Complete features: site_id, strategy_id, list_type, line_id, adv_id, adv_vertical, name, goal, price, limit, avg_bid, max_bid, impressions, conversions, avg_imps_site, stdev_imps_site, win_rate_site, win_rate_strat, cvr_strat, cvr, line_cvr, hist_zscore, overlap, target, win_rate_site_table, win_rate_strat_table



#### Option 5: kNN imputation

Fill values based on similar records

This method is computationally expensive, but will yield the best results.  It looks at k records that have similar features and target labels, and then finds a reasonable value for the missing feature.

The following package attempts to do this, but is incomplete.  You will have to write your own algorithm:
https://pypi.python.org/pypi/fancyimpute/

#### Option 6: De-noising autoencoder

Finally, you can use an autoencoder: http://stackoverflow.com/questions/32407621/impute-multiple-missing-values-in-a-feature-vector

de-noising autoencoders for neural nets:  http://deeplearning.net/tutorial/dA.html#autoencoders

#### Now save the cleaned data as a .csv

In [10]:
df.to_csv('../data/sample_data_cleaned.csv', header = True, index = True, sep = ',')

#### For a beginner's guide to imputation methods:
http://www.jmlr.org/papers/volume8/saar-tsechansky07a/saar-tsechansky07a.pdf