# Santander Value Prediction

The goal of this project is to predict the value of a financial transaction to a customer of Santander Group, which would help them better personalize financial services to new customers. This project is derived from the Kaggle Competition "Santander Value Prediction Challenge."


## Data Exploration

The data given by Santander consists of a labeled training set and an unlabeled test set. (to be used for leaderboard scoring) In this analysis, we will use the only the training set so we can assess performance of the prediction algorithm.

The competition states that "You are provided with an anonymized dataset containing numeric feature variables, the numeric target column, and a string ID column. The task is to predict the value of target column in the test set."

Right away we know that the features are anonymized and thus we can't apply domain knowledge to this problem. We also know this is a prediction and not a classification problem.  So we'll be looking at regression-type techniques.

The first step is to explore the data.

In [254]:
# Import modules

import pandas as pd
import numpy as np
import os
import sys
import random
import copy
import cPickle as pickle

import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
import colorlover as cl
import matplotlib.pyplot as plt 

src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)

from sklearn.model_selection import ShuffleSplit
from scipy.stats import spearmanr
from scipy import stats
from sklearn import decomposition
from sklearn.preprocessing import StandardScaler
from scipy.stats import boxcox
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

import plotting_methods as pm

init_notebook_mode(connected=True)

%reload_ext autoreload
%autoreload 2

pd.options.display.float_format = '{:,.4f}'.format

In [2]:
# Load data

raw_data_dir = 'C:\Users\Colleen\Documents\Kaggle_Santander_Value_Pred\data'

f = open(os.path.join(raw_data_dir, 'train.csv'), 'r')
train_data = pd.read_csv(f)
f.close()

id_col = 'ID'
tar_col = 'target'

First we get some basic stats on the data.  Already we can see we have more features than samples so we'll need some dimension reduction, feature selection or other techniques to avoid overfitting and the curse of dimensionality.

In [6]:
# Get data stats:

num_samps = train_data.shape[0]
feat_names = [x for x in train_data.columns if x not in [id_col, tar_col]]
num_feats = len(feat_names)

print 'Num samples: ' + str(num_samps)
print 'Num features: ' + str(num_feats)

Num samples: 4459
Num features: 4991


Next we look at the features types and names.  The pandas.read_csv function automatically detects the types of the features, and we can see a mix of float and integer types.  We can see the 'target' variable, which is what we're trying to predict and that its a float.  The integer type features may be categorial features, so we likely have a mix of continuous and categorical variables.

In [7]:
train_data.dtypes

ID            object
target       float64
48df886f9    float64
0deb4b6a8      int64
34b15f335    float64
a8cb14b00      int64
2f0771a37      int64
30347e683      int64
d08d1fbe3      int64
6ee66e115      int64
20aa07010    float64
dc5a8f1d8    float64
11d86fa6a    float64
77c9823f2      int64
8d6c2a0b2      int64
4681de4fd      int64
adf119b9a      int64
cff75dd09    float64
96f83a237      int64
b8a716ebf    float64
6c7a4567c      int64
4fcfd2b4d      int64
f3b9c0b95    float64
71cebf11c      int64
d966ac62c      int64
68b647452    float64
c88d108c9      int64
ff7b471cd      int64
d5308d8bc      int64
0d866c3d7    float64
              ...   
cdfc2b069    float64
2a879b4f7    float64
6b119d8ce    float64
98dea9e42      int64
9f2471031      int64
88458cb21      int64
f40da20f4      int64
7ad6b38bd    float64
c901e7df1      int64
8f55955dc      int64
85dcc913d    float64
5ca0b9b0c      int64
eab8abf7a      int64
8d8bffbae    float64
2a1f6c7f9      int64
9437d8b64      int64
5831f4c76    

To explore further, we calculate the %unique values for each feature and plot them in a bar graph to better visualize the large number of features. These plots show that many of these features could be considered categorical or ordinal, since there are few unique values compared to the total number of values.  

In [228]:
perc_unique = train_data.loc[:, feat_names].apply(lambda x: float(100.0 * len(np.unique(x))) / float(len(x)))
iplot([go.Bar(
    x = train_data.columns,
    y = list(perc_unique))])

The target variable itself has the following percent unique values, which is higher than the majority of the features can still be considered categorical

In [10]:
perc_unique.loc[tar_col]

31.688719443821483

This could also indicate sparse data so we calculate the % of values equal to 0, which is extremely high.  Thus, we are indeed dealing with very sparse data.

In [24]:
sum((train_data.loc[:, feat_names] == 0).sum()) / float(num_feats * num_samps)

0.9685413111171313

We now plot the % number of unique values with the zeros removed. (shown below) This plot shows some variables with all unique values, and others around 60%. So these variables are not categorical.  Looking at the actual values showed that many of them are ordinal.

In [230]:
perc_unique_nonzero = train_data.loc[:, feat_names].apply(lambda x: 100 * float(len(np.unique(x[x!=0]))) / len(x[x!=0])  if len(x[x!=0])>0 else 0)
iplot([go.Bar(
    x = feat_names,
    y = list(perc_unique_nonzero[perc_unique_nonzero != 0]))])

Before digging deeper into the features, we'll clean the data. 

First we remove features with no variation:

In [28]:
train_data = train_data.drop(columns = train_data.columns[np.where(train_data.std() == 0.0)[0]])
new_feat_names = [x for x in train_data.columns if x not in [id_col, tar_col]]

In [29]:
print 'num dropped features: ' + str(len(feat_names) - len(new_feat_names))

num dropped features: 256


In [41]:
feat_names = new_feat_names

Next we look for missing values in samples, and we can see there aren't any

In [30]:
any(train_data.isnull().sum() > 0)

False

## Data Splitting

Also before feature analysis, we need to split the data into a training and test set.  The test set will be used to test the models on data that it hasn't seen.  Here we'll use a 80-20 random shuffle split.

In [31]:
split_seed = 4
train_prop = 0.8

inds = list(train_data.index)

ss = ShuffleSplit(n_splits=1, train_size=train_prop, random_state=split_seed)
split_inds = [(train_index, test_index) for train_index, test_index in ss.split(train_data)]

train_inds = split_inds[0][0]
test_inds = split_inds[0][1]


From version 0.21, test_size will always complement train_size unless both are specified.



In [368]:
train_feats = train_data.loc[train_inds, :]
train_feats.index = range(train_feats.shape[0])

## Feature Analysis

Here we visualize the features of the training data to get a sense of the data and how it could predict the target variable.

For now, we'll remove the zeros from each variable. We'll handle them later.  First we want to get a sense of the distribution of each feature without the zeros.  Below is a boxplot of the first 50 features which zeros removed and only those with more than 50 non-zero values.

In [127]:
nonzero = dict([(x, train_feats[x].loc[train_feats[x] != 0]) \
                 for x in feat_names + [tar_col] if len(np.where(train_feats[x] != 0)[0]) >= 50])

In [138]:
# Keep a version of the target var for each feature, which matching zeros removed
target_nonzero = dict([(x, train_feats[tar_col].loc[train_feats[x] != 0]) \
                 for x in feat_names + [tar_col] if len(np.where(train_feats[x] != 0)[0]) >= 50])

In [128]:
scaled_nonzero = dict([(x, StandardScaler().fit_transform(np.reshape([nonzero[x]], (len(nonzero[x]), 1))).flatten()) \
                       for x in nonzero.keys()])

In [129]:
#iplot([go.Box(y = train_feats[x].loc[train_feats[x] != 0], name = x) for x in feat_names[0:50]])
iplot([go.Box(y = scaled_nonzero[x], name = x) for x in scaled_nonzero.keys()[0:50] + [tar_col]])

This plot shows the feature values are very skewed, having all long tails. The target variable at the end is also skewed.  A transformation may be helpful here.  The following is the same plot with log transformations:

In [157]:
#trans_target = dict([(x, StandardScaler().fit_transform(np.reshape(boxcox((target_nonzero[x] / target_nonzero[x].std()) + 1)[0],
#                                                                    (len(target_nonzero[x]), 1))).flatten())\
#                      for x in target_nonzero.keys()])
trans_target = dict([(x, StandardScaler().fit_transform(np.reshape(list(np.log(target_nonzero[x] + 1)),
                                                                    (len(target_nonzero[x]), 1))).flatten())\
                      for x in target_nonzero.keys()])

In [158]:
#trans_nonzero = dict([(x, StandardScaler().fit_transform(np.reshape(np.log((nonzero[x] / nonzero[x].std()) + 1),
#                                                                    (len(nonzero[x]), 1))).flatten())\
#                      for x in nonzero.keys()])

trans_nonzero = dict([(x, StandardScaler().fit_transform(np.reshape(list(np.log(nonzero[x] + 1)),
                                                                    (len(nonzero[x]), 1))).flatten())\
                      for x in nonzero.keys()])

In [159]:

iplot([go.Box(y = StandardScaler().fit_transform(np.reshape(trans_nonzero[x], (len(trans_nonzero[x]), 1))).flatten(), 
              name = x) for x in trans_nonzero.keys()[0:50] + [tar_col]])

The features look significantly less skewed, especially the target variable.

Now we'll look at the correlations between these transformed features and the target variable, plotted in a histogram, using spearman correlation since some of the features still appear a bit skewed.

In [329]:
corrs = pd.Series(dict([(x,spearmanr(trans_nonzero[x], trans_target[x])[0]) for x in trans_nonzero.keys() if x != tar_col]))
iplot([go.Histogram(x = list(corrs))])

There aren't many extremely high correlations, but there still appears to be some relation to the target.

We now plot the top 20 correlated features vs the target.

In [330]:
top_feats = corrs.abs().sort_values(ascending = False).iloc[0:20].index

In [342]:
def plot_var_linear_fit(xvals, yvals, title, fit_line = True, xr = None):
    
    slope, intercept, r_value, p_value, std_err = stats.linregress(xvals, yvals)
    line = slope*xvals + intercept
    
    data = [go.Scatter(x = xvals, y = yvals, mode = 'markers')]
    if fit_line:
        data.append(go.Scatter(x = xvals, y = line, mode = 'lines', marker = dict(color = 'black')))
    
    if xr != None:
        return go.Figure(data = data,
                          layout = go.Layout(title = title, xaxis = dict(range = xr)))    
    else:
        return go.Figure(data = data,
                          layout = go.Layout(title = title))

In [331]:
titles = dict([(f, "%.2f" % corrs[f]) for f in top_feats])
fig = pm.subplot_helper_fig(5, 4, [plot_var_linear_fit(trans_nonzero[f], trans_target[f], titles[f]) for f in top_feats])
fig['layout'].update(height = 1000)
iplot(fig)

There are definitely some relationships here, also we can some features have a cluster of points with the same value. This might be something to watch for since it could affect modeling.

## Feature Selection

Even though some algorithms could handle the large number of features here (decision trees, lasso/ridge regression), its still good to do some feature selection to reduce processing time.

We could remove features with low correlations with the target, but these features may have non-linear relattionships with the target.  So we'll use PCA to reduce the dimension and also to remove features correlated with eachother.

We'll also use the full sparse data set now. StandardScaler can handle sparse matrices as can PCA

In [369]:
# Log transform and scale the dataset, converting to sparse.

sc = StandardScaler(with_mean = False)

trans_data = (train_feats.loc[:, feat_names + [tar_col]] + 1).apply(np.log).to_sparse()
scale_train_data = pd.DataFrame(sc.fit_transform(trans_data.loc[:, feat_names]), columns = feat_names).to_sparse()

In [181]:
pca = decomposition.PCA(svd_solver = 'randomized')
res = pca.fit_transform(scale_train_data)

ex_var = pca.explained_variance_ratio_.cumsum()
num_comp_keep = np.where(np.array(ex_var <= 0.95))[0][-1]

pca_train_data = pd.DataFrame(res).iloc[:, range(num_comp_keep)]

In [182]:
print num_comp_keep

1513


Using a 95% explained variance cutoff, we managed to cut down the 5000 or so features to about 1500 which is definitely more manageable.

In [183]:
# Apply processing to test set

test_feats = train_data.loc[test_inds, :]
trans_test_data = (test_feats.loc[:, feat_names + [tar_col]] + 1).apply(np.log).to_sparse()
scale_test_data = pd.DataFrame(sc.transform(trans_test_data.loc[:, feat_names]), columns = feat_names).to_sparse()
pca_test_data = pd.DataFrame(pca.transform(scale_test_data)).iloc[:, range(num_comp_keep)]

## Basic Model - Lasso Regression

Now we'll try a basic model on the data to get a basline.  We'll try a lasso regression first, since lasso gives spare coefficients which is helpful with our sparse data. We'll also do a cross-validation with gridsearch to get the optimal parameters for the model.

In [246]:
pca_train_data.shape, trans_data.loc[:,[tar_col]].shape 

((3567, 1513), (3567, 1))

In [184]:
from sklearn.linear_model import LassoCV

alphas = np.logspace(-4, -1, 10)
lcv = LassoCV()

lcv.fit(pca_train_data, trans_data[tar_col])

LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
    max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False,
    precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
    verbose=False)

The following is the optimal alpha with the root mse.  

In [191]:
print lcv.alpha_, min(np.sqrt(lcv.mse_path_).flatten())

0.114081811911 1.66769637158


This root mse appears high, given that the 95% percentile of the pca transformed values are:

In [203]:
np.percentile(pca_train_data.values.flatten(), 95)

1.9711477064205576

Still we'll fit a model with this alpha on the training data and predict on the test set.  The following are the results:

In [347]:
print 'test error ' + str(np.sqrt(mean_squared_error(trans_test_data[tar_col], pred_vals)))
print 'train error ' + str(np.sqrt(mean_squared_error(trans_data[tar_col], lm.predict(pca_train_data))) )

test error 1.68302778146
train error 1.65446280143


Both the training and test error are high so there is some bias here. The figure out why, we plotted the standardized residuals:

In [344]:
from sklearn.linear_model import Lasso

lm = Lasso(alpha = lcv.alpha_)
lm.fit(pca_train_data, trans_data[tar_col])
pred_vals = lm.predict(pca_test_data)

In [218]:
resid = np.array(trans_test_data[tar_col]) - pred_vals
resid = resid / np.std(resid)

iplot([go.Scatter(x = pred_vals, y = resid, mode = 'markers')])

We can see the effect of the 0s and the cluster of same-values here - the residuals are unbalanced.  This is likely causing the poor fit.

Given this, we should probably chose a model that can better handle this sparse data.

# Random Forest Regressor

Decision trees can better deal with this mix of categorical and continuous data. And by using an ensemble of trees, we can reduce overfitting.

Since random forests don't require too much hyperparameter tuning, and that tuning would be computational intensive, we'll try fitting a test forest and look at the results.

In [257]:
#rf = RandomForestRegressor(n_estimators = 500, max_features = 'auto', min_samples_leaf = 5, verbose = 2)
#rf.fit(pca_train_data, trans_data[tar_col])

f = open('C:\\Users\\Colleen\\Documents\\Kaggle_Santander_Value_Pred\\data\\rf.p', 'r')
rf = pickle.load(f)
f.close()

pred_vals = rf.predict(pca_test_data)

[Parallel(n_jobs=8)]: Done  25 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 146 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 349 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 500 out of 500 | elapsed:    0.0s finished


In [258]:
np.sqrt(mean_squared_error(trans_test_data[tar_col], pred_vals))

1.6067609685558453

This is not better than the lasso regression. Its likely that this is overfitted since we have 1500 variables but only used 500 trees.  Looking at the training error, we get:

In [275]:
np.sqrt(mean_squared_error(trans_data[tar_col], rf.predict(pca_train_data))) 

[Parallel(n_jobs=8)]: Done  25 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 146 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 349 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done 500 out of 500 | elapsed:    0.1s finished


0.72375369252903754

The testing error is double the training error so there is definitely some overfitting. Given that increasing the tree size will greatly increase computation time, we'll see if we can reduce the number of features further.

We can use the results from the random forest to get feature importances and plot them:

In [272]:
feat_imp = rf.feature_importances_
iplot([go.Bar(
    x = range(len(feat_imp)),
    y = feat_imp[np.argsort(feat_imp)[::-1]])])

This shows that the vast majority of features were not important.

Rerunning the model with only the top 70 important features gives the following results:

In [287]:
sel_feats = pca_train_data.columns[feat_imp > 0.001]

In [291]:

f = open('C:\\Users\\Colleen\\Documents\\Kaggle_Santander_Value_Pred\\data\\rf_sel_feats.p', 'r')
rf = pickle.load(f)
f.close()

pred_vals = rf.predict(pca_test_data.loc[:, sel_feats])
print 'test error ' + str(np.sqrt(mean_squared_error(trans_test_data[tar_col], pred_vals)))
print 'train error ' + str(np.sqrt(mean_squared_error(trans_data[tar_col], rf.predict(pca_train_data.loc[:, sel_feats]))) )

[Parallel(n_jobs=8)]: Done  25 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 146 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 349 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 500 out of 500 | elapsed:    0.0s finished


test error 1.54814350453


[Parallel(n_jobs=8)]: Done  25 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 146 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 349 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 500 out of 500 | elapsed:    0.1s finished


train error 0.801003338635


There is still some overfitting and some bias.  We need to look further into these features.

Next we plot the each of the top 10 and bottom 10 features against the target variable.  Here we see the top features have a larger range of values despite no visible correlation with the target.  However, these values might also contain a lot of noise which may be causing the overfitting.  I'm hesitant to remove these 'noisy' values since that would likely remove all signal from the data.

Thus it appears that some feature engineering/selection is required here to produce better features and handle the large number of zeros.

In [324]:
sort_feat_imp_arg = pca_train_data.columns[np.argsort(feat_imp)[::-1]]
sort_feat_imp_val = feat_imp[np.argsort(feat_imp)[::-1]]

top_10 = range(10)
bot_10 = range(len(feat_imp)-1, len(feat_imp)-1-10, -1)[::-1]
top_bot_10 = top_10 + bot_10

In [343]:
titles = dict([(sort_feat_imp_arg[x],  "%.2f" % sort_feat_imp_val[x]) for x in top_bot_10])
rand_samp = random.sample(range(pca_train_data.shape[0]), 500)
fig = pm.subplot_helper_fig(4, 5, [plot_var_linear_fit(
    pca_train_data[sort_feat_imp_arg[x]].iloc[rand_samp], 
    trans_data[tar_col].iloc[rand_samp], titles[sort_feat_imp_arg[x]], False, xr = [-50, 50]) for x in top_bot_10])
fig['layout'].update(height = 1000)
iplot(fig)

## Feature Engineering

Now we'll take a closer look at the patterns in the data to try to engineer new features or combine existing ones.

First we plot a heatmap of a random sample of rows and columns.

In [495]:
rand_feats = random.sample(feat_names, 100)
iplot([go.Heatmap(z = np.array(train_feats.loc[
    random.sample(range(train_feats.shape[0]), 100), 
    random.sample(feat_names, 100),
    ]), zmin = 0, zmax = 10*10**6)])

There may be a pattern to where the non-zero elements are in the features vs. samples. There are some features that have non-zero, medium values for nearly every sample. Where as there are some spikes in values scattered around the features. There are definitely features with a great deal more information than others.  


Its possible that the ordering of features matter and that these samples represent time series.  

Below we ploted a sample of rows as time series.  The first plot contains rows with target values below 5K while the second contains rows have targets above 20M. In general the lowered valued rows contain more non-zero points than the higher valued rows. Thus, we may find some pattersn by engineering time-series based features.

In [485]:
cols = ['f190486d6', '58e2e02e6', 'eeb9cd3aa', '9fd594eec', '6eef030c1', '15ace8c9f',
        'fb0f5dbfe', '58e056e12', '20aa07010', '024c577b9', 'd6bb78916', 'b43a7cfd5',
        '58232a6fb', '1702b5bf0', '324921c7b', '62e59a501', '2ec5b290f', '241f0f867',
        'fb49e4212', '66ace2992', 'f74e8f13d', '5c6487af1', '963a49cdc', '26fc93eb7',
        '1931ccfdd', '703885424', '70feb1494', '491b9ee45', '23310aa6f', 'e176a204a', '6619d81fc', '1db387535']

n = train_data.loc[:, feat_names].apply(lambda x: len(x[x != 0]) / float(train_data.shape[0]))

iplot([go.Heatmap(z = np.array(train_feats.loc[
    random.sample(range(train_data.shape[0]), 100), n[n > 0.2].index]),
                 zmin = 0, zmax = 10*10**6)])

In [498]:
low_tar = train_data.loc[train_data[tar_col] < 1*10**6,:].index
high_tar = train_data.loc[train_data[tar_col] > 10*10**6,:].index

iplot([go.Heatmap(z = np.array(train_feats.loc[
    low_tar[0:100], rand_feats]),#n[n > 0.2].index]),
                 zmin = 0, zmax = 50*10**6)])

In [499]:
iplot([go.Heatmap(z = np.array(train_feats.loc[
    high_tar[0:100], rand_feats]),#n[n > 0.2].index]),
                 zmin = 0, zmax = 50*10**6)])

In [503]:
inds = random.sample(train_feats.loc[train_feats[tar_col] < 5*10**5,:].index, 10)

figs = [go.Figure(data = [go.Scatter(x = range(len(feat_names)), y = train_feats.loc[x,n[n > 0.2].index])],
                  layout = go.Layout(yaxis = dict(range = [0, 20*10**6])))
        for x in random.sample(train_feats.index, 10)]

fig = pm.subplot_helper_fig(10, 1, figs)
fig['layout'].update(height = 1000)
iplot(fig)

In [504]:
inds = random.sample(train_feats.loc[train_feats[tar_col] > 20*10**6,:].index, 10)

figs = [go.Figure(data = [go.Scatter(x = range(len(feat_names)), y = train_feats.loc[x,n[n > 0.2].index])],
                  layout = go.Layout(yaxis = dict(range = [0, 20*10**6])))
        for x in random.sample(train_feats.index, 10)]

fig = pm.subplot_helper_fig(10, 1, figs)
fig['layout'].update(height = 1000)
iplot(fig)

## We'll look at the following features:

1. Number of non-zero values
1. Average of non-zero values
3. Std of non-zero values
4. Min of non-zero values
5. Max of non-zero values
6. Average rate of non-zero values
7. Ave num zeros between non zero points
8. Std num zeros between non zero points
9. Max num zeros between non zero points

In [513]:
df = trans_data.loc[:, n[n > 0.2].index]

In [446]:
def get_num_zeros_bet(data):    
    return [len(x) for x in ''.join(map(str, map(int, data != 0))).split('1')]

In [514]:

new_feats = pd.DataFrame({'num_nonzero': (df == 0).sum(1),
                          'sum_nonzero': df.apply(lambda x: sum(x[x != 0]), 1),
                       'ave_nonzero': df.apply(lambda x: np.mean(x[x != 0]), 1),
                        'std_nonzero': df.apply(lambda x: np.std(x[x != 0]), 1),
                        'min_nonzero': df.apply(lambda x: np.percentile(x[x != 0], 5) if len(x[x != 0]) > 0 else 0, 1),
                        'max_nonzero': df.apply(lambda x: np.percentile(x[x != 0], 95) if len(x[x != 0]) > 0 else 0, 1),
                        'ave_bet_zero': df.apply(lambda x: np.mean(get_num_zeros_bet(x)), 1),
                        'std_bet_zero': df.apply(lambda x: np.std(get_num_zeros_bet(x)), 1),
                        'max_bet_zero': df.apply(lambda x: np.max(get_num_zeros_bet(x)), 1)})

#new_feats.columns = ['num_nonzero', 'ave_nonzero', 'std_nonzero', 'min_nonzero', 'max_nonzero',
#                    'ave_bet_zero', 'std_bet_zero', 'max_bet_zero']



In [531]:
fig = pm.subplot_helper_fig(3, 3, [go.Figure(data = [go.Scatter(x = new_feats[f].loc[samp], 
                                        y = trans_data[tar_col].loc[samp], mode = 'markers')],
                                             layout = go.Layout(title = f)) for f in new_feats.columns])
iplot(fig)

In [520]:
print len(new_feats[f].loc[samp]), len(train_feats[tar_col].loc[samp])

500 500


In [528]:
#sc = StandardScaler()
#new_scaled = pd.DataFrame(sc.fit_transform(new_feats), columns = new_feats.columns)
#new_scaled.head()

samp = random.sample(new_feats.index, 500)
fig = pm.subplot_helper_fig(3, 3, [plot_var_linear_fit(new_feats[f].loc[samp], trans_data[tar_col].loc[samp], f, False) for f in new_feats.columns])
fig['layout'].update(height = 1000)
iplot(fig)

ValueError: all the input array dimensions except for the concatenation axis must match exactly

In [361]:
iplot([go.Histogram(x = train_feats['target'])])

In [375]:
def get_group(x):
    if x < 5.0*10**6:
        return 1
    elif (x > 5.0*10**6) & (x < 10.5*10**6):
        return 2
    elif (x > 10.50*10**6) & (x < 20.5*10**6):
        return 3
    else:
        return 4
target_grp = [get_group(x) for x in train_feats['target']]

In [383]:
(train_feats.loc[:, feat_names] != 0).sum(1).groupby(target_grp )

<pandas.core.groupby.groupby.SeriesGroupBy object at 0x000000000EEBA780>

In [414]:
np.array(train_feats.loc[inds[x],feat_names]) + 1000

array([1000.0, 1000, 1000.0, ..., 1000, 1000, 1000], dtype=object)

In [406]:
train_feats.loc[x,new_feats] + random.randint(1, 1000)

ValueError: Cannot index with multidimensional key

In [509]:
f = open('C:\Users\Colleen\Documents\Kaggle_Santander_Value_Pred\sel_feats.p', 'w')
pickle.dump(n[n > 0.2].index, f)
f.close()

In [238]:
f = open('C:\Users\Colleen\Documents\Kaggle_Santander_Value_Pred\pca_train_data.csv', 'w')
pca_train_data.to_csv(f)
f.close()

In [523]:
f = open('C:\Users\Colleen\Documents\Kaggle_Santander_Value_Pred\scale_train_data.csv', 'w')
scale_train_data.loc[:, n[n > 0.2].index].to_csv(f)
f.close()

In [247]:
f = open('C:\\Users\\Colleen\\Documents\\Kaggle_Santander_Value_Pred\\data\\pca_y_train_data.csv', 'w')
trans_data.loc[:, [tar_col]].copy().to_csv(f)
f.close()

In [251]:
print pca_train_data.shape

(3567, 1513)


In [470]:
n = train_data.loc[:, feat_names].apply(lambda x: len(x[x != 0]) / float(train_data.shape[0]))
n[n > 0.2]

20aa07010   0.3351
963a49cdc   0.3299
935ca66a9   0.2003
861076e21   0.2063
0572565c2   0.3469
66ace2992   0.3411
fb49e4212   0.3324
6619d81fc   0.3438
6eef030c1   0.3263
fc99f9426   0.3404
1db387535   0.3420
b43a7cfd5   0.3225
024c577b9   0.3160
2ec5b290f   0.3312
0ff32eb98   0.3510
166008929   0.2079
58e056e12   0.3281
241f0f867   0.3277
1931ccfdd   0.3384
f02ecb19c   0.2063
58e2e02e6   0.3393
9fd594eec   0.3409
fb0f5dbfe   0.3353
91f701ba2   0.3499
ca2b906e8   0.2021
703885424   0.3393
f97d9431e   0.2025
eeb9cd3aa   0.3391
324921c7b   0.3373
58232a6fb   0.3330
491b9ee45   0.3416
d6bb78916   0.3223
70feb1494   0.3436
adb64ff71   0.3469
11e12dbe8   0.2081
9de83dc23   0.2023
62e59a501   0.3312
15ace8c9f   0.3330
5c6487af1   0.3451
f190486d6   0.3463
f74e8f13d   0.3351
77deffdf0   0.2000
c5a231d81   0.3510
e176a204a   0.3442
1702b5bf0   0.3364
a09a238d0   0.2003
190db8488   0.3427
c47340d97   0.3487
23310aa6f   0.3436
dtype: float64

In [471]:
n.loc[['f190486d6', '58e2e02e6', 'eeb9cd3aa', '9fd594eec', '6eef030c1', '15ace8c9f',
        'fb0f5dbfe', '58e056e12', '20aa07010', '024c577b9', 'd6bb78916', 'b43a7cfd5',
        '58232a6fb', '1702b5bf0', '324921c7b', '62e59a501', '2ec5b290f', '241f0f867',
        'fb49e4212', '66ace2992', 'f74e8f13d', '5c6487af1', '963a49cdc', '26fc93eb7',
        '1931ccfdd', '703885424', '70feb1494', '491b9ee45', '23310aa6f', 'e176a204a', '6619d81fc', '1db387535']]



Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike



f190486d6   0.3463
58e2e02e6   0.3393
eeb9cd3aa   0.3391
9fd594eec   0.3409
6eef030c1   0.3263
15ace8c9f   0.3330
fb0f5dbfe   0.3353
58e056e12   0.3281
20aa07010   0.3351
024c577b9   0.3160
d6bb78916   0.3223
b43a7cfd5   0.3225
58232a6fb   0.3330
1702b5bf0   0.3364
324921c7b   0.3373
62e59a501   0.3312
2ec5b290f   0.3312
241f0f867   0.3277
fb49e4212   0.3324
66ace2992   0.3411
f74e8f13d   0.3351
5c6487af1   0.3451
963a49cdc   0.3299
26fc93eb7      nan
1931ccfdd   0.3384
703885424   0.3393
70feb1494   0.3436
491b9ee45   0.3416
23310aa6f   0.3436
e176a204a   0.3442
6619d81fc   0.3438
1db387535   0.3420
dtype: float64

In [524]:
f = open('C:\\Users\\Colleen\\Documents\\Kaggle_Santander_Value_Pred\\data\\rf_sel_feats.p', 'r')
rf = pickle.load(f)
f.close()

pred_vals = rf.predict(scale_test_data.loc[:, n[n > 0.2].index])
print 'test error ' + str(np.sqrt(mean_squared_error(trans_test_data[tar_col], pred_vals)))
print 'train error ' + str(np.sqrt(mean_squared_error(trans_data[tar_col], rf.predict(scale_train_data.loc[:, n[n > 0.2].index]))) )

[Parallel(n_jobs=8)]: Done  25 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 146 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 349 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 500 out of 500 | elapsed:    0.0s finished


test error 1.44344342728


[Parallel(n_jobs=8)]: Done  25 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 146 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 349 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 500 out of 500 | elapsed:    0.1s finished


train error 1.04790773341


In [537]:
df = train_feats.loc[:, n[n > 0.2].index]
new_feats = pd.DataFrame({'num_nonzero': (df == 0).sum(1),
                          'sum_nonzero': df.apply(lambda x: sum(x[x != 0]) if len(x[x != 0]) > 0 else 0, 1),
                       'ave_nonzero': df.apply(lambda x: np.mean(x[x != 0]) if len(x[x != 0]) > 0 else 0, 1),
                        'std_nonzero': df.apply(lambda x: np.std(x[x != 0]) if len(x[x != 0]) > 0 else 0, 1),
                        'min_nonzero': df.apply(lambda x: np.percentile(x[x != 0], 5) if len(x[x != 0]) > 0 else 0, 1),
                        'max_nonzero': df.apply(lambda x: np.percentile(x[x != 0], 95) if len(x[x != 0]) > 0 else 0, 1),
                        'ave_bet_zero': df.apply(lambda x: np.mean(get_num_zeros_bet(x)), 1),
                        'std_bet_zero': df.apply(lambda x: np.std(get_num_zeros_bet(x)), 1),
                        'max_bet_zero': df.apply(lambda x: np.max(get_num_zeros_bet(x)), 1)})

new_train_feats = pd.concat([df, new_feats], 1)
new_train_feats[tar_col] = train_feats[tar_col]

In [541]:
df = test_feats.loc[:, n[n > 0.2].index]
new_feats = pd.DataFrame({'num_nonzero': (df == 0).sum(1),
                          'sum_nonzero': df.apply(lambda x: sum(x[x != 0]) if len(x[x != 0]) > 0 else 0, 1),
                       'ave_nonzero': df.apply(lambda x: np.mean(x[x != 0]) if len(x[x != 0]) > 0 else 0, 1),
                        'std_nonzero': df.apply(lambda x: np.std(x[x != 0]) if len(x[x != 0]) > 0 else 0, 1),
                        'min_nonzero': df.apply(lambda x: np.percentile(x[x != 0], 5) if len(x[x != 0]) > 0 else 0, 1),
                        'max_nonzero': df.apply(lambda x: np.percentile(x[x != 0], 95) if len(x[x != 0]) > 0 else 0, 1),
                        'ave_bet_zero': df.apply(lambda x: np.mean(get_num_zeros_bet(x)), 1),
                        'std_bet_zero': df.apply(lambda x: np.std(get_num_zeros_bet(x)), 1),
                        'max_bet_zero': df.apply(lambda x: np.max(get_num_zeros_bet(x)), 1)})

new_test_feats = pd.concat([df, new_feats], 1)
new_test_feats[tar_col] = test_feats[tar_col]

In [538]:
sc = StandardScaler(with_mean = False)
new_feat_names = [x for x in new_train_feats.columns if x != tar_col]
new_trans_data = (new_train_feats.loc[:, new_feat_names + [tar_col]] + 1).apply(np.log).to_sparse()
new_scale_train_data = pd.DataFrame(sc.fit_transform(new_trans_data.loc[:, new_feat_names]), columns = new_feat_names).to_sparse()

In [542]:
new_trans_test_data = (new_test_feats.loc[:, new_feat_names + [tar_col]] + 1).apply(np.log).to_sparse()
new_scale_test_data = pd.DataFrame(sc.transform(new_trans_test_data.loc[:, new_feat_names]), columns = new_feat_names).to_sparse()

In [540]:
f = open('C:\\Users\\Colleen\\Documents\\Kaggle_Santander_Value_Pred\\new_scale_train_data.csv', 'w')
new_scale_train_data.to_csv(f)
f.close()

In [547]:
len(new_scale_test_data.columns)

58

In [548]:
len(new_scale_train_data.columns)

58

In [557]:
f = open('C:\\Users\\Colleen\\Documents\\Kaggle_Santander_Value_Pred\\data\\rf_sel_feats.p', 'r')
rf = pickle.load(f)
f.close()

sel_feats = list(n[n > 0.2].index) + ['min_nonzero', 'max_nonzero']

pred_vals = rf.predict(new_scale_test_data.loc[:, sel_feats])
print 'test error ' + str(np.sqrt(mean_squared_error(new_trans_test_data[tar_col], pred_vals)))
print 'train error ' + str(np.sqrt(mean_squared_error(new_trans_data[tar_col], rf.predict(new_scale_train_data.loc[:, sel_feats]))) )

[Parallel(n_jobs=8)]: Done  25 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 146 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 349 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 632 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed:    0.2s finished
[Parallel(n_jobs=8)]: Done  25 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 146 tasks      | elapsed:    0.0s


test error 1.40393174107


[Parallel(n_jobs=8)]: Done 349 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 632 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed:    0.2s finished


train error 0.983346350221


In [555]:
f = open('C:\Users\Colleen\Documents\Kaggle_Santander_Value_Pred\sel_feats.p', 'w')
pickle.dump(list(n[n > 0.2].index) + ['min_nonzero', 'max_nonzero'], f)
f.close()