# Don't Overfit! II

## Distribution of points in 300D space

I was interested in ellipses on this kernel https://www.kaggle.com/cyones77/t-sne-projection. Аnd I asked myself - is there some kind of second-order logic in the data, if I present a data set as the coordinates of points in 300D space?

__Spoiler - there is!__

Let's start the research...

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pylab as plt

In [2]:
import gc
import time
from datetime import datetime
import warnings
warnings.simplefilter(action = 'ignore')

In [3]:
from sklearn.metrics import roc_auc_score, log_loss, accuracy_score, confusion_matrix
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression

In [4]:
from scipy.stats import mannwhitneyu

## Let's prepare everything we need

In [5]:
train = pd.read_csv('../input/train.csv', index_col = 'id')
train.shape

(250, 301)

In [6]:
target = train['target']
train.drop('target', axis = 1, inplace = True)
target.value_counts()

1.0    160
0.0     90
Name: target, dtype: int64

In [7]:
test = pd.read_csv('../input/test.csv', index_col = 'id')
test.shape

(19750, 300)

It will be more convenient to use the combined data set:

In [8]:
index_train = train.index
index_test = test.index
print(len(index_train), len(index_test))

250 19750


In [9]:
df_full = pd.concat([train, test], axis = 0)

del train, test
gc.collect()

14

Data set for research with some basic source statistics:

In [10]:
df_stats = df_full.T.describe().T.drop('count', axis = 1)
df_stats.columns = ['source_' + c for c in df_stats.columns]
df_stats.head()

Unnamed: 0_level_0,source_mean,source_std,source_min,source_25%,source_50%,source_75%,source_max
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,-0.009223,1.089171,-2.851,-0.77575,-0.0505,0.79625,2.929
1,0.08613,0.985838,-2.771,-0.5515,0.0745,0.74275,2.907
2,0.027657,1.012757,-2.788,-0.70875,0.0285,0.66075,2.895
3,0.088357,0.939743,-2.404,-0.6105,0.1525,0.74525,3.27
4,0.134413,0.941277,-2.087,-0.47425,0.112,0.69475,3.432


In [11]:
df_stats.shape

(20000, 7)

In [12]:
df_stats.loc[index_train].corrwith(target)

source_mean   -0.179397
source_std     0.065762
source_min    -0.061177
source_25%    -0.154892
source_50%    -0.166062
source_75%    -0.145751
source_max    -0.117318
dtype: float64

Functions and table for comparing scores of logistic regression and make a submittion:

In [13]:
PARAMS = {}
PARAMS['random_state'] = 0
PARAMS['n_jobs'] = -1
PARAMS['C'] = .2
PARAMS['penalty'] = 'l1'
PARAMS['class_weight'] = 'balanced'
PARAMS['solver'] = 'saga'

In [14]:
logreg_scores = pd.DataFrame(columns = ['auc', 'acc', 'loss', 'tn', 'fn', 'fp', 'tp'])

def get_logreg_score(train_, target_):
    folds = RepeatedStratifiedKFold(n_splits = 5, n_repeats = 20, random_state = 0)
    predict = pd.DataFrame(index = train_.index)
    
    # Cross-validation cycle
    for n_fold, (train_idx, valid_idx) in enumerate(folds.split(target_, target_)):
        train_x, train_y = train_.iloc[train_idx], target_.iloc[train_idx]
        valid_x, valid_y = train_.iloc[valid_idx], target_.iloc[valid_idx]
        
        clf = LogisticRegression(**PARAMS)
        clf.fit(train_x, train_y)
        predict[n_fold] = pd.Series(clf.predict_proba(valid_x)[:, 1], index = valid_x.index)

        del train_x, train_y, valid_x, valid_y
        gc.collect()
        
    predict = predict.mean(axis = 1)
    tn, fp, fn, tp = confusion_matrix(target_, (predict >= .5) * 1).ravel()
    return [
                 roc_auc_score(target_, predict), 
                 accuracy_score(target_, (predict >= .5) * 1), 
                 log_loss(target_, predict),
                 tn, fn, fp, tp
            ]

In [15]:
def get_submit(train_, test_, target_):
    predict = pd.DataFrame(index = test_.index)
    
    clf = LogisticRegression(**PARAMS)
    clf.fit(train_, target_)
    
    predict = pd.Series(clf.predict_proba(test_)[:, 1], index = test_.index).reset_index()
    predict.columns = ['id', 'target']
    
    return predict

## Let's start...

First of all, let's calculate the score for the source data.

In [16]:
step = 'source dataset'
logreg_scores = logreg_scores.T
logreg_scores[step] = get_logreg_score(df_full.loc[index_train], target)
logreg_scores = logreg_scores.T
logreg_scores

Unnamed: 0,auc,acc,loss,tn,fn,fp,tp
source dataset,0.815417,0.732,0.506695,55.0,32.0,35.0,128.0


In [17]:
submit = get_submit(df_full.loc[index_train], df_full.loc[index_test], target)

score_auc = logreg_scores.loc[step, 'auc']
score_acc = logreg_scores.loc[step, 'acc']
score_loss = logreg_scores.loc[step, 'loss']
filename = 'subm_{}_{:.4f}_{:.4f}_{:.4f}_{}.csv'.format('source', score_auc, score_acc, score_loss,
                                                        datetime.now().strftime('%Y-%m-%d'))
print(filename)
submit.to_csv(filename, index = False)

subm_source_0.8154_0.7320_0.5067_2019-04-24.csv


__LB = 0.845__

If we are looking for second-order logic, let's first check the distance from the points to the origin.

In [18]:
dist_to_origin_sqr = (df_full**2).sum(axis = 1)
dist_to_origin_sqr.describe()

count    20000.000000
mean       300.267708
std         24.626507
min        207.933984
25%        283.360689
50%        299.423155
75%        316.498468
max        408.301493
dtype: float64

Wow! It looks like a sphere centered at the origin with a radius of sqrt(300)! All ponts are located near it.

Hmm... The square of the radius is equal to the dimension of space... What does this mean? For example, for such a sphere, the coordinates of the "bisectors" of quadrants are 1 or -1. Or may be initial coordinates were 1 and -1, and then some kind of transformation was applied. For synthetic set this is well likely assumption.

Now let's try to project the points onto the sphere and analyze the distance to it. Does it have any useful information?

In [19]:
rad_sphere_sqr = 300
rad_sphere = np.sqrt(rad_sphere_sqr)
rad_sphere

17.320508075688775

In [20]:
df_stats['dist_to_sphere'] = np.sqrt(dist_to_origin_sqr) - rad_sphere
df_stats['dist_to_sphere'].describe()

count    20000.000000
mean        -0.006832
std          0.710176
min         -2.900592
25%         -0.487187
50%         -0.016660
75%          0.469896
max          2.885963
Name: dist_to_sphere, dtype: float64

In [23]:
np.corrcoef(df_stats['dist_to_sphere'].loc[index_train], target)[0, 1]

0.06722549873039244

In [24]:
np.corrcoef(abs(df_stats['dist_to_sphere'].loc[index_train]), target)[0, 1]

0.08679370204297157

In [25]:
mannwhitneyu(df_stats['dist_to_sphere'].loc[index_train], df_stats['dist_to_sphere'].loc[index_test])

MannwhitneyuResult(statistic=2427657.0, pvalue=0.32528354304664864)

It looks like the distanse to the sphere has no useful information.

And it has no difference between train and test sets.

Now let's project the points onto the sphere...

In [None]:
df_full_sphere = (df_full * rad_sphere).divide(np.sqrt(dist_to_origin_sqr), axis = 'rows')
(df_full_sphere**2).sum(axis = 1).describe()

In [None]:
tmp = df_full_sphere.T.describe().T.drop('count', axis = 1)
tmp.columns = ['sphere_' + c for c in tmp.columns]
tmp.loc[index_train].corrwith(target)

In [None]:
df_stats = pd.concat([df_stats, tmp], axis = 1)

del tmp
gc.collect()

df_stats.head()

...and check the score change.

In [None]:
step = 'projection onto sphere'
logreg_scores = logreg_scores.T
logreg_scores[step] = get_logreg_score(df_full_sphere.loc[index_train], target)
logreg_scores = logreg_scores.T
logreg_scores

In [None]:
submit = get_submit(df_full_sphere.loc[index_train], df_full_sphere.loc[index_test], target)

score_auc = logreg_scores.loc[step, 'auc']
score_acc = logreg_scores.loc[step, 'acc']
score_loss = logreg_scores.loc[step, 'loss']
filename = 'subm_{}_{:.4f}_{:.4f}_{:.4f}_{}.csv'.format('full_sphere', score_auc, score_acc, score_loss,
                                                        datetime.now().strftime('%Y-%m-%d'))
print(filename)
submit.to_csv(filename, index = False)

__LB = 0.845__

It doesn't change significantly. We removed the some kind of noise. All points are really located on this sphere.

Let's explore this set.

First, let's take a closer look at the location of the points relative to the "bisectors" of the quadrants. For this let's define the average density of points in each quadrant.

In [None]:
df_signes = np.sign(df_full_sphere).astype(int)
df_signes.head()

In [None]:
df_signes.replace(-1, 2).astype(str).apply(lambda x: ''.join(x), axis = 1).nunique()

Surprize! There are 20000 unique combinations of coordinate signs. In each quadrant is no more than one point! 

But...

There are 2**300 quadrants in the 300D space. It's a very big number:

In [None]:
2**300

And we have only 20000 points. Can we accidentally get such a distribution of points? Yes. Is it an accident here? I hope no. This is a synthetic dataset.

Let's explore the distribution of the quadrants with points.

In [None]:
df_stats['positive_cnt'] = (df_signes > 0).sum(axis = 1)
df_stats['positive_cnt'].describe()

On average, for each point half the coordinates are positive. Not less 109 and not more 185. It means, there are no points with almost all positive or all negative coordinates.

Is it useful? Let's check.

In [None]:
np.corrcoef(df_stats['positive_cnt'].loc[index_train], target)[0, 1]

In [None]:
mannwhitneyu(df_stats['positive_cnt'].loc[index_train], df_stats['positive_cnt'].loc[index_test])

This is rather small correlation for using count of positive coordinates for prediction. But not the smallest of the values found :)

What about the quadrants themselves? Let's replace coordinates of points to coordinates of "bisectors" of quadrants and look at the prediction.

In [None]:
step = 'quadrants'
logreg_scores = logreg_scores.T
logreg_scores[step] = get_logreg_score(df_signes.loc[index_train], target)
logreg_scores = logreg_scores.T
logreg_scores

In [None]:
submit = get_submit(df_signes.loc[index_train], df_signes.loc[index_test], target)

score_auc = logreg_scores.loc[step, 'auc']
score_acc = logreg_scores.loc[step, 'acc']
score_loss = logreg_scores.loc[step, 'loss']
filename = 'subm_{}_{:.4f}_{:.4f}_{:.4f}_{}.csv'.format('quad', score_auc, score_acc, score_loss,
                                                        datetime.now().strftime('%Y-%m-%d'))
print(filename)
submit.to_csv(filename, index = False)

__LB = 0.748__

The distribution of quadrants contains meaningful information for prediction, but not all of its. The distribution of points within quandrants is important too.

Let's calculate, for example, angles between vector of point and vector of "bisector".

In [None]:
df_stats['angle_w_bis'] = np.arccos(abs(df_full_sphere).sum(axis = 1) / rad_sphere_sqr)
df_stats['angle_w_bis'].describe()

Hmm... It looks like another spheres with the same radius which centered at the intersection of "bisectors" with the source sphere. Each ponts is located near the intersection of such sphere with the source one.

In [None]:
np.corrcoef(df_stats['angle_w_bis'].loc[index_train], target)[0, 1]

In [None]:
mannwhitneyu(df_stats['angle_w_bis'].loc[index_train], df_stats['angle_w_bis'].loc[index_test])

The magnitude of the angle is not important. 

The distribution of angles on the train and test sets differs only slightly higher than for previous statistics. But it can still be considered the same.

## To be continued...

I hope :)