In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os
print(os.listdir("../input"))

Like everyone else, I am interested in discerning what these anonymous features really mean. Since it's financial data, one would assume they would provide us:

* Individual transaction amounts
* Aggregated (summed) transactions per day
* Aggregated transactions per week / month / year
* Diffrent types of financial transactions, such as amount in account1, account2, or loan amounts, savings account amounts, etc.

Examining the data and especially in light of Giba's comment, we know that our target value is frequently in our dataset. Moreover, just by virtue of the sheer number of columns, I believe it reasonable to assume that at minimal, we're looking at individual transactions. If that is indeed the case, is there a way we can for certain discern if there are aggregate columns present in the dataset? Or conversly, prove for certain that there are not? Also why are there so many 0s in the dataset? Stay tuned...

In [None]:
train = pd.read_csv('../input/train.csv')
test  = pd.read_csv('../input/test.csv')

del train['ID'], test['ID'], train['target']
all_data = train.append(test)
all_data.reset_index(drop=True, inplace=True)

cols_with_onlyone_val = train.columns[train.nunique() == 1]

# Uncomment for fun:
#all_data.drop(cols_with_onlyone_val, axis=1, inplace=True)  

My thought process is as follows. We need to examine each user's columns, searching for the minimal, non-zero value present. If we find such a number, and this number is **not** repeated anywhere in the users columns (that is, it only appears once), then we can for certain lay claim that that specific column / feature is **not** an aggregate. If it were, we would expect to see that same, minimal feature appear at least one other time as some other column, that would be summed into the row in question.

Let's code that up:

In [None]:
# GOAL: If user's col=min, and that min value only appears once in the user,
# then for certain, that col is NOT an aggregate
def notagg(row):
    row_nz = row[row>0]
    if row_nz.shape[0] == 0: return row # row is all 0s, so we return false=0 that it's not an agg row
    
    min_nz = row_nz.min()
    check  = (row_nz==min_nz).sum()
    
    # Min value occurs more than once, we can't learn anything about this column (min val column);
    # as such, we can't learn anything about this row
    if check>1:
        row = 0
        return row
    
    # Otherwise, min-val only occurs once! That col is NOT an aggregate
    return (row==min_nz).astype(np.int)  # only min-col will be marked=1

In [None]:
# Apply the above function to all rows:
cols_not_agg = all_data.apply(notagg, axis=1)
cols_not_agg.shape

In [None]:
# Cool, now look at each column and see if that column ever gets disqualified
cols_not_agg = cols_not_agg.max(axis=0)
cols_not_agg.shape # Make sure we're looking @ columns

In [None]:
cols_not_agg.sum()

In [None]:
which = cols_not_agg[cols_not_agg==0].index.tolist()
which

In [None]:
# Start with the easiest canidates. Let's see which of these columns has the least number of non-0 values
check = train[which]
check = check>0
pd.concat([check.sum(axis=0).sort_values(), 100 * check.sum(axis=0).sort_values() / train.shape[0]], axis=1)

Pitiful. These columns have very little representation in the data. Column 0 is raw count, and Column 1 is percentage. Only the bottom four features appear with > 1% non-zeros in our train data. It doesn't even make sense to consider features that only appear 19 or less times in our dataset with non-0 values as a potential aggregate, because we really won't be getting any LB-juice out of that. And even if we were to go ahead and assume that the bottom four features were actual aggregate candidates, it wouldn't make sense to have just 4 aggregates out of 4991 columns... and on top of that have those aggregates == 0 +97.5% of the time.

My conclusion here is that this dataset does **not contain any aggregates at all**.

The next question is why are there so many zeros in the dataset? My hypothesis here is that:

1. In support of the assertion that there are NO aggregates in the dataset, and
2. In support of other people's finding that the order of columns does **not** matter

It stands to reason that for each "day" they created a bunch of buckets so that they could hold multiple transactions per day. If the dataset were a json object, or nested arrays, then I believe they'd actually have a different number of features per user; but since it's provided to us as a .csv, this is a result of that.

One issue with the above is that if that were true, we would expect to see a core group of columns, for example the columns that represent the first few transactions of every day or period always contain non-zero items. That does't seem to be the case when we plot the non-zero count of each columns:

In [None]:
nz = (all_data>0).sum(axis=0)
nz = nz.sort_values(ascending=False)
nz.shape

In [None]:
plt.plot(nz.values/all_data.shape[0])
plt.show()

At max, we're only seeing non-zeros at 17.5% of a column. Not sure where to go next. Any ideas 🤔?