Imports

In [13]:
import numpy as np
import pandas as pd
print('Imports are ready!')
!pwd

Imports are ready!
/Users/juansmacbook/PycharmProjects/Santander_Transaction/Santander-Transaction-Competition


 # **Loading Datasets**

 As you can see in both sets we contain 20k examples, each containing it's information of its "ID_code" and 200 numerical features. It is important to note that the training set contains an aditional column containing the binary target for each example

In [14]:
print("Let us load the training and the test set, please wait.")
train=pd.read_csv('/Users/juansmacbook/PycharmProjects/Santander_Transaction/santander_data/train.csv')
test=pd.read_csv('/Users/juansmacbook/PycharmProjects/Santander_Transaction/santander_data/test.csv')
print("Shape of training set: "+str(train.shape))
print("Shape of testing set: "+str(test.shape))

Let us load the training and the test set, please wait.
Shape of training set: (200000, 202)
Shape of testing set: (200000, 201)


# **"Magic" and synthetic examples in the test dataset**
This competition is interesting because people in the forums discovered that the 50% of the test set is synthetic! The key to distinguish the real testing examples from the fake ones is checking whether the value of a certain feature isn't repeated in another training example. Therefore, only the examples that have *at least one unique value in any of their features* are considered to be real!

This of course wasn't discovered by me, this is what the community calls "Magic" when it comes to kaggle competitions. I will elaborate further on why this is really important for the model.

In this order of ideas, my next task will be to filter out the fake testing examples and prove to you that this is actually the case. We should be able to notice that 10.000  training examples are real and 10.000 are fake!

In [15]:
"""Here we define a function that inputs a column from a data frame and returns a list of the values that only appear one time in that data set"""

def unique_values(column):
    count=column.value_counts()
    return count.index[count==1]

"""We are going to check what examples in test set are real or not by doing value counts, examples whose variables arent repeated in the whole test set are considered to be real!"""

def feature_uniqueness(data_frame=test): # This function inputs the TEST DATASET and returns its "extended" version that says which feature is unique
    for column in data_frame.columns[1:]: #Notice that we are dropping the first column since we don't care bout ID
        new_column=pd.Series(
            data=data_frame[column].isin(unique_values(data_frame[column])).to_numpy(),
            name=column+'_unique?'
        )
        data_frame=pd.concat([data_frame,new_column],axis=1)
    return data_frame

In [16]:
"""This is our new data frame containing 200 extra features that tell us which feature is unique in each column for every training example"""
temporal_df=feature_uniqueness()
temporal_df.head()

Unnamed: 0,ID_code,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,...,var_190_unique?,var_191_unique?,var_192_unique?,var_193_unique?,var_194_unique?,var_195_unique?,var_196_unique?,var_197_unique?,var_198_unique?,var_199_unique?
0,test_0,11.0656,7.7798,12.9536,9.4292,11.4327,-2.3805,5.8493,18.2675,2.1337,...,False,False,False,False,False,False,False,False,False,False
1,test_1,8.5304,1.2543,11.3047,5.1858,9.1974,-4.0117,6.0196,18.6316,-4.4131,...,False,False,False,False,False,False,False,False,False,False
2,test_2,5.4827,-10.3581,10.1407,7.0479,10.2628,9.8052,4.895,20.2537,1.5233,...,False,False,False,False,False,False,False,False,False,False
3,test_3,8.5374,-1.3222,12.022,6.5749,8.8458,3.1744,4.9397,20.566,3.3755,...,False,False,False,False,False,False,True,True,False,False
4,test_4,11.7058,-0.1327,14.1295,7.7506,9.1035,-8.5848,6.8595,10.6048,2.989,...,False,False,False,False,False,False,False,False,False,False


# **Proof that 50% of data set is fake!**
Just as previously mentioned, the only examples that are we going to consider to be non-synthetic will be those have at least one unique value on their features, the ones that don't have any will be considered as fake. We can easily filter out the real and the fake ones via a simple condition.
Using this condition we are going to split the training set into real and fake examples

In [18]:
#In each example we are checking whether if any of their lats 200 features has a true statement
condition=temporal_df[temporal_df.columns[201:]].any(axis=1)

print('_.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._')
print(
temporal_df[temporal_df.columns[201:]].any(axis=1).value_counts()
)
print('_.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._')
print('Just as mentioned in the forums, 50% of test data is fake!')

_.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._
False    100000
True     100000
dtype: int64
_.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._
Just as mentioned in the forums, 50% of test data is fake!


In [19]:
#Now we can split the dataset in the real part and the fake part
real_test_examples=temporal_df[condition==True]
fake_test_examples=temporal_df[condition==False]
#del temporal_df #We delete temporal_df to save memory

## Why filtering out real examples out of the testing set?
We are doing this so we can do again a value

In [22]:
temporal_df

Unnamed: 0,ID_code,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,...,var_190_unique?,var_191_unique?,var_192_unique?,var_193_unique?,var_194_unique?,var_195_unique?,var_196_unique?,var_197_unique?,var_198_unique?,var_199_unique?
0,test_0,11.0656,7.7798,12.9536,9.4292,11.4327,-2.3805,5.8493,18.2675,2.1337,...,False,False,False,False,False,False,False,False,False,False
1,test_1,8.5304,1.2543,11.3047,5.1858,9.1974,-4.0117,6.0196,18.6316,-4.4131,...,False,False,False,False,False,False,False,False,False,False
2,test_2,5.4827,-10.3581,10.1407,7.0479,10.2628,9.8052,4.8950,20.2537,1.5233,...,False,False,False,False,False,False,False,False,False,False
3,test_3,8.5374,-1.3222,12.0220,6.5749,8.8458,3.1744,4.9397,20.5660,3.3755,...,False,False,False,False,False,False,True,True,False,False
4,test_4,11.7058,-0.1327,14.1295,7.7506,9.1035,-8.5848,6.8595,10.6048,2.9890,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199995,test_199995,13.1678,1.0136,10.4333,6.7997,8.5974,-4.1641,4.8579,14.7625,-2.7239,...,False,False,False,True,False,False,False,False,False,False
199996,test_199996,9.7171,-9.1462,7.3443,9.1421,12.8936,3.0191,5.6888,18.8862,5.0915,...,True,False,False,False,False,False,False,False,False,True
199997,test_199997,11.6360,2.2769,11.2074,7.7649,12.6796,11.3224,5.3883,18.3794,1.6603,...,False,False,False,False,False,False,False,False,False,False
199998,test_199998,13.5745,-0.5134,13.6584,7.4855,11.2241,-11.3037,4.1959,16.8280,5.3208,...,False,False,False,False,False,False,False,False,False,False


In [51]:
#We create a new variable that says if an example is real or not
#temporal_df.drop(x_test.columns[201:401], axis=1,
#                 inplace=False#Cambiar por TRUE DESPUESSSS
#                 )
#temporal_df[7:8]
real_test_examples

Unnamed: 0,ID_code,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,...,var_190_unique?,var_191_unique?,var_192_unique?,var_193_unique?,var_194_unique?,var_195_unique?,var_196_unique?,var_197_unique?,var_198_unique?,var_199_unique?
3,test_3,8.5374,-1.3222,12.0220,6.5749,8.8458,3.1744,4.9397,20.5660,3.3755,...,False,False,False,False,False,False,True,True,False,False
7,test_7,17.3035,-2.4212,13.3989,8.3998,11.0777,9.6449,5.9596,17.8477,-4.8068,...,False,False,False,False,False,False,True,False,True,False
11,test_11,10.6137,-2.1898,8.9090,3.8014,13.8602,-5.9802,5.5515,15.4716,-0.1714,...,False,False,False,False,False,False,False,False,False,False
15,test_15,14.8595,-4.5378,13.6483,5.6480,9.9144,1.5190,5.0358,13.4524,-2.5419,...,False,False,False,False,False,False,False,False,False,False
16,test_16,14.1732,-5.1490,9.7591,3.7316,10.3700,-21.9202,7.7130,18.8749,0.4680,...,False,False,False,True,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199986,test_199986,19.2884,-2.8384,11.9149,6.6611,12.3112,12.9244,5.6492,16.0449,5.3597,...,False,False,False,False,False,False,False,False,False,False
199993,test_199993,14.6764,-8.1066,7.1167,2.4138,10.3845,-11.9327,4.7563,16.0455,0.4510,...,True,False,False,False,False,False,False,False,False,True
199995,test_199995,13.1678,1.0136,10.4333,6.7997,8.5974,-4.1641,4.8579,14.7625,-2.7239,...,False,False,False,True,False,False,False,False,False,False
199996,test_199996,9.7171,-9.1462,7.3443,9.1421,12.8936,3.0191,5.6888,18.8862,5.0915,...,True,False,False,False,False,False,False,False,False,True


In [57]:
#WE CONCAT TRAINING SET WITH REAL TRAINING EXAMPLES


SUPER_TEST_DF=pd.concat([
    train.drop(["target"], axis=1), #We drop the target column of the training set because the real_test_examples doesn't have one
    real_test_examples.drop(columns=real_test_examples.columns[201:]) #We drop all columns containing if the variables are unique or not, so we can do tha value count again
    ],axis=0, ignore_index=True) #Since both data frames contains repeated indices, it is important to reset them to avoid problems

In [40]:
# We are going to concatenate the train set and the real training examples into x_train to do a value count later
# We check what values are uniques in both training set and real training set
x_train=pd.concat( [train.drop(['target'],axis=1), real_test_examples.drop(['real'], axis=1)] , axis=0) #LISTOLISTOLISTOLISTOLISTOLISTOLISTOLISTOLISTOLISTO
x_train.reset_index(inplace=True) #lets reset index to avoid problems later #LISTOLISTOLISTOLISTOLISTOLISTOLISTOLISTOLISTOLISTO
x_train.drop(['index'], axis=1, inplace=True) #LISTOLISTOLISTOLISTOLISTOLISTOLISTOLISTOLISTOLISTOLISTOLISTOLISTOLISTO

print('Shape after concatenating: '+str(x_train.shape))


# We append to the dataset 200 new features containing what values are unique
for column in x_train.columns[1:]:
    uniques_count=x_train[column].value_counts().to_dict()
    x_train[column+'_unique']=x_train[column].apply(lambda x: 1 if uniques_count[x]==1 else 0).values
    fake_test_examples[column+'_unique']= 0
print('Finished!')
#We create the final test set containing all unique value counts!!!!!!
fake_test_examples.drop(['real'], axis=1, inplace=True)
x_test=pd.concat([x_train[:][200000:],fake_test_examples])

#We create the final training set containing value counts!
x_train=x_train[:][:200000]


print('Ready!')

KeyError: "['real'] not found in axis"