### Imports

In [1]:
import numpy as np
import pandas as pd
print('Imports are Ready!')
!pwd
print("Hey!")

Imports are Ready!
/Users/juansmacbook/PycharmProjects/Santander_Transaction/Santander-Transaction-Competition
Hola


 ### Loading Datasets

 As you can see in both sets we contain 20k examples, each containing it's information of its "ID_code" and 200 numerical features. It is important to note that the training set contains an aditional column containing the binary target for each example

In [2]:
print("Let us load the training and the test set, please wait.")
train=pd.read_csv('/Users/juansmacbook/PycharmProjects/Santander_Transaction/Santander-Transaction-Competition/train.csv')
test=pd.read_csv('/Users/juansmacbook/PycharmProjects/Santander_Transaction/Santander-Transaction-Competition/test.csv')
print("Shape of training set: "+str(train.shape))
print("Shape of testing set: "+str(test.shape))

Let us load the training and the test set, please wait.
Shape of training set: (200000, 202)
Shape of testing set: (200000, 201)


#  Extracting synthetic examples in the test dataset
This competition is interesting because people in the forums discovered that the 50% of the test set is synthetic! The key to distinguish the real testing examples from the fake ones is checking whether the value of a certain feature isn't repeated in another training example. Therefore, only the examples that have *at least one unique value in any of their features* are considered to be real!

This of course wasn't discovered by me, this is what the community calls "Magic" when it comes to kaggle competitions.

In this order of ideas, my next task will be to filter out the fake testing examples and prove to you that this is actually the case. We should be able to notice that 10.000  training examples are real and 10.000 are fake! Then I will elaborate further on why this is really important for the model and what we are going to do after we split the testing set.

In [3]:
"""Here we define a function that inputs a column from a data frame and returns a list of the values that only appear one time in that data set"""

def unique_values(column):
    count=column.value_counts()
    return count.index[count==1]

"""We are going to check what examples in test set are real or not by doing value counts, examples whose variables arent repeated in the whole test set are considered to be real!"""

def feature_uniqueness(data_frame): # This function inputs the TEST DATASET and returns its "extended" version that says which feature is unique
    for column in data_frame.columns[1:]: #Notice that we are dropping the first column since we don't care bout ID
        new_column=pd.Series(
            data=data_frame[column].isin(unique_values(data_frame[column])).to_numpy(),
            name=column+'_unique?'
        )
        data_frame=pd.concat([data_frame,new_column],axis=1)
    return data_frame

In [4]:
"""This is our new data frame containing 200 extra features that tell us which feature is unique in each column for every training example"""
temporal_df=feature_uniqueness(test)
temporal_df.head()

Unnamed: 0,ID_code,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,...,var_190_unique?,var_191_unique?,var_192_unique?,var_193_unique?,var_194_unique?,var_195_unique?,var_196_unique?,var_197_unique?,var_198_unique?,var_199_unique?
0,test_0,11.0656,7.7798,12.9536,9.4292,11.4327,-2.3805,5.8493,18.2675,2.1337,...,False,False,False,False,False,False,False,False,False,False
1,test_1,8.5304,1.2543,11.3047,5.1858,9.1974,-4.0117,6.0196,18.6316,-4.4131,...,False,False,False,False,False,False,False,False,False,False
2,test_2,5.4827,-10.3581,10.1407,7.0479,10.2628,9.8052,4.895,20.2537,1.5233,...,False,False,False,False,False,False,False,False,False,False
3,test_3,8.5374,-1.3222,12.022,6.5749,8.8458,3.1744,4.9397,20.566,3.3755,...,False,False,False,False,False,False,True,True,False,False
4,test_4,11.7058,-0.1327,14.1295,7.7506,9.1035,-8.5848,6.8595,10.6048,2.989,...,False,False,False,False,False,False,False,False,False,False


# Proof that 50% of data set is fake!
Just as previously mentioned, the only examples that are we going to consider to be non-synthetic will be those have at least one unique value on their features, the ones that don't have any will be considered as fake. We can easily filter out the real and the fake ones via a simple condition.
Using this condition we are going to split the training set into real and fake examples

In [5]:
"""In each example we are checking whether if any of their lats 200 features has a true statement"""
condition=temporal_df[temporal_df.columns[201:]].any(axis=1)

print('_.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._')
print(temporal_df[temporal_df.columns[201:]].any(axis=1).value_counts())
print('_.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._')
print('Just as mentioned in the forums, 50% of test data is fake!')

_.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._
False    100000
True     100000
dtype: int64
_.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._.-._
Just as mentioned in the forums, 50% of test data is fake!


In [6]:
"""Now we can split the dataset in the real part and the fake part"""

real_test_examples=temporal_df[condition==True]
fake_test_examples=temporal_df[condition==False]

del temporal_df #We delete temporal_df to save memory

### Ok? half the testing set is fake... so what?


## 'MAGIC', and how to break the limits of this Kaggle competition
In Kaggle competitions there is something that the community calls "Magic", which refers to some tricks or procedures to create competition-winning models. In this specific competition the "Magic" to get a top 1% score, as it is discussed in the forums, is the following:

1) Filter out the real examples from the testing set via counting unique values (We already did that!)
2) Merge the full training set with the real test examples (without the "var_#_unique?" features) and do the value counts again as a whole new set, essentially gaining 200 more features again.
3) Extracting out the training set, and the real test examples, but each with their 200 additional features. The real test examples must be appended back with the fake test examples (the extended version of them) in order tà get back our testing set.

from now on we are going to carry out steps 2 and 3

In [7]:
"""We concatenate the training set with the real training examples into a mixed data frame"""

mixed_df=pd.concat([
    train.drop(["target"], axis=1), #We drop the target column of the training set because the real_test_examples doesn't have one
    real_test_examples.drop(columns=real_test_examples.columns[201:]) #We drop all columns containing if the variables are unique or not, so we can do tha value count again
    ],axis=0, ignore_index=True) #Since both data frames contains repeated indices, it is important to reset them to avoid problems

In [8]:
"""Once again we use our function feature uniqueness in the combined data frame to extract which variables are real"""
mixed_df=feature_uniqueness(mixed_df)

In [9]:
"""Now we are going to crate the actual training sets and testing sets that we are going to use for our model"""
"""We only have to split the mixed_df data frame again, where the first half is the actual training set, and the other half is part of the training set"""
x_train=mixed_df[:200000]
x_test=pd.concat([ mixed_df[200000:], fake_test_examples ], axis=0, ignore_index=True)

"""Now we are going to map all True and False values to 1 and 0 respectively"""

for col in x_test.columns[201:]:
    x_test[col]=x_test[col].apply(lambda x: 1 if x==True else 0)

for col in x_train.columns[201:]:
    x_train[col]=x_train[col].apply(lambda x: 1 if x==True else 0)

# REVISAR COMO HACERLO MAS EFICIENTEMENTE!!!!!!!

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x_train[col]=x_train[col].apply(lambda x: 1 if x==True else 0)


In [10]:
x_test.shape
x_test.shape

!ls

Boosting copy.ipynb       juan_submission.csv       x_test.csv
DataGeneration copy.ipynb test.csv                  x_train.csv
README.md                 train.csv


In [11]:
"""We save the training sets that are going to be used for the boosting algorithm"""
x_test.to_csv('x_test.csv', index=False)
x_train.to_csv('x_train.csv', index=False)

In [None]:
ls