In [1]:
! pip install -q kaggle

In [10]:
!git clone https://github.com/7Dany6/MLDM-exam-project.git

Cloning into 'MLDM-exam-project'...
remote: Enumerating objects: 3, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (2/2), done.[K
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (3/3), done.


In [2]:
from google.colab import files

In [3]:
uploaded = files.upload()
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving kaggle.json to kaggle.json
User uploaded file "kaggle.json" with length 68 bytes


In [5]:
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

In [6]:
!kaggle competitions list

ref                                             deadline             category            reward  teamCount  userHasEntered  
----------------------------------------------  -------------------  ---------------  ---------  ---------  --------------  
contradictory-my-dear-watson                    2030-07-01 23:59:00  Getting Started     Prizes         61           False  
gan-getting-started                             2030-07-01 23:59:00  Getting Started     Prizes         92           False  
store-sales-time-series-forecasting             2030-06-30 23:59:00  Getting Started  Knowledge        713           False  
tpu-getting-started                             2030-06-03 23:59:00  Getting Started  Knowledge        134           False  
digit-recognizer                                2030-01-01 00:00:00  Getting Started  Knowledge       1241           False  
titanic                                         2030-01-01 00:00:00  Getting Started  Knowledge      14363           False  


In [7]:
!kaggle competitions download -c amex-default-prediction

Downloading amex-default-prediction.zip to /content
100% 20.5G/20.5G [01:45<00:00, 197MB/s]
100% 20.5G/20.5G [01:45<00:00, 208MB/s]


In [8]:
! mkdir data

In [9]:
! unzip amex-default-prediction.zip -d data

Archive:  amex-default-prediction.zip
  inflating: data/sample_submission.csv  
  inflating: data/test_data.csv      
  inflating: data/train_data.csv     
  inflating: data/train_labels.csv   


Given datasets are enormous in the context of memory-usage. That's why reading these files by "pd.read_csv" is not appropriate. Instead, as an idea we can implement "dask" library.
It will allow us to use parallel computing, using all the kernels. 


In [18]:
!pip install dask 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [12]:
from dask import dataframe as dd

In [13]:
test_data = dd.read_csv('/content/data/test_data.csv')
train_data = dd.read_csv('/content/data/train_data.csv')

In [14]:
train_data.head()

Unnamed: 0,customer_ID,S_2,P_2,D_39,B_1,B_2,R_1,S_3,D_41,B_3,...,D_136,D_137,D_138,D_139,D_140,D_141,D_142,D_143,D_144,D_145
0,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-03-09,0.938469,0.001733,0.008724,1.006838,0.009228,0.124035,0.008771,0.004709,...,,,,0.002427,0.003706,0.003818,,0.000569,0.00061,0.002674
1,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-04-07,0.936665,0.005775,0.004923,1.000653,0.006151,0.12675,0.000798,0.002714,...,,,,0.003954,0.003167,0.005032,,0.009576,0.005492,0.009217
2,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-05-28,0.95418,0.091505,0.021655,1.009672,0.006815,0.123977,0.007598,0.009423,...,,,,0.003269,0.007329,0.000427,,0.003429,0.006986,0.002603
3,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-06-13,0.960384,0.002455,0.013683,1.0027,0.001373,0.117169,0.000685,0.005531,...,,,,0.006117,0.004516,0.0032,,0.008419,0.006527,0.0096
4,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-07-16,0.947248,0.002483,0.015193,1.000727,0.007605,0.117325,0.004653,0.009312,...,,,,0.003671,0.004946,0.008889,,0.00167,0.008126,0.009827


So, to work with the set data we need to reduce its memory complexity.

In [15]:
types = train_data.dtypes
print(types)

customer_ID     object
S_2             object
P_2            float64
D_39           float64
B_1            float64
                ...   
D_141          float64
D_142          float64
D_143          float64
D_144          float64
D_145          float64
Length: 190, dtype: object


As we see, the majority of columns have the type 'float64'. It gives us an opportunity to reduce complexity by converting column type to 'float32'. Also, we are given a list of categorical variables - ['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']. Coverting them, also will give us the boost.

In [16]:
categorical_variables = ['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']
for column in categorical_variables:
  train_data[column] = train_data[column].astype('int8')
for col in train_data.columns:
  if train_data[col].dtype == 'float64':
    train_data[col] = train_data[col].astype('float32')

Let's see the result:

In [17]:
train_data.dtypes

customer_ID     object
S_2             object
P_2            float32
D_39           float32
B_1            float32
                ...   
D_141          float32
D_142          float32
D_143          float32
D_144          float32
D_145          float32
Length: 190, dtype: object

Finally, we're facing a solid problem:
That's immposible to make computations with dask. Anyway, we need to switch to another, more appropriate way of processing data.