## Reduce memory trick
- customer_id는 length가 64인 string으로 64byte의 memory 사용
- 이를 int64 타입으로 변경하여 8byte만 사용 가능하게 한다
- 출처 : https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/308635

### Using pandas

In [18]:
import pandas as pd
path = 'D:/Kaggle/H&M/' 
train = pd.read_csv(path + 'transactions_train.csv')
print(f'Memory usage is {train.customer_id.memory_usage(deep = True)}')
print(f'Length of customer_id variable is {len(train.customer_id[0])}')
train.customer_id

Memory usage is 3846387332
Length of customer_id variable is 64


0           000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...
1           000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...
2           00007d2de826758b65a93dd24ce629ed66842531df6699...
3           00007d2de826758b65a93dd24ce629ed66842531df6699...
4           00007d2de826758b65a93dd24ce629ed66842531df6699...
                                  ...                        
31788319    fff2282977442e327b45d8c89afde25617d00124d0f999...
31788320    fff2282977442e327b45d8c89afde25617d00124d0f999...
31788321    fff380805474b287b05cb2a7507b9a013482f7dd0bce0e...
31788322    fff4d3a8b1f3b60af93e78c30a7cb4cf75edaf2590d3e5...
31788323    fffef3b6b73545df065b521e19f64bf6fe93bfd450ab20...
Name: customer_id, Length: 31788324, dtype: object

In [16]:
train['customer_id'] =\
    train['customer_id'].apply(lambda x: int(x[-16:],16) ).astype('int64')
print(f'Memory usage is {train.customer_id.memory_usage(deep = True)}')
train.customer_id

Memory usage is 254306720


0             -6846340800584936
1             -6846340800584936
2          -8334631767138808638
3          -8334631767138808638
4          -8334631767138808638
                   ...         
31788319    4685485978980270934
31788320    4685485978980270934
31788321    3959348689921271969
31788322   -8639340045377511665
31788323    3235222691137941515
Name: customer_id, Length: 31788324, dtype: int64

### Using RAPIDS cuDF

In [None]:
import cudf
path = 'D:/Kaggle/H&M/'
train = cudf.read_csv(path + 'transactions_train.csv')

In [None]:
train['customer_id'] =\
    train['customer_id'].str[-16:].str.hex_to_int().astype('int64')
print(f'Memory usage is {train.customer_id.memory_usage(deep = True)}')
train.customer_id

### Using index

In [32]:
path = 'D:/Kaggle/H&M/'
customers = pd.read_csv(path + 'customers.csv')
sub = pd.read_csv(path + 'sample_submission.csv')
customers.customer_id

0          00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...
1          0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...
2          000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...
3          00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...
4          00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...
                                 ...                        
1371975    ffffbbf78b6eaac697a8a5dfbfd2bfa8113ee5b403e474...
1371976    ffffcd5046a6143d29a04fb8c424ce494a76e5cdf4fab5...
1371977    ffffcf35913a0bee60e8741cb2b4e78b8a98ee5ff2e6a1...
1371978    ffffd7744cebcf3aca44ae7049d2a94b87074c3d4ffe38...
1371979    ffffd9ac14e89946416d80e791d064701994755c3ab686...
Name: customer_id, Length: 1371980, dtype: object

In [34]:
id_to_index_dict = dict(zip(customers["customer_id"], customers.index))
index_to_id_dict = dict(zip(customers.index, customers["customer_id"]))

# for memory efficiency
train["customer_id"] = train["customer_id"].map(id_to_index_dict)

# for switching back for submission
sub["customer_id"] = sub["customer_id"].map(index_to_id_dict)

In [40]:
train.customer_id

0                 2
1                 2
2                 7
3                 7
4                 7
             ...   
31788319    1371691
31788320    1371691
31788321    1371721
31788322    1371747
31788323    1371960
Name: customer_id, Length: 31788324, dtype: int64