In [1]:
!ls -alh ../data

total 553840
drwxr-xr-x  14 chi  staff   448B Sep 14 19:49 [1m[36m.[m[m
drwxr-xr-x  10 chi  staff   320B Sep 14 19:51 [1m[36m..[m[m
-rw-r--r--@  1 chi  staff   8.0K Aug 25 12:18 .DS_Store
drwxr-xr-x   3 chi  staff    96B Aug 20 23:05 [1m[36m.ipynb_checkpoints[m[m
-rw-r--r--   1 chi  staff    17M Sep 14 19:49 data_test.zip
-rw-r--r--   1 chi  staff    41M Sep 14 19:49 data_train_1.zip
-rw-r--r--   1 chi  staff    76M Sep 14 19:49 data_train_2.zip
-rw-r--r--   1 chi  staff    16M Sep 14 19:49 data_train_3.zip
-rw-r--r--   1 chi  staff    53M Aug 24 13:47 datasource1.csv
-rw-r--r--   1 chi  staff   478B Aug 25 09:54 datasource1_small.csv
-rw-r--r--   1 chi  staff    35M Aug 24 13:48 datasource2.json
-rw-r--r--   1 chi  staff    28M Aug 24 13:48 datasource3.csv
-rw-r--r--   1 chi  staff   4.7M Sep 14 19:49 sample_submission.csv
drwxr-xr-x   4 chi  staff   128B Aug 25 12:18 [1m[36mtransformed_data[m[m


In [2]:
!head ../data/sample_submission.csv

id,returned
a3173126f1e0e5e456a5c74d4c0cfeb2,0.3745401188473625
d85ad1398aca4b9fc5768dc55ba916b0,0.9507143064099162
89d75016827f0872e8a2175353062b1e,0.7319939418114051
f6a8725e35f22ab9ec58ae59028f37b3,0.5986584841970366
8bb525cd90856cb8cacad2bc6c63fbdd,0.15601864044243652
5ead6e0f89f0f5e511bcef658ccede19,0.15599452033620265
d4cc85514e5e9efe513e3f7508fc24e7,0.05808361216819946
18a498699b1231c205d651d75c9b173b,0.8661761457749352
13d8a4b2a270afd12535d03cd2518aaf,0.6011150117432088


Now we need to decompress the zip files.

In [3]:
! mkdir data

In [4]:
from zipfile import ZipFile

for i in range(1,4):
    # Create a ZipFile Object and load data_train_*.zip in it
    with ZipFile(f'../data/data_train_{i}.zip', 'r') as zipObj:
        # Extract all the contents of zip file in different directory
        zipObj.extractall(f'data/data_train_{i}')


### Data train 1

In [5]:
ls data/data_train_1/

datasource1.csv   datasource2.json  datasource3.csv


In [6]:
import pandas as pd

In [7]:
df1 = pd.read_csv('data/data_train_1/datasource1.csv')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 0: invalid continuation byte

First error! It seems to be a unicodedecode error, this means the file is not unicode (utf8) valid. There are a huge number of encodings available, we can use [chardet](https://chardet.readthedocs.io/en/latest/usage.html) to detect the valid encoding.

To use it we just have to read the file in byte format (the `rb` flag). We only read the first 1000 bytes (so it's faster) and use `chardet.detect`.

In [8]:
import chardet

In [9]:
with open('data/data_train_1/datasource1.csv', 'rb') as fname:
    raw_data = fname.read()[:1000]
    
chardet.detect(raw_data)

{'encoding': 'windows-1251',
 'confidence': 0.5730969056434915,
 'language': 'Bulgarian'}

In [10]:
df1 = pd.read_csv('data/data_train_1/datasource1.csv', encoding='windows-1251')

In [11]:
df1.head()

Unnamed: 0,id,tierafterorder,orderportalid,size,orderdate_gmt,hasusedwishlist,ldsa_team_wishes_you
0,cfcd208495d565ef66e7dff9f98764da,T4,1,1,2018-01-01 00:00:19.733000+00:00,Yes,Срећно! (Good luck!)
1,c4ca4238a0b923820dcc509a6f75849b,T3,2,2,2018-01-01 00:00:42.540000+00:00,Yes,Срећно! (Good luck!)
2,c81e728d9d4c2f636f067f89cc14862c,T3,3,3,2018-01-01 00:01:15.893000+00:00,No,Срећно! (Good luck!)
3,eccbc87e4b5ce2fe28308fd9f2a7baf3,T3,3,4,2018-01-01 00:01:15.893000+00:00,No,Срећно! (Good luck!)
4,a87ff679a2f3e71d9181a67b7542122c,T4,4,5,2018-01-01 00:01:51.450000+00:00,No,Срећно! (Good luck!)


Let's set the id as the index for all the files.

In [12]:
df1 = df1.set_index('id')

Done! Let's go to the second file of this folder.

In [13]:
ls data/data_train_1/

datasource1.csv   datasource2.json  datasource3.csv


It seems to be a json file...

In [14]:
with open('data/data_train_1/datasource2.json') as fname:
    df2 = pd.read_json(fname, lines=True)

In [15]:
df2.head()

Unnamed: 0,columns,data,index
0,"[shipper, productid, isreseller, issale, categ...","[[1, 1, No, Yes, Accessories], [2, 2, No, No, ...","[cfcd208495d565ef66e7dff9f98764da, c4ca4238a0b..."


Ok that didn't work, maybe a different orientation? Let's check the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html) and preview the json.

In [18]:
# this shows the first 500 bytes of the file
! head -c 500 data/data_train_1/datasource2.json

{"columns":["shipper","productid","isreseller","issale","category_1stlevel"],"index":["cfcd208495d565ef66e7dff9f98764da","c4ca4238a0b923820dcc509a6f75849b","c81e728d9d4c2f636f067f89cc14862c","eccbc87e4b5ce2fe28308fd9f2a7baf3","a87ff679a2f3e71d9181a67b7542122c","e4da3b7fbbce2345d7772b0674a318d5","1679091c5a880faf6fb5e6087eb1b2dc","8f14e45fceea167a5a36dedd4bea2543","c9f0f895fb98ab9159f51fd0297e236d","45c48cce2e2d7fbdea1afc51c7c6ad26","d3d9446802a44259755d38e6d163e820","6512bd43d9caa6e02c990b0a8265

In [19]:
with open('data/data_train_1/datasource2.json') as fname:
    df2 = pd.read_json(fname, orient='split')

In [20]:
df2.head()

Unnamed: 0,shipper,productid,isreseller,issale,category_1stlevel
cfcd208495d565ef66e7dff9f98764da,1,1,No,Yes,Accessories
c4ca4238a0b923820dcc509a6f75849b,2,2,No,No,Clothing
c81e728d9d4c2f636f067f89cc14862c,2,3,No,No,Clothing
eccbc87e4b5ce2fe28308fd9f2a7baf3,2,4,No,Yes,Shoes
a87ff679a2f3e71d9181a67b7542122c,2,5,No,Yes,Shoes


Ok, let's go to the third file!

In [21]:
!head data/data_train_1/datasource3.csv

id|tierbeforeorder|ddprate|platform|style|region
cfcd208495d565ef66e7dff9f98764da||0.0|web|1|1
c4ca4238a0b923820dcc509a6f75849b|T3|5.0083|app|2|1
c81e728d9d4c2f636f067f89cc14862c||42.2351||3|1
eccbc87e4b5ce2fe28308fd9f2a7baf3||42.2351|web|4|1
a87ff679a2f3e71d9181a67b7542122c||5.0083|web|5|1
e4da3b7fbbce2345d7772b0674a318d5|T2|0.0|web|6|2
1679091c5a880faf6fb5e6087eb1b2dc|T2|0.0|web|7|2
8f14e45fceea167a5a36dedd4bea2543|T2|0.0|web|8|2
c9f0f895fb98ab9159f51fd0297e236d|T4|0.0|web|9|2


Simple, it's a csv but with a pipe `|` as a separator.

This is why doing a simple `!head` before running into pandas is super helpful!

In [22]:
df3 = pd.read_csv('data/data_train_1/datasource3.csv', sep='|')

In [23]:
df3.head()

Unnamed: 0,id,tierbeforeorder,ddprate,platform,style,region
0,cfcd208495d565ef66e7dff9f98764da,,0.0,web,1,1
1,c4ca4238a0b923820dcc509a6f75849b,T3,5.0083,app,2,1
2,c81e728d9d4c2f636f067f89cc14862c,,42.2351,,3,1
3,eccbc87e4b5ce2fe28308fd9f2a7baf3,,42.2351,web,4,1
4,a87ff679a2f3e71d9181a67b7542122c,,5.0083,web,5,1


In [24]:
df3 = df3.set_index('id')

Done! Let's move to the second folder.

#### data train 2

In [25]:
!ls -alh data/data_train_2/

total 375472
drwxr-xr-x  5 chi  staff   160B Sep 14 19:54 [1m[36m.[m[m
drwxr-xr-x  5 chi  staff   160B Sep 14 19:54 [1m[36m..[m[m
-rw-r--r--  1 chi  staff    65M Sep 14 19:54 datasource4.html
-rw-r--r--  1 chi  staff    84M Sep 14 19:54 datasource5.csv
-rw-r--r--  1 chi  staff    34M Sep 14 19:54 datasource7.csv


Let's proceed with datasource4, it seems like an html file?

In [26]:
!head -n 20 data/data_train_2/datasource4.html

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>isusingmultipledevices</th>
      <th>freereturn</th>
      <th>userid</th>
    </tr>
    <tr>
      <th>id</th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>cfcd208495d565ef66e7dff9f98764da</th>
      <td>No</td>
      <td>1</td>


Let's try `pandas.read_html`!

In [27]:
# This takes a while to run...
df4 = pd.read_html('./data/data_train_2/datasource4.html', attrs = {'class': 'dataframe'})

`read_html` returns a list.

In [28]:
len(df4)

1

There is only one table on the html, so we get the first element of the list.

In [29]:
df4 = df4[0]

In [30]:
df4.head()

Unnamed: 0_level_0,Unnamed: 0_level_0,isusingmultipledevices,freereturn,userid
Unnamed: 0_level_1,id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,cfcd208495d565ef66e7dff9f98764da,No,1,1
1,c4ca4238a0b923820dcc509a6f75849b,No,1,2
2,c81e728d9d4c2f636f067f89cc14862c,No,1,3
3,eccbc87e4b5ce2fe28308fd9f2a7baf3,No,1,3
4,a87ff679a2f3e71d9181a67b7542122c,No,1,4


In [31]:
df4.columns

MultiIndex(levels=[['Unnamed: 0_level_0', 'freereturn', 'isusingmultipledevices', 'userid'], ['Unnamed: 1_level_1', 'Unnamed: 2_level_1', 'Unnamed: 3_level_1', 'id']],
           codes=[[0, 2, 1, 3], [3, 0, 1, 2]])

It seems it is a  multiindex!, we need to get the id as an index.

In [32]:
df4.columns = df4.columns.droplevel(level=0)

In [33]:
df4.head()

Unnamed: 0,id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,cfcd208495d565ef66e7dff9f98764da,No,1,1
1,c4ca4238a0b923820dcc509a6f75849b,No,1,2
2,c81e728d9d4c2f636f067f89cc14862c,No,1,3
3,eccbc87e4b5ce2fe28308fd9f2a7baf3,No,1,3
4,a87ff679a2f3e71d9181a67b7542122c,No,1,4


In [36]:
df4.columns = ['id', 'isusingmultipledevices', 'freereturn', 'userid']

In [37]:
df4.head()

Unnamed: 0,id,isusingmultipledevices,freereturn,userid
0,cfcd208495d565ef66e7dff9f98764da,No,1,1
1,c4ca4238a0b923820dcc509a6f75849b,No,1,2
2,c81e728d9d4c2f636f067f89cc14862c,No,1,3
3,eccbc87e4b5ce2fe28308fd9f2a7baf3,No,1,3
4,a87ff679a2f3e71d9181a67b7542122c,No,1,4


In [38]:
df4 = df4.set_index('id')

Done! Let's move to file 5.

In [39]:
# this might crash your browser
# !head data/data_train_2/datasource5.csv

Ok, this file disguises itself as a csv but its actually a json! A good reminder that file extensions don't mean anything!

In [40]:
df5 = pd.read_json('data/data_train_2/datasource5.csv')

In [41]:
df5.head()

Unnamed: 0,isvip,brand,promocode,designer
00003e3b9e5336685200ae85d21b4f5e,Not VIP,50,1,4133
000053b1e684c9e7ea73727b2238ce18,Not VIP,640,921,14465
00005d011db80a956aab176cc94d1d37,Not VIP,36,1,11691
0000a0f5746d603088ac968c91b085b5,VIP,279,1,34489
0000b2815cc3c2b56867cbbf4d36efa5,Not VIP,17,924,66837


Done! Let's go to file 6.

In [42]:
!ls -alh data/data_train_2/

total 375472
drwxr-xr-x  5 chi  staff   160B Sep 14 19:54 [1m[36m.[m[m
drwxr-xr-x  5 chi  staff   160B Sep 14 19:54 [1m[36m..[m[m
-rw-r--r--  1 chi  staff    65M Sep 14 19:54 datasource4.html
-rw-r--r--  1 chi  staff    84M Sep 14 19:54 datasource5.csv
-rw-r--r--  1 chi  staff    34M Sep 14 19:54 datasource7.csv


Well, it seems that there is no file 6 here, let's go to file 7 then.

In [43]:
!head data/data_train_2/datasource7.csv

id,countrycode,countryoforigin,userfraudstatus,useless values (delete please),test strings (delete please)
91b48f444340e38b245224d3cf495dd6,48,20,3,0.5972732194313508,tMdME
68c26935d45bf7340b70c481e2578906,WEBSITE,WEBSITE,WEBSITE,0.31678691699869344,gJtKD
5ef781f891290d7ff10c44fc54f28829,14,12,3,0.6405539685733376,K6iBC
fe548a297e49243c8838b61221a09def,5,12,3,0.34994931238397897,0mcLU
bdb4fc61cdf6cfadfd20db40fa1a64d2,17,4,1,0.6095160940464983,iBSjr
24fc676fd9f7bba69640c6b1fd5c52f5,3,1,3,0.6492229983006353,2P5o3
c6321d2275686eee40744053a81dd7fc,2,1,3,0.48500347145892464,8tRfn
d6557be41069a1036c6cedf514f17346,46,26,3,0.6455367880637098,4o4rc
5d9920797a7f15da0ac6b68a0e3db25e,3,1,3,0.60239127186448,UcbXQ


Ok, we see this seems to be a csv that someone has messed with. Since csvs are basically text files this happens sometimes

No problem, we will skip the header and we will remove the columns that are useless and have to be deleted (the file is asking very nicely after all...).

In [44]:
df7 = pd.read_csv('data/data_train_2/datasource7.csv', 
                  header=None, 
                  skiprows=1,
                  usecols=[0,1,2,3],
                  names=['id','countrycode','countryoforigin','userfraudstatus']
                 )

In [45]:
df7.head()

Unnamed: 0,id,countrycode,countryoforigin,userfraudstatus
0,91b48f444340e38b245224d3cf495dd6,48,20,3
1,68c26935d45bf7340b70c481e2578906,WEBSITE,WEBSITE,WEBSITE
2,5ef781f891290d7ff10c44fc54f28829,14,12,3
3,fe548a297e49243c8838b61221a09def,5,12,3
4,bdb4fc61cdf6cfadfd20db40fa1a64d2,17,4,1


In [46]:
df7 = df7.set_index('id')

Done!

### Folder 3

In [47]:
!ls -alh data/data_train_3/

total 0
drwxr-xr-x     3 chi  staff    96B Sep 14 19:54 [1m[36m.[m[m
drwxr-xr-x     5 chi  staff   160B Sep 14 19:54 [1m[36m..[m[m
drwxr-xr-x  5002 chi  staff   156K Sep 14 19:54 [1m[36mdatasource6[m[m


Ok, here is the datasource6, which seems to be another folder?

In [49]:
# try running it without the "| head -10" to see all the files
!ls -alh data/data_train_3/datasource6/ | head -10

total 118248
drwxr-xr-x  5002 chi  staff   156K Sep 14 19:54 .
drwxr-xr-x     3 chi  staff    96B Sep 14 19:54 ..
-rw-r--r--     1 chi  staff   8.1K Sep 14 19:54 0.csv
-rw-r--r--     1 chi  staff   8.8K Sep 14 19:54 1.csv
-rw-r--r--     1 chi  staff   8.3K Sep 14 19:54 10.csv
-rw-r--r--     1 chi  staff   8.7K Sep 14 19:54 100.csv
-rw-r--r--     1 chi  staff   8.6K Sep 14 19:54 1000.csv
-rw-r--r--     1 chi  staff   8.4K Sep 14 19:54 1001.csv
-rw-r--r--     1 chi  staff   8.7K Sep 14 19:54 1002.csv


Ok, seems to be a lot of csv files. Since they have similar names, they are probably created by some automatic process. This is good because it means that we can use the same function to read all of the files.

In [50]:
!head data/data_train_3/datasource6/1000.csv

id,ddpsubcategory,shiptypeid,hasitemsonbag,country
6a6331cd682c28bd1d95cb865c69cbdd,"Trousers, overalls, shorts",2,Yes,1
d2f358973ec4382ff877a9f928876e90,WEBSITE,WEBSITE,WEBSITE,WEBSITE
e350ebea9bef9cd420f59c34dfa4f589,WEBSITE,WEBSITE,WEBSITE,WEBSITE
a8d972f22527b09a03564f854a3eb9a3,"Jerseys, pullovers, cardigans, waistcoats and similar articles, knitted or crocheted",2,Yes,1
2d273973a88b3ab45f0d0763300b0695,"Handbags, whether or not with shoulder strap, including those without handle",2,Yes,1
ac0ce4c020f526032faee133bea0673b,API,API,API,API
1d5d5a604ba4a3c33a00faf9494c6dbf,Other footwear with outer soles of leather,2,Yes,1
c556b66075e6ceb64e3b73914c1e1cd6,Footwear with outer soles of rubber or plastics,2,No,5
f26d104eb67d84cc4c8f2cb272501b56,N/D,3,No,9


We will create a list of dataframes and will concatenate them.

In [51]:
from pathlib import Path

datasource6 = Path('data/data_train_3/datasource6/')

In [52]:
# this takes a while to run
ind_dfs = []
for file in datasource6.glob('*'):
    ind_dfs.append(pd.read_csv(file))

In [53]:
df6 = pd.concat(ind_dfs)

In [54]:
df6.shape

(543341, 5)

In [55]:
df6.head()

Unnamed: 0,id,ddpsubcategory,shiptypeid,hasitemsonbag,country
0,e96faaa362160d37b30cc3c06e684b53,"Coats: overcoats, raincoat, cape, cloak, ski j...",2,No,1
1,2c68139e18d69fa4e79fc00a36ddf78a,"Hats and other headgear, knitted or crocheted,...",2,No,1
2,8933495921cf8651a3deacb715c3970a,"Jerseys, pullovers, cardigans, waistcoats and ...",2,Yes,2
3,99594d0d5f1bc469d458e58d35823e10,"Trousers, overalls, shorts",3,No,4
4,2cac0e6a7cc3038c3e3da05592cc2649,"Trousers, overalls, shorts",3,No,4


In [56]:
df6 = df6.set_index('id')

Done!

Now we merged them all!

In [57]:
files_df = df1.join(df2).join(df3).join(df4).join(df5).join(df6).join(df7)

In [58]:
files_df.head()

Unnamed: 0_level_0,tierafterorder,orderportalid,size,orderdate_gmt,hasusedwishlist,ldsa_team_wishes_you,shipper,productid,isreseller,issale,...,brand,promocode,designer,ddpsubcategory,shiptypeid,hasitemsonbag,country,countrycode,countryoforigin,userfraudstatus
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
cfcd208495d565ef66e7dff9f98764da,T4,1,1,2018-01-01 00:00:19.733000+00:00,Yes,Срећно! (Good luck!),1,1,No,Yes,...,1,1,1,Articles of a kind normally carried in the poc...,1,No,1,1,1,1
c4ca4238a0b923820dcc509a6f75849b,T3,2,2,2018-01-01 00:00:42.540000+00:00,Yes,Срећно! (Good luck!),2,2,No,No,...,2,1,2,,2,No,1,1,2,2
c81e728d9d4c2f636f067f89cc14862c,T3,3,3,2018-01-01 00:01:15.893000+00:00,No,Срећно! (Good luck!),2,3,No,No,...,3,1,3,"Jerseys, pullovers, cardigans, waistcoats and ...",2,Yes,2,2,2,1
eccbc87e4b5ce2fe28308fd9f2a7baf3,T3,3,4,2018-01-01 00:01:15.893000+00:00,No,Срећно! (Good luck!),2,4,No,Yes,...,4,1,4,Footwear with outer soles of rubber or plastics,2,Yes,2,2,2,1
a87ff679a2f3e71d9181a67b7542122c,T4,4,5,2018-01-01 00:01:51.450000+00:00,No,Срећно! (Good luck!),2,5,No,Yes,...,5,1,5,Footwear with outer soles of rubber or plastics,2,No,1,1,1,3


And output the resulting dataframe to a pickle in the `data/clean` directory!

In [60]:
! mkdir data/clean

In [61]:
files_df.to_pickle('data/clean/files_df.pkl')