In [3]:
# Load all helper function
%run -i 'source.py'

# Prepare input data

## Loading in the Data

In [5]:
INPUT_PATH = '../input/'
print('File sizes')
for f in os.listdir(INPUT_PATH):
    if 'zip' not in f:
        print(" - " + f.ljust(30) + mem_size(os.path.getsize('../input/' + f)) )

File sizes
 - .test_supplement.csv          398.0 B
 - .sample_submission.csv        104.0 B
 - .test.csv                     184.0 B
 - .train.csv                    980.0 B
 - train_sample.csv              3.89 MB
 - test_supplement.csv           2.48 GB
 - test.csv                      823.28 MB
 - sample_submission_adtracking.csv186.52 MB
 - .ipynb_checkpoints            0B
 - train.csv                     7.02 GB


Wow, that is some really big data. Unfortunately we don't have enough kernel memory to load the full dataset into memory; however we can get a glimpse at some of the statistics:

In [70]:
print('# Line count:')
for file in ['train.csv', 'test.csv', 'train_sample.csv']:
    lines = subprocess.run(['wc', '-l', '../input/{}'.format(file)], stdout=subprocess.PIPE).stdout.decode('utf-8')
    print(" - " + file.ljust(30) + lines, end='', flush=True)

# Line count:
 - train.csv                      184903891 ../input/train.csv
 - test.csv                       18790470 ../input/test.csv
 - train_sample.csv                100001 ../input/train_sample.csv


That makes 185 million rows in the training set and **19 million** in the test set. Handily the organisers have provided a train_sample.csv which contains 100K rows in case you don't want to download the full data

In [2]:
train_raw = pd.read_csv(INPUT_PATH + 'train.csv', parse_dates=['click_time', 'attributed_time']) # nrows=20000000

In [3]:
train_mem_before = summary_memory(train_raw).assign(status="Before Downsizing")

In [4]:
train_raw_downsized = reduce_mem_usage(train_raw)

-------------------Begin downsizing--------------------
ip converted from int64 to int32
app converted from int64 to int16
device converted from int64 to int16
os converted from int64 to int16
channel converted from int64 to int16
is_attributed converted from int64 to int8
------------------------Result-------------------------
 -> Mem. usage decreased from 11.02 GB to 4.99 GB
-------------------Finish downsizing-------------------


In [5]:
train_mem_after = summary_memory(train_raw_downsized).assign(status="After Downsizing")

In [6]:
mem_info = pd.concat([train_mem_before, train_mem_after], axis=0).reset_index(drop=True)

In [7]:
mem_info

Unnamed: 0,Veriable,Memory,Data Type,status
0,ip,1479231120,int64,Before Downsizing
1,app,1479231120,int64,Before Downsizing
2,device,1479231120,int64,Before Downsizing
3,os,1479231120,int64,Before Downsizing
4,channel,1479231120,int64,Before Downsizing
5,click_time,1479231120,datetime64[ns],Before Downsizing
6,attributed_time,1479231120,datetime64[ns],Before Downsizing
7,is_attributed,1479231120,int64,Before Downsizing
8,ip,739615560,int32,After Downsizing
9,app,369807780,int16,After Downsizing


In [8]:
mem_info.to_csv("../output/memory.csv", index=False)

In [7]:
train_raw.shape

(184903890, 8)

In [8]:
train_raw.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed
0,83230,3,1,13,379,2017-11-06 14:32:21,NaT,0
1,17357,3,1,19,379,2017-11-06 14:33:34,NaT,0
2,35810,3,1,13,379,2017-11-06 14:34:12,NaT,0
3,45745,14,1,13,478,2017-11-06 14:34:52,NaT,0
4,161007,3,1,13,379,2017-11-06 14:35:08,NaT,0


According to the data page, our data contains:

* `ip`: ip address of click
* `app`: app id for marketing
* `device`: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)
* `os`: os version id of user mobile phone
* `channel`: channel id of mobile ad publisher
* `click_time`: timestamp of click (UTC)
* `attributed_time`: if user download the app for after clicking an ad, this is the time of the app download
* `is_attributed`: the target that is to be predicted, indicating the app was downloaded

**A few things of note:**
* If you look at the data samples above, you'll notice that all these variables are encoded - meaning we don't know what the actual value corresponds to - each value has instead been assigned an ID which we're given. This has likely been done because data such as IP addresses are sensitive, although it does unfortunately reduce the amount of feature engineering we can do on these.
* The `attributed_time` variable is only available in the training set - it's not immediately useful for classification but it could be used for some interesting analysis (for example, one could fill in the variable in the test set by building a model to predict it).

For each of our encoded values, let's look at the number of unique values:

In [9]:
train_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 184903890 entries, 0 to 184903889
Data columns (total 8 columns):
 #   Column           Dtype         
---  ------           -----         
 0   ip               int32         
 1   app              int16         
 2   device           int16         
 3   os               int16         
 4   channel          int16         
 5   click_time       datetime64[ns]
 6   attributed_time  datetime64[ns]
 7   is_attributed    int8          
dtypes: datetime64[ns](2), int16(4), int32(1), int8(1)
memory usage: 5.0 GB


## Missing Values

In [10]:
train_raw.isnull().sum()

ip                         0
app                        0
device                     0
os                         0
channel                    0
click_time                 0
attributed_time    184447044
is_attributed              0
dtype: int64

Notice we have many missing values for attributed_time; that is expected since the clicks that did not lead to downloads won't have an attributed download time. We need to check that the NAN values in this column are only for samples where there was no download.

In [11]:
train_raw.attributed_time[train_raw.is_attributed==0].unique()#.describe()

array(['NaT'], dtype='datetime64[ns]')

We see that there are only NAN values when train.is_attributed==0. Let us check for any null values when train.is_attributed==1.

In [12]:
train_raw[train_raw.is_attributed==1].isnull().sum()

ip                 0
app                0
device             0
os                 0
channel            0
click_time         0
attributed_time    0
is_attributed      0
dtype: int64

All good! We see for all samples where is_attributed = 1, we have no missing values. **Looking at the number of NA values, we see that we have the vast majority of clicks not leading to downloads. **

## Output Data

If seems like the [est Format to Save Pandas Data](https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d) is feather. We saved it for future input!

In [13]:
train_raw.to_feather("../processing/train_raw.feather") # pip install pyarrow

# Preparing Unseen data

In [6]:
test_raw = pd.read_csv(INPUT_PATH + 'test.csv', parse_dates=['click_time']) # nrows=20000000

In [7]:
test_raw

Unnamed: 0,click_id,ip,app,device,os,channel,click_time
0,0,5744,9,1,3,107,2017-11-10 04:00:00
1,1,119901,9,1,3,466,2017-11-10 04:00:00
2,2,72287,21,1,19,128,2017-11-10 04:00:00
3,3,78477,15,1,13,111,2017-11-10 04:00:00
4,4,123080,12,1,13,328,2017-11-10 04:00:00
...,...,...,...,...,...,...,...
18790464,18790464,99442,9,1,13,127,2017-11-10 15:00:00
18790465,18790465,88046,23,1,37,153,2017-11-10 15:00:00
18790466,18790467,81398,18,1,17,265,2017-11-10 15:00:00
18790467,18790466,123236,27,1,13,122,2017-11-10 15:00:00


In [8]:
test_raw = reduce_mem_usage(test_raw)

-------------------Begin downsizing--------------------
click_id converted from int64 to int32
ip converted from int64 to int32
app converted from int64 to int16
device converted from int64 to int16
os converted from int64 to int16
channel converted from int64 to int16
------------------------Result-------------------------
 -> Mem. usage decreased from 1003.52 MB to 430.08 MB
-------------------Finish downsizing-------------------


In [9]:
test_raw

Unnamed: 0,click_id,ip,app,device,os,channel,click_time
0,0,5744,9,1,3,107,2017-11-10 04:00:00
1,1,119901,9,1,3,466,2017-11-10 04:00:00
2,2,72287,21,1,19,128,2017-11-10 04:00:00
3,3,78477,15,1,13,111,2017-11-10 04:00:00
4,4,123080,12,1,13,328,2017-11-10 04:00:00
...,...,...,...,...,...,...,...
18790464,18790464,99442,9,1,13,127,2017-11-10 15:00:00
18790465,18790465,88046,23,1,37,153,2017-11-10 15:00:00
18790466,18790467,81398,18,1,17,265,2017-11-10 15:00:00
18790467,18790466,123236,27,1,13,122,2017-11-10 15:00:00


In [10]:
test_raw.to_feather("../processing/test_raw.feather")