One of the main issues in this competition is the size of the dataset. Pandas crashes when attempting to load the entire train and test datasets at once. [One of the kernels has been able to read the entire train dataset using dask](https://www.kaggle.com/ashishpatel26/how-to-handle-this-big-dataset-dask-vs-pandas). In this kernels we'll use [Python datatable package](https://github.com/h2oai/datatable) to load the entire train and test datasets, and do some simple EDA on them. Python datatable is still in early alpha stage and is under very active curent development. It is designed from ground up for big datasets and with efficiency and speed in mind. It is closely related to [R's data.table](https://github.com/Rdatatable/data.table) and attempts to mimic its core algorithams and API. 

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 
from datetime import datetime
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
import gc
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

['test.csv', 'train.csv', 'sample_submission.csv']


Unfortunately datatable is not currently available in Kaggle Docker image. My attempts to install it via Kaggle kernel package installation API have failed, but I have been able to load it from the following pre-made wheel. (A huge shoutout to [Olivier](https://www.kaggle.com/ogrellier) for his help with this.)

In [2]:
!pip install https://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/pydatatable/0.7.0.dev490/x86_64-centos7/datatable-0.7.0.dev490-cp36-cp36m-linux_x86_64.whl

Collecting datatable==0.7.0.dev490 from https://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/pydatatable/0.7.0.dev490/x86_64-centos7/datatable-0.7.0.dev490-cp36-cp36m-linux_x86_64.whl
[?25l  Downloading https://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/pydatatable/0.7.0.dev490/x86_64-centos7/datatable-0.7.0.dev490-cp36-cp36m-linux_x86_64.whl (1.6MB)
[K    100% |████████████████████████████████| 1.6MB 16.6MB/s 
[?25hCollecting typesentry>=0.2.6 (from datatable==0.7.0.dev490)
  Downloading https://files.pythonhosted.org/packages/0f/37/3757249f05aac8a08d9742f9a35c17ab6895eb916b83bbf3a23eae6842b2/typesentry-0.2.7-py2.py3-none-any.whl
Collecting blessed (from datatable==0.7.0.dev490)
[?25l  Downloading https://files.pythonhosted.org/packages/3f/96/1915827a8e411613d364dd3a56ef1fbfab84ee878070a69c21b10b5ad1bb/blessed-1.15.0-py2.py3-none-any.whl (60kB)
[K    100% |████████████████████████████████| 61kB 5.7MB/s 
Installing collected packages: typesentry, blessed, datat

Now let's import datatable

In [3]:
from sklearn.metrics import log_loss, roc_auc_score
from datetime import datetime
import datatable as dt
from datatable.models import Ftrl

Now let's load the train dataset:

In [4]:
%%time
train = dt.fread('../input/train.csv')

CPU times: user 1min 14s, sys: 17.2 s, total: 1min 31s
Wall time: 26.7 s


And test:

In [5]:
%%time
test = dt.fread('../input/test.csv')

CPU times: user 1min 1s, sys: 11.9 s, total: 1min 13s
Wall time: 22.1 s


We were able to load all of the train and test datasets, and pretty much exhausted all of kernel's 17.2 GB of RAM. But we did it!

Let's take a look at the train:

In [6]:
train.head()

Unnamed: 0_level_0,MachineIdentifier,ProductName,EngineVersion,AppVersion,AvSigVersion,IsBeta,RtpStateBitfield,IsSxsPassiveMode,DefaultBrowsersIdentifier,AVProductStatesIdentifier,…,Census_IsPenCapable,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪,▪▪▪▪,▪,▪▪▪▪,▪▪▪▪,…,▪,▪,▪,▪▪▪▪,▪
0,0000028988387b115f69f31a3bf04f09,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1735.0,0,7,0,,53447,…,0,0,0,10,0
1,000007535c3f730efa9ea0b7ef1bd645,win8defender,1.1.14600.4,4.13.17134.1,1.263.48.0,0,7,0,,53447,…,0,0,0,8,0
2,000007905a28d863f6d0d597892cd692,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1341.0,0,7,0,,53447,…,0,0,0,3,0
3,00000b11598a75ea8ba1beea8459149f,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1527.0,0,7,0,,53447,…,0,0,0,3,1
4,000014a5f00daa18e76b81417eeb99fc,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1379.0,0,7,0,,53447,…,0,0,0,1,1
5,000016191b897145d069102325cab760,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1094.0,0,7,0,,53447,…,0,0,0,15,1
6,0000161e8abf8d8b89c5ab8787fd712b,win8defender,1.1.15100.1,4.18.1807.18075,1.273.845.0,0,7,0,,43927,…,0,0,0,10,1
7,000019515bc8f95851aff6de873405e8,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1393.0,0,7,0,,53447,…,0,0,0,15,0
8,00001a027a0ab970c408182df8484fce,win8defender,1.1.15200.1,4.18.1807.18075,1.275.988.0,0,7,0,,53447,…,0,0,0,15,0
9,00001a18d69bb60bda9779408dcf02ac,win8defender,1.1.15100.1,4.18.1807.18075,1.273.973.0,0,7,0,,46413,…,0,0,1,8,1


In [7]:
train.shape

(8921483, 83)

And test:

In [8]:
test.head()

Unnamed: 0_level_0,MachineIdentifier,ProductName,EngineVersion,AppVersion,AvSigVersion,IsBeta,RtpStateBitfield,IsSxsPassiveMode,DefaultBrowsersIdentifier,AVProductStatesIdentifier,…,Census_IsTouchEnabled,Census_IsPenCapable,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪,▪▪▪▪,▪,▪▪▪▪,▪▪▪▪,…,▪,▪,▪,▪,▪▪▪▪
0,0000010489e3af074adeac69c53e555e,win8defender,1.1.15400.5,4.18.1810.5,1.281.501.0,0,7,0,,53447,…,0,0,0.0,0.0,7.0
1,00000176ac758d54827acd545b6315a5,win8defender,1.1.15400.4,4.18.1809.2,1.279.301.0,0,7,0,,53447,…,0,0,0.0,1.0,12.0
2,0000019dcefc128c2d4387c1273dae1d,win8defender,1.1.15300.6,4.18.1809.2,1.277.230.0,0,7,0,,49480,…,0,0,0.0,1.0,11.0
3,0000055553dc51b1295785415f1a224d,win8defender,1.1.15400.5,4.18.1810.5,1.281.664.0,0,7,0,,42160,…,0,0,0.0,0.0,10.0
4,00000574cefffeca83ec8adf9285b2bf,win8defender,1.1.15400.4,4.18.1809.2,1.279.236.0,0,7,0,,53447,…,0,0,0.0,1.0,3.0
5,000007ffedd31948f08e6c16da31f6d1,win8defender,1.1.15300.6,4.18.1809.2,1.277.724.0,0,7,0,,53447,…,0,0,0.0,0.0,10.0
6,000008f31610018d898e5f315cdf1bd1,win8defender,1.1.15400.4,4.18.1810.5,1.279.1373.0,0,7,0,,7945,…,0,0,0.0,1.0,10.0
7,00000a3c447250626dbcc628c9cbc460,win8defender,1.1.15300.6,4.18.1806.18062,1.277.1185.0,0,7,0,,15521,…,0,0,0.0,0.0,9.0
8,00000b6bf217ec9aef0f68d5c6705897,win8defender,1.1.15400.5,4.18.1810.5,1.281.675.0,0,7,0,,53447,…,0,0,,,
9,00000b8d3776b13e93ad83676a28e4aa,win8defender,1.1.14700.5,4.14.17613.18039,1.265.676.0,0,7,0,,53447,…,0,0,0.0,0.0,15.0


In [9]:
test.shape

(7853253, 82)

Look at the number of unique values in the two datasets:

In [10]:
train.nunique()

Unnamed: 0_level_0,MachineIdentifier,ProductName,EngineVersion,AppVersion,AvSigVersion,IsBeta,RtpStateBitfield,IsSxsPassiveMode,DefaultBrowsersIdentifier,AVProductStatesIdentifier,…,Census_IsPenCapable,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections
Unnamed: 0_level_1,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,…,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪
0,8921483,6,70,110,8531,2,7,2,2017,28970,…,2,2,2,15,2


In [11]:
test.nunique()

Unnamed: 0_level_0,MachineIdentifier,ProductName,EngineVersion,AppVersion,AvSigVersion,IsBeta,RtpStateBitfield,IsSxsPassiveMode,DefaultBrowsersIdentifier,AVProductStatesIdentifier,…,Census_IsTouchEnabled,Census_IsPenCapable,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier
Unnamed: 0_level_1,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,…,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪
0,7853253,6,70,120,9357,2,8,2,1757,23492,…,2,2,2,2,15


In [12]:
train[:, 'EngineVersion'].nunique1()

70

In [13]:
train_unique = dt.unique(train[:, 'EngineVersion']).to_list()[0]
len(train_unique)

70

In [14]:
test_unique = dt.unique(test[:, 'EngineVersion']).to_list()[0]
len(test_unique)

70

In [15]:
intersection = list(set(train_unique) & set(test_unique))
len(intersection)

66

We see there are only 66 values that overlap in the train and test for this feature.

Let's see what are the names of the features in the dataset:

In [16]:
train.names

('MachineIdentifier',
 'ProductName',
 'EngineVersion',
 'AppVersion',
 'AvSigVersion',
 'IsBeta',
 'RtpStateBitfield',
 'IsSxsPassiveMode',
 'DefaultBrowsersIdentifier',
 'AVProductStatesIdentifier',
 'AVProductsInstalled',
 'AVProductsEnabled',
 'HasTpm',
 'CountryIdentifier',
 'CityIdentifier',
 'OrganizationIdentifier',
 'GeoNameIdentifier',
 'LocaleEnglishNameIdentifier',
 'Platform',
 'Processor',
 'OsVer',
 'OsBuild',
 'OsSuite',
 'OsPlatformSubRelease',
 'OsBuildLab',
 'SkuEdition',
 'IsProtected',
 'AutoSampleOptIn',
 'PuaMode',
 'SMode',
 'IeVerIdentifier',
 'SmartScreen',
 'Firewall',
 'UacLuaenable',
 'Census_MDC2FormFactor',
 'Census_DeviceFamily',
 'Census_OEMNameIdentifier',
 'Census_OEMModelIdentifier',
 'Census_ProcessorCoreCount',
 'Census_ProcessorManufacturerIdentifier',
 'Census_ProcessorModelIdentifier',
 'Census_ProcessorClass',
 'Census_PrimaryDiskTotalCapacity',
 'Census_PrimaryDiskTypeName',
 'Census_SystemVolumeTotalCapacity',
 'Census_HasOpticalDiskDrive',
 

And their types:

In [17]:
train.ltypes

(ltype.str,
 ltype.str,
 ltype.str,
 ltype.str,
 ltype.str,
 ltype.bool,
 ltype.int,
 ltype.bool,
 ltype.int,
 ltype.int,
 ltype.int,
 ltype.int,
 ltype.bool,
 ltype.int,
 ltype.int,
 ltype.int,
 ltype.int,
 ltype.int,
 ltype.str,
 ltype.str,
 ltype.str,
 ltype.int,
 ltype.int,
 ltype.str,
 ltype.str,
 ltype.str,
 ltype.bool,
 ltype.bool,
 ltype.str,
 ltype.bool,
 ltype.int,
 ltype.str,
 ltype.bool,
 ltype.int,
 ltype.str,
 ltype.str,
 ltype.int,
 ltype.int,
 ltype.int,
 ltype.int,
 ltype.int,
 ltype.str,
 ltype.real,
 ltype.str,
 ltype.int,
 ltype.bool,
 ltype.int,
 ltype.str,
 ltype.real,
 ltype.int,
 ltype.int,
 ltype.str,
 ltype.str,
 ltype.real,
 ltype.str,
 ltype.str,
 ltype.str,
 ltype.int,
 ltype.int,
 ltype.str,
 ltype.str,
 ltype.str,
 ltype.int,
 ltype.int,
 ltype.str,
 ltype.bool,
 ltype.str,
 ltype.str,
 ltype.bool,
 ltype.bool,
 ltype.str,
 ltype.bool,
 ltype.int,
 ltype.int,
 ltype.bool,
 ltype.bool,
 ltype.bool,
 ltype.bool,
 ltype.bool,
 ltype.bool,
 ltype.bool,
 ltype

Next, we are going to try to fit an Ftrl model on the train set. Here we will adopt [Olivier's great discussion topic](https://www.kaggle.com/c/microsoft-malware-prediction/discussion/75478). First, let's replace all the missing values.

In [18]:
'''%%time
for name in test.names:
    if test[:, name].ltypes[0] == dt.ltype.str:
        train.replace(None, '-1')
        test.replace(None, '-1')
    elif test[:, name].ltypes[0] == dt.ltype.int:
        train.replace(None, -1)
        test.replace(None, -1)
    elif test[:, name].ltypes[0] == dt.ltype.bool:
        train.replace(None, 0)
        test.replace(None, 0)
    elif test[:, name].ltypes[0] == dt.ltype.real:
        train.replace(None, -1.0)
        test.replace(None, -1.0)'''


"%%time\nfor name in test.names:\n    if test[:, name].ltypes[0] == dt.ltype.str:\n        train.replace(None, '-1')\n        test.replace(None, '-1')\n    elif test[:, name].ltypes[0] == dt.ltype.int:\n        train.replace(None, -1)\n        test.replace(None, -1)\n    elif test[:, name].ltypes[0] == dt.ltype.bool:\n        train.replace(None, 0)\n        test.replace(None, 0)\n    elif test[:, name].ltypes[0] == dt.ltype.real:\n        train.replace(None, -1.0)\n        test.replace(None, -1.0)"

Next, we'll factorize all the string columns. Unfortunately, datatabel still doesn't handle this natively, so we'll have to use the Pandas crutch.

In [19]:
%%time
for f in train.names:
    if f not in ['MachineIdentifier', 'HasDetections']:
        if train[:, f].ltypes[0] == dt.ltype.str:
            print('factorizing %s' % f)
            col_f = pd.concat([train[:, f].to_pandas(), test[:, f].to_pandas()], ignore_index=True)
            encoding = col_f.groupby(f).size()
            encoding = encoding/len(col_f)
            column = col_f[f].map(encoding).values.flatten()
            del col_f, encoding
            gc.collect()
            train[:, f] = dt.Frame(column[:8921483])
            test[:, f] = dt.Frame(column[8921483:])
            del column
            gc.collect()

factorizing ProductName
factorizing EngineVersion
factorizing AppVersion
factorizing AvSigVersion
factorizing Platform
factorizing Processor
factorizing OsVer
factorizing OsPlatformSubRelease
factorizing OsBuildLab
factorizing SkuEdition
factorizing PuaMode
factorizing SmartScreen
factorizing Census_MDC2FormFactor
factorizing Census_DeviceFamily
factorizing Census_ProcessorClass
factorizing Census_PrimaryDiskTypeName
factorizing Census_ChassisTypeName
factorizing Census_PowerPlatformRoleName
factorizing Census_InternalBatteryType
factorizing Census_OSVersion
factorizing Census_OSArchitecture
factorizing Census_OSBranch
factorizing Census_OSEdition
factorizing Census_OSSkuName
factorizing Census_OSInstallTypeName
factorizing Census_OSWUAutoUpdateOptionsName
factorizing Census_GenuineStateName
factorizing Census_ActivationChannel
factorizing Census_FlightRing
CPU times: user 2min 49s, sys: 43.1 s, total: 3min 32s
Wall time: 3min 31s


In [20]:
train[:, f]

Unnamed: 0_level_0,HasDetections
Unnamed: 0_level_1,▪
0,0
1,0
2,0
3,1
4,1
5,1
6,1
7,0
8,0
9,1


In [21]:
train.head()

Unnamed: 0_level_0,MachineIdentifier,ProductName,EngineVersion,AppVersion,AvSigVersion,IsBeta,RtpStateBitfield,IsSxsPassiveMode,DefaultBrowsersIdentifier,AVProductStatesIdentifier,…,Census_IsPenCapable,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪,▪▪▪▪,▪,▪▪▪▪,▪▪▪▪,…,▪,▪,▪,▪▪▪▪,▪
0,0000028988387b115f69f31a3bf04f09,0.991,0.228555,0.347238,0.00119924,0,7,0,,53447,…,0,0,0,10,0
1,000007535c3f730efa9ea0b7ef1bd645,0.991,0.0178303,0.0291144,0.0137497,0,7,0,,53447,…,0,0,0,8,0
2,000007905a28d863f6d0d597892cd692,0.991,0.228555,0.347238,0.000625047,0,7,0,,53447,…,0,0,0,3,0
3,00000b11598a75ea8ba1beea8459149f,0.991,0.228555,0.347238,0.00250019,0,7,0,,53447,…,0,0,0,3,1
4,000014a5f00daa18e76b81417eeb99fc,0.991,0.228555,0.347238,0.00300744,0,7,0,,53447,…,0,0,0,1,1
5,000016191b897145d069102325cab760,0.991,0.228555,0.347238,0.00126893,0,7,0,,53447,…,0,0,0,15,1
6,0000161e8abf8d8b89c5ab8787fd712b,0.991,0.228555,0.347238,0.000205786,0,7,0,,43927,…,0,0,0,10,1
7,000019515bc8f95851aff6de873405e8,0.991,0.228555,0.347238,0.000374551,0,7,0,,53447,…,0,0,0,15,0
8,00001a027a0ab970c408182df8484fce,0.991,0.251041,0.347238,0.00128205,0,7,0,,53447,…,0,0,0,15,0
9,00001a18d69bb60bda9779408dcf02ac,0.991,0.228555,0.347238,0.000386355,0,7,0,,46413,…,0,0,1,8,1


In [22]:
test.head()

Unnamed: 0_level_0,MachineIdentifier,ProductName,EngineVersion,AppVersion,AvSigVersion,IsBeta,RtpStateBitfield,IsSxsPassiveMode,DefaultBrowsersIdentifier,AVProductStatesIdentifier,…,Census_IsTouchEnabled,Census_IsPenCapable,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪,▪▪▪▪,▪,▪▪▪▪,▪▪▪▪,…,▪,▪,▪,▪,▪▪▪▪
0,0000010489e3af074adeac69c53e555e,0.991,0.0888999,0.126972,0.000228379,0,7,0,,53447,…,0,0,0.0,0.0,7.0
1,00000176ac758d54827acd545b6315a5,0.991,0.12556,0.164084,0.00165088,0,7,0,,53447,…,0,0,0.0,1.0,12.0
2,0000019dcefc128c2d4387c1273dae1d,0.991,0.192051,0.164084,0.00037175,0,7,0,,49480,…,0,0,0.0,1.0,11.0
3,0000055553dc51b1295785415f1a224d,0.991,0.0888999,0.126972,0.000838463,0,7,0,,42160,…,0,0,0.0,0.0,10.0
4,00000574cefffeca83ec8adf9285b2bf,0.991,0.12556,0.164084,0.00129618,0,7,0,,53447,…,0,0,0.0,1.0,3.0
5,000007ffedd31948f08e6c16da31f6d1,0.991,0.192051,0.164084,0.000220152,0,7,0,,53447,…,0,0,0.0,0.0,10.0
6,000008f31610018d898e5f315cdf1bd1,0.991,0.12556,0.126972,0.000457474,0,7,0,,7945,…,0,0,0.0,1.0,10.0
7,00000a3c447250626dbcc628c9cbc460,0.991,0.192051,0.0573805,0.000193028,0,7,0,,15521,…,0,0,0.0,0.0,9.0
8,00000b6bf217ec9aef0f68d5c6705897,0.991,0.0888999,0.126972,0.00105444,0,7,0,,53447,…,0,0,,,
9,00000b8d3776b13e93ad83676a28e4aa,0.991,0.00405872,0.00473218,1.1565e-05,0,7,0,,53447,…,0,0,0.0,0.0,15.0


Now, let's fit the model:

In [23]:
features = [f for f in train.names if f not in ['HasDetections']]
ftrl = Ftrl(nepochs=2, interactions=True)


In [24]:
%%time
print('Start Fitting on   ', train.shape, ' @ ', datetime.now())
ftrl.fit(train[:, features], train[:, 'HasDetections'])
print('Fitted complete on ', train.shape, ' @ ', datetime.now())  
print('Current loss : %.6f' 
          % log_loss(np.array(train[:, 'HasDetections'])[:, 0],  
                             np.array(ftrl.predict(train[:, features]))))

Start Fitting on    (8921483, 83)  @  2019-03-03 13:44:28.836626
Fitted complete on  (8921483, 83)  @  2019-03-03 16:59:14.314141
Current loss : 0.579900
CPU times: user 19h 7min 42s, sys: 1min 53s, total: 19h 9min 36s
Wall time: 4h 49min 56s


In [25]:
print('Current AUC : %.6f' 
          % roc_auc_score(np.array(train[:, 'HasDetections'])[:, 0],  
                             np.array(ftrl.predict(train[:, features]))))

Current AUC : 0.760506


In [26]:
preds1 = np.array(ftrl.predict(test[:, features]))
preds1 = preds1.flatten()

In [27]:
ftrl = Ftrl(nepochs=20, interactions=False)
ftrl.fit(train[:, features], train[:, 'HasDetections'])
preds2 = np.array(ftrl.predict(test[:, features]))
preds2 = preds2.flatten()

In [28]:
np.save('preds1', preds1)
np.save('preds2', preds2)

In [29]:
sample_submission = pd.read_csv('../input/sample_submission.csv')

In [30]:
sample_submission['HasDetections'] = 0.6*preds1+0.4*preds2

In [31]:
sample_submission.to_csv('datatable_ftrl_submission.csv', index=False)

To be continued ...