Посмотрим, как в действительности Vowpal Wabbit справляется с большими выборками. Имеются 10 Гб вопросов со StackOverflow – ссылка на данные, там аккурат 10 миллионов вопросов, и у каждого вопроса может быть несколько тегов.


Из всех тегов выделены 10, и решается задача классификации на 10 классов: по тексту вопроса надо поставить один из 10 тегов, соответствующих 10 популярным языкам программирования.

In [2]:
%%time
!wc -l stackoverflow.10kk.tsv

9999992 stackoverflow.10kk.tsv
CPU times: user 63.3 ms, sys: 10.9 ms, total: 74.2 ms
Wall time: 2.51 s


In [1]:
import os
from tqdm import tqdm
from time import time
import numpy as np
from sklearn.metrics import accuracy_score

In [4]:
!sed -i 1,7d stackoverflow.10kk.tsv

In [4]:
%%time
!python3 preprocess.py stackoverflow.10kk.tsv stackoverflow.vw

9999992it [01:10, 140992.95it/s]
4389052 lines selected, 15 lines corrupted.
CPU times: user 1.64 s, sys: 475 ms, total: 2.11 s
Wall time: 1min 11s


10 | i ve got some code in window scroll that checks if an element is visible then triggers another function however only the first section of code is firing both bits of code work in and of themselves if i swap their order whichever is on top fires correctly my code is as follows fn isonscreen function use strict var win window viewport top win scrolltop left win scrollleft bounds this offset viewport right viewport left + win width viewport bottom viewport top + win height bounds right bounds left + this outerwidth bounds bottom bounds top + this outerheight return viewport right lt bounds left viewport left gt bounds right viewport bottom lt bounds top viewport top gt bounds bottom window scroll function use strict var load_more_results ajax load_more_results isonscreen if load_more_results true loadmoreresults var load_more_staff ajax load_more_staff isonscreen if load_more_staff true loadmorestaff what am i doing wrong can you only fire one event from window scroll i assume not

In [2]:
!vw -d "stackoverflow_train.vw" \
--oaa 10 \
--bit_precision 28 \
--random_seed 17  \
--passes 1 \
--ngram 1 \
--loss_function hinge \
-f "model_passes_1_1.vw" 

/bin/sh: 1: vw: not found


-oaa 10 – указываем, что классификация на 10 классов
-d – путь к данным
-f – путь к модели, которая будет построена
-b 28 – используем 28 бит для хэширования, то есть признаковое пространство ограничено  228  признаками, что в данном случае больше, чем число уникальных слов в выборке (но потом появятся би- и триграммы, и ограничение размерности признакового пространства начнет работать)
также указываем random seed

In [6]:
!vw -i model_passes_1_1.vw -t -d stackoverflow_valid.vw \
-p stackoverflow_valid_pred.txt --quiet

In [15]:
with open('stackoverflow_valid_pred.txt') as pred_file, open('stackoverflow_valid_labels.txt') as valid_labels:
    print ("1-1",accuracy_score(np.loadtxt(valid_labels), np.loadtxt(pred_file)))

1-1 0.9166626794748937


In [None]:
import itertools

In [None]:
i = itertools.product([])

In [17]:
p = (1,3,5)
n = (1,2,3)

In [20]:
for passes,ngram in [(passes,ngram) for passes in p for ngram in n]:
    !vw -d "stackoverflow_train.vw" \
    --oaa 10 \
    --bit_precision 28 \
    --random_seed 17  \
    --cache_file "model_passes_{passes}_ngram_{ngram}.vw.cache" \
    --passes $passes \
    --ngram $ngram \
    --loss_function hinge \
    -f "model_passes_{passes}_ngram_{ngram}.vw" 

Generating 1-grams for all namespaces.
final_regressor = model_passes_1_ngram_1.vw
Num weight bits = 28
learning rate = 0.5
initial_t = 0
power_t = 0.5
using cache_file = model_passes_1_ngram_1.vw.cache
ignoring text input in favor of cache input
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.000000 0.000000            1            1.0        1        1      161
0.500000 1.000000            2            2.0        4        1       68
0.750000 1.000000            4            4.0        7        1       88
0.750000 0.750000            8            8.0        7        1       95
0.750000 0.750000           16           16.0        7        7      209
0.781250 0.812500           32           32.0        7        2      174
0.765625 0.750000           64           64.0        3        3      204
0.648438 0.531250          128          128.0        1        5       29
0.60937

0.100007 0.091610      1048576      1048576.0        1        1      422
0.092582 0.092582      2097152      2097152.0        5        5      696 h

finished run
number of examples per pass = 1316717
passes used = 3
weighted example sum = 3950151.000000
weighted label sum = 0.000000
average loss = 0.084107 h
total feature number = 788274150
Generating 2-grams for all namespaces.
final_regressor = model_passes_3_ngram_2.vw
Num weight bits = 28
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = model_passes_3_ngram_2.vw.cache
ignoring text input in favor of cache input
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.000000 0.000000            1            1.0        1        1      320
0.500000 1.000000            2            2.0        4        1      134
0.750000 1.000000            4            4.0        7        1      174
0.750

0.531250 0.453125          512          512.0        2        5      134
0.458984 0.386719         1024         1024.0        1        1      262
0.382812 0.306641         2048         2048.0        7        7      140
0.305664 0.228516         4096         4096.0        2        2      636
0.255249 0.204834         8192         8192.0        5        7       46
0.215027 0.174805        16384        16384.0        3        3     1160
0.177612 0.140198        32768        32768.0        3        3       54
0.148590 0.119568        65536        65536.0        4        4      366
0.127281 0.105972       131072       131072.0        2        2      188
0.110416 0.093552       262144       262144.0        5        5      462
0.096605 0.082794       524288       524288.0        6        6      282
0.086320 0.076035      1048576      1048576.0        1        1      842
0.077806 0.077806      2097152      2097152.0        5        5     1390 h
0.073817 0.069828      4194304      4194304.0    

In [25]:
valid_labels = np.loadtxt('stackoverflow_valid_labels.txt')

In [27]:
!vw -i model_passes_1_ngram_1.vw -t -d stackoverflow_valid.vw -p stackoverflow_valid_pred_11.txt --quiet
!vw -i model_passes_1_ngram_2.vw -t -d stackoverflow_valid.vw -p stackoverflow_valid_pred_12.txt --quiet
!vw -i model_passes_1_ngram_3.vw -t -d stackoverflow_valid.vw -p stackoverflow_valid_pred_13.txt --quiet
!vw -i model_passes_3_ngram_1.vw -t -d stackoverflow_valid.vw -p stackoverflow_valid_pred_31.txt --quiet
!vw -i model_passes_3_ngram_2.vw -t -d stackoverflow_valid.vw -p stackoverflow_valid_pred_32.txt --quiet
!vw -i model_passes_3_ngram_3.vw -t -d stackoverflow_valid.vw -p stackoverflow_valid_pred_33.txt --quiet
!vw -i model_passes_5_ngram_1.vw -t -d stackoverflow_valid.vw -p stackoverflow_valid_pred_51.txt --quiet
!vw -i model_passes_5_ngram_2.vw -t -d stackoverflow_valid.vw -p stackoverflow_valid_pred_52.txt --quiet
!vw -i model_passes_5_ngram_3.vw -t -d stackoverflow_valid.vw -p stackoverflow_valid_pred_53.txt --quiet

In [28]:
    print("passes_1_ngram_1: ", (accuracy_score(valid_labels, np.loadtxt('stackoverflow_valid_pred_11.txt'))))
    print("passes_1_ngram_2: ", (accuracy_score(valid_labels, np.loadtxt('stackoverflow_valid_pred_12.txt'))))
    print("passes_1_ngram_3: ", (accuracy_score(valid_labels, np.loadtxt('stackoverflow_valid_pred_13.txt'))))
    print("passes_3_ngram_1: ", (accuracy_score(valid_labels, np.loadtxt('stackoverflow_valid_pred_31.txt'))))
    print("passes_3_ngram_2: ", (accuracy_score(valid_labels, np.loadtxt('stackoverflow_valid_pred_32.txt'))))
    print("passes_3_ngram_3: ", (accuracy_score(valid_labels, np.loadtxt('stackoverflow_valid_pred_33.txt'))))
    print("passes_5_ngram_1: ", (accuracy_score(valid_labels, np.loadtxt('stackoverflow_valid_pred_51.txt'))))
    print("passes_5_ngram_2: ", (accuracy_score(valid_labels, np.loadtxt('stackoverflow_valid_pred_52.txt'))))
    print("passes_5_ngram_3: ", (accuracy_score(valid_labels, np.loadtxt('stackoverflow_valid_pred_53.txt'))))

passes_1_ngram_1:  0.9166626794748937
passes_1_ngram_2:  0.932356949811964
passes_1_ngram_3:  0.9299701028968885
passes_3_ngram_1:  0.9180078440593349
passes_3_ngram_2:  0.9305374233263022
passes_3_ngram_3:  0.9272544835401888
passes_5_ngram_1:  0.9180269825798453
passes_5_ngram_2:  0.93123529580634
passes_5_ngram_3:  0.9281464752996887


In [35]:
acc1 = accuracy_score(valid_labels, np.loadtxt('stackoverflow_valid_pred_12.txt'))

In [29]:
!vw -i model_passes_1_ngram_2.vw -t -d stackoverflow_test.vw -p stackoverflow_test_pred_12.txt --quiet

In [30]:
test_labels = np.loadtxt('stackoverflow_test_labels.txt')

In [34]:
acc2 = accuracy_score(test_labels, np.loadtxt('stackoverflow_test_pred_12.txt'))
print ("Best combination on test set:", acc2)

Best combination on test set: 0.9325346646452743


In [36]:
(acc2-acc1)*100

0.017771483331030513

In [38]:
    !vw -d "stackoverflow_merged.vw" \
    --oaa 10 \
    --bit_precision 28 \
    --random_seed 17  \
    --passes 1 \
    --ngram 2 \
    --loss_function hinge \
    -f "model_merged.vw" 

Generating 2-grams for all namespaces.
final_regressor = model_merged.vw
Num weight bits = 28
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = stackoverflow_merged.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.000000 0.000000            1            1.0        1        1      320
0.500000 1.000000            2            2.0        4        1      134
0.750000 1.000000            4            4.0        7        1      174
0.750000 0.750000            8            8.0        7        1      188
0.750000 0.750000           16           16.0        7        7      416
0.781250 0.812500           32           32.0        7        2      346
0.750000 0.718750           64           64.0        3        3      406
0.648438 0.546875          128          128.0        1        7       56
0.617188 0.585938          256          256.0        5

In [39]:
!vw -i model_merged.vw -t -d stackoverflow_test.vw -p stackoverflow_test_pred_merged.txt --quiet

In [51]:
acc_merged = accuracy_score(test_labels, np.loadtxt('stackoverflow_test_pred_merged.txt'))
print ("Best combination on test set:", acc_merged, "Acc gain:", (acc_merged-acc2)*100)

Best combination on test set: 0.9364471250524601 Acc gain: 0.3912460407185736


Удивительно, как простая модель VW может обучиться на такой выборке за секунды или минуты на простом железе, без всяких Hadoop-кластеров. 