## Линейные методы. Vowpal Wabbit.

Vowpal Wabbit on GitHub: https://github.com/JohnLangford/vowpal_wabbit

Vowpal Wabbit Tutorial: https://github.com/JohnLangford/vowpal_wabbit/wiki/Tutorial

In [None]:
!apt-get install vowpal-wabbit

In [None]:
!wget https://www.dropbox.com/s/crld672bipr0n05/train-sample.csv?dl=0

In [None]:
!ls

In [None]:
import pandas as pd
import numpy as np

In [None]:
train_path = 'train-sample.csv?dl=0'

In [None]:
data = pd.read_csv(train_path)
data.head()
print(len(data))

In [None]:
print(data.OpenStatus[10])
print(data.BodyMarkdown[10])

In [None]:
data_train = data.iloc[:50000, :]
data_test = data.iloc[70000:, :]

In [None]:
import re

def save_to_vw(data, fname):
    with open(fname, 'w') as fout:
        for _, row in data.iterrows():
            text = filter(lambda x: len(x) > 1, re.split("[^a-z]",
                                    row.BodyMarkdown.lower()))
            text = ' '.join(text)
            if row.OpenStatus == "open":
                target = 1
            else:
                target = -1
            fout.write('{0} |n 0:{1} {2} |t {3}\n'.format(target, 
                                        row.ReputationAtPostCreation,
                                        row.Tag1,
                                        text))

In [None]:
save_to_vw(data_train, 'train.vw')
save_to_vw(data_test, 'test.vw')

In [None]:
!ls

In [None]:
!head -n 2 train.vw

In [None]:
!vw -d train.vw -c -k -f model.vw --passes 10 --link logistic

In [None]:
!vw -d test.vw -i model.vw -t -p pred.txt

In [None]:
!head -n 10 pred.txt

In [None]:
from sklearn.metrics import roc_auc_score

def calc_vw_qual():
    preds = pd.read_csv('pred.txt', header=None).iloc[:, 0].values
    target = data_test.OpenStatus.values
    T = []
    for t in target:
        if t == 'open':
            T.append(1.)
        else:
            T.append(-1.)
    print(roc_auc_score(T, preds))
    
calc_vw_qual()

In [None]:
!vw -d train.vw -c -k -f model.vw --passes 10 -l 0.1 --link logistic
!vw -d test.vw -i model.vw -t -p pred.txt
print('\n\n\n')
calc_vw_qual()

n-граммы (n=2) - индикаторы того, что два слова встретились рядом. Из "мама мыла раму" получаем биграммы "мама мыла" и "мыла раму".

In [None]:
!vw -d train.vw -c -k -f model.vw --passes 10 -l 0.1 --ngram t2 --link logistic
!vw -d test.vw -i model.vw -t -p pred.txt
print('\n\n\n')
calc_vw_qual()

k-skip-n-граммы - как n-граммы, но разрешаем словам быть отдаленными друг от друга не больше, чем на k

In [None]:
!vw -d train.vw -c -k -f model.vw --passes 10 -l 0.1 --ngram t2 --skips t2 --link logistic
!vw -d test.vw -i model.vw -t -p pred.txt
print('\n\n\n')
calc_vw_qual()

In [None]:
!vw -d train.vw -c -k -f model.vw --passes 10 -l 0.1 --ngram t2 -b 18 --link logistic
!vw -d test.vw -i model.vw -t -p pred.txt
print('\n\n\n')
calc_vw_qual()