<center>
<img src="../../img/ods_stickers.jpg" />
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose. This material is a translated version of the Capstone project (by the same author) from specialization "Machine learning and data analysis" by Yandex and MIPT. No solutions shared.

# <center> Week 6. Vowpal Wabbit
This week, we explore the popular library called Vowpal Wabbit and apply it to site visits data.

Week 6 roadmap:
- Part 1. Overview of Vowpal Wabbit
- Part 2. Applying Vowpal Wabbit to site visits data
    - 2.1 Data preprocssing
    - 2.2 Holdout validation
    - 2.3 Test set validation (on public leaderboard)
    
Resources: Vowpal Wabbit's [documentation](https://github.com/JohnLangford/vowpal_wabbit/wiki)

## The task 
1. Fill in code in this notebook
2. Choose answers in the [webform](https://docs.google.com/forms/d/1VWfSupfYXvb6gyROR0enXYVMjuxqgRaTScYhEz4f6YQ)

## Part 1. Overview of Vowpal Wabbit

Read the [article](https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-8-vowpal-wabbit-fast-learning-with-gigabytes-of-data-60f750086237) on Vowpal Wabbit from the OpenDataScience machine learning course. Download the [notebook](https://mlcourse.ai/notebooks/blob/master/jupyter_english/topic08_sgd_hashing_vowpal_wabbit/topic8_sgd_hashing_vowpal_wabbit.ipynb?flush_cache=true), play with the code a bit - that is the most effective way to get started.  

## Part 2. Applying Vowpal Wabbit to site visits data

## 2.1 Data preprocessing

Now we will see Vowpal Wabbit in action. If we were to use in a binary classification task, we would not have noticed any difference neither in terms of accuracy nor in terms of speed. Instead, we will do 400-class classification. Source data is the same, but now we have 400 users and our goal is to identify each one of them. 
- Download the data from [here](https://www.kaggle.com/c/identify-me-if-you-can4/data) - files **train_sessions_400users.csv** and **test_sessions_400users.csv**

In [41]:
import os
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

import numpy as np

In [2]:
# Change to your data path
PATH_TO_DATA = './capstone_user_identification'

Read train and test data. You may notice, that sessions in the test subset are spanning the different time period than train sessions. 

In [3]:
train_df_400 = pd.read_csv(os.path.join(PATH_TO_DATA,'train_sessions_400users.csv'), 
                           index_col='session_id')

In [4]:
test_df_400 = pd.read_csv(os.path.join(PATH_TO_DATA,'test_sessions_400users.csv'), 
                           index_col='session_id')

In [5]:
train_df_400.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,...,time6,site7,time7,site8,time8,site9,time9,site10,time10,user_id
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,23713,2014-03-24 15:22:40,23720.0,2014-03-24 15:22:48,23713.0,2014-03-24 15:22:48,23713.0,2014-03-24 15:22:54,23720.0,2014-03-24 15:22:54,...,2014-03-24 15:22:55,23713.0,2014-03-24 15:23:01,23713.0,2014-03-24 15:23:03,23713.0,2014-03-24 15:23:04,23713.0,2014-03-24 15:23:05,653
2,8726,2014-04-17 14:25:58,8725.0,2014-04-17 14:25:59,665.0,2014-04-17 14:25:59,8727.0,2014-04-17 14:25:59,45.0,2014-04-17 14:25:59,...,2014-04-17 14:26:01,45.0,2014-04-17 14:26:01,5320.0,2014-04-17 14:26:18,5320.0,2014-04-17 14:26:47,5320.0,2014-04-17 14:26:48,198
3,303,2014-03-21 10:12:24,19.0,2014-03-21 10:12:36,303.0,2014-03-21 10:12:54,303.0,2014-03-21 10:13:01,303.0,2014-03-21 10:13:24,...,2014-03-21 10:13:36,303.0,2014-03-21 10:13:54,309.0,2014-03-21 10:14:01,303.0,2014-03-21 10:14:06,303.0,2014-03-21 10:14:24,34
4,1359,2013-12-13 09:52:28,925.0,2013-12-13 09:54:34,1240.0,2013-12-13 09:54:34,1360.0,2013-12-13 09:54:34,1344.0,2013-12-13 09:54:34,...,2013-12-13 09:54:34,1346.0,2013-12-13 09:54:34,1345.0,2013-12-13 09:54:34,1344.0,2013-12-13 09:58:19,1345.0,2013-12-13 09:58:19,601
5,11,2013-11-26 12:35:29,85.0,2013-11-26 12:35:31,52.0,2013-11-26 12:35:31,85.0,2013-11-26 12:35:32,11.0,2013-11-26 12:35:32,...,2013-11-26 12:35:32,11.0,2013-11-26 12:37:03,85.0,2013-11-26 12:37:03,10.0,2013-11-26 12:37:03,85.0,2013-11-26 12:37:04,273


**There are 182793 sessions in train data, 46473 in test data and 400 unique users.**

In [6]:
train_df_400.shape, test_df_400.shape, train_df_400['user_id'].nunique()

((182793, 21), (46473, 20), 400)

In [7]:
train_df_400['user_id'].unique().max(),train_df_400['user_id'].unique().min()

(997, 1)

Vowpal Wabbit requires class labels to be encoded from 1 to K, where K is total number of classes in classification task (in our case that is 400). So we should apply *LabelEncoder* and add +1 to its result. (*LabelEncoder* translates all labels in 0 to K-1 range). We will also have to perform inverse transformation later.

In [10]:
y = train_df_400.user_id
class_encoder = LabelEncoder()
y_for_vw = class_encoder.fit_transform(y) + 1y_for_vw

Next we will compare VW wih SGDClassifier and logistic regression. All these models require processed data. Prepare sparse matrices for sklearn models (just like we did in previous part):
- concatenate train and test data
- choose only websites (features 'site1' through 'site10')
- impute missing values with 0 (we started enumerating sites from 0)
- transform data to *csr_matrix* 
- split back to *train* and *test*

In [8]:
train_df_400.fillna(0, inplace=True)
test_df_400.fillna(0,inplace=True)

In [9]:
full_df = pd.concat([train_df_400.drop('user_id', axis=1), test_df_400])

idx_split = train_df_400.shape[0]

In [10]:
sites = ['site' + str(i) for i in range(1, 11)]

In [11]:
test_df = test_df_400[sites]
train_df =train_df_400[sites]

In [20]:
full_sites = full_df[sites]
full_sites = full_sites.astype(int)
full_sites.head()

Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,23713,23720,23713,23713,23720,23713,23713,23713,23713,23713
2,8726,8725,665,8727,45,8725,45,5320,5320,5320
3,303,19,303,303,303,303,303,309,303,303
4,1359,925,1240,1360,1344,1359,1346,1345,1344,1345
5,11,85,52,85,11,52,11,85,10,85


In [21]:
sites_flatten = full_sites.values.flatten()
full_sites_sparse = csr_matrix(([1] * sites_flatten.shape[0],
                                sites_flatten,
                                range(0, sites_flatten.shape[0]  + 10, 10)))[:, 1:]

In [22]:
X_train_sparse = full_sites_sparse[:idx_split, :]
X_test_sparse = full_sites_sparse[idx_split:, :]
y = train_df_400.user_id.values

## 2.2 Holdout validation

Split data into training (70%) and validation (30%) subsets. We do not shuffle data and take into account that sessions are sorted by time

In [29]:
train_share = int(.7 * train_df_400.shape[0])

In [None]:
train_df_part = train_df_400[sites].iloc[:train_share, :]
valid_df = train_df_400[sites].iloc[train_share:, :]

In [30]:
X_train_part_sparse = X_train_sparse[:train_share, :]
X_valid_sparse = X_train_sparse[train_share:, :]

In [31]:
y_train_part = y[:train_share]
y_valid = y[train_share:]

In [90]:
y_train_part_for_vw = y_for_vw[:train_share]
y_valid_for_vw = y_for_vw[train_share:]

Implement function **arrays_to_vw** which transforms data to Vowpal Wabbit format. 

Input: 
- X - numpy matrix (training data)
- y (optional) - target numpy vector. It is optional since we will apply the same function to test data.
- train (flag) - True, if we are passing training data as X, False otherwise
- out_file - path to .vw file, in which we'll write results

Details:
- you should iterate over every row of X and write to file all the data using whitespace separator. Also you should add target value at the start of each row, separating it with | from the features.
- when applying function to test data, you can wirte any target value (1 for example)

In [91]:
def arrays_to_vw(X, y=None, train=True, out_file='tmp.vw'):
    
    if train:
        with open(os.path.join(PATH_TO_DATA,out_file),'w') as vw_data_train:
            for y, X in zip(y,X):
                vw_data_train.write(str(y) + ' |sites '+ ' '.join(X.astype(int).astype(str)) + '\n')
    else:
        with open(os.path.join(PATH_TO_DATA,out_file),'w') as vw_data_test:
            for X in X:
                vw_data_test.write('1' + ' |sites '+ ' '.join(X.astype(int).astype(str)) + '\n')

Apply function to subset of training data (train_df_part, y_train_part_for_vw), to holdout set (valid_df, y_valid_for_wv), to whole training data and to whole test data. **Notice, that our method takes only numpy arrays as inputs.**

In [92]:
%%time
# should be 4 calls
arrays_to_vw(train_df_part.values, y_train_part_for_vw, out_file='train_part.vw')
arrays_to_vw(valid_df.values, y_valid_for_vw, out_file='valid.vw')
arrays_to_vw(train_df.values, y_for_vw, out_file='train.vw')
arrays_to_vw(test_df.values, train=False, out_file='test.vw')

Wall time: 6.65 s


In [None]:
# Won't work on Windows
!head -3 $PATH_TO_DATA/train_part.vw

In [None]:
# Won't work on Windows
!head -3  $PATH_TO_DATA/valid.vw

In [None]:
# Won't work on Windows
!head -3 $PATH_TO_DATA/test.vw

Train Vowpal Wabbit on **train_part.wv**. Specify classification task with 400 classes **(--oaa)**, make 3 passes over dataset **(--passes)**. You can also specify cache file (**--cache_file** or flag **-c**) so VW would perform all passes following first one faster (you can delete previous cache file with argument **-k**). Also specify parameter **b=26**. That is number of bits to use for hashing, in this case we need more than deafult 18 bits. Finally, specifiy **random_seed=17**. Do not change other parameters.

In [114]:
train_part_vw = os.path.join(PATH_TO_DATA, 'train_part.vw')
valid_vw = os.path.join(PATH_TO_DATA, 'valid.vw')

train_vw = os.path.join(PATH_TO_DATA, 'train.vw')
test_vw = os.path.join(PATH_TO_DATA, 'test.vw')

model = os.path.join(PATH_TO_DATA, 'vw_model.vw')
pred = os.path.join(PATH_TO_DATA, 'vw_pred.csv')

pred_test = os.path.join(PATH_TO_DATA, 'vw_test_pred.csv')

In [98]:
%%time
!vw --oaa 400 -d $train_part_vw -f vw_model_part.vw -c -k --passes 3 -b 26  --random_seed 17 --quiet

Wall time: 1min


Write predictions for **valid.vw** to **vw_valid_pred.csv**

In [100]:
%%time
!vw -t -i vw_model_part.vw -d $valid_vw -p vw_valid_pred.csv --random_seed 17 --quiet

Wall time: 2.42 s


Read predictions *kaggle_data/vw_valid_pred.csv* from file and see fraction of correct answers on holdout set. 

In [104]:
vw_pred = np.loadtxt('vw_valid_pred.csv')
test_labels = y_valid_for_vw
accuracy_score(test_labels, vw_pred)

0.34545388234435975

Now train *SGDClassifier* (3 passes, logistic loss function) and *LogisticRegression* on 70% of sparse train dataset (X_train_part_sparse, y_train_part), make prediction for holdout set (X_valid_sparse, y_valid) and calculate accuracy. Logistic regression will take some time to fit (for me it took around 8 minutes) - this is okay, set multinomial multi_class to make it train much faster. Specify *random_state=17*, *n_jobs=-1* everywhere. For *SGDClassifier* also specify *max_iter=3*.

In [108]:
logit = LogisticRegression(solver="lbfgs",random_state=17, n_jobs=-1, multi_class='multinomial')

In [109]:
%%time
logit.fit(X_train_part_sparse,y_train_part)

Wall time: 10min 23s


In [110]:
accuracy_score(y_valid, logit.predict(X_valid_sparse))

0.35305809839892044

In [34]:
%%time
sgd_logit = SGDClassifier(loss='log',max_iter=3,random_state=17,n_jobs=-1)
sgd_logit.fit(X_train_part_sparse,y_train_part)

Wall time: 15.7 s


In [35]:
accuracy_score(y_valid, sgd_logit.predict(X_valid_sparse))

0.2910755315657026

- **Calculate accuracy on the holdout set for Vowpal Wabbit, round to 3 decimal places**
- **Calculate accuracy on the holdout set for SGD, round to 3 decimal places**
- **Calculate accuracy on the holdout set for logistic regression, round to 3 decimal places**

In [None]:
vw_valid_acc = #0.34545388234435975
sgd_valid_acc = #0.2910755315657026
logit_valid_acc = #0.35305809839892044

## 2.3 Test set validation (public leaderboard)

Train a VW model with same parameters on the whole training data - **train.wv**

In [113]:
%%time
!vw --oaa 400 -d $train_vw -f vw_model.vw -c -k --passes 3 -b 26  --random_seed 17 --quiet

Wall time: 1min 5s


Make predictions for test data

In [117]:
%%time
!vw -t -i vw_model.vw -d $test_vw -p vw_pred.csv --random_seed 17 --quiet

Wall time: 1.89 s


Write predictions to file, perform reverse label transformation (we got our labels via adding +1 to output of *LabelEncoder* instance) and send submission to Kaggle. 

In [39]:
def write_to_submission_file(predicted_labels, out_file,
                             target='user_id', index_label="session_id"):
    # turn predictions into data frame and save as csv file
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)

In [126]:
a = np.loadtxt('vw_pred.csv')-1

In [129]:
vw_pred = class_encoder.inverse_transform(np.loadtxt('vw_pred.csv').astype(int)-1)

  if diff:


In [135]:
write_to_submission_file(vw_pred, os.path.join(PATH_TO_DATA, 'vw_400_users.csv'))

Do the same for SGD and logistic regression. I know, it is pretty annoying to wait for logistic regression to fit on this data, but let's be patient. 

In [None]:
logit = LogisticRegression(solver="lbfgs",random_state=17, n_jobs=-1, multi_class='multinomial')

In [36]:
%%time
sgd_logit = SGDClassifier(loss='log',max_iter=3,random_state=17,n_jobs=-1)
sgd_logit.fit(X_train_sparse,y)

Wall time: 23.7 s


In [37]:
sgd_logit_test_pred = sgd_logit.predict(X_test_sparse)

In [None]:
%%time
logit.fit(X_train_sparse,y)

In [143]:
logit_test_pred = logit.predict(X_test_sparse)

In [144]:
write_to_submission_file(logit_test_pred, 
                         os.path.join(PATH_TO_DATA, 'logit_400_users.csv'))

In [42]:
write_to_submission_file(sgd_logit_test_pred, 
                         os.path.join(PATH_TO_DATA, 'sgd_400_users_log2.csv'))

Let's look at Public Leaderboard scores in [this](https://www.kaggle.com/c/identify-me-if-you-can4) competition.

- **What is the Public Leaderboard score for Vowpal Wabbit?**
- **What is the Public Leaderboard score for SGD?**
- **What is the Public Leaderboard score for logistic regression?**

In conclusion:
- think how do Vowpal Wabbit, SGD and logistic regression compare in terms of training speed/classification quality
- 400 user classification task probably can't be solved good enough if we use "honest" time based split for testing. Next we will compete in identification of only one user (Alice) - [here](https://inclass.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2) is the competition you are advised to participate in.

Good luck!