<center>
<img src="../../img/ods_stickers.jpg">
## Открытый курс по машинному обучению
Автор материала: программист-исследователь Mail.ru Group, старший преподаватель Факультета Компьютерных Наук ВШЭ Юрий Кашницкий. Материал распространяется на условиях лицензии [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Можно использовать в любых целях (редактировать, поправлять и брать за основу), кроме коммерческих, но с обязательным упоминанием автора материала.

# <center> Соревнование по прогнозированию популярности статьи на портале Medium
## <center>

[Ссылка](https://mlcourse.arktur.io/) на соревнование.

**Задача** 

Есть выборка статей с популярного англоязычного портала Medium. Задача – спрогнозировать число рекомендаций ("лайков") статьи.
Предлагается Вам самим составить обучающую и тестовую выборки на основе имеющихся данных, обучить модель-регрессор и сформировать файл посылки с прогнозами – числом рекомендаций статей (с `log1p`-преобразованием) из тестовой выборки.

**Данные**

Обучающая выборка – 52699 статей, опубликованных до 2016 года включительно (**train.zip** ~ 480 Mb, unzip ~1.6 Gb). Тестовая выборка – 39492 статьи, опубликованные с 1 января по 27 июня 2017 года (**test.zip** ~425 Mb, unzip ~1.4 Gb).

Данные о статьях представлены в JSON формате с полями:
- _id и url – URL статьи
- published – время публикации
- title – название статьи
- author – имя автора, его акканут на Твиттере и Medium
- content – HTML-контент статьи
meta_tags – остальная информация о статье

В файле **train_log1p_recommends.csv** представлены номера (id) статей из обучающей выборки вместе с целевым показателем: числом рекомендаций статей, к которому применено преобразование `log1p(x) = log(1 + x)` В файле **sample_submission.csv** представлен пример файла посылки.

In [1]:
import os
import json
import pickle
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
from glob import glob
from matplotlib import pyplot as plt
%matplotlib inline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import SGDRegressor

import zlib
import re
import glob

In [2]:
PATH_TO_DATA = 'medium'

Посмотрим на одну из строчек в JSON-файле: считаем ее с помощью библиотеки json. Эта строчка соответствует [7-ой статье](https://habrahabr.ru/post/7/) на Хабре.

In [3]:
!head -1 medium/train.json > medium/train1.json
!head -2 medium/train.json | tail -1 > medium/train2.json

In [14]:
!head -2 medium/test.json | tail -1 > medium/test2.json

In [4]:
with open('medium/train2.json') as inp_json:
    first_json = json.load(inp_json)

In [5]:
first_json.keys()

dict_keys(['url', 'domain', 'title', 'quality', 'meta_tags', 'tags', '_timestamp', 'content', 'image_url', 'published', '_spider', '_id', 'author', 'link_tags'])

In [15]:
with open('medium/test2.json') as inp_json:
    first_json_t = json.load(inp_json)

In [16]:
first_json_t.keys()

dict_keys(['url', 'domain', 'title', 'meta_tags', 'tags', '_timestamp', 'content', 'image_url', 'published', '_spider', '_id', 'author', 'link_tags'])

In [6]:
first_json['_id']

'https://medium.com/policy/medium-privacy-policy-f03bf92035c9'

In [7]:
first_json['_timestamp']

1498505468.315491

In [8]:
first_json['url']

'https://medium.com/policy/medium-privacy-policy-f03bf92035c9'

In [9]:
first_json['domain']

'medium.com'

In [10]:
first_json['published']

{'$date': '2012-08-13T22:57:17.248Z'}

In [11]:
first_json['title']

'Medium Privacy Policy – Medium Policy – Medium'

In [12]:
first_json['content'].split('recommends')[1].split('<')[0].split('>')[1]

'396'

In [13]:
first_json['content']

'<header class="container u-maxWidth740"><div class="postMetaHeader u-paddingBottom10 row"><div class="col u-size12of12 js-postMetaLockup"><div class="postMetaLockup postMetaLockup--authorWithBio u-flex js-postMetaLockup"><div class="u-flex0"><a class="link avatar u-baseColor--link" href="https://medium.com/@Medium?source=post_header_lockup" data-action="show-user-card" data-action-source="post_header_lockup" data-action-value="504c7870fdb6" data-action-type="hover" data-user-id="504c7870fdb6" dir="auto"><img src="https://cdn-images-1.medium.com/fit/c/120/120/1*P_xM00gtMxf1Iw0tPewjnA.png" class="avatar-image avatar-image--small" alt="Go to the profile of Medium"></a></div><div class="u-flex1 u-paddingLeft15 u-overflowHidden"><a class="link link link--darken link--darker u-baseColor--link" href="https://medium.com/@Medium?source=post_header_lockup" data-action="show-user-card" data-action-source="post_header_lockup" data-action-value="504c7870fdb6" data-action-type="hover" data-user-id=

In [17]:
first_json['tags']

[]

In [14]:
first_json['author']

{'name': None, 'twitter': '@Medium', 'url': 'https://medium.com/@Medium'}

In [15]:
first_json['link_tags']

{'alternate': 'android-app://com.medium.reader/https/medium.com/p/f03bf92035c9',
 'apple-touch-icon': 'https://cdn-images-1.medium.com/fit/c/120/120/1*etUthOXG-BrZm25K7wEcgA.png',
 'author': 'https://medium.com/@Medium',
 'canonical': 'https://medium.com/policy/medium-privacy-policy-f03bf92035c9',
 'icon': 'https://cdn-static-1.medium.com/_/fp/icons/favicon-medium.TAS6uQ-Y7kcKgi0xjcYHXw.ico',
 'mask-icon': 'https://cdn-static-1.medium.com/_/fp/icons/favicon.KjTfUJo7yJH_fCoUzzH3cg.svg',
 'publisher': 'https://plus.google.com/103654360130207659246',
 'search': '/osd.xml',
 'stylesheet': 'https://cdn-static-1.medium.com/_/fp/css/main-base.XLD2lHzXGLucmHpFdiqzSg.css'}

In [19]:
first_json_t['meta_tags']

{'al:android:app_name': 'Medium',
 'al:android:package': 'com.medium.reader',
 'al:android:url': 'medium://p/f44918df914b',
 'al:ios:app_name': 'Medium',
 'al:ios:app_store_id': '828256236',
 'al:ios:url': 'medium://p/f44918df914b',
 'al:web:url': 'https://hackernoon.com/how-does-rsa-work-f44918df914b',
 'article:author': '10154344354411361',
 'article:published_time': '2017-06-23T14:27:27.083Z',
 'article:publisher': 'https://www.facebook.com/hackernoon',
 'author': 'Short Tech Stories',
 'description': 'RSA is an asymmetric system , which means that a key pair will be generated (we will see how soon) , a public key and a private key , obviously you keep your private key secure and pass around the…',
 'fb:app_id': '542599432471018',
 'og:description': 'Hey guys , I wanted to write a little bit about RSA cryptosystem .',
 'og:image': 'https://cdn-images-1.medium.com/max/1200/1*pZvP5n6jrz-KYb1MJdR6-A.png',
 'og:site_name': 'Hacker Noon',
 'og:title': 'How does RSA work? – Hacker Noon',


In [16]:
first_json['meta_tags']

{'al:android:app_name': 'Medium',
 'al:android:package': 'com.medium.reader',
 'al:android:url': 'medium://p/f03bf92035c9',
 'al:ios:app_name': 'Medium',
 'al:ios:app_store_id': '828256236',
 'al:ios:url': 'medium://p/f03bf92035c9',
 'al:web:url': 'https://medium.com/policy/medium-privacy-policy-f03bf92035c9',
 'article:author': 'https://medium.com/@Medium',
 'article:published_time': '2012-08-13T22:57:17.248Z',
 'article:publisher': 'https://www.facebook.com/medium',
 'author': 'Medium',
 'description': 'Privacy is important. We respect yours. Our goal is to do more than we have to by law — we want to earn your trust that we are careful with your data. This policy sets out our privacy practices and…',
 'fb:app_id': '542599432471018',
 'og:description': 'Effective Date: April 10, 2014',
 'og:site_name': 'Medium',
 'og:title': 'Medium Privacy Policy – Medium Policy – Medium',
 'og:type': 'article',
 'og:url': 'https://medium.com/policy/medium-privacy-policy-f03bf92035c9',
 'referrer': '

Загрузим ответы на обучающей выборке.

In [20]:
train_target = pd.read_csv(os.path.join(PATH_TO_DATA, 
                                        'train_log1p_recommends.csv'), index_col='id')
y_train = train_target['log_recommends'].values

In [21]:
train_target.head()

Unnamed: 0_level_0,log_recommends
id,Unnamed: 1_level_1
0,6.90875
1,5.98394
6,6.21661
12,2.30259
15,4.70048


Сформируйте обучающую выборку для Vowpal Wabbit, выберите признаки title, tags, domain, flow, author, и hubs из JSON-файла.
От самого текста для начала просто возьмем его длину: постройте признак content_len – длина текста в миллионах символов.
Также постройте признаки: час и месяц публикации статьи. Еще, конечно же, возьмите ответы на обучающей выборке из `train_target`. Ниже пример того, как могут выглядеть первые две строки нового файла.

In [22]:
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ' '.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

In [23]:
def nonuniq_words(text):
    return [e.lower() for e in re.findall("\w+", text, re.UNICODE)]

def prepare(text):
    return ' '.join(nonuniq_words(text))

def prepare_ngr(v):
    s = v.lower().replace('\n', ' ').replace('|', '_').replace(':', '_')
    res = []
    for i in range(len(s)):
        if s[i] == ' ':
            continue
        if not s[i].isalpha():
            res.append(s[i])
        #if i < len(s) - 1:
        #    res.append(s[i:i+2].replace(' ', '_'))
        if i < len(s) - 2:
            res.append(s[i:i+3].replace(' ', '_'))
    return ' '.join(res)

In [24]:
def process_json(json_data):
    res = dict()
    for k, v in json_data.items():
        if k == 'quality':
            continue
        if isinstance(v, dict):
            for k1, v1 in v.items():
                #print(k, k1, v1)
                res[k + '_' + k1] = v1
        else:
            #print(k, v)
            res[k] = v
    return res

def getFeatures(json_data):
    sc = json_data['content']
    tc = nonuniq_words(sc)

    st = json_data['title']
    tt = nonuniq_words(st)

    return [np.log(1 + len(sc)),
                 np.log(1 + len(tc)),
                 len(set(tc)) / len(tc),
                 len(zlib.compress(sc.encode('utf-8'))) / len(sc.encode('utf-8')),
                 np.log(1 + len(st)),
                 np.log(1 + len(tt)),
                 len(set(tt)) / len(tt),
                 len(zlib.compress(st.encode('utf-8'))) / len(st.encode('utf-8')),
                 int(sc.find('.gif') != -1)
                ]

In [25]:
def prepareData(input_filename, output_filename, y_train=None, is_content=False, is_ngr=False):
    my_targets = []

    with open(output_filename, 'w', encoding='utf-8') as fout, \
         open(input_filename) as inp_json:
        k = 0
        for line in tqdm_notebook(inp_json):
            json_data = json.loads(line)
            res = process_json(json_data)
            if y_train is not None:
                s = str(y_train[k]) + ' '
            else:
                s = '1 '
            for ek, ev in res.items():
                if ek == 'meta_tags':
                    continue
                if ek == 'content':
                    if is_content is False:
                        continue
                    ev = strip_tags(ev.split('recommends"')[0])
                if is_ngr is False:
                    s += '|' + ek.replace(':', '_') + ' ' + prepare(str(ev)) + ' '
                else:
                    s += '|' + ek.replace(':', '_') + ' ' + prepare_ngr(str(ev)) + ' '
                    

            features = getFeatures(json_data)
            feature_names = ['f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9']
            for fn, fv in zip(feature_names, features):
                s += fn + ':{:.5f}'.format(fv) + ' '

            s += 'year:{}'.format(int(json_data['published']['$date'].split('-')[0])) + ' '
            s += 'month:{}'.format(int(json_data['published']['$date'].split('-')[1])) + ' '
            s += 'hour:{}'.format(int(json_data['published']['$date'][11:13])) + ' '

            fout.write(s + '\n')

            try:
                my_targets.append(json_data['content'].split('recommends"')[1].split('<')[0].split('>')[1])
            except:
                print(json_data['content'])
                raise

            k += 1
            #if k == 2000:
            #    break
    return my_targets

In [26]:
my_targets = prepareData('medium/train.json', 'medium/train_content.vw', y_train=y_train, is_content=True);




In [27]:
#my_targets = prepareData('medium/train.json', 'medium/train.vw', y_train=y_train, is_content=False)

In [28]:
#my_targets = prepareData('medium/train.json', 'medium/train_ngr.vw', y_train=y_train, is_content=True, is_ngr=True)

In [29]:
my_targets_int = []
for e in my_targets:
    if e.find('K') != -1:
        my_targets_int.append(int(float(e[:-1])) * 1000)
    else:
        my_targets_int.append(int(e))

Проделайте все то же с тестовой выборкой, вместо ответов подсовывая что угодно, например, единицы.

In [30]:
my_targets_test = prepareData('medium/test.json', 'medium/test_content.vw', y_train=None, is_content=True);




In [31]:
#my_targets_test = prepareData('medium/test.json', 'medium/test.vw', y_train=None, is_content=False)

In [32]:
#my_targets_test = prepareData('medium/test.json', 'medium/test_ngr.vw', y_train=None, is_content=True, is_ngr=True)

In [33]:
my_targets_test_int = []
for e in my_targets_test:
    if e.find('K') != -1:
        my_targets_test_int.append(int(float(e[:-1])) * 1000)
    else:
        my_targets_test_int.append(int(e))

In [34]:
y_test = [np.log(1 + e) for e in my_targets_test_int]

In [169]:
#!head -2 medium/train_content.vw

In [40]:
#!head -2 medium/test_content.vw

Выбор того, как валидировать модель, остается за Вами. Проще всего, конечно, сделать отложенную выборку. Бенчмарк, который Вы видите в соревновании (**vw_baseline.csv**) и который надо побить, получен с Vowpal Wabbit, 3 проходами по выборке (не забываем удалять кэш), биграммами и настроенными гиперпараметрами `bits`, `learning_rate` и `power_t`. 

In [145]:
#!vw --help

In [161]:
!head -n 32699 medium/train.vw > medium/train_head_32699.vw
!tail -n 20000 medium/train.vw > medium/train_tail_20000.vw

In [162]:
!head -n 32699 medium/train_content.vw > medium/train_content_head_32699.vw
!tail -n 20000 medium/train_content.vw > medium/train_content_tail_20000.vw

In [41]:
!head -n 32699 medium/train_ngr.vw > medium/train_ngr_head_32699.vw
!tail -n 20000 medium/train_ngr.vw > medium/train_ngr_tail_20000.vw

In [35]:
def cleanCache():
    for e in glob.glob("medium/*.cache"):
        os.remove(e)

In [36]:
def learn(tr, bb, lr, pt, model, ngram=2, passes=10, l2=0, dlr=1.0, random_seed=17):
    cleanCache()
    
    !vw -d $tr --loss_function quantile -f $model \
    -b $bb --random_seed $random_seed -c \
    --passes $passes --ngram $ngram --learning_rate $lr --power_t $pt --l2 $l2 --decay_learning_rate $dlr

def predict(tt, model, out):
    cleanCache()
    !vw -i $model --quiet -t -d $tt -p $out
    
def getPredictions(filename):
    with open(filename) as pred_file:
        test_prediction = [float(label) for label in pred_file.readlines()]    
    return np.array(test_prediction)    

In [37]:
def run(tr, te, ngram=2):
    model_tr = tr[:-3] + '_model.vw'

    #tr = 'medium/train.vw'
    #te = 'medium/train_tail10000.vw'

    errors = dict()
    for lr in np.linspace(0.1, 0.5, 3):
        for pt in np.linspace(0.1, 0.5, 3):
            for bits in np.linspace(25, 25, 1):
                for ps in [5, 10]:
                    print('==========================================')
                    print(str(lr) + ' ' + str(pt) + ' ' + str(bits) + ' ' + str(ps))
                    print('==========================================')
                    bb = int(bits)

                    learn(tr, bb, lr, pt, model_tr, ngram=ngram, passes=ps)
                    predict(te, model_tr, 'predictions_te.txt')

        #             !vw -d $tr --loss_function quantile -f $model_tr \
        #             -b $bb --random_seed 17 -c \
        #             --passes 10 --ngram 2 --learning_rate $lr --power_t $pt

        #             !vw -i $model_tr -c -t -d $te -p predictions_te.txt

                    te_prediction = getPredictions('predictions_te.txt')

                    err = mean_absolute_error(y_train[-len(te_prediction):], te_prediction)
                    print('Loss: ', err)
                    errors[str(lr) + ' ' + str(pt) + ' ' + str(int(bits)) + ' ' + str(ps)] = err
    return errors

In [85]:
#tr = 'medium/train_head_32699.vw'
#te = 'medium/train_tail_20000.vw'
#errors = run(tr, te)

In [83]:
#tr = 'medium/train_ngr_head_32699.vw'
#te = 'medium/train_ngr_tail_20000.vw'
#errors_ngr = run(tr, te)

In [84]:
#tr = 'medium/train_content_head_32699.vw'
#te = 'medium/train_content_tail_20000.vw'
#run(tr, te)

In [173]:
sorted(errors.items(), key=lambda x: x[1])

[('0.3 0.5 25', 1.0020124924999998)]

In [155]:
%%time
passes = [12, 12, 12, 12, 10]
bbb = [25, 24, 23, 22, 23]
ngrams = [2, 2, 2, 2, 3]
for j in range(5):
    if j < 4:
        continue
    learn('medium/train_content.vw', bbb[j], 0.3, 0.5, 'medium/train_content_model_{}.vw'.format(j), 
          l2=0, dlr=1.0, random_seed=17*(j+1), passes=passes[j], ngram=ngrams[j])

Generating 3-grams for all namespaces.
final_regressor = medium/train_content_model_4.vw
Num weight bits = 23
learning rate = 0.3
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = medium/train_content.vw.cache
Reading datafile = medium/train_content.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
3.454375 3.454375            1            1.0   6.9088   0.0000     4469
1.985187 0.515999            2            2.0   5.9839   4.9519     3284
1.421689 0.858192            4            4.0   2.3026   2.5519     2074
1.197840 0.973990            8            8.0   5.4510   6.9088    13864
0.888143 0.578446           16           16.0   1.3863   1.2910     3487
0.738415 0.588687           32           32.0   3.8286   4.4516     5317
0.644099 0.549784           64           64.0   1.0986   1.4745     1366
0.657179 0.670259          128          128.0   4.

In [95]:
#%%time
#learn('medium/train.vw', 25, 0.3, 0.5, 'medium/train_model.vw')

In [82]:
#%%time
#learn('medium/train_ngr.vw', 20, 0.3, 0.5, 'medium/train_ngr_model.vw', ngram=1, passes=5)

In [78]:
#predict('medium/test_content.vw', 'medium/train_content_model.vw', 'predictions_content_test.txt')

In [38]:
learn('medium/train_content.vw', 23, 0.3, 0.5, 'medium/train_content_model.vw', 
      l2=0, dlr=1.0, random_seed=17, passes=10, ngram=3)

Generating 3-grams for all namespaces.
final_regressor = medium/train_content_model.vw
Num weight bits = 23
learning rate = 0.3
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = medium/train_content.vw.cache
Reading datafile = medium/train_content.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
3.454375 3.454375            1            1.0   6.9088   0.0000     4415
2.013198 0.572021            2            2.0   5.9839   4.8399     3230
1.458790 0.904383            4            4.0   2.3026   2.6260     2011
1.225534 0.992277            8            8.0   5.4510   6.9088    13816
0.903333 0.581133           16           16.0   1.3863   1.3528     3400
0.749378 0.595424           32           32.0   3.8286   4.5506     5221
0.653865 0.558352           64           64.0   1.0986   1.4771     1273
0.668741 0.683616          128          128.0   4.49

In [156]:
for j in range(5):
    if j < 4:
        continue
    print(j)
    predict('medium/test_content.vw', 'medium/train_content_model_{}.vw'.format(j), 'predictions_content_{}_test.txt'.format(j))

4


In [67]:
#predict('medium/test.vw', 'medium/train_model.vw', 'predictions_test.txt')

In [57]:
#predict('medium/test_ngr.vw', 'medium/train_ngr_model.vw', 'predictions_ngr_test.txt')

In [39]:
predict('medium/test_content.vw', 'medium/train_content_model.vw', 'predictions_content_test.txt')

In [157]:
test_content_prediction_list = []
for j in range(5):
    test_content_prediction_list.append(getPredictions('predictions_content_{}_test.txt'.format(j)))
    print(mean_absolute_error(y_test, test_content_prediction_list[-1]))

0.700650455245
0.698815583589
0.696167292049
0.697614773656
0.67863150442


In [161]:
pred_final = test_content_prediction_list[1]
for j in range(2, 5):
    pred_final += test_content_prediction_list[j]
pred_final /= 4

In [163]:
pred_final = test_content_prediction_list[4]

In [164]:
print(mean_absolute_error(y_test, pred_final))

0.67863150442


In [40]:
pred_final = getPredictions('predictions_content_test.txt')

In [41]:
print(mean_absolute_error(y_test, pred_final))

0.758736542492


In [165]:
print(mean_absolute_error(y_test, pred_final))

0.67863150442


In [166]:
def write_submission_file(prediction, filename,
                          path_to_sample=os.path.join(PATH_TO_DATA,
                                        'sample_submission.csv')):
    submission = pd.read_csv(path_to_sample, index_col='id')
    
    submission['log_recommends'] = prediction
    submission.to_csv(filename)

In [167]:
write_submission_file(pred_final, os.path.join(PATH_TO_DATA, 'vw_2.csv'))