<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6
### <center> Beating benchmarks in "How good is your Medium article?"
    
[Competition](https://www.kaggle.com/c/how-good-is-your-medium-article). The task is to beat "Assignment 6 baseline".

In [110]:
import os
import json
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.metrics import mean_absolute_error
from scipy.sparse import csr_matrix, hstack
from sklearn.linear_model import Ridge

The following code will help to throw away all HTML tags from an article content.

In [2]:
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
        self.read = False
    def handle_starttag(self, tag, attrs):
        if ('data-source', 'post_page') in attrs:
            self.read = True
    def handle_endtag(self, tag):
        if tag == 'section':
            self.read = False
    def handle_data(self, d):
        if self.read:
            self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Supplementary function to read a JSON line without crashing on escape characters.

In [3]:
def read_json_line(line=None):
    result = None
    try:        
        result = json.loads(line)
    except Exception as e:      
        # Find the offending character index:
        idx_to_replace = int(str(e).split(' ')[-1].replace(')',''))      
        # Remove the offending character:
        new_line = list(line)
        new_line[idx_to_replace] = ' '
        new_line = ''.join(new_line)     
        return read_json_line(line=new_line)
    return result

Extract features `content`, `published`, `title` and `author`, write them to separate files for train and test sets.

In [156]:
def extract_features_and_write(inp_filename, is_train=True, max_iter=-1, total=-1):
    
    prefix = 'train' if is_train else 'test'
    features = ['content', 'published', 'title', 'author']
    feature_files = [data_file(f'{prefix}_{feat}.txt', mode='w') for feat in features]
    
    normalize_string = lambda s: str(s).replace('\n', ' ').replace('\r', '')
    count = 0
    
    with data_file(inp_filename, mode='r') as inp_json_file:

        for line in tqdm_notebook(inp_json_file, total=total):
            if max_iter > 0:
                max_iter -= 1
            elif max_iter == 0:
                break
            json_data = read_json_line(line)
            content = normalize_string(strip_tags(json_data['content']))
            published = json_data['published']['$date']
            title = normalize_string(json_data['title'])
            author = normalize_string(json_data['author']['url'])
            feature_data = [content, published, title, author]
            for s, f in zip(feature_data, feature_files):
                print(s, file=f)
            count += 1
                
    for f in feature_files:
        f.close()
        
    return count

In [157]:
PATH_TO_DATA = 'hgiyma' # modify this if you need to
def data_file(name, mode=None, csv=False, **kvargs):
    path = os.path.join(PATH_TO_DATA, name)
    if mode:
        return open(path, mode=mode)
    if csv:
        return pd.read_csv(open(path), **kvargs)
    return path

In [158]:
train_total = extract_features_and_write('train.json', is_train=True, total=62313)

A Jupyter Widget




In [91]:
test_total = extract_features_and_write('test.json', is_train=False, total=34645)

A Jupyter Widget




**Add the following groups of features:**
    - Tf-Idf with article content (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Tf-Idf with article titles (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Time features: publication hour, whether it's morning, day, night, whether it's a weekend
    - Bag of authors (i.e. One-Hot-Encoded author names)

In [159]:
for prefix in ['train', 'test']:
    for feat in ['content', 'published', 'title', 'author']:
        file_len = 0
        file_name = f'{prefix}_{feat}.txt'
        with open(data_file(file_name)) as file:
            for _ in file:
                file_len += 1
        print(f'len({file_name}) = {file_len}')
                

len(train_content.txt) = 62313
len(train_published.txt) = 62313
len(train_title.txt) = 62313
len(train_author.txt) = 62313
len(test_content.txt) = 34645
len(test_published.txt) = 34645
len(test_title.txt) = 34645
len(test_author.txt) = 34645


In [95]:
%%time

content_vectorizer = TfidfVectorizer(ngram_range=(1,2), max_features=100000)
with open(data_file('train_content.txt')) as f:
    content_vectorizer.fit(tqdm_notebook(f, total=train_total, desc='fit()'))
    f.seek(0)
    X_train_content_sparse = content_vectorizer.transform(tqdm_notebook(f, total=train_total, desc='(train)'))
with open(data_file('test_content.txt')) as f:
    X_test_content_sparse = content_vectorizer.transform(tqdm_notebook(f, total=test_total, desc='(test)'))

A Jupyter Widget




A Jupyter Widget


CPU times: user 9min 37s, sys: 1min 51s, total: 11min 29s
Wall time: 12min 37s


In [97]:
%%time

title_vectorizer = TfidfVectorizer(ngram_range=(1,2), max_features=100000)
with open(data_file('train_title.txt')) as f:
    title_vectorizer.fit(tqdm_notebook(f, total=train_total, desc='fit()'))
    f.seek(0)
    X_train_title_sparse = title_vectorizer.fit_transform(tqdm_notebook(f, total=train_total, desc='(train)'))
with open(data_file('test_title.txt')) as f:
    X_test_title_sparse = title_vectorizer.transform(tqdm_notebook(f, total=test_total, desc='(test)'))

A Jupyter Widget




A Jupyter Widget




A Jupyter Widget


CPU times: user 7.23 s, sys: 424 ms, total: 7.65 s
Wall time: 7.77 s


In [187]:
def time_features(name):
    df = data_file(name, csv=True, header=None, parse_dates=[0])
    time_series = df[0]
    
    hour_ohe = OneHotEncoder().fit(np.arange(24).reshape(-1,1))
    hour = hour_ohe.transform(time_series.apply(lambda ts: ts.hour).values.reshape(-1,1))
    
    weekday_ohe = OneHotEncoder().fit(np.arange(7).reshape(-1,1))
    weekday = hour_ohe.transform(time_series.apply(lambda ts: ts.weekday()).values.reshape(-1,1))
    
    month_ohe = OneHotEncoder().fit(np.arange(12).reshape(-1,1))
    month = hour_ohe.transform(time_series.apply(lambda ts: ts.month).values.reshape(-1,1))
    
    year_ohe = OneHotEncoder().fit(np.arange(1970, 2020).reshape(-1,1))
    year = year_ohe.transform(time_series.apply(lambda ts: ts.year).values.reshape(-1,1))
    return hstack([year, month, weekday, hour])
    
X_train_time_features_sparse = time_features('train_published.txt')
X_test_time_features_sparse = time_features('test_published.txt')

In [160]:
%%time

train_author = data_file('train_author.txt', csv=True, header=None)
test_author = data_file('test_author.txt', csv=True, header=None)
author_label = LabelEncoder()
author_ids = author_label.fit_transform(pd.concat([train_author, test_author]))
author_ohe = OneHotEncoder()
author_ohe.fit(author_ids.reshape(-1,1))

X_train_author_sparse = author_ohe.transform(author_label.transform(train_author).reshape(-1,1))
X_test_author_sparse = author_ohe.transform(author_label.transform(test_author).reshape(-1,1))


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


CPU times: user 934 ms, sys: 38 ms, total: 972 ms
Wall time: 982 ms


**Join all sparse matrices.**

In [188]:
for f in [X_train_content_sparse, X_train_title_sparse, X_train_author_sparse, X_train_time_features_sparse]:
    print(f.shape)
    
for f in [X_test_content_sparse, X_test_title_sparse, X_test_author_sparse, X_test_time_features_sparse]:
    print(f.shape)

(62313, 100000)
(62313, 100000)
(62313, 45374)
(62313, 122)
(34645, 100000)
(34645, 100000)
(34645, 45374)
(34645, 122)


In [189]:
X_train_sparse = csr_matrix(hstack([X_train_content_sparse, X_train_title_sparse,
                                    X_train_author_sparse, X_train_time_features_sparse]))

In [190]:
X_test_sparse = csr_matrix(hstack([X_test_content_sparse, X_test_title_sparse,
                                    X_test_author_sparse, X_test_time_features_sparse]))

**Read train target and split data for validation.**

In [165]:
train_target = data_file('train_log1p_recommends.csv', csv=True, index_col='id')
y_train = train_target['log_recommends'].values
y_train.shape

(62313,)

In [166]:
train_part_size = int(0.7 * train_target.shape[0])
X_train_part_sparse = X_train_sparse[:train_part_size, :]
y_train_part = y_train[:train_part_size]
X_valid_sparse =  X_train_sparse[train_part_size:, :]
y_valid = y_train[train_part_size:]

**Train a simple Ridge model and check MAE on the validation set.**

In [169]:
RANDOM_STATE=33
N_JOBS=-1

In [170]:
ridge = Ridge(random_state=RANDOM_STATE)

In [172]:
%%time
ridge.fit(X_train_part_sparse, y_train_part)

CPU times: user 1min 10s, sys: 1.79 s, total: 1min 12s
Wall time: 1min 13s


Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=33, solver='auto', tol=0.001)

In [173]:
mean_absolute_error(ridge.predict(X_valid_sparse), y_valid)

1.0757028267001929

**Train the same Ridge with all available data, make predictions for the test set and form a submission file.**

In [191]:
%%time

ridge.fit(X_train_sparse, y_train)
ridge_test_pred = ridge.predict(X_test_sparse)

CPU times: user 1min 27s, sys: 1.32 s, total: 1min 28s
Wall time: 1min 29s


In [192]:
def write_submission_file(prediction, filename):
    submission = data_file('sample_submission.csv', csv=True, index_col='id')
    submission['log_recommends'] = prediction
    submission.to_csv(data_file(filename))

In [193]:
write_submission_file(ridge_test_pred, 'assignment6_medium_submission.csv')

**Now's the time for dirty Kaggle hacks. Form a submission file with all zeroes. Make a submission. What do you get if you think about it? How is it going to help you with modifying your predictions?**

In [194]:
write_submission_file(np.zeros_like(ridge_test_pred), 'medium_all_zeros_submission.csv')

In [196]:
zmae = 4.33328

**Modify predictions in an appropriate way (based on your all-zero submission) and make a new submission.**

In [197]:
ridge_test_pred_modif = ridge_test_pred + ( zmae - ridge_test_pred.mean() ) # You code here

In [198]:
write_submission_file(ridge_test_pred_modif, 
                      'assignment6_medium_submission_with_hack.csv')