<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6
### <center> Beating benchmarks in "How good is your Medium article?"
    
[Competition](https://www.kaggle.com/c/how-good-is-your-medium-article). The task is to beat "Assignment 6 baseline".

In [1]:
import warnings
warnings.filterwarnings('ignore')
import os
import json
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from scipy.sparse import csr_matrix, hstack
from sklearn.linear_model import Ridge

The following code will help to throw away all HTML tags from an article content.

In [2]:
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Supplementary function to read a JSON line without crashing on escape characters.

In [3]:
def read_json_line(line=None):
    result = None
    try:        
        result = json.loads(line)
    except Exception as e:      
        # Find the offending character index:
        idx_to_replace = int(str(e).split(' ')[-1].replace(')',''))      
        # Remove the offending character:
        new_line = list(line)
        new_line[idx_to_replace] = ' '
        new_line = ''.join(new_line)     
        return read_json_line(line=new_line)
    return result

Extract features `content`, `published`, `title` and `author`, write them to separate files for train and test sets.

In [8]:
def extract_features_and_write(path_to_data,
                               inp_filename, is_train=True):
    
    features = ['content', 'published', 'title', 'author']
    prefix = 'train' if is_train else 'test'
    feature_files = [open(os.path.join(path_to_data,
                                       '{}_{}.txt'.format(prefix, feat)),
                          'w', encoding='utf-8')
                     for feat in features]
    
    with open(os.path.join(path_to_data, inp_filename), 
              encoding='utf-8') as inp_json_file:

        for i,line in enumerate(tqdm_notebook(inp_json_file)):
            
            json_data = read_json_line(line)
            
            for feature, file in zip(features, feature_files):
                if feature == 'content':
                    content = json_data['content'].replace('\n', ' ').replace('\r', ' ')
                    content_no_html_tags = strip_tags(content)
                    file.write(content_no_html_tags + '\n')
                elif feature == 'author':
                    url = str(json_data[feature]['url'].replace('\n', ' ').replace('\r', ' '))
                    media_name = url.replace('https://medium.com/@', '')
                    file.write(media_name + '\n')
                elif feature == 'title':
                    title = json_data['title']
                    file.write(title + '\n')
                elif feature == 'published':
                    date = json_data['published']['$date']
                    file.write(date + '\n')
            if i > 30000 and is_train:
                break

In [5]:
PATH_TO_DATA = './data/how-good-is-your-medium-article/' # modify this if you need to

In [6]:
extract_features_and_write(PATH_TO_DATA, 'train.json', is_train=True)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

In [9]:
extract_features_and_write(PATH_TO_DATA, 'test.json', is_train=False)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




**Add the following groups of features:**
    - Tf-Idf with article content (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Tf-Idf with article titles (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Time features: publication hour, whether it's morning, day, night, whether it's a weekend
    - Bag of authors (i.e. One-Hot-Encoded author names)

In [10]:
content_file = open(PATH_TO_DATA + '/train_content.txt', 'r', encoding='utf-8')
title_file = open(PATH_TO_DATA + '/train_title.txt', 'r', encoding='utf-8')
author_file = open(PATH_TO_DATA + '/train_author.txt', 'r', encoding='utf-8')
date_file = open(PATH_TO_DATA + '/train_published.txt', 'r', encoding='utf-8')

date_df = pd.DataFrame(data=date_file.readlines(), columns=['published'])
date_df['published'] = date_df['published'].apply(lambda x: pd.to_datetime(x.replace('\n', '')))
date_df['hour'] = date_df['published'].apply(lambda x: x.hour)
date_df['is_weekend'] = date_df['published'].apply(lambda x: 1 if x.weekday() >= 5 else 0)
date_df['is_morning'] =  date_df['hour'].apply(lambda x: 1 if x in range(4,13) else 0)
date_df['is_day'] = date_df['hour'].apply(lambda x: 1 if x in range(13,19) else 0)
date_df['is_evening'] = date_df['hour'].apply(lambda x: 1 if x in range(19,25) else 0)
date_df['is_night'] = date_df['hour'].apply(lambda x: 1 if x in range(0,4) else 0)
date_df = pd.concat([date_df.drop(['published', 'hour'], axis=1), pd.get_dummies(date_df['hour'], prefix='hour')], axis=1)

X_train_content_sparse = TfidfVectorizer(ngram_range=(1, 2), max_features=100000).fit_transform(content_file)
# X_train_title_sparse = TfidfVectorizer(ngram_range=(1, 2), max_features=100000).fit_transform(title_file)
#X_train_author_sparse = csr_matrix(pd.get_dummies(pd.DataFrame(data = author_file.readlines(), columns = ['author'])))
X_train_time_features_sparse = csr_matrix(date_df)

In [11]:
content_file = open(PATH_TO_DATA + '/test_content.txt', 'r', encoding='utf-8')
title_file = open(PATH_TO_DATA + '/test_title.txt', 'r', encoding='utf-8')
author_file = open(PATH_TO_DATA + '/test_author.txt', 'r', encoding='utf-8')
date_file = open(PATH_TO_DATA + '/test_published.txt', 'r', encoding='utf-8')

date_df = pd.DataFrame(data=date_file.readlines(), columns=['published'])
date_df['published'] = date_df['published'].apply(lambda x: pd.to_datetime(x.replace('\n', '')))
date_df['hour'] = date_df['published'].apply(lambda x: x.hour)
date_df['is_weekend'] = date_df['published'].apply(lambda x: 1 if x.weekday() >= 5 else 0)
date_df['is_morning'] =  date_df['hour'].apply(lambda x: 1 if x in range(4,13) else 0)
date_df['is_day'] = date_df['hour'].apply(lambda x: 1 if x in range(13,19) else 0)
date_df['is_evening'] = date_df['hour'].apply(lambda x: 1 if x in range(19,25) else 0)
date_df['is_night'] = date_df['hour'].apply(lambda x: 1 if x in range(0,4) else 0)
date_df = pd.concat([date_df.drop(['published', 'hour'], axis=1), pd.get_dummies(date_df['hour'], prefix='hour')], axis=1)

X_test_content_sparse = TfidfVectorizer(ngram_range=(1, 2), max_features=100000).fit_transform(content_file)
# X_train_title_sparse = TfidfVectorizer(ngram_range=(1, 2), max_features=100000).fit_transform(title_file)
#X_test_author_sparse = csr_matrix(pd.get_dummies(pd.DataFrame(data = author_file.readlines(), columns = ['author'])))
X_test_time_features_sparse = csr_matrix(date_df)

**Join all sparse matrices.**

In [13]:
X_train_content_sparse,X_train_time_features_sparse

(<30002x100000 sparse matrix of type '<class 'numpy.float64'>'
 	with 32833551 stored elements in Compressed Sparse Row format>,
 <30002x29 sparse matrix of type '<class 'numpy.int64'>'
 	with 65113 stored elements in Compressed Sparse Row format>)

In [None]:
X_test_content_sparse,X_test_author_sparse,X_test_time_features_sparse

In [14]:
X_train_sparse = csr_matrix(hstack([X_train_content_sparse, 
                                    X_train_time_features_sparse]))

In [15]:
X_test_sparse = csr_matrix(hstack([X_test_content_sparse, 
                                     X_test_time_features_sparse]))

**Read train target and split data for validation.**

In [16]:
train_target = pd.read_csv('./data/how-good-is-your-medium-article/train_log1p_recommends.csv', 
                           index_col='id',nrows=X_train_sparse.shape[0])
y_train = train_target['log_recommends'].values

In [17]:
train_part_size = int(0.7 * train_target.shape[0])
X_train_part = X_train_sparse[:train_part_size, :]
y_train_part = y_train[:train_part_size]
X_valid =  X_train_sparse[train_part_size:, :]
y_valid = y_train[train_part_size:]

**Train a simple Ridge model and check MAE on the validation set.**

In [23]:
parameters = {'alpha':np.linspace(0.3, 0.5, num=10)}
clf_gs = GridSearchCV(Ridge(), parameters,n_jobs=-1,scoring='neg_mean_absolute_error')
clf_gs.fit(X_train_part, y_train_part)

GridSearchCV(cv=None, error_score='raise',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'alpha': array([0.3    , 0.32222, 0.34444, 0.36667, 0.38889, 0.41111, 0.43333,
       0.45556, 0.47778, 0.5    ])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_absolute_error', verbose=0)

In [24]:
clf_gs.best_params_

{'alpha': 0.5}

In [26]:
clf_gs.best_score_

-1.257856587911388

In [27]:
clf = Ridge(alpha=clf_gs.best_params_['alpha'])

In [28]:
clf.fit(X_train_part,y_train_part)

Ridge(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [29]:
mean_absolute_error(clf.predict(X_valid),y_valid)

1.2480628165627932

**Train the same Ridge with all available data, make predictions for the test set and form a submission file.**

In [30]:
clf = Ridge(alpha=clf_gs.best_params_['alpha'])


In [31]:
clf.fit(X_train_sparse,y_train)

Ridge(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [42]:

X_test_sparse.shape,X_train_sparse

((20002, 111469), <20002x111594 sparse matrix of type '<class 'numpy.float64'>'
 	with 21726868 stored elements in Compressed Sparse Row format>)

In [32]:
ridge_test_pred= clf.predict(X_test_sparse)

In [37]:
def write_submission_file(prediction, filename,
                          path_to_sample='./data/how-good-is-your-medium-article/sample_submission.csv'):
    submission = pd.read_csv(path_to_sample, index_col='id')
    
    submission['log_recommends'] = prediction
    submission.to_csv(filename)

In [38]:
write_submission_file(ridge_test_pred, 'assignment6_medium_submission.csv')

**Now's the time for dirty Kaggle hacks. Form a submission file with all zeroes. Make a submission. What do you get if you think about it? How is it going to help you with modifying your predictions?**

In [39]:
write_submission_file(np.zeros_like(ridge_test_pred), 
                      'medium_all_zeros_submission.csv')

**Modify predictions in an appropriate way (based on your all-zero submission) and make a new submission.**

In [59]:
np.mean(ridge_test_pred+1.9)

3.9916226646037063

In [52]:
arr = ridge_test_pred - np.var(ridge_test_pred) 

0.2744499376410994

In [60]:
ridge_test_pred_modif = ridge_test_pred+1.9 # You code here

In [61]:
write_submission_file(ridge_test_pred_modif, 
                      'assignment6_medium_submission_with_hack.csv')