<center>
<img src="../../img/ods_stickers.jpg" />
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). Edited by Sergey Kolchenko (@KolchenkoSergey). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.

## <center>Assignment #6
### <center> Beating baselines in "How good is your Medium article?"
    
<img src='../../img/medium_claps.jpg' width=40% />


[Competition](https://www.kaggle.com/c/how-good-is-your-medium-article). The task is to beat "A6 baseline" (~1.45 Public LB score). Do not forget about our shared ["primitive" baseline](https://www.kaggle.com/kashnitsky/ridge-countvectorizer-baseline) - you'll find something valuable there.

**Your task:**
 1. "Freeride". Come up with good features to beat the baseline "A6 baseline" (for now, public LB is only considered)
 2. You need to name your [team](https://www.kaggle.com/c/how-good-is-your-medium-article/team) (out of 1 person) in full accordance with the [course rating](https://drive.google.com/open?id=19AGEhUQUol6_kNLKSzBsjcGUU3qWy3BNUg8x8IFkO3Q). You can think of it as a part of the assignment. 16 credits for beating the mentioned baseline and correct team naming.
 
*For discussions, please stick to [ODS Slack](https://opendatascience.slack.com/), channel #mlcourse_ai, pinned thread __#a6__*

In [54]:
import os
import json
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import mean_absolute_error
from scipy.sparse import csr_matrix, hstack
from sklearn.linear_model import Ridge

The following code will help to throw away all HTML tags from an article content.

In [2]:
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Supplementary function to read a JSON line without crashing on escape characters.

In [3]:
def read_json_line(line=None):
    result = None
    try:        
        result = json.loads(line)
    except Exception as e:      
        # Find the offending character index:
        idx_to_replace = int(str(e).split(' ')[-1].replace(')',''))      
        # Remove the offending character:
        new_line = list(line)
        new_line[idx_to_replace] = ' '
        new_line = ''.join(new_line)     
        return read_json_line(line=new_line)
    return result

Extract features `content`, `published`, `title` and `author`, write them to separate files for train and test sets.

In [7]:
def extract_features_and_write(path_to_data,
                               inp_filename, is_train=True):
    
    features = ['content', 'published', 'title', 'author']
    prefix = 'train' if is_train else 'test'
    feature_files = [open(os.path.join(path_to_data,
                                       '{}_{}.txt'.format(prefix, feat)),
                          'w', encoding='utf-8')
                     for feat in features]
    
    features_to_files_dict = dict(list(zip(features, feature_files)))
    to_dump = {'content': [], 'published': [], 'title': [], 'author': []}
    
    with open(os.path.join(path_to_data, inp_filename), 
              encoding='utf-8') as inp_json_file:

        for line in tqdm_notebook(inp_json_file):
            json_data = read_json_line(line)
            pub_date = json_data['published']['$date']
            title = json_data['meta_tags']['title'].split('\u2013')[0].strip().replace('\n', ' ').replace('\r', ' ')
            author_name = json_data['meta_tags']['author'].strip()
            content = strip_tags(json_data['content'].replace('\n', ' ').replace('\r', ' '))
            to_dump['content'].append(content)
            to_dump['published'].append(pub_date)
            to_dump['title'].append(title)
            to_dump['author'].append(author_name)   
            for feature in features_to_files_dict:
                features_to_files_dict[feature].write(str(to_dump[feature][-1]) + '\n')
            #print(pub_date,"|", title, "|", author_name, "|", content)

        for f in features_to_files_dict:
            features_to_files_dict[f].close()
    return to_dump
                

In [8]:
PATH_TO_DATA = '../../data/medium/' # modify this if you need to

In [9]:
train_data = extract_features_and_write(PATH_TO_DATA, 'train.json', is_train=True)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [43]:
len(train_data['title'])

62313

In [10]:
test_data = extract_features_and_write(PATH_TO_DATA, 'test.json', is_train=False)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [44]:
len(test_data['title'])

34645

**Add the following groups of features:**
    - Tf-Idf with article content (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Tf-Idf with article titles (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Time features: publication hour, whether it's morning, day, night, whether it's a weekend
    - Bag of authors (i.e. One-Hot-Encoded author names)

In [11]:
train = pd.DataFrame()
train['published'] = pd.to_datetime(train_data['published'], format='%Y-%m-%dT%H:%M:%S.%fZ')
train['title'] = train_data['title']
train['author'] = train_data['author']
train_content = train_data['content']

In [12]:
test = pd.DataFrame()
test['published'] = pd.to_datetime(test_data['published'], format='%Y-%m-%dT%H:%M:%S.%fZ')
test['title'] = test_data['title']
test['author'] = test_data['author']
test_content = test_data['content']

In [13]:
idx_split = len(train)

In [14]:
train.to_pickle(os.path.join(PATH_TO_DATA, "train.pkl"))
test.to_pickle(os.path.join(PATH_TO_DATA, "test.pkl"))

In [15]:
full_df = pd.concat([train, test])
full_df.head()

Unnamed: 0,published,title,author
0,2012-08-13 22:54:53.510,Medium Terms of Service,Medium
1,2015-08-03 07:44:50.331,Amendment to Medium Terms of Service Applicabl...,Medium
2,2017-02-05 13:08:17.410,走入山與海之間：閩東大刀會和兩岸走私,Yun-Chen Chien（簡韻真）
3,2017-05-06 08:16:30.776,How fast can a camera get?,Vaibhav Khulbe
4,2017-06-04 14:46:25.772,A game for the lonely fox,Vaibhav Khulbe


In [54]:
del train_data, test_data 

NameError: name 'train_data' is not defined

In [16]:
import gc
gc.collect()

0

In [17]:
full_df['dow'] = full_df['published'].apply(lambda x: x.dayofweek)
full_df['year'] = full_df['published'].apply(lambda x: x.year)
full_df['month'] = full_df['published'].apply(lambda x: x.month)
full_df['hour'] = full_df['published'].apply(lambda x: x.hour)

full_df.head()

Unnamed: 0,published,title,author,dow,year,month,hour
0,2012-08-13 22:54:53.510,Medium Terms of Service,Medium,0,2012,8,22
1,2015-08-03 07:44:50.331,Amendment to Medium Terms of Service Applicabl...,Medium,0,2015,8,7
2,2017-02-05 13:08:17.410,走入山與海之間：閩東大刀會和兩岸走私,Yun-Chen Chien（簡韻真）,6,2017,2,13
3,2017-05-06 08:16:30.776,How fast can a camera get?,Vaibhav Khulbe,5,2017,5,8
4,2017-06-04 14:46:25.772,A game for the lonely fox,Vaibhav Khulbe,6,2017,6,14


In [18]:
full_df['morning'] = full_df['hour'].apply(lambda ts:\
                                                    (ts > 7 ) and (ts <= 11)).astype('int')
full_df['day'] = full_df['hour'].apply(lambda ts:\
                                                    (ts > 11 ) and (ts <= 18)).astype('int')
full_df['evening'] = full_df['hour'].apply(lambda ts:\
                                                    (ts > 18 ) and (ts <= 23)).astype('int')
full_df['night'] = full_df['hour'].apply(lambda ts:\
                                                    (ts >23 ) or (ts <=7 )).astype('int')
full_df.head(5)

# full_new_feat['morning'] = (full_new_feat['start_hour'] <=11.0).astype('int64')

Unnamed: 0,published,title,author,dow,year,month,hour,morning,day,evening,night
0,2012-08-13 22:54:53.510,Medium Terms of Service,Medium,0,2012,8,22,0,0,1,0
1,2015-08-03 07:44:50.331,Amendment to Medium Terms of Service Applicabl...,Medium,0,2015,8,7,0,0,0,1
2,2017-02-05 13:08:17.410,走入山與海之間：閩東大刀會和兩岸走私,Yun-Chen Chien（簡韻真）,6,2017,2,13,0,1,0,0
3,2017-05-06 08:16:30.776,How fast can a camera get?,Vaibhav Khulbe,5,2017,5,8,1,0,0,0
4,2017-06-04 14:46:25.772,A game for the lonely fox,Vaibhav Khulbe,6,2017,6,14,0,1,0,0


In [19]:
full_df.shape

(96958, 11)

In [21]:
TF_V = TfidfVectorizer(max_features=100000, ngram_range=(1,2))

In [None]:
X_train_content_sparse = TF_V.fit_transform(train_content)
X_test_content_sparse = TF_V.transform(test_content)


In [71]:
TF_V2 = TfidfVectorizer(max_features=100000, ngram_range=(1,2))
X_train_title_sparse = TF_V2.fit_transform(train_data['title'])
X_test_title_sparse = TF_V2.transform(test_data['title'])

In [None]:
OHE = OneHotEncoder()
X_train_author_sparse = OHE.fit_transform(train[['author']])
X_test_author_sparse = OHE.transform(train_data[['author']])

In [24]:
X_train_content_sparse.shape, X_train_title_sparse.shape

((62313, 100000), (62313, 100000))

In [64]:
cols_no_scale = ['morning', 'day', 'evening', 'night']
cols_to_scale = ('dow', 'year', 'month','hour')

tmp = StandardScaler().fit_transform(full_df.loc[:, cols_to_scale])
# [full_df.loc[:, cols_to_scale]]

# X_train_time_features_sparse
tmp

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


array([[-1.35654985, -4.00067044,  0.4627592 ,  1.36607227],
       [-1.35654985, -1.30898493,  0.4627592 , -1.00228203],
       [ 1.8484754 ,  0.48547208, -1.15525008, -0.05494031],
       ...,
       [ 0.24596278,  0.48547208,  1.54143205,  0.73451112],
       [-0.2882081 ,  1.38270058, -1.15525008,  0.73451112],
       [ 1.31430453,  1.38270058, -1.15525008,  0.89240141]])

In [67]:

X_train_time_features_sparse = hstack([full_df[:idx_split][cols_no_scale],
                                      tmp[:idx_split,:]])


<bound method spmatrix.todense of <62313x8 sparse matrix of type '<class 'numpy.float64'>'
	with 311565 stored elements in COOrdinate format>>

**Join all sparse matrices.**

In [70]:
X_train_sparse = hstack([X_train_content_sparse, X_train_title_sparse,
                         X_train_author_sparse, 
                         X_train_time_features_sparse]).tocsr()

In [None]:
X_test_sparse = hstack([X_test_content_sparse, X_test_title_sparse,
                        X_test_author_sparse, 
                        X_test_time_features_sparse]).tocsr()

**Read train target and split data for validation.**

In [51]:
train_target = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_log1p_recommends.csv'), 
                           index_col='id')
y_train = train_target['log_recommends'].values

In [None]:
train_part_size = int(0.7 * train_target.shape[0])
X_train_part_sparse = X_train_sparse[:train_part_size, :]
y_train_part = y_train[:train_part_size]
X_valid_sparse =  X_train_sparse[train_part_size:, :]
y_valid = y_train[train_part_size:]

**Train a simple Ridge model and check MAE on the validation set.**

In [None]:
# You code here

**Train the same Ridge with all available data, make predictions for the test set and form a submission file.**

In [None]:
# You code here

In [None]:
def write_submission_file(prediction, filename,
                          path_to_sample=os.path.join(PATH_TO_DATA, 
                                                      'sample_submission.csv')):
    submission = pd.read_csv(path_to_sample, index_col='id')
    
    submission['log_recommends'] = prediction
    submission.to_csv(filename)

In [None]:
write_submission_file(ridge_test_pred, os.path.join(PATH_TO_DATA,
                                                    'assignment6_medium_submission.csv'))

**Now's the time for dirty Kaggle hacks. Form a submission file with all zeros. Make a submission. What do you get if you think about it? How is it going to help you with modifying your predictions?**

In [None]:
write_submission_file(np.zeros_like(ridge_test_pred), 
                      os.path.join(PATH_TO_DATA,
                                   'medium_all_zeros_submission.csv'))

**Modify predictions in an appropriate way (based on your all-zero submission) and make a new submission.**

In [None]:
ridge_test_pred_modif = ridge_test_pred # You code here

In [None]:
write_submission_file(ridge_test_pred_modif, 
                      os.path.join(PATH_TO_DATA,
                                   'assignment6_medium_submission_with_hack.csv'))

That's it for the assignment. Much more credits will be given to the winners in this competition, check [course roadmap](https://mlcourse.ai/roadmap). Do not spoil the assignment and the competition - don't share high-performing kernels (with MAE < 1.5).

Some ideas for improvement:

- Engineer good features, this is the key to success. Some simple features will be based on publication time, authors, content length and so on
- You may not ignore HTML and extract some features from there
- You'd better experiment with your validation scheme. You should see a correlation between your local improvements and LB score
- Try TF-IDF, ngrams, Word2Vec and GloVe embeddings
- Try various NLP techniques like stemming and lemmatization
- Tune hyperparameters. In our example, we've left only 50k features and used C=1 as a regularization parameter, this can be changed
- SGD and Vowpal Wabbit will learn much faster
- Play around with blending and/or stacking. An intro is given in [this Kernel](https://www.kaggle.com/kashnitsky/ridge-and-lightgbm-simple-blending) by @yorko 
- In our course, we don't cover neural nets. But it's not obliged to use GRUs/LSTMs/whatever in this competition.

Good luck!

<img src='../../img/kaggle_shakeup.png' width=50%>