<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6
### <center> Beating benchmarks in "How good is your Medium article?"
    
[Competition](https://www.kaggle.com/c/how-good-is-your-medium-article). The task is to beat "Assignment 6 baseline".

In [1]:
import os
import json
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_absolute_error
from scipy.sparse import csr_matrix, hstack
from sklearn.linear_model import Ridge

The following code will help to throw away all HTML tags from an article content.

In [2]:
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Supplementary function to read a JSON line without crashing on escape characters.

In [3]:
def read_json_line(line=None):
    result = None
    try:        
        result = json.loads(line)
    except Exception as e:      
        # Find the offending character index:
        idx_to_replace = int(str(e).split(' ')[-1].replace(')',''))      
        # Remove the offending character:
        new_line = list(line)
        new_line[idx_to_replace] = ' '
        new_line = ''.join(new_line)     
        return read_json_line(line=new_line)
    return result

Extract features `content`, `published`, `title` and `author`, write them to separate files for train and test sets.

In [22]:
def extract_features_and_write(path_to_data,
                               inp_filename, is_train=True):
    
    features = ['content', 'published', 'title', 'author']
    prefix = 'train' if is_train else 'test'
    feature_files = [open(os.path.join(path_to_data,
                                       '{}_{}.txt'.format(prefix, feat)),
                          'w', encoding='utf-8')
                     for feat in features]
    i = 0
    with open(os.path.join(path_to_data, inp_filename), 
              encoding='utf-8') as inp_json_file:

        for line in tqdm_notebook(inp_json_file):
            json_data = read_json_line(line)
            content = json_data['content'].replace('\n', ' ').replace('\r', ' ')
            content_no_html_tags = strip_tags(content)
            feature_files[0].write(content_no_html_tags + '\n')
            feature_files[1].write(json.dumps(json_data['published']['$date']))
            feature_files[1].write('\n')
            feature_files[2].write(json.dumps(strip_tags(json_data['title'])))
            feature_files[2].write('\n')
            feature_files[3].write(json.dumps(json_data['author']['url']))
            feature_files[3].write('\n')
#             i+=1
#             if i>10:
#                 return 0
#             print(json_data)
            # You code here

In [4]:
PATH_TO_DATA = 'C:/ml/medium/' # modify this if you need to

In [6]:
extract_features_and_write(PATH_TO_DATA, 'train.json', is_train=True)

NameError: name 'extract_features_and_write' is not defined

In [24]:
extract_features_and_write(PATH_TO_DATA, 'test.json', is_train=False)




**Add the following groups of features:**
    - Tf-Idf with article content (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Tf-Idf with article titles (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Time features: publication hour, whether it's morning, day, night, whether it's a weekend
    - Bag of authors (i.e. One-Hot-Encoded author names)

In [5]:
# You code here
train_title = list()
with open(os.path.join(PATH_TO_DATA, 'train_title.txt'), 
              encoding='utf-8') as file_title:
    train_title = [line.strip() for line in file_title]

In [6]:
test_title = list()
with open(os.path.join(PATH_TO_DATA, 'test_title.txt'), 
              encoding='utf-8') as file_title:
    test_title = [line.strip() for line in file_title]

In [7]:
all_titles = train_title+test_title
len(train_title), len(test_title)

(62313, 34645)

In [8]:
idx_split = len(train_title)

In [9]:
vectorizer = TfidfVectorizer(ngram_range=(1, 2), max_features=100000)

In [33]:
%%time
all_titles_tfidf = vectorizer.fit_transform(all_titles)

Wall time: 2min 19s


In [34]:
scipy.sparse.save_npz('D:/TEMP/all_titles_tfidf.npz', all_titles_tfidf)

In [10]:
train_content = list()
with open(os.path.join(PATH_TO_DATA, 'train_content.txt'), 
              encoding='utf-8') as file_title:
    train_content = [line.strip() for line in file_title]
test_content = list()
with open(os.path.join(PATH_TO_DATA, 'test_content.txt'), 
              encoding='utf-8') as file_title:
    test_content = [line.strip() for line in file_title]
all_content = train_content+test_content
len(train_content), len(test_content)

(62313, 34645)

In [16]:
all_len_content = [len(x) for x in all_content]

In [18]:
from sklearn.preprocessing import scale, OneHotEncoder

In [19]:
all_len_content_RDY = pd.Series(scale(all_len_content)).values.reshape(-1,1)

In [11]:
%%time
all_content_tfidf = vectorizer.fit_transform(all_content)

Wall time: 11min 41s


In [12]:
all_content_tfidf

<96958x100000 sparse matrix of type '<class 'numpy.float64'>'
	with 105030711 stored elements in Compressed Sparse Row format>

In [13]:
import scipy.sparse

In [15]:
scipy.sparse.save_npz('D:/TEMP/all_content_tfidf.npz', all_content_tfidf)

In [20]:
train_author = list()
with open(os.path.join(PATH_TO_DATA, 'train_author.txt'), 
              encoding='utf-8') as file_title:
    train_author = [line.strip() for line in file_title]
test_author = list()
with open(os.path.join(PATH_TO_DATA, 'test_author.txt'), 
              encoding='utf-8') as file_title:
    test_author = [line.strip() for line in file_title]
all_author = train_author+test_author
len(train_author), len(test_author)

(62313, 34645)

In [21]:
%%time
all_author_OHE = pd.get_dummies(pd.Series(all_author))

Wall time: 8.47 s


In [22]:
train_published = list()
with open(os.path.join(PATH_TO_DATA, 'train_published.txt'), 
              encoding='utf-8') as file_title:
    train_published = [line.strip() for line in file_title]
test_published = list()
with open(os.path.join(PATH_TO_DATA, 'test_published.txt'), 
              encoding='utf-8') as file_title:
    test_published = [line.strip() for line in file_title]
all_published = train_published+test_published
len(train_published), len(test_published)

(62313, 34645)

In [23]:
def getWeekEnd(data):
    list = []
    for item in data:
        time = item[1:].split('.')[0]
        weekDay = datetime.strptime(time,'%Y-%m-%dT%H:%M:%S').isoweekday()
        list.append(1 if weekDay in [6,7] else 0)
    return list

In [25]:
from datetime import datetime
all_weekend = getWeekEnd(all_published)

In [26]:
def getMDAN(data):
    list = []
    for item in data:
        time = item.split('.')[0]
        hour = datetime.strptime(time[1:],'%Y-%m-%dT%H:%M:%S').hour
        if (hour < 6):
            list.append(1)
        elif (hour>=6) & (hour<12):
            list.append(2)
        elif (hour>=12) & (hour<18):
            list.append(3)
        else:
            list.append(4)
    return list

In [27]:
all_PartOfDay = getMDAN(all_published)

In [28]:
def getHour(data):
    list = []
    for item in data:
        time = item.split('T')[1].split('.')[0]
        list.append(datetime.strptime(time,'%H:%M:%S').hour)
    return list

In [29]:
all_Hour = getHour(all_published)

In [30]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()

In [31]:
all_PartOfDay_OHE = enc.fit_transform(pd.Series(all_PartOfDay).values.reshape(-1,1))
all_weekend_OHE = enc.fit_transform(pd.Series(all_weekend).values.reshape(-1,1))

In [35]:
X_all_sparse = csr_matrix(hstack([all_content_tfidf,
                                  all_titles_tfidf,
                                  all_len_content_RDY,
                                  all_author_OHE,
                                  all_PartOfDay_OHE,
                                  all_weekend_OHE]))

In [36]:
ridge = Ridge(random_state=17)

In [38]:
train_target = pd.read_csv('train_log1p_recommends.csv', 
                           index_col='id')
y_train = train_target['log_recommends'].values

In [39]:
%%time
ridge.fit(X_all_sparse[:idx_split], y_train);

Wall time: 5min 49s


Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=17, solver='auto', tol=0.001)

In [40]:
ridge_pred = ridge.predict(X_all_sparse[idx_split:])

In [42]:
ridge_pred.mean()

3.1196268617232237

In [49]:
ridge_pred_hack = ridge_pred + (4.3329 - ridge_pred.mean())

In [44]:
ridge_pred_hack.mean()

4.332799999999999

In [47]:
write_submission_file(ridge_pred, 'DARKNIGHT1-clear.csv')

In [50]:
write_submission_file(ridge_pred_hack, 'DARKNIGHT1-hack-vk.csv')

**Join all sparse matrices.**

In [None]:
X_train_sparse = csr_matrix(hstack([X_train_content_sparse, X_train_title_sparse,
                                    X_train_author_sparse, X_train_time_features_sparse]))

In [None]:
X_test_sparse = csr_matrix(hstack([X_test_content_sparse, X_test_title_sparse,
                                    X_test_author_sparse, X_test_time_features_sparse]))

**Read train target and split data for validation.**

In [52]:
train_target = pd.read_csv('train_log1p_recommends.csv', 
                           index_col='id')
y_train = train_target['log_recommends'].values

In [None]:
train_part_size = int(0.7 * train_target.shape[0])
X_train_part_sparse = X_train_sparse[:train_part_size, :]
y_train_part = y_train[:train_part_size]
X_valid_sparse =  X_train_sparse[train_part_size:, :]
y_valid = y_train[train_part_size:]

**Train a simple Ridge model and check MAE on the validation set.**

In [None]:
# You code here

**Train the same Ridge with all available data, make predictions for the test set and form a submission file.**

In [None]:
# You code here

In [46]:
def write_submission_file(prediction, filename,
                          path_to_sample='sample_submission.csv'):
    submission = pd.read_csv(path_to_sample, index_col='id')
    
    submission['log_recommends'] = prediction
    submission.to_csv(filename)

In [None]:
write_submission_file(ridge_test_pred, 'assignment6_medium_submission.csv')

**Now's the time for dirty Kaggle hacks. Form a submission file with all zeroes. Make a submission. What do you get if you think about it? How is it going to help you with modifying your predictions?**

In [33]:
write_submission_file(np.zeros_like(ridge_test_pred), 
                      'medium_all_zeros_submission.csv')

NameError: name 'ridge_test_pred' is not defined

**Modify predictions in an appropriate way (based on your all-zero submission) and make a new submission.**

In [None]:
ridge_test_pred_modif = ridge_test_pred # You code here

In [None]:
write_submission_file(ridge_test_pred_modif, 
                      'assignment6_medium_submission_with_hack.csv')