<center>
<img src="../../img/ods_stickers.jpg" />
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). Edited by Sergey Kolchenko (@KolchenkoSergey). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.

# <center> Assignment #2. Spring 2019
## <center>  Competition 2. Predicting Medium articles popularity with Ridge Regression <br>(beating baselines in the "Medium" competition)
    
<img src='../../img/medium_claps.jpg' width=40% />


In this [competition](https://www.kaggle.com/c/how-good-is-your-medium-article) we are predicting Medium article popularity based on its features like content, title, author, tags, reading time etc. 

Prior to working on the assignment, you'd better check out the corresponding course material:
 1. [Classification, Decision Trees and k Nearest Neighbors](https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic03_decision_trees_kNN/topic3_decision_trees_kNN.ipynb?flush_cache=true), the same as an interactive web-based [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-3-decision-trees-and-knn) (basics of machine learning are covered here)
 2. Linear classification and regression in 5 parts: 
    - [ordinary least squares](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-1-ols)
    - [linear classification](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-2-classification)
    - [regularization](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-3-regularization)
    - [logistic regression: pros and cons](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-4-more-of-logit)
    - [validation](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-5-validation)
 3. You can also practice with demo assignments, which are simpler and already shared with solutions: 
    - " Sarcasm detection with logistic regression": [assignment](https://www.kaggle.com/kashnitsky/a4-demo-sarcasm-detection-with-logit) + [solution](https://www.kaggle.com/kashnitsky/a4-demo-sarcasm-detection-with-logit-solution)
    - "Linear regression as optimization": [assignment](https://www.kaggle.com/kashnitsky/a4-demo-linear-regression-as-optimization/edit) (solution cannot be officially shared)
    - "Exploring OLS, Lasso and Random Forest in a regression task": [assignment](https://www.kaggle.com/kashnitsky/a6-demo-linear-models-and-rf-for-regression) + [solution](https://www.kaggle.com/kashnitsky/a6-demo-regression-solution)
 4. Baseline with Ridge regression and "bag of words" for article content, [Kernel](https://www.kaggle.com/kashnitsky/ridge-countvectorizer-baseline)
 5. Other [Kernels](https://www.kaggle.com/c/how-good-is-your-medium-article/kernels?sortBy=voteCount&group=everyone&pageSize=20&competitionId=8673) in this competition. You can share yours as well, but not high-performing ones (Public LB MAE shall be > 1.5). Please don't spoil the competitive spirit.  
 6. If that's still not enough, watch two videos (Linear regression and regularization) from here [mlcourse.ai/video](https://mlcourse.ai/video), the second one on LTV prediction is smth that you won't typically find in a MOOC - real problem, real metrics, real data.

**Your task:**
 1. "Freeride". Come up with good features to beat the baselines "A2 baseline (10 credits)" (**1.45082** Public LB MAE) and "A2 strong baseline (20 credits)"  (**1.41117** Public LB MAE). As names suggest, you'll get 10 more credits for beating the first one, and 10 more (20 in total) for beating the second one. You need to name your [team](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2/team) (out of 1 person) in full accordance with the [course rating](https://docs.google.com/spreadsheets/d/1LAy1eK8vIONzIWgcCEaVmhKPSj579zK5lrECf_tQT60/edit?usp=sharing) (for newcomers: you need to name your team with your real full name). You can think of it as a part of the assignment.
 2. If you've beaten "A2 baseline (10 credits)" or performed better, you need to upload your solution as described in [course roadmap](https://mlcourse.ai/roadmap) ("Kaggle Inclass Competition Medium"). For all baselines that you see on Public Leaderboard, it's OK to beat them on Public LB as well. But 10 winners will be defined according to the private LB, which will be revealed by @yorko on March 11. 
 
### <center> Deadline for A2: 2019 March 10, 20:59 GMT (London time)
 
### How to get help
In [ODS Slack](https://opendatascience.slack.com) (if you still don't have access, fill in the [form](https://docs.google.com/forms/d/1BMqcUc-hIQXa0HB_Q2Oa8vWBtGHXk8a6xo5gPnMKYKA/edit) mentioned on the mlcourse.ai main page), we have a channel **#mlcourse_ai_news** with announcements from the course team.
You can discuss the course content freely in the **#mlcourse_ai** channel (we still have a huge Russian-speaking group, they have a separate channel **#mlcourse_ai_rus**).

Please stick this special thread for your questions:
 - [#a2_medium](https://opendatascience.slack.com/archives/C91N8TL83/p1549882568052400) 
 
Help each other without sharing actual code. Our TA Artem @datamove is there to help (only in the mentioned thread, do not write to him directly).

In [96]:
import os
import json
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_absolute_error
from scipy.sparse import csr_matrix, hstack
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split

The following code will help to throw away all HTML tags from an article content.

In [97]:
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Supplementary function to read a JSON line without crashing on escape characters.

In [98]:
def read_json_line(line=None):
    result = None
    try:        
        result = json.loads(line)
    except Exception as e:      
        # Find the offending character index:
        idx_to_replace = int(str(e).split(' ')[-1].replace(')',''))      
        # Remove the offending character:
        new_line = list(line)
        new_line[idx_to_replace] = ' '
        new_line = ''.join(new_line)     
        return read_json_line(line=new_line)
    return result

def extract_name(author):
    name = author['url'].split('@')[-1]
    return name

Extract features `content`, `published`, `title` and `author`, write them to separate files for train and test sets.

In [100]:
def extract_features_and_write(path_to_data,
                               inp_filename, is_train=True):
    
    features = ['content', 'published', 'title', 'author']
    prefix = 'train' if is_train else 'test'
    feature_files = [open(os.path.join(path_to_data,
                                       '{}_{}.txt'.format(prefix, feat)),
                          'w', encoding='utf-8')
                     for feat in features]
    
    with open(os.path.join(path_to_data, inp_filename), 
              encoding='utf-8') as inp_json_file:
        nol = 0
        for line in tqdm_notebook(inp_json_file):
            json_data = read_json_line(line)
            nol += 1
            for fea in features:
                file_name = feature_files[features.index(fea)]
                dt = json_data[fea]
                if fea in ['content', 'title']:
                    file_name.write(strip_tags(dt).replace('\n', ' ').replace('\r', ' ') + '\n')
                elif fea == 'published':
                    file_name.write(dt['$date'] + '\n')
                elif fea == 'author':
                    file_name.write(extract_name(dt) + '\n')
    [fifle.close() for fifle in feature_files]
    print(f'{nol} lines')

In [94]:
PATH_TO_DATA = 'data/kaggle_medium' # modify this if you need to

In [101]:
extract_features_and_write(PATH_TO_DATA, 'train.json', is_train=True)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


62313 lines


In [102]:
extract_features_and_write(PATH_TO_DATA, 'test.json', is_train=False)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


34645 lines


**Add the following groups of features:**
    - Tf-Idf with article content (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Tf-Idf with article titles (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Time features: publication hour, whether it's morning, day, night, whether it's a weekend
    - Bag of authors (i.e. One-Hot-Encoded author names)

In [1]:
import os
import json
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import mean_absolute_error
from scipy.sparse import csr_matrix, hstack
from sklearn.linear_model import Ridge
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.utils import shuffle
import seaborn as sns
%matplotlib inline

PATH_TO_DATA = 'data/kaggle_medium' # modify this if you need to

## Read train target 

In [2]:
train_target = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_log1p_recommends.csv'), 
                           index_col='id')
y_train = train_target['log_recommends'].values

# Features creating
## titles

In [3]:
%%time
vectorizer = TfidfVectorizer(ngram_range=(1, 3), max_features=100000, )

with open('data/kaggle_medium/train_title.txt') as file:
    X_train_title_sparse = vectorizer.fit_transform(file)

with open('data/kaggle_medium/test_title.txt') as file:
    X_test_title_sparse = vectorizer.transform(file)

print(X_train_title_sparse.shape, X_test_title_sparse.shape)

CPU times: user 5.94 s, sys: 137 ms, total: 6.07 s
Wall time: 6.08 s


## contexts

In [4]:
%%time
vectorizer = TfidfVectorizer(max_features=100000, )

with open('data/kaggle_medium/train_content.txt') as file:
    X_train_content_sparse = vectorizer.fit_transform(file)

with open('data/kaggle_medium/test_content.txt') as file:
    X_test_content_sparse = vectorizer.transform(file)

print(X_train_content_sparse.shape, X_test_content_sparse.shape)

CPU times: user 2min 21s, sys: 2.32 s, total: 2min 24s
Wall time: 2min 24s


## authors

In [5]:
ohe = OneHotEncoder(handle_unknown='ignore')

with open('data/kaggle_medium/train_author.txt') as file:
    X_train_author_sparse = ohe.fit_transform(np.reshape(list(map(str.strip, file.readlines())), (-1,1)))  

with open('data/kaggle_medium/test_author.txt') as file:
    X_test_author_sparse = ohe.transform(np.reshape(list(map(str.strip, file.readlines())), (-1,1)))  

X_train_author_sparse.shape, X_test_author_sparse.shape

((62313, 31540), (34645, 31540))

## times are not needed

In [19]:
with open('data/kaggle_medium/train_published.txt') as file:
    train_times = pd.to_datetime(list(map(str.strip, file.readlines())))

with open('data/kaggle_medium/test_published.txt') as file:
    test_times = pd.to_datetime(list(map(str.strip, file.readlines())))

def add_time_features(times):
    time_df = pd.DataFrame()

    time_df['hour'] = times.hour
    time_df['day_of_week'] = times.dayofweek
    time_df['month'] = times.month

    session_start_hour = time_df.hour
    time_df['morning'] = ((session_start_hour >= 6) & (session_start_hour <= 11)).astype('int')
    time_df['day'] = ((session_start_hour >= 12) & (session_start_hour <= 18)).astype('int')
    time_df['evening'] = ((session_start_hour >= 19) & (session_start_hour <= 23)).astype('int')
    time_df['night'] = ((session_start_hour >= 0) & (session_start_hour <= 5)).astype('int')

    time_df['is_weekday'] = (time_df.hour <= 4).astype('int')

    time_df['winter'] = ((time_df.month >= 1) & (time_df.month <= 2) | (time_df.month == 12)).astype('int')
    time_df['spring'] = ((time_df.month >= 3) & (time_df.month <= 5)).astype('int')
    time_df['summer'] = ((time_df.month >= 6) & (time_df.month <= 8)).astype('int')
    time_df['fall'] = ((time_df.month >= 9) & (time_df.month <= 11)).astype('int')
    
    need_cols = 'morning 	day 	evening 	night 	is_weekday 	winter 	spring 	summer 	fall'.split()
    time_feat = csr_matrix(time_df.loc[:, need_cols])
    return time_feat

X_train_time_features_sparse = add_time_features(train_times)
X_test_time_features_sparse = add_time_features(test_times)

X_train_time_features_sparse.shape, X_test_time_features_sparse.shape

((62313, 9), (34645, 9))

**Join all sparse matrices.**

In [8]:
fff = [X_train_title_sparse, X_train_author_sparse,]
X_train_sparse = hstack(fff).tocsr()

In [6]:
ggg = [X_test_title_sparse, X_test_author_sparse,]
X_test_sparse = hstack(ggg).tocsr()

In [9]:
X_test_sparse.shape, X_train_sparse.shape, y_train.shape

((34645, 131540), (62313, 131540), (62313,))

**Split data for validation.**

In [82]:
train_part_size = int(0.7 * train_target.shape[0])
X_train_part_sparse = X_train_sparse[:train_part_size, :]
y_train_part = y_train[:train_part_size]
X_valid_sparse =  X_train_sparse[train_part_size:, :]
y_valid = y_train[train_part_size:]

**Train a simple Ridge model and check MAE on the validation set.**

In [83]:
ridge = Ridge()

In [84]:
%%time
ridge.fit(X_train_part_sparse, np.log1p(y_train_part))
y_pred = np.expm1(ridge.predict(X_valid_sparse))
mae_ridge = mean_absolute_error(y_valid, y_pred)
print(mae_ridge)

1.4472964694929422
CPU times: user 427 ms, sys: 0 ns, total: 427 ms
Wall time: 425 ms


In [6]:
def get_score(X_train_sparse, y_train=y_train, cv=5, idsplit=0.7):
    train_part_size = int(idsplit * train_target.shape[0])
    X_train_sparse = X_train_sparse.copy()
    scores = []
    
    for _ in range(cv):
        shuffle(X_train_sparse)
        X_train_part_sparse = X_train_sparse[:train_part_size, :]
        y_train_part = y_train[:train_part_size]
        X_valid_sparse =  X_train_sparse[train_part_size:, :]
        y_valid = y_train[train_part_size:]

        ridge = Ridge(alpha=1.27)
        ridge.fit(X_train_part_sparse, np.log1p(y_train_part))
        y_pred = np.expm1(ridge.predict(X_valid_sparse))
        mae_ridge = mean_absolute_error(y_valid, y_pred)
        scores.append(mae_ridge)
    print(np.mean(scores), np.std(scores))
    return cv

In [8]:
all_sets = [X_train_title_sparse, X_train_content_sparse, X_train_author_sparse]
names = ['X_train_title_sparse', 'X_train_content_sparse', 'X_train_author_sparse']
for x in range(3):
    X_train_sparse = all_sets[x].tocsr()
    cv = get_score(X_train_sparse)
    print(f"for {names[x].split(str('_'))[2]}\n")

1.2378187007834203 2.252875322913526e-06
for title

1.182999498317166 9.336416392583952e-06
for content

1.1989671870267977 1.266687128618093e-07
for author



In [12]:
for x in range(3):
    for y in range(x+1, 3):
        X_train_sparse = hstack([all_sets[x], all_sets[y]]).tocsr()
        cv = get_score(X_train_sparse)
        print(f"for {names[x].split(str('_'))[2]} and {names[y].split(str('_'))[2]}\n")

1.1427608626999963 6.6017588753076865e-06
for title and content

1.1305215345894362 8.282016238256069e-07
for title and author

1.0861344883378947 3.3082789976271535e-06
for content and author



In [13]:
X_train_sparse = hstack([X_train_title_sparse, 
                         X_train_content_sparse, 
                         X_train_author_sparse]).tocsr()

cv3 = get_score(X_train_sparse)

1.0735725703960457 1.498952676574712e-06


In [10]:
X_train_sparse = hstack([X_train_title_sparse, 
                         X_train_time_features_sparse, 
                         X_train_author_sparse, 
                         X_train_content_sparse]).tocsr()
# X_train_sparse = X_train_content_sparse

cv4 = get_score(X_train_sparse)



TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

In [21]:
c_values = np.logspace(-2, 2, 20)
ridge = Ridge()

ridge_grid_searcher = GridSearchCV(estimator=ridge, param_grid={'alpha': c_values},
                                  scoring='neg_mean_absolute_error', n_jobs=4, cv=8, verbose=1)

In [23]:
ridge_grid_searcher.fit(X_train_title_sparse, y_train)

Fitting 8 folds for each of 20 candidates, totalling 160 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:  1.8min
[Parallel(n_jobs=4)]: Done 160 out of 160 | elapsed:  2.7min finished


GridSearchCV(cv=8, error_score='raise-deprecating',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
       fit_params=None, iid='warn', n_jobs=4,
       param_grid={'alpha': array([1.00000e-02, 1.62378e-02, 2.63665e-02, 4.28133e-02, 6.95193e-02,
       1.12884e-01, 1.83298e-01, 2.97635e-01, 4.83293e-01, 7.84760e-01,
       1.27427e+00, 2.06914e+00, 3.35982e+00, 5.45559e+00, 8.85867e+00,
       1.43845e+01, 2.33572e+01, 3.79269e+01, 6.15848e+01, 1.00000e+02])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_absolute_error', verbose=1)

In [29]:
ridge_grid_searcher.best_score_, ridge_grid_searcher.best_params_

(-1.299899962977019, {'alpha': 1.2742749857031335})

**Train the same Ridge with all available data, make predictions for the test set and form a submission file.**

In [17]:
def write_submission_file(prediction, filename,
                          path_to_sample=os.path.join(PATH_TO_DATA, 
                                                      'sample_submission.csv')):
    submission = pd.read_csv(path_to_sample, index_col='id')
    
    submission['log_recommends'] = prediction
    submission.to_csv(filename)

In [15]:
%%time
# X_train_sparse = hstack([X_train_title_sparse, X_train_author_sparse]).tocsr()
X_test_sparse = hstack([X_test_title_sparse, X_test_content_sparse, X_test_author_sparse]).tocsr()

best_estimator = Ridge(alpha=1.27)
best_estimator.fit(X_train_sparse, np.log1p(y_train))
ridge_test_pred = np.expm1(best_estimator.predict(X_test_sparse))

CPU times: user 21.5 s, sys: 416 ms, total: 21.9 s
Wall time: 21.9 s


In [23]:
write_submission_file(ridge_test_pred, os.path.join(PATH_TO_DATA,
                                                    'assignment7_medium_submission.csv'))

**Now's the time for dirty Kaggle hacks. Form a submission file with all zeros. Make a submission. What do you get if you think about it? How is it going to help you with modifying your predictions?**

**UPD:** There is a [tutorial](https://nbviewer.jupyter.org/github/Yorko/mlcourse.ai/blob/master/jupyter_english/tutorials/kaggle_leaderboard_probing_nikolai_timonin.ipynb) on leaderboard probing which is written within mlcourse.ai and is relevant here. (Originally, contestants were supposed to come up with simple probing techniques on their own. But now when this tutorial is shared, we eliminate "discovery bias" and equalize everybody's chances by sharing this tutorial).

In [None]:
write_submission_file(np.zeros_like(ridge_test_pred), 
                      os.path.join(PATH_TO_DATA,
                                   'medium_all_zeros_submission.csv'))

**Modify predictions in an appropriate way (based on your all-zero submission) and make a new submission.**

In [18]:
mean_test_target = 4.33328 
ridge_test_pred_modif = ridge_test_pred + mean_test_target - y_train.mean()

write_submission_file(ridge_test_pred_modif, 'hack_ridge2_submission.csv')

In [None]:
write_submission_file(ridge_test_pred_modif, 
                      os.path.join(PATH_TO_DATA,
                                   'assignment2_medium_submission_with_hack.csv'))

Some ideas for improvement:

- Engineer good features, this is the key to success. Some simple features will be based on publication time, authors, content length and so on
- You may not ignore HTML and extract some features from there
- You'd better experiment with your validation scheme. You should see a correlation between your local improvements and LB score
- Try TF-IDF, ngrams, Word2Vec and GloVe embeddings
- Try various NLP techniques like stemming and lemmatization
- Tune hyperparameters. In our example, we've left only 50k features and used C=1 as a regularization parameter, this can be changed
- SGD and Vowpal Wabbit will learn much faster
- Play around with blending and/or stacking. An intro is given in [this Kernel](https://www.kaggle.com/kashnitsky/ridge-and-lightgbm-simple-blending) by @yorko 
- In our course, we don't cover neural nets. But it's not obliged to use GRUs/LSTMs/whatever in this competition.

Good luck!

<img src='../../img/kaggle_shakeup.png' width=50%>