<img src="https://habrastorage.org/webt/ia/m9/zk/iam9zkyzqebnf_okxipihkgjwnw.jpeg" />
    
**<center>[mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course** </center><br>
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). [mlcourse.ai](https://mlcourse.ai) is powered by [OpenDataScience (ods.ai)](https://ods.ai/) © 2017—2021

## <center>Assignment #6. Task</center><a class="tocSkip">
### <center> Beating benchmarks in "How good is your Medium article?"</center><a class="tocSkip">
    
[Competition](https://www.kaggle.com/c/how-good-is-your-medium-article). The task is to beat "Assignment 6 baseline" (~1.45 Public LB score). You can refer to [this simple Ridge baseline](https://www.kaggle.com/kashnitsky/ridge-countvectorizer-baseline?rvi=1).

*For discussions, please stick to [ODS Slack](https://opendatascience.slack.com/), channel __#mlcourse_ai_eng__, pinned thread __#a6_bonus__. If you are sure that something is not 100% correct, please leave your feedback there*

-----

# Imports

In [1]:
# !pip install xgboost nltk dill

In [2]:
import json
import os
import pickle
from collections import defaultdict
from html.parser import HTMLParser
from pathlib import Path

import dill as pickle
import numpy as np
import pandas as pd
import xgboost as xgb
from nltk import TweetTokenizer
from scipy.sparse import csr_matrix, hstack
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS, TfidfVectorizer
from sklearn.linear_model import Ridge
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from tqdm.notebook import tqdm

# Params

In [3]:
SEED = 42
PATH_TO_DATA = "data"
PATH_TO_SAVE_DIR = "prepared_data"

# Cleaning HTML

The following code will help to throw away all HTML tags from an article content.

In [4]:
from html.parser import HTMLParser


class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs = True
        self.fed = []

    def handle_data(self, d):
        self.fed.append(d)

    def get_data(self):
        return "".join(self.fed)


def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

# Reading JSON

Supplementary function to read a JSON line without crashing on escape characters.

In [5]:
def read_json_line(line=None):
    result = None
    try:
        result = json.loads(line)
    except Exception as e:
        # Find the offending character index:
        idx_to_replace = int(str(e).split(" ")[-1].replace(")", ""))
        # Remove the offending character:
        new_line = list(line)
        new_line[idx_to_replace] = " "
        new_line = "".join(new_line)
        return read_json_line(line=new_line)
    return result

Extract features `content`, `published`, `title` and `author`, write them to separate files for train and test sets.

In [27]:
def check_file(filename):
    file = Path(filename)
    return file.is_file()

In [28]:
def read_from_disk(features, path_to_save, prefix):
    reload_files = False
    d = defaultdict()
    for feature in features:
        filename = os.path.join(path_to_save, prefix + "_" + feature + ".txt")
        if check_file(filename):
            print(f"Reading from disk {filename}")
            with open(filename, "rb") as fp:
                d[feature] = pickle.load(fp)
        else:
            reload_files = True

    return reload_files, d

In [29]:
def extract_features_and_write(path_to_data, path_to_save, inp_filename, is_train=True):
    titles = []
    contents = []
    dates = []

    authors = []
    features = ["content", "published", "title", "author"]
    prefix = "train" if is_train else "test"

    Path(path_to_save).mkdir(parents=True, exist_ok=True)

    reload_files, d = read_from_disk(features, path_to_save, prefix)

    if reload_files:

        feature_files = [
            open(os.path.join(path_to_save, "{}_{}.txt".format(prefix, feat)), "w", encoding="utf-8")
            for feat in features
        ]

        with open(os.path.join(path_to_data, inp_filename), encoding="utf-8") as inp_json_file:

            for line in tqdm(inp_json_file, desc=f"Reading {prefix} json files"):
                json_data = read_json_line(line)

                title = json_data["title"].replace("\n", " ").replace("\t", " ").replace("\r", " ").replace("\xa0", " ")
                content = strip_tags(
                    json_data["content"].replace("\n", " ").replace("\t", " ").replace("\r", " ").replace("\xa0", " ")
                )
                published = json_data["published"]
                author = json_data["meta_tags"]["author"]
                authors_name = json_data["meta_tags"]["author"]

                titles.append(title)
                contents.append(content)
                dates.append(published)
                authors.append(authors_name)

        d = {"content": contents, "published": dates, "title": titles, "author": authors}
        for feature in features:
            filename = prefix + "_" + feature + ".txt"
            with open(os.path.join(path_to_save, filename), "wb") as fp:
                pickle.dump(d[feature], fp)
    else:
        titles, contents, dates, authors = d["title"], d["content"], d["published"], d["author"]

    return titles, contents, dates, authors

Download the [competition data](https://www.kaggle.com/c/how-good-is-your-medium-article/data) and place it where it's convenient for you. You can modify the path to data below.

In [30]:
train_titles, train_contents, train_dates, train_authors = extract_features_and_write(
    PATH_TO_DATA, PATH_TO_SAVE_DIR, "train.json", is_train=True
)
test_titles, test_contents, test_dates, test_authors = extract_features_and_write(
    PATH_TO_DATA, PATH_TO_SAVE_DIR, "test.json", is_train=False
)

Reading from disk prepared_data\train_content.txt
Reading from disk prepared_data\train_published.txt
Reading from disk prepared_data\train_title.txt
Reading from disk prepared_data\train_author.txt
Reading from disk prepared_data\test_content.txt
Reading from disk prepared_data\test_published.txt
Reading from disk prepared_data\test_title.txt
Reading from disk prepared_data\test_author.txt


# Feature engineering

**Add the following groups of features:**
* Tf-Idf with article content:
  * ngram_range=(1, 2)
  * max_features=100000
* Tf-Idf with article titles:
  * ngram_range=(1, 2)
  * max_features=100000
* Time features: 
  * publication hour, 
  * time of the day 
  * weekend or not
* Bag of authors  
i.e. One-Hot-Encoded author names

In [8]:
TITLE_NGRAMS = (1, 2)  # for tf-idf on titles
CONTENT_NGRAMS = (1, 2)  # for tf-idf on contents
MAX_FEATURES = 100_000  # for tf-idf

XGB_TRAIN_ROUNDS = 60  # num. iteration to train XGBoost
XGB_NUM_LEAVES = 255  # max number of leaves in XGBoost trees
XGB_WEIGHT = 0.4

MEAN_TEST_TARGET = 4.33328  # what we got by submitting all zeros

RIDGE_WEIGHT = 0.6  # weight of Ridge predictions in a blend with XGBoost

## Tokenizer

In [10]:
tokenizer = TweetTokenizer()
assert tokenizer.tokenize("Now I'm a man") == ["Now", "I'm", "a", "man"]

In [11]:
def tokenize(text):
    return tokenizer.tokenize(text)

## Doing TF-IDF vectorization for articles

In [48]:
art_vec_params = {
    "ngram_range": CONTENT_NGRAMS,
    "max_features": MAX_FEATURES,
    "tokenizer": tokenize,
    "stop_words": ENGLISH_STOP_WORDS,
}

In [1]:
article_vectorizer_params_str = f"{art_vec_params['ngram_range']}_{art_vec_params['max_features']}"
article_vec_name = f"vectorizer_article_{article_vectorizer_params_str}.pickle"

NameError: name 'art_vec_params' is not defined

In [None]:
%%time
article_vec_is_saved = check_file(article_vec_name)
if article_vec_is_saved:
    vectorizer_article = pickle.load(open(article_vec_name, "rb"))
else:
    vectorizer_article = TfidfVectorizer(**article_vectorizer_params)
    vectorizer_article.fit(train_contents)
    pickle.dump(vectorizer_article, open(article_vec_name, "wb"))
    
X_train_article = vectorizer_article.transform(train_contents)
X_test_article = vectorizer_article.transform(test_contents)

## Doing TF-IDF vectorization for titles

In [79]:
title_vec_params = {
    "ngram_range": TITLE_NGRAMS,
    "max_features": MAX_FEATURES,
    "tokenizer": tokenize,
    "stop_words": ENGLISH_STOP_WORDS,
}

In [80]:
title_vectorizer_params_str = f"{title_vec_params['ngram_range']}_{title_vec_params['max_features']}"
title_vec_name = f"vectorizer_title_{title_vectorizer_params_str}.pickle"

In [82]:
%%time
title_vec_is_saved = check_file(title_vec_name)
if title_vec_is_saved:
    vectorizer_title = pickle.load(open(title_vec_name, "rb"))
else:
    vectorizer_title = TfidfVectorizer(**title_vec_params)
    vectorizer_title.fit(train_titles)
    pickle.dump(vectorizer_title, open(title_vec_name, "wb"))

X_train_title = vectorizer_title.transform(train_titles)
X_test_title = vectorizer_title.transform(test_titles)

CPU times: total: 6.14 s
Wall time: 6.15 s


## Preparing time features

In [None]:
# Time features
def add_time_features(dates):
    scaler = StandardScaler()
    hour = scaler.fit_transform(np.array([date.hour for date in dates]).reshape(-1, 1))
    weekday = scaler.fit_transform(np.array([date.weekday() for date in dates]).reshape(-1, 1))
    morning = scaler.fit_transform(((hour >= 7) & (hour <= 11)).astype("int").reshape(-1, 1))
    day = scaler.fit_transform(((hour >= 12) & (hour <= 18)).astype("int").reshape(-1, 1))
    evening = scaler.fit_transform(((hour >= 19) & (hour <= 23)).astype("int").reshape(-1, 1))
    night = scaler.fit_transform(((hour >= 0) & (hour <= 6)).astype("int").reshape(-1, 1))
    weekend_temp = np.array([date.weekday() for date in dates]).reshape(-1, 1)
    weekend = scaler.fit_transform(((weekend_temp >= 5) & (weekend_temp <= 6)).astype("int").reshape(-1, 1))

    feature_names = ["morning", "day", "evening", "night", "weekday"]
    time_features = pd.DataFrame(
        list(zip(morning.flatten(), day.flatten(), evening.flatten(), night.flatten(), weekend.flatten())),
        columns=feature_names,
    )
    sparse_time_features = csr_matrix(time_features.values)
    return sparse_time_features, feature_names

In [None]:
train_times = pd.to_datetime([date["$date"] for date in train_dates])
test_times = pd.to_datetime([date["$date"] for date in test_dates])

X_train_time_features_sparse, time_feature_names = add_time_features(train_times)
X_test_time_features_sparse, _ = add_time_features(test_times)

## Doing bag of authors

In [None]:
authors = np.unique(train_authors + test_authors)
enc = OneHotEncoder(handle_unknown="ignore")
enc.fit(authors.reshape(-1, 1))
enc.categories_
X_train_author_sparse = enc.transform(np.array(train_authors).reshape(-1, 1)).toarray()
X_test_author_sparse = enc.transform(np.array(test_authors).reshape(-1, 1)).toarray()

## Preparing additional features

In [None]:
train_len = [len(article) for article in train_contents]
test_len = [len(article) for article in test_contents]
scaler = StandardScaler()

X_train_len_sparse = scaler.fit_transform(np.array(train_len).reshape(-1, 1))
X_test_len_sparse = scaler.fit_transform(np.array(test_len).reshape(-1, 1))

## Joining features

**Join all sparse matrices.**

In [9]:
X_train_sparse = hstack(
    [X_train_article, X_train_title, X_train_author_sparse, X_train_time_features_sparse, X_train_len_sparse]
).tocsr()

X_test_sparse = hstack(
    [X_test_article, X_test_title, X_test_author_sparse, X_test_time_features_sparse, X_test_len_sparse]
).tocsr()

# Read train target and split data for validation

In [11]:
train_target = pd.read_csv(os.path.join(PATH_TO_DATA, "train_log1p_recommends.csv"), index_col="id")
y_train = train_target["log_recommends"].values

In [12]:
train_part_size = int(0.7 * train_target.shape[0])

X_train_part_sparse = X_train_sparse[:train_part_size, :]
y_train_part = y_train[:train_part_size]

X_valid_sparse = X_train_sparse[train_part_size:, :]
y_valid = y_train[train_part_size:]

# Training

## Ridge

**Train a simple Ridge model and check MAE on the validation set.**

In [13]:
# alpha_values = np.logspace(-2, 2, 20)
ridge = Ridge(random_state=SEED, alpha=0.01)
# logit_grid_searcher = GridSearchCV(estimator=ridge, param_grid={'alpha': alpha_values}, scoring='neg_mean_absolute_error', n_jobs=-1, cv=3, verbose=1)
ridge.fit(X_train_sparse, y_train)
# final_model = logit_grid_searcher.best_estimator_
ridge_test_pred = ridge.predict(X_test_sparse)

## XGBoost

In [None]:
lgb_x_train = lgb.Dataset(X_train_sparse.astype(np.float32), label=np.log1p(y_train))
param = {"num_leaves": LGB_NUM_LEAVES, "objective": "mean_absolute_error", "metric": "mae"}
bst_lgb = lgb.train(param, lgb_x_train, LGB_TRAIN_ROUNDS, verbose_eval=5)
lgb_test_pred = np.expm1(bst_lgb.predict(X_test_sparse.astype(np.float32)))

## Training with all data

**Train the same Ridge with all available data, make predictions for the test set and form a submission file.**

In [14]:
# You code here (read-only in a JupyterBook, pls run jupyter-notebook to edit)

In [15]:
ridge_test_pred = np.empty([34645, 1])  # change this

In [16]:
def write_submission_file(
    prediction,
    filename,
    path_to_sample=os.path.join(PATH_TO_DATA, "sample_submission.csv"),
):
    submission = pd.read_csv(path_to_sample, index_col="id")

    submission["log_recommends"] = prediction
    submission.to_csv(filename)

In [17]:
write_submission_file(ridge_test_pred, os.path.join(PATH_TO_DATA, "assignment6_medium_submission.csv"))

**Now's the time for dirty Kaggle hacks. Form a submission file with all zeros. Make a submission. What do you get if you think about it? How is it going to help you with modifying your predictions?**

In [18]:
write_submission_file(
    np.zeros_like(ridge_test_pred),
    os.path.join(PATH_TO_DATA, "medium_all_zeros_submission.csv"),
)

**Modify predictions in an appropriate way (based on your all-zero submission) and make a new submission.**

In [19]:
ridge_test_pred_modif = ridge_test_pred
# You code here (read-only in a JupyterBook, pls run jupyter-notebook to edit)

In [20]:
write_submission_file(
    ridge_test_pred_modif,
    os.path.join(PATH_TO_DATA, "assignment6_medium_submission_with_hack.csv"),
)

That's it for the assignment. In case you'd like to try some more ideas for improvement:

- Engineer good features, this is the key to success. Some simple features will be based on publication time, authors, content length and so on
- You may not ignore HTML and extract some features from there
- You'd better experiment with your validation scheme. You should see a correlation between your local improvements and LB score
- Try TF-IDF, ngrams, Word2Vec and GloVe embeddings
- Try various NLP techniques like stemming and lemmatization
- Tune hyperparameters. In our example, we've left only 50k features and used C=1 as a regularization parameter, this can be changed
- SGD and Vowpal Wabbit will train much faster
- Play around with blending and/or stacking. An intro is given in [this Kernel](https://www.kaggle.com/kashnitsky/ridge-and-lightgbm-simple-blending) by @yorko 
- And neural nets of course. We don't cover them in this course byt still transformer-based architectures will likely perform well in such types of tasks