Here we use a simple linear model and article content with `CountVectorizer`.

Import libraries.

In [1]:
import os
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
import json
from tqdm import tqdm_notebook
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import mean_absolute_error

The following code will help to throw away all HTML tags from an article content.

In [2]:
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

In [6]:
PATH_TO_DATA = '../../../data/2019-03_ass02-2/'
os.path.isdir(PATH_TO_DATA)

True

Assume you have all data downloaded from competition's [page](https://www.kaggle.com/c/how-good-is-your-medium-article/data) in the PATH_TO_DATA folder and `.gz` files are ungzipped.

In [5]:
!ls -l $PATH_TO_DATA

total 3837900
-rwxrwx---+ 1 User9 РћС‚СЃСѓС‚СЃС‚РІСѓРµС‚     884217 Mar  9 21:09 sample_submission.csv
-rwxrwx---+ 1 User9 РћС‚СЃСѓС‚СЃС‚РІСѓРµС‚ 1156020029 Sep 20 09:56 test.json
-rwxrwx---+ 1 User9 РћС‚СЃСѓС‚СЃС‚РІСѓРµС‚  240600924 Mar  9 21:11 test.json.zip
-rwxrwx---+ 1 User9 РћС‚СЃСѓС‚СЃС‚РІСѓРµС‚          0 Mar  9 21:17 test_author.txt
-rwxrwx---+ 1 User9 РћС‚СЃСѓС‚СЃС‚РІСѓРµС‚          0 Mar  9 21:17 test_content.txt
-rwxrwx---+ 1 User9 РћС‚СЃСѓС‚СЃС‚РІСѓРµС‚          0 Mar  9 21:17 test_published.txt
-rwxrwx---+ 1 User9 РћС‚СЃСѓС‚СЃС‚РІСѓРµС‚          0 Mar  9 21:17 test_title.txt
-rwxrwx---+ 1 User9 РћС‚СЃСѓС‚СЃС‚РІСѓРµС‚ 2086185062 Sep 20 09:56 train.json
-rwxrwx---+ 1 User9 РћС‚СЃСѓС‚СЃС‚РІСѓРµС‚  445395631 Mar  9 21:13 train.json.zip
-rwxrwx---+ 1 User9 РћС‚СЃСѓС‚СЃС‚РІСѓРµС‚          0 Mar  9 21:25 train_author.txt
-rwxrwx---+ 1 User9 РћС‚СЃСѓС‚СЃС‚РІСѓРµС‚          0 Mar  9 21:25 train_content.txt
-rwxrwx---+ 1 User9 РћС‚СЃСѓС‚СЃС‚РІСѓРµС‚     912544 Mar  9 21:08 train_lo

Supplementary function to read a JSON line without crashing on escape characters. 

In [7]:
def read_json_line(line=None):
    result = None
    try:        
        result = json.loads(line)
    except Exception as e:      
        # Find the offending character index:
        idx_to_replace = int(str(e).split(' ')[-1].replace(')',''))      
        # Remove the offending character:
        new_line = list(line)
        new_line[idx_to_replace] = ' '
        new_line = ''.join(new_line)     
        return read_json_line(line=new_line)
    return result

This function takes a JSON and forms a txt file leaving only article content. When you resort to feature engineering and extract various features from articles, a good idea is to modify this function.

In [8]:
def preprocess(path_to_inp_json_file):
    output_list = []
    with open(path_to_inp_json_file, encoding='utf-8') as inp_file:
        for line in tqdm_notebook(inp_file):
            json_data = read_json_line(line)
            content = json_data['content'].replace('\n', ' ').replace('\r', ' ')
            content_no_html_tags = strip_tags(content)
            output_list.append(content_no_html_tags)
    return output_list

In [9]:
%%time
train_raw_content = preprocess(path_to_inp_json_file=os.path.join(PATH_TO_DATA, 
                                                                  'train.json'),)


Wall time: 9min 12s


In [10]:
train_raw_content[:3]

['MediumEveryone’s stories and ideasAug 13, 2012Medium Terms of\xa0ServiceEffective: March 7, 2016These Terms of Service (“Terms”) are a contract between you and A Medium Corporation. They govern your use of Medium’s sites, services, mobile apps, products, and content (“Services”).By using Medium, you agree to these Terms. If you don’t agree to any of the Terms, you can’t use Medium.We can change these Terms at any time. We keep a historical record of all changes to our Terms on GitHub. If a change is material, we’ll let you know before they take effect. By using Medium on or after that effective date, you agree to the new Terms. If you don’t agree to them, you should delete your account before they take effect, otherwise your use of the site and content will be subject to the new Terms.Content rights & responsibilitiesYou own the rights to the content you create and post on Medium.By posting content to Medium, you give us a nonexclusive license to publish it on Medium Services, includ

In [11]:
%%time
test_raw_content = preprocess(path_to_inp_json_file=os.path.join(PATH_TO_DATA, 
                                                                  'test.json'),)


Wall time: 5min 18s


We'll use a linear model (`Ridge`) with a very simple feature extractor – `CountVectorizer`, meaning that we resort to the Bag-of-Words approach. For now, we are leaving only 50k features. 

In [12]:
cv = CountVectorizer(max_features=50000)

In [13]:
%%time
X_train = cv.fit_transform(train_raw_content)

MemoryError: 

In [14]:
%%time
X_test = cv.transform(test_raw_content)

NotFittedError: CountVectorizer - Vocabulary wasn't fitted.

In [15]:
X_train.shape, X_test.shape

NameError: name 'X_train' is not defined

Read targets from file.

In [None]:
train_target = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_log1p_recommends.csv'), 
                           index_col='id')

In [None]:
train_target.shape

In [None]:
y_train = train_target['log_recommends'].values

Make a 30%-holdout set. 

In [None]:
train_part_size = int(0.7 * train_target.shape[0])
X_train_part = X_train[:train_part_size, :]
y_train_part = y_train[:train_part_size]
X_valid =  X_train[train_part_size:, :]
y_valid = y_train[train_part_size:]

Now we are ready to fit a linear model.

In [None]:
from sklearn.linear_model import Ridge

In [None]:
ridge = Ridge(random_state=17)

In [None]:
%%time
ridge.fit(X_train_part, y_train_part);

In [None]:
ridge_pred = ridge.predict(X_valid)

Let's plot predictions and targets for the holdout set. Recall that these are #recommendations (= #claps) of Medium articles with the `np.log1p` transformation.

In [None]:
plt.hist(y_valid, bins=30, alpha=.5, color='red', label='true', range=(0,10));
plt.hist(ridge_pred, bins=30, alpha=.5, color='green', label='pred', range=(0,10));
plt.legend();

As we can see, the prediction is far from perfect, and we get MAE $\approx$ 1.3 that corresponds to $\approx$ 2.7 error in #recommendations.

In [None]:
valid_mae = mean_absolute_error(y_valid, ridge_pred)
valid_mae, np.expm1(valid_mae)

Finally, train the model on the full accessible training set, make predictions for the test set and form a submission file. 

In [None]:
%%time
ridge.fit(X_train, y_train);

In [None]:
%%time
ridge_test_pred = ridge.predict(X_test)

In [None]:
def write_submission_file(prediction, filename,
    path_to_sample=os.path.join(PATH_TO_DATA, 'sample_submission.csv')):
    submission = pd.read_csv(path_to_sample, index_col='id')
    
    submission['log_recommends'] = prediction
    submission.to_csv(filename)

In [None]:
write_submission_file(prediction=ridge_test_pred, 
                      filename='first_ridge.csv')

With this, you'll get 1.91185 on [public leaderboard](https://www.kaggle.com/c/how-good-is-your-medium-article/leaderboard). This is much higher than our validation MAE. This indicates that the target distribution in test set somewhat differs from that of the training set (recent Medium articles are more popular). This shouldn't confuse us as long as we see a correlation between local improvements and improvements on the leaderboard. 

Some ideas for improvement:
- Engineer good features, this is the key to success. Some simple features will be based on publication time, authors, content length and so on
- You may not ignore HTML and extract some features from there
- You'd better experiment with your validation scheme. You should see a correlation between your local improvements and LB score
- Try TF-IDF, ngrams, Word2Vec and GloVe embeddings
- Try various NLP techniques like stemming and lemmatization
- Tune hyperparameters. In our example, we've left only 50k features and used `C`=1 as a regularization parameter, this can be changed 
- SGD and Vowpal Wabbit will learn much faster
- In our course, we don't cover neural nets. But it's not obliged to use GRUs or LSTMs in this competition. 