<center>
<img src="../../img/ods_stickers.jpg" />
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). Edited by Sergey Kolchenko (@KolchenkoSergey). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.

## <center>Assignment #6
### <center> Beating baselines in "How good is your Medium article?"
    
<img src='../../img/medium_claps.jpg' width=40% />


[Competition](https://www.kaggle.com/c/how-good-is-your-medium-article). The task is to beat "A6 baseline" (~1.45 Public LB score). Do not forget about our shared ["primitive" baseline](https://www.kaggle.com/kashnitsky/ridge-countvectorizer-baseline) - you'll find something valuable there.

**Your task:**
 1. "Freeride". Come up with good features to beat the baseline "A6 baseline" (for now, public LB is only considered)
 2. You need to name your [team](https://www.kaggle.com/c/how-good-is-your-medium-article/team) (out of 1 person) in full accordance with the [course rating](https://drive.google.com/open?id=19AGEhUQUol6_kNLKSzBsjcGUU3qWy3BNUg8x8IFkO3Q). You can think of it as a part of the assignment. 16 credits for beating the mentioned baseline and correct team naming.

In [1]:
import os
import json
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_absolute_error
from scipy.sparse import csr_matrix, hstack,coo_matrix
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.model_selection import TimeSeriesSplit

from bs4 import BeautifulSoup

The following code will help to throw away all HTML tags from an article content.

In [2]:
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Supplementary function to read a JSON line without crashing on escape characters.

In [3]:
def read_json_line(line=None):
    result = None
    try:        
        result = json.loads(line)
    except Exception as e:      
        # Find the offending character index:
        idx_to_replace = int(str(e).split(' ')[-1].replace(')',''))      
        # Remove the offending character:
        new_line = list(line)
        new_line[idx_to_replace] = ' '
        new_line = ''.join(new_line)     
        return read_json_line(line=new_line)
    return result

Extract features `content`, `published`, `title` and `author`, write them to separate files for train and test sets.

In [5]:
# def extract_features_and_write(path_to_data,
#                                inp_filename, is_train=True):
    
#     features = ['content', 'published', 'title', 'author', 'domain', 'tags']
#     prefix = 'train' if is_train else 'test'
#     feature_files = [open(os.path.join(path_to_data,
#                                        '{}_{}.txt'.format(prefix, feat)),
#                           'w', encoding='utf-8')
#                      for feat in features]
    
#     with open(os.path.join(path_to_data, inp_filename), 
#               encoding='utf-8') as inp_json_file:
        

#         for line in tqdm_notebook(inp_json_file):
#             json_data = read_json_line(line)
            
#             print(json_data)
            
#             content = json_data['content'].replace('\n',' ').replace('\r',' ')
#             feature_files[0].write(strip_tags(content)+'\n')
            
#             feature_files[1].write(json_data['published']['$date']+'\n')
            
#             feature_files[2].write(strip_tags(json_data['title']).split('\u2013')[0].strip().replace('\n',' ').replace('\r',' ')+'\n')
            
#             feature_files[3].write(json_data['meta_tags']['author'].strip()+'\n')
            
#             feature_files[4].write(json_data['domain']+'\n')
            
#             tags_str = []
#             soup = BeautifulSoup(content, 'lxml')
#             try:
#                 tag_block = soup.find('ul', class_='tags')
#                 tags = tag_block.find_all('a')
#                 for tag in tags:
#                     tags_str.append(tag.text.translate({ord(' '):None, ord('-'):None}))
#                 tags = ' '.join(tags_str)
#             except Exception:
#                 tags = 'None'
            
#             feature_files[5].write(tags+'\n')

#         for feature_file in feature_files:
#             feature_file.close()
            


In [8]:
def extract_features_and_write(path_to_data,
                               inp_filename, is_train=True):  
            
    content_list = []
    published_list =[]
    title_list =[]
    author_list =[]
    domain_list =[]
    tags_list = []
    

    with open(os.path.join(path_to_data, inp_filename), 
              encoding='utf-8') as inp_json_file:
        
        for line in tqdm_notebook(inp_json_file):
            json_data = read_json_line(line)
            
            
#             content = json_data['content'].replace('\n',' ').replace('\r',' ')
#             content_list.append(strip_tags(content))
            
            published_list.append(json_data['published']['$date'])
            
#             title_list.append(strip_tags(json_data['title']).split('\u2013')[0].strip().replace('\n',' ').replace('\r',' '))
            title_list.append(strip_tags(json_data['title']).replace('\n',' ').replace('\r',' '))
            
#             author_list.append(json_data['meta_tags']['author'].strip())
            
#             domain_list.append(json_data['domain'])
            
#             tags_str = []
#             soup = BeautifulSoup(content, 'lxml')
#             try:
#                 tag_block = soup.find('ul', class_='tags')
#                 tags = tag_block.find_all('a')
#                 for tag in tags:
#                     tags_str.append(tag.text.translate({ord(' '):None, ord('-'):None}))
#                 tags = ' '.join(tags_str)
#             except Exception:
#                 tags = 'None'
            
#             tags_list.append(tags)
            
        df = pd.DataFrame()
#         df['content'] = content_list
        df['published'] = pd.to_datetime(published_list, format='%Y-%m-%dT%H:%M:%S.%fZ')
        df['title'] = title_list
#         df['author'] = author_list
#         df['domain'] = domain_list
#         df['tags'] = tags_list
        
        df.sort_values(by ='published',inplace=True)

#         features = ['content', 'published', 'title', 'author', 'domain', 'tags']
        features = ['title']

    
        prefix = 'train' if is_train else 'test'
        
        for feat in features:
            df[feat].to_csv(os.path.join(path_to_data,'{}_{}.txt'.format(prefix, feat)),sep=' ',index=None,header =None)



In [9]:
PATH_TO_DATA = '../data/Medium' # modify this if you need to

In [11]:
# extract_features_and_write(PATH_TO_DATA, 'train.json', is_train=True)

In [11]:
extract_features_and_write(PATH_TO_DATA, 'test.json', is_train=False)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




**Add the following groups of features:**
    - Tf-Idf with article content (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Tf-Idf with article titles (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Time features: publication hour, whether it's morning, day, night, whether it's a weekend
    - Bag of authors (i.e. One-Hot-Encoded author names)

In [142]:
tfidf_content = TfidfVectorizer(ngram_range=(1,2),max_features=100000)

with open(os.path.join(PATH_TO_DATA,'train_content.txt'),encoding="utf8") as train_content:
    X_train_content_sparse = tfidf_content.fit_transform(train_content)
    
with open(os.path.join(PATH_TO_DATA,'test_content.txt'),encoding="utf8") as test_content:
    X_test_content_sparse = tfidf_content.transform(test_content)

In [143]:
X_train_content_sparse.shape,X_test_content_sparse.shape

((62313, 50000), (34645, 50000))

In [144]:
tfidf_title = TfidfVectorizer(ngram_range=(1,2),max_features=100000

with open(os.path.join(PATH_TO_DATA,'train_title.txt'),encoding="utf8") as train_title:
    X_train_title_sparse = tfidf_title.fit_transform(train_title)
    
with open(os.path.join(PATH_TO_DATA,'test_title.txt'),encoding="utf8") as test_title:
    X_test_title_sparse = tfidf_title.transform(test_title)

In [145]:
X_train_title_sparse.shape,X_test_title_sparse.shape

((62313, 50000), (34645, 50000))

In [146]:
from scipy import sparse

# sparse.save_npz("./data/medium/X_test_content_sparse.npz", X_test_content_sparse)
X_test_content_sparse = sparse.load_npz("./data/medium/X_test_content_sparse.npz")

# sparse.save_npz("./data/medium/X_train_content_sparse.npz", X_train_content_sparse)
X_train_content_sparse = sparse.load_npz("./data/medium/X_train_content_sparse.npz")

In [147]:
train_author_ = pd.read_csv(os.path.join(PATH_TO_DATA,'train_author.txt'),header=None,usecols=[0],)
test_author_ = pd.read_csv(os.path.join(PATH_TO_DATA,'test_author.txt'),header=None,usecols=[0])

In [212]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

In [215]:
authors = pd.concat([train_author_,test_author_],axis=0,sort=False)

In [None]:
LE = LabelEncoder()
LE.fit(authors)
X_train_author_LE = LE.transform(train_author_)
X_test_author_LE  = LE.transform(test_author_)

In [223]:
OHE = OneHotEncoder(handle_unknown='ignore')
OHE.fit(X_train_author_LE.reshape(-1,1))

OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
       handle_unknown='ignore', n_values='auto', sparse=True)

In [224]:
X_train_author_sparse = OHE.transform(X_train_author_LE.reshape(-1,1))
X_test_author_sparse  = OHE.transform(X_test_author_LE.reshape(-1,1))

In [226]:
X_train_author_sparse.shape, X_test_author_sparse.shape

((62313, 31331), (34645, 31331))

In [148]:
# authors = pd.concat([train_author_,test_author_],axis=0,sort=False)
# authors = pd.get_dummies(authors)

In [149]:
# X_train_author_sparse = authors.iloc[:train_author_.shape[0],]
# X_test_author_sparse = authors.iloc[train_author_.shape[0]:,]

In [150]:
# X_train_author_sparse.shape, X_test_author_sparse.shape

((62313, 43885), (34645, 43885))

In [152]:
X_train_time = pd.read_csv(os.path.join(PATH_TO_DATA,'train_published.txt'),header=None,names=['date'],
                                            parse_dates =['date'])
X_test_time = pd.read_csv(os.path.join(PATH_TO_DATA,'test_published.txt'),header=None,names=['date'],
                                         parse_dates =['date'])

In [153]:
X_train_time.shape,X_test_time.shape

((62313, 1), (34645, 1))

In [229]:
def add_time_features(df):
    new_df = pd.DataFrame(index=df.index)
    hour = df['date'].apply(lambda ts: ts.hour)
    new_df['morning'] = ((hour >= 7) & (hour <= 11)).astype('int')
    new_df["day"] = ((hour >= 12) & (hour <= 18)).astype('int')
    new_df["evening"] = ((hour >= 19) & (hour <= 23)).astype('int')
    new_df["night"] = ((hour >= 0) & (hour <= 6)).astype('int')
#     new_df['is_weekend'] = df['date'].dt.weekday.isin([6,7]).astype('int')
    new_df['year'] = df['date'].dt.year
    new_df['month'] = df['date'].dt.month
    new_df['weekday'] = df['date'].dt.weekday
#     new_df['hour'] = df['date'].dt.hour    
    return new_df

In [230]:
X_train_time_features_sparse = add_time_features(X_train_time)
X_test_time_features_sparse = add_time_features(X_test_time)

In [231]:
X_train_time_features_sparse.shape

(62313, 7)

In [232]:
X_train_time_features_sparse.head()

Unnamed: 0,morning,day,evening,night,year,month,weekday
0,0,0,0,1,1970,1,3
1,0,0,0,1,1970,1,3
2,0,0,0,1,1970,1,6
3,0,0,1,0,1987,12,1
4,0,1,0,0,2003,12,0


In [233]:
with open(os.path.join(PATH_TO_DATA,'train_content.txt'),encoding="utf8") as train_content:
    length =[] 
    for line in train_content:
        length.append(len(line))
X_train_time_features_sparse['length'] = length

In [234]:
with open(os.path.join(PATH_TO_DATA,'test_content.txt'),encoding="utf8") as test_content:
    length =[] 
    for line in test_content:
        length.append(len(line))
X_test_time_features_sparse['length'] = length

In [235]:
with open(os.path.join(PATH_TO_DATA,'train_tags.txt'),encoding="utf8") as train_tags:
    length =[] 
    for line in train_tags:
        length.append(len(line))
X_train_time_features_sparse['tag_length'] = length

In [236]:
with open(os.path.join(PATH_TO_DATA,'test_tags.txt'),encoding="utf8") as test_tags:
    length =[] 
    for line in test_tags:
        length.append(len(line))
X_test_time_features_sparse['tag_length'] = length

In [237]:
X_train_time_features_sparse.head()

Unnamed: 0,morning,day,evening,night,year,month,weekday,length,tag_length
0,0,0,0,1,1970,1,3,5476,52
1,0,0,0,1,1970,1,3,5328,3
2,0,0,0,1,1970,1,6,2490,14
3,0,0,1,0,1987,12,1,11288,48
4,0,1,0,0,2003,12,0,12544,18


In [238]:
X_test_time_features_sparse.head()

Unnamed: 0,morning,day,evening,night,year,month,weekday,length,tag_length
0,0,0,0,1,2017,7,5,7615,62
1,0,0,0,1,2017,7,5,6822,82
2,0,0,0,1,2017,7,5,14447,52
3,0,0,0,1,2017,7,5,13626,44
4,0,0,0,1,2017,7,5,15074,46


In [239]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
minmax = MinMaxScaler()
minmax.fit(X_train_time_features_sparse)
X_train_time_features_sparse_norm =minmax.transform(X_train_time_features_sparse)
X_test_time_features_sparse_norm =minmax.transform(X_test_time_features_sparse)

**Join all sparse matrices.**

In [240]:
X_train_sparse = hstack([X_train_content_sparse, X_train_title_sparse,
                         X_train_author_sparse, 
                         X_train_time_features_sparse_norm]).tocsr()

In [241]:
X_train_sparse.shape

(62313, 131340)

In [242]:
X_test_sparse = hstack([X_test_content_sparse, X_test_title_sparse,
                        X_test_author_sparse, 
                        X_test_time_features_sparse_norm]).tocsr()

In [168]:
from scipy import sparse

# sparse.save_npz("./data/medium/X_train_sparse.npz", X_train_sparse)
X_train_sparse = sparse.load_npz("./data/medium/X_train_sparse.npz")

# sparse.save_npz("./data/medium/X_test_sparse.npz", X_test_sparse)
X_test_sparse = sparse.load_npz("./data/medium/X_test_sparse.npz")

**Read train target and split data for validation.**

In [243]:
train_target = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_log1p_recommends.csv'), 
                           index_col='id')
y_train = train_target['log_recommends'].values

In [244]:
train_part_size = int(0.7 * train_target.shape[0])
X_train_part_sparse = X_train_sparse[:train_part_size, :]
y_train_part = y_train[:train_part_size]
X_valid_sparse =  X_train_sparse[train_part_size:, :]
y_valid = y_train[train_part_size:]

**Train a simple Ridge model and check MAE on the validation set.**

In [245]:
%%time
lridge = Ridge(random_state=17)
lridge.fit(X_train_part_sparse,y_train_part)

Wall time: 21.1 s


In [246]:
mean_absolute_error(y_valid,lridge.predict(X_valid_sparse))

1.622917412171956

In [174]:
mean_absolute_error(y_valid,lridge.predict(X_valid_sparse))

1.622813180558879

In [103]:
mean_absolute_error(y_valid,lridge.predict(X_valid_sparse))

1.0903316131971577

In [221]:
mean_absolute_error(y_valid,lridge.predict(X_valid_sparse))

1.0736315005313009

**Train the same Ridge with all available data, make predictions for the test set and form a submission file.**

In [241]:
X_test_sparse.shape

(34645, 244250)

In [222]:
%%time
lridge.fit(X_train_sparse,y_train)
prediction = lridge.predict(X_test_sparse)

Wall time: 3min 48s


In [185]:
def write_submission_file(prediction, filename,
                          path_to_sample=os.path.join(PATH_TO_DATA, 
                                                      'sample_submission.csv')):
    submission = pd.read_csv(path_to_sample, index_col='id')
    
    submission['log_recommends'] = prediction
    submission.to_csv(filename)

In [254]:
prediction.mean()#1.45082

2.3605271917129667

In [256]:
4.33328-2.3605271917129667

1.9727528082870336

In [257]:
prediction_adjusted = prediction+ 1.9727528082870336

In [259]:
prediction_adjusted.mean()

4.33328

In [260]:
write_submission_file(prediction_adjusted, './assignment6_submissions/assignment6_medium_submission_adjusted.csv')#1.66273

In [237]:
write_submission_file(prediction, './assignment6_submissions/assignment6_medium_submission_default.csv') #2.33484

**Now's the time for dirty Kaggle hacks. Form a submission file with all zeroes. Make a submission. What do you get if you think about it? How is it going to help you with modifying your predictions?**

In [240]:
write_submission_file(np.zeros_like(prediction), './assignment6_submissions/medium_all_zeros_submission.csv')#LB 4.33328

**Modify predictions in an appropriate way (based on your all-zero submission) and make a new submission.**

In [None]:
ridge_test_pred_modif = ridge_test_pred # You code here

In [None]:
write_submission_file(ridge_test_pred_modif, 
                      os.path.join(PATH_TO_DATA,
                                   'assignment6_medium_submission_with_hack.csv'))

In [177]:
%%time
ts = TimeSeriesSplit(5)
alphas = np.logspace(-1,2,5)
lridgeCV = RidgeCV(alphas=alphas, scoring ='neg_mean_absolute_error',cv=ts)
lridgeCV.fit(X_train_sparse,y_train)

Wall time: 9min 19s


In [178]:
lridgeCV.alpha_

100.0

In [None]:
%%time
lridge = Ridge(random_state=17,alpha=100)
lridge.fit(X_train_part_sparse,y_train_part)

In [180]:
mean_absolute_error(y_valid,lridge.predict(X_valid_sparse))

1.493027351276218

In [181]:
%%time
lridge.fit(X_train_sparse,y_train)
prediction = lridge.predict(X_test_sparse)

Wall time: 34.1 s


In [182]:
prediction.mean()

3.0148218378784315

In [183]:
pred_adjusted = prediction+ (4.33328 -prediction.mean())

In [186]:
write_submission_file(pred_adjusted, './assignment6_submissions/assignment6_medium_submission_2.csv')

In [188]:
from sklearn.linear_model import Lasso, LassoCV

In [247]:
%%time
lasreg = Lasso(random_state=17)
lasreg.fit(X_train_part_sparse,y_train_part)

Wall time: 12.6 s


In [248]:
mean_absolute_error(y_valid,lasreg.predict(X_valid_sparse))

1.4881453048830608

In [204]:
%%time
ts = TimeSeriesSplit(5)
alphas = np.logspace(-1,4,10)
lasregCV = LassoCV(alphas=alphas,cv=ts,n_jobs=-1,verbose=True)
lasregCV.fit(X_train_sparse,y_train)

..................................................[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  3.1min finished


Wall time: 3min 30s


In [205]:
lasregCV.alpha_

10000.0

In [249]:
%%time
lasreg = Lasso(random_state=17,alpha=1000000)
lasreg.fit(X_train_part_sparse,y_train_part)

Wall time: 11.4 s


In [250]:
mean_absolute_error(y_valid,lasreg.predict(X_valid_sparse))

1.4881453048830608

That's it for the assignment. Much more credits will be given to the winners in this competition, check [course roadmap](https://mlcourse.ai/roadmap). Do not spoil the assignment and the competition - don't share high-performing kernels (with MAE < 1.5).

Some ideas for improvement:

- Engineer good features, this is the key to success. Some simple features will be based on publication time, authors, content length and so on
- You may not ignore HTML and extract some features from there
- You'd better experiment with your validation scheme. You should see a correlation between your local improvements and LB score
- Try TF-IDF, ngrams, Word2Vec and GloVe embeddings
- Try various NLP techniques like stemming and lemmatization
- Tune hyperparameters. In our example, we've left only 50k features and used C=1 as a regularization parameter, this can be changed
- SGD and Vowpal Wabbit will learn much faster
- Play around with blending and/or stacking. An intro is given in [this Kernel](https://www.kaggle.com/kashnitsky/ridge-and-lightgbm-simple-blending) by @yorko 
- In our course, we don't cover neural nets. But it's not obliged to use GRUs/LSTMs/whatever in this competition.

Good luck!

<img src='../../img/kaggle_shakeup.png' width=50%>