<center>
<img src="../../img/ods_stickers.jpg" />
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). Edited by Sergey Kolchenko (@KolchenkoSergey). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.

## <center>Assignment #6
### <center> Beating baselines in "How good is your Medium article?"
    
<img src='../../img/medium_claps.jpg' width=40% />


[Competition](https://www.kaggle.com/c/how-good-is-your-medium-article). The task is to beat "A6 baseline" (~1.45 Public LB score). Do not forget about our shared ["primitive" baseline](https://www.kaggle.com/kashnitsky/ridge-countvectorizer-baseline) - you'll find something valuable there.

**Your task:**
 1. "Freeride". Come up with good features to beat the baseline "A6 baseline" (for now, public LB is only considered)
 2. You need to name your [team](https://www.kaggle.com/c/how-good-is-your-medium-article/team) (out of 1 person) in full accordance with the [course rating](https://drive.google.com/open?id=19AGEhUQUol6_kNLKSzBsjcGUU3qWy3BNUg8x8IFkO3Q). You can think of it as a part of the assignment. 16 credits for beating the mentioned baseline and correct team naming.

In [1]:
import os
import json
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_absolute_error
from scipy.sparse import csr_matrix, hstack,coo_matrix
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.model_selection import TimeSeriesSplit

from bs4 import BeautifulSoup
from sklearn.linear_model import Lasso, LassoCV

from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MinMaxScaler
from sklearn.model_selection import GridSearchCV, cross_val_score

The following code will help to throw away all HTML tags from an article content.

In [2]:
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Supplementary function to read a JSON line without crashing on escape characters.

In [3]:
def read_json_line(line=None):
    result = None
    try:        
        result = json.loads(line)
    except Exception as e:      
        # Find the offending character index:
        idx_to_replace = int(str(e).split(' ')[-1].replace(')',''))      
        # Remove the offending character:
        new_line = list(line)
        new_line[idx_to_replace] = ' '
        new_line = ''.join(new_line)     
        return read_json_line(line=new_line)
    return result

Extract features `content`, `published`, `title` and `author`, write them to separate files for train and test sets.

In [5]:
def extract_features_and_write(path_to_data,
                               inp_filename, is_train=True):  
            
    content_list = []
    published_list =[]
    title_list =[]
    author_list =[]
    domain_list =[]
    tags_list = []
    

    with open(os.path.join(path_to_data, inp_filename), 
              encoding='utf-8') as inp_json_file:
        
        for line in tqdm_notebook(inp_json_file):
            json_data = read_json_line(line)
            
            
            content = json_data['content'].replace('\n',' ').replace('\r',' ')
            content_list.append(strip_tags(content))
            
            published_list.append(json_data['published']['$date'])
            
            title_list.append(strip_tags(json_data['title']).split('\u2013')[0].strip().replace('\n',' ').replace('\r',' '))
            
            author_list.append(json_data['meta_tags']['author'].strip())
            
            domain_list.append(json_data['domain'])
            
            tags_str = []
            soup = BeautifulSoup(content, 'lxml')
            try:
                tag_block = soup.find('ul', class_='tags')
                tags = tag_block.find_all('a')
                for tag in tags:
                    tags_str.append(tag.text.translate({ord(' '):None, ord('-'):None}))
                tags = ' '.join(tags_str)
            except Exception:
                tags = 'None'
            
            tags_list.append(tags)
            
        df = pd.DataFrame()
        df['content'] = content_list
        df['published'] = pd.to_datetime(published_list, format='%Y-%m-%dT%H:%M:%S.%fZ')
        df['title'] = title_list
        df['author'] = author_list
        df['domain'] = domain_list
        df['tags'] = tags_list
        
        if is_train:
            df.sort_values(by ='published',inplace=True)

        features = ['content', 'published', 'title', 'author', 'domain', 'tags']
    
        prefix = 'train' if is_train else 'test'
        
        for feat in features:
            df[feat].to_csv(os.path.join(path_to_data,'{}_{}.txt'.format(prefix, feat)),sep=' ',index=None,header =None)



In [4]:
PATH_TO_DATA = './data/Medium' # modify this if you need to

In [6]:
# extract_features_and_write(PATH_TO_DATA, 'train.json', is_train=True)

In [537]:
# extract_features_and_write(PATH_TO_DATA, 'test.json', is_train=False)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




**Add the following groups of features:**
    - Tf-Idf with article content (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Tf-Idf with article titles (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Time features: publication hour, whether it's morning, day, night, whether it's a weekend
    - Bag of authors (i.e. One-Hot-Encoded author names)

In [None]:
# !type "./data/medium/train_title.txt" >> "./data/medium/total_content.txt"

In [5]:
%%time
tfidf_content = TfidfVectorizer(os.path.join(PATH_TO_DATA,'train_content.txt'),ngram_range=(1,2),max_features=100000)

with open(os.path.join(PATH_TO_DATA,'train_content.txt'),encoding="utf8") as train_content:
    X_train_content_sparse = tfidf_content.fit_transform(train_content)

with open(os.path.join(PATH_TO_DATA,'test_content.txt'),encoding="utf8") as test_content:
    X_test_content_sparse = tfidf_content.fit_transform(test_content)

Wall time: 12min 5s


In [6]:
X_train_content_sparse.shape,X_test_content_sparse.shape

((62313, 100000), (34645, 100000))

### saving for quick access

In [16]:
from scipy import sparse

sparse.save_npz("./data/medium/X_test_content_sparse_total_bigram.npz", X_test_content_sparse)
# X_test_content_sparse = sparse.load_npz("./data/medium/X_test_content_sparse_total.npz")

sparse.save_npz("./data/medium/X_train_content_sparse_total_bigram.npz", X_train_content_sparse)
# X_train_content_sparse = sparse.load_npz("./data/medium/X_train_content_sparse_total.npz")

** Titles **

In [7]:
tfidf_title = TfidfVectorizer(os.path.join(PATH_TO_DATA,'train_title_strip.txt'),ngram_range=(1,2),max_features=100000)
#                               strip_accents='unicode', stop_words='english')

with open(os.path.join(PATH_TO_DATA,'train_title_strip.txt'),encoding="utf8") as train_title:
    X_train_title_sparse = tfidf_title.fit_transform(train_title)
    
with open(os.path.join(PATH_TO_DATA,'test_title.txt'),encoding="utf8") as test_title:
    X_test_title_sparse = tfidf_title.fit_transform(test_title)

In [8]:
X_train_title_sparse.shape,X_test_title_sparse.shape

((62313, 100000), (34645, 100000))

** tags **

In [325]:
tfidf_tags = TfidfVectorizer(os.path.join(PATH_TO_DATA,'total_tags.txt'),ngram_range=(1,2),max_features=100000,stop_words='english')

with open(os.path.join(PATH_TO_DATA,'train_tags.txt'),encoding="utf8") as train_tags:
    X_train_tags_sparse = tfidf_tags.fit_transform(train_tags)
    
with open(os.path.join(PATH_TO_DATA,'test_tags.txt'),encoding="utf8") as test_tags:
    X_test_tags_sparse = tfidf_tags.fit_transform(test_tags)

In [193]:
X_train_tags_sparse.shape,X_test_tags_sparse.shape

((62313, 100000), (34645, 100000))

** Authors **

In [9]:
train_author_ = pd.read_csv(os.path.join(PATH_TO_DATA,'train_author.txt'),header=None,usecols=[0],names=["authors"])
test_author_ = pd.read_csv(os.path.join(PATH_TO_DATA,'test_author.txt'),header=None,usecols=[0],names=["authors"])

authors = pd.concat([train_author_,test_author_],axis=0,sort=False)

train_author_list = set(train_author_.authors.value_counts().head(500).index)
# train_author_list = set(train_author_.authors.value_counts().index)
test_author_list = set(test_author_.authors.value_counts().index)

common_author = train_author_list.intersection(test_author_list)

def encodecommonauthors(x):
    if x in common_author: return x
    else: return 'unknown'
    
authors = authors.authors.map(encodecommonauthors)

authors = pd.get_dummies(authors,drop_first=True)

X_train_author_sparse = authors.iloc[:train_author_.shape[0],:].values
X_test_author_sparse  = authors.iloc[train_author_.shape[0]:,:].values

X_train_author_sparse.shape, X_test_author_sparse.shape

((62313, 5841), (34645, 5841))

In [292]:
# authors.value_counts().tail()

Arman Anaturk        8
Doc Ayomide          8
Ahmed El-Sharkasy    8
Cory House           8
MVD NO               8
Name: authors, dtype: int64

In [244]:
# authors.authors.value_counts().head()

War Is Boring        305
Caitlin Johnstone    227
Jon Westenberg 🌈     211
Ethan Siegel         176
Larry Kim            166
Name: authors, dtype: int64

In [520]:
# LE = LabelEncoder()
# LE.fit(authors.values.reshape(1,-1)[0])
# X_train_author_LE = LE.transform(train_author_.values.reshape(1,-1)[0])
# X_test_author_LE  = LE.transform(test_author_.values.reshape(1,-1)[0])

# OHE = OneHotEncoder(handle_unknown='ignore')
# OHE.fit(X_train_author_LE.reshape(-1,1))

# X_train_author_sparse = OHE.transform(X_train_author_LE.reshape(-1,1))
# X_test_author_sparse  = OHE.transform(X_test_author_LE.reshape(-1,1))

# X_train_author_sparse.shape, X_test_author_sparse.shape

((62313, 31331), (34645, 31331))

** domain **

In [27]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

train_domain_ = pd.read_csv(os.path.join(PATH_TO_DATA,'train_domain.txt'),header=None,usecols=[0],names=['domains'])
test_domain_ = pd.read_csv(os.path.join(PATH_TO_DATA,'test_domain.txt'),header=None,usecols=[0],names = ['domains'])

domains = pd.concat([train_domain_,test_domain_],axis=0,sort=False)

def toptwodomain(x):
    if x=='medium.com': return 0
    elif x=='hackernoon.com': return 1
    else: return 2
domains = domains.domains.map(toptwodomain)

domains = pd.get_dummies(domains,drop_first=True)

X_train_domain_sparse = domains.iloc[:train_domain_.shape[0],:].values
X_test_domain_sparse  = domains.iloc[train_domain_.shape[0]:,:].values

X_train_domain_sparse.shape, X_test_domain_sparse.shape

((62313, 2), (34645, 2))

In [212]:
domains.iloc[:,0].value_counts().head()

medium.com              91601
hackernoon.com           4367
jw-webmagazine.com        147
blog.medium.com            73
thecoffeelicious.com       48
Name: 0, dtype: int64

In [213]:
train_domain_.iloc[:,0].value_counts().head()

medium.com              59522
hackernoon.com           1938
jw-webmagazine.com        143
blog.medium.com            67
thecoffeelicious.com       48
Name: 0, dtype: int64

In [216]:
test_domain_.iloc[:,0].value_counts().head()

medium.com                  32079
hackernoon.com               2429
towardsdatascience.com         17
blog.usejournal.com             9
journal.thriveglobal.com        7
Name: 0, dtype: int64

** Date Columns **

In [10]:
X_train_time = pd.read_csv(os.path.join(PATH_TO_DATA,'train_published.txt'),header=None,names=['date'],
                                            parse_dates =['date'])
X_test_time = pd.read_csv(os.path.join(PATH_TO_DATA,'test_published.txt'),header=None,names=['date'],
                                         parse_dates =['date'])

In [11]:
X_train_time.shape,X_test_time.shape

((62313, 1), (34645, 1))

In [12]:
def add_time_features(df):
    new_df = pd.DataFrame(index=df.index)
    hour = df['date'].apply(lambda ts: ts.hour)
    new_df['morning'] = ((hour >= 7) & (hour <= 11)).astype('int')
    new_df["day"] = ((hour >= 12) & (hour <= 18)).astype('int')
    new_df["evening"] = ((hour >= 19) & (hour <= 23)).astype('int')
    new_df["night"] = ((hour >= 0) & (hour <= 6)).astype('int')
    new_df['is_weekend'] = df['date'].dt.weekday.isin([6,7]).astype('int')
#     new_df['year'] = df['date'].dt.year
#     new_df['month'] = df['date'].dt.month
#     new_df['weekday'] = df['date'].dt.weekday
#     new_df['hour'] = df['date'].dt.hour    
    return new_df

In [13]:
X_train_time_features_sparse = add_time_features(X_train_time)
X_test_time_features_sparse = add_time_features(X_test_time)
X_train_time_features_sparse.shape,X_test_time_features_sparse.shape

((62313, 5), (34645, 5))

In [14]:
X_train_time_features_sparse.head()

Unnamed: 0,morning,day,evening,night,is_weekend
0,0,0,0,1,0
1,0,0,0,1,0
2,0,0,0,1,1
3,0,0,1,0,0
4,0,1,0,0,0


** length features **

In [15]:
with open(os.path.join(PATH_TO_DATA,'train_content.txt'),encoding="utf8") as train_content:
    length =[] 
    for line in train_content:
        length.append(len(line))
X_train_time_features_sparse['length'] = length

with open(os.path.join(PATH_TO_DATA,'test_content.txt'),encoding="utf8") as test_content:
    length =[] 
    for line in test_content:
        length.append(len(line))
X_test_time_features_sparse['length'] = length

with open(os.path.join(PATH_TO_DATA,'train_tags.txt'),encoding="utf8") as train_tags:
    length =[] 
    for line in train_tags:
        length.append(len(line))
X_train_time_features_sparse['tag_length'] = length

with open(os.path.join(PATH_TO_DATA,'test_tags.txt'),encoding="utf8") as test_tags:
    length =[] 
    for line in test_tags:
        length.append(len(line))
X_test_time_features_sparse['tag_length'] = length

In [493]:
# X_train_time_features_sparse.head()

In [494]:
# X_test_time_features_sparse.head()

In [17]:
from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler()
minmax.fit(X_train_time_features_sparse)
X_train_time_features_sparse[X_train_time_features_sparse.columns] = minmax.transform(X_train_time_features_sparse)
X_test_time_features_sparse[X_test_time_features_sparse.columns] = minmax.transform(X_test_time_features_sparse)

In [18]:
import seaborn as sns
sns.heatmap(X_test_time_features_sparse.corr());

<matplotlib.axes._subplots.AxesSubplot at 0x2743b329198>

In [19]:
X_train_time_features_sparse.head()

Unnamed: 0,morning,day,evening,night,is_weekend,length,tag_length
0,0.0,0.0,0.0,1.0,0.0,0.012452,0.457944
1,0.0,0.0,0.0,1.0,0.0,0.012102,0.0
2,0.0,0.0,0.0,1.0,1.0,0.005394,0.102804
3,0.0,0.0,1.0,0.0,0.0,0.02619,0.420561
4,0.0,1.0,0.0,0.0,0.0,0.029159,0.140187


In [30]:
# from sklearn.preprocessing import StandardScaler, MinMaxScaler
# minmax = MinMaxScaler()
# minmax.fit(X_train_time_features_sparse)
# X_train_time_features_sparse_norm =minmax.transform(X_train_time_features_sparse)
# X_test_time_features_sparse_norm =minmax.transform(X_test_time_features_sparse)

In [31]:
# from sklearn.preprocessing import StandardScaler, MinMaxScaler
# minmax = StandardScaler()
# minmax.fit(X_train_time_features_sparse)
# X_train_time_features_sparse_norm =minmax.transform(X_train_time_features_sparse)
# X_test_time_features_sparse_norm =minmax.transform(X_test_time_features_sparse)

**Join all sparse matrices.**

In [37]:
X_train_sparse = hstack([X_train_content_sparse, X_train_title_sparse,
                         X_train_author_sparse, 
                         X_train_time_features_sparse]).tocsr()

X_test_sparse = hstack([X_test_content_sparse, X_test_title_sparse,
                        X_test_author_sparse, 
                        X_test_time_features_sparse]).tocsr()

X_train_sparse.shape, X_test_sparse.shape

In [None]:
# from scipy import sparse

# # sparse.save_npz("./data/medium/X_train_sparse.npz", X_train_sparse)
# X_train_sparse = sparse.load_npz("./data/medium/X_train_sparse.npz")

# # sparse.save_npz("./data/medium/X_test_sparse.npz", X_test_sparse)
# X_test_sparse = sparse.load_npz("./data/medium/X_test_sparse.npz")

**Read train target and split data for validation.**

In [40]:
train_target = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_log1p_recommends.csv'), 
                           index_col='id')
y_train = train_target['log_recommends'].values

In [41]:
train_part_size = int(0.7 * train_target.shape[0])
X_train_part_sparse = X_train_sparse[:train_part_size, :]
y_train_part = y_train[:train_part_size]
X_valid_sparse =  X_train_sparse[train_part_size:, :]
y_valid = y_train[train_part_size:]

**Train a simple Ridge model and check MAE on the validation set.**

In [42]:
%%time
lridge = Ridge(random_state=17)
lridge.fit(X_train_part_sparse,y_train_part)

Wall time: 5min 36s


In [43]:
mean_absolute_error(y_valid,lridge.predict(X_valid_sparse))

1.5492824304336308

**Train the same Ridge with all available data, make predictions for the test set and form a submission file.**

In [45]:
%%time
lridge.fit(X_train_sparse,y_train)
prediction = lridge.predict(X_test_sparse)

Wall time: 8min 2s


In [20]:
def write_submission_file(prediction, filename,
                          path_to_sample=os.path.join(PATH_TO_DATA, 
                                                      'sample_submission.csv')):
    submission = pd.read_csv(path_to_sample, index_col='id')
    
    submission['log_recommends'] = prediction
    submission.to_csv(filename)

In [47]:
pred_adjusted = prediction+ (4.33328 -prediction.mean())

In [49]:
pred_adjusted.mean()

4.333280000000001

In [50]:
write_submission_file(pred_adjusted, './assignment6_submissions/assignment6_medium_submission_adjusted_3.csv')#LB 1.82842 #CV 1.5492824304336308

In [60]:
write_submission_file(pred_adjusted, './assignment6_submissions/assignment6_medium_submission_adjusted_4.csv')
#LB 1.79066 #CV 1.4881453048830608

In [394]:
from scipy import sparse
X_train_content_sparse = sparse.load_npz("./data/medium/X_train_content_sparse_total_bigram.npz")
X_test_content_sparse = sparse.load_npz("./data/medium/X_test_content_sparse_total_bigram.npz")

In [361]:
X_train_domain_sparse.shape,X_train_author_sparse.shape,X_train_time_features_sparse_norm.shape

((62313, 2), (62313, 352), (62313, 11))

In [511]:
X_train_sparse = np.concatenate((X_train_author_sparse,X_train_time_features_sparse),axis=1)
X_test_sparse = np.concatenate((X_test_author_sparse,X_test_time_features_sparse),axis=1)

In [506]:
X_train_sparse1.shape,X_test_sparse1.shape

((62313, 5848), (34645, 5848))

In [28]:
X_train_sparse = hstack([
                          X_train_domain_sparse,
                          X_train_content_sparse,
#                           X_train_title_sparse,                          
#                           X_train_tags_sparse,
                          X_train_author_sparse,
                          X_train_time_features_sparse]).tocsr()

X_test_sparse = hstack([
                         X_test_domain_sparse,
                         X_test_content_sparse,
#                          X_test_title_sparse,
#                          X_test_tags_sparse,
                         X_test_author_sparse,
                         X_test_time_features_sparse]).tocsr()

train_target = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_log1p_recommends.csv'), 
                           index_col='id')
y_train = train_target['log_recommends'].values


X_train_sparse.shape, X_test_sparse.shape

((62313, 105850), (34645, 105850))

In [29]:
train_part_size = int(0.7 * train_target.shape[0])
X_train_part_sparse = X_train_sparse[:train_part_size, :]
y_train_part = y_train[:train_part_size]
X_valid_sparse =  X_train_sparse[train_part_size:, :]
y_valid = y_train[train_part_size:]

In [548]:
lridge = Ridge(random_state=17,alpha=500)
lridge.fit(X_train_part_sparse,y_train_part)
mean_absolute_error(y_valid,lridge.predict(X_valid_sparse)) # author+content+time features.

1.4878659037253157

In [23]:
lridge = Ridge(random_state=17,alpha=500)
lridge.fit(X_train_part_sparse,y_train_part)
mean_absolute_error(y_valid,lridge.predict(X_valid_sparse))

1.4878659037253157

In [26]:
lridge = Ridge(random_state=17)
lridge.fit(X_train_part_sparse,y_train_part)
mean_absolute_error(y_valid,lridge.predict(X_valid_sparse))

1.5626494038236158

In [30]:
lridge = Ridge(random_state=17)
lridge.fit(X_train_part_sparse,y_train_part)
mean_absolute_error(y_valid,lridge.predict(X_valid_sparse)) ## added domain no titles

1.563165334511272

### Submission file

In [516]:
X_test_sparse.shape

(34645, 105848)

In [24]:
lridge.fit(X_train_sparse,y_train)
prediction = lridge.predict(X_test_sparse)

pred_adjusted = prediction+ (4.33328 -prediction.mean())
pred_adjusted.mean(),prediction.mean()

(4.33328, 3.025346584063666)

In [303]:
write_submission_file(pred_adjusted, './assignment6_submissions/assignment6_medium_submission_adjusted_5.csv')
#LB 1.87163 #CV 1.6676447868344884

In [391]:
write_submission_file(pred_adjusted, './assignment6_submissions/assignment6_medium_submission_adjusted_6.csv')
#LB 1.79381 #CV 1.5931624978304328

In [392]:
write_submission_file(prediction, './assignment6_submissions/assignment6_medium_submission_notadjusted_7.csv')
#LB 2.09594 #CV 1.5931624978304328

In [463]:
write_submission_file(pred_adjusted, './assignment6_submissions/assignment6_medium_submission_adjusted_8.csv')
#LB 1.79082 #cv 1.5864854295562016

In [518]:
write_submission_file(pred_adjusted, './assignment6_submissions/assignment6_medium_submission_adjusted_9.csv')
#LB 1.79061  #cv sumbission with content and some features no title #cv 0.7  1.487730649228865

In [526]:
write_submission_file(pred_adjusted, './assignment6_submissions/assignment6_medium_submission_adjusted_10.csv')
#LB  1.79047   #cv 0.7  1.487857

In [550]:
write_submission_file(pred_adjusted, './assignment6_submissions/assignment6_medium_submission_adjusted_11.csv')
#LB  1.79120  #cv0.7  1.4878659037253157 # all common authors content time features.(test file correct)

In [25]:
write_submission_file(pred_adjusted, './assignment6_submissions/assignment6_medium_submission_adjusted_12.csv')
#LB 1.79315   #cv0.7  1.4878659037253157 # all common authors content time features.(test file correctnow)

### Glove embeddings 

In [None]:
! wget http://nlp.stanford.edu/data/glove.6B.zip

In [None]:
unzip glove.6B.zip

In [None]:
import numpy as np

with open("glove.6B.50d.txt", "rb") as lines:
    w2v = {line.split()[0]: np.array(map(float, line.split()[1:]))
           for line in lines}

### grid search

In [140]:
param_grid = {'alpha':np.logspace(3,5,10)}
gs = GridSearchCV(lridge,param_grid=param_grid,n_jobs=-1,scoring='neg_mean_absolute_error',cv=TimeSeriesSplit(5))
gs.fit(X_train_sparse1,y_train)

GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=5),
       error_score='raise',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=17, solver='auto', tol=0.001),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'alpha': array([  1000.     ,   1668.10054,   2782.5594 ,   4641.58883,
         7742.63683,  12915.49665,  21544.3469 ,  35938.13664,
        59948.42503, 100000.     ])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_absolute_error', verbose=0)

In [141]:
gs.best_params_

{'alpha': 100000.0}

In [96]:
%%time
ts = TimeSeriesSplit(5)
alphas = np.logspace(1,3,5)
lridgeCV = RidgeCV(alphas=alphas, scoring ='neg_mean_absolute_error',cv=ts)
lridgeCV.fit(X_train_sparse1,y_train)

Wall time: 26.2 s


In [98]:
lridgeCV.alpha_,lridgeCV.coef_

(1000.0, array([ 0.00268131,  0.00402473,  0.00451589, ..., -0.02041138,
         0.00301198, -0.0824426 ]))

### feature selection

In [111]:
# !pip install mlxtend

In [112]:
from mlxtend.feature_selection import SequentialFeatureSelector

In [135]:
selector = SequentialFeatureSelector(lridge, scoring='neg_mean_absolute_error', 
                                     verbose=2, k_features=3, forward=False, n_jobs=-1)

selector.fit(X_train_time_features_sparse_norm, y_train)

[Parallel(n_jobs=-1)]: Done  11 out of  11 | elapsed:    4.3s finished

[2018-11-16 00:17:28] Features: 10/3 -- score: -1.6136425596076482[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    4.1s finished

[2018-11-16 00:17:33] Features: 9/3 -- score: -1.612995721194757[Parallel(n_jobs=-1)]: Done   7 out of   9 | elapsed:    3.6s remaining:    0.9s
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    3.8s finished

[2018-11-16 00:17:37] Features: 8/3 -- score: -1.6127004388806323[Parallel(n_jobs=-1)]: Done   6 out of   8 | elapsed:    3.6s remaining:    1.1s
[Parallel(n_jobs=-1)]: Done   8 out of   8 | elapsed:    3.8s finished

[2018-11-16 00:17:42] Features: 7/3 -- score: -1.6126614938257287[Parallel(n_jobs=-1)]: Done   4 out of   7 | elapsed:    3.6s remaining:    2.7s
[Parallel(n_jobs=-1)]: Done   7 out of   7 | elapsed:    3.7s finished

[2018-11-16 00:17:46] Features: 6/3 -- score: -1.6126370872179436[Parallel(n_jobs=-1)]: Done   3 out of   6 | elapsed:    2.8s remaining:

SequentialFeatureSelector(clone_estimator=True, cv=5,
             estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=17, solver='auto', tol=0.001),
             floating=False, forward=False, k_features=3, n_jobs=-1,
             pre_dispatch='2*n_jobs', scoring='neg_mean_absolute_error',
             verbose=2)

In [136]:
selector.k_feature_names_

('2', '8', '9')

In [513]:
lridge = Ridge(random_state=17)
scores = cross_val_score(lridge,X_train_sparse,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='neg_mean_absolute_error')
scores,scores.mean() # all common author, time  + removed hour,weekday, month feature 

(array([-1.70615878, -1.67787463, -1.60717713, -1.52979893, -1.49164347]),
 -1.602530587673937)

In [503]:
lridge = Ridge(random_state=17,alpha=100)
scores = cross_val_score(lridge,X_train_sparse,y_train,cv=3,n_jobs=-1,scoring='neg_mean_absolute_error')
scores,scores.mean() # all common author, time  + removed hour,weekday, month feature + content on train corpus

(array([-1.73459287, -1.61540689, -1.49328139]), -1.6144270493039015)

In [504]:
lridge = Ridge(random_state=17,alpha=100)
scores = cross_val_score(lridge,X_train_sparse,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='neg_mean_absolute_error')
scores,scores.mean() # all common author, time  + removed hour,weekday, month feature + content on train corpus

(array([-1.6961397 , -1.66503504, -1.57940485, -1.51182441, -1.47267521]),
 -1.5850158420836533)

In [471]:
lridge = Ridge(random_state=17,alpha=100)
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='neg_mean_absolute_error')
scores,scores.mean() # all common author, time  + removed hour,weekday, month,feature + title(5 ngram) + content on train corpus

(array([-1.70678507, -1.66486316, -1.574324  , -1.51921109, -1.47291004]),
 -1.5876186715697718)

In [460]:
lridge = Ridge(random_state=17,alpha=500)
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='neg_mean_absolute_error')
scores,scores.mean() 
# only author common, time +content(train corpus) + removed hour,year, weekday, month,feature

(array([-1.70326353, -1.66437512, -1.5739535 , -1.51747089, -1.47336411]),
 -1.5864854295562016)

In [430]:
lridge = Ridge(random_state=17)
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='neg_mean_absolute_error')
scores,scores.mean() # only author(500), time +content(train corpus)+title(train corpus) + removed hour,weekday, month,feature

(array([-1.79021982, -1.76944994, -1.67921221, -1.64240117, -1.58974062]),
 -1.6942047504117308)

In [427]:
lridge = Ridge(random_state=17)
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='neg_mean_absolute_error')
scores,scores.mean() # only author(500), time +content(train corpus) + removed hour,weekday, month,feature

(array([-1.75029904, -1.71864653, -1.63777537, -1.58549061, -1.53951762]),
 -1.6463458332448966)

In [418]:
lridge = Ridge(random_state=17)
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='neg_mean_absolute_error')
scores,scores.mean() # only author(500), time +title + removed hour feature

(array([-1.80411863, -1.76962978, -1.66307081, -1.60140272, -1.54928635]),
 -1.6775016574455894)

In [416]:
lridge = Ridge(random_state=17)
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='neg_mean_absolute_error')
scores,scores.mean() # only author(500), time + removed hour feature

(array([-1.71383703, -1.67039702, -1.58319636, -1.52311857, -1.47406608]),
 -1.5929230112113395)

In [413]:
lridge = Ridge(random_state=17)
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='neg_mean_absolute_error')
scores,scores.mean() # only author(500), time+ content(bigram, 50000) + removed hour feature

(array([-1.75029904, -1.71864653, -1.63777537, -1.58549061, -1.53951762]),
 -1.6463458332448966)

In [407]:
lridge = Ridge(random_state=17)
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='neg_mean_absolute_error')
scores,scores.mean() # only author(500), domain, time+ content(bigram, 50000) + removed hour feature

(array([-1.7511022 , -1.71865133, -1.63760093, -1.58540646, -1.54022222]),
 -1.6465966285312508)

In [397]:
lridge = Ridge(random_state=17)
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='neg_mean_absolute_error')
scores,scores.mean() # only author(500), domain, time+ content(bigram)

(array([-1.75552844, -1.72424925, -1.65143391, -1.59018058, -1.54537297]),
 -1.653353029870934)

In [386]:
lridge = Ridge(random_state=17)
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='neg_mean_absolute_error')
scores,scores.mean() # only author(500), domain, time+ title(10000)

(array([-1.80465401, -1.76968117, -1.663078  , -1.60150013, -1.54924794]),
 -1.677632249227821)

In [376]:
lridge = Ridge(random_state=17)
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='neg_mean_absolute_error')
scores,scores.mean() # only author(1000), domain, time

(array([-1.71605728, -1.67320132, -1.58660663, -1.52797207, -1.47772642]),
 -1.5963127440676463)

In [381]:
lridge = Ridge(random_state=17)
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='neg_mean_absolute_error')
scores,scores.mean() # only author(500), domain, time

(array([-1.71481073, -1.67048868, -1.58324289, -1.52315781, -1.47411238]),
 -1.5931624978304328)

In [170]:
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='mean_absolute_error')
scores,scores.mean() # same as baseline just with trained on full corpus

(array([-1.77988321, -1.75231114, -1.68516628, -1.63273228, -1.58802091]),
 -1.6876227639716952)

In [173]:
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='mean_absolute_error')
scores,scores.mean() # same as baseline just with trained on full corpus + domain

(array([-1.78005177, -1.75267851, -1.68626226, -1.6318807 , -1.58865579]),
 -1.6879058038621024)

In [177]:
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='mean_absolute_error')
scores,scores.mean() # same as baseline just with trained on full corpus + domain + tags

(array([-1.78776372, -1.7541245 , -1.69207366, -1.62452704, -1.59107438]),
 -1.6899126623792564)

In [181]:
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='mean_absolute_error')
scores,scores.mean() # same as baseline just with trained on full corpus + domain + tags - no content

(array([-1.78788731, -1.75173271, -1.67609661, -1.60911423, -1.57100427]),
 -1.6791670269193824)

In [189]:
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='mean_absolute_error')
scores,scores.mean() # same as baseline just with trained on full corpus + domain + small tags and titles - no content

(array([-1.79597596, -1.76370831, -1.68977071, -1.62458945, -1.58708861]),
 -1.6922266082441344)

In [197]:
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='mean_absolute_error')
scores,scores.mean() # same as baseline just with trained on full corpus + domain + small ngram tags and titles - no content

(array([-1.77895346, -1.74125258, -1.66932919, -1.60010456, -1.55931242]),
 -1.669790442385595)

In [231]:
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='neg_mean_absolute_error')
scores,scores.mean() # same as baseline just with trained on full corpus + reduced domain + small ngram tags and titles - no content

(array([-1.77880984, -1.74066763, -1.66815147, -1.6010553 , -1.5594876 ]),
 -1.669634365394295)

In [297]:
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='neg_mean_absolute_error')
scores,scores.mean() # same as baseline just with trained on full corpus +reduced authorlist+ reduced domain + small ngram tags and titles - no content

(array([-1.78088846, -1.75069073, -1.65618224, -1.59819464, -1.55226787]),
 -1.6676447868344884)

In [330]:
lridge = Ridge(random_state=17)
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='neg_mean_absolute_error')
scores,scores.mean() # same as baseline just with trained on full corpus +reduced authorlist+ reduced domain + 
#small ngram tags and titles(nostopword) - no content

(array([-1.78248074, -1.75609869, -1.68068572, -1.62029946, -1.59025814]),
 -1.685964551629732)

In [336]:
lridge = Ridge(random_state=17)
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='neg_mean_absolute_error')
scores,scores.mean() # content with bigrams +reduced authorlist+ reduced domain + 
#small ngram tags and titles(nostopword) 

(array([-1.78240041, -1.75600557, -1.68051004, -1.6203664 , -1.58921474]),
 -1.685699431471828)

In [340]:
lridge = Ridge(random_state=17)
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='neg_mean_absolute_error')
scores,scores.mean() # content with bigrams +reduced authorlist+ reduced domain + 
# - no tags -no title

(array([-1.75596149, -1.72707715, -1.65358406, -1.59496739, -1.54873665]),
 -1.6560653469113027)

In [344]:
lridge = Ridge(random_state=17)
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='neg_mean_absolute_error')
scores,scores.mean() # content with bigrams +reduced authorlist+ reduced domain + 
#  - no tags -no title + more authors

(array([-1.7554402 , -1.72698805, -1.66269248, -1.60270939, -1.5585561 ]),
 -1.661277245275749)

In [348]:
lridge = Ridge(random_state=17)
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='neg_mean_absolute_error')
scores,scores.mean() # content with bigrams +reduced authorlist+ reduced domain + 
#  - no tags -no title + less authors(500)

(array([-1.75552844, -1.72424925, -1.65143391, -1.59018058, -1.54537297]),
 -1.653353029870934)

In [351]:
lridge = Ridge(random_state=17)
scores = cross_val_score(lridge,X_train_sparse1,y_train,cv=TimeSeriesSplit(5),n_jobs=-1,scoring='neg_mean_absolute_error')
scores,scores.mean() # no content +reduced authorlist+ reduced domain + 
#  -no title + less authors(500)

(array([-1.74941744, -1.71131357, -1.63078541, -1.55365962, -1.51169151]),
 -1.6313735095426722)

That's it for the assignment. Much more credits will be given to the winners in this competition, check [course roadmap](https://mlcourse.ai/roadmap). Do not spoil the assignment and the competition - don't share high-performing kernels (with MAE < 1.5).

Some ideas for improvement:

- Engineer good features, this is the key to success. Some simple features will be based on publication time, authors, content length and so on
- You may not ignore HTML and extract some features from there
- You'd better experiment with your validation scheme. You should see a correlation between your local improvements and LB score
- Try TF-IDF, ngrams, Word2Vec and GloVe embeddings
- Try various NLP techniques like stemming and lemmatization
- Tune hyperparameters. In our example, we've left only 50k features and used C=1 as a regularization parameter, this can be changed
- SGD and Vowpal Wabbit will learn much faster
- Play around with blending and/or stacking. An intro is given in [this Kernel](https://www.kaggle.com/kashnitsky/ridge-and-lightgbm-simple-blending) by @yorko 
- In our course, we don't cover neural nets. But it's not obliged to use GRUs/LSTMs/whatever in this competition.

Good luck!

<img src='../../img/kaggle_shakeup.png' width=50%>