**Experiment 4: DTA, linear regression and subwords**

*Background*: Experiment 2 shows that linear regression with 215 features does not overfit anymore, but the model does not fit the beginning and the end of the timerange. Quadratic polynomial regression would be a possibility, but quadratic polynomial regression performs worse on both training and validation set than linear regression. Polynomial regression with a degree of 5 cannot be computed due to the vast number of polynomial features created by the model.

*Goal*: Determine if it is possible to predict the year in which a text was written using regression.

*Strategies*:

- Use a BPE-transformer to train on subwords (Sennrich2016)

*Relevance*:

- If this experiment works, it is possible to estimate years for corpora that have NA's in this variable.
- Subwords might increase the amount of generalization, and minimize the vocabulary used at the same time.

*Success criteria*:

- Consistent findings over training-, test- and validation set
- predicted year is not more than ten years away from the true year

*Corpora*:

- DTA

*Baseline to beat (Exp. 2)*:
- Train MSE = 2259.8
- Val MSE = 3202.51

*Result*: - 

In [1]:
import pandas as pd
import numpy as np
import nltk

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest , f_regression
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
import sklearn.utils
from sklearn.preprocessing import FunctionTransformer

from Selfwritten_modules.SubwordTransformer import SubwordTransformer

import re

import eli5



In [2]:
#build tokenizer that just substitutes '[' and ']' with ','
def tokenizer_word(doc):
    doc = re.sub('[(\[+)|(\]+)]', '', doc)
    doc = re.split(',', doc)
    return doc

In [3]:
#function for assembling predictions in order to find out how features are weighted

def collect_predictions(dataset, classifier,vectorizer, feature_names, pipeline):
    predictions = eli5.explain_weights_df(classifier,vec=vectorizer, feature_names=feature_names)
    
    predictions = predictions.drop(['target'], axis=1)
    
    
    predictions['YEAR'] = 0
    
    

    for instance in range (0, len(dataset)):
        pred = eli5.explain_prediction_df(classifier, dataset[instance], vec=vectorizer, feature_names=feature_names)
        source_text = pd.DataFrame([[dataset[instance]]])
        year_pred = pipeline.predict(source_text[0])
        pred['weight_value'] = pred['weight'] * pred['value']
        pred['instance'] = instance
        
        
        pred = pred.drop(['target','weight','value'], axis=1)
        
    
        pred['YEAR'] = np.round(year_pred[0])
    
        predictions = pd.concat([predictions, pred])
    
    
    
    return predictions

In [4]:
train_full = pd.read_csv('/Volumes/Korpora/Train/DTA_train_tokenized.csv', sep=';')
val_full = pd.read_csv('/Volumes/Korpora/Val/DTA_val_tokenized.csv', sep=';')
#test_full = pd.read_csv('/Volumes/Korpora/Test/DTA_test_tokenized.csv', sep=';')

In [5]:
print('Length train set: ',len(train_full))
print('Length validation set: ', len(val_full))
#print('Length test set: ', len(test_full))

Length train set:  899
Length validation set:  225


In [6]:
train_x = train_full['Text']
train_y = train_full['Publication_year']

val_x = val_full['Text']
val_y = val_full['Publication_year']

The pipeline takes a lot of time to run through. To determine were the problem is, I'm going to go through the subword transformer stepwise.

In [7]:
subwords = SubwordTransformer(tokenizer=tokenizer_word)

subwords.fit(train_x)

SubwordTransformer(n_best=5, number_of_merges=10,
                   tokenizer=<function tokenizer_word at 0x11e3dc7a0>)

In [10]:
len(subwords.get_feature_names())

50

So the subword algorithm seems to extract only 10 subwords. This is the number of merges the algorithm performs per default.
It also seems to use about one minute per merge. Maybe it would be useful to extract the k_best pairs each time instead of only the best pair.

The transformation process took the algorithm really long, but I fixed it, now the speed is better.

Update: Parallelizing the extraction of the subwords really helped to gather more features in less time. So I would suggest to rather increase the features extracted at the same time step than the number of merges done, at least when time plays a role.

In [11]:
train_x_transformed = subwords.transform(train_x)

In [12]:
reg_1 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=5, n_best = 50)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [13]:
reg_1.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=50, number_of_merges=5,
                                    tokenizer=<function tokenizer_word at 0x11e3dc7a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [14]:
y_pred_train = reg_1.predict(train_x)
mean_squared_error(train_y, y_pred_train)

1856.134944837302

In [15]:
y_pred_val = reg_1.predict(val_x)

mean_squared_error(val_y, y_pred_val)

5607.611798707225

In [16]:
reg_2 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=10, n_best = 25)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [17]:
reg_2.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=25, number_of_merges=10,
                                    tokenizer=<function tokenizer_word at 0x11e3dc7a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [18]:
y_pred_train = reg_2.predict(train_x)
mean_squared_error(train_y, y_pred_train)

1705.4578667527232

In [19]:
y_pred_val = reg_2.predict(val_x)

mean_squared_error(val_y, y_pred_val)

5158.148065706169

In [20]:
reg_3 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=1, n_best = 250)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [21]:
reg_3.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=250, number_of_merges=1,
                                    tokenizer=<function tokenizer_word at 0x11e3dc7a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [22]:
y_pred_train = reg_3.predict(train_x)
mean_squared_error(train_y, y_pred_train)

1401.2375448462706

In [23]:
y_pred_val = reg_3.predict(val_x)

mean_squared_error(val_y, y_pred_val)

9079.520901710413

The subwords do not beat the baseline yet, but it is very close to it. It seems that the number of merges does increase the performance slightly. We also have more features than the linear regression on which the baseline was trained. So let's play around with the features.

In [26]:
reg_4 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=10, n_best = 10)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [27]:
reg_4.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=10, number_of_merges=10,
                                    tokenizer=<function tokenizer_word at 0x11e3dc7a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [28]:
y_pred_train = reg_4.predict(train_x)
mean_squared_error(train_y, y_pred_train)

0.033363106020776016

In [29]:
y_pred_val = reg_4.predict(val_x)
mean_squared_error(val_y, y_pred_val)

1116084.7998909994

100 features with a lot of merges and few words loaded per merge leads to heavily overfitting.

In [8]:
reg_5 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=5, n_best = 20)),
                ('ridge_reg', linear_model.Ridge())])

In [9]:
reg_5.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=20, number_of_merges=5,
                                    tokenizer=<function tokenizer_word at 0x1250d3b90>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [10]:
y_pred_train = reg_5.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2715.818723121653

In [11]:
y_pred_val = reg_5.predict(val_x)
mean_squared_error(val_y, y_pred_val)

3321.3020605132606

In [12]:
reg_6 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=2, n_best = 50)),
                ('ridge_reg', linear_model.Ridge())])

In [13]:
reg_6.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=50, number_of_merges=2,
                                    tokenizer=<function tokenizer_word at 0x1250d3b90>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [14]:
y_pred_train = reg_6.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2575.599149907795

In [15]:
y_pred_val = reg_6.predict(val_x)
mean_squared_error(val_y, y_pred_val)

3447.358455480043

In [16]:
reg_7 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=1, n_best = 100)),
                ('ridge_reg', linear_model.Ridge())])

In [17]:
reg_7.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=100, number_of_merges=1,
                                    tokenizer=<function tokenizer_word at 0x1250d3b90>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [18]:
y_pred_train = reg_7.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2500.7412722986505

In [19]:
y_pred_val = reg_7.predict(val_x)
mean_squared_error(val_y, y_pred_val)

3745.6498662387094

In [20]:
reg_8 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=2, n_best = 60)),
                ('ridge_reg', linear_model.Ridge())])

In [21]:
reg_8.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=60, number_of_merges=2,
                                    tokenizer=<function tokenizer_word at 0x1250d3b90>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [22]:
y_pred_train = reg_8.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2465.011096075109

In [23]:
y_pred_val = reg_8.predict(val_x)
mean_squared_error(val_y, y_pred_val)

3602.910503824884

In [25]:
reg_9 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=2, n_best = 80)),
                ('ridge_reg', linear_model.Ridge())])

In [26]:
reg_9.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=80, number_of_merges=2,
                                    tokenizer=<function tokenizer_word at 0x1250d3b90>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [27]:
y_pred_train = reg_9.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2222.48922717676

In [28]:
y_pred_val = reg_9.predict(val_x)
mean_squared_error(val_y, y_pred_val)

3667.109164393909

In [29]:
reg_10 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=2, n_best = 70)),
                ('ridge_reg', linear_model.Ridge())])

In [30]:
reg_10.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=70, number_of_merges=2,
                                    tokenizer=<function tokenizer_word at 0x1250d3b90>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [31]:
y_pred_train = reg_10.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2192.9760224052943

In [32]:
y_pred_val = reg_10.predict(val_x)
mean_squared_error(val_y, y_pred_val)

4156.00168293966

In [33]:
reg_11 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=3, n_best = 50)),
                ('ridge_reg', linear_model.Ridge())])

In [34]:
reg_11.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=50, number_of_merges=3,
                                    tokenizer=<function tokenizer_word at 0x1250d3b90>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [35]:
y_pred_train = reg_11.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2154.155532780347

In [36]:
y_pred_val = reg_11.predict(val_x)
mean_squared_error(val_y, y_pred_val)

4410.141749003236

In [37]:
reg_12 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=3, n_best = 30)),
                ('ridge_reg', linear_model.Ridge())])

In [38]:
reg_12.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=30, number_of_merges=3,
                                    tokenizer=<function tokenizer_word at 0x1250d3b90>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [39]:
y_pred_train = reg_12.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2148.1071957613735

In [40]:
y_pred_val = reg_12.predict(val_x)
mean_squared_error(val_y, y_pred_val)

4429.4427212795745