**Experiment 2: DTA, linear regression and subwords**

*Background*: Experiment 2 shows that linear regression with 215 features does not overfit anymore, but the model does not fit the beginning and the end of the timerange. Quadratic polynomial regression would be a possibility, but quadratic polynomial regression performs worse on both training and validation set than linear regression. Polynomial regression with a degree of 5 cannot be computed due to the vast number of polynomial features created by the model.

*Goal*: Determine if it is possible to predict the year in which a text was written using regression.

*Strategies*:

- Use a BPE-transformer to train on subwords (Sennrich2016)

*Relevance*:

- If this experiment works, it is possible to estimate years for corpora that have NA's in this variable.
- Subwords might increase the amount of generalization, and minimize the vocabulary used at the same time.

*Success criteria*:

- Consistent findings over training-, test- and validation set
- predicted year is not more than ten years away from the true year

*Corpora*:

- DTA

*Result*: 

Best parameters: 3 merges à 30 features (reg 12), 100 features is only slightly worse -> reg_4,5, 6, 7 with 100 features over all merges are pretty close to each other

***Baselines to beat (Exp. 1b)***:

*MSE DTA Train*: 2259.8 | 2755.29

*MSE DTA Val*: 3202.51 | 3050.20

*MSE DTA Test*: 4504.35 | 3103.35

*MSE over GERMANC val*: 3700.60 | 4384.34

*MSE over GERMANC test*: 3565.05 | 4281.22

Setup:

- nbest: 30
- merges: 3

----------------------------------------------------------------------------------------------------------------------

*MSE CLMET Train*: 1346.26 | 2541.43

*MSE CLMET Val*: 2727.37 | 2702.58

*MSE CLMET Test*: 4315.18 | 3757.33

*MSE over ARCHER Val*: 10212.14 | 10004.50

*MSE over ARCHER Test*: 9784.28 | 9657.31

Setup:

- merges: 1
- n_best: 15




----------------------------------------------------------------------------------------------------------------------

*MSE ARCHER Train*: 3555.68 | 4281.68

*MSE ARCHER Val*: 4843.06 | 5276.70

*MSE ARCHER Test*: 4939.51 | 4770.15

*MSE over CLMET Val*: 6437126.77 | 5256255.22

*MSE over CLMET Test*: 6689227.44 | 10656247.10

Setup:

- merges: 3
- n_best: 35

----------------------------------------------------------------------------------------------------------------------

*MSE GERMANC Train*: 348.830 | 1215.24

*MSE GERMANC Val*: 398.94 | 1845.08

*MSE GERMANC Test*: 649.33 | 1448.88

*MSE over DTA val*: 8400009.85 | 25187766.76

*MSE over DTA test*: 21896163.78 | 18349238.65

Setup:

- 1 merge
- 18 n_best


In [17]:
import pandas as pd
import numpy as np
import nltk

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest , f_regression
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
import sklearn.utils
from sklearn.preprocessing import FunctionTransformer

from Selfwritten_modules.SubwordTransformer import SubwordTransformer

import re

import eli5

In [18]:
#build tokenizer that just substitutes '[' and ']' with ','
def tokenizer_word(doc):
    doc = re.sub('[(\[+)|(\]+)]', '', doc)
    doc = re.split(',', doc)
    return doc

In [19]:
#function for assembling predictions in order to find out how features are weighted

def collect_predictions(dataset, classifier,vectorizer, feature_names, pipeline):
    predictions = eli5.explain_weights_df(classifier,vec=vectorizer, feature_names=feature_names)
    
    predictions = predictions.drop(['target'], axis=1)
    
    
    predictions['YEAR'] = 0
    
    indexes = dataset.index.values
    
    

    for index in indexes:
        
        pred = eli5.explain_prediction_df(classifier, dataset[index], vec=vectorizer, feature_names=feature_names)
        
        source_text = pd.DataFrame([[dataset[index]]])
        
        year_pred = pipeline.predict(source_text[0])
        pred['weight_value'] = pred['weight'] * pred['value']
        pred['instance'] = index
        
        pred = pred.drop(['target','value'], axis=1)
        
    
        pred['YEAR'] = np.round(year_pred[0])
    
        predictions = pd.concat([predictions, pred])
        
    
    
    
    return predictions.dropna()

In [20]:
train_full = pd.read_csv('/Volumes/Korpora/Train/DTA_train_tokenized.csv', sep=';')
val_full = pd.read_csv('/Volumes/Korpora/Val/DTA_val_tokenized.csv', sep=';')
test_full = pd.read_csv('/Volumes/Korpora/Test/DTA_test_tokenized.csv', sep=';')

In [21]:
print('Length train set: ',len(train_full))
print('Length validation set: ', len(val_full))
print('Length test set: ', len(test_full))

Length train set:  899
Length validation set:  225
Length test set:  281


In [22]:
train_x = train_full['Text']
train_y = train_full['Publication_year']

val_x = val_full['Text']
val_y = val_full['Publication_year']

test_x = test_full['Text']
test_y = test_full['Publication_year']

The pipeline takes a lot of time to run through. To determine were the problem is, I'm going to go through the subword transformer stepwise.

In [7]:
subwords = SubwordTransformer(tokenizer=tokenizer_word)

subwords.fit(train_x)

KeyboardInterrupt: 

In [10]:
len(subwords.get_feature_names())

50

So the subword algorithm seems to extract only 10 subwords. This is the number of merges the algorithm performs per default.
It also seems to use about one minute per merge. Maybe it would be useful to extract the k_best pairs each time instead of only the best pair.

The transformation process took the algorithm really long, but I fixed it, now the speed is better.

Update: Parallelizing the extraction of the subwords really helped to gather more features in less time. So I would suggest to rather increase the features extracted at the same time step than the number of merges done, at least when time plays a role.

In [11]:
train_x_transformed = subwords.transform(train_x)

In [7]:
reg_1 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=5, n_best = 50)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [8]:
reg_1.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=50, number_of_merges=5,
                                    tokenizer=<function tokenizer_word at 0x12214a7a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [9]:
y_pred_train = reg_1.predict(train_x)
mean_squared_error(train_y, y_pred_train)

1884.2910597026605

In [10]:
y_pred_val = reg_1.predict(val_x)

mean_squared_error(val_y, y_pred_val)

5574.163315600897

In [11]:
reg_2 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=10, n_best = 25)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [12]:
reg_2.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=25, number_of_merges=10,
                                    tokenizer=<function tokenizer_word at 0x12214a7a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [13]:
y_pred_train = reg_2.predict(train_x)
mean_squared_error(train_y, y_pred_train)

1868.284018370914

In [14]:
y_pred_val = reg_2.predict(val_x)

mean_squared_error(val_y, y_pred_val)

4819.751780050508

In [15]:
reg_3 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=1, n_best = 250)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [16]:
reg_3.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=250, number_of_merges=1,
                                    tokenizer=<function tokenizer_word at 0x12214a7a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [17]:
y_pred_train = reg_3.predict(train_x)
mean_squared_error(train_y, y_pred_train)

1981.3836256602808

In [18]:
y_pred_val = reg_3.predict(val_x)

mean_squared_error(val_y, y_pred_val)

4711.995368986996

The subwords do not beat the baseline yet, but it is very close to it. It seems that the number of merges does increase the performance slightly. We also have more features than the linear regression on which the baseline was trained. So let's play around with the features.

In [19]:
reg_4 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=10, n_best = 10)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [20]:
reg_4.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=10, number_of_merges=10,
                                    tokenizer=<function tokenizer_word at 0x12214a7a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [21]:
y_pred_train = reg_4.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2755.1225760989837

In [22]:
y_pred_val = reg_4.predict(val_x)
mean_squared_error(val_y, y_pred_val)

3203.671297418154

100 features with a lot of merges and few words loaded per merge leads to heavily overfitting.

In [23]:
reg_5 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=5, n_best = 20)),
                ('ridge_reg', linear_model.Ridge())])

In [24]:
reg_5.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=20, number_of_merges=5,
                                    tokenizer=<function tokenizer_word at 0x12214a7a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [25]:
y_pred_train = reg_5.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2715.8187231216534

In [26]:
y_pred_val = reg_5.predict(val_x)
mean_squared_error(val_y, y_pred_val)

3321.302060513491

In [7]:
reg_6 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=2, n_best = 50)),
                ('ridge_reg', linear_model.Ridge())])

In [8]:
reg_6.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=50, number_of_merges=2,
                                    tokenizer=<function tokenizer_word at 0x11d5e57a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [9]:
y_pred_train = reg_6.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2807.3324608991907

In [10]:
y_pred_val = reg_6.predict(val_x)
mean_squared_error(val_y, y_pred_val)

3554.4515758892103

In [11]:
reg_7 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=1, n_best = 100)),
                ('ridge_reg', linear_model.Ridge())])

In [12]:
reg_7.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=100, number_of_merges=1,
                                    tokenizer=<function tokenizer_word at 0x11d5e57a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [13]:
y_pred_train = reg_7.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2786.7084867524163

In [14]:
y_pred_val = reg_7.predict(val_x)
mean_squared_error(val_y, y_pred_val)

3545.195196886467

In [15]:
reg_8 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=2, n_best = 60)),
                ('ridge_reg', linear_model.Ridge())])

In [16]:
reg_8.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=60, number_of_merges=2,
                                    tokenizer=<function tokenizer_word at 0x11d5e57a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [17]:
y_pred_train = reg_8.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2705.775202889078

In [18]:
y_pred_val = reg_8.predict(val_x)
mean_squared_error(val_y, y_pred_val)

3155.8639009070366

In [19]:
reg_9 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=2, n_best = 80)),
                ('ridge_reg', linear_model.Ridge())])

In [20]:
reg_9.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=80, number_of_merges=2,
                                    tokenizer=<function tokenizer_word at 0x11d5e57a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [21]:
y_pred_train = reg_9.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2410.0902571511806

In [22]:
y_pred_val = reg_9.predict(val_x)
mean_squared_error(val_y, y_pred_val)

3394.587824149019

In [23]:
reg_10 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=2, n_best = 70)),
                ('ridge_reg', linear_model.Ridge())])

In [24]:
reg_10.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=70, number_of_merges=2,
                                    tokenizer=<function tokenizer_word at 0x11d5e57a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [25]:
y_pred_train = reg_10.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2514.3991782509506

In [26]:
y_pred_val = reg_10.predict(val_x)
mean_squared_error(val_y, y_pred_val)

3627.108279692615

In [27]:
reg_11 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=3, n_best = 50)),
                ('ridge_reg', linear_model.Ridge())])

In [28]:
reg_11.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=50, number_of_merges=3,
                                    tokenizer=<function tokenizer_word at 0x11d5e57a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [29]:
y_pred_train = reg_11.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2383.553752870829

In [30]:
y_pred_val = reg_11.predict(val_x)
mean_squared_error(val_y, y_pred_val)

4739.662481449634

In [7]:
reg_12 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=3, n_best = 30)),
                ('ridge_reg', linear_model.Ridge())])

In [8]:
reg_12.fit(train_x, train_y)

KeyboardInterrupt: 

In [33]:
y_pred_train = reg_12.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2755.290835551211

In [34]:
y_pred_val = reg_12.predict(val_x)
mean_squared_error(val_y, y_pred_val)

3050.2027254566788

In [43]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=2, n_best = 45)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [44]:
reg_13.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=45, number_of_merges=2,
                                    tokenizer=<function tokenizer_word at 0x11d5e57a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [45]:
y_pred_train = reg_13.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2843.7430417615674

In [46]:
y_pred_val = reg_13.predict(val_x)
mean_squared_error(val_y, y_pred_val)

3407.370387999431

In [81]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=1, n_best = 90)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [82]:
reg_13.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=90, number_of_merges=1,
                                    tokenizer=<function tokenizer_word at 0x120a2d7a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [83]:
y_pred_train = reg_13.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2826.538282468166

In [84]:
y_pred_val = reg_13.predict(val_x)
mean_squared_error(val_y, y_pred_val)

3465.077170534083

**Try out best settings on different corpora**

In [23]:
reg_5 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=3, n_best = 30)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [24]:
reg_5.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=30, number_of_merges=3,
                                    tokenizer=<function tokenizer_word at 0x1f0d8cdd0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [25]:
y_pred_train = reg_5.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2755.290835551211

In [26]:
y_pred_val = reg_5.predict(val_x)
mean_squared_error(val_y, y_pred_val)

3050.2027254566788

In [27]:
y_pred_test = reg_5.predict(test_x)
mean_squared_error(test_y, y_pred_test)

3103.3538338108283

In [28]:
features_selected = reg_5['subwords'].get_feature_names()

eli5.explain_weights(reg_5['ridge_reg'],vec=reg_5['subwords'], feature_names=features_selected)

Weight?,Feature
+1777.624,<BIAS>
+0.058,dur
+0.054,men
+0.028,dem
+0.025,icht
+0.024,ir
+0.020,mm
+0.020,ig
+0.018,aͤ
+0.016,au


In [54]:
y_pred_train = pd.DataFrame(y_pred_train, columns=['Predicted_y'])

diff_pred_true_train = pd.concat([y_pred_train, train_y], axis=1)

diff_pred_true_train['Difference'] = diff_pred_true_train.Predicted_y - diff_pred_true_train.Publication_year
    

y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Publication_year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Publication_year

In [57]:
diff_pred_true_train.to_csv('/Volumes/Korpora/Exp2_results/DTA_Exp2_Reg12_Labels_train.csv',sep=';')
diff_pred_true_val.to_csv('/Volumes/Korpora/Exp2_results/DTA_Exp2_Reg12_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/Exp2_results/DTA_Exp2_Reg12_Labels_test.csv',sep=';')

In [29]:
train_details = collect_predictions(train_x, reg_5['ridge_reg'],reg_5['subwords'],features_selected, reg_5)
val_details = collect_predictions(val_x, reg_5['ridge_reg'],reg_5['subwords'],features_selected, reg_5)
test_details = collect_predictions(test_x, reg_5['ridge_reg'],reg_5['subwords'],features_selected, reg_5)

In [30]:
train_details.to_csv('/Volumes/Korpora/Exp2_results/DTA_Exp2_Reg12_Train_results.csv',sep=';')
val_details.to_csv('/Volumes/Korpora/Exp2_results/DTA_Exp2_Reg12_Val_results.csv', sep=';')
test_details.to_csv('/Volumes/Korpora/Exp2_results/DTA_Exp2_Reg12_Test_results.csv', sep=';')

In [8]:
GERMANC_train_full = pd.read_csv('/Volumes/Korpora/Train/GERMANC_train_tokenized.csv', sep=';')
GERMANC_val_full = pd.read_csv('/Volumes/Korpora/Val/GERMANC_val_tokenized.csv', sep=';')
GERMANC_test_full = pd.read_csv('/Volumes/Korpora/Test/GERMANC_test_tokenized.csv', sep=';')

In [9]:
GERMANC_train_full = GERMANC_train_full[(GERMANC_train_full.Year.str.len()== 4) & (GERMANC_train_full.Year.str.isnumeric())]

GERMANC_val_full = GERMANC_val_full[(GERMANC_val_full.Year.str.len()== 4) & (GERMANC_val_full.Year.str.isnumeric())]

GERMANC_test_full = GERMANC_test_full[(GERMANC_test_full.Year.str.len()== 4) & (GERMANC_test_full.Year.str.isnumeric())]

In [10]:
print('Length train set: ',len(GERMANC_train_full))
print('Length validation set: ', len(GERMANC_val_full))
print('Length test set: ', len(GERMANC_test_full))

Length train set:  177
Length validation set:  40
Length test set:  56


In [11]:
GERMANC_train_x = GERMANC_train_full['Text']
GERMANC_train_y = GERMANC_train_full['Year'].astype(int)

GERMANC_val_x = GERMANC_val_full['Text']
GERMANC_val_y = GERMANC_val_full['Year'].astype(int)

GERMANC_test_x = GERMANC_test_full['Text']
GERMANC_test_y = GERMANC_test_full['Year'].astype(int)

In [12]:
y_pred_val = reg_5.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

NameError: name 'reg_5' is not defined

In [64]:
y_pred_test = reg_5.predict(GERMANC_test_x)
mean_squared_error(GERMANC_test_y, y_pred_test)

4281.221454400298

In [65]:

y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, GERMANC_val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, GERMANC_test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Year

In [66]:
diff_pred_true_val.to_csv('/Volumes/Korpora/Exp2_results/DTA_over_GERMANC_Exp2_Reg12_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/Exp2_results/DTA_over_GERMANC_Exp2_Reg12_Labels_test.csv',sep=';')

In [67]:
val_details = collect_predictions(GERMANC_val_x, reg_5['ridge_reg'],reg_5['subwords'],features_selected, reg_5)
test_details = collect_predictions(GERMANC_test_x, reg_5['ridge_reg'],reg_5['subwords'],features_selected, reg_5)

In [68]:
val_details.to_csv('/Volumes/Korpora/Exp2_results/DTA_over_GERMANC_Exp2_Reg12_Val_results.csv', sep=';')
test_details.to_csv('/Volumes/Korpora/Exp2_results/DTA_over_GERMANC_Exp2_Reg12_Test_results.csv', sep=';')

**GERMANC**

Ratio documents-features DTA: 10%

Ratio features-n_best: 30%

Ratio features-merges: 3%

Ratio n_best-merges DTA: 10%

For GERMANC:

Features: 18

n_best: 6

merges: 3

In [13]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=3, n_best = 6)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [14]:
reg_13.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=6, number_of_merges=3,
                                    tokenizer=<function tokenizer_word at 0x120a2d7a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [15]:
y_pred_train = reg_13.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

1252.1329625948

In [16]:
y_pred_val = reg_13.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

1837.4256299496271

In [17]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=2, n_best = 9)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [18]:
reg_13.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=9, number_of_merges=2,
                                    tokenizer=<function tokenizer_word at 0x120a2d7a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [19]:
y_pred_train = reg_13.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

1227.8967580954352

In [20]:
y_pred_val = reg_13.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

2298.445528214307

In [25]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=1, n_best = 18)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [26]:
reg_13.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=18, number_of_merges=1,
                                    tokenizer=<function tokenizer_word at 0x120a2d7a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [27]:
y_pred_train = reg_13.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

1215.243411892883

In [28]:
y_pred_val = reg_13.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

1845.084380748532

In [29]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=1, n_best = 16)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [32]:
reg_13.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=16, number_of_merges=1,
                                    tokenizer=<function tokenizer_word at 0x120a2d7a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [33]:
y_pred_train = reg_13.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

1218.6727753643434

In [34]:
y_pred_val = reg_13.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

1872.3747755339136

In [37]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=2, n_best = 8)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [39]:
reg_13.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=8, number_of_merges=2,
                                    tokenizer=<function tokenizer_word at 0x120a2d7a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [40]:
y_pred_train = reg_13.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

1165.3662814718596

In [41]:
y_pred_val = reg_13.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

2109.555160766239

In [42]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=4, n_best = 4)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [43]:
reg_13.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=4, number_of_merges=4,
                                    tokenizer=<function tokenizer_word at 0x120a2d7a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [44]:
y_pred_train = reg_13.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

1352.0319413592565

In [45]:
y_pred_val = reg_13.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

1966.1060806754872

In [46]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=4, n_best = 5)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [47]:
reg_13.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=5, number_of_merges=4,
                                    tokenizer=<function tokenizer_word at 0x120a2d7a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [48]:
y_pred_train = reg_13.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

1079.9454947377378

In [49]:
y_pred_val = reg_13.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

1955.5357499392271

In [50]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=6, n_best = 3)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [51]:
reg_13.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=3, number_of_merges=6,
                                    tokenizer=<function tokenizer_word at 0x120a2d7a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [52]:
y_pred_train = reg_13.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

1162.6032994633238

In [53]:
y_pred_val = reg_13.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

2114.4557170618946

In [54]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=3, n_best = 7)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [55]:
reg_13.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=7, number_of_merges=3,
                                    tokenizer=<function tokenizer_word at 0x120a2d7a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [56]:
y_pred_train = reg_13.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

1073.1652891814635

In [57]:
y_pred_val = reg_13.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

2070.0836070202276

The best performing settings are 1 merge with 18 features. However, this merge does only consider bigrams, and 3 merges with 6 features is only slightly worse, but might provide better qualitative results.

In [63]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=1, n_best = 18)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [64]:
reg_13.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=18, number_of_merges=1,
                                    tokenizer=<function tokenizer_word at 0x120a2d7a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [65]:
y_pred_train = reg_13.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

1215.243411892883

In [66]:
y_pred_val = reg_13.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

1845.084380748532

In [67]:
selected_features = reg_13['subwords'].get_feature_names()

eli5.explain_weights(reg_13['ridge_reg'], vec=reg_13['subwords'], feature_names = selected_features)

Weight?,Feature
1694.122,<BIAS>
0.298,un
0.222,es
0.19,ei
0.174,ie
0.141,en
0.108,ne
0.101,de
0.042,st
0.029,er


In [68]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=3, n_best = 6)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [69]:
reg_13.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=6, number_of_merges=3,
                                    tokenizer=<function tokenizer_word at 0x120a2d7a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [70]:
y_pred_train = reg_13.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

1252.1329625948

In [71]:
y_pred_val = reg_13.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

1837.4256299496271

In [72]:
selected_features = reg_13['subwords'].get_feature_names()

eli5.explain_weights(reg_13['ridge_reg'], vec=reg_13['subwords'], feature_names = selected_features)

Weight?,Feature
1694.783,<BIAS>
0.508,sch
0.366,die
0.289,un
0.244,in
0.177,ei
0.171,ich
0.076,de
0.053,en
0.032,au


Since this is difficult to decide, let's look how the regression perform on the DTA

In [73]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=3, n_best = 6)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [74]:
reg_13.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=6, number_of_merges=3,
                                    tokenizer=<function tokenizer_word at 0x120a2d7a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [75]:
y_pred_val = reg_13.predict(val_x)
mean_squared_error(val_y, y_pred_val)

4349956.692489209

In [76]:
y_pred_test = reg_13.predict(test_x)
mean_squared_error(test_y, y_pred_test)

4577148.77974876

In [89]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=1, n_best = 18)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [90]:
reg_13.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=18, number_of_merges=1,
                                    tokenizer=<function tokenizer_word at 0x120a2d7a0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [91]:
y_pred_train = reg_13.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

1215.243411892883

In [92]:
y_pred_val = reg_13.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

1845.084380748532

In [93]:
y_pred_test = reg_13.predict(GERMANC_test_x)
mean_squared_error(GERMANC_test_y, y_pred_test)

1448.876310478287

In [95]:
DTA_y_pred_val = reg_13.predict(val_x)
mean_squared_error(val_y, DTA_y_pred_val)

25187766.758683134

In [97]:
DTA_y_pred_test = reg_13.predict(test_x)
mean_squared_error(test_y, DTA_y_pred_test)

18349238.647847947

The bigrams perform much better on the DTA than the other algorithm does, so the bigrams are the clear winner.

In [98]:
y_pred_train = pd.DataFrame(y_pred_train, columns=['Predicted_y'])

diff_pred_true_train = pd.concat([y_pred_train, GERMANC_train_y], axis=1)

diff_pred_true_train['Difference'] = diff_pred_true_train.Predicted_y - diff_pred_true_train.Year


y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, GERMANC_val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, GERMANC_test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Year


DTA_y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

DTA_diff_pred_true_val = pd.concat([DTA_y_pred_val, val_y], axis=1)

DTA_diff_pred_true_val['Difference'] = DTA_diff_pred_true_val.Predicted_y - DTA_diff_pred_true_val.Publication_year


DTA_y_pred_test = pd.DataFrame(DTA_y_pred_test, columns=['Predicted_y'])

DTA_diff_pred_true_test = pd.concat([DTA_y_pred_test, test_y], axis=1)

DTA_diff_pred_true_test['Difference'] = DTA_diff_pred_true_test.Predicted_y - DTA_diff_pred_true_test.Publication_year




In [99]:
diff_pred_true_train.to_csv('/Volumes/Korpora/Exp2_results/GERMANC_Exp2_Reg12_Labels_train.csv',sep=';')
diff_pred_true_val.to_csv('/Volumes/Korpora/Exp2_results/GERMANC_Exp2_Reg12_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/Exp2_results/GERMANC_Exp2_Reg12_Labels_test.csv',sep=';')
DTA_diff_pred_true_val.to_csv('/Volumes/Korpora/Exp2_results/GERMANC_over_DTA_Exp2_Reg12_Labels_val.csv',sep=';')
DTA_diff_pred_true_test.to_csv('/Volumes/Korpora/Exp2_results/GERMANC_over_DTA_Exp2_Reg12_Labels_test.csv',sep=';')

In [101]:
features_selected = reg_13['subwords'].get_feature_names()

train_details = collect_predictions(GERMANC_train_x, reg_13['ridge_reg'],reg_13['subwords'],features_selected, reg_13)
val_details = collect_predictions(GERMANC_val_x, reg_13['ridge_reg'],reg_13['subwords'],features_selected, reg_13)
test_details = collect_predictions(GERMANC_test_x, reg_13['ridge_reg'],reg_13['subwords'],features_selected, reg_13)

DTA_val_details = collect_predictions(val_x, reg_13['ridge_reg'],reg_13['subwords'],features_selected, reg_13)
DTA_test_details = collect_predictions(test_x, reg_13['ridge_reg'],reg_13['subwords'],features_selected, reg_13)

In [102]:
train_details.to_csv('/Volumes/Korpora/Exp2_results/GERMANC_Exp2_Reg12_train_results.csv', sep=';')
val_details.to_csv('/Volumes/Korpora/Exp2_results/GERMANC_Exp2_Reg12_Val_results.csv', sep=';')
test_details.to_csv('/Volumes/Korpora/Exp2_results/GERMANC_Exp2_Reg12_Test_results.csv', sep=';')

DTA_val_details.to_csv('/Volumes/Korpora/Exp2_results/GERMANC_over_DTA_Exp2_Reg12_Val_results.csv', sep=';')
DTA_test_details.to_csv('/Volumes/Korpora/Exp2_results/GERMANC_over_DTA_Exp2_Reg12_Test_results.csv', sep=';')

**ARCHER**

In [4]:
ARCHER_train_full = pd.read_csv('/Volumes/Korpora/Train/ARCHER_train_tokenized.csv', sep=';')
ARCHER_val_full = pd.read_csv('/Volumes/Korpora/Val/ARCHER_val_tokenized.csv', sep=';')
ARCHER_test_full = pd.read_csv('/Volumes/Korpora/Test/ARCHER_test_tokenized.csv', sep=';')

In [5]:
print('Length train set: ',len(ARCHER_train_full))
print('Length validation set: ', len(ARCHER_val_full))
print('Length test set: ', len(ARCHER_test_full))

Length train set:  1093
Length validation set:  274
Length test set:  342


In [6]:
ARCHER_train_full = ARCHER_train_full[(ARCHER_train_full.Year.str.len()== 4) & (ARCHER_train_full.Year.str.isnumeric())]

ARCHER_val_full = ARCHER_val_full[(ARCHER_val_full.Year.str.len()== 4) & (ARCHER_val_full.Year.str.isnumeric())]

ARCHER_test_full = ARCHER_test_full[(ARCHER_test_full.Year.str.len()== 4) & (ARCHER_test_full.Year.str.isnumeric())]

In [7]:
print('Length train set: ',len(ARCHER_train_full))
print('Length validation set: ', len(ARCHER_val_full))
print('Length test set: ', len(ARCHER_test_full))

Length train set:  1049
Length validation set:  264
Length test set:  329


In [8]:
ARCHER_train_x = ARCHER_train_full['Text']
ARCHER_train_y = ARCHER_train_full['Year'].astype(int)

ARCHER_val_x = ARCHER_val_full['Text']
ARCHER_val_y = ARCHER_val_full['Year'].astype(int)

ARCHER_test_x = ARCHER_test_full['Text']
ARCHER_test_y = ARCHER_test_full['Year'].astype(int)

Number of features / length of train set: 105
Ratio features / n_best: 31-32
Ratio features / merges: 3

-> 3 * 32 = 96

-> 105 / 3 = 35

In [9]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=3, n_best = 35)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [10]:
len(ARCHER_train_x)

1049

In [11]:
len(ARCHER_train_y)

1049

In [12]:
reg_13.fit(ARCHER_train_x, ARCHER_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=35, number_of_merges=3,
                                    tokenizer=<function tokenizer_word at 0x129c1ab90>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [13]:
y_pred_train = reg_13.predict(ARCHER_train_x)
mean_squared_error(ARCHER_train_y, y_pred_train)

4281.676856135276

In [14]:
y_pred_val = reg_13.predict(ARCHER_val_x)
mean_squared_error(ARCHER_val_y, y_pred_val)

5276.700466856189

In [15]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=5, n_best = 21)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [16]:
reg_13.fit(ARCHER_train_x, ARCHER_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=21, number_of_merges=5,
                                    tokenizer=<function tokenizer_word at 0x129c1ab90>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [17]:
y_pred_train = reg_13.predict(ARCHER_train_x)
mean_squared_error(ARCHER_train_y, y_pred_train)

4133.746470114735

In [18]:
y_pred_val = reg_13.predict(ARCHER_val_x)
mean_squared_error(ARCHER_val_y, y_pred_val)

5293.798981965828

In [19]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=3, n_best = 31)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [20]:
reg_13.fit(ARCHER_train_x, ARCHER_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=31, number_of_merges=3,
                                    tokenizer=<function tokenizer_word at 0x129c1ab90>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [21]:
y_pred_train = reg_13.predict(ARCHER_train_x)
mean_squared_error(ARCHER_train_y, y_pred_train)

4356.331867700647

In [22]:
y_pred_val = reg_13.predict(ARCHER_val_x)
mean_squared_error(ARCHER_val_y, y_pred_val)

5433.115023074826

In [23]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=3, n_best = 32)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [24]:
reg_13.fit(ARCHER_train_x, ARCHER_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=32, number_of_merges=3,
                                    tokenizer=<function tokenizer_word at 0x129c1ab90>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [25]:
y_pred_train = reg_13.predict(ARCHER_train_x)
mean_squared_error(ARCHER_train_y, y_pred_train)

4350.206391334774

In [26]:
y_pred_val = reg_13.predict(ARCHER_val_x)
mean_squared_error(ARCHER_val_y, y_pred_val)

5402.502509154262

In [27]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=1, n_best = 105)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [28]:
reg_13.fit(ARCHER_train_x, ARCHER_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=105, number_of_merges=1,
                                    tokenizer=<function tokenizer_word at 0x129c1ab90>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [29]:
y_pred_train = reg_13.predict(ARCHER_train_x)
mean_squared_error(ARCHER_train_y, y_pred_train)

4360.954476525783

In [30]:
y_pred_val = reg_13.predict(ARCHER_val_x)
mean_squared_error(ARCHER_val_y, y_pred_val)

5257.100514169243

In [31]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=3, n_best = 67)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [32]:
reg_13.fit(ARCHER_train_x, ARCHER_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=67, number_of_merges=3,
                                    tokenizer=<function tokenizer_word at 0x129c1ab90>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [33]:
y_pred_train = reg_13.predict(ARCHER_train_x)
mean_squared_error(ARCHER_train_y, y_pred_train)

3410.241674986492

In [34]:
y_pred_val = reg_13.predict(ARCHER_val_x)
mean_squared_error(ARCHER_val_y, y_pred_val)

5729.158560893348

In [35]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=3, n_best = 40)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [36]:
reg_13.fit(ARCHER_train_x, ARCHER_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=40, number_of_merges=3,
                                    tokenizer=<function tokenizer_word at 0x129c1ab90>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [37]:
y_pred_train = reg_13.predict(ARCHER_train_x)
mean_squared_error(ARCHER_train_y, y_pred_train)

4198.645742841872

In [38]:
y_pred_val = reg_13.predict(ARCHER_val_x)
mean_squared_error(ARCHER_val_y, y_pred_val)

5295.735461712045

In [57]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=3, n_best = 35)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [58]:
reg_13.fit(ARCHER_train_x, ARCHER_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=35, number_of_merges=3,
                                    tokenizer=<function tokenizer_word at 0x129c1ab90>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [59]:
y_pred_train = reg_13.predict(ARCHER_train_x)
mean_squared_error(ARCHER_train_y, y_pred_train)

4281.676856135276

In [60]:
y_pred_val = reg_13.predict(ARCHER_val_x)
mean_squared_error(ARCHER_val_y, y_pred_val)

5276.700466856189

In [61]:
y_pred_test = reg_13.predict(ARCHER_test_x)
mean_squared_error(ARCHER_test_y, y_pred_test)

4770.152504351862

In [44]:
CLMET_train_full = pd.read_csv('/Volumes/Korpora/Train/CLMET_train_tokenized.csv', sep=';')
CLMET_val_full = pd.read_csv('/Volumes/Korpora/Val/CLMET_val_tokenized.csv', sep=';')
CLMET_test_full = pd.read_csv('/Volumes/Korpora/Test/CLMET_test_tokenized.csv', sep=';')

In [45]:
#drop rows with invalid data types
CLMET_train_full = CLMET_train_full[CLMET_train_full.Year.str.len()== 4]
CLMET_val_full = CLMET_val_full[CLMET_val_full.Year.str.len()== 4]
CLMET_test_full = CLMET_test_full[CLMET_test_full.Year.str.len()== 4]

In [46]:
print('Length train set: ',len(CLMET_train_full))
print('Length validation set: ', len(CLMET_val_full))
print('Length test set: ', len(CLMET_test_full))

Length train set:  186
Length validation set:  47
Length test set:  60


In [47]:
CLMET_train_x = CLMET_train_full['Text']
CLMET_train_y = CLMET_train_full['Year'].astype(int)

CLMET_val_x = CLMET_val_full['Text']
CLMET_val_y = CLMET_val_full['Year'].astype(int)

CLMET_test_x = CLMET_test_full['Text']
CLMET_test_y = CLMET_test_full['Year'].astype(int)

In [62]:
CLMET_y_pred_val = reg_13.predict(CLMET_val_x)
mean_squared_error(CLMET_val_y, CLMET_y_pred_val)

5256255.217421983

In [63]:
CLMET_y_pred_test = reg_13.predict(CLMET_test_x)
mean_squared_error(CLMET_test_y, CLMET_y_pred_test)

10656247.100333506

In [64]:
y_pred_train = pd.DataFrame(y_pred_train, columns=['Predicted_y'])

diff_pred_true_train = pd.concat([y_pred_train, ARCHER_train_y], axis=1)

diff_pred_true_train['Difference'] = diff_pred_true_train.Predicted_y - diff_pred_true_train.Year


y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, ARCHER_val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, ARCHER_test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Year


CLMET_y_pred_val = pd.DataFrame(CLMET_y_pred_val, columns=['Predicted_y'])

CLMET_diff_pred_true_val = pd.concat([CLMET_y_pred_val, CLMET_val_y], axis=1)

CLMET_diff_pred_true_val['Difference'] = CLMET_diff_pred_true_val.Predicted_y - CLMET_diff_pred_true_val.Year


CLMET_y_pred_test = pd.DataFrame(CLMET_y_pred_test, columns=['Predicted_y'])

CLMET_diff_pred_true_test = pd.concat([CLMET_y_pred_test, CLMET_test_y], axis=1)

CLMET_diff_pred_true_test['Difference'] = CLMET_diff_pred_true_test.Predicted_y - CLMET_diff_pred_true_test.Year



In [65]:
diff_pred_true_train.to_csv('/Volumes/Korpora/Exp2_results/ARCHER_Exp2_Reg12_Labels_train.csv',sep=';')
diff_pred_true_val.to_csv('/Volumes/Korpora/Exp2_results/ARCHER_Exp2_Reg12_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/Exp2_results/ARCHER_Exp2_Reg12_Labels_test.csv',sep=';')
CLMET_diff_pred_true_val.to_csv('/Volumes/Korpora/Exp2_results/ARCHER_over_CLMET_Exp2_Reg12_Labels_val.csv',sep=';')
CLMET_diff_pred_true_test.to_csv('/Volumes/Korpora/Exp2_results/ARCHER_over_CLMET_Exp2_Reg12_Labels_test.csv',sep=';')

In [66]:
features_selected = reg_13['subwords'].get_feature_names()

train_details = collect_predictions(ARCHER_train_x, reg_13['ridge_reg'],reg_13['subwords'],features_selected, reg_13)
val_details = collect_predictions(ARCHER_val_x, reg_13['ridge_reg'],reg_13['subwords'],features_selected, reg_13)
test_details = collect_predictions(ARCHER_test_x, reg_13['ridge_reg'],reg_13['subwords'],features_selected, reg_13)

CLMET_val_details = collect_predictions(CLMET_val_x, reg_13['ridge_reg'],reg_13['subwords'],features_selected, reg_13)
CLMET_test_details = collect_predictions(CLMET_test_x, reg_13['ridge_reg'],reg_13['subwords'],features_selected, reg_13)

In [67]:
train_details.to_csv('/Volumes/Korpora/Exp2_results/ARCHER_Exp2_Reg12_train_results.csv', sep=';')
val_details.to_csv('/Volumes/Korpora/Exp2_results/ARCHER_Exp2_Reg12_Val_results.csv', sep=';')
test_details.to_csv('/Volumes/Korpora/Exp2_results/ARCHER_Exp2_Reg12_Test_results.csv', sep=';')

CLMET_val_details.to_csv('/Volumes/Korpora/Exp2_results/ARCHER_over_CLMET_Exp2_Reg12_Val_results.csv', sep=';')
CLMET_test_details.to_csv('/Volumes/Korpora/Exp2_results/ARCHER_over_CLMET_Exp2_Reg12_Test_results.csv', sep=';')

**CLMET**

Features: 18-19
Features per merge: 5-6

In [69]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=3, n_best = 6)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [70]:
reg_13.fit(CLMET_train_x, CLMET_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=6, number_of_merges=3,
                                    tokenizer=<function tokenizer_word at 0x129c1ab90>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [71]:
y_pred_train = reg_13.predict(CLMET_train_x)
mean_squared_error(CLMET_train_y, y_pred_train)

2500.447473237334

In [72]:
y_pred_val = reg_13.predict(CLMET_val_x)
mean_squared_error(CLMET_val_y, y_pred_val)

2711.6210952685706

In [73]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=3, n_best = 5)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [74]:
reg_13.fit(CLMET_train_x, CLMET_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=5, number_of_merges=3,
                                    tokenizer=<function tokenizer_word at 0x129c1ab90>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [75]:
y_pred_train = reg_13.predict(CLMET_train_x)
mean_squared_error(CLMET_train_y, y_pred_train)

2540.3313013298653

In [76]:
y_pred_val = reg_13.predict(CLMET_val_x)
mean_squared_error(CLMET_val_y, y_pred_val)

2665.957893475361

In [77]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=1, n_best = 15)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [78]:
reg_13.fit(CLMET_train_x, CLMET_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=15, number_of_merges=1,
                                    tokenizer=<function tokenizer_word at 0x129c1ab90>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [79]:
y_pred_train = reg_13.predict(CLMET_train_x)
mean_squared_error(CLMET_train_y, y_pred_train)

2541.429198465812

In [80]:
y_pred_val = reg_13.predict(CLMET_val_x)
mean_squared_error(CLMET_val_y, y_pred_val)

2702.579369561798

In [106]:
reg_13 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=1, n_best = 15)),
                ('ridge_reg', linear_model.Ridge())
                        ])

In [107]:
reg_13.fit(CLMET_train_x, CLMET_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=15, number_of_merges=1,
                                    tokenizer=<function tokenizer_word at 0x129c1ab90>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [108]:
y_pred_train = reg_13.predict(CLMET_train_x)
mean_squared_error(CLMET_train_y, y_pred_train)

2541.429198465812

In [109]:
y_pred_val = reg_13.predict(CLMET_val_x)
mean_squared_error(CLMET_val_y, y_pred_val)

2702.579369561798

In [110]:
y_pred_test = reg_13.predict(CLMET_test_x)
mean_squared_error(CLMET_test_y, y_pred_test)

3757.325270663924

In [111]:
ARCHER_y_pred_val = reg_13.predict(ARCHER_val_x)
mean_squared_error(ARCHER_val_y, ARCHER_y_pred_val)

10004.496547398534

In [112]:
ARCHER_y_pred_test = reg_13.predict(ARCHER_test_x)
mean_squared_error(ARCHER_test_y, ARCHER_y_pred_test)

9657.311426043334

In [105]:
y_pred_train = pd.DataFrame(y_pred_train, columns=['Predicted_y'])

diff_pred_true_train = pd.concat([y_pred_train, CLMET_train_y], axis=1)

diff_pred_true_train['Difference'] = diff_pred_true_train.Predicted_y - diff_pred_true_train.Year


y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, CLMET_val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, CLMET_test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Year


ARCHER_y_pred_val = pd.DataFrame(ARCHER_y_pred_val, columns=['Predicted_y'])

ARCHER_diff_pred_true_val = pd.concat([ARCHER_y_pred_val, ARCHER_val_y], axis=1)

ARCHER_diff_pred_true_val['Difference'] = ARCHER_diff_pred_true_val.Predicted_y - ARCHER_diff_pred_true_val.Year


ARCHER_y_pred_test = pd.DataFrame(ARCHER_y_pred_test, columns=['Predicted_y'])

ARCHER_diff_pred_true_test = pd.concat([ARCHER_y_pred_test, ARCHER_test_y], axis=1)

ARCHER_diff_pred_true_test['Difference'] = ARCHER_diff_pred_true_test.Predicted_y - ARCHER_diff_pred_true_test.Year



In [90]:
diff_pred_true_train.to_csv('/Volumes/Korpora/Exp2_results/CLMET_Exp2_Reg12_Labels_train.csv',sep=';')
diff_pred_true_val.to_csv('/Volumes/Korpora/Exp2_results/CLMET_Exp2_Reg12_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/Exp2_results/CLMET_Exp2_Reg12_Labels_test.csv',sep=';')
ARCHER_diff_pred_true_val.to_csv('/Volumes/Korpora/Exp2_results/CLMET_over_ARCHER_Exp2_Reg12_Labels_val.csv',sep=';')
ARCHER_diff_pred_true_test.to_csv('/Volumes/Korpora/Exp2_results/CLMET_over_ARCHER_Exp2_Reg12_Labels_test.csv',sep=';')

In [91]:
features_selected = reg_13['subwords'].get_feature_names()

train_details = collect_predictions(CLMET_train_x, reg_13['ridge_reg'],reg_13['subwords'],features_selected, reg_13)
val_details = collect_predictions(CLMET_val_x, reg_13['ridge_reg'],reg_13['subwords'],features_selected, reg_13)
test_details = collect_predictions(CLMET_test_x, reg_13['ridge_reg'],reg_13['subwords'],features_selected, reg_13)

ARCHER_val_details = collect_predictions(ARCHER_val_x, reg_13['ridge_reg'],reg_13['subwords'],features_selected, reg_13)
ARCHER_test_details = collect_predictions(ARCHER_test_x, reg_13['ridge_reg'],reg_13['subwords'],features_selected, reg_13)

In [92]:
train_details.to_csv('/Volumes/Korpora/Exp2_results/CLMET_Exp2_Reg12_train_results.csv', sep=';')
val_details.to_csv('/Volumes/Korpora/Exp2_results/CLMET_Exp2_Reg12_Val_results.csv', sep=';')
test_details.to_csv('/Volumes/Korpora/Exp2_results/CLMET_Exp2_Reg12_Test_results.csv', sep=';')

ARCHER_val_details.to_csv('/Volumes/Korpora/Exp2_results/CLMET_over_ARCHER_Exp2_Reg12_Val_results.csv', sep=';')
ARCHER_test_details.to_csv('/Volumes/Korpora/Exp2_results/CLMET_over_ARCHER_Exp2_Reg12_Test_results.csv', sep=';')