**Experiment 3b: Support Vector Regression with Subwords**

*Background*: Experiment 2 shows that linear regression with 215 features does not overfit anymore, but the model does not fit the beginning and the end of the timerange. Support vector regression maps all points to a higher dimensional space, which could help with fit (cf. https://scikit-learn.org/stable/modules/svm.html)

Subwords worked well for some corpora, but for others, bag-of-words worked better.



*Strategies*:

- use support vector machines
- Use a BPE-transformer to train on subwords (Sennrich2016)
- use bag-of-word approach

*Relevance*:

- Subwords might increase the amount of generalization, and minimize the vocabulary used at the same time.
- higher dimension might help with fit

*Success criteria*:

- Consistent findings over training-, test- and validation set
- predicted year is not more than ten years away from the true year

*Corpora*:

- DTA
- ARCHER
- CLMET
- GERMANC

*Result*: 

-
***Baselines to beat***:

*MSE DTA Train*: 2259.8 | 2755.29 | 4989.51 | 5888.71

*MSE DTA Val*: 3202.51 | 3050.20 | 4577.19 | 5533.00

*MSE DTA Test*: 4504.35 | 3103.35 | 4288.83 | 4898.35

*MSE over GERMANC val*: 3700.60 | 4384.34 | 3829.78 | 6160.46

*MSE over GERMANC test*: 3565.05 | 4281.22 | 3724.79 | 6089.49

Setup:

- RGB kernel
- SVR
- 6 merges
- 10 pairs per merge

----------------------------------------------------------------------------------------------------------------------

*MSE CLMET Train*: 1346.26 | 2541.43 | 3135.46 | 3374.98

*MSE CLMET Val*: 2727.37 | 2702.58 | 3176.05 | 3262.79

*MSE CLMET Test*: 4315.18 | 3757.33 | 3404.04 | 3603.84

*MSE over ARCHER Val*: 10212.14 | 10004.50 | 10109.93 | 10084.92
 
*MSE over ARCHER Test*: 9784.28 | 9657.31 | 9719.35 | 9694.91

Setup:
- RGB kernel
- SVR
- 2 merges
- 5 pairs per merge

----------------------------------------------------------------------------------------------------------------------

*MSE ARCHER Train*: 3555.68 | 4281.68 | 8603.07 | 9108.94

*MSE ARCHER Val*: 4843.06 | 5276.70 | 8916.03 | 9810.94

*MSE ARCHER Test*: 4939.51 | 4770.15 | 8495.80 | 9129.50

*MSE over CLMET Val*: 6437126.77 | 5256255.22 | 3205.40 | 3323.66

*MSE over CLMET Test*: 6689227.44 | 10656247.10 |3673.11 | 3713.43

Setup:

- RGB kernel
- SVR
- 1 merge
- 90 pairs per merge

----------------------------------------------------------------------------------------------------------------------

*MSE GERMANC Train*: 348.830 | 1215.24 | 737.67 | 1759.29

*MSE GERMANC Val*: 398.94 | 1845.08 | 819.97 | 2129.27

*MSE GERMANC Test*: 649.33 | 1448.88 | 792.35 | 1903.36

*MSE over DTA val*: 8400009.85 | 25187766.76 | 9955.60 | 9257.79

*MSE over DTA test*: 21896163.78 | 18349238.65 | 8958.89 | 8351.84

Setup:

- 1 merge
- 12 pairs per merge

In [1]:
import pandas as pd
import numpy as np
import nltk

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest , f_regression
from sklearn import svm
from sklearn.metrics import mean_squared_error
import sklearn.utils
from sklearn.preprocessing import FunctionTransformer

from Selfwritten_modules.SubwordTransformer import SubwordTransformer

import re

import eli5
from eli5.sklearn import PermutationImportance



In [2]:
#build tokenizer that just substitutes '[' and ']' with ','
def tokenizer_word(doc):
    doc = re.sub('[(\[+)|(\]+)]', '', doc)
    doc = re.split(',', doc)
    return doc

In [3]:
# Code example: https://stackoverflow.com/questions/39839112/the-easiest-way-for-getting-feature-names-after-running-selectkbest-in-scikit-le
def features_to_names(features, feature_names):
    features_selected = []

    for bool, feature in zip(features, feature_names):
        if bool:
            features_selected.append(feature)
    return features_selected

In [4]:
#function for assembling predictions in order to find out how features are weighted

def collect_predictions(dataset, classifier,vectorizer, feature_names, pipeline):
    predictions = eli5.explain_weights_df(classifier,vec=vectorizer, feature_names=feature_names)
    
    predictions = predictions.drop(['target'], axis=1)
    
    
    predictions['YEAR'] = 0
    
    indexes = dataset.index.values
    
    

    for index in indexes:
        
        pred = eli5.explain_prediction_df(classifier, dataset[index], vec=vectorizer, feature_names=feature_names)
        
        source_text = pd.DataFrame([[dataset[index]]])
        
        year_pred = pipeline.predict(source_text[0])
        pred['weight_value'] = pred['weight'] * pred['value']
        pred['instance'] = index
        
        pred = pred.drop(['target','value'], axis=1)
        
    
        pred['YEAR'] = np.round(year_pred[0])
    
        predictions = pd.concat([predictions, pred])
        
    
    
    
    return predictions.dropna()

In [5]:
train_full = pd.read_csv('/Volumes/Korpora/Train/DTA_train_tokenized.csv', sep=';')
val_full = pd.read_csv('/Volumes/Korpora/Val/DTA_val_tokenized.csv', sep=';')
test_full = pd.read_csv('/Volumes/Korpora/Test/DTA_test_tokenized.csv', sep=';')

In [6]:
print('Length train set: ',len(train_full))
print('Length validation set: ', len(val_full))
print('Length test set: ', len(test_full))

Length train set:  899
Length validation set:  225
Length test set:  281


In [7]:
train_x = train_full['Text']
train_y = train_full['Publication_year']

val_x = val_full['Text']
val_y = val_full['Publication_year']

test_x = test_full['Text']
test_y = test_full['Publication_year']

In [8]:
reg_1 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=3, n_best = 30)),
                         ('svr', svm.SVR())
                        ])

In [9]:
reg_1.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=30, number_of_merges=3,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [10]:
y_pred_train = reg_1.predict(train_x)
mean_squared_error(train_y, y_pred_train)

5887.228196958991

In [11]:
y_pred_val = reg_1.predict(val_x)

mean_squared_error(val_y, y_pred_val)

5524.324848088053

In [12]:
reg_2 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=3, n_best = 10)),
                         ('svr', svm.SVR())
                        ])

In [13]:
reg_2.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=10, number_of_merges=3,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [14]:
y_pred_train = reg_2.predict(train_x)
mean_squared_error(train_y, y_pred_train)

5963.6460749842845

In [15]:
y_pred_val = reg_2.predict(val_x)
mean_squared_error(val_y, y_pred_val)

5614.803508630877

In [16]:
reg_3 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=1, n_best = 30)),
                         ('svr', svm.SVR())
                        ])

In [17]:
reg_3.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=30, number_of_merges=1,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [18]:
y_pred_train = reg_3.predict(train_x)
mean_squared_error(train_y, y_pred_train)

5957.577966747561

In [19]:
y_pred_val = reg_3.predict(val_x)
mean_squared_error(val_y, y_pred_val)

5610.0138678168005

In [32]:
reg_4 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=2, n_best = 30)),
                         ('svr', svm.SVR())
                        ])

In [33]:
reg_4.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=30, number_of_merges=2,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [34]:
y_pred_train = reg_4.predict(train_x)
mean_squared_error(train_y, y_pred_train)

5887.357671546869

In [35]:
y_pred_val = reg_4.predict(val_x)
mean_squared_error(val_y, y_pred_val)

5535.656625998603

In [26]:
reg_5 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=3, n_best = 100)),
                         ('svr', svm.SVR())
                        ])

In [27]:
reg_5.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=100, number_of_merges=3,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [30]:
y_pred_train = reg_5.predict(train_x)
mean_squared_error(train_y, y_pred_train)

6108.518195607794

In [31]:
y_pred_val = reg_5.predict(val_x)
mean_squared_error(val_y, y_pred_val)

5720.361518538872

In [36]:
reg_6 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=6, n_best = 10)),
                         ('svr', svm.SVR())
                        ])

In [37]:
reg_6.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=10, number_of_merges=6,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [38]:
y_pred_train = reg_6.predict(train_x)
mean_squared_error(train_y, y_pred_train)

5888.713789579892

In [39]:
y_pred_val = reg_6.predict(val_x)
mean_squared_error(val_y, y_pred_val)

5533.005128167105

In [40]:
reg_7 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=1, n_best = 60)),
                         ('svr', svm.SVR())
                        ])

In [41]:
reg_7.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=60, number_of_merges=1,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [42]:
y_pred_train = reg_7.predict(train_x)
mean_squared_error(train_y, y_pred_train)

5951.667971911611

In [43]:
y_pred_val = reg_7.predict(val_x)
mean_squared_error(val_y, y_pred_val)

5591.869249782603

Reg_6 works best

In [44]:
reg_6 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=6, n_best = 10)),
                         ('svr', svm.SVR())
                        ])

In [45]:
reg_6.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=10, number_of_merges=6,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [46]:
y_pred_train = reg_6.predict(train_x)
mean_squared_error(train_y, y_pred_train)

5888.713789579892

In [47]:
y_pred_val = reg_6.predict(val_x)
mean_squared_error(val_y, y_pred_val)

5533.005128167105

In [48]:
y_pred_test = reg_6.predict(test_x)
mean_squared_error(test_y, y_pred_test)

4898.351185460394

In [49]:
transformator = SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=6, n_best = 10)

In [50]:
train_transform = transformator.fit_transform(train_x, train_y)

In [52]:
perm = PermutationImportance(reg_6['svr']).fit(train_transform, y_pred_train)

In [55]:
features_selected = transformator.get_feature_names()
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.1335  ± 0.0163,en
0.1251  ± 0.0035,er
0.0656  ± 0.0056,ch
0.0634  ± 0.0055,de
0.0322  ± 0.0048,ei
0.0282  ± 0.0029,te
0.0203  ± 0.0020,ie
0.0198  ± 0.0048,un
0.0145  ± 0.0030,in
0.0116  ± 0.0048,ge


In [56]:
DTA_train_features_Exp3 = eli5.explain_weights_df(perm, feature_names=features_selected)

In [57]:
val_transform = transformator.transform(val_x)

In [58]:
perm = PermutationImportance(reg_6['svr']).fit(val_transform, y_pred_val)

In [59]:
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.1317  ± 0.0246,en
0.1252  ± 0.0132,er
0.0593  ± 0.0086,ch
0.0588  ± 0.0079,de
0.0231  ± 0.0022,ei
0.0208  ± 0.0030,te
0.0157  ± 0.0028,un
0.0157  ± 0.0032,ie
0.0092  ± 0.0028,in
0.0085  ± 0.0014,ſe


In [60]:
DTA_val_features_Exp3 = eli5.explain_weights_df(perm, feature_names=features_selected)

In [61]:
test_transform = transformator.transform(test_x)

In [62]:
perm = PermutationImportance(reg_6['svr']).fit(test_transform, y_pred_test)

In [63]:
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.1188  ± 0.0143,en
0.1124  ± 0.0072,er
0.0538  ± 0.0100,ch
0.0506  ± 0.0107,de
0.0200  ± 0.0021,ei
0.0172  ± 0.0028,te
0.0138  ± 0.0034,un
0.0137  ± 0.0035,ie
0.0074  ± 0.0010,ſe
0.0073  ± 0.0017,in


In [64]:
DTA_test_features_Exp3 = eli5.explain_weights_df(perm, feature_names=features_selected)

In [65]:
y_pred_train = pd.DataFrame(y_pred_train, columns=['Predicted_y'])

diff_pred_true_train = pd.concat([y_pred_train, train_y], axis=1)

diff_pred_true_train['Difference'] = diff_pred_true_train.Predicted_y - diff_pred_true_train.Publication_year
    

y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Publication_year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Publication_year

In [66]:
diff_pred_true_train.to_csv('/Volumes/Korpora/Exp3b_results/DTA_Exp3b_Reg6_Labels_train.csv',sep=';')
diff_pred_true_val.to_csv('/Volumes/Korpora/Exp3b_results/DTA_Exp3b_Reg6_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/Exp3b_results/DTA_Exp3b_Reg6_Labels_test.csv',sep=';')

In [67]:
DTA_train_features_Exp3.to_csv('/Volumes/Korpora/Exp3b_results/DTA_train_features_Exp3b.csv',sep=';')
DTA_val_features_Exp3.to_csv('/Volumes/Korpora/Exp3b_results/DTA_val_features_Exp3b.csv', sep=';')
DTA_test_features_Exp3.to_csv('/Volumes/Korpora/Exp3b_results/DTA_test_features_Exp3b.csv', sep=';')

In [68]:
GERMANC_train_full = pd.read_csv('/Volumes/Korpora/Train/GERMANC_train_tokenized.csv', sep=';')
GERMANC_val_full = pd.read_csv('/Volumes/Korpora/Val/GERMANC_val_tokenized.csv', sep=';')
GERMANC_test_full = pd.read_csv('/Volumes/Korpora/Test/GERMANC_test_tokenized.csv', sep=';')

In [69]:
GERMANC_train_full = GERMANC_train_full[(GERMANC_train_full.Year.str.len()== 4) & (GERMANC_train_full.Year.str.isnumeric())]

GERMANC_val_full = GERMANC_val_full[(GERMANC_val_full.Year.str.len()== 4) & (GERMANC_val_full.Year.str.isnumeric())]

GERMANC_test_full = GERMANC_test_full[(GERMANC_test_full.Year.str.len()== 4) & (GERMANC_test_full.Year.str.isnumeric())]

In [70]:
print('Length train set: ',len(GERMANC_train_full))
print('Length validation set: ', len(GERMANC_val_full))
print('Length test set: ', len(GERMANC_test_full))

Length train set:  177
Length validation set:  40
Length test set:  56


In [71]:
GERMANC_train_x = GERMANC_train_full['Text']
GERMANC_train_y = GERMANC_train_full['Year'].astype(int)

GERMANC_val_x = GERMANC_val_full['Text']
GERMANC_val_y = GERMANC_val_full['Year'].astype(int)

GERMANC_test_x = GERMANC_test_full['Text']
GERMANC_test_y = GERMANC_test_full['Year'].astype(int)

In [72]:
y_pred_val = reg_6.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

6160.463263848846

In [73]:
y_pred_test = reg_6.predict(GERMANC_test_x)
mean_squared_error(GERMANC_test_y, y_pred_test)

6089.489848693307

In [74]:
y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, GERMANC_val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, GERMANC_test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Year

In [75]:
diff_pred_true_val.to_csv('/Volumes/Korpora/Exp3b_results/DTA_over_GERMANC_Exp3b_Reg6_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/Exp3b_results/DTA_over_GERMANC_Exp3b_Reg6_Labels_test.csv',sep=';')

In [77]:
val_transform = transformator.transform(GERMANC_val_x)


In [78]:
perm = PermutationImportance(reg_6['svr']).fit(val_transform, y_pred_val)

In [79]:
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.3135  ± 0.1005,uͤ
0.2995  ± 0.0909,en
0.2389  ± 0.0577,aͤ
0.1421  ± 0.0084,un
0.0830  ± 0.0174,st
0.0804  ± 0.0182,es
0.0659  ± 0.0171,te
0.0610  ± 0.0320,er
0.0399  ± 0.0176,ff
0.0339  ± 0.0087,nd


In [80]:
DTA_GERMANC_val_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [82]:
test_transform = transformator.transform(GERMANC_test_x)


In [83]:
perm = PermutationImportance(reg_6['svr']).fit(test_transform, y_pred_test)

In [84]:
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.4080  ± 0.1291,en
0.2263  ± 0.0250,uͤ
0.1227  ± 0.0299,st
0.1026  ± 0.0157,es
0.0996  ± 0.0290,aͤ
0.0995  ± 0.0248,un
0.0897  ± 0.0255,er
0.0716  ± 0.0223,te
0.0483  ± 0.0162,ff
0.0349  ± 0.0128,nd


In [85]:
DTA_GERMANC_test_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [86]:
DTA_GERMANC_val_features.to_csv('/Volumes/Korpora/Exp3b_results/DTA_over_GERMANC_val_features_Exp3b.csv', sep=';')
DTA_GERMANC_test_features.to_csv('/Volumes/Korpora/Exp3b_results/DTA_over_GERMANC_test_features_Exp3b.csv', sep=';')

**GERMANC**

In [87]:
reg_5 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=6, n_best = 3)),
                         ('svr', svm.SVR())
                        ])

In [88]:
reg_5.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=3, number_of_merges=6,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [89]:
y_pred_train = reg_5.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

1774.517741367557

In [90]:
y_pred_val = reg_5.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

2128.535065876909

In [91]:
reg_5 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=6, n_best = 2)),
                         ('svr', svm.SVR())
                        ])

In [92]:
reg_5.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=2, number_of_merges=6,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [93]:
y_pred_train = reg_5.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

1781.2253333871315

In [94]:
y_pred_val = reg_5.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

2137.778377188398

In [95]:
reg_5 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=3, n_best = 6)),
                         ('svr', svm.SVR())
                        ])

In [96]:
reg_5.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=6, number_of_merges=3,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [97]:
y_pred_train = reg_5.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

1775.5059118797947

In [98]:
y_pred_val = reg_5.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

2129.6592353031497

In [99]:
reg_5 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=3, n_best = 4)),
                         ('svr', svm.SVR())
                        ])

In [100]:
reg_5.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=4, number_of_merges=3,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [101]:
y_pred_train = reg_5.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

1775.5057567895415

In [102]:
y_pred_val = reg_5.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

2138.928842115488

In [103]:
reg_5 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=2, n_best = 6)),
                         ('svr', svm.SVR())
                        ])

In [104]:
reg_5.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=6, number_of_merges=2,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [105]:
y_pred_train = reg_5.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

1770.7117245022782

In [106]:
y_pred_val = reg_5.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

2129.8368489611325

In [107]:
reg_5 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=1, n_best = 12)),
                         ('svr', svm.SVR())
                        ])

In [108]:
reg_5.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=12, number_of_merges=1,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [109]:
y_pred_train = reg_5.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

1759.2875149568317

In [110]:
y_pred_val = reg_5.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

2129.2665927605112

In [111]:
reg_5 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=1, n_best = 10)),
                         ('svr', svm.SVR())
                        ])

In [112]:
reg_5.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=10, number_of_merges=1,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [113]:
y_pred_train = reg_5.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

1754.6312242223544

In [114]:
y_pred_val = reg_5.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

2132.139993583843

In [117]:
reg_5 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=1, n_best = 12)),
                         ('svr', svm.SVR())
                        ])

In [118]:
reg_5.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=12, number_of_merges=1,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [119]:
y_pred_train = reg_5.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

1759.2875149568317

In [157]:
y_pred_val = reg_5.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

2129.2665927605112

In [158]:
y_pred_test = reg_5.predict(GERMANC_test_x)
mean_squared_error(GERMANC_test_y, y_pred_test)

1903.3576212046798

In [122]:
y_pred_train = pd.DataFrame(y_pred_train, columns=['Predicted_y'])

diff_pred_true_train = pd.concat([y_pred_train, GERMANC_train_y], axis=1)

diff_pred_true_train['Difference'] = diff_pred_true_train.Predicted_y - diff_pred_true_train.Year
    

y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, GERMANC_val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, GERMANC_test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Year

In [140]:
diff_pred_true_train.to_csv('/Volumes/Korpora/Exp3b_results/GERMANC_Exp3b_Reg6_Labels_train.csv',sep=';')
diff_pred_true_val.to_csv('/Volumes/Korpora/Exp3b_results/GERMANC_Exp3b_Reg6_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/Exp3b_results/GERMANC_Exp3b_Reg6_Labels_test.csv',sep=';')

In [149]:
transformator = SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=1, n_best = 12)

In [150]:
train_transform = transformator.fit_transform(GERMANC_train_x, GERMANC_train_y)


In [151]:
features_selected = transformator.get_feature_names()

In [152]:
perm = PermutationImportance(reg_5['svr']).fit(train_transform, y_pred_train)

In [153]:
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.5967  ± 0.0636,en
0.3635  ± 0.0505,ie
0.1768  ± 0.0283,ch
0.1287  ± 0.0119,nd
0.0362  ± 0.0047,te
0.0362  ± 0.0069,in
0.0342  ± 0.0035,ei
0.0333  ± 0.0048,un
0.0323  ± 0.0074,er
0.0183  ± 0.0062,de


In [154]:
GERMANC_train_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [155]:
val_transform = transformator.transform(GERMANC_val_x)


In [159]:
perm = PermutationImportance(reg_5['svr']).fit(val_transform, y_pred_val)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.9819  ± 0.3484,en
0.5794  ± 0.2356,ie
0.2818  ± 0.1213,nd
0.1667  ± 0.0169,in
0.1515  ± 0.0242,ch
0.0974  ± 0.0235,ei
0.0922  ± 0.0411,un
0.0787  ± 0.0156,te
0.0437  ± 0.0206,er
0.0379  ± 0.0137,de


In [160]:
GERMANC_val_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [161]:
test_transform = transformator.transform(GERMANC_test_x)

In [162]:
perm = PermutationImportance(reg_5['svr']).fit(test_transform, y_pred_test)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.7566  ± 0.1722,en
0.3991  ± 0.1019,ie
0.2951  ± 0.0496,ch
0.1942  ± 0.0257,nd
0.0465  ± 0.0051,te
0.0407  ± 0.0127,in
0.0365  ± 0.0108,er
0.0365  ± 0.0081,un
0.0342  ± 0.0101,ei
0.0206  ± 0.0040,de


In [163]:
GERMANC_test_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [164]:
y_pred_val = reg_5.predict(val_x)
mean_squared_error(val_y, y_pred_val)

9257.789781566484

In [165]:
y_pred_test = reg_5.predict(test_x)
mean_squared_error(test_y, y_pred_test)

8351.843124057808

In [167]:
val_transform = transformator.transform(val_x)


In [168]:
perm = PermutationImportance(reg_5['svr']).fit(val_transform, y_pred_val)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
1.0094  ± 0.0266,en
1.0027  ± 0.0000,nd
1.0027  ± 0.0000,te
1.0024  ± 0.0011,in
1.0022  ± 0.0024,ie
1.0006  ± 0.0081,de
0.9972  ± 0.0207,er
0.9628  ± 0.1594,ch
0.9320  ± 0.2366,ge
0.9057  ± 0.3141,he


In [169]:
GERMANC_DTA_val_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [170]:
test_transform = transformator.transform(test_x)


In [171]:
perm = PermutationImportance(reg_5['svr']).fit(test_transform, y_pred_test)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.9991  ± 0.0058,ch
0.9674  ± 0.1071,er
0.9670  ± 0.0801,un
0.9227  ± 0.2684,en
0.9223  ± 0.2209,ie
0.9185  ± 0.2077,de
0.9051  ± 0.2955,nd
0.8982  ± 0.3333,ei
0.8886  ± 0.3375,te
0.8794  ± 0.2704,ge


In [172]:
GERMANC_DTA_test_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [173]:
GERMANC_train_features.to_csv('/Volumes/Korpora/Exp3b_results/GERMANC_train_features_Exp3b.csv', sep=';')
GERMANC_val_features.to_csv('/Volumes/Korpora/Exp3b_results/GERMANC_val_features_Exp3b.csv', sep=';')
GERMANC_test_features.to_csv('/Volumes/Korpora/Exp3b_results/GERMANC_test_features_Exp3b.csv', sep=';')
GERMANC_DTA_val_features.to_csv('/Volumes/Korpora/Exp3b_results/GERMANC_over_DTA_val_features_Exp3b.csv', sep=';')
GERMANC_DTA_test_features.to_csv('/Volumes/Korpora/Exp3b_results/GERMANC_over_DTA_test_features_Exp3b.csv', sep=';')

In [176]:
y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val,val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Publication_year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Publication_year

In [177]:
diff_pred_true_val.to_csv('/Volumes/Korpora/Exp3b_results/GERMANC_DTA_Exp3b_Reg6_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/Exp3b_results/GERMANC_DTA_Exp3b_Reg6_Labels_test.csv',sep=';')

In [178]:
ARCHER_train_full = pd.read_csv('/Volumes/Korpora/Train/ARCHER_train_tokenized.csv', sep=';')
ARCHER_val_full = pd.read_csv('/Volumes/Korpora/Val/ARCHER_val_tokenized.csv', sep=';')
ARCHER_test_full = pd.read_csv('/Volumes/Korpora/Test/ARCHER_test_tokenized.csv', sep=';')

In [179]:
ARCHER_train_full = ARCHER_train_full[(ARCHER_train_full.Year.str.len()== 4) & (ARCHER_train_full.Year.str.isnumeric())]

ARCHER_val_full = ARCHER_val_full[(ARCHER_val_full.Year.str.len()== 4) & (ARCHER_val_full.Year.str.isnumeric())]

ARCHER_test_full = ARCHER_test_full[(ARCHER_test_full.Year.str.len()== 4) & (ARCHER_test_full.Year.str.isnumeric())]

In [180]:
print('Length train set: ',len(ARCHER_train_full))
print('Length validation set: ', len(ARCHER_val_full))
print('Length test set: ', len(ARCHER_test_full))

Length train set:  1049
Length validation set:  264
Length test set:  329


In [181]:
ARCHER_train_x = ARCHER_train_full['Text']
ARCHER_train_y = ARCHER_train_full['Year'].astype(int)

ARCHER_val_x = ARCHER_val_full['Text']
ARCHER_val_y = ARCHER_val_full['Year'].astype(int)

ARCHER_test_x = ARCHER_test_full['Text']
ARCHER_test_y = ARCHER_test_full['Year'].astype(int)

In [182]:
reg_5 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=3, n_best = 35)),
                         ('svr', svm.SVR())
                        ])

In [183]:
reg_5.fit(ARCHER_train_x, ARCHER_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=35, number_of_merges=3,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [184]:
y_pred_train = reg_5.predict(ARCHER_train_x)
mean_squared_error(ARCHER_train_y, y_pred_train)

9128.53815129921

In [185]:
y_pred_val = reg_5.predict(ARCHER_val_x)
mean_squared_error(ARCHER_val_y, y_pred_val)

9809.4265811449

In [186]:
reg_5 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=6, n_best = 15)),
                         ('svr', svm.SVR())
                        ])

In [187]:
reg_5.fit(ARCHER_train_x, ARCHER_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=15, number_of_merges=6,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [188]:
y_pred_train = reg_5.predict(ARCHER_train_x)
mean_squared_error(ARCHER_train_y, y_pred_train)

9157.918896997538

In [189]:
y_pred_val = reg_5.predict(ARCHER_val_x)
mean_squared_error(ARCHER_val_y, y_pred_val)

9831.256061096501

In [190]:
reg_5 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=6, n_best = 10)),
                         ('svr', svm.SVR())
                        ])

In [191]:
reg_5.fit(ARCHER_train_x, ARCHER_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=10, number_of_merges=6,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [192]:
y_pred_train = reg_5.predict(ARCHER_train_x)
mean_squared_error(ARCHER_train_y, y_pred_train)

9230.64056726293

In [193]:
y_pred_val = reg_5.predict(ARCHER_val_x)
mean_squared_error(ARCHER_val_y, y_pred_val)

9900.484688269877

In [194]:
reg_5 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=3, n_best = 30)),
                         ('svr', svm.SVR())
                        ])

In [195]:
reg_5.fit(ARCHER_train_x, ARCHER_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=30, number_of_merges=3,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [196]:
y_pred_train = reg_5.predict(ARCHER_train_x)
mean_squared_error(ARCHER_train_y, y_pred_train)

9156.865694942353

In [197]:
y_pred_val = reg_5.predict(ARCHER_val_x)
mean_squared_error(ARCHER_val_y, y_pred_val)

9832.145013010066

In [198]:
reg_5 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=2, n_best = 45)),
                         ('svr', svm.SVR())
                        ])

In [199]:
reg_5.fit(ARCHER_train_x, ARCHER_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=45, number_of_merges=2,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [200]:
y_pred_train = reg_5.predict(ARCHER_train_x)
mean_squared_error(ARCHER_train_y, y_pred_train)

9145.902170618056

In [201]:
y_pred_val = reg_5.predict(ARCHER_val_x)
mean_squared_error(ARCHER_val_y, y_pred_val)

9820.071372497141

In [202]:
reg_5 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=1, n_best = 90)),
                         ('svr', svm.SVR())
                        ])

In [203]:
reg_5.fit(ARCHER_train_x, ARCHER_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=90, number_of_merges=1,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [204]:
y_pred_train = reg_5.predict(ARCHER_train_x)
mean_squared_error(ARCHER_train_y, y_pred_train)

9108.944973994685

In [205]:
y_pred_val = reg_5.predict(ARCHER_val_x)
mean_squared_error(ARCHER_val_y, y_pred_val)

9810.944498334742

In [206]:
y_pred_test = reg_5.predict(ARCHER_test_x)
mean_squared_error(ARCHER_test_y, y_pred_test)

9129.496381280214

In [207]:
y_pred_train = pd.DataFrame(y_pred_train, columns=['Predicted_y'])

diff_pred_true_train = pd.concat([y_pred_train, ARCHER_train_y], axis=1)

diff_pred_true_train['Difference'] = diff_pred_true_train.Predicted_y - diff_pred_true_train.Year
    

y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, ARCHER_val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, ARCHER_test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Year

In [208]:
diff_pred_true_train.to_csv('/Volumes/Korpora/Exp3b_results/ARCHER_Exp3b_Reg6_Labels_train.csv',sep=';')
diff_pred_true_val.to_csv('/Volumes/Korpora/Exp3b_results/ARCHER_Exp3b_Reg6_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/Exp3b_results/ARCHER_Exp3b_Reg6_Labels_test.csv',sep=';')

In [209]:
transformator = SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=1, n_best = 90)

In [210]:
train_transform = transformator.fit_transform(ARCHER_train_x, ARCHER_train_y)

In [211]:
features_selected = transformator.get_feature_names()

In [212]:
perm = PermutationImportance(reg_5['svr']).fit(train_transform, y_pred_train)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.0449  ± 0.0091,he
0.0380  ± 0.0027,in
0.0320  ± 0.0037,th
0.0120  ± 0.0013,re
0.0104  ± 0.0020,er
0.0090  ± 0.0003,on
0.0075  ± 0.0010,an
0.0058  ± 0.0002,ti
0.0052  ± 0.0008,at
0.0049  ± 0.0002,en


In [213]:
ARCHER_train_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [214]:
val_transform = transformator.transform(ARCHER_val_x)


In [215]:
perm = PermutationImportance(reg_5['svr']).fit(val_transform, y_pred_val)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.0438  ± 0.0070,in
0.0430  ± 0.0143,he
0.0326  ± 0.0083,th
0.0123  ± 0.0026,re
0.0097  ± 0.0027,er
0.0079  ± 0.0006,on
0.0063  ± 0.0017,an
0.0051  ± 0.0006,at
0.0049  ± 0.0010,ti
0.0041  ± 0.0005,en


In [216]:
ARCHER_val_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [217]:
test_transform = transformator.transform(ARCHER_test_x)


In [218]:
perm = PermutationImportance(reg_5['svr']).fit(test_transform, y_pred_test)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.0524  ± 0.0132,he
0.0410  ± 0.0053,in
0.0342  ± 0.0084,th
0.0146  ± 0.0061,re
0.0105  ± 0.0020,er
0.0089  ± 0.0010,on
0.0086  ± 0.0043,an
0.0058  ± 0.0009,at
0.0058  ± 0.0010,ti
0.0045  ± 0.0007,en


In [219]:
ARCHER_test_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [220]:
CLMET_train_full = pd.read_csv('/Volumes/Korpora/Train/CLMET_train_tokenized.csv', sep=';')
CLMET_val_full = pd.read_csv('/Volumes/Korpora/Val/CLMET_val_tokenized.csv', sep=';')
CLMET_test_full = pd.read_csv('/Volumes/Korpora/Test/CLMET_test_tokenized.csv', sep=';')

In [221]:
#drop rows with invalid data types
CLMET_train_full = CLMET_train_full[CLMET_train_full.Year.str.len()== 4]
CLMET_val_full = CLMET_val_full[CLMET_val_full.Year.str.len()== 4]
CLMET_test_full = CLMET_test_full[CLMET_test_full.Year.str.len()== 4]

In [222]:
CLMET_train_x = CLMET_train_full['Text']
CLMET_train_y = CLMET_train_full['Year'].astype(int)

CLMET_val_x = CLMET_val_full['Text']
CLMET_val_y = CLMET_val_full['Year'].astype(int)

CLMET_test_x = CLMET_test_full['Text']
CLMET_test_y = CLMET_test_full['Year'].astype(int)

In [223]:
print('Length train set: ',len(CLMET_train_full))
print('Length validation set: ', len(CLMET_val_full))
print('Length test set: ', len(CLMET_test_full))

Length train set:  186
Length validation set:  47
Length test set:  60


In [224]:
y_pred_val = reg_5.predict(CLMET_val_x)
mean_squared_error(CLMET_val_y, y_pred_val)

3323.6581774632145

In [225]:
y_pred_test = reg_5.predict(CLMET_test_x)
mean_squared_error(CLMET_test_y, y_pred_test)

3713.429796623722

In [226]:
y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, CLMET_val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, CLMET_test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Year

In [227]:
diff_pred_true_val.to_csv('/Volumes/Korpora/Exp3b_results/ARCHER_over_CLMET_Exp3b_Reg6_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/Exp3b_results/ARCHER_over_CLMET_Exp3b_Reg6_Labels_test.csv',sep=';')

In [228]:
val_transform = transformator.transform(CLMET_val_x)


In [229]:
perm = PermutationImportance(reg_5['svr']).fit(val_transform, y_pred_val)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
1.0591  ± 0.0076,nd
1.0265  ± 0.1086,ou
0.9724  ± 0.2226,he
0.9015  ± 0.4638,th
0.8977  ± 0.3442,an
0.8892  ± 0.3506,es
0.8770  ± 0.4189,ng
0.8704  ± 0.4735,at
0.8546  ± 0.3983,in
0.8332  ± 0.5023,so


In [230]:
ARCHER_CLMET_val_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [231]:
test_transform = transformator.transform(CLMET_test_x)


In [232]:
perm = PermutationImportance(reg_5['svr']).fit(test_transform, y_pred_test)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
1.0223  ± 0.0123,of
1.0203  ± 0.0045,th
1.0191  ± 0.0000,he
1.0191  ± 0.0000,to
1.0191  ± 0.0000,an
1.0191  ± 0.0000,in
1.0189  ± 0.0011,ea
1.0186  ± 0.0019,or
1.0173  ± 0.0070,ve
1.0164  ± 0.0052,nd


In [233]:
ARCHER_CLMET_test_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [234]:
ARCHER_train_features.to_csv('/Volumes/Korpora/Exp3b_results/ARCHER_train_features_Exp3b.csv', sep=';')
ARCHER_val_features.to_csv('/Volumes/Korpora/Exp3b_results/ARCHER_val_features_Exp3b.csv', sep=';')
ARCHER_test_features.to_csv('/Volumes/Korpora/Exp3b_results/ARCHER_test_features_Exp3b.csv', sep=';')
ARCHER_CLMET_val_features.to_csv('/Volumes/Korpora/Exp3b_results/ARCHER_over_CLMET_val_features_Exp3b.csv', sep=';')
ARCHER_CLMET_test_features.to_csv('/Volumes/Korpora/Exp3b_results/ARCHER_over_CLMET_test_features_Exp3b.csv', sep=';')

In [235]:
reg_5 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=1, n_best = 15)),
                         ('svr', svm.SVR())
                        ])

In [236]:
reg_5.fit(CLMET_train_x, CLMET_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=15, number_of_merges=1,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [237]:
y_pred_train = reg_5.predict(CLMET_train_x)
mean_squared_error(CLMET_train_y, y_pred_train)

3378.2242965825812

In [238]:
y_pred_val = reg_5.predict(CLMET_val_x)
mean_squared_error(CLMET_val_y, y_pred_val)

3265.5224492757784

In [239]:
reg_5 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=1, n_best = 10)),
                         ('svr', svm.SVR())
                        ])

In [240]:
reg_5.fit(CLMET_train_x, CLMET_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=10, number_of_merges=1,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [241]:
y_pred_train = reg_5.predict(CLMET_train_x)
mean_squared_error(CLMET_train_y, y_pred_train)

3375.47140685032

In [242]:
y_pred_val = reg_5.predict(CLMET_val_x)
mean_squared_error(CLMET_val_y, y_pred_val)

3263.581050205432

In [243]:
reg_5 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=2, n_best = 5)),
                         ('svr', svm.SVR())
                        ])

In [244]:
reg_5.fit(CLMET_train_x, CLMET_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=5, number_of_merges=2,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [245]:
y_pred_train = reg_5.predict(CLMET_train_x)
mean_squared_error(CLMET_train_y, y_pred_train)

3374.981450663011

In [246]:
y_pred_val = reg_5.predict(CLMET_val_x)
mean_squared_error(CLMET_val_y, y_pred_val)

3262.7875632677583

In [251]:
reg_5 = Pipeline([('subwords', SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=2, n_best = 5)),
                         ('svr', svm.SVR())
                        ])

In [252]:
reg_5.fit(CLMET_train_x, CLMET_train_y)

Pipeline(memory=None,
         steps=[('subwords',
                 SubwordTransformer(n_best=5, number_of_merges=2,
                                    tokenizer=<function tokenizer_word at 0x116a98560>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [253]:
y_pred_train = reg_5.predict(CLMET_train_x)
mean_squared_error(CLMET_train_y, y_pred_train)

3374.981450663011

In [254]:
y_pred_val = reg_5.predict(CLMET_val_x)
mean_squared_error(CLMET_val_y, y_pred_val)

3262.7875632677583

In [255]:
y_pred_test = reg_5.predict(CLMET_test_x)
mean_squared_error(CLMET_test_y, y_pred_test)

3603.836868570244

In [256]:
y_pred_train = pd.DataFrame(y_pred_train, columns=['Predicted_y'])

diff_pred_true_train = pd.concat([y_pred_train, CLMET_train_y], axis=1)

diff_pred_true_train['Difference'] = diff_pred_true_train.Predicted_y - diff_pred_true_train.Year
    

y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, CLMET_val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, CLMET_test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Year

In [257]:
diff_pred_true_train.to_csv('/Volumes/Korpora/Exp3b_results/CLMET_Exp3b_Reg6_Labels_train.csv',sep=';')
diff_pred_true_val.to_csv('/Volumes/Korpora/Exp3b_results/CLMET_Exp3b_Reg6_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/Exp3b_results/CLMET_Exp3b_Reg6_Labels_test.csv',sep=';')

In [258]:
transformator = SubwordTransformer(tokenizer=tokenizer_word, number_of_merges=2, n_best = 5)

In [259]:
train_transform = transformator.fit_transform(CLMET_train_x, CLMET_train_y)


In [260]:
features_selected = transformator.get_feature_names()

In [261]:
perm = PermutationImportance(reg_5['svr']).fit(train_transform, y_pred_train)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.2988  ± 0.0605,he
0.1559  ± 0.0239,th
0.0658  ± 0.0173,in
0.0306  ± 0.0124,er
0.0235  ± 0.0046,an
0.0151  ± 0.0113,on
0.0145  ± 0.0049,re
0.0127  ± 0.0059,at
0.0108  ± 0.0102,en
0.0066  ± 0.0020,ou


In [262]:
CLMET_train_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [263]:
val_transform = transformator.transform(CLMET_val_x)


In [264]:
perm = PermutationImportance(reg_5['svr']).fit(val_transform, y_pred_val)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.3524  ± 0.0884,he
0.1735  ± 0.0691,th
0.0723  ± 0.0362,in
0.0396  ± 0.0082,er
0.0261  ± 0.0110,an
0.0159  ± 0.0025,re
0.0083  ± 0.0039,on
0.0064  ± 0.0026,at
0.0043  ± 0.0018,en
0.0032  ± 0.0019,ou


In [265]:
CLMET_val_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [266]:
test_transform = transformator.transform(CLMET_test_x)


In [267]:
perm = PermutationImportance(reg_5['svr']).fit(test_transform, y_pred_test)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.3241  ± 0.0415,he
0.1827  ± 0.0331,th
0.0865  ± 0.0316,in
0.0545  ± 0.0373,er
0.0418  ± 0.0286,an
0.0264  ± 0.0185,re
0.0185  ± 0.0186,on
0.0182  ± 0.0173,at
0.0167  ± 0.0116,en
0.0152  ± 0.0171,ou


In [268]:
CLMET_test_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [269]:
y_pred_val = reg_5.predict(ARCHER_val_x)
mean_squared_error(ARCHER_val_y, y_pred_val)

10084.917988301646

In [270]:
y_pred_test = reg_5.predict(ARCHER_test_x)
mean_squared_error(ARCHER_test_y, y_pred_test)

9694.919583446645

In [271]:
y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, ARCHER_val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, ARCHER_test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Year

In [272]:
diff_pred_true_val.to_csv('/Volumes/Korpora/Exp3_results/CLMET_over_ARCHER_Exp3b_Reg6_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/Exp3_results/CLMET_over_ARCHER_Exp3b_Reg6_Labels_test.csv',sep=';')

In [273]:
val_transform = transformator.transform(ARCHER_val_x)


In [274]:
perm = PermutationImportance(reg_5['svr']).fit(val_transform, y_pred_val)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.3504  ± 0.0305,he
0.0977  ± 0.0127,th
0.0603  ± 0.0014,in
0.0098  ± 0.0013,an
0.0086  ± 0.0008,er
0.0017  ± 0.0001,at
0.0016  ± 0.0001,re
0.0008  ± 0.0001,ou
0.0005  ± 0.0001,on
0.0000  ± 0.0000,en


In [275]:
CLMET_ARCHER_val_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [276]:
test_transform = transformator.transform(ARCHER_test_x)


In [277]:
perm = PermutationImportance(reg_5['svr']).fit(test_transform, y_pred_test)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.3433  ± 0.0228,he
0.1048  ± 0.0093,th
0.0495  ± 0.0013,in
0.0102  ± 0.0011,an
0.0093  ± 0.0006,er
0.0018  ± 0.0002,at
0.0017  ± 0.0002,re
0.0009  ± 0.0001,ou
0.0006  ± 0.0001,on
0.0000  ± 0.0000,en


In [278]:
CLMET_ARCHER_test_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [279]:
CLMET_train_features.to_csv('/Volumes/Korpora/Exp3b_results/CLMET_train_features_Exp3b.csv', sep=';')
CLMET_val_features.to_csv('/Volumes/Korpora/Exp3b_results/CLMET_val_features_Exp3b.csv', sep=';')
CLMET_test_features.to_csv('/Volumes/Korpora/Exp3b_results/CLMET_test_features_Exp3b.csv', sep=';')
CLMET_ARCHER_val_features.to_csv('/Volumes/Korpora/Exp3b_results/CLMET_over_ARCHER_val_features_Exp3b.csv', sep=';')
CLMET_ARCHER_test_features.to_csv('/Volumes/Korpora/Exp3b_results/CLMET_over_ARCHER_test_features_Exp3b.csv', sep=';')