**Experiment 3: Support Vector Regression**

*Background*: Experiment 2 shows that linear regression with 215 features does not overfit anymore, but the model does not fit the beginning and the end of the timerange. Support vector regression maps all points to a higher dimensional space, which could help with fit (cf. https://scikit-learn.org/stable/modules/svm.html)

Subwords worked well for some corpora, but for others, bag-of-words worked better.



*Strategies*:

- use support vector machines
- Use a BPE-transformer to train on subwords (Sennrich2016)
- use bag-of-word approach

*Relevance*:

- Subwords might increase the amount of generalization, and minimize the vocabulary used at the same time.
- higher dimension might help with fit

*Success criteria*:

- Consistent findings over training-, test- and validation set
- predicted year is not more than ten years away from the true year

*Corpora*:

- DTA
- ARCHER
- CLMET
- GERMANC

*Result*: 

-
***Baselines to beat (Exp. 1b)***:

*MSE DTA Train*: 2259.8 | 2755.29 | 4989.51

*MSE DTA Val*: 3202.51 | 3050.20 | 4577.19

*MSE DTA Test*: 4504.35 | 3103.35 | 4288.83

*MSE over GERMANC val*: 3700.60 | 4384.34 | 3829.78

*MSE over GERMANC test*: 3565.05 | 4281.22 | 3724.79

Setup:

- RGB kernel
- SVR
- Bag-of-Words
- select 150 kbest

----------------------------------------------------------------------------------------------------------------------

*MSE CLMET Train*: 1346.26 | 2541.43 | 3135.46

*MSE CLMET Val*: 2727.37 | 2702.58 | 3176.05

*MSE CLMET Test*: 4315.18 | 3757.33 | 3404.04

*MSE over ARCHER Val*: 10212.14 | 10004.50 | 10109.93

*MSE over ARCHER Test*: 9784.28 | 9657.31 | 9719.35

Setup:
- RGB kernel
- SVR
- Bag-of-Words
- select 32 kbest

----------------------------------------------------------------------------------------------------------------------

*MSE ARCHER Train*: 3555.68 | 4281.68 | 8603.07

*MSE ARCHER Val*: 4843.06 | 5276.70 | 8916.03

*MSE ARCHER Test*: 4939.51 | 4770.15 | 8495.80

*MSE over CLMET Val*: 6437126.77 | 5256255.22 | 3205.40

*MSE over CLMET Test*: 6689227.44 | 10656247.10 |3673.11

Setup:

- RGB kernel
- SVR
- Bag-of-Words
- select 130 kbest

----------------------------------------------------------------------------------------------------------------------

*MSE GERMANC Train*: 348.830 | 1215.24 | 737.67

*MSE GERMANC Val*: 398.94 | 1845.08 | 819.97

*MSE GERMANC Test*: 649.33 | 1448.88 | 792.35

*MSE over DTA val*: 8400009.85 | 25187766.76 | 9955.60

*MSE over DTA test*: 21896163.78 | 18349238.65 | 8958.89

Setup:

- Bag-of-Words
- selected 22 k-best

In [5]:
import pandas as pd
import numpy as np
import nltk

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest , f_regression
from sklearn import svm
from sklearn.metrics import mean_squared_error
import sklearn.utils
from sklearn.preprocessing import FunctionTransformer

from Selfwritten_modules.SubwordTransformer import SubwordTransformer

import re

import eli5
from eli5.sklearn import PermutationImportance



In [6]:
#build tokenizer that just substitutes '[' and ']' with ','
def tokenizer_word(doc):
    doc = re.sub('[(\[+)|(\]+)]', '', doc)
    doc = re.split(',', doc)
    return doc

In [7]:
# Code example: https://stackoverflow.com/questions/39839112/the-easiest-way-for-getting-feature-names-after-running-selectkbest-in-scikit-le
def features_to_names(features, feature_names):
    features_selected = []

    for bool, feature in zip(features, feature_names):
        if bool:
            features_selected.append(feature)
    return features_selected

In [8]:
#function for assembling predictions in order to find out how features are weighted

def collect_predictions(dataset, classifier,vectorizer, feature_names, pipeline):
    predictions = eli5.explain_weights_df(classifier,vec=vectorizer, feature_names=feature_names)
    
    predictions = predictions.drop(['target'], axis=1)
    
    
    predictions['YEAR'] = 0
    
    indexes = dataset.index.values
    
    

    for index in indexes:
        
        pred = eli5.explain_prediction_df(classifier, dataset[index], vec=vectorizer, feature_names=feature_names)
        
        source_text = pd.DataFrame([[dataset[index]]])
        
        year_pred = pipeline.predict(source_text[0])
        pred['weight_value'] = pred['weight'] * pred['value']
        pred['instance'] = index
        
        pred = pred.drop(['target','value'], axis=1)
        
    
        pred['YEAR'] = np.round(year_pred[0])
    
        predictions = pd.concat([predictions, pred])
        
    
    
    
    return predictions.dropna()

In [9]:
train_full = pd.read_csv('/Volumes/Korpora/Train/DTA_train_tokenized.csv', sep=';')
val_full = pd.read_csv('/Volumes/Korpora/Val/DTA_val_tokenized.csv', sep=';')
test_full = pd.read_csv('/Volumes/Korpora/Test/DTA_test_tokenized.csv', sep=';')

In [10]:
print('Length train set: ',len(train_full))
print('Length validation set: ', len(val_full))
print('Length test set: ', len(test_full))

Length train set:  899
Length validation set:  225
Length test set:  281


In [11]:
train_x = train_full['Text']
train_y = train_full['Publication_year']

val_x = val_full['Text']
val_y = val_full['Publication_year']

test_x = test_full['Text']
test_y = test_full['Publication_year']

**Try SVR**

In [22]:
reg_1 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 800)),
                    ('feature_selector', SelectKBest(f_regression, k='all')),
                         ('svr', svm.SVR())
                        ])

In [23]:
reg_1.fit(train_x, train_y)



Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=800,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x116ee2560>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k='all',
                             score_func=<function f_regression at 0x116805a70>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsi

In [26]:
y_pred_train = reg_1.predict(train_x)
mean_squared_error(train_y, y_pred_train)

62530.04426686872

In [27]:
y_pred_val = reg_1.predict(val_x)

mean_squared_error(val_y, y_pred_val)

75321.94842239987

In [None]:
reg_2 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 400)),
                    ('feature_selector', SelectKBest(f_regression, k='all')),
                         ('svr', svm.SVR())
                        ])

In [15]:
reg_2.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=400,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1169d3560>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k='all',
                             score_func=<function f_regression at 0x1162faa70>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsi

In [16]:
y_pred_train = reg_2.predict(train_x)
mean_squared_error(train_y, y_pred_train)

5088.6607164913485

In [17]:
y_pred_val = reg_2.predict(val_x)
mean_squared_error(val_y, y_pred_val)

4660.419547783548

In [18]:
reg_3 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 800)),
                    ('feature_selector', SelectKBest(f_regression, k=100)),
                         ('svr', svm.SVR())
                        ])

In [19]:
reg_3.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=800,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1169d3560>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=100,
                             score_func=<function f_regression at 0x1162faa70>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilo

In [20]:
y_pred_train = reg_3.predict(train_x)
mean_squared_error(train_y, y_pred_train)

4983.908350759601

In [21]:
y_pred_val = reg_3.predict(val_x)
mean_squared_error(val_y, y_pred_val)

4571.6984857298385

In [22]:
reg_4 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 800)),
                    ('feature_selector', SelectKBest(f_regression, k=50)),
                         ('svr', svm.SVR())
                        ])

In [23]:
reg_4.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=800,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1169d3560>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=50,
                             score_func=<function f_regression at 0x1162faa70>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon

In [24]:
y_pred_train = reg_3.predict(train_x)
mean_squared_error(train_y, y_pred_train)

4983.908350759601

In [25]:
y_pred_val = reg_3.predict(val_x)
mean_squared_error(val_y, y_pred_val)

4571.6984857298385

In [48]:
reg_5 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 800)),
                    ('feature_selector', SelectKBest(f_regression, k=150)),
                         ('svr', svm.SVR(kernel='poly', degree=5))
                        ])

In [49]:
reg_5.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=800,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x12491a950>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=150,
                             score_func=<function f_regression at 0x124240a70>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=5, epsilo

In [50]:
y_pred_train = reg_3.predict(train_x)
mean_squared_error(train_y, y_pred_train)

723614.6055945167

In [51]:
y_pred_val = reg_3.predict(val_x)
mean_squared_error(val_y, y_pred_val)

1389410.8249968216

**Try NuSVR**

In [32]:
reg_1 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 800)),
                    ('feature_selector', SelectKBest(f_regression, k='all')),
                         ('svr', svm.NuSVR())
                        ])

In [33]:
reg_1.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=800,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1169d3560>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k='all',
                             score_func=<function f_regression at 0x1162faa70>)),
                ('svr',
                 NuSVR(C=1.0, cache_size=200, coef0=0.0, degree=3,
  

In [34]:
y_pred_train = reg_1.predict(train_x)
mean_squared_error(train_y, y_pred_train)

5120.97448729449

In [35]:
y_pred_val = reg_1.predict(val_x)
mean_squared_error(val_y, y_pred_val)

4695.405958943123

In [36]:
reg_2 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 800)),
                    ('feature_selector', SelectKBest(f_regression, k=100)),
                         ('svr', svm.NuSVR())
                        ])

In [37]:
reg_2.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=800,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1169d3560>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=100,
                             score_func=<function f_regression at 0x1162faa70>)),
                ('svr',
                 NuSVR(C=1.0, cache_size=200, coef0=0.0, degree=3,
    

In [38]:
y_pred_train = reg_2.predict(train_x)
mean_squared_error(train_y, y_pred_train)

4998.2428364029665

In [39]:
y_pred_val = reg_2.predict(val_x)
mean_squared_error(val_y, y_pred_val)

4578.414062910382

**Try linearSVR**

In [40]:
reg_1 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 800)),
                    ('feature_selector', SelectKBest(f_regression, k='all')),
                         ('svr', svm.LinearSVR())
                        ])

In [41]:
reg_1.fit(train_x, train_y)



Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=800,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1169d3560>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k='all',
                             score_func=<function f_regression at 0x1162faa70>)),
                ('svr',
                 LinearSVR(C=1.0, dual=True, epsilon=0.0, fit_interce

In [24]:
y_pred_train = reg_1.predict(train_x)
mean_squared_error(train_y, y_pred_train)

62530.04426686872

In [25]:
y_pred_val = reg_1.predict(val_x)
mean_squared_error(val_y, y_pred_val)

75321.94842239987

Reg 5 with SVR works best

In [8]:
reg_5 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 800)),
                    ('feature_selector', SelectKBest(f_regression, k=150)),
                         ('svr', svm.SVR())
                        ])

In [9]:
reg_5.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=800,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x11f3aa680>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=150,
                             score_func=<function f_regression at 0x11eccca70>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilo

In [10]:
y_pred_train = reg_5.predict(train_x)
mean_squared_error(train_y, y_pred_train)

4989.50795950559

In [11]:
y_pred_val = reg_5.predict(val_x)
mean_squared_error(val_y, y_pred_val)

4577.185764875978

In [12]:
y_pred_test = reg_5.predict(test_x)
mean_squared_error(test_y, y_pred_test)

4288.834314224981

In [13]:
transformator = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 800)),
                    ('feature_selector', SelectKBest(f_regression, k=150))
                        ])

In [14]:
train_transform = transformator.fit_transform(train_x, train_y)

In [15]:
train_transform = train_transform.toarray()

In [33]:
perm = PermutationImportance(reg_5['svr']).fit(train_transform, y_pred_train)

In [34]:
features = transformator['feature_selector'].get_support()
feature_names = transformator['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)


In [35]:
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.4824  ± 0.0436,'
0.4756  ± 0.0088,'
0.0128  ± 0.0028,'der'
0.0101  ± 0.0014,'die'
0.0060  ± 0.0022,''
0.0010  ± 0.0005,'in'
0.0006  ± 0.0001,'des'
0.0004  ± 0.0000,'den'
0.0003  ± 0.0001,'er'
0.0003  ± 0.0003,'nicht'


In [36]:
DTA_train_features_Exp3 = eli5.explain_weights_df(perm, feature_names=features_selected)

In [37]:
val_transform = transformator.transform(val_x)

In [38]:
val_transform = val_transform.toarray()

In [39]:
perm = PermutationImportance(reg_5['svr']).fit(val_transform, y_pred_val)

In [40]:
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.4839  ± 0.0848,'
0.4726  ± 0.0371,'
0.0112  ± 0.0028,''
0.0099  ± 0.0021,'der'
0.0091  ± 0.0020,'die'
0.0006  ± 0.0006,'in'
0.0005  ± 0.0001,'des'
0.0004  ± 0.0001,'er'
0.0003  ± 0.0001,'ich'
0.0002  ± 0.0001,'ein'


In [41]:
DTA_val_features_Exp3 = eli5.explain_weights_df(perm, feature_names=features_selected)

In [42]:
test_transform = transformator.transform(test_x)

In [45]:
test_transform = test_transform.toarray()

In [46]:
perm = PermutationImportance(reg_5['svr']).fit(test_transform, y_pred_test)

In [47]:
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.4706  ± 0.0511,'
0.4692  ± 0.0637,'
0.0100  ± 0.0027,'der'
0.0085  ± 0.0019,'die'
0.0070  ± 0.0023,''
0.0004  ± 0.0001,'des'
0.0004  ± 0.0001,'in'
0.0002  ± 0.0001,'er'
0.0002  ± 0.0000,'ein'
0.0001  ± 0.0000,'den'


In [48]:
DTA_test_features_Exp3 = eli5.explain_weights_df(perm, feature_names=features_selected)

In [16]:
y_pred_train = pd.DataFrame(y_pred_train, columns=['Predicted_y'])

diff_pred_true_train = pd.concat([y_pred_train, train_y], axis=1)

diff_pred_true_train['Difference'] = diff_pred_true_train.Predicted_y - diff_pred_true_train.Publication_year
    

y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Publication_year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Publication_year

In [17]:
diff_pred_true_train.to_csv('/Volumes/Korpora/Exp3_results/DTA_Exp3_Reg5_Labels_train.csv',sep=';')
diff_pred_true_val.to_csv('/Volumes/Korpora/Exp3_results/DTA_Exp3_Reg5_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/Exp3_results/DTA_Exp3_Reg5_Labels_test.csv',sep=';')

In [49]:
DTA_train_features_Exp3.to_csv('/Volumes/Korpora/Exp3_results/DTA_train_features_Exp3.csv',sep=';')
DTA_val_features_Exp3.to_csv('/Volumes/Korpora/Exp3_results/DTA_val_features_Exp3.csv', sep=';')
DTA_test_features_Exp3.to_csv('/Volumes/Korpora/Exp3_results/DTA_test_features_Exp3.csv', sep=';')

In [12]:
GERMANC_train_full = pd.read_csv('/Volumes/Korpora/Train/GERMANC_train_tokenized.csv', sep=';')
GERMANC_val_full = pd.read_csv('/Volumes/Korpora/Val/GERMANC_val_tokenized.csv', sep=';')
GERMANC_test_full = pd.read_csv('/Volumes/Korpora/Test/GERMANC_test_tokenized.csv', sep=';')

In [13]:
GERMANC_train_full = GERMANC_train_full[(GERMANC_train_full.Year.str.len()== 4) & (GERMANC_train_full.Year.str.isnumeric())]

GERMANC_val_full = GERMANC_val_full[(GERMANC_val_full.Year.str.len()== 4) & (GERMANC_val_full.Year.str.isnumeric())]

GERMANC_test_full = GERMANC_test_full[(GERMANC_test_full.Year.str.len()== 4) & (GERMANC_test_full.Year.str.isnumeric())]

In [128]:
print('Length train set: ',len(GERMANC_train_full))
print('Length validation set: ', len(GERMANC_val_full))
print('Length test set: ', len(GERMANC_test_full))

Length train set:  177
Length validation set:  40
Length test set:  56


In [129]:
GERMANC_train_x = GERMANC_train_full['Text']
GERMANC_train_y = GERMANC_train_full['Year'].astype(int)

GERMANC_val_x = GERMANC_val_full['Text']
GERMANC_val_y = GERMANC_val_full['Year'].astype(int)

GERMANC_test_x = GERMANC_test_full['Text']
GERMANC_test_y = GERMANC_test_full['Year'].astype(int)

In [54]:
y_pred_val = reg_5.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

3829.781853517679

In [58]:
y_pred_test = reg_5.predict(GERMANC_test_x)
mean_squared_error(GERMANC_test_y, y_pred_test)

3724.787438465452

In [59]:
y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, GERMANC_val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, GERMANC_test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Year

In [60]:
diff_pred_true_val.to_csv('/Volumes/Korpora/Exp3_results/DTA_over_GERMANC_Exp3_Reg5_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/Exp3_results/DTA_over_GERMANC_Exp3_Reg5_Labels_test.csv',sep=';')

In [65]:
val_transform = transformator.transform(GERMANC_val_x)
val_transform = val_transform.toarray()

In [66]:
perm = PermutationImportance(reg_5['svr']).fit(val_transform, y_pred_val)

In [67]:
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.5650  ± 0.1324,'
0.5278  ± 0.0482,'
0.0007  ± 0.0002,':'
0.0006  ± 0.0001,'ein'
0.0005  ± 0.0001,'er'
0.0003  ± 0.0001,'oder'
0.0002  ± 0.0000,'in'
0.0001  ± 0.0000,''
0.0001  ± 0.0000,'man'
0.0001  ± 0.0000,'nicht'


In [68]:
DTA_GERMANC_val_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [70]:
test_transform = transformator.transform(GERMANC_test_x)
test_transform = test_transform.toarray()

In [71]:
perm = PermutationImportance(reg_5['svr']).fit(test_transform, y_pred_test)

In [72]:
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.5497  ± 0.0592,'
0.5060  ± 0.0760,'
0.0003  ± 0.0001,''
0.0003  ± 0.0001,':'
0.0003  ± 0.0000,'er'
0.0002  ± 0.0001,'oder'
0.0002  ± 0.0000,'nicht'
0.0001  ± 0.0000,'ein'
0.0001  ± 0.0000,'man'
0.0001  ± 0.0000,'in'


In [73]:
DTA_GERMANC_test_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [74]:
DTA_GERMANC_val_features.to_csv('/Volumes/Korpora/Exp3_results/DTA_over_GERMANC_val_features_Exp3.csv', sep=';')
DTA_GERMANC_test_features.to_csv('/Volumes/Korpora/Exp3_results/DTA_over_GERMANC_test_features_Exp3.csv', sep=';')

**GERMANC**

In [75]:
reg_5 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 105)),
                    ('feature_selector', SelectKBest(f_regression, k=44)),
                         ('svr', svm.SVR())
                        ])

In [76]:
reg_5.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=105,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x11f3aa680>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=44,
                             score_func=<function f_regression at 0x11eccca70>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon

In [78]:
y_pred_train = reg_5.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

745.5235419490572

In [79]:
y_pred_val = reg_5.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

835.3956312421485

In [80]:
reg_5 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 105)),
                    ('feature_selector', SelectKBest(f_regression, k=22)),
                         ('svr', svm.SVR())
                        ])

In [81]:
reg_5.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=105,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x11f3aa680>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=22,
                             score_func=<function f_regression at 0x11eccca70>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon

In [82]:
y_pred_train = reg_5.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

737.6743405427411

In [83]:
y_pred_val = reg_5.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

819.9692302589707

In [84]:
reg_5 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 105)),
                    ('feature_selector', SelectKBest(f_regression, k=11)),
                         ('svr', svm.SVR())
                        ])

In [85]:
reg_5.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=105,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x11f3aa680>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=11,
                             score_func=<function f_regression at 0x11eccca70>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon

In [86]:
y_pred_train = reg_5.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

737.7050912827574

In [87]:
y_pred_val = reg_5.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

798.3255078309553

In [89]:
reg_5 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 105)),
                    ('feature_selector', SelectKBest(f_regression, k=33)),
                         ('svr', svm.SVR())
                        ])

In [90]:
reg_5.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=105,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x11f3aa680>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=33,
                             score_func=<function f_regression at 0x11eccca70>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon

In [91]:
y_pred_train = reg_5.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

741.7157483274672

In [92]:
y_pred_val = reg_5.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

827.0194034306454

In [131]:
reg_5 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 105)),
                    ('feature_selector', SelectKBest(f_regression, k=22)),
                         ('svr', svm.SVR())
                        ])

In [132]:
reg_5.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=105,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x117122560>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=22,
                             score_func=<function f_regression at 0x116a48a70>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon

In [133]:
y_pred_train = reg_5.predict(GERMANC_train_x)
mean_squared_error(GERMANC_train_y, y_pred_train)

737.6743405427411

In [134]:
y_pred_val = reg_5.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

819.9692302589707

In [135]:
y_pred_test = reg_5.predict(GERMANC_test_x)
mean_squared_error(GERMANC_test_y, y_pred_test)

792.3454058965783

In [136]:
y_pred_train = pd.DataFrame(y_pred_train, columns=['Predicted_y'])

diff_pred_true_train = pd.concat([y_pred_train, GERMANC_train_y], axis=1)

diff_pred_true_train['Difference'] = diff_pred_true_train.Predicted_y - diff_pred_true_train.Year
    

y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, GERMANC_val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, GERMANC_test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Year

In [137]:
diff_pred_true_train.to_csv('/Volumes/Korpora/Exp3_results/GERMANC_Exp3_Reg5_Labels_train.csv',sep=';')
diff_pred_true_val.to_csv('/Volumes/Korpora/Exp3_results/GERMANC_Exp3_Reg5_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/Exp3_results/GERMANC_Exp3_Reg5_Labels_test.csv',sep=';')

In [102]:
transformator = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 105)),
                    ('feature_selector', SelectKBest(f_regression, k=22))
                        ])

In [103]:
train_transform = transformator.fit_transform(GERMANC_train_x, GERMANC_train_y)
train_transform = train_transform.toarray()

In [104]:
features = transformator['feature_selector'].get_support()
feature_names = transformator['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

In [105]:
perm = PermutationImportance(reg_5['svr']).fit(train_transform, y_pred_train)

In [106]:
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.5198  ± 0.0866,'
0.5133  ± 0.0994,'
0.0020  ± 0.0004,':'
0.0003  ± 0.0000,'die'
0.0001  ± 0.0000,'daß'
0.0000  ± 0.0000,'des'
0.0000  ± 0.0000,'auf'
0.0000  ± 0.0000,'wenn'
0.0000  ± 0.0000,'eine'
0.0000  ± 0.0000,'um'


In [107]:
GERMANC_train_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [109]:
val_transform = transformator.transform(GERMANC_val_x)
val_transform = val_transform.toarray()

In [110]:
perm = PermutationImportance(reg_5['svr']).fit(val_transform, y_pred_val)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.4954  ± 0.0457,'
0.4872  ± 0.1215,'
0.0020  ± 0.0008,':'
0.0002  ± 0.0001,'die'
0.0001  ± 0.0000,'daß'
0.0000  ± 0.0000,'des'
0.0000  ± 0.0000,'auf'
0.0000  ± 0.0000,'wenn'
0.0000  ± 0.0000,'eine'
0.0000  ± 0.0000,'sind'


In [111]:
GERMANC_val_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [112]:
test_transform = transformator.transform(GERMANC_test_x)
test_transform = test_transform.toarray()

In [113]:
perm = PermutationImportance(reg_5['svr']).fit(test_transform, y_pred_test)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.5345  ± 0.1524,'
0.5238  ± 0.1289,'
0.0004  ± 0.0001,'die'
0.0001  ± 0.0001,':'
0.0000  ± 0.0000,'daß'
0.0000  ± 0.0000,'des'
0.0000  ± 0.0000,'auf'
0.0000  ± 0.0000,'wenn'
0.0000  ± 0.0000,'eine'
0.0000  ± 0.0000,'um'


In [114]:
GERMANC_test_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [142]:
y_pred_val = reg_5.predict(val_x)
mean_squared_error(val_y, y_pred_val)

9955.600948926007

In [143]:
y_pred_test = reg_5.predict(test_x)
mean_squared_error(test_y, y_pred_test)

8958.889061190974

In [144]:
y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val,val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Publication_year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Publication_year

In [145]:
diff_pred_true_val.to_csv('/Volumes/Korpora/Exp3_results/GERMANC_DTA_Exp3_Reg5_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/Exp3_results/GERMANC_DTA_Exp3_Reg5_Labels_test.csv',sep=';')

In [115]:
val_transform = transformator.transform(val_x)
val_transform = val_transform.toarray()

In [119]:
perm = PermutationImportance(reg_5['svr']).fit(val_transform, y_pred_val)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
1.0001  ± 0.0007,'die'
0.8900  ± 0.2402,'
0.8802  ± 0.1845,'
0.6090  ± 0.3326,'auf'
0.5958  ± 0.3270,'daß'
0.5771  ± 0.2793,'des'
0.5444  ± 0.3840,'eine'
0.5129  ± 0.4241,':'
0.4621  ± 0.3742,'nur'
0.3538  ± 0.2040,'aus'


In [120]:
GERMANC_DTA_val_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [121]:
test_transform = transformator.transform(test_x)
test_transform = test_transform.toarray()

In [122]:
perm = PermutationImportance(reg_5['svr']).fit(test_transform, y_pred_test)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.9547  ± 0.2130,'
0.9382  ± 0.2662,'
0.8960  ± 0.1422,'die'
0.5494  ± 0.2847,'auf'
0.5240  ± 0.1133,'eine'
0.5041  ± 0.1305,'des'
0.4819  ± 0.2162,'daß'
0.4485  ± 0.2194,':'
0.3874  ± 0.2087,'nur'
0.3800  ± 0.2274,'wenn'


In [123]:
GERMANC_DTA_test_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [124]:
GERMANC_train_features.to_csv('/Volumes/Korpora/Exp3_results/GERMANC_train_features_Exp3.csv', sep=';')
GERMANC_val_features.to_csv('/Volumes/Korpora/Exp3_results/GERMANC_val_features_Exp3.csv', sep=';')
GERMANC_test_features.to_csv('/Volumes/Korpora/Exp3_results/GERMANC_test_features_Exp3.csv', sep=';')
GERMANC_DTA_val_features.to_csv('/Volumes/Korpora/Exp3_results/GERMANC_over_DTA_val_features_Exp3.csv', sep=';')
GERMANC_DTA_test_features.to_csv('/Volumes/Korpora/Exp3_results/GERMANC_over_DTA_test_features_Exp3.csv', sep=';')

In [4]:
ARCHER_train_full = pd.read_csv('/Volumes/Korpora/Train/ARCHER_train_tokenized.csv', sep=';')
ARCHER_val_full = pd.read_csv('/Volumes/Korpora/Val/ARCHER_val_tokenized.csv', sep=';')
ARCHER_test_full = pd.read_csv('/Volumes/Korpora/Test/ARCHER_test_tokenized.csv', sep=';')

In [5]:
ARCHER_train_full = ARCHER_train_full[(ARCHER_train_full.Year.str.len()== 4) & (ARCHER_train_full.Year.str.isnumeric())]

ARCHER_val_full = ARCHER_val_full[(ARCHER_val_full.Year.str.len()== 4) & (ARCHER_val_full.Year.str.isnumeric())]

ARCHER_test_full = ARCHER_test_full[(ARCHER_test_full.Year.str.len()== 4) & (ARCHER_test_full.Year.str.isnumeric())]

In [6]:
print('Length train set: ',len(ARCHER_train_full))
print('Length validation set: ', len(ARCHER_val_full))
print('Length test set: ', len(ARCHER_test_full))

Length train set:  1049
Length validation set:  264
Length test set:  329


In [7]:
ARCHER_train_x = ARCHER_train_full['Text']
ARCHER_train_y = ARCHER_train_full['Year'].astype(int)

ARCHER_val_x = ARCHER_val_full['Text']
ARCHER_val_y = ARCHER_val_full['Year'].astype(int)

ARCHER_test_x = ARCHER_test_full['Text']
ARCHER_test_y = ARCHER_test_full['Year'].astype(int)

In [10]:
reg_5 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 500)),
                    ('feature_selector', SelectKBest(f_regression, k='all')),
                         ('svr', svm.SVR())
                        ])

In [11]:
reg_5.fit(ARCHER_train_x, ARCHER_train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=500,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x117122560>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k='all',
                             score_func=<function f_regression at 0x116a48a70>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsi

In [12]:
y_pred_train = reg_5.predict(ARCHER_train_x)
mean_squared_error(ARCHER_train_y, y_pred_train)

8633.760705404035

In [13]:
y_pred_val = reg_5.predict(ARCHER_val_x)
mean_squared_error(ARCHER_val_y, y_pred_val)

8939.260952494482

In [14]:
reg_5 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 800)),
                    ('feature_selector', SelectKBest(f_regression, k='all')),
                         ('svr', svm.SVR())
                        ])

In [15]:
reg_5.fit(ARCHER_train_x, ARCHER_train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=800,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x117122560>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k='all',
                             score_func=<function f_regression at 0x116a48a70>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsi

In [16]:
y_pred_train = reg_5.predict(ARCHER_train_x)
mean_squared_error(ARCHER_train_y, y_pred_train)

8615.109274365492

In [17]:
y_pred_val = reg_5.predict(ARCHER_val_x)
mean_squared_error(ARCHER_val_y, y_pred_val)

8959.347967264155

In [18]:
reg_5 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 300)),
                    ('feature_selector', SelectKBest(f_regression, k='all')),
                         ('svr', svm.SVR())
                        ])

In [19]:
reg_5.fit(ARCHER_train_x, ARCHER_train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=300,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x117122560>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k='all',
                             score_func=<function f_regression at 0x116a48a70>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsi

In [20]:
y_pred_train = reg_5.predict(ARCHER_train_x)
mean_squared_error(ARCHER_train_y, y_pred_train)

8651.164847682081

In [21]:
y_pred_val = reg_5.predict(ARCHER_val_x)
mean_squared_error(ARCHER_val_y, y_pred_val)

8948.290702038592

In [23]:
reg_5 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 300)),
                    ('feature_selector', SelectKBest(f_regression, k=178)),
                         ('svr', svm.SVR())
                        ])

In [24]:
reg_5.fit(ARCHER_train_x, ARCHER_train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=300,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x117122560>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=178,
                             score_func=<function f_regression at 0x116a48a70>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilo

In [25]:
y_pred_train = reg_5.predict(ARCHER_train_x)
mean_squared_error(ARCHER_train_y, y_pred_train)

8848.914934746368

In [26]:
y_pred_val = reg_5.predict(ARCHER_val_x)
mean_squared_error(ARCHER_val_y, y_pred_val)

9333.850424572334

In [27]:
reg_5 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 500)),
                    ('feature_selector', SelectKBest(f_regression, k=100)),
                         ('svr', svm.SVR())
                        ])

In [28]:
reg_5.fit(ARCHER_train_x, ARCHER_train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=500,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x117122560>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=100,
                             score_func=<function f_regression at 0x116a48a70>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilo

In [29]:
y_pred_train = reg_5.predict(ARCHER_train_x)
mean_squared_error(ARCHER_train_y, y_pred_train)

8838.223785384655

In [30]:
y_pred_val = reg_5.predict(ARCHER_val_x)
mean_squared_error(ARCHER_val_y, y_pred_val)

9316.181223676746

In [120]:
reg_5 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 500)),
                    ('feature_selector', SelectKBest(f_regression, k=130)),
                         ('svr', svm.SVR())
                        ])

In [121]:
reg_5.fit(ARCHER_train_x, ARCHER_train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=500,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x117122560>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=130,
                             score_func=<function f_regression at 0x116a48a70>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilo

In [33]:
y_pred_train = reg_5.predict(ARCHER_train_x)
mean_squared_error(ARCHER_train_y, y_pred_train)

8603.066798842648

In [49]:
y_pred_val = reg_5.predict(ARCHER_val_x)
mean_squared_error(ARCHER_val_y, y_pred_val)

8916.03416212663

In [50]:
y_pred_test = reg_5.predict(ARCHER_test_x)
mean_squared_error(ARCHER_test_y, y_pred_test)

8495.802980948689

In [51]:
y_pred_train = pd.DataFrame(y_pred_train, columns=['Predicted_y'])

diff_pred_true_train = pd.concat([y_pred_train, ARCHER_train_y], axis=1)

diff_pred_true_train['Difference'] = diff_pred_true_train.Predicted_y - diff_pred_true_train.Year
    

y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, ARCHER_val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, ARCHER_test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Year

In [52]:
diff_pred_true_train.to_csv('/Volumes/Korpora/Exp3_results/ARCHER_Exp3_Reg5_Labels_train.csv',sep=';')
diff_pred_true_val.to_csv('/Volumes/Korpora/Exp3_results/ARCHER_Exp3_Reg5_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/Exp3_results/ARCHER_Exp3_Reg5_Labels_test.csv',sep=';')

In [42]:
transformator = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 500)),
                    ('feature_selector', SelectKBest(f_regression, k=130))
                        ])

In [45]:
train_transform = transformator.fit_transform(ARCHER_train_x, ARCHER_train_y)
train_transform = train_transform.toarray()

In [46]:
features = transformator['feature_selector'].get_support()
feature_names = transformator['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

In [47]:
perm = PermutationImportance(reg_5['svr']).fit(train_transform, y_pred_train)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.7294  ± 0.0748,'.'
0.4642  ± 0.0194,'the'
0.1462  ± 0.0125,'
0.1440  ± 0.0089,'
0.0218  ± 0.0036,'of'
0.0075  ± 0.0007,'and'
0.0051  ± 0.0008,'a'
0.0048  ± 0.0003,'in'
0.0047  ± 0.0004,'to'
0.0020  ± 0.0002,"""''"""


In [48]:
ARCHER_train_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [55]:
val_transform = transformator.transform(ARCHER_val_x)
val_transform = val_transform.toarray()

In [56]:
perm = PermutationImportance(reg_5['svr']).fit(val_transform, y_pred_val)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.7022  ± 0.0485,'.'
0.4507  ± 0.0749,'the'
0.1506  ± 0.0075,'
0.1462  ± 0.0176,'
0.0199  ± 0.0038,'of'
0.0086  ± 0.0021,'and'
0.0054  ± 0.0017,'to'
0.0050  ± 0.0007,'in'
0.0049  ± 0.0012,'a'
0.0026  ± 0.0018,"""''"""


In [57]:
ARCHER_val_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [58]:
test_transform = transformator.transform(ARCHER_test_x)
test_transform = test_transform.toarray()

In [59]:
perm = PermutationImportance(reg_5['svr']).fit(test_transform, y_pred_test)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.8007  ± 0.0587,'.'
0.4426  ± 0.0832,'the'
0.1655  ± 0.0170,'
0.1647  ± 0.0102,'
0.0214  ± 0.0056,'of'
0.0079  ± 0.0012,'and'
0.0062  ± 0.0028,':'
0.0057  ± 0.0010,'a'
0.0046  ± 0.0007,'in'
0.0045  ± 0.0008,'to'


In [60]:
ARCHER_test_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [36]:
CLMET_train_full = pd.read_csv('/Volumes/Korpora/Train/CLMET_train_tokenized.csv', sep=';')
CLMET_val_full = pd.read_csv('/Volumes/Korpora/Val/CLMET_val_tokenized.csv', sep=';')
CLMET_test_full = pd.read_csv('/Volumes/Korpora/Test/CLMET_test_tokenized.csv', sep=';')

In [37]:
#drop rows with invalid data types
CLMET_train_full = CLMET_train_full[CLMET_train_full.Year.str.len()== 4]
CLMET_val_full = CLMET_val_full[CLMET_val_full.Year.str.len()== 4]
CLMET_test_full = CLMET_test_full[CLMET_test_full.Year.str.len()== 4]

In [38]:
CLMET_train_x = CLMET_train_full['Text']
CLMET_train_y = CLMET_train_full['Year'].astype(int)

CLMET_val_x = CLMET_val_full['Text']
CLMET_val_y = CLMET_val_full['Year'].astype(int)

CLMET_test_x = CLMET_test_full['Text']
CLMET_test_y = CLMET_test_full['Year'].astype(int)

In [39]:
print('Length train set: ',len(CLMET_train_full))
print('Length validation set: ', len(CLMET_val_full))
print('Length test set: ', len(CLMET_test_full))

Length train set:  186
Length validation set:  47
Length test set:  60


In [122]:
y_pred_val = reg_5.predict(CLMET_val_x)
mean_squared_error(CLMET_val_y, y_pred_val)

3205.398612412842

In [123]:
y_pred_test = reg_5.predict(CLMET_test_x)
mean_squared_error(CLMET_test_y, y_pred_test)

3673.107374950346

In [124]:
y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, CLMET_val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, CLMET_test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Year

In [125]:
diff_pred_true_val.to_csv('/Volumes/Korpora/Exp3_results/ARCHER_over_CLMET_Exp3_Reg5_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/Exp3_results/ARCHER_over_CLMET_Exp3_Reg5_Labels_test.csv',sep=';')

In [63]:
val_transform = transformator.transform(CLMET_val_x)
val_transform = val_transform.toarray()

In [64]:
perm = PermutationImportance(reg_5['svr']).fit(val_transform, y_pred_val)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
1.0472  ± 0.0003,'
1.0468  ± 0.0015,'
0.9320  ± 0.4069,'.'
0.8550  ± 0.4696,'of'
0.8282  ± 0.6938,'and'
0.8236  ± 0.5486,'a'
0.7440  ± 0.7742,'to'
0.6248  ± 0.7574,'in'
0.6244  ± 0.7560,'that'
0.6103  ± 0.3485,'``'


In [65]:
ARCHER_CLMET_val_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [66]:
test_transform = transformator.transform(CLMET_test_x)
test_transform = test_transform.toarray()

In [67]:
perm = PermutationImportance(reg_5['svr']).fit(test_transform, y_pred_test)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
1.1856  ± 0.6765,'
1.0176  ± 0.0042,'
1.0159  ± 0.0019,'that'
0.9509  ± 0.2618,'.'
0.8874  ± 0.8649,';'
0.8588  ± 0.4715,'he'
0.8569  ± 0.5218,'with'
0.8461  ± 0.6814,'of'
0.8454  ± 0.5971,'the'
0.8118  ± 0.8116,'in'


In [68]:
ARCHER_CLMET_test_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [70]:
ARCHER_train_features.to_csv('/Volumes/Korpora/Exp3_results/ARCHER_train_features_Exp3.csv', sep=';')
ARCHER_val_features.to_csv('/Volumes/Korpora/Exp3_results/ARCHER_val_features_Exp3.csv', sep=';')
ARCHER_test_features.to_csv('/Volumes/Korpora/Exp3_results/ARCHER_test_features_Exp3.csv', sep=';')
ARCHER_CLMET_val_features.to_csv('/Volumes/Korpora/Exp3_results/ARCHER_over_CLMET_val_features_Exp3.csv', sep=';')
ARCHER_CLMET_test_features.to_csv('/Volumes/Korpora/Exp3_results/ARCHER_over_CLMET_test_features_Exp3.csv', sep=';')

In [71]:
reg_5 = Pipeline([('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 176)),
                    ('feature_selector', SelectKBest(f_regression, k=32)),
                         ('svr', svm.SVR())]
                  )

In [72]:
reg_5.fit(CLMET_train_x, CLMET_train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=176,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x117122560>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=32,
                             score_func=<function f_regression at 0x116a48a70>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon

In [73]:
y_pred_train = reg_5.predict(CLMET_train_x)
mean_squared_error(CLMET_train_y, y_pred_train)

3135.455358670207

In [74]:
y_pred_val = reg_5.predict(CLMET_val_x)
mean_squared_error(CLMET_val_y, y_pred_val)

3176.0535454246083

In [75]:
reg_5 = Pipeline([('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 176)),
                    ('feature_selector', SelectKBest(f_regression, k=44)),
                         ('svr', svm.SVR())]
                  )

In [76]:
reg_5.fit(CLMET_train_x, CLMET_train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=176,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x117122560>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=44,
                             score_func=<function f_regression at 0x116a48a70>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon

In [77]:
y_pred_train = reg_5.predict(CLMET_train_x)
mean_squared_error(CLMET_train_y, y_pred_train)

3252.3095711052456

In [78]:
y_pred_val = reg_5.predict(CLMET_val_x)
mean_squared_error(CLMET_val_y, y_pred_val)

3266.5405354176924

In [83]:
reg_5 = Pipeline([('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 176)),
                    ('feature_selector', SelectKBest(f_regression, k=10)),
                         ('svr', svm.SVR())]
                  )

In [84]:
reg_5.fit(CLMET_train_x, CLMET_train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=176,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x117122560>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=10,
                             score_func=<function f_regression at 0x116a48a70>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon

In [85]:
y_pred_train = reg_5.predict(CLMET_train_x)
mean_squared_error(CLMET_train_y, y_pred_train)

3156.3395364013472

In [86]:
y_pred_val = reg_5.predict(CLMET_val_x)
mean_squared_error(CLMET_val_y, y_pred_val)

3204.033880101949

In [87]:
reg_5 = Pipeline([('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 176)),
                    ('feature_selector', SelectKBest(f_regression, k=32)),
                         ('svr', svm.SVR())]
                  )

In [88]:
reg_5.fit(CLMET_train_x, CLMET_train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=176,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x117122560>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=32,
                             score_func=<function f_regression at 0x116a48a70>)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon

In [89]:
y_pred_train = reg_5.predict(CLMET_train_x)
mean_squared_error(CLMET_train_y, y_pred_train)

3135.455358670207

In [90]:
y_pred_val = reg_5.predict(CLMET_val_x)
mean_squared_error(CLMET_val_y, y_pred_val)

3176.0535454246083

In [91]:
y_pred_test = reg_5.predict(CLMET_test_x)
mean_squared_error(CLMET_test_y, y_pred_test)

3404.0427042578312

In [92]:
y_pred_train = pd.DataFrame(y_pred_train, columns=['Predicted_y'])

diff_pred_true_train = pd.concat([y_pred_train, CLMET_train_y], axis=1)

diff_pred_true_train['Difference'] = diff_pred_true_train.Predicted_y - diff_pred_true_train.Year
    

y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, CLMET_val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, CLMET_test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Year

In [93]:
diff_pred_true_train.to_csv('/Volumes/Korpora/Exp3_results/CLMET_Exp3_Reg5_Labels_train.csv',sep=';')
diff_pred_true_val.to_csv('/Volumes/Korpora/Exp3_results/CLMET_Exp3_Reg5_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/Exp3_results/CLMET_Exp3_Reg5_Labels_test.csv',sep=';')

In [94]:
transformator = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 176)),
                    ('feature_selector', SelectKBest(f_regression, k=32))
                        ])

In [95]:
train_transform = transformator.fit_transform(CLMET_train_x, CLMET_train_y)
train_transform = train_transform.toarray()

In [97]:
features = transformator['feature_selector'].get_support()
feature_names = transformator['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

In [98]:
perm = PermutationImportance(reg_5['svr']).fit(train_transform, y_pred_train)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.9714  ± 0.1970,'my'
0.7055  ± 0.0567,'-'
0.0719  ± 0.0205,'there'
0.0579  ± 0.0121,'?'
0.0427  ± 0.0175,'your'
0.0159  ± 0.0025,'up'
0.0085  ± 0.0050,'every'
0.0030  ± 0.0012,'down'
0.0029  ± 0.0013,'like'
0.0019  ± 0.0009,'shall'


In [99]:
CLMET_train_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [100]:
val_transform = transformator.transform(CLMET_val_x)
val_transform = val_transform.toarray()

In [101]:
perm = PermutationImportance(reg_5['svr']).fit(val_transform, y_pred_val)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.9880  ± 0.3733,'-'
0.9866  ± 0.2665,'my'
0.1260  ± 0.0154,'there'
0.0957  ± 0.0254,'?'
0.0594  ± 0.0235,'your'
0.0313  ± 0.0083,'up'
0.0068  ± 0.0009,'like'
0.0048  ± 0.0014,'every'
0.0042  ± 0.0020,'shall'
0.0036  ± 0.0013,'down'


In [102]:
CLMET_val_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [103]:
test_transform = transformator.transform(CLMET_test_x)
test_transform = test_transform.toarray()

In [104]:
perm = PermutationImportance(reg_5['svr']).fit(test_transform, y_pred_test)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
0.8921  ± 0.3964,'my'
0.7500  ± 0.1856,'-'
0.1037  ± 0.0448,'there'
0.1024  ± 0.0136,'?'
0.0609  ± 0.0650,'your'
0.0236  ± 0.0248,'up'
0.0218  ± 0.0331,'every'
0.0168  ± 0.0424,'shall'
0.0046  ± 0.0017,'like'
0.0034  ± 0.0029,'down'


In [105]:
CLMET_test_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [106]:
y_pred_val = reg_5.predict(ARCHER_val_x)
mean_squared_error(ARCHER_val_y, y_pred_val)

10109.930975614836

In [107]:
y_pred_test = reg_5.predict(ARCHER_test_x)
mean_squared_error(ARCHER_test_y, y_pred_test)

9719.350960203037

In [108]:
y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, ARCHER_val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, ARCHER_test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Year

In [109]:
diff_pred_true_val.to_csv('/Volumes/Korpora/Exp3_results/CLMET_over_ARCHER_Exp3_Reg5_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/Exp3_results/CLMET_over_ARCHER_Exp3_Reg5_Labels_test.csv',sep=';')

In [110]:
val_transform = transformator.transform(ARCHER_val_x)
val_transform = val_transform.toarray()

In [112]:
perm = PermutationImportance(reg_5['svr']).fit(val_transform, y_pred_val)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
1.3383  ± 0.1517,'my'
0.3189  ± 0.0097,'-'
0.2683  ± 0.0378,'?'
0.1236  ± 0.0148,'there'
0.0453  ± 0.0079,'your'
0.0259  ± 0.0034,'up'
0.0074  ± 0.0008,'like'
0.0058  ± 0.0005,'down'
0.0032  ± 0.0002,'every'
0.0031  ± 0.0004,'back'


In [114]:
CLMET_ARCHER_val_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [116]:
test_transform = transformator.transform(ARCHER_test_x)
test_transform = test_transform.toarray()

In [117]:
perm = PermutationImportance(reg_5['svr']).fit(test_transform, y_pred_test)
eli5.explain_weights(perm, feature_names=features_selected)

Weight,Feature
1.5420  ± 0.0636,'my'
0.5168  ± 0.1002,'?'
0.3369  ± 0.0170,'-'
0.0985  ± 0.0104,'there'
0.0869  ± 0.0065,'your'
0.0226  ± 0.0014,'up'
0.0059  ± 0.0002,'like'
0.0052  ± 0.0006,'every'
0.0049  ± 0.0004,'down'
0.0039  ± 0.0002,'shall'


In [118]:
CLMET_ARCHER_test_features = eli5.explain_weights_df(perm, feature_names=features_selected)

In [119]:
CLMET_train_features.to_csv('/Volumes/Korpora/Exp3_results/CLMET_train_features_Exp3.csv', sep=';')
CLMET_val_features.to_csv('/Volumes/Korpora/Exp3_results/CLMET_val_features_Exp3.csv', sep=';')
CLMET_test_features.to_csv('/Volumes/Korpora/Exp3_results/CLMET_test_features_Exp3.csv', sep=';')
CLMET_ARCHER_val_features.to_csv('/Volumes/Korpora/Exp3_results/CLMET_over_ARCHER_val_features_Exp3.csv', sep=';')
CLMET_ARCHER_test_features.to_csv('/Volumes/Korpora/Exp3_results/CLMET_over_ARCHER_test_features_Exp3.csv', sep=';')