**Experiment 1b: linear regression and selecting features that occur a certain amount of time across corpora in the same language**

*Background*: Exp. 1b works well on all corpora.

*Goal*: Watch the performance of linear regression on a validation- / test set that is in the same language as the train set, but not from the same corpus. This helps to determine if the algorithm generalizes well, and if the features it picks are really typical for a certain time period (cf. 

*Strategies*:
- Train on features that occur a certain amount of time

*Relevance*:
- If this experiment works, it is possible to estimate years for corpora that have NA's in this variable.

*Success criteria*:
- Consistent findings over different corpora in the same language
- predicted year is not more than ten years away from the true year

*Corpora*:
- DTA
- CLMET
- GERMANC
- ARCHER

*Result*: The classifier trained over DTA performs very well on the validation set from the GERMANC. It does not work well the other way round, which is unsurprising given the fact that GERMANC provides much less features than the larger DTA.

With CLMET and ARCHER, the performance is not good. This can be due to the fact that ARCHER's parameters have a complete different ratio than the parameters of the other corpora.

----------------------------------------------------------------------------------------------------------------------

*MSE DTA Train*: 2259.8

*MSE DTA Val*: 3202.51

*MSE DTA Test*: 4504.35

*MSE over GERMANC val*: 3700.60

*MSE over GERMANC test*: 3565.05
Setup: 
- features occur in about 89% of documents in the train set
- number of features is 25% of the number of documents in the train set

----------------------------------------------------------------------------------------------------------------------

*MSE CLMET Train*: 1346.26

*MSE CLMET Val*: 2727.37

*MSE CLMET Test*: 4315.18

*MSE over ARCHER Val*: 10212.14

*MSE over ARCHER Test*: 9784.28

Setup:
- features occur in about 89% of documents in the train set
- number of features is 25% of the number of documents in the train set

----------------------------------------------------------------------------------------------------------------------

*MSE ARCHER Train*: 3555.68

*MSE ARCHER Val*: 4843.06

*MSE ARCHER Test*: 4939.51

*MSE over CLMET Val*: 6437126.77

*MSE over CLMET Test*: 6689227.44

Setup:
- features occur in about 48% of documents in the train set
- number of features is 15% of the number of documents in the train set

----------------------------------------------------------------------------------------------------------------------

*MSE GERMANC Train*: 348.830

*MSE GERMANC Val*: 398.94

*MSE GERMANC Test*: 649.33

*MSE over DTA val*: 8400009.85

*MSE over DTA test*: 21896163.78

Setup:
- features occur in about 89% of documents in the train set
- number of features is 25% of the number of documents in the train set

In [1]:
import pandas as pd
import numpy as np
import nltk

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest , f_regression
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
import sklearn.utils
import re

import eli5



In [2]:
# Code example: https://stackoverflow.com/questions/39839112/the-easiest-way-for-getting-feature-names-after-running-selectkbest-in-scikit-le
def features_to_names(features, feature_names):
    features_selected = []

    for bool, feature in zip(features, feature_names):
        if bool:
            features_selected.append(feature)
    return features_selected

In [8]:
#build tokenizer that just substitutes '[' and ']' with ','
def tokenizer_word(doc):
    doc = re.sub('[(\[+)|(\]+)]', '', doc)
    doc = re.split(',', doc)
    return doc

In [7]:
#function for assembling predictions in order to find out how features are weighted

def collect_predictions(dataset, classifier,vectorizer, feature_names, pipeline):
    predictions = eli5.explain_weights_df(classifier,vec=vectorizer, feature_names=feature_names)
    
    predictions = predictions.drop(['target'], axis=1)
    
    
    predictions['YEAR'] = 0
    
    indexes = dataset.index.values
    
    

    for index in indexes:
        
        pred = eli5.explain_prediction_df(classifier, dataset[index], vec=vectorizer, feature_names=feature_names)
        
        source_text = pd.DataFrame([[dataset[index]]])
        
        year_pred = pipeline.predict(source_text[0])
        pred['weight_value'] = pred['weight'] * pred['value']
        pred['instance'] = index
        
        pred = pred.drop(['target','value'], axis=1)
        
    
        pred['YEAR'] = np.round(year_pred[0])
    
        predictions = pd.concat([predictions, pred])
        
    
    
    
    return predictions.dropna()

In [3]:
DTA_train_full = pd.read_csv('/Volumes/Korpora/Train/DTA_train_tokenized.csv', sep=';')
DTA_val_full = pd.read_csv('/Volumes/Korpora/Val/DTA_val_tokenized.csv', sep=';')
DTA_test_full = pd.read_csv('/Volumes/Korpora/Test/DTA_test_tokenized.csv', sep=';')

In [9]:
print('Length train set: ',len(DTA_train_full))
print('Length validation set: ', len(DTA_val_full))
print('Length test set: ', len(DTA_test_full))

Length train set:  899
Length validation set:  225
Length test set:  281


In [10]:
DTA_train_x = DTA_train_full['Text']
DTA_train_y = DTA_train_full['Publication_year']

DTA_val_x = DTA_val_full['Text']
DTA_val_y = DTA_val_full['Publication_year']

DTA_test_x = DTA_test_full['Text']
DTA_test_y = DTA_test_full['Publication_year']

In [4]:
CLMET_train_full = pd.read_csv('/Volumes/Korpora/Train/CLMET_train_tokenized.csv', sep=';')
CLMET_val_full = pd.read_csv('/Volumes/Korpora/Val/CLMET_val_tokenized.csv', sep=';')
CLMET_test_full = pd.read_csv('/Volumes/Korpora/Test/CLMET_test_tokenized.csv', sep=';')

In [11]:
#drop rows with invalid data types
CLMET_train_full = CLMET_train_full[CLMET_train_full.Year.str.len()== 4]
CLMET_val_full = CLMET_val_full[CLMET_val_full.Year.str.len()== 4]
CLMET_test_full = CLMET_test_full[CLMET_test_full.Year.str.len()== 4]

In [13]:
print('Length train set: ',len(CLMET_train_full))
print('Length validation set: ', len(CLMET_val_full))
print('Length test set: ', len(CLMET_test_full))

Length train set:  186
Length validation set:  47
Length test set:  60


In [12]:
CLMET_train_x = CLMET_train_full['Text']
CLMET_train_y = CLMET_train_full['Year'].astype(int)

CLMET_val_x = CLMET_val_full['Text']
CLMET_val_y = CLMET_val_full['Year'].astype(int)

CLMET_test_x = CLMET_test_full['Text']
CLMET_test_y = CLMET_test_full['Year'].astype(int)

In [5]:
ARCHER_train_full = pd.read_csv('/Volumes/Korpora/Train/ARCHER_train_tokenized.csv', sep=';')
ARCHER_val_full = pd.read_csv('/Volumes/Korpora/Val/ARCHER_val_tokenized.csv', sep=';')
ARCHER_test_full = pd.read_csv('/Volumes/Korpora/Test/ARCHER_test_tokenized.csv', sep=';')

In [14]:
ARCHER_train_full = ARCHER_train_full[(ARCHER_train_full.Year.str.len()== 4) & (ARCHER_train_full.Year.str.isnumeric())]

ARCHER_val_full = ARCHER_val_full[(ARCHER_val_full.Year.str.len()== 4) & (ARCHER_val_full.Year.str.isnumeric())]

ARCHER_test_full = ARCHER_test_full[(ARCHER_test_full.Year.str.len()== 4) & (ARCHER_test_full.Year.str.isnumeric())]

In [15]:
print('Length train set: ',len(ARCHER_train_full))
print('Length validation set: ', len(ARCHER_val_full))
print('Length test set: ', len(ARCHER_test_full))

Length train set:  1049
Length validation set:  264
Length test set:  329


In [16]:
ARCHER_train_x = ARCHER_train_full['Text']
ARCHER_train_y = ARCHER_train_full['Year'].astype(int)

ARCHER_val_x = ARCHER_val_full['Text']
ARCHER_val_y = ARCHER_val_full['Year'].astype(int)

ARCHER_test_x = ARCHER_test_full['Text']
ARCHER_test_y = ARCHER_test_full['Year'].astype(int)

In [6]:
GERMANC_train_full = pd.read_csv('/Volumes/Korpora/Train/GERMANC_train_tokenized.csv', sep=';')
GERMANC_val_full = pd.read_csv('/Volumes/Korpora/Val/GERMANC_val_tokenized.csv', sep=';')
GERMANC_test_full = pd.read_csv('/Volumes/Korpora/Test/GERMANC_test_tokenized.csv', sep=';')

In [17]:
GERMANC_train_full = GERMANC_train_full[(GERMANC_train_full.Year.str.len()== 4) & (GERMANC_train_full.Year.str.isnumeric())]

GERMANC_val_full = GERMANC_val_full[(GERMANC_val_full.Year.str.len()== 4) & (GERMANC_val_full.Year.str.isnumeric())]

GERMANC_test_full = GERMANC_test_full[(GERMANC_test_full.Year.str.len()== 4) & (GERMANC_test_full.Year.str.isnumeric())]

In [18]:
print('Length train set: ',len(GERMANC_train_full))
print('Length validation set: ', len(GERMANC_val_full))
print('Length test set: ', len(GERMANC_test_full))

Length train set:  177
Length validation set:  40
Length test set:  56


In [19]:
GERMANC_train_x = GERMANC_train_full['Text']
GERMANC_train_y = GERMANC_train_full['Year'].astype(int)

GERMANC_val_x = GERMANC_val_full['Text']
GERMANC_val_y = GERMANC_val_full['Year'].astype(int)

GERMANC_test_x = GERMANC_test_full['Text']
GERMANC_test_y = GERMANC_test_full['Year'].astype(int)

**Classifier trained on DTA, and validated on GERMANC**

In [34]:
reg_4 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 800)),
                    ('feature_selector', SelectKBest(f_regression, k='all')),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [35]:
reg_4.fit(DTA_train_x, DTA_train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=800,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ec13c680>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k='all',
                             score_func=<function f_regression at 0x117096a70>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=Tr

In [22]:
y_pred_val = reg_4.predict(GERMANC_val_x)
mean_squared_error(GERMANC_val_y, y_pred_val)

3700.6021756572445

In [36]:
y_pred_test = reg_4.predict(GERMANC_test_x)
mean_squared_error(GERMANC_test_y, y_pred_test)

3565.0475176619693

In [38]:

y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, GERMANC_val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, GERMANC_test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Year

In [39]:

diff_pred_true_val.to_csv('/Volumes/Korpora/DTA_over_GERMANC_Exp1b_Reg4_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/DTA_over_GERMANC_Exp1b_Reg4_Labels_test.csv',sep=';')

In [40]:
features = reg_4['feature_selector'].get_support()
feature_names = reg_4['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

In [42]:
val_details = collect_predictions(GERMANC_val_x, reg_4['ridge_reg'],reg_4['unigram_vectorizer'],features_selected, reg_4)
test_details = collect_predictions(GERMANC_test_x, reg_4['ridge_reg'],reg_4['unigram_vectorizer'],features_selected, reg_4)

In [43]:
val_details.to_csv('/Volumes/Korpora/DTA_over_GERMANC_Exp1b_Reg4_Val_results.csv', sep=';')
test_details.to_csv('/Volumes/Korpora/DTA_over_GERMANC_Exp1b_Reg4_Test_results.csv', sep=';')

**Classifier trained on GERMANC, and validated on DTA**

In [44]:
reg_4 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 105)),
                    ('feature_selector', SelectKBest(f_regression, k=44)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [45]:
reg_4.fit(GERMANC_train_x, GERMANC_train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=105,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ec13c680>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=44,
                             score_func=<function f_regression at 0x117096a70>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,

In [26]:
y_pred_val = reg_4.predict(DTA_val_x)
mean_squared_error(DTA_val_y, y_pred_val)

8400009.856587972

In [46]:
y_pred_test = reg_4.predict(DTA_test_x)
mean_squared_error(DTA_test_y, y_pred_test)

6689227.439661223

In [48]:
y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, DTA_val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Publication_year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, DTA_test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Publication_year

In [49]:

diff_pred_true_val.to_csv('/Volumes/Korpora/GERMANC_over_DTA_Exp1b_Reg4_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/GERMANC_over_DTA_Exp1b_Reg4_Labels_test.csv',sep=';')

In [50]:
features = reg_4['feature_selector'].get_support()
feature_names = reg_4['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

In [51]:
vectorizer = CountVectorizer(tokenizer=tokenizer_word, vocabulary=features_selected) 
#ELI5 cant't include both vectorizer and feature selector, so this is the best solution

In [52]:
val_details = collect_predictions(DTA_val_x, reg_4['ridge_reg'],vectorizer,features_selected, reg_4)
test_details = collect_predictions(DTA_test_x, reg_4['ridge_reg'],vectorizer,features_selected, reg_4)

In [53]:
val_details.to_csv('/Volumes/Korpora/GERMANC_over_DTA_Exp1b_Reg4_Val_results.csv', sep=';')
test_details.to_csv('/Volumes/Korpora/GERMANC_over_DTA_Exp1b_Reg4_Test_results.csv', sep=';')

**Classifier trained on ARCHER, and validated over CLMET**

In [54]:
reg_4 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 500)),
                    ('feature_selector', SelectKBest(f_regression, k='all')),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [55]:
reg_4.fit(ARCHER_train_x, ARCHER_train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=500,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ec13c680>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k='all',
                             score_func=<function f_regression at 0x117096a70>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=Tr

In [29]:
y_pred_val = reg_4.predict(CLMET_val_x)
mean_squared_error(CLMET_val_y, y_pred_val)

6437126.770897349

In [56]:
y_pred_test = reg_4.predict(CLMET_test_x)
mean_squared_error(CLMET_test_y, y_pred_test)

21896163.78334434

In [58]:
y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, CLMET_val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, CLMET_test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Year

In [59]:
diff_pred_true_val.to_csv('/Volumes/Korpora/ARCHER_over_CLMET_Exp1b_Reg4_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/ARCHER_over_CLMET_Exp1b_Reg4_Labels_test.csv',sep=';')

In [60]:
features = reg_4['feature_selector'].get_support()
feature_names = reg_4['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

In [61]:
val_details = collect_predictions(ARCHER_val_x, reg_4['ridge_reg'],reg_4['unigram_vectorizer'],features_selected, reg_4)
test_details = collect_predictions(ARCHER_test_x, reg_4['ridge_reg'],reg_4['unigram_vectorizer'],features_selected, reg_4)

In [62]:
val_details.to_csv('/Volumes/Korpora/ARCHER_over_CLMET_Exp1b_Reg4_Val_results.csv', sep=';')
test_details.to_csv('/Volumes/Korpora/ARCHER_over_CLMET_Exp1b_Reg4_Test_results.csv', sep=';')

**Classifier trained on CLMET, and validated over ARCHER**

In [63]:
reg_4 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 167)),
                    ('feature_selector', SelectKBest(f_regression, k=40)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [64]:
reg_4.fit(CLMET_train_x, CLMET_train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=167,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ec13c680>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=40,
                             score_func=<function f_regression at 0x117096a70>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,

In [33]:
y_pred_val = reg_4.predict(ARCHER_val_x)
mean_squared_error(ARCHER_val_y, y_pred_val)

10212.139003296703

In [65]:
y_pred_test = reg_4.predict(ARCHER_test_x)
mean_squared_error(ARCHER_test_y, y_pred_test)

9784.281718876851

In [66]:
y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, ARCHER_val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, ARCHER_test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Year

In [67]:
diff_pred_true_val.to_csv('/Volumes/Korpora/CLMET_over_ARCHER_Exp1b_Reg4_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/CLMET_over_ARCHER_Exp1b_Reg4_Labels_test.csv',sep=';')

In [68]:
features = reg_4['feature_selector'].get_support()
feature_names = reg_4['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

In [69]:
vectorizer = CountVectorizer(tokenizer=tokenizer_word, vocabulary=features_selected) 
#ELI5 cant't include both vectorizer and feature selector, so this is the best solution

In [70]:
val_details = collect_predictions(ARCHER_val_x, reg_4['ridge_reg'],vectorizer,features_selected, reg_4)
test_details = collect_predictions(ARCHER_test_x, reg_4['ridge_reg'],vectorizer,features_selected, reg_4)

In [71]:
val_details.to_csv('/Volumes/Korpora/CLMET_over_ARCHER_Exp1b_Reg4_Val_results.csv', sep=';')
test_details.to_csv('/Volumes/Korpora/CLMET_over_ARCHER_Exp1b_Reg4_Test_results.csv', sep=';')