**Experiment 2: linear regression and selecting features that occur a certain amount of time**

*Background*: Experiment 1 showed that some sort of feature selection is required, but also that just picking the k highest scoring features leads to overfitting.

*Goal*: Determine if it is possible to predict the year in which a text was written using regression.

*Strategies*:
- Train on features that occur a certain amount of time

*Relevance*:
- If this experiment works, it is possible to estimate years for corpora that have NA's in this variable.

*Success criteria*:
- Consistent findings over training-, test- and validation set
- predicted year is not more than ten years away from the true year

*Corpora*:
- DTA
- CLMET
- GERMANC
- ARCHER

*Result*: 
Only using features that occur in 800 out of 899 documents solved the problem of overfitting. 800 out of 899 is about 90% of the training set.

The linear regression seems to try to model a normal distribution, which does not reflect the real distribution of years in the DTA (cf analysis notebook). Therefore, it seems that linear regression is not flexible enough to model 
the data correctly.

Taking only words that occur in about 90% of all documents in the train set, and only taking about 25% of features works well on the DTA, the GERMANC and the CLMET - but not on the ARCHER, because if only words are used that occur in 90% of the documents, the pipeline only picks 25 arguments. 

Interestingly, the two larger corpora - DTA and ARCHER - do not need a function to limit the number of features picked, whereas the much smaller CLMET heavily overfits if the kbest step is left out. GERMANC also needs a restriction on its features to prevent overfit, but not so heavily as the CLMET.

*MSE DTA Train*: 2259.8

*MSE DTA Val*: 3202.51

*MSE DTA Test*: 4504.35

Setup: 
- features occur in about 89% of documents in the train set
- number of features is 25% of the number of documents in the train set

----------------------------------------------------------------------------------------------------------------------

*MSE CLMET Train*: 1346.26

*MSE CLMET Val*: 2727.37

*MSE CLMET Test*: 4315.18

Setup:
- features occur in about 89% of documents in the train set
- number of features is 25% of the number of documents in the train set

----------------------------------------------------------------------------------------------------------------------

*MSE ARCHER Train*: 3555.68

*MSE ARCHER Val*: 4843.06

*MSE ARCHER Test*: 4939.51

Setup:
- features occur in about 48% of documents in the train set
- number of features is 15% of the number of documents in the train set

----------------------------------------------------------------------------------------------------------------------

*MSE GERMANC Train*: 348.830

*MSE GERMANC Val*: 398.94

*MSE GERMANC Test*: 649.33

Setup:
- features occur in about 89% of documents in the train set
- number of features is 25% of the number of documents in the train set

In [2]:
import pandas as pd
import numpy as np
import nltk

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest , f_regression
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
import sklearn.utils
import re

import eli5



In [3]:
# Code example: https://stackoverflow.com/questions/39839112/the-easiest-way-for-getting-feature-names-after-running-selectkbest-in-scikit-le
def features_to_names(features, feature_names):
    features_selected = []

    for bool, feature in zip(features, feature_names):
        if bool:
            features_selected.append(feature)
    return features_selected

In [70]:
train_full = pd.read_csv('/Volumes/Korpora/Train/DTA_train_tokenized.csv', sep=';')
val_full = pd.read_csv('/Volumes/Korpora/Val/DTA_val_tokenized.csv', sep=';')
test_full = pd.read_csv('/Volumes/Korpora/Test/DTA_test_tokenized.csv', sep=';')

In [71]:
print('Length train set: ',len(train_full))
print('Length validation set: ', len(val_full))
print('Length test set: ', len(test_full))

Length train set:  899
Length validation set:  225
Length test set:  281


In [4]:
#build tokenizer that just substitutes '[' and ']' with ','
def tokenizer_word(doc):
    doc = re.sub('[(\[+)|(\]+)]', '', doc)
    doc = re.split(',', doc)
    return doc

In [108]:
#function for assembling predictions in order to find out how features are weighted

def collect_predictions(dataset, classifier,vectorizer, feature_names, pipeline):
    predictions = eli5.explain_weights_df(classifier,vec=vectorizer, feature_names=feature_names)
    
    predictions = predictions.drop(['target'], axis=1)
    
    
    predictions['YEAR'] = 0
    
    indexes = dataset.index.values
    
    

    for index in indexes:
        
        pred = eli5.explain_prediction_df(classifier, dataset[index], vec=vectorizer, feature_names=feature_names)
        
        source_text = pd.DataFrame([[dataset[index]]])
        
        year_pred = pipeline.predict(source_text[0])
        pred['weight_value'] = pred['weight'] * pred['value']
        pred['instance'] = index
        
        pred = pred.drop(['target','value'], axis=1)
        
    
        pred['YEAR'] = np.round(year_pred[0])
    
        predictions = pd.concat([predictions, pred])
        
    
    
    
    return predictions.dropna()

In [72]:
train_x = train_full['Text']
train_y = train_full['Publication_year']

val_x = val_full['Text']
val_y = val_full['Publication_year']

test_x = test_full['Text']
test_y = test_full['Publication_year']

CountVectorizer has an attribute called min_df that can be set to an integer or a float. If a word occurs less often than min_df (count or distribution), it is removed. I set min_df to 10 for the next experiment.

In Experiment 1, the model performed best on a set with 22000 features, so I start with that value.

In [None]:
reg_1 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 10)),
                    ('feature_selector', SelectKBest(f_regression, k=22000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])



In [None]:
reg_1.fit(train_x, train_y)

In [None]:
y_pred_train = reg_1.predict(train_x)
mean_squared_error(train_y, y_pred_train)

In [None]:
y_pred_val = reg_1.predict(val_x)

mean_squared_error(val_y, y_pred_val)

In [None]:
features = reg_1['feature_selector'].get_support()
feature_names = reg_1['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

expl = eli5.explain_weights_df(reg_1['ridge_reg'],vec=reg_1['unigram_vectorizer'],target_names=(train_y),feature_names=features_selected)




In [None]:
print(expl.head(2))

It is still overfitting heavily, so I raise the threshold to 100.

Since there are less than 22k features over the threshold, I set k to 'all' for now, because else the feature selector does complain.

In [None]:
reg_2 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 100)),
                    ('feature_selector', SelectKBest(f_regression, k='all')),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [None]:
reg_2.fit(train_x, train_y)

In [None]:
y_pred_train = reg_2.predict(train_x)
mean_squared_error(train_y, y_pred_train)

In [None]:
y_pred_val = reg_2.predict(val_x)

mean_squared_error(val_y, y_pred_val)

In [None]:
features = reg_2['feature_selector'].get_support()
feature_names = reg_2['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_2['ridge_reg'],vec=reg_2['unigram_vectorizer'], feature_names=features_selected)

In [None]:
reg_3 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 200)),
                    ('feature_selector', SelectKBest(f_regression, k='all')),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [None]:
reg_3.fit(train_x, train_y)

In [None]:
y_pred_train = reg_3.predict(train_x)
mean_squared_error(train_y, y_pred_train)

In [None]:
y_pred_val = reg_3.predict(val_x)

mean_squared_error(val_y, y_pred_val)

In [None]:
features = reg_3['feature_selector'].get_support()
feature_names = reg_3['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_3['ridge_reg'],vec=reg_3['unigram_vectorizer'], feature_names=features_selected)

Since the model is still overfitting, I reverse the experiment, and I just pick features that occur in 800 out of 899 documents.

In [73]:
reg_4 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 800)),
                    ('feature_selector', SelectKBest(f_regression, k='all')),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [74]:
reg_4.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=800,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x11e27d0e0>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k='all',
                             score_func=<function f_regression at 0x11dbaf200>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=Tr

In [75]:
y_pred_train = reg_4.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2259.827033935024

In [76]:
y_pred_val = reg_4.predict(val_x)

mean_squared_error(val_y, y_pred_val)

3203.7007093719653

In [77]:
y_pred_test = reg_4.predict(test_x)
mean_squared_error(test_y, y_pred_test)

4500.970069187744

In [85]:
features = reg_4['feature_selector'].get_support()
feature_names = reg_4['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.explain_weights(reg_4['ridge_reg'],vec=reg_4['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1769.716,<BIAS>
+0.282,'keine'
+0.261,'weiter'
+0.217,'ja'
+0.181,'that'
+0.178,'leicht'
+0.176,'vielen'
+0.155,'finden'
+0.153,'erhalten'
+0.152,'liegen'


In [15]:
len(features_selected)

215

The MSE of the train and val set converged now, so going after the stop words seems to be a pretty good idea. The model trained on 215 features.

In [78]:
y_pred_train = pd.DataFrame(y_pred_train, columns=['Predicted_y'])

diff_pred_true_train = pd.concat([y_pred_train, train_y], axis=1)

diff_pred_true_train['Difference'] = diff_pred_true_train.Predicted_y - diff_pred_true_train.Publication_year
    

print(diff_pred_true_train.head(3))


   Predicted_y  Publication_year  Difference
0  1786.255418              1741   45.255418
1  1750.022064              1691   59.022064
2  1770.788691              1665  105.788691


In [17]:
diff_pred_true_train.describe()

Unnamed: 0,Predicted_y,Publication_year,Difference
count,899.0,899.0,899.0
mean,1788.259177,1788.259177,-4.426073e-14
std,61.741568,77.929074,47.5638
min,1549.902463,1598.0,-125.8733
25%,1762.839429,1739.5,-28.05162
50%,1782.950146,1796.0,0.02362674
75%,1824.280243,1855.0,25.14667
max,1962.413722,1913.0,167.1227


This table shows that the mean of the years is the same for the predicted and the true year. The maximum and minimum of the model's prediction is lower and higher than in the true labels, so the model thinks that the range between the earliest and the latest publication year is larger than shown in the train set.

In [79]:
y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Publication_year

print(diff_pred_true_val.head(3))


   Predicted_y  Publication_year  Difference
0  1818.740047              1897  -78.259953
1  1717.877252              1701   16.877252
2  1517.901188              1663 -145.098812


In [19]:
diff_pred_true_val.describe()

Unnamed: 0,Predicted_y,Publication_year,Difference
count,225.0,225.0,225.0
mean,1789.651089,1791.315556,-1.664467
std,70.223421,74.822785,56.692355
min,1482.561566,1603.0,-164.438434
25%,1763.628141,1750.0,-35.216816
50%,1784.658281,1804.0,-5.313875
75%,1822.392752,1843.0,28.570888
max,2025.898784,1913.0,255.349277


The true mean of the publication year in the validation set is three years higher than in the train set. The model adapts slightly by adding one year to the mean of the predictions over the validation set (compared to the train set).

Surprisingly, the model dates the oldest text from the validation set back to 1462, when in fact, the oldest text was written in 1603. The youngest text in the validation set, according to the model, was written in 2014, the true year of the youngest text is 1913. This means that the range of the prediction is about 350 years larger than it should be.

The mean difference between the predicted and the true year is -2, meaning that the predicted year is generally two years lower than the true label.

in the first quartile, the models prediction is about 37 years to low, in the third quartile, the prediction is about 256 years to high. It seems that the model generally tends to predict a higher publication year than the true year. Given the mean (which is actually quite decent), the main problem might be some heavy outliers in the model's prediction.

In [80]:
y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Publication_year

print(diff_pred_true_test.head(3))

   Predicted_y  Publication_year  Difference
0  1872.353749              1861   11.353749
1  1813.622673              1801   12.622673
2  1751.247843              1790  -38.752157


In [18]:
print(diff_pred_true_val.nsmallest(10,'Difference'))

     Predicted_y  Publication_year  Difference
143  1482.561566              1647 -164.438434
183  1531.163409              1679 -147.836591
2    1517.937050              1663 -145.062950
170  1763.628141              1895 -131.371859
54   1775.478628              1897 -121.521372
188  1779.429106              1893 -113.570894
159  1783.201390              1890 -106.798610
50   1784.179079              1889 -104.820921
47   1801.314523              1898  -96.685477
116  1820.343756              1913  -92.656244


eli5 instance 119 and 142: The model predicts 2012 and 2014 (true: 1765 and 1895), probably because it overestimates the influence of the word "dem". In the instance of 2012, "dem" has a weight of +348, whereas for the one with 2014, it is weighted with +71. 

"die" seems also to be a word that misleads the classifier to think that a text is younger than it really is. For the text of 2012, "die" has a weight of +185, for the example that was predicted with 2014, the weight is +65

In [81]:
diff_pred_true_val.to_csv('/Volumes/Korpora/Exp1b_DTA_Reg4_Labels_val.csv',sep=';')

In [82]:
diff_pred_true_train.to_csv('/Volumes/Korpora/Exp1b_DTA_Reg4_Labels_train.csv',sep=';')

In [83]:
diff_pred_true_test.to_csv('/Volumes/Korpora/Exp1b_DTA_Reg4_Labels_test.csv',sep=';')

In [19]:
#eli5.explain_prediction(reg_4['ridge_reg'],val_x[143],vec=reg_4['unigram_vectorizer'], feature_names=features_selected, )

In [20]:
expl = eli5.explain_prediction_df(reg_4['ridge_reg'],val_x[0],vec=reg_4['unigram_vectorizer'], feature_names=features_selected)



In [21]:
len(expl)

213

In [22]:
expl.nsmallest(10,'value')

Unnamed: 0,target,feature,weight,value
0,y,<BIAS>,1769.724984,1.0
111,y,'welchem',0.112434,1.0
113,y,'gleichen',0.080908,1.0
100,y,'findet',0.240282,2.0
116,y,'welches',0.016883,2.0
117,y,'nen',0.014924,2.0
106,y,'bleibt',0.174982,3.0
127,y,'zeiten',-0.310498,3.0
83,y,'liegen',0.608372,4.0
112,y,'dahin',0.083053,4.0


In [23]:
eli5.explain_weights_df(reg_4['ridge_reg'],vec=reg_4['unigram_vectorizer'], feature_names=features_selected)

Unnamed: 0,target,feature,weight
0,y,<BIAS>,1769.724984
1,y,'keine',0.282159
2,y,'weiter',0.260575
3,y,'ja',0.216977
4,y,'that',0.181377
...,...,...,...
211,y,'gen',-0.176596
212,y,'bald',-0.179685
213,y,'ins',-0.221891
214,y,'weit',-0.229857


In [86]:
val_details = collect_predictions(val_x, reg_4['ridge_reg'],reg_4['unigram_vectorizer'],features_selected, reg_4)



In [91]:
len(val_details)

46786

In [87]:
train_details = collect_predictions(train_x, reg_4['ridge_reg'],reg_4['unigram_vectorizer'],features_selected, reg_4)

In [88]:
test_details = collect_predictions(test_x, reg_4['ridge_reg'],reg_4['unigram_vectorizer'],features_selected, reg_4)

In [89]:
train_details.to_csv('/Volumes/Korpora/Exp1b_Reg4_Train_results_DTA.csv',sep=';')
val_details.to_csv('/Volumes/Korpora/Exp1b_Reg4_Val_results_DTA.csv', sep=';')
test_details.to_csv('/Volumes/Korpora/Exp1b_Reg4_Test_results_DTA.csv', sep=';')

**REG4 + CLMET**

In [90]:
train_clmet_full = pd.read_csv('/Volumes/Korpora/Train/CLMET_train_tokenized.csv', sep=';')
val_clmet_full = pd.read_csv('/Volumes/Korpora/Val/CLMET_val_tokenized.csv', sep=';')
test_clmet_full = pd.read_csv('/Volumes/Korpora/Test/CLMET_test_tokenized.csv', sep=';')

In [91]:
print('Length train set: ',len(train_clmet_full))
print('Length validation set: ', len(val_clmet_full))
print('Length test set: ', len(test_clmet_full))

Length train set:  212
Length validation set:  54
Length test set:  67


In [92]:
#drop rows with invalid data types
train_clmet_full = train_clmet_full[train_clmet_full.Year.str.len()== 4]
val_clmet_full = val_clmet_full[val_clmet_full.Year.str.len()== 4]
test_clmet_full = test_clmet_full[test_clmet_full.Year.str.len()== 4]

In [93]:
print('Length train set: ',len(train_clmet_full))
print('Length validation set: ', len(val_clmet_full))
print('Length test set: ', len(test_clmet_full))

Length train set:  186
Length validation set:  47
Length test set:  60


In [94]:
train_x = train_clmet_full['Text']
train_y = train_clmet_full['Year'].astype(int)

val_x = val_clmet_full['Text']
val_y = val_clmet_full['Year'].astype(int)

test_x = test_clmet_full['Text']
test_y = test_clmet_full['Year'].astype(int)

90% von 186 = 167.4

In [70]:
reg_4 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 167)),
                    ('feature_selector', SelectKBest(f_regression, k='all')),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [71]:
reg_4.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=167,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1d9359c20>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k='all',
                             score_func=<function f_regression at 0x12071cdd0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=Tr

In [72]:
y_pred_train = reg_4.predict(train_x)
mean_squared_error(train_y, y_pred_train)

0.0032282565763734907

In [73]:
y_pred_val = reg_4.predict(val_x)
mean_squared_error(val_y, y_pred_val)

28696.790663825832

In [74]:
features = reg_4['feature_selector'].get_support()
feature_names = reg_4['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.explain_weights(reg_4['ridge_reg'],vec=reg_4['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1845.241,<BIAS>
+0.793,'well'
+0.555,'believe'
+0.507,'understand'
+0.432,'place'
+0.375,'times'
+0.369,'given'
+0.368,'another'
… 217 more positive …,… 217 more positive …
… 263 more negative …,… 263 more negative …


In [75]:
len(features)

499

Restrict features on the same number as the model chooses for the DTA

In [78]:
reg_4 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 167)),
                    ('feature_selector', SelectKBest(f_regression, k=215)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [79]:
reg_4.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=167,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1d9359c20>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=215,
                             score_func=<function f_regression at 0x12071cdd0>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True

In [80]:
y_pred_train = reg_4.predict(train_x)
mean_squared_error(train_y, y_pred_train)

0.013660963035626862

In [81]:
y_pred_val = reg_4.predict(val_x)
mean_squared_error(val_y, y_pred_val)

46532.69500113356

In [82]:
features = reg_4['feature_selector'].get_support()
feature_names = reg_4['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.explain_weights(reg_4['ridge_reg'],vec=reg_4['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1831.578,<BIAS>
+5.395,'understand'
+2.759,'met'
+2.743,'lie'
+2.544,'remember'
+2.535,'appear'
+2.155,'got'
+2.053,'danger'
+1.900,'hands'
… 87 more positive …,… 87 more positive …


Ratio between features and number of documents from experiments with the DTA:
899 to 215 = 167 to 40

In [95]:
reg_4 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 167)),
                    ('feature_selector', SelectKBest(f_regression, k=40)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [96]:
reg_4.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=167,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x11e27d0e0>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=40,
                             score_func=<function f_regression at 0x11dbaf200>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,

In [97]:
y_pred_train = reg_4.predict(train_x)
mean_squared_error(train_y, y_pred_train)

1346.2602741556182

In [98]:
y_pred_val = reg_4.predict(val_x)
mean_squared_error(val_y, y_pred_val)

2727.371335726752

In [99]:
y_pred_test = reg_4.predict(test_x)
mean_squared_error(test_y, y_pred_test)

4315.177223906214

In [104]:
features = reg_4['feature_selector'].get_support()
feature_names = reg_4['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.explain_weights(reg_4['ridge_reg'],vec=reg_4['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1850.134,<BIAS>
+1.109,'need'
+0.793,'understand'
+0.656,'question'
+0.596,'coming'
+0.491,'occasion'
+0.320,'behind'
+0.307,'right'
+0.256,'fear'
+0.225,'forward'


In [101]:
y_pred_train = pd.DataFrame(y_pred_train, columns=['Predicted_y'])

diff_pred_true_train = pd.concat([y_pred_train, train_y], axis=1)

diff_pred_true_train['Difference'] = diff_pred_true_train.Predicted_y - diff_pred_true_train.Year
    

y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Year

In [102]:
diff_pred_true_train.to_csv('/Volumes/Korpora/CLMET_Exp1b_Reg4_Labels_train.csv',sep=';')
diff_pred_true_val.to_csv('/Volumes/Korpora/CLMET_Exp1b_Reg4_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/CLMET_Exp1b_Reg4_Labels_test.csv',sep=';')

In [116]:
vectorizer = CountVectorizer(tokenizer=tokenizer_word, vocabulary=features_selected) 
#ELI5 cant't include both vectorizer and feature selector, so this is the best solution

In [118]:
train_details = collect_predictions(train_x, reg_4['ridge_reg'],vectorizer,features_selected, reg_4)
val_details = collect_predictions(val_x, reg_4['ridge_reg'],vectorizer,features_selected, reg_4)
test_details = collect_predictions(test_x, reg_4['ridge_reg'],vectorizer,features_selected, reg_4)

In [120]:
train_details.to_csv('/Volumes/Korpora/Exp1b_Reg4_Train_results_CLMET.csv',sep=';')
val_details.to_csv('/Volumes/Korpora/Exp1b_Reg4_Val_results_CLMET.csv', sep=';')
test_details.to_csv('/Volumes/Korpora/Exp1b_Reg4_Test_results_CLMET.csv', sep=';')

**ARCHER**

In [119]:
ARCHER_train_full = pd.read_csv('/Volumes/Korpora/Train/ARCHER_train_tokenized.csv', sep=';')
ARCHER_val_full = pd.read_csv('/Volumes/Korpora/Val/ARCHER_val_tokenized.csv', sep=';')
ARCHER_test_full = pd.read_csv('/Volumes/Korpora/Test/ARCHER_test_tokenized.csv', sep=';')

In [10]:
print('Length train set: ',len(ARCHER_train_full))
print('Length validation set: ', len(ARCHER_val_full))
print('Length test set: ', len(ARCHER_test_full))

Length train set:  1093
Length validation set:  274
Length test set:  342


In [11]:
ARCHER_train_full = ARCHER_train_full[(ARCHER_train_full.Year.str.len()== 4) & (ARCHER_train_full.Year.str.isnumeric())]

ARCHER_val_full = ARCHER_val_full[(ARCHER_val_full.Year.str.len()== 4) & (ARCHER_val_full.Year.str.isnumeric())]

ARCHER_test_full = ARCHER_test_full[(ARCHER_test_full.Year.str.len()== 4) & (ARCHER_test_full.Year.str.isnumeric())]

In [12]:
print('Length train set: ',len(ARCHER_train_full))
print('Length validation set: ', len(ARCHER_val_full))
print('Length test set: ', len(ARCHER_test_full))

Length train set:  1049
Length validation set:  264
Length test set:  329


In [13]:
train_x = ARCHER_train_full['Text']
train_y = ARCHER_train_full['Year'].astype(int)

val_x = ARCHER_val_full['Text']
val_y = ARCHER_val_full['Year'].astype(int)

test_x = ARCHER_test_full['Text']
test_y = ARCHER_test_full['Year'].astype(int)

In [15]:
reg_4 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 944)),
                    ('feature_selector', SelectKBest(f_regression, k='all')),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [16]:
reg_4.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=944,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1a19da2f80>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k='all',
                             score_func=<function f_regression at 0x1a19084050>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=

In [17]:
y_pred_train = reg_4.predict(train_x)
mean_squared_error(train_y, y_pred_train)

6299.528993677098

In [18]:
y_pred_val = reg_4.predict(val_x)
mean_squared_error(val_y, y_pred_val)

6700.325092357388

In [19]:
features = reg_4['feature_selector'].get_support()
feature_names = reg_4['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.explain_weights(reg_4['ridge_reg'],vec=reg_4['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1806.422,<BIAS>
+1.056,'in'
+0.988,'is'
+0.858,'a'
+0.544,'with'
+0.488,'have'
+0.373,'the'
+0.361,'for'
+0.312,'.'
+0.288,'that'


In [20]:
len(feature_names)

25

For some reason, the classifier only picked 25 features when applying the same document-occurrence ratio to the ARCHER as to the DTA and CLMET. The MSE is therefore also much higher, because 25 features are way too less for a reliable prediction.

In [21]:
reg_4 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 800)),
                    ('feature_selector', SelectKBest(f_regression, k='all')),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [22]:
reg_4.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=800,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1a19da2f80>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k='all',
                             score_func=<function f_regression at 0x1a19084050>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=

In [23]:
y_pred_train = reg_4.predict(train_x)
mean_squared_error(train_y, y_pred_train)

4889.77420422718

In [24]:
y_pred_val = reg_4.predict(val_x)
mean_squared_error(val_y, y_pred_val)

5713.646349826785

In [25]:
features = reg_4['feature_selector'].get_support()
feature_names = reg_4['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.explain_weights(reg_4['ridge_reg'],vec=reg_4['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1808.391,<BIAS>
+2.849,'only'
+2.134,'one'
+2.087,'can'
+1.957,'would'
+1.712,'been'
+1.618,'are'
+1.444,'on'
+1.348,'has'
+1.217,'who'


In [26]:
len(feature_names)

57

**BEST EXPERIMENT WITH ARCHER!!!**

In [14]:
reg_4 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 500)),
                    ('feature_selector', SelectKBest(f_regression, k='all')),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [15]:
reg_4.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=500,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x11e27d0e0>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k='all',
                             score_func=<function f_regression at 0x11dbaf200>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=Tr

In [16]:
y_pred_train = reg_4.predict(train_x)
mean_squared_error(train_y, y_pred_train)

3556.2066222035774

In [17]:
y_pred_val = reg_4.predict(val_x)
mean_squared_error(val_y, y_pred_val)

4835.284673264261

In [18]:
y_pred_test = reg_4.predict(test_x)
mean_squared_error(test_y, y_pred_test)

4941.5709435637145

In [20]:
features = reg_4['feature_selector'].get_support()
feature_names = reg_4['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.explain_weights(reg_4['ridge_reg'],vec=reg_4['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1811.479,<BIAS>
+2.520,'way'
+2.433,'only'
+2.391,'between'
+2.177,'say'
+2.175,'new'
+2.156,'can'
+1.964,'also'
+1.922,'would'
… 66 more positive …,… 66 more positive …


In [78]:
y_pred_train = pd.DataFrame(y_pred_train, columns=['Predicted_y'])

diff_pred_true_train = pd.concat([y_pred_train, train_y], axis=1)

diff_pred_true_train['Difference'] = diff_pred_true_train.Predicted_y - diff_pred_true_train.Year
    

y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Year

In [79]:
diff_pred_true_train.to_csv('/Volumes/Korpora/ARCHER_Exp1b_Reg4_Labels_train.csv',sep=';')
diff_pred_true_val.to_csv('/Volumes/Korpora/ARCHER_Exp1b_Reg4_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/ARCHER_Exp1b_Reg4_Labels_test.csv',sep=';')

In [68]:
train_details = collect_predictions(train_x, reg_4['ridge_reg'],reg_4['unigram_vectorizer'],features_selected, reg_4)
val_details = collect_predictions(val_x, reg_4['ridge_reg'],reg_4['unigram_vectorizer'],features_selected, reg_4)
test_details = collect_predictions(test_x, reg_4['ridge_reg'],reg_4['unigram_vectorizer'],features_selected, reg_4)

In [69]:
train_details.to_csv('/Volumes/Korpora/Exp1b_Reg4_Train_results_ARCHER.csv',sep=';')
val_details.to_csv('/Volumes/Korpora/Exp1b_Reg4_Val_results_ARCHER.csv', sep=';')
test_details.to_csv('/Volumes/Korpora/Exp1b_Reg4_Test_results_ARCHER.csv', sep=';')

In [31]:
features = reg_4['feature_selector'].get_support()
feature_names = reg_4['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.explain_weights(reg_4['ridge_reg'],vec=reg_4['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1811.511,<BIAS>
+2.544,'way'
+2.432,'only'
+2.355,'between'
+2.174,'say'
+2.171,'new'
+2.160,'can'
+1.964,'also'
+1.937,'would'
… 66 more positive …,… 66 more positive …


In [32]:
len(feature_names)

154

In [34]:
reg_4 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 300)),
                    ('feature_selector', SelectKBest(f_regression, k='all')),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [35]:
reg_4.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=300,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1a19da2f80>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k='all',
                             score_func=<function f_regression at 0x1a19084050>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=

In [36]:
y_pred_train = reg_4.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2579.3463182966584

In [37]:
y_pred_val = reg_4.predict(val_x)
mean_squared_error(val_y, y_pred_val)

6647.363972084789

In [38]:
features = reg_4['feature_selector'].get_support()
feature_names = reg_4['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.explain_weights(reg_4['ridge_reg'],vec=reg_4['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1815.759,<BIAS>
+8.766,'possible'
+6.207,'true'
+6.099,'does'
+5.780,'seems'
+5.319,'however'
+5.165,'call'
+5.058,'while'
+4.758,'quite'
+4.542,'whether'


In [39]:
len(feature_names)

312

Both DTA and CLMET perform best when the number of chosen features is about 25% of the number of documents in the train set. For the ARCHER, this number would be 262 features.

In [40]:
reg_4 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 300)),
                    ('feature_selector', SelectKBest(f_regression, k=262)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [41]:
reg_4.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=300,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1a19da2f80>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=262,
                             score_func=<function f_regression at 0x1a19084050>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=Tr

In [42]:
y_pred_train = reg_4.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2808.375193200379

In [43]:
y_pred_val = reg_4.predict(val_x)
mean_squared_error(val_y, y_pred_val)

5782.293327101129

In [46]:
reg_4 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 400)),
                    ('feature_selector', SelectKBest(f_regression, k='all')),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [47]:
reg_4.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=400,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1a19da2f80>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k='all',
                             score_func=<function f_regression at 0x1a19084050>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=

In [48]:
y_pred_train = reg_4.predict(train_x)
mean_squared_error(train_y, y_pred_train)

3240.5259464233695

In [49]:
y_pred_val = reg_4.predict(val_x)
mean_squared_error(val_y, y_pred_val)

5276.994628584327

In [50]:
features = reg_4['feature_selector'].get_support()
feature_names = reg_4['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.explain_weights(reg_4['ridge_reg'],vec=reg_4['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1814.960,<BIAS>
+5.704,'while'
+4.118,'does'
+3.722,'better'
+3.443,'even'
+3.261,'however'
+3.193,'work'
+2.660,'say'
+2.580,'way'
+2.496,'because'


In [51]:
len(feature_names)

202

In [52]:
reg_4 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 350)),
                    ('feature_selector', SelectKBest(f_regression, k='all')),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [53]:
reg_4.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=350,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1a19da2f80>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k='all',
                             score_func=<function f_regression at 0x1a19084050>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=

In [54]:
y_pred_train = reg_4.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2903.948759210483

In [55]:
y_pred_val = reg_4.predict(val_x)
mean_squared_error(val_y, y_pred_val)

5725.874755583225

In [56]:
features = reg_4['feature_selector'].get_support()
feature_names = reg_4['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.explain_weights(reg_4['ridge_reg'],vec=reg_4['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1815.324,<BIAS>
+6.079,'does'
+5.550,'while'
+4.982,'however'
+3.645,'times'
+3.322,'better'
+3.292,'even'
+3.212,'himself'
+3.120,'work'
+3.084,'whether'


In [57]:
len(features)

255

**GERMANC**

In [121]:
train_full = pd.read_csv('/Volumes/Korpora/Train/GERMANC_train_tokenized.csv', sep=';')
val_full = pd.read_csv('/Volumes/Korpora/Val/GERMANC_val_tokenized.csv', sep=';')
test_full = pd.read_csv('/Volumes/Korpora/Test/GERMANC_test_tokenized.csv', sep=';')

In [122]:
print('Length train set: ',len(train_full))
print('Length validation set: ', len(val_full))
print('Length test set: ', len(test_full))

Length train set:  214
Length validation set:  54
Length test set:  68


In [124]:
train_full = train_full[(train_full.Year.str.len()== 4) & (train_full.Year.str.isnumeric())]

val_full = val_full[(val_full.Year.str.len()== 4) & (val_full.Year.str.isnumeric())]

test_full = test_full[(test_full.Year.str.len()== 4) & (test_full.Year.str.isnumeric())]

In [125]:
print('Length train set: ',len(train_full))
print('Length validation set: ', len(val_full))
print('Length test set: ', len(test_full))

Length train set:  177
Length validation set:  40
Length test set:  56


In [126]:
train_x = train_full['Text']
train_y = train_full['Year'].astype(int)

val_x = val_full['Text']
val_y = val_full['Year'].astype(int)

test_x = test_full['Text']
test_y = test_full['Year'].astype(int)

89% of documents in train set: 105
25% of documents in train set: 44

In [127]:
reg_4 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 105)),
                    ('feature_selector', SelectKBest(f_regression, k='all')),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [128]:
reg_4.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=105,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x11e27d0e0>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k='all',
                             score_func=<function f_regression at 0x11dbaf200>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=Tr

In [129]:
y_pred_train = reg_4.predict(train_x)
mean_squared_error(train_y, y_pred_train)

53.05487688674284

In [130]:
y_pred_val = reg_4.predict(val_x)
mean_squared_error(val_y, y_pred_val)

1532.7680131827892

In [131]:
features = reg_4['feature_selector'].get_support()
feature_names = reg_4['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.explain_weights(reg_4['ridge_reg'],vec=reg_4['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1769.443,<BIAS>
+3.910,'am'
+3.230,'andere'
+3.085,'um'
+2.349,'habe'
+2.278,'ob'
+2.107,'muß'
+2.104,'ihrer'
+1.918,'kein'
… 51 more positive …,… 51 more positive …


In [132]:
len(features_selected)

144

Germanc reaches very good results, but with 144 features. Maybe they get even better with 44 features?

In [133]:
reg_4 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 105)),
                    ('feature_selector', SelectKBest(f_regression, k=44)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [134]:
reg_4.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=105,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x11e27d0e0>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=44,
                             score_func=<function f_regression at 0x11dbaf200>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,

In [135]:
y_pred_train = reg_4.predict(train_x)
mean_squared_error(train_y, y_pred_train)

348.8304814704438

In [136]:
y_pred_val = reg_4.predict(val_x)
mean_squared_error(val_y, y_pred_val)

398.940400606272

They do get better.

In [137]:
y_pred_test = reg_4.predict(test_x)
mean_squared_error(test_y, y_pred_test)

649.3281922679863

In [138]:
y_pred_train = pd.DataFrame(y_pred_train, columns=['Predicted_y'])

diff_pred_true_train = pd.concat([y_pred_train, train_y], axis=1)

diff_pred_true_train['Difference'] = diff_pred_true_train.Predicted_y - diff_pred_true_train.Year
    

y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Year


y_pred_test = pd.DataFrame(y_pred_test, columns=['Predicted_y'])

diff_pred_true_test = pd.concat([y_pred_test, test_y], axis=1)

diff_pred_true_test['Difference'] = diff_pred_true_test.Predicted_y - diff_pred_true_test.Year

In [139]:
diff_pred_true_train.to_csv('/Volumes/Korpora/GERMANC_Exp1b_Reg4_Labels_train.csv',sep=';')
diff_pred_true_val.to_csv('/Volumes/Korpora/GERMANC_Exp1b_Reg4_Labels_val.csv',sep=';')
diff_pred_true_test.to_csv('/Volumes/Korpora/GERMANC_Exp1b_Reg4_Labels_test.csv',sep=';')

In [142]:
features = reg_4['feature_selector'].get_support()
feature_names = reg_4['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.explain_weights(reg_4['ridge_reg'],vec=reg_4['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1705.757,<BIAS>
+1.626,'uͤber'
+1.578,'eines'
+1.150,'nur'
+0.943,'gegen'
+0.817,'um'
+0.800,'machen'
+0.708,'unter'
+0.558,'diese'
+0.548,'dieser'


In [144]:
vectorizer = CountVectorizer(tokenizer=tokenizer_word, vocabulary=features_selected) 
#ELI5 cant't include both vectorizer and feature selector, so this is the best solution

In [145]:
train_details = collect_predictions(train_x, reg_4['ridge_reg'],vectorizer,features_selected, reg_4)
val_details = collect_predictions(val_x, reg_4['ridge_reg'],vectorizer,features_selected, reg_4)
test_details = collect_predictions(test_x, reg_4['ridge_reg'],vectorizer,features_selected, reg_4)

In [146]:
train_details.to_csv('/Volumes/Korpora/Exp1b_Reg4_Train_results_GERMANC.csv',sep=';')
val_details.to_csv('/Volumes/Korpora/Exp1b_Reg4_Val_results_GERMANC.csv', sep=';')
test_details.to_csv('/Volumes/Korpora/Exp1b_Reg4_Test_results_GERMANC.csv', sep=';')