**Experiment Description**

*Goal*: Determine if it is possible to predict the year in which a text was written using regression.

*Strategies*:
- Train on all features
- Train on the k highest scoring features where k ranges from 2000 to 3000.

*Relevance*:
- If this experiment works, it is possible to estimate years for corpora that have NA's in this variable.

*Success criteria*:
- Consistent findings over training-, test- and validation set
- predicted year is not more than ten years away from the true year

*Corpora*:
- DTA

*Result*: Classifier overfits heavily -> Generalization Problem

In [1]:
import pandas as pd
import numpy as np
import nltk
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest , f_regression
from sklearn import linear_model
from sklearn.metrics import accuracy_score, mean_squared_error
import sklearn.utils
import re
import eli5
from eli5.lime import TextExplainer

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [2]:
train_full = pd.read_csv('/Volumes/Korpora/Train/DTA_train_tokenized.csv', sep=';')
val_full = pd.read_csv('/Volumes/Korpora/Val/DTA_val_tokenized.csv', sep=';')
test_full = pd.read_csv('/Volumes/Korpora/Test/DTA_test_tokenized.csv', sep=';')

In [3]:
print('Length train set: ',len(train_full))
print('Length validation set: ', len(val_full))
print('Length test set: ', len(test_full))

Length train set:  899
Length validation set:  225
Length test set:  281


In [4]:
train_full.describe()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,DTA_File_ID,ID,Publication_year,Sent_ID
count,899.0,899.0,899.0,899.0,899.0,0.0,899.0,899.0
mean,449.0,567.985539,698.937709,0.0,19208.556174,,1788.259177,699.937709
std,259.663243,327.42807,405.880695,0.0,2913.857612,,77.929074,405.880695
min,0.0,0.0,0.0,0.0,16157.0,,1598.0,1.0
25%,224.5,283.5,353.5,0.0,16702.5,,1739.5,354.5
50%,449.0,579.0,699.0,0.0,20005.0,,1796.0,700.0
75%,673.5,845.5,1043.5,0.0,20524.5,,1855.0,1044.5
max,898.0,1123.0,1404.0,0.0,25229.0,,1913.0,1405.0


In [5]:
val_full.describe()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,DTA_File_ID,ID,Publication_year,Sent_ID
count,225.0,225.0,225.0,225.0,225.0,0.0,225.0,225.0
mean,112.0,535.586667,740.377778,0.0,18706.804444,,1791.315556,741.377778
std,65.096083,312.488675,408.646617,0.0,2740.370194,,74.822785,408.646617
min,0.0,6.0,10.0,0.0,16172.0,,1603.0,11.0
25%,56.0,265.0,380.0,0.0,16595.0,,1750.0,381.0
50%,112.0,488.0,800.0,0.0,17129.0,,1804.0,801.0
75%,168.0,808.0,1079.0,0.0,20399.0,,1843.0,1080.0
max,224.0,1122.0,1398.0,0.0,30972.0,,1913.0,1399.0


In [6]:
test_full.describe()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,DTA_File_ID,ID,Publication_year,Sent_ID
count,281.0,281.0,281.0,281.0,0.0,281.0,281.0
mean,140.0,681.067616,0.0,19082.647687,,1788.626335,682.067616
std,81.261922,402.316488,0.0,3030.697764,,70.77832,402.316488
min,0.0,6.0,0.0,16169.0,,1606.0,7.0
25%,70.0,338.0,0.0,16587.0,,1746.0,339.0
50%,140.0,650.0,0.0,17124.0,,1796.0,651.0
75%,210.0,1038.0,0.0,20507.0,,1841.0,1039.0
max,280.0,1402.0,0.0,30331.0,,1913.0,1403.0


**Preprocessing**
- Tokenization (external script, because this step has to be done for every experiment, and it takes very long. The loaded data frames already contain the tokenized text.)

- Binning into decades (already done during the splitting process in order to enable stratified sampling)

**Linear Regression Details**

Gerond (2017) suggests to use Ridge-regression, which contains mean square error as cost function. Mean square error is a suitable cost function for numeric prediction, because it does not distinguish binarily between "correct" and "incorrect", but does measure how far away the predicted value is from the true value. The greater the distance between the predicted and the true value, the greater the loss. Mean square error is the most often used loss function, but it has the disadvantage of exagerating the effect of outliers([Gerond 2017, p.101-102], [Witten et al. 2017, p.176, 195-197]).

Since Scikit has sort of a tutorial for Ridge Regression (https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression-and-classification), and since this is a standard, state-of-the-art technique for machine learning, I start with ridge regression to get sort of a baseline before trying to refine it with different loss functions, different preprocessing steps, etc.

For evaluation, I use Mean Square Error, since accuracy does not work well with regression tasks.

The documents used are already tokenized. To keep it simple, I am going to use a bag-of-word representation in which the single words and the counts of the words are represented. I will use a sparse representation to speed up the training later (https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction).

CountVectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer

Modifying CountVectorizer: https://towardsdatascience.com/hacking-scikit-learns-vectorizers-9ef26a7170af

In [4]:
#build tokenizer that just substitutes '[' and ']' with ','
def tokenizer_word(doc):
    doc = re.sub('[(\[+)|(\]+)]', '', doc)
    doc = re.split(',', doc)
    return doc


In [5]:
train_x = train_full['Text']
train_y = train_full['Publication_year']

val_x = val_full['Text']
val_y = val_full['Publication_year']

In [9]:
#Building pipeline

regression_1 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [12]:
regression_1.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1aee94bbf8>,
                                 vocabulary=None)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

Pipeline out: 

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1aee94bbf8>,
                                 vocabulary=None)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)



In [13]:
y_pred_train = regression_1.predict(train_x) #error over the training set
mean_squared_error(train_y, y_pred_train)

0.2306225459155541

For the evaluation of this task, I use the mean square error, because accuracy is designed to evaluate classification tasks, and this is a regression task.

In [14]:
y_pred_val = regression_1.predict(val_x)#error over validation_set
mean_squared_error(val_y, y_pred_val)

54463.44590640144

The MSE over the train set is 0.23, whereas the MSE over the validation set is 54463.45. This indicates that the model is overfitting heavily.

To read the features out, I am trying out the package eli5, which is also compatible with Keras, a framework I will probably use to build a neural network.

Tutorial: https://towardsdatascience.com/extracting-feature-importances-from-scikit-learn-pipelines-18c79b4ae09a

Documentation: https://eli5.readthedocs.io/en/latest/

In [16]:
feature_names = regression_1['unigram_vectorizer'].get_feature_names()
len(feature_names)

2430142

The train set contains 2'430'142 unique words, which equals the number of features the classifier trains on. Given the number of features and the training time of the classifier (about 20 minutes), it might be a good idea to select some features in order to reduce training time and overfitting.

In [45]:
eli5.show_weights(regression_1['ridge_reg'],vec=regression_1['unigram_vectorizer'], feature_names=feature_names)


Weight?,Feature
+1706.184,<BIAS>
+1.511,'·'
+1.014,'moderne'
+0.972,'x-strahlen'
+0.868,'tizianello'
+0.823,'weissen'
+0.811,'sah'
+0.779,'kunſt'
+0.755,'vögel'
+0.733,'hinter'


According to this graph, the feature that is weighted the most (after the bias) is '.', followed by 'moderne', 'x-strahlen', 'tizianello', 'weissen', 'sah', 'kunft', 'vögel', 'hinter', 'dinge', 'gianino', 'deren', and 'menuets'.

Since the first model overfits heavily and needs really long to train, it might be a good idea to restrict the number of features used by using a feature selection algorithm from sklearn.

Suitable algorithms: https://scikit-learn.org/stable/modules/feature_selection.html

As a selector, I am going to use selectKbest, which selects the k highest scoring features. As k, I use 4000 as a start, and I might vary this number later to find out how it influences the model's performance.

This selector needs a scoring function, and I use f_regression as a scoring function. f_regression performs a f_test on the data that can capture linear dependencies between two random variables.

In [6]:
reg_2 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=4000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [7]:
reg_2.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae2499bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=4000,
                             score_func=<function f_regression at 0x1a19bd1598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=Tru

reg_2 out:
Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1a5b1dd510>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=4000,
                             score_func=<function f_regression at 0x1a1d58d598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [8]:
y_pred_train = reg_2.predict(train_x)
mean_squared_error(train_y, y_pred_train)

12.683615774135793

In [9]:
y_pred_val = reg_2.predict(val_x)
mean_squared_error(val_y, y_pred_val)

121248.1132911661

MSE train set: 12.68

MSE val set: 121248.11

This shows that feature selection helped with overfitting, but it could be better.

In [14]:
features = reg_2['feature_selector'].get_support(indices=True)
feature_names = reg_2['unigram_vectorizer'].get_feature_names()


NameError: name 'reg_2' is not defined

In [13]:
# Code example: https://stackoverflow.com/questions/39839112/the-easiest-way-for-getting-feature-names-after-running-selectkbest-in-scikit-le
def features_to_names(features, feature_names):
    features_selected = []

    for bool, feature in zip(features, feature_names):
        if bool:
            features_selected.append(feature)
    return features_selected

In [17]:
features_selected = features_to_names(features, feature_names)

In [1]:
eli5.show_weights(reg_2['ridge_reg'],vec=reg_2['unigram_vectorizer'], feature_names=features_selected)


NameError: name 'eli5' is not defined

Top features are: erschaut, kastei'n, d'aimer, l'abbaye-aux-bois, g'seufzt, lieb'res, kasan'scher, droh'nden, kautsky'sche, bertrand-thiel'sche, heyder-pascha's, lessing's, l'assoupissement, schimper'schen.

Interestingly, a lot of the top features are French words. Historically, French words are a good indicator for the age of a text because, as far as I remember, French was spoken very frequently in the German speaking area due to the fact that Napoleon conquered those areas. Later, language purists tried to eliminate all French words in the German language, e.g. "Moment", and created new words to substitute established French loanwords such as "Nase" and "Moment".

This means that the French words in the German texts can be mapped very well to a certain time period, which makes them valuable features for estimating the publishing year of a text.

Since the classifier still overfits, I want to try to diminish the number of features selected to 2000.

In [19]:
reg_3 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=2000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [20]:
reg_3.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae2499bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=2000,
                             score_func=<function f_regression at 0x1a19bd1598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=Tru

Model output:

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae2499bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=2000,
                             score_func=<function f_regression at 0x1a19bd1598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)



In [22]:
y_pred_train = reg_3.predict(train_x)
mean_squared_error(train_y, y_pred_train)

2.3512860163449725

In [23]:
y_pred_val = reg_3.predict(val_x)
mean_squared_error(val_y, y_pred_val)

150310.63517303564

MSE train: 2.35

MSE val: 150310.64

It seems that 2000 features make the classifier overfit more than with 4000, but less than with all features.

In [24]:
features = reg_3['feature_selector'].get_support()

features_selected = features_to_names(features, feature_names)

In [25]:
eli5.show_weights(reg_3['ridge_reg'],vec=reg_3['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1752.929,<BIAS>
+16.247,'fürchte'
+13.998,'ledig'
+12.606,'väter'
+12.576,'schwanken'
+12.282,'erzählen'
+11.289,'beitrag'
+10.924,'unbeachtet'
+10.573,'öffnen'
+10.253,'erſchöpft'


The most important words: fürchte, ledig, väter, schwanken, erzählen, beitrag, unbeachtet, öffnen, erschöpft.

Interestingly, none of the French words made it into the top 2000 features.

Next experiment: 3000 features.

In [26]:
reg_4 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=3000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [27]:
reg_4.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae2499bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=3000,
                             score_func=<function f_regression at 0x1a19bd1598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=Tru

Model out:

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae2499bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=3000,
                             score_func=<function f_regression at 0x1a19bd1598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)


In [28]:
y_train_predict = reg_4.predict(train_x)

mean_squared_error(train_y, y_train_predict)

27.860837116923015

In [29]:
y_val_predict = reg_4.predict(val_x)

mean_squared_error(val_y, y_val_predict)

120476.10663204179

MSE train: 27.86

MSE val: 120476.11

These values are very much the same as with 4000 features.


In [30]:
features = reg_4['feature_selector'].get_support()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_4['ridge_reg'],vec=reg_4['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1736.205,<BIAS>
+12.975,'fürchte'
+11.352,'wucht'
+10.151,'ledig'
+9.311,'öffnen'
+8.566,'väter'
+8.521,'erzählen'
+7.983,'übrigens'
+7.841,'seinem'
+6.886,'unmöglichkeit'


The MSEs might be the same as with the model with 4000 features, but the words the model uses are different. 

The top features: fürchte, wucht, ledig, öffnen, väter, erzählen, übrigens, seinem, unmöglichkeit

These features correspond very strongly to the features of the model that uses 2000 words.

Next, I will look what happens when I use 6000 features.

In [8]:
reg_5 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=6000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [9]:
reg_5.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae49b1bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=6000,
                             score_func=<function f_regression at 0x1a1c0f3598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=Tru

Model out:

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae49b1bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=6000,
                             score_func=<function f_regression at 0x1a1c0f3598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)



In [10]:
y_train_predict = reg_5.predict(train_x)

mean_squared_error(train_y, y_train_predict)

0.05057491622763877

In [12]:
y_val_predict = reg_5.predict(val_x)

mean_squared_error(val_y, y_val_predict)

79922.26344243476

MSE Train: 0.05

MSE Val: 79922.26

The classifier is still overfitting, but at least, the error over the validation set gets smaller. The error over the train set is smaller than the error of the model that uses all features to train.

In [15]:
features = reg_5['feature_selector'].get_support(indices=True)
feature_names = reg_5['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_5['ridge_reg'],vec=reg_5['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1725.968,<BIAS>
+5.287,"""o'"""
+4.980,"""dicht'rin"""
+4.852,"""helene'n"""
+4.736,"""verſchlung'nen"""
+4.275,"""s'agitait"""
+4.221,"""kunſtfert'gem"""
+4.127,"""nöth'ge"""
+4.022,"""rankine'ſchen"""
+3.998,"""c'2"""


Top words: o, dicht'rin, helene'n, verschlung'nen, s'agitait, kustfertgem, nöth'ge, rankine'schen, c'2, -bu-i-t, schiller'schen, erinn'rungen, verlor'n.

There is only one French word in this list, so it is really astonishing that many French words are top features when 4000 features are selected, but not when more or less features are selected.

In [16]:
reg_6 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=8000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [17]:
reg_6.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae49b1bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=8000,
                             score_func=<function f_regression at 0x1a1c0f3598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=Tru

Model out: 

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae49b1bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=8000,
                             score_func=<function f_regression at 0x1a1c0f3598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)



In [18]:
y_train_predict = reg_6.predict(train_x)

mean_squared_error(train_y, y_train_predict)

0.02760831163554112

In [19]:
y_val_predict = reg_6.predict(val_x)

mean_squared_error(val_y, y_val_predict)

81250.90764984985

MSE train: 0.03

MSE val: 81250.91

The MSE over the validation set is higher than the MSE of the validation set over 6000 features, but the MSE over the train set is lower.

In [21]:
features = reg_6['feature_selector'].get_support(indices=True)
feature_names = reg_6['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_6['ridge_reg'],vec=reg_6['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1727.790,<BIAS>
+3.974,'*baldung'
+3.878,"""ſomes'ſche"""
+3.859,"""waͤr'"""
+3.761,"""phryg'ſche"""
+3.488,'*bennert'
+3.399,"""raphael'ſchen"""
+3.238,"""d'athè¬"""
+3.203,"""waͤr'es"""
+3.103,"""franzoͤ'ſch"""


Top features: baldung, fomes'sche, wär, bennert, raphael'schen, d'athè, wär'es, franzö'sch, k, benesch, geschäft'ge, l'eau.

We have some more French words as top features in this model than in the model before, but not as much as in the model with 4000 features.

In [22]:
reg_7 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=10000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [23]:
reg_7.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae49b1bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=10000,
                             score_func=<function f_regression at 0x1a1c0f3598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=Tr

Model out:

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae49b1bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=10000,
                             score_func=<function f_regression at 0x1a1c0f3598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

Model out:


In [24]:
y_train_predict = reg_7.predict(train_x)

mean_squared_error(train_y, y_train_predict)

0.018841024899424196

In [25]:
y_val_predict = reg_7.predict(val_x)

mean_squared_error(val_y, y_val_predict)

79260.36149479043

In [26]:
features = reg_7['feature_selector'].get_support(indices=True)
feature_names = reg_7['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_7['ridge_reg'],vec=reg_7['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1725.788,<BIAS>
+4.041,'*broekel'
+3.713,'*findling'
+3.373,'*as-dhiváha'
+3.246,'*fu-set'
+3.215,"""vertheid'ge"""
+3.006,'*as-jâ-mmed'
+2.988,"""war's"""
+2.757,'*fructu-ns'
+2.731,'*gasti-s'


In [27]:
reg_8 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=12000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [28]:
reg_8.fit(train_x,train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae49b1bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=12000,
                             score_func=<function f_regression at 0x1a1c0f3598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=Tr

In [29]:
y_train_predict = reg_8.predict(train_x)

mean_squared_error(train_y, y_train_predict)

0.016078425484788352

In [30]:
y_val_predict = reg_8.predict(val_x)

mean_squared_error(val_y, y_val_predict)

76823.53015689956

In [31]:
features = reg_8['feature_selector'].get_support(indices=True)
feature_names = reg_8['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_8['ridge_reg'],vec=reg_8['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1723.483,<BIAS>
+3.670,'*gellert'
+3.345,'*dreves'
+3.142,'*schaumberg'
+2.792,'*schulze-smidt'
+2.677,'***reede'
+2.618,"""scheuchzer'ſchen"""
+2.559,'*schlemm'
+2.515,'*romanowski'
+2.506,'*drexel'


In [32]:
reg_9 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=14000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [33]:
reg_9.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae49b1bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=14000,
                             score_func=<function f_regression at 0x1a1c0f3598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=Tr

In [34]:
y_train_predict = reg_9.predict(train_x)

mean_squared_error(train_y, y_train_predict)

0.013903785347451611

In [35]:
y_val_predict = reg_9.predict(val_x)

mean_squared_error(val_y, y_val_predict)

68712.04909218462

In [36]:
features = reg_9['feature_selector'].get_support(indices=True)
feature_names = reg_9['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_9['ridge_reg'],vec=reg_9['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1724.128,<BIAS>
+3.617,'*schenck'
+2.913,'*kam'
+2.666,'*wantalowicz'
+2.465,'*wisbacher'
+2.319,'*biri-s'
+2.304,'*weismüller'
+2.261,'*uni-decim'
+2.229,"""wallot'schen"""
+2.166,'*vel-sem'


In [37]:
reg_10 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=16000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [38]:
reg_10.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae49b1bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=16000,
                             score_func=<function f_regression at 0x1a1c0f3598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=Tr

In [39]:
y_train_predict = reg_10.predict(train_x)

mean_squared_error(train_y, y_train_predict)

0.011514426389240428

In [40]:
y_val_predict = reg_10.predict(val_x)

mean_squared_error(val_y, y_val_predict)

58869.77784450592

In [41]:
features = reg_10['feature_selector'].get_support(indices=True)
feature_names = reg_10['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_10['ridge_reg'],vec=reg_10['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1723.139,<BIAS>
+2.867,'*vilkōs'
+2.781,'*sieg'
+2.368,'-bhjaç-ḱabô-bus'
+2.192,'-dacht'
+2.142,'-bodenſtedt'
+2.119,"""„l'honore"""
+2.103,'*eger'
+2.023,'*fiedler'
+2.003,'-dynamik'


In [42]:
reg_11 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=18000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [43]:
reg_11.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae49b1bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=18000,
                             score_func=<function f_regression at 0x1a1c0f3598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=Tr

In [44]:
y_train_predict = reg_11.predict(train_x)

mean_squared_error(train_y, y_train_predict)

0.010583929301670979

In [45]:
y_val_predict = reg_11.predict(val_x)

mean_squared_error(val_y, y_val_predict)

57171.57783725581

In [46]:
features = reg_11['feature_selector'].get_support(indices=True)
feature_names = reg_11['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_11['ridge_reg'],vec=reg_11['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1723.578,<BIAS>
+2.712,'-ane'
+2.165,'-ion'
+2.156,'*weidenmüller'
+2.108,'-iþ'
+2.025,'-küste'
+1.976,'-legen'
+1.964,'*deye'
+1.941,'*jad-s-ta'
+1.884,'*mig-to-s'


In [47]:
reg_12 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=20000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [48]:
reg_12.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae49b1bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=20000,
                             score_func=<function f_regression at 0x1a1c0f3598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=Tr

In [49]:
y_train_predict = reg_12.predict(train_x)

mean_squared_error(train_y, y_train_predict)

0.009254912428222373

In [50]:
y_val_predict = reg_12.predict(val_x)

mean_squared_error(val_y, y_val_predict)

49708.35374639586

In [51]:
features = reg_12['feature_selector'].get_support(indices=True)
feature_names = reg_12['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_12['ridge_reg'],vec=reg_12['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1721.854,<BIAS>
+4.542,'-nŭ'
+2.794,'-färberei'
+2.012,"""unmuͤnd'gen"""
+1.977,'-anunga'
+1.881,'-thir'
+1.862,'-rich'
+1.856,'*sept-mŭ'
+1.837,'-rîcher'
+1.791,'*carnot'


In [52]:
reg_13 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=22000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [53]:
reg_13.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae49b1bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=22000,
                             score_func=<function f_regression at 0x1a1c0f3598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=Tr

In [54]:
y_train_predict = reg_13.predict(train_x)

mean_squared_error(train_y, y_train_predict)

0.009532301992329566

In [55]:
y_val_predict = reg_13.predict(val_x)

mean_squared_error(val_y, y_val_predict)

48859.08545164966

In [56]:
features = reg_13['feature_selector'].get_support(indices=True)
feature_names = reg_13['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_13['ridge_reg'],vec=reg_13['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1720.585,<BIAS>
+4.305,'-wand'
+2.775,'-loch'
+2.033,"""woltman'ſchen"""
+1.981,'-ϰατιο'
+1.971,'-fk'
+1.782,'-ſlagôn'
+1.771,'084'
+1.712,'*eu'
+1.658,'-ëch'


In [57]:
reg_14 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=24000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [58]:
reg_14.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae49b1bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=24000,
                             score_func=<function f_regression at 0x1a1c0f3598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=Tr

In [59]:
y_train_predict = reg_14.predict(train_x)

mean_squared_error(train_y, y_train_predict)

0.0074146576043282245

In [60]:
y_val_predict = reg_14.predict(val_x)

mean_squared_error(val_y, y_val_predict)

52946.16323337469

In [61]:
features = reg_14['feature_selector'].get_support(indices=True)
feature_names = reg_14['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_14['ridge_reg'],vec=reg_14['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1719.195,<BIAS>
+3.986,'11.980'
+3.770,'.14¾'
+2.737,'-stampf-'
+1.841,'-kragen'
+1.741,'*hanan-s'
+1.725,'0.17'
+1.706,"""„gensd'armes"""
+1.687,'0b'
+1.680,'1004.'


In [62]:
reg_15 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=26000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [63]:
reg_15.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae49b1bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=26000,
                             score_func=<function f_regression at 0x1a1c0f3598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=Tr

In [64]:
y_train_predict = reg_15.predict(train_x)

mean_squared_error(train_y, y_train_predict)

0.009018801767735608

In [65]:
y_val_predict = reg_15.predict(val_x)

mean_squared_error(val_y, y_val_predict)

50532.07490542395

In [66]:
features = reg_15['feature_selector'].get_support(indices=True)
feature_names = reg_15['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_15['ridge_reg'],vec=reg_15['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1719.269,<BIAS>
+3.881,'12.hier'
+3.733,'0°'
+2.585,'-îas'
+1.799,'-rês'
+1.720,'*rocca'
+1.677,'*an-thara-s'
+1.664,'1/74'
+1.638,'1000000000.'
+1.612,'11.29'


In [67]:
reg_16 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=28000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [68]:
reg_16.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae49b1bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=28000,
                             score_func=<function f_regression at 0x1a1c0f3598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=Tr

In [69]:
y_train_predict = reg_16.predict(train_x)

mean_squared_error(train_y, y_train_predict)

0.00855448535259319

In [70]:
y_val_predict = reg_16.predict(val_x)

mean_squared_error(val_y, y_val_predict)

53004.6095530511

In [71]:
features = reg_16['feature_selector'].get_support(indices=True)
feature_names = reg_16['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_16['ridge_reg'],vec=reg_16['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1718.435,<BIAS>
+3.672,'13.588'
+3.624,'10.so'
+2.325,'.580.'
+1.715,'-ward'
+1.569,'11°
+1.549,'105es'
+1.549,'10==auf'
+1.532,'*biebendt'
+1.523,'*skoda'


In [72]:
reg_17 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=30000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [73]:
reg_17.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae49b1bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=30000,
                             score_func=<function f_regression at 0x1a1c0f3598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=Tr

In [74]:
y_train_predict = reg_17.predict(train_x)

mean_squared_error(train_y, y_train_predict)

0.008506373573915853

In [75]:
y_val_predict = reg_17.predict(val_x)

mean_squared_error(val_y, y_val_predict)

54732.34771458616

In [76]:
features = reg_17['feature_selector'].get_support(indices=True)
feature_names = reg_17['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_17['ridge_reg'],vec=reg_17['unigram_vectorizer'], feature_names=features_selected)

Weight?,Feature
+1718.101,<BIAS>
+3.496,'14/20'
+3.475,'1077'
+2.324,'054091'
+1.685,'-εσθαι'
+1.506,'*cu-tero'
+1.501,'12921'
+1.496,'1122..'
+1.464,'116—124'
+1.463,'*urbantſchitſch'


This series of experiments shows that the error over the validation set is the lowest with 22000 as features (MSE train = 0.01, MSE val = 48859.09). However, the difference between these two errors is still large, indicating that the model overfits. Linear regression is a very simple model, so the issue is generalizing over the data rather than a model that is too complex.