**Experiment Description**

*Goal*: Determine if it is possible to predict the year in which a text was written using regression.

*Strategies*:
- Train on all features
- Train on the k highest scoring features where k ranges from 2000 to 3000.

*Relevance*:
- If this experiment works, it is possible to estimate years for corpora that have NA's in this variable.

*Success criteria*:
- Consistent findings over training-, test- and validation set
- predicted year is not more than ten years away from the true year

*Corpora*:
- DTA

*Result*: Classifier overfits heavily -> Generalization Problem

In [1]:
import pandas as pd
import numpy as np
import nltk
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest , f_regression
from sklearn import linear_model
from sklearn.metrics import accuracy_score, mean_squared_error
import sklearn.utils
import re
import eli5




In [2]:
train_full = pd.read_csv('/Volumes/Korpora/Train/DTA_train_tokenized.csv', sep=';')
val_full = pd.read_csv('/Volumes/Korpora/Val/DTA_val_tokenized.csv', sep=';')
test_full = pd.read_csv('/Volumes/Korpora/Test/DTA_test_tokenized.csv', sep=';')

In [None]:
print('Length train set: ',len(train_full))
print('Length validation set: ', len(val_full))
print('Length test set: ', len(test_full))

In [None]:
train_full.describe()

In [None]:
val_full.describe()

In [None]:
test_full.describe()

**Preprocessing**
- Tokenization (external script, because this step has to be done for every experiment, and it takes very long. The loaded data frames already contain the tokenized text.)

- Binning into decades (already done during the splitting process in order to enable stratified sampling)

**Linear Regression Details**

Gerond (2017) suggests to use Ridge-regression, which contains mean square error as cost function. Mean square error is a suitable cost function for numeric prediction, because it does not distinguish binarily between "correct" and "incorrect", but does measure how far away the predicted value is from the true value. The greater the distance between the predicted and the true value, the greater the loss. Mean square error is the most often used loss function, but it has the disadvantage of exagerating the effect of outliers([Gerond 2017, p.101-102,115-117], [Witten et al. 2017, p.176, 195-197]).

Since Scikit has sort of a tutorial for Ridge Regression (https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression-and-classification), and since this is a standard, state-of-the-art technique for machine learning, I start with ridge regression to get sort of a baseline before trying to refine it with different loss functions, different preprocessing steps, etc.

For evaluation, I use Mean Square Error, since accuracy does not work well with regression tasks.

The documents used are already tokenized. To keep it simple, I am going to use a bag-of-word representation in which the single words and the counts of the words are represented. I will use a sparse representation to speed up the training later (https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction).

CountVectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer

Modifying CountVectorizer: https://towardsdatascience.com/hacking-scikit-learns-vectorizers-9ef26a7170af

In [None]:
#build tokenizer that just substitutes '[' and ']' with ','
def tokenizer_word(doc):
    doc = re.sub('[(\[+)|(\]+)]', '', doc)
    doc = re.split(',', doc)
    return doc


In [None]:
train_x = train_full['Text']
train_y = train_full['Publication_year']

val_x = val_full['Text']
val_y = val_full['Publication_year']

In [None]:
#Building pipeline

regression_1 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [None]:
regression_1.fit(train_x, train_y)

Pipeline out: 

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1aee94bbf8>,
                                 vocabulary=None)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)



In [None]:
y_pred_train = regression_1.predict(train_x) #error over the training set
mean_squared_error(train_y, y_pred_train)

For the evaluation of this task, I use the mean square error, because accuracy is designed to evaluate classification tasks, and this is a regression task.

In [None]:
y_pred_val = regression_1.predict(val_x)#error over validation_set
mean_squared_error(val_y, y_pred_val)

The MSE over the train set is 0.23, whereas the MSE over the validation set is 54463.45. This indicates that the model is overfitting heavily.

To read the features out, I am trying out the package eli5, which is also compatible with Keras, a framework I will probably use to build a neural network.

Tutorial: https://towardsdatascience.com/extracting-feature-importances-from-scikit-learn-pipelines-18c79b4ae09a

Documentation: https://eli5.readthedocs.io/en/latest/

In [None]:
feature_names = regression_1['unigram_vectorizer'].get_feature_names()
len(feature_names)

The train set contains 2'430'142 unique words, which equals the number of features the classifier trains on. Given the number of features and the training time of the classifier (about 20 minutes), it might be a good idea to select some features in order to reduce training time and overfitting.

In [None]:
eli5.show_weights(regression_1['ridge_reg'],vec=regression_1['unigram_vectorizer'], feature_names=feature_names)


According to this graph, the feature that is weighted the most (after the bias) is '.', followed by 'moderne', 'x-strahlen', 'tizianello', 'weissen', 'sah', 'kunft', 'vögel', 'hinter', 'dinge', 'gianino', 'deren', and 'menuets'.

Since the first model overfits heavily and needs really long to train, it might be a good idea to restrict the number of features used by using a feature selection algorithm from sklearn.

Suitable algorithms: https://scikit-learn.org/stable/modules/feature_selection.html

As a selector, I am going to use selectKbest, which selects the k highest scoring features. As k, I use 4000 as a start, and I might vary this number later to find out how it influences the model's performance.

This selector needs a scoring function, and I use f_regression as a scoring function. f_regression performs a f_test on the data that can capture linear dependencies between two random variables.

In [None]:
reg_2 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=4000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [None]:
reg_2.fit(train_x, train_y)

reg_2 out:
Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1a5b1dd510>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=4000,
                             score_func=<function f_regression at 0x1a1d58d598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

In [None]:
y_pred_train = reg_2.predict(train_x)
mean_squared_error(train_y, y_pred_train)

In [None]:
y_pred_val = reg_2.predict(val_x)
mean_squared_error(val_y, y_pred_val)

MSE train set: 12.68

MSE val set: 121248.11

This shows that feature selection helped with overfitting, but it could be better.

In [None]:
features = reg_2['feature_selector'].get_support(indices=True)
feature_names = reg_2['unigram_vectorizer'].get_feature_names()


In [None]:
# Code example: https://stackoverflow.com/questions/39839112/the-easiest-way-for-getting-feature-names-after-running-selectkbest-in-scikit-le
def features_to_names(features, feature_names):
    features_selected = []

    for bool, feature in zip(features, feature_names):
        if bool:
            features_selected.append(feature)
    return features_selected

In [None]:
features_selected = features_to_names(features, feature_names)

In [None]:
eli5.show_weights(reg_2['ridge_reg'],vec=reg_2['unigram_vectorizer'], feature_names=features_selected)


Top features are: erschaut, kastei'n, d'aimer, l'abbaye-aux-bois, g'seufzt, lieb'res, kasan'scher, droh'nden, kautsky'sche, bertrand-thiel'sche, heyder-pascha's, lessing's, l'assoupissement, schimper'schen.

Interestingly, a lot of the top features are French words. Historically, French words are a good indicator for the age of a text because, as far as I remember, French was spoken very frequently in the German speaking area due to the fact that Napoleon conquered those areas. Later, language purists tried to eliminate all French words in the German language, e.g. "Moment", and created new words to substitute established French loanwords such as "Nase" and "Moment".

This means that the French words in the German texts can be mapped very well to a certain time period, which makes them valuable features for estimating the publishing year of a text.

Since the classifier still overfits, I want to try to diminish the number of features selected to 2000.

In [None]:
reg_3 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=2000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [None]:
reg_3.fit(train_x, train_y)

Model output:

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae2499bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=2000,
                             score_func=<function f_regression at 0x1a19bd1598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)



In [None]:
y_pred_train = reg_3.predict(train_x)
mean_squared_error(train_y, y_pred_train)

In [None]:
y_pred_val = reg_3.predict(val_x)
mean_squared_error(val_y, y_pred_val)

MSE train: 2.35

MSE val: 150310.64

It seems that 2000 features make the classifier overfit more than with 4000, but less than with all features.

In [None]:
features = reg_3['feature_selector'].get_support()

features_selected = features_to_names(features, feature_names)

In [None]:
eli5.show_weights(reg_3['ridge_reg'],vec=reg_3['unigram_vectorizer'], feature_names=features_selected)

The most important words: fürchte, ledig, väter, schwanken, erzählen, beitrag, unbeachtet, öffnen, erschöpft.

Interestingly, none of the French words made it into the top 2000 features.

Next experiment: 3000 features.

In [None]:
reg_4 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=3000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [None]:
reg_4.fit(train_x, train_y)

Model out:

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae2499bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=3000,
                             score_func=<function f_regression at 0x1a19bd1598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)


In [None]:
y_train_predict = reg_4.predict(train_x)

mean_squared_error(train_y, y_train_predict)

In [None]:
y_val_predict = reg_4.predict(val_x)

mean_squared_error(val_y, y_val_predict)

MSE train: 27.86

MSE val: 120476.11

These values are very much the same as with 4000 features.


In [None]:
features = reg_4['feature_selector'].get_support()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_4['ridge_reg'],vec=reg_4['unigram_vectorizer'], feature_names=features_selected)

The MSEs might be the same as with the model with 4000 features, but the words the model uses are different. 

The top features: fürchte, wucht, ledig, öffnen, väter, erzählen, übrigens, seinem, unmöglichkeit

These features correspond very strongly to the features of the model that uses 2000 words.

Next, I will look what happens when I use 6000 features.

In [None]:
reg_5 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=6000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [None]:
reg_5.fit(train_x, train_y)

Model out:

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae49b1bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=6000,
                             score_func=<function f_regression at 0x1a1c0f3598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)



In [None]:
y_train_predict = reg_5.predict(train_x)

mean_squared_error(train_y, y_train_predict)

In [None]:
y_val_predict = reg_5.predict(val_x)

mean_squared_error(val_y, y_val_predict)

MSE Train: 0.05

MSE Val: 79922.26

The classifier is still overfitting, but at least, the error over the validation set gets smaller. The error over the train set is smaller than the error of the model that uses all features to train.

In [None]:
features = reg_5['feature_selector'].get_support(indices=True)
feature_names = reg_5['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_5['ridge_reg'],vec=reg_5['unigram_vectorizer'], feature_names=features_selected)

Top words: o, dicht'rin, helene'n, verschlung'nen, s'agitait, kustfertgem, nöth'ge, rankine'schen, c'2, -bu-i-t, schiller'schen, erinn'rungen, verlor'n.

There is only one French word in this list, so it is really astonishing that many French words are top features when 4000 features are selected, but not when more or less features are selected.

In [None]:
reg_6 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=8000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [None]:
reg_6.fit(train_x, train_y)

Model out: 

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae49b1bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=8000,
                             score_func=<function f_regression at 0x1a1c0f3598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)



In [None]:
y_train_predict = reg_6.predict(train_x)

mean_squared_error(train_y, y_train_predict)

In [None]:
y_val_predict = reg_6.predict(val_x)

mean_squared_error(val_y, y_val_predict)

MSE train: 0.03

MSE val: 81250.91

The MSE over the validation set is higher than the MSE of the validation set over 6000 features, but the MSE over the train set is lower.

In [None]:
features = reg_6['feature_selector'].get_support(indices=True)
feature_names = reg_6['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_6['ridge_reg'],vec=reg_6['unigram_vectorizer'], feature_names=features_selected)

Top features: baldung, fomes'sche, wär, bennert, raphael'schen, d'athè, wär'es, franzö'sch, k, benesch, geschäft'ge, l'eau.

We have some more French words as top features in this model than in the model before, but not as much as in the model with 4000 features.

In [None]:
reg_7 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=10000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [None]:
reg_7.fit(train_x, train_y)

Model out:

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1ae49b1bf8>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k=10000,
                             score_func=<function f_regression at 0x1a1c0f3598>)),
                ('ridge_reg',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

Model out:


In [None]:
y_train_predict = reg_7.predict(train_x)

mean_squared_error(train_y, y_train_predict)

In [None]:
y_val_predict = reg_7.predict(val_x)

mean_squared_error(val_y, y_val_predict)

In [None]:
features = reg_7['feature_selector'].get_support(indices=True)
feature_names = reg_7['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_7['ridge_reg'],vec=reg_7['unigram_vectorizer'], feature_names=features_selected)

In [None]:
reg_8 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=12000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [None]:
reg_8.fit(train_x,train_y)

In [None]:
y_train_predict = reg_8.predict(train_x)

mean_squared_error(train_y, y_train_predict)

In [None]:
y_val_predict = reg_8.predict(val_x)

mean_squared_error(val_y, y_val_predict)

In [None]:
features = reg_8['feature_selector'].get_support(indices=True)
feature_names = reg_8['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_8['ridge_reg'],vec=reg_8['unigram_vectorizer'], feature_names=features_selected)

In [None]:
reg_9 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=14000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [None]:
reg_9.fit(train_x, train_y)

In [None]:
y_train_predict = reg_9.predict(train_x)

mean_squared_error(train_y, y_train_predict)

In [None]:
y_val_predict = reg_9.predict(val_x)

mean_squared_error(val_y, y_val_predict)

In [None]:
features = reg_9['feature_selector'].get_support(indices=True)
feature_names = reg_9['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_9['ridge_reg'],vec=reg_9['unigram_vectorizer'], feature_names=features_selected)

In [None]:
reg_10 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=16000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [None]:
reg_10.fit(train_x, train_y)

In [None]:
y_train_predict = reg_10.predict(train_x)

mean_squared_error(train_y, y_train_predict)

In [None]:
y_val_predict = reg_10.predict(val_x)

mean_squared_error(val_y, y_val_predict)

In [None]:
features = reg_10['feature_selector'].get_support(indices=True)
feature_names = reg_10['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_10['ridge_reg'],vec=reg_10['unigram_vectorizer'], feature_names=features_selected)

In [None]:
reg_11 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=18000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [None]:
reg_11.fit(train_x, train_y)

In [None]:
y_train_predict = reg_11.predict(train_x)

mean_squared_error(train_y, y_train_predict)

In [None]:
y_val_predict = reg_11.predict(val_x)

mean_squared_error(val_y, y_val_predict)

In [None]:
features = reg_11['feature_selector'].get_support(indices=True)
feature_names = reg_11['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_11['ridge_reg'],vec=reg_11['unigram_vectorizer'], feature_names=features_selected)

In [None]:
reg_12 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=20000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [None]:
reg_12.fit(train_x, train_y)

In [None]:
y_train_predict = reg_12.predict(train_x)

mean_squared_error(train_y, y_train_predict)

In [None]:
y_val_predict = reg_12.predict(val_x)

mean_squared_error(val_y, y_val_predict)

In [None]:
features = reg_12['feature_selector'].get_support(indices=True)
feature_names = reg_12['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_12['ridge_reg'],vec=reg_12['unigram_vectorizer'], feature_names=features_selected)

In [None]:
reg_13 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=22000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [None]:
reg_13.fit(train_x, train_y)

In [None]:
y_train_predict = reg_13.predict(train_x)

mean_squared_error(train_y, y_train_predict)

In [None]:
y_val_predict = reg_13.predict(val_x)

mean_squared_error(val_y, y_val_predict)

In [None]:
features = reg_13['feature_selector'].get_support(indices=True)
feature_names = reg_13['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_13['ridge_reg'],vec=reg_13['unigram_vectorizer'], feature_names=features_selected)

In [None]:
reg_14 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=24000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [None]:
reg_14.fit(train_x, train_y)

In [None]:
y_train_predict = reg_14.predict(train_x)

mean_squared_error(train_y, y_train_predict)

In [None]:
y_val_predict = reg_14.predict(val_x)

mean_squared_error(val_y, y_val_predict)

In [None]:
features = reg_14['feature_selector'].get_support(indices=True)
feature_names = reg_14['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_14['ridge_reg'],vec=reg_14['unigram_vectorizer'], feature_names=features_selected)

In [None]:
reg_15 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=26000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [None]:
reg_15.fit(train_x, train_y)

In [None]:
y_train_predict = reg_15.predict(train_x)

mean_squared_error(train_y, y_train_predict)

In [None]:
y_val_predict = reg_15.predict(val_x)

mean_squared_error(val_y, y_val_predict)

In [None]:
features = reg_15['feature_selector'].get_support(indices=True)
feature_names = reg_15['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_15['ridge_reg'],vec=reg_15['unigram_vectorizer'], feature_names=features_selected)

In [None]:
reg_16 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=28000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [None]:
reg_16.fit(train_x, train_y)

In [None]:
y_train_predict = reg_16.predict(train_x)

mean_squared_error(train_y, y_train_predict)

In [None]:
y_val_predict = reg_16.predict(val_x)

mean_squared_error(val_y, y_val_predict)

In [None]:
features = reg_16['feature_selector'].get_support(indices=True)
feature_names = reg_16['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_16['ridge_reg'],vec=reg_16['unigram_vectorizer'], feature_names=features_selected)

In [None]:
reg_17 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word)),
                    ('feature_selector', SelectKBest(f_regression, k=30000)),
                         ('ridge_reg', linear_model.Ridge())
                        ])

In [None]:
reg_17.fit(train_x, train_y)

In [None]:
y_train_predict = reg_17.predict(train_x)

mean_squared_error(train_y, y_train_predict)

In [None]:
y_val_predict = reg_17.predict(val_x)

mean_squared_error(val_y, y_val_predict)

In [None]:
features = reg_17['feature_selector'].get_support(indices=True)
feature_names = reg_17['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.show_weights(reg_17['ridge_reg'],vec=reg_17['unigram_vectorizer'], feature_names=features_selected)

This series of experiments shows that the error over the validation set is the lowest with 22000 as features (MSE train = 0.01, MSE val = 48859.09). However, the difference between these two errors is still large, indicating that the model overfits. Linear regression is a very simple model, so the issue is generalizing over the data rather than a model that is too complex.