**Experiment 4: DTA, linear regression and subwords**

*Background*: Experiment 2 shows that linear regression with 215 features does not overfit anymore, but the model does not fit the beginning and the end of the timerange. Quadratic polynomial regression would be a possibility, but quadratic polynomial regression performs worse on both training and validation set than linear regression. Polynomial regression with a degree of 5 cannot be computed due to the vast number of polynomial features created by the model.

*Goal*: Determine if it is possible to predict the year in which a text was written using regression.

*Strategies*:

- Use a BPE-transformer to train on subwords (Sennrich2016)

*Relevance*:

- If this experiment works, it is possible to estimate years for corpora that have NA's in this variable.
- Subwords might increase the amount of generalization, and minimize the vocabulary used at the same time.

*Success criteria*:

- Consistent findings over training-, test- and validation set
- predicted year is not more than ten years away from the true year

*Corpora*:

- DTA

*Baseline to beat (Exp. 2)*:
- Train MSE = 2259.8
- Val MSE = 3202.51

*Result*: - 

In [1]:
import pandas as pd
import numpy as np
import nltk

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest , f_regression
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
import sklearn.utils
from sklearn.preprocessing import FunctionTransformer

from Selfwritten_modules.SubwordTransformer import SubwordTransformer

import re

import eli5



Test bpe-algorithm on a small play-set:

In [2]:
playset = pd.read_csv('/users/dianaenggist/Documents/Masterprojekt/testfile_deutsch_tokenized.csv', sep=';')

In [3]:
print(playset)
playset = playset['Text']

   Unnamed: 0                                               Text
0           0  [['Die', 'Pressekonferenz', 'ist', 'beendet', ...
1           1  [['Strupler', ':', '«', 'Wir', 'müssen', 'zuer...
2           2  [['Koch', ':', '«', 'Es', 'gibt', 'bereits', '...


In [4]:
#build tokenizer that just substitutes '[' and ']' with ','
def tokenizer_word(doc):
    doc = re.sub('[(\[+)|(\]+)]', '', doc)
    doc = re.split(',', doc)
    return doc

In [5]:
transformer = SubwordTransformer(tokenizer=tokenizer_word)





In [6]:
vocabulary = transformer.bpe(playset)

\ \ '
 
'


In [7]:
print(vocabulary)

defaultdict(<class 'int'>, {"'Die'": 1, " 'Pressekonferenz'": 1, " 'ist'": 1, " 'beendet'": 1, " '.'": 6, " 'Vielen'": 1, " 'Dank'": 1, " 'für'": 1, " 'Ihre'": 1, " 'Aufmerksamkeit'": 1, " 'Weiter'": 1, " 'geht'": 1, " 'es'": 1, " 'um'": 2, " '15.30'": 1, " 'Uhr'": 1, " 'mit'": 1, " 'der'": 1, " 'Medienkonferenz'": 1, " 'des'": 1, " 'Kantons'": 1, " 'Graubünden'": 1, " 'Die'": 1, " 'Behörden'": 1, " 'nehmen'": 1, " 'Stellung'": 1, " 'zu'": 2, " 'den'": 1, " 'zwei'": 1, " 'bestätigten'": 1, " 'Coronavirus-Fällen'": 1, " 'im'": 1, " 'Kanton'": 1, "'Koch'": 1, " ':'": 1, " '«'": 1, " 'Es'": 2, " 'gibt'": 1, " 'bereits'": 1, " 'europaweit'": 1, " 'Aufrufe'": 1, " '": 3, "'": 4, " 'Forschung'": 1, " 'generieren'": 1, " 'wird'": 1, " 'schnell'": 1, " 'und'": 2, " 'stark'": 1, " 'an'": 2, " 'Impfstoffen'": 1, " 'Medikamenten'": 1, " 'geforscht.'": 1, " '»'": 1, " 'Zu'": 1, " 'sagen'": 1, " 'wie'": 1, " 'lange'": 1, " 'das'": 1, " 'dauere'": 1, " 'sei'": 1, " 'aber'": 1, " 'Spekulation'": 1, '

In [None]:
# Code example: https://stackoverflow.com/questions/39839112/the-easiest-way-for-getting-feature-names-after-running-selectkbest-in-scikit-le
def features_to_names(features, feature_names):
    features_selected = []

    for bool, feature in zip(features, feature_names):
        if bool:
            features_selected.append(feature)
    return features_selected

In [None]:
#function for assembling predictions in order to find out how features are weighted

def collect_predictions(dataset, classifier,vectorizer, feature_names, pipeline):
    predictions = eli5.explain_weights_df(classifier,vec=vectorizer, feature_names=feature_names)
    
    predictions = predictions.drop(['target'], axis=1)
    
    
    predictions['YEAR'] = 0
    
    

    for instance in range (0, len(dataset)):
        pred = eli5.explain_prediction_df(classifier, dataset[instance], vec=vectorizer, feature_names=feature_names)
        source_text = pd.DataFrame([[dataset[instance]]])
        year_pred = pipeline.predict(source_text[0])
        pred['weight_value'] = pred['weight'] * pred['value']
        pred['instance'] = instance
        
        
        pred = pred.drop(['target','weight','value'], axis=1)
        
    
        pred['YEAR'] = np.round(year_pred[0])
    
        predictions = pd.concat([predictions, pred])
    
    
    
    return predictions

In [None]:
train_full = pd.read_csv('/Volumes/Korpora/Train/DTA_train_tokenized.csv', sep=';')
val_full = pd.read_csv('/Volumes/Korpora/Val/DTA_val_tokenized.csv', sep=';')
#test_full = pd.read_csv('/Volumes/Korpora/Test/DTA_test_tokenized.csv', sep=';')

In [None]:
print('Length train set: ',len(train_full))
print('Length validation set: ', len(val_full))
#print('Length test set: ', len(test_full))

In [None]:
train_x = train_full['Text']
train_y = train_full['Publication_year']

val_x = val_full['Text']
val_y = val_full['Publication_year']