**Experiment 4: DTA, linear regression and subwords**

*Background*: Experiment 2 shows that linear regression with 215 features does not overfit anymore, but the model does not fit the beginning and the end of the timerange. Quadratic polynomial regression would be a possibility, but quadratic polynomial regression performs worse on both training and validation set than linear regression. Polynomial regression with a degree of 5 cannot be computed due to the vast number of polynomial features created by the model.

*Goal*: Determine if it is possible to predict the year in which a text was written using regression.

*Strategies*:

- Use a BPE-transformer to train on subwords (Sennrich2016)

*Relevance*:

- If this experiment works, it is possible to estimate years for corpora that have NA's in this variable.
- Subwords might increase the amount of generalization, and minimize the vocabulary used at the same time.

*Success criteria*:

- Consistent findings over training-, test- and validation set
- predicted year is not more than ten years away from the true year

*Corpora*:

- DTA

*Baseline to beat (Exp. 2)*:
- Train MSE = 2259.8
- Val MSE = 3202.51

*Result*: - 

In [1]:
import pandas as pd
import numpy as np
import nltk

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest , f_regression
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
import sklearn.utils
from sklearn.preprocessing import FunctionTransformer

from Selfwritten_modules.SubwordTransformer import SubwordTransformer

import re

import eli5



Test bpe-algorithm on a small play-set:

In [2]:
playset = pd.read_csv('/users/dianaenggist/Documents/Masterprojekt/testfile_deutsch_tokenized.csv', sep=';')

In [3]:
print(playset)
playset = playset['Text']

   Unnamed: 0                                               Text
0           0  [['Die', 'Pressekonferenz', 'ist', 'beendet', ...
1           1  [['Strupler', ':', '«', 'Wir', 'müssen', 'zuer...
2           2  [['Koch', ':', '«', 'Es', 'gibt', 'bereits', '...


In [4]:
#build tokenizer that just substitutes '[' and ']' with ','
def tokenizer_word(doc):
    doc = re.sub('[(\[+)|(\]+)]', '', doc)
    doc = re.split(',', doc)
    return doc

In [5]:
transformer = SubwordTransformer(tokenizer=tokenizer_word)





In [6]:
vocabulary = transformer.bpe(playset)

e,n
en
e,n
en
e,r
er
e,r
er
o,n
on
o,n
on
a,n
an
a,n
an
e,i
ei
e,i
ei
i,e
ie
i,e
ie
e,s
es
e,s
es
ei,t
eit
ei,t
eit
u,n
un
u,n
un
c,h
ch
c,h
ch


In [7]:
print(vocabulary)

defaultdict(<class 'int'>, {'D,ie': 2, 'P,r,es,s,e,k,on,f,er,en,z': 1, 'i,s,t': 1, 'b,e,en,d,e,t': 1, '.': 6, 'V,ie,l,en': 1, 'D,an,k': 1, 'f,ü,r': 1, 'I,h,r,e': 1, 'A,u,f,m,er,k,s,a,m,k,eit': 1, 'W,eit,er': 1, 'g,e,h,t': 1, 'es': 1, 'u,m': 2, '1,5,.,3,0': 1, 'U,h,r': 1, 'm,i,t': 1, 'd,er': 1, 'M,e,d,ien,k,on,f,er,en,z': 1, 'd,es': 1, 'K,an,t,on,s': 1, 'G,r,a,u,b,ü,n,d,en': 1, 'B,e,h,ö,r,d,en': 1, 'n,e,h,m,en': 1, 'S,t,e,l,l,un,g': 1, 'z,u': 2, 'd,en': 1, 'z,w,ei': 1, 'b,es,t,ä,t,i,g,t,en': 1, 'C,o,r,on,a,v,i,r,u,s,-,F,ä,l,l,en': 1, 'i,m': 1, 'K,an,t,on': 1, 'K,o,ch': 1, ':': 1, '«': 1, 'E,s': 2, 'g,i,b,t': 1, 'b,er,eit,s': 1, 'e,u,r,o,p,a,w,eit': 1, 'A,u,f,r,u,f,e': 1, "'": 6, 'F,o,r,s,ch,un,g': 1, 'g,en,er,ier,en': 1, 'w,i,r,d': 1, 's,ch,n,e,l,l': 1, 'un,d': 2, 's,t,a,r,k': 1, 'an': 2, 'I,m,p,f,s,t,o,f,f,en': 1, 'M,e,d,i,k,a,m,en,t,en': 1, 'g,e,f,o,r,s,ch,t,.': 1, '»': 1, 'Z,u': 1, 's,a,g,en': 1, 'w,ie': 1, 'l,an,g,e': 1, 'd,a,s': 1, 'd,a,u,er,e': 1, 's,ei': 1, 'a,b,er': 1, 'S,p,e,k,

In [None]:
# Code example: https://stackoverflow.com/questions/39839112/the-easiest-way-for-getting-feature-names-after-running-selectkbest-in-scikit-le
def features_to_names(features, feature_names):
    features_selected = []

    for bool, feature in zip(features, feature_names):
        if bool:
            features_selected.append(feature)
    return features_selected

In [None]:
#function for assembling predictions in order to find out how features are weighted

def collect_predictions(dataset, classifier,vectorizer, feature_names, pipeline):
    predictions = eli5.explain_weights_df(classifier,vec=vectorizer, feature_names=feature_names)
    
    predictions = predictions.drop(['target'], axis=1)
    
    
    predictions['YEAR'] = 0
    
    

    for instance in range (0, len(dataset)):
        pred = eli5.explain_prediction_df(classifier, dataset[instance], vec=vectorizer, feature_names=feature_names)
        source_text = pd.DataFrame([[dataset[instance]]])
        year_pred = pipeline.predict(source_text[0])
        pred['weight_value'] = pred['weight'] * pred['value']
        pred['instance'] = instance
        
        
        pred = pred.drop(['target','weight','value'], axis=1)
        
    
        pred['YEAR'] = np.round(year_pred[0])
    
        predictions = pd.concat([predictions, pred])
    
    
    
    return predictions

In [None]:
train_full = pd.read_csv('/Volumes/Korpora/Train/DTA_train_tokenized.csv', sep=';')
val_full = pd.read_csv('/Volumes/Korpora/Val/DTA_val_tokenized.csv', sep=';')
#test_full = pd.read_csv('/Volumes/Korpora/Test/DTA_test_tokenized.csv', sep=';')

In [None]:
print('Length train set: ',len(train_full))
print('Length validation set: ', len(val_full))
#print('Length test set: ', len(test_full))

In [None]:
train_x = train_full['Text']
train_y = train_full['Publication_year']

val_x = val_full['Text']
val_y = val_full['Publication_year']