**Experiment 2: DTA, polynomial regression and selecting features that occur a certain amount of time**

*Background*: Experiment 2 showed that only using features which occur in 800 out of 899 texts solves the problem of overfitting. The analysis showed that the linear regression does fit well on texts that were written around the middle of the timespan, but it doesn't fit well on texts that were written either very early or very recently. This is surprising because more texts were written in the second half of the timespan than in the first half.

*Goal*: Determine if it is possible to predict the year in which a text was written using regression.

*Strategies*:

 - Train on some features that occur a certain amount of time with polynomial regression

*Relevance*:

 If this experiment works, it is possible to estimate years for corpora that have NA's in this variable.

*Success criteria*:

 - Consistent findings over training-, test- and validation set
 - predicted year is not more than ten years away from the true year
 
*Corpora*:

 - DTA

*Baseline to beat (Exp. 2)*:
- Train MSE = 2259.8
- Val MSE = 3202.51

*Result*: Quadratic polynomial regression performs worse than linear regression, and polynomial regression with a degree of 5 cannot be computed due to the number of polynomial features created.


In [1]:
import pandas as pd
import numpy as np
import nltk

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectKBest , f_regression
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
import sklearn.utils
import re

import eli5



In [2]:
# Code example: https://stackoverflow.com/questions/39839112/the-easiest-way-for-getting-feature-names-after-running-selectkbest-in-scikit-le
def features_to_names(features, feature_names):
    features_selected = []

    for bool, feature in zip(features, feature_names):
        if bool:
            features_selected.append(feature)
    return features_selected

In [3]:
train_full = pd.read_csv('/Volumes/Korpora/Train/DTA_train_tokenized.csv', sep=';')
val_full = pd.read_csv('/Volumes/Korpora/Val/DTA_val_tokenized.csv', sep=';')
#test_full = pd.read_csv('/Volumes/Korpora/Test/DTA_test_tokenized.csv', sep=';')

In [4]:
print('Length train set: ',len(train_full))
print('Length validation set: ', len(val_full))
#print('Length test set: ', len(test_full))

Length train set:  899
Length validation set:  225


In [5]:
#build tokenizer that just substitutes '[' and ']' with ','
def tokenizer_word(doc):
    doc = re.sub('[(\[+)|(\]+)]', '', doc)
    doc = re.split(',', doc)
    return doc

In [6]:
#function for assembling predictions in order to find out how features are weighted

def collect_predictions(dataset, classifier,vectorizer, feature_names, pipeline):
    predictions = eli5.explain_weights_df(classifier,vec=vectorizer, feature_names=feature_names)
    
    predictions = predictions.drop(['target'], axis=1)
    
    
    #features = predictions['feature']
    #predictions = predictions.sort_values('feature')
    
    
    #predictions = predictions.pivot(index='instance',columns='feature', values='weight')
    
    predictions['YEAR'] = 0
    
    

    for instance in range (0, len(dataset)):
        pred = eli5.explain_prediction_df(classifier, dataset[instance], vec=vectorizer, feature_names=feature_names)
        source_text = pd.DataFrame([[dataset[instance]]])
        year_pred = pipeline.predict(source_text[0])
        pred['weight_value'] = pred['weight'] * pred['value']
        pred['instance'] = instance
        
        
        pred = pred.drop(['target','weight','value'], axis=1)
        
        #pred = pred.merge(features, how='right', on='feature')
        
        #pred = pred.sort_values('feature')
        #pred = pred.pivot(index = 'instance', columns='feature', values = 'weight_value')
        
        pred['YEAR'] = np.round(year_pred[0])
    
        predictions = pd.concat([predictions, pred])
    
    
    
    return predictions

In [7]:
train_x = train_full['Text']
train_y = train_full['Publication_year']

val_x = val_full['Text']
val_y = val_full['Publication_year']

Polynomial Regression with sklearn: https://scikit-learn.org/stable/modules/linear_model.html#polynomial-regression-extending-linear-models-with-basis-functions

In [9]:
reg_1 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 800)),
                    ('feature_selector', SelectKBest(f_regression, k='all')),
                    ('poly', PolynomialFeatures(degree=2)),
                    ('linear', linear_model.LinearRegression(fit_intercept=False))
                        ])

In [11]:
reg_1.fit(train_x, train_y)

Pipeline(memory=None,
         steps=[('unigram_vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=800,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_word at 0x1da54dc20>,
                                 vocabulary=None)),
                ('feature_selector',
                 SelectKBest(k='all',
                             score_func=<function f_regression at 0x11fb51170>)),
                ('poly',
                 PolynomialFeatures(degree=2, include_bias=True,
   

In [12]:
y_pred_train = reg_1.predict(train_x)
mean_squared_error(train_y, y_pred_train)

158045.95707092748

In [13]:
y_pred_val = reg_1.predict(val_x)

mean_squared_error(val_y, y_pred_val)

21964810465.566597

In [8]:
#reg_2 = Pipeline([ ('unigram_vectorizer', CountVectorizer(tokenizer=tokenizer_word, min_df = 800)),
                    #('feature_selector', SelectKBest(f_regression, k='all')),
                    #('poly', PolynomialFeatures(degree=5)),
                    #('linear', linear_model.LinearRegression(fit_intercept=False))
                        #])

In [1]:
#reg_2.fit(train_x, train_y)

NameError: name 'reg_2' is not defined

215 features with degree = 5 ends up with 4'102'565'544 features, and the kernel can't handle this. Therefore, the next step is to reduce the number of features, because quadratic regression performs worse than linear regression.

In [None]:
y_pred_train = reg_2.predict(train_x)
mean_squared_error(train_y, y_pred_train)

In [None]:
y_pred_val = reg_2.predict(val_x)

mean_squared_error(val_y, y_pred_val)

In [14]:
features = reg_1['feature_selector'].get_support()
feature_names = reg_1['unigram_vectorizer'].get_feature_names()

features_selected = features_to_names(features, feature_names)

eli5.explain_weights(reg_1['linear'],vec=reg_1['unigram_vectorizer'], feature_names=features_selected)

ValueError: feature_names has a wrong length: expected=23436, got=215

In [15]:
y_pred_train = pd.DataFrame(y_pred_train, columns=['Predicted_y'])

diff_pred_true_train = pd.concat([y_pred_train, train_y], axis=1)

diff_pred_true_train['Difference'] = diff_pred_true_train.Predicted_y - diff_pred_true_train.Publication_year
    

print(diff_pred_true_train.head(3))


   Predicted_y  Publication_year  Difference
0  1821.832534              1741   80.832534
1  1718.854585              1691   27.854585
2  1817.362107              1665  152.362107


In [16]:
y_pred_val = pd.DataFrame(y_pred_val, columns=['Predicted_y'])

diff_pred_true_val = pd.concat([y_pred_val, val_y], axis=1)

diff_pred_true_val['Difference'] = diff_pred_true_val.Predicted_y - diff_pred_true_val.Publication_year

print(diff_pred_true_val.head(3))


     Predicted_y  Publication_year     Difference
0   -9757.105495              1897  -11654.105495
1  146451.113027              1701  144750.113027
2  109784.712197              1663  108121.712197


In [17]:
diff_pred_true_val.to_csv('/Volumes/Korpora/Exp3_Reg1_Labels_val.csv',sep=';')
diff_pred_true_train.to_csv('/Volumes/Korpora/Exp3_Reg1_Labels_train.csv',sep=';')

In [18]:
val_details = collect_predictions(val_x, reg_1['linear'],reg_1['unigram_vectorizer'],features_selected, reg_1)
val_details.to_csv('/Volumes/Korpora/Exp3_Reg1_Val_results.csv', sep=';')

ValueError: feature_names has a wrong length: expected=23436, got=215

In [None]:
train_details = collect_predictions(train_x, reg_1['linear'],reg_1['unigram_vectorizer'],features_selected, reg_1)
train_details.to_csv('/Volumes/Korpora/Exp3_Reg1_train_results.csv', sep=';')