# Predicting article reading time based on text
This jupyter notebook attempts to predict article reading time based on article text data. 
## Prerequisites
You will need to following:
* a virtual environment set up with all necessary required packages as discribed in requirements.txt
* a csv file with your article text data
* a csv file referencing the article text data, with reading time for every article


## Configuration and loading in the data

### Configurations

These are the necessary configurations:
* DATA_DIR: the folder in which your data csv ares stored
* DEFAULT_LANGUAGE: the default language for your articles.
* TITLE_WEIGHT: the title is generaly more important than the full text. you can give it a higher weight here.
    * MAX_WORD_FEATURES: number of words in the Bag Of Words representation
* MAX_FILTERED_FEATURES: number of words after filtering
* LSA_FEATURES: the number of latent topics after LSA

### Data loading

Your article text csv (article_content.csv in the example) file should look like this:

| article_reference |  article_title   | article_text  |
|-------------------|------------------|---------------|
| article_00000001  | My first article | test test abc |

Your article reading time csv (article_reading_time.csv in the example) file should look like this:

| article_reference | avg_reading_time |
|-------------------|------------------|
| article_00000001  |        3.6       |

Here, the average reading time is given in seconds

In [None]:
########################################################################################################
# Configurations                                                                                       #
########################################################################################################
import os
import pandas as pd
DATA_DIR = os.path.join(".","data")
DEFAULT_LANGUAGE = 'english'
TITLE_WEIGHT = 3
MAX_WORD_FEATURES=20000
MAX_FILTERED_FEATURES=7500
LSA_FEATURES=500
########################################################################################################
# loading in the data                                                                                  #
########################################################################################################
article_content_df = pd.read_csv(os.path.join(DATA_DIR,"article_content.csv"))
article_reading_time_df = pd.read_csv(os.path.join(DATA_DIR,"article_reading_time.csv"))

Check if your loaded data looks ok as in the example

In [None]:
article_content_df.set_index("article_reference",inplace=True)
article_content_df.head()

In [None]:
article_reading_time_df.set_index("article_reference",inplace=True)
article_reading_time_df.head()

## Data preprocessing

### Stemming the words

To avoid variations in the words, we need to stem the words. There are multiple libraries such as the snowballstemmer for this. (Also, if you are working with data in English, lemmatization might be prefered to stemming). But afterwards we want to be able to get the actual word used, and not the stemmed word. Thus, we provide a wrapper around the SnowballStemmer that keeps a memory of the words it stemmed. This will cause some overhead if you are working with a lot of data

In [None]:
import nltk
from statistics import mode 

class ReversibleSnowBallStemmer(nltk.stem.SnowballStemmer):
    """ A wrapper around snowball stemmer with a reverse lookip table """

    def __init__(self, *args, **kwargs):
        super(self.__class__, self).__init__(*args, **kwargs)
        self._stem_memory = nltk.defaultdict(list)
        # switch stem and memstem
        self._stem = self.stem
        self.stem = self.memstem

    def memstem(self, word):
        """ Wrapper around stem that remembers """
        stemmed_word = self._stem(word)
        self._stem_memory[stemmed_word].append(word)
        return stemmed_word

    def unstem(self, stemmed_word):
        """ Reverse lookup """
        return mode(self._stem_memory[stemmed_word])

## Creating the full text

We now create the full text by weighing the title with the title weight and adding the rest of the text

In [None]:
article_content_df["fulltext"] = TITLE_WEIGHT * (article_content_df["article_title"] + " ") \
+ article_content_df["article_text"]
#check if the first article looks ok
article_content_df["fulltext"].iloc[0]

## splitting in words and deleting stopwords

We still need to split the text in words, delete stopwords and apply our reversible stemmer

In [None]:
stemmer = ReversibleSnowBallStemmer(DEFAULT_LANGUAGE)
from nltk.corpus import stopwords
import re 

def text_to_words(text):
    # remove punctuation, numbers
    try:
        text = re.sub("[^a-zA-Z]", " ", text)
    # the csv file with the html dataframe removes unicode, so the text can be NaN
    except Exception as e:
        print(f"Exception while processing text {text}: {e}")
        text = ""
    # split in words
    words = text.lower().split()
    # remove stop words
    stops = set(stopwords.words(DEFAULT_LANGUAGE))
    filtered = [w for w in words if w not in stops]
    # stem
    stemmed = [stemmer.stem(w) for w in filtered]
    return " ".join(stemmed)
article_content_df["preprocessed"] = article_content_df["fulltext"].apply(text_to_words)
article_content_df["preprocessed"].head()

In [None]:
article_content_and_reading_time_df = article_content_df.merge(
    article_reading_time_df,
    left_index=True,
    right_index=True
)
article_content_and_reading_time_df.head()

## Categorizing the data and splitting the data in train and test data

### Categorizing the data

We want to build a classification algorithm and not a regression algorithm, therefore we make discrete categories of average reading time. This implementation just uses the quantiles of the reading time. The lowest 25% are classified as `LOW`, the highest 25% is classified as `HIGH`, the quantiles in between are respectively `LOWER THAN AVERAGE` and `HIGHER THAN AVERAGE`

### Splitting the data in train and test data

We want to be able to test if the model is actually learning. Thus, we need a separate batch of train data and a separate batch of test data. We will use 75% of the data to train on and 25% to test

In [None]:
from sklearn.model_selection import train_test_split

quantile_25 =  article_content_and_reading_time_df["avg_reading_time"].quantile(0.25)
quantile_50 =  article_content_and_reading_time_df["avg_reading_time"].quantile(0.5)
quantile_75 =  article_content_and_reading_time_df["avg_reading_time"].quantile(0.75)

article_content_and_reading_time_df["avg_reading_time_category"] = "NONE"
article_content_and_reading_time_df.loc[
    article_content_and_reading_time_df["avg_reading_time"] < quantile_25,
    "avg_reading_time_category"
] = "LOW"
article_content_and_reading_time_df.loc[
    (
    (article_content_and_reading_time_df["avg_reading_time"] >= quantile_25)
    &
    (article_content_and_reading_time_df["avg_reading_time"] < quantile_50)
    ),
    "avg_reading_time_category"
] = "LOWER THAN AVERAGE"
article_content_and_reading_time_df.loc[
    (article_content_and_reading_time_df["avg_reading_time"] >= quantile_50)
    &
    (article_content_and_reading_time_df["avg_reading_time"] < quantile_75),
    "avg_reading_time_category"
] = "HIGHER THAN AVERAGE"
article_content_and_reading_time_df.loc[
    article_content_and_reading_time_df["avg_reading_time"] >= quantile_75,
    "avg_reading_time_category"
] = "HIGH"
X_train, X_test, Y_train, Y_test = train_test_split(
    article_content_and_reading_time_df["preprocessed"],
    article_content_and_reading_time_df["avg_reading_time_category"]
)
article_content_and_reading_time_df.groupby("avg_reading_time_category").count()

## Applying the model

Applying the model consits of the following steps:

* convert the preprocessed text to a bag of words representation (Countvectorizer)
* filter the bag of words for the most relevant terms (ffilter)
* apply term frequency inverse document frequency as a word relevancy metric (tfidftransformer)
* build a latent topic model by applying the LSA algorithm (svdAlgorithm)
* normalize the features (Normalizer)
* apply the support vector machine (clf)

There are 2 steps:

* fitting the model on train data
* appling the model to test data

You can see them in the next 2 cells

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.multiclass import OneVsRestClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn import svm

countVectorizer = CountVectorizer(analyzer="word", tokenizer=None,
                                               preprocessor=None,
                                               stop_words=None,
                                               max_features=MAX_WORD_FEATURES)
tfidfTransformer = TfidfTransformer()
ffilter = SelectKBest(chi2, k=MAX_FILTERED_FEATURES)
svd_algorithm = TruncatedSVD(LSA_FEATURES)
normalizer = Normalizer()
clf = OneVsRestClassifier(svm.LinearSVC(class_weight='balanced'))

bow = countVectorizer.fit_transform(X_train, Y_train)
bow_filtered = ffilter.fit_transform(bow, Y_train.values)
tfidf = tfidfTransformer.fit_transform(bow_filtered, Y_train.values)
svd = svd_algorithm.fit_transform(tfidf, Y_train.values)
features = normalizer.fit_transform(svd, Y_train.values)
clf.fit(features, Y_train.values.ravel())

In [None]:
bow = countVectorizer.transform(X_test)
bow_filtered = ffilter.transform(bow)
tfidf = tfidfTransformer.transform(bow_filtered)
svd = svd_algorithm.transform(tfidf)
features = normalizer.transform(svd)
predicted = clf.predict(features)
predicted_df = pd.DataFrame(data={'PREDICTED_CATEGORY': predicted},
                            index=X_test.index)
predicted_df.head()

## Test the model on test data

Before checking what the model learned, we need to test if the model actually learned something. We check 2 things in the following cells:

* We check the Precision, Recall, F1-score and Support metric for the model. See the links for a more detailed explanation of what these metrics mean and why they matter
* We check the confusion matrix. The darker your diagonal from the top left to the bottom right, the better your model. The matrix shows us how many times the true label matches the label predicted by the model and how many times it makes certain mistakes (e.g. how many times does it predict a low average reading time when the true average reading time is high? ). Some mistakes are worse than others. E.g. in this case, mistaking `LOWER THAN AVERAGE` and `HIGHER THAN AVERAGE` would not be that bad, while mistaking `HIGH` for `LOW` would be a lot worse. We can see how many times each kind of mistake is made, and try to think what that would mean for the model.

In [None]:
from sklearn.metrics import precision_recall_fscore_support as score
precision, recall, fscore, support = score(Y_test,
                                           predicted_df,
                                           labels=label_order)
print("Categories: {}".format(label_order))
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))

In [None]:
from sklearn.metrics import confusion_matrix 
from matplotlib import pyplot as plt
import numpy as np
import itertools

def create_confusion_matrix(true_labels, predicted_labels, labels):
    cm = confusion_matrix(true_labels, predicted_labels, labels=labels)
    cm_df = pd.DataFrame(data=cm, index=labels, columns=labels)
    cm_df.index.name = "True label"
    cm_df.columns.name = "Predicted label"
    return cm_df

def plot_confusion_matrix(cm_df, normalize=False, title='Confusion matrix'):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py
    """
    if normalize:
        cm_df = cm_df.div(cm_df.sum(axis=1), axis=0)
    cmap = plt.cm.Blues
    plt.figure(figsize=(12, 8))
    plt.imshow(cm_df, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(cm_df.columns))
    plt.xticks(tick_marks, cm_df.columns, rotation=90)
    plt.yticks(tick_marks, cm_df.index)

    thresh = cm_df.max().max() / 2.
    for true, pred in itertools.product(range(cm_df.shape[0]),
                                        range(cm_df.shape[1])):
        value = np.round(cm_df.iat[true, pred], 2)
        plt.text(pred, true, value,
                 horizontalalignment="center",
                 color="white" if (value > thresh) else "black")

    plt.tight_layout()
    plt.ylabel(cm_df.index.name)
    plt.xlabel(cm_df.columns.name)
    plt.show()

label_order = ["HIGH","HIGHER THAN AVERAGE","LOWER THAN AVERAGE","LOW"]
cm = create_confusion_matrix(Y_test, predicted_df,
                                        label_order)
plot_confusion_matrix(cm, normalize=True, title="Confusion matrix")
plot_confusion_matrix(cm, normalize=False, title="Confusion matrix")

## What did the model learn?

Now we can investigate what the model learned. Which words are the most important words to predict a `HIGH` average reading time and which are the most important to predict a `LOW` reading time? We have to trace our way back through the whole model for investigation. First we investigate the weights of the model for these categories. further we backtrace the weights to the topic model, and multiply the weight for that topic found in the support vector machine by the weight of the words in the bow representation for that topic. By summing all those multiplications for each word, we find the importance of that word

After that, we create a wordcloud from it, so we can visualize the importance of the words.

In [None]:
def get_feature_names():
    return np.asarray(countVectorizer.get_feature_names())[
        ffilter.get_support()]

def get_top_terms( n=250):
    labels = clf.classes_
    #get the weights for features from the support vectore machine
    feature_weights = [est.coef_[0] for est in clf.estimators_]
    feature_weights = np.array(feature_weights)
    #Get all words in the bow representation after filtering
    terms = get_feature_names()
    #Get the importance of all words (sum of multiplications as specified above) from the model weights and the
    #latent topic model weight
    original_space_features = svd_algorithm.inverse_transform(
        feature_weights)
    # get the indexes of the words, ranked by importance of the word
    order_features = original_space_features.argsort()[:, ::-1]
    #For every predicted label
    for i in range(len(labels)):
        text = ""
        for ind in order_features[i,:n]:
            #we create a text in with the word multiplied by its importance for the wordcloud representation later
            for _ in range(int(round(original_space_features[i,ind] * 25))):
                #unstem the words
                text += stemmer.unstem(terms[ind])
                text += " "
        yield (labels[i],text)

In [None]:
from wordcloud import WordCloud
from PIL import Image
mask = np.array(Image.open(os.path.join(DATA_DIR,"blue-cloud-hi.png")))
for label, text in get_top_terms():
    if label in ("HIGH","LOW"):
        print(label)
        wc = WordCloud(width=1000,height=500,background_color='white',collocations=False,max_words=250, mask=mask, margin=10,
           random_state=1).generate(text)
        # store default colored image
        default_colors = wc.to_array()
        wc.to_file("low.png")
        plt.axis("off")
        plt.imshow(default_colors, interpolation="bilinear")
        plt.show()

## Conclusion

That should be it. Your wordcloud should be there in the image above. If there are words in there that seem weird or not understandable, you can investigate and see in which articles these words were actually used. 

If you have problems with repeating the work in this notebook, don't hesitate to contact me at engineering@twipemobile.com or create an issue in the git.