# BLU09 - Information Extraction

In [None]:
import os
import re
import spacy
import hashlib
import numpy as np
import pandas as pd
import json

from tqdm import tqdm
from collections import Counter
from spacy.matcher import Matcher
from sklearn.metrics import accuracy_score
from nltk.tokenize import WordPunctTokenizer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer

import utils

cpu_count = int(os.cpu_count()) if os.cpu_count() != None else 4

In this exercise notebook you are going to tackle a very real problem: **Detecting fake news!** You'll create a classification workflow to determine if a piece of news is considered 'reliable' or 'unreliable'. You will start by building some basic features, then extract information from the text, go on to build more features, and finally put it all together.

The data set we will be using is the [Fake News data set](https://www.kaggle.com/c/fake-news/overview) from Kaggle. Each piece of news is either reliable or trustworthy, '0', or unreliable and possibly fake, '1'. First, let's load the data and see what we are dealing with.

In [None]:
data_path = "data/fakenews/train.csv"
df = pd.read_csv(data_path, index_col=0)
df["title"] = df["title"].astype(str)
df["text"] = df["text"].astype(str)

df = df[:5000]

df.head()

We have 4 columns that are pretty self-explanatory. Let's drop the author column since we only want to practice our text analysis and drop the title as well for simplicity sake.

In [None]:
df = df.drop(columns=["author", "title"])

Let's also load SpaCy's module with the [merged entities](https://spacy.io/api/pipeline-functions#merge_entities) (which will come in handy later) and stopwords. We insert the merged entities module into the SpaCy pipeline after the NER module.

In [None]:
nlp = spacy.load('en_core_web_md')
nlp.add_pipe("merge_entities", after="ner")
en_stopwords = nlp.Defaults.stop_words

Here we process the news data with SpaCy to use later on. This might take a while depending on your hardware (a break to walk the dog? 🐶).

In [None]:
docs = list(tqdm(nlp.pipe(df["text"], batch_size=20, n_process=cpu_count-1), total=len(df["text"])))
docs[:3]

Overall, the text looks good! Not too many errors, well written... as expected from a news article. Fake news is a very tough, recent problem that is now appearing more and more frequently in the wild. Usually there aren't many ortographic mistakes or slang (as it may happen with spam) since it's coming from news sources that want to appear credible but also clickbaity so that they can profit on that good ad revenue and create distrust.

## Exercise 1 - Pipeline

Let's create a baseline classification workflow. We'll use the TfidfVectorizer to get a simple, fast and trustworthy baseline.

Create a function that applies a pipeline to the given train data, makes a prediction for the test data, and returns the accuracy of the prediction. The pipeline should consist of a `TfidfVectorizer` and a `RandomForestClassifier`.

In [None]:
def tfidf_rf_pipeline(X_train, X_test, y_train, y_test, seed=42):
    """
    Trains a TfidfVectorizer + RandomForestClassifier pipeline on the given train data.
    Makes a prediction on the test data.
    Returns the trained pipeline and the accuracy of the prediction.

    Parameters:
        X_train, y_train: train data, pd.Series
        X_test, y_test: test data, pd.Series
        seed (int): random state seed for the classifier
    
    Returns:
        pipe: fitted pipeline
        acc (int): accuracy of the prediction
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return pipe, acc

For the baseline, we will preprocess the text - remove punctuaction and stopwords and tokenize it - then run it through the pipeline:

In [None]:
df_processed = df.copy()
df_processed["text"] = df_processed["text"].apply(utils.remove_punctuation)
df_processed["text"] = df_processed["text"].apply(utils.remove_stopwords, stopwords = en_stopwords, 
                                                  tokenizer = WordPunctTokenizer())

X_train, X_test, y_train, y_test = train_test_split(df_processed["text"], df_processed["label"], 
                                                    test_size=0.2, random_state=42, stratify=df_processed["label"])
baseline_model, baseline_acc = tfidf_rf_pipeline(X_train, X_test, y_train, y_test)

assert isinstance(baseline_model, Pipeline)
assert hashlib.sha256(json.dumps(str(baseline_model[0])).encode()).hexdigest() == \
'e68c8e581c16f0d62f3b9cb33a7967b17890e18c1fe819d013181e6714e7a303', "The pipeline parameters are not correct."
assert hashlib.sha256(json.dumps(str(baseline_model[1])).encode()).hexdigest() == \
'36a4f3295ffa4c170fc0addee2a8cac5613970f06e3dde6956fa31daf19aa329', "The pipeline parameters are not correct."
np.testing.assert_almost_equal(baseline_acc, 0.908, decimal=2, err_msg="The accuracy is not correct.")
print(f'Baseline accuracy: {baseline_acc}')

Wow, the accuracy is quite good for such a simple text model! This just proves that a trustworthy baseline is all you need. I can't stress enough that it's really important to have a simple first iteration, and afterwards we can add complexity and study which features make sense or not. 

Sometimes, data scientists focus right off the bat on the most complex solutions and a simple one would be enough. Real life problems will obviously achieve lower scores as the data sets are not controlled or cleaned for you but that should not stop you from starting with a simpler and easier solution.

Now let's see if we can engineer more features. We will extract information with SpaCy and see if we can use it to train the model.

## Exercise 2 - SpaCy Matcher

Let's see if we can extract some useful features with the SpaCy Matcher.

### Exercise 2.1 - Simple matcher

You think of some words that could be related with the detection of Fake News. Something starts ringing in your mind about "propaganda", "USA" and "fraud", so you decide to use the SpaCy Matcher to check how many of those words appear in the news articles.

Use the `docs` list preprocessed by SpaCy and count the number of occurences of these words in all documents. Make sure to match the words regardless of the case. The output should be the sum of occurencies in all news articles.

In [None]:
words = ["propaganda", "usa", "fraud"]

# YOUR CODE HERE
raise NotImplementedError()

# count = ...

In [None]:
assert hashlib.sha256(json.dumps(str(count)).encode()).hexdigest() == \
'9d44059c29e077b9fd8496ebcc41c94aeb203bf1adce7729d3ecda30bc885a90', 'Not correct, try again.'
print(f'Count: {count}')

### Exercise 2.2 - POS-tagging search

Ok, this doesn't look like the way to go, let's look at other theories. You start thinking that fake news might exaggerate on adjectives and adverbs by using over the top descriptions. So you decide to create a feature that counts the number of _Adjectives_ and _Adverbs_ in a piece of news article. The count should be normalized to the token count of the article.

The result should be a list of adjective and adverb counts for each document normalized to the token count of the document.

In [None]:
# nb_adj_adv = 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(nb_adj_adv,list), "The result should be a list."
assert len(nb_adj_adv) == len(docs), "The length of the result list is wrong. You should have a count for every news article."
np.testing.assert_almost_equal(np.var(nb_adj_adv), 0.00105, decimal=4, err_msg='The result is not correct.')
np.testing.assert_almost_equal(np.sum(nb_adj_adv), 462.5, decimal=1, err_msg='The result is not correct.')

Let's add this feature to our dataframe:

In [None]:
df_processed["nb_adj_adv"] = nb_adj_adv

### Exercise 2.3 - Adjectivized proper nouns

Another theory that might be worth testing is that adjectives with proper nouns are often used in this kind of news to induce sentiments towards people or organizations. So you decide to extract proper nouns preceeded by adjectives to maybe use in a later analysis.

Create a `Matcher` to search for adjective + proper noun combinations. Count the number of occurences of each combination. Store the 10 most common combinations and the number of their occurences as tuples in a list, sorted in descending order by the number of occurencies.

In [None]:
# most_common_adj_propn = []

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(most_common_adj_propn,list), "The output should be a list."
assert len(most_common_adj_propn) == 10, "It should be the top 10!"
assert isinstance(most_common_adj_propn[0],tuple), 'The elements of the list should be tuples of (combination, occurences).'
assert hashlib.sha256(json.dumps(most_common_adj_propn).encode()).hexdigest() == \
'0b12899bfedce520180f460bfd6742c1241ac7270ee98d4dcb482284e134cde8', 'The top ten list is not correct.'

Let's look at the 10 most common combinations:

In [None]:
most_common_adj_propn

The counts are too low to use these terms as features. Maybe running a vectorizer on all the results could work better.

### Exercise 2.4 - Objects of preposition
The objects in the sentences could indicate something. For instance, 'NGO financed by Soros' is more likely to appear in fake news than 'NGO financed by UNESCO'. Both objects in these sentences are objects of preposition (hint: SpaCy has a dependency label for this).

Create a `Matcher` to search for objects of preposition which are nouns. Again, count the number of occurences of each. Store the 10 most common combinations and their occurences as tuples in a list, sorted in descending order by the number of occurencies.

In [None]:
# most_common_pobj = []

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(most_common_pobj,list), "The output should be a list."
assert len(most_common_pobj) == 10, "It should be the top 10!"
assert isinstance(most_common_pobj[0],tuple), 'The elements of the list should be tuples of (combination, occurences).'
assert  hashlib.sha256(json.dumps(most_common_pobj).encode()).hexdigest() == \
'8b947c095d53dc4ccc4f28d1a448e60ef2f0c509eeac370cb0448ba9418d25c3', 'The top ten list is not correct.'

This time the counts are higher and might be more interesting for a feature.

In [None]:
most_common_pobj

### Exercise 2.5 - Verbs with direct objects
As the last point, you decide to look at verbs with direct objects. These should indicate actions taken towards something or someone. This exercise can be solved without a Matcher.

Search for verbs with direct objects which are not pronouns. This time it's a bit trickier - you need to look at the [parse tree](https://spacy.io/usage/linguistic-features#navigating) because the object does not necessarily come right after the verb. Lemmatize both the verb and the object and count the occurences of the lemmatized verb and direct object separated by a space, like this: 'verb_lemma dobj_lemma'. Don't forget to exclude objects that are pronouns.

Again, output the 10 most common combinations and their occurences as tuples in a list, sorted in descending order by the number of occurences.

In [None]:
# most_common_dobj = []

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(most_common_dobj,list), "The output should be a list."
assert len(most_common_dobj) == 10, "It should be the top 10!"
assert isinstance(most_common_dobj[0],tuple), 'The elements of the list should be tuples of (combination, occurences).'
assert hashlib.sha256(json.dumps(most_common_dobj).encode()).hexdigest() == \
'4aa91fbf85175e56f4a132fb253c40869c6f839fc2cfd4cee821ac1422217f29' or \
hashlib.sha256(json.dumps(most_common_dobj).encode()).hexdigest() == \
'7135e8ae4d25d6cf68db7f5083830236e38dae2843b24b6435342d7ded486e45', 'The top ten list is not correct.'

Not so many occurencies, but again the whole list could be used in a vectorizer:

In [None]:
most_common_dobj

## Exercise 3 - Feature unions

We're going to create a few more numerical features here, then use them in a feature union pipeline and see if the baseline improves.

### Exercise 3.1 - More features

There are a few more simple features that we can extract from the data set to try to enrich our model. Let's add the following features to the `df_processed` dataframe:
- number of words in the news article
- character length of the news article
- average word length
- average sentence length.

Use the SpaCy processed `Doc`s for calculating the average sentence length (note that you will obtain sentence length in tokens).

Use the tokenized text in `df_processed` for everything else. Punctuation and stopwords were already removed from this text.

In [None]:
# df_processed["nb_words"] = ...
# df_processed["doc_length"] = ...
# df_processed["avg_word_length"] = ...
# df_processed["avg_sentence_length"] = ...

# YOUR CODE HERE
raise NotImplementedError()

df_processed.head()

In [None]:
assert df_processed.shape == (5000, 7), "Something wrong about the shape, do you have all columns/rows?"
assert "nb_words" in df_processed, "Missing column! Maybe wrong name?"
assert "doc_length" in df_processed, "Missing column! Maybe wrong name?"
assert "avg_word_length" in df_processed, "Missing column! Maybe wrong name?"
assert "avg_sentence_length" in df_processed, "Missing column! Maybe wrong name?"

assert np.sum(df_processed["nb_words"]) == 1963935, "Something is wrong with the nb_words column."
assert np.sum(df_processed["doc_length"]) == 14636737, "Something is wrong with the doc_length column."
np.testing.assert_almost_equal(np.sum(df_processed["avg_word_length"]), 32100.0, decimal=1, 
                               err_msg='Something is wrong with the avg_word_length column.')
np.testing.assert_almost_equal(np.sum(df_processed["avg_sentence_length"]), 118628.9, 
                               decimal=1, err_msg='Something is wrong with the avg_sentence_length column.')

### Exercise 3.2 - Define a feature union for preprocessing

Let's create a processing pipeline for every feature in `df_processed` and join them all in a feature union. The pipeline for textual features should have one step, a `TfidfVectorizer` with default parameters. The pipeline for numerical features should have one step, a `Standard Scaler`. Afterwards, join the features' pipelines in a feature union.

Use the `Selector` classes in the cell below.

In [None]:
class Selector(TransformerMixin, BaseEstimator):
    """
    Transformer to select a column from a dataframe 
    on which to perform additional transformations.
    """ 
    def __init__(self, key):
        self.key = key
        
    def fit(self, X, y=None):
        return self
    

class TextSelector(Selector):
    """
    Transformer to select a single text column from the dataframe
    on which to perform additional transformations.
    """
    def transform(self, X):
        return X[self.key]
    
    
class NumberSelector(Selector):
    """
    Transformer to select a single numerical column from the dataframe
    on which to perform additional transformations.
    """
    def transform(self, X):
        return X[[self.key]]

In [None]:
# text_pipe = ...
# nb_adj_adv_pipe = ...
# nb_words_pipe = ...
# doc_length_pipe = ...
# avg_word_length_pipe = ...
# avg_sentence_length_pipe = ...
# feats = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(feats, FeatureUnion)
assert len(feats.transformer_list) == 6, "Did you create a pipeline for each feature?"
for pipe in feats.transformer_list:
    
    selector = pipe[1][0]
    if not (isinstance(selector, TextSelector) or isinstance(selector, NumberSelector)):
        raise AssertionError("The first step of the pipeline is not correct.")
        
    feature_builder = pipe[1][1]
    if not (isinstance(feature_builder, TfidfVectorizer) or isinstance(feature_builder, StandardScaler)):
        raise AssertionError("The second step fo the pipeline is not correct.")    

### Exercise 3.3 Fit the feature union
Define a function with pipeline that will apply the preprocessing steps from the previous exercise and fit a classifier to the provided data. The pipeline should have two steps, the feature union from the previous exercise and a `RandomForestClassifier`.
The function should fit the pipeline to the train data, make a prediction on the test data and calculate its accuracy.

In [None]:
def improved_pipeline(feats, X_train, X_test, y_train, y_test, seed=42):
    """
    Creates a pipeline with the provided feature union and a Random Forest classifier.
    Fits the pipeline to the train data and makes a prediction with the test data.
    Outputs the fitted pipeline and the accuracy of the prediction.

    Parameters:
        feats: feature union
        X_train, y_train: train data
        X_test, y_test: test data
        seed (int): seed for random state in the classifier

    Returns:
        pipe: fitted pipeline
        acc (int): accuracy of the prediction for the test data
    """
    
    # YOUR CODE HERE
    raise NotImplementedError()

    return pipe, acc

In [None]:
Y = df_processed["label"]
X = df_processed.drop(columns="label")

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)
pipeline_model, pipeline_acc = improved_pipeline(feats, X_train, X_test, y_train, y_test)

assert isinstance(pipeline_model, Pipeline)
assert isinstance(pipeline_model[0],FeatureUnion), "The first step of the pipeline is not correct."
assert isinstance(pipeline_model[1],RandomForestClassifier),  "The second step of the pipeline is not correct."
np.testing.assert_almost_equal(pipeline_acc, 0.908, decimal=3, err_msg="The accuracy score is not correct.")

With this more complex approach we have achieved basically the same performance as our baseline. This might mean a lot of things: our features might have no real relevance to the model (which you can check with feature importances) or we have achieved a plateau and can't improve the score with this technique. 

Nevertheless it is a good score for this problem and data set. Regardless of the score, you have learnt a lot about SpaCy, feature unions and also that the sky is the limit when creating features. Anything can be a feature really - now good features are a totally different thing that might need more research and validation.