# BLU09 - Information Extraction

In [None]:
# importing needed packages here

import os
import re
import spacy
import hashlib
import numpy as np
import pandas as pd
import json

from tqdm import tqdm
from collections import Counter
from spacy.matcher import Matcher
from sklearn.metrics import accuracy_score
from nltk.tokenize import WordPunctTokenizer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer

from utils import remove_punctuation, remove_stopwords

def _hash(s):
    return hashlib.sha256(json.dumps(str(s)).encode()).hexdigest()

cpu_count = int(os.cpu_count()) if os.cpu_count() != None else 4

In this exercise notebook you are going to tackle a quite real problem: **Detecting fake news!** Let's create a binary classifier to determine if a piece of news is considered 'reliable' or 'unreliable'. You will start by building some basic features, then go on to build more complex ones, and finally put it all together. You should be able to have a working classifier by the end of the notebook.

## Dataset

The dataset we will be using is the [Fake News](https://www.kaggle.com/c/fake-news/overview) from Kaggle. Each piece of news is either reliable or trustworthy, '0', or unreliable and possibly fake, '1'. First, let's load it and see what we are dealing with.

In [None]:
data_path = "datasets/fakenews/train.csv"
df = pd.read_csv(data_path, index_col=0)
df["title"] = df["title"].astype(str)
df["text"] = df["text"].astype(str)

df = df[:5000]

df.head()

We can see that we have 4 columns that are pretty self-explanatory. Let's drop the author column since we only want to practice our text analysis and drop the title as well for simplicity sake.

In [None]:
df.drop(columns=["author", "title"], inplace=True)

Let's also load SpaCy's module with the merged entities (which will come in handy later) and stopwords. `merge_entities` is this function https://spacy.io/api/pipeline-functions#merge_entities. We insert it into the SpaCy pipeline after the NER module.

In [None]:
nlp = spacy.load('en_core_web_md')
nlp.add_pipe("merge_entities", after="ner")
en_stopwords = nlp.Defaults.stop_words

Here we process the news text with SpaCy. This might take a while depending on your hardware (a break to walk the dog? 🐶).

In [None]:
docs = list(tqdm(nlp.pipe(df["text"], batch_size=20, n_process=cpu_count-1), total=len(df["text"])))
docs[:3]

Overall, the text looks good! Not too many errors, well written... as expected from a news article. Fake news is a very tough, recent problem that is now appearing more and more frequently in the wild. Usually there aren't many ortographic mistakes or slang (as it may happen with spam) since it's coming from news sources that want to appear credible but also clickbaity so that they can profit on that good ad revenue and create distrust.

## Q1. Pipeline

With our text processed, let's get a baseline model for our classification problem! Let's use our comfortable _TfidfVectorizer_ to get a simple, fast and trustworthy baseline.

Create a function that applies a pipeline to the given train data, makes a prediction for the test data, and returns the accuracy of the prediction. The pipeline should consist of a *TfidfVectorizer* and a *RandomForestClassifier* 

In [None]:
def tfidf_rf_pipeline(X_train, X_test, y_train, y_test):
    """
    Trains a TfidfVectorizer + RandomForestClassifier pipeline for the given train data.
    Makes a prediction.
    Returns the trained pipeline and the accuracy of the prediction.
    X_train, y_train: train data, pd.Series
    X_test, y_test: test data, pd.Series
    """
    
    # pipe = (...)
    # pipe.fit(...)
    # (...)
    # acc =
    
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return pipe, acc

For the baseline, we will preprocess the text - remove punctuaction and stopwords and tokenize it - then run it through the pipeline:

In [None]:
df_processed = df.copy()
df_processed["text"] = df_processed["text"].apply(remove_punctuation)
df_processed["text"] = df_processed["text"].apply(remove_stopwords, stopwords = en_stopwords, 
                                                  tokenizer = WordPunctTokenizer())

X_train, X_test, y_train, y_test = train_test_split(df_processed["text"], df_processed["label"], 
                                                    test_size=0.2, random_state=42, stratify=df_processed["label"])
baseline_model, baseline_acc = tfidf_rf_pipeline(X_train, X_test, y_train, y_test)

# asserts
assert isinstance(baseline_model, Pipeline)
assert _hash(baseline_model[0]) == 'e68c8e581c16f0d62f3b9cb33a7967b17890e18c1fe819d013181e6714e7a303',"The\
pipeline parameters are not correct."
assert _hash(baseline_model[1]) == 'e5fd22909dcc06f7c81407ee302879e41a75675ddfd55fa1ec640ae68a3338d8',"The\
pipeline parameters are not correct."
assert np.allclose(baseline_acc, 0.908, 0.1), "something wrong with the accuracy score. Use the default parameters."

Wow, the accuracy is quite good for such a simple text model! This just proves that a starting trustworthy baseline is all you need. I can't stress enough that it's really important to have a simple first iteration, and afterwards we can add complexity and study which features make sense or not, testing more out of the box solutions. 

Sometimes, data scientists focus right off the bat on the most complex solutions and a simple one would be enough. Real life problems will obviously achieve lower scores as the datasets are not controlled or cleaned for you but that should not stop you from starting with a simpler and easier solution.

Now let's see if we can engineer other features. We will extract information with SpaCy and use it to train the same pipeline.

## Q2. SpaCy Matcher

Let's see if we can extract some useful features by using our SpaCy Matcher.

#### Q2.a) Simple Matcher

You think of some words that could be related with the detection of Fake News. Something starts ringing in your mind about "propaganda", "USA" and "fraud", so you decide to check how many of those words appear in our news articles using the SpaCy Matcher.

Use the docs list preprocessed by SpaCy and count the number of occurences of these words in all `Doc`s. The output should be the sum of occurencies in all news articles.

In [None]:
words = ["propaganda", "USA", "fraud"]

# init the matcher - remember it from the learning notebook
# add the patterns of the words. HINT: for a direct match you need a specific pattern (check SpaCy documentation)
# count how many matches in all news articles

# YOUR CODE HERE
raise NotImplementedError()

# count = ...

In [None]:
assert _hash(count) == '2357bc4ef05103860befa4a49fd8bbaa1541640845f4b89c1733bb4d6eb7cfcf'

#### Q2.b) POS-Tagging Search

Ok, this doesn't look like the way to go, let's look at other theories. You start thinking that fake news might exaggerate on adjectives and adverbs by using exaggerated or over the top descriptions. So you decide to create a feature that counts the number of _Adjectives_ and _Adverbs_ in a piece of news article. The count should be normalized to the word count of the article. Don't forget to exclude the punctuation.

In [None]:
# HINT: you already have your news text processed (the docs variable),
# so you can go over every doc and check if there is any POS Tag which is an ADJ or ADV
# to check the POS tag of a token in a doc -----> token.pos_

"""
Try it out by running the below code! 
for token in docs[0]:
    print(token.pos_)
"""

# Return a list with the number of adjectives and adverbs for every piece of news in docs
# Normalize it to the number of words in the given article
# nb_adj_adv = [...]

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert type(nb_adj_adv) == list, "The result should be a list with just 1 dimension."
assert len(nb_adj_adv) == 5000, "The length of the result list is wrong.\
You should have a count for every news article."
np.testing.assert_almost_equal(np.mean(nb_adj_adv), 0.1, decimal=1, err_msg='The result is not correct.')
np.testing.assert_almost_equal(np.sum(nb_adj_adv), 528, decimal=0, err_msg='The result is not correct.')

Let's add this feature to our dataframe:

In [None]:
df_processed["nb_adj_adv"] = nb_adj_adv

#### Q2.c) Adjectivized proper nouns

Another theory that might be worth testing is that adjectives with proper nouns are often used in this kind of news to induce sentiments towards people or organization. You want to extract proper nouns preceeded by adjectives to maybe use in a later analysis.

Create a `Matcher` to search for adjective + proper noun. Count the number of occurences of each and output the 10 most common as a list.

In [None]:
# I'll reset the matcher for you
matcher = Matcher(nlp.vocab)

# pattern = [...] to find adjectives followed by proper nouns
# matcher.add("", pattern)

# for doc in docs:
# do matches and save the text in a list

# count the number of times the same expression appears in the list (hint: remember the dictionary solution...)
# take the top 10 of the counter
# the result will be a list of tuples of the form (count, expression)

# most_common_adj_propn = []

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert type(most_common_adj_propn) == list, "the output is not a list"
assert len(most_common_adj_propn) == 10, "It should be the top 10!"

assert _hash(most_common_adj_propn) == '0b6595146c52b5f1ce86ed96d3390418babe3c5da1497e2c97ea97cf1a830387', 'The top ten list is not correct.'

Let's look at the 10 most common:

In [None]:
most_common_adj_propn

The counts seem to be too low to use these terms as features. Maybe running a vectorizer on all the results could work better.

#### Q2.d) Objects of preposition
Objects of the sentences could indicate something. For instance, 'NGO financed by Soros' is more likely to appear in fake news than 'NGO financed by UNESCO'. Both objects in these sentences are objects of preposition (hint: SpaCy has a dependency label for this).

Create a `Matcher` to search for objects of preposition which are nouns. Again, count the number of occurences of each and output the 10 most common as a list.

In [None]:
# I'll reset the matcher for you
matcher = Matcher(nlp.vocab)

# pattern = [...] to find objects of preposition that are nouns
# hint: you need to use dependency and POS labels
# matcher.add("", pattern)

# for doc in docs:
# do matches and save the text in a list

# count the number of times the same expression appears in the list (hint: remember the dictionary solution...)
# take the top 10 of the counter
# the result will be a list of tuples of the form (count, expression)

# most_common_pobj = []

# YOUR CODE HERE
raise NotImplementedError()


In [None]:
assert type(most_common_pobj)== list, "the output is not a list"
assert len(most_common_pobj) == 10, "It should be the top 10!"

assert _hash(most_common_pobj) == 'eb1416f528dcaa1f08b3c1f3460e2079f8baedb953f8ca810347adbf2d34e52a', 'The top ten list is not correct.'

This time the counts are higher and might be more interesting for a feature.

In [None]:
most_common_pobj

#### Q2.e) Verbs with direct objects
As the last point, you decide to look at verbs with direct objects. These should indicate actions taken towards something or someone. This one can be done without a Matcher.

Search for verbs with direct objects which are not pronouns. This time it's a bit trickier - you need to look at the [parse tree](https://spacy.io/usage/linguistic-features#navigating) because the object does not necessarily come right after the verb. Lemmatize both the verb and the object and count the occurences of the lemmatized verb and direct object separated by a space, like this : 'verb_lemma dobj_lemma'. Don't forget to exclude objects that are pronouns.

Again, output the 10 most common as a list.

In [None]:
# hint: you need to use the parse tree and the dependency and POS labels

# for doc in docs:
# do matches and save the text in a list

# count the number of times the same expression appears in the list (hint: remember the dictionary solution...)
# take the top 10 of the counter
# the result will be a list of tuples of the form (count, expression)

# most_common_dobj = []

# YOUR CODE HERE
raise NotImplementedError()


In [None]:
assert type(most_common_dobj)== list, "the output is not a list"
assert len(most_common_dobj) == 10, "It should be the top 10!"

assert _hash(most_common_dobj) == 'eb869c73f4e54f102e4b7e0b24fe2b058ff941d01522520ed61d1327ffe6b891', 'The top ten list is not correct.'

Not so many occurencies, but the whole list could be used with a vectorizer:

In [None]:
most_common_dobj

## Q3. Feature Unions

We're going to create a few more numerical features here, then use them in a feature union pipeline and see if the baseline improves.

#### Q3.a) Adding Extra Features

There are a few more simple features that we can extract from the dataset to try to enrich our model. Let's add to our dataframe the following features: **number of words in the news article**, **character length of the news article**,  **average word length**, and **average sentence length**. (Remember that we already have the number of adverbs and adjectives.)

Use the SpaCy processed `Doc`s for calculating the sentence length and don't forget to exclude punctuation. Use the tokenized text in `df_processed` for everything else.

In [None]:
# df_processed["nb_words"] = ...
# df_processed["doc_length"] = ...
# df_processed["avg_word_length"] = ...
# df_processed["avg_sentence_length"] = ...


# YOUR CODE HERE
raise NotImplementedError()

df_processed

In [None]:
assert df_processed.shape == (5000, 7), "Something wrong about the shape, do you have all columns/rows?"
assert "nb_words" in df_processed, "Missing column! Maybe wrong name?"
assert "doc_length" in df_processed, "Missing column! Maybe wrong name?"
assert "avg_word_length" in df_processed, "Missing column! Maybe wrong name?"
assert "avg_sentence_length" in df_processed, "Missing column! Maybe wrong name?"

assert np.sum(df_processed["nb_words"]) == 1963935, "Something wrong with the nb_words column."
assert np.sum(df_processed["doc_length"]) == 14636737, "Something wrong with the doc_length column."
np.testing.assert_almost_equal(np.sum(df_processed["avg_word_length"]), 32100.0, decimal=1, err_msg='Something is wrong with the avg_word_length column.')
np.testing.assert_almost_equal(np.sum(df_processed["avg_sentence_length"]), 116678.1, decimal=1, err_msg='Something is wrong with the avg_sentence_length column.')

##### Q3.b) Feature Pipelines

Let's create a processing _Pipeline_ for every new feature and join them all in a _Feature Union_. For the textual features use the usual _TfidfVectorizer_ with default parameters and for any numerical feature use a _Standard Scaler_. Afterwards, join the features pipelines using a _Feature Union_.

Use the following Selector classes.

In [None]:
class Selector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a column from the dataframe to perform additional transformations on
    """ 
    def __init__(self, key):
        self.key = key
        
    def fit(self, X, y=None):
        return self
    

class TextSelector(Selector):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on text columns in the data
    """
    def transform(self, X):
        return X[self.key]
    
    
class NumberSelector(Selector):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on numeric columns in the data
    """
    def transform(self, X):
        return X[[self.key]]
    
    

In [None]:
# text_pipe = Pipeline([...])
# nb_adj_adv_pipe = Pipeline([...])
# nb_words_pipe = Pipeline([...])
# doc_length_pipe = Pipeline([...])
# avg_word_length_pipe = Pipeline([...])
# avg_sentence_length_pipe = Pipeline([...])
# feats = FeatureUnion(...)

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(feats, FeatureUnion)
assert len(feats.transformer_list) == 6, "Are you creating 6 pipelines? One for each feature?"
for pipe in feats.transformer_list:
    
    selector = pipe[1][0]
    if not (isinstance(selector, TextSelector) or isinstance(selector, NumberSelector)):
        raise AssertionError("pipeline is wrong, the Selectors should come first.")
        
    feature_builder = pipe[1][1]
    if not (isinstance(feature_builder, TfidfVectorizer) or isinstance(feature_builder, StandardScaler)):
        raise AssertionError("pipeline is wrong, the second thing to come should be the Tfidf or the Scaler.")
    

##### Q3.c) Feature Union
Now let's build our function to use the newly created _Feature Union_ and calculate its performance!

Create a function that will apply the improved pipeline to the provided train data, make a prediction and calculate its accuracy. The pipeline should consist of the feature union we created in Q3.b and a RandomForestClassifier.

In [None]:
def improved_pipeline(feats, X_train, X_test, y_train, y_test):
    """
    Creates a pipeline with the provided feature union and a Random Forest classifier.
    Fits the pipeline to the train data and makes a prediction with the test data.
    Outputs the fitted pipeline and the accuracy of the prediction.
    """
    
    # pipe = (...)
    # pipe.fit(...)
    # (...)
    # acc = ...
    
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return pipe, acc

In [None]:
Y = df_processed["label"]
X = df_processed.drop(columns="label")

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)
pipeline_model, pipeline_acc = improved_pipeline(feats, X_train, X_test, y_train, y_test)

# asserts
assert isinstance(pipeline_model, Pipeline)
assert _hash(pipeline_model[0]) == '933e9e022884461f96b8dcbda7872290b8fd44f4003fa618ab5ff8d2a952c247', "The first part of the\
Pipeline is incorrect."
assert _hash(pipeline_model[1]) == 'e5fd22909dcc06f7c81407ee302879e41a75675ddfd55fa1ec640ae68a3338d8', "The second part of the\
Pipeline is incorrect."
assert np.allclose(pipeline_acc, 0.913, rtol=1e-1), "Something wrong with the accuracy score. Use the default parameters."

With this more complex approach we have achieved basically the same performance as our baseline. This might mean a lot of things: our features might have no real revelance to the model (which you can check with feature importances) or we have achieved a plateau and can't improve the score with this technique. 

Nevertheless it is a good score for this problem and dataset. Regardless the score, you have learnt a lot about _SpaCy_, _Feature Union_ and also learnt that the sky is the limit when creating features. Anything can be a feature really - now good features are a totally different thing that might need more research and validation.