# BLU09 - Information Extraction

In [1]:
# importing needed packages here

import os
import re
import spacy
import hashlib
import numpy as np
import pandas as pd
import json

from tqdm import tqdm
from collections import Counter
from spacy.matcher import Matcher
from sklearn.metrics import accuracy_score
from nltk.tokenize import WordPunctTokenizer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer

from utils import remove_punctuation, remove_stopwords

def _hash(s):
    return hashlib.sha256(json.dumps(str(s)).encode()).hexdigest()

cpu_count = int(os.cpu_count()) if os.cpu_count() != None else 4

In this exercise notebook you are going to tackle a quite real problem: **Detecting fake news!** Let's create a binary classifier to determine if a piece of news is considered 'reliable' or 'unreliable'. You will start by building some basic features, then go on to build more complex ones, and finally put it all together. You should be able to have a working classifier by the end of the notebook.

## Dataset

The dataset we will be using is the [Fake News](https://www.kaggle.com/c/fake-news/overview) from Kaggle. Each piece of news is either reliable or trustworthy, '0', or unreliable and possibly fake, '1'. First, let's load it and see what we are dealing with.

In [2]:
data_path = "datasets/fakenews/train.csv"
df = pd.read_csv(data_path, index_col=0)
df["title"] = df["title"].astype(str)
df["text"] = df["text"].astype(str)

df = df[:5000]

df.head()

Unnamed: 0_level_0,title,author,text,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,House Dem Aide: We Didn‚Äôt Even See Comey‚Äôs Let...,Darrell Lucus,House Dem Aide: We Didn‚Äôt Even See Comey‚Äôs Let...,1
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


We can see that we have 4 columns that are pretty self-explanatory. Let's drop the author column since we only want to practice our text analysis and drop the title as well for simplicity sake.

In [3]:
df.drop(columns=["author", "title"], inplace=True)

Let's also load SpaCy's module with the merged entities (which will come in handy later) and stopwords. `merge_entities` is this function https://spacy.io/api/pipeline-functions#merge_entities. We insert it into the SpaCy pipeline after the NER module.

In [4]:
nlp = spacy.load('en_core_web_md')
nlp.add_pipe("merge_entities", after="ner")
en_stopwords = nlp.Defaults.stop_words

Here we process the news text with SpaCy. This might take a while depending on your hardware (a break to walk the dog? üê∂).

In [5]:
docs = list(tqdm(nlp.pipe(df["text"], batch_size=20, n_process=cpu_count-1), total=len(df["text"])))
docs[:3]

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5000/5000 [01:53<00:00, 43.97it/s]


[House Dem Aide: We Didn‚Äôt Even See Comey‚Äôs Letter Until Jason Chaffetz Tweeted It By Darrell Lucus on October 30, 2016 Subscribe Jason Chaffetz on the stump in American Fork, Utah ( image courtesy Michael Jolley, available under a Creative Commons-BY license) 
 With apologies to Keith Olbermann, there is no doubt who the Worst Person in The World is this week‚ÄìFBI Director James Comey. But according to a House Democratic aide, it looks like we also know who the second-worst person is as well. It turns out that when Comey sent his now-infamous letter announcing that the FBI was looking into emails that may be related to Hillary Clinton‚Äôs email server, the ranking Democrats on the relevant committees didn‚Äôt hear about it from Comey. They found out via a tweet from one of the Republican committee chairmen. 
 As we now know, Comey notified the Republican chairmen and Democratic ranking members of the House Intelligence, Judiciary, and Oversight committees that his agency was revi

Overall, the text looks good! Not too many errors, well written... as expected from a news article. Fake news is a very tough, recent problem that is now appearing more and more frequently in the wild. Usually there aren't many ortographic mistakes or slang (as it may happen with spam) since it's coming from news sources that want to appear credible but also clickbaity so that they can profit on that good ad revenue and create distrust.

## Q1. Pipeline

With our text processed, let's get a baseline model for our classification problem! Let's use our comfortable _TfidfVectorizer_ to get a simple, fast and trustworthy baseline.

Create a function that applies a pipeline to the given train data, makes a prediction for the test data, and returns the accuracy of the prediction. The pipeline should consist of a *TfidfVectorizer* and a *RandomForestClassifier* 

In [6]:
def tfidf_rf_pipeline(X_train, X_test, y_train, y_test):
    """
    Trains a TfidfVectorizer + RandomForestClassifier pipeline for the given train data.
    Makes a prediction.
    Returns the trained pipeline and the accuracy of the prediction.
    X_train, y_train: train data, pd.Series
    X_test, y_test: test data, pd.Series
    """
    
    # pipe = (...)
    # pipe.fit(...)
    # (...)
    # acc =
    
    # YOUR CODE HERE
    
    pipe = Pipeline([('tfidf', TfidfVectorizer()),
                   ('classifier', RandomForestClassifier())])
    pipe.fit(X_train, y_train)
    
    y_pred = pipe.predict(X_test)
    
    acc = accuracy_score(y_test, y_pred)
    
    return pipe, acc

For the baseline, we will preprocess the text - remove punctuaction and stopwords and tokenize it - then run it through the pipeline:

In [7]:
df_processed = df.copy()
df_processed["text"] = df_processed["text"].apply(remove_punctuation)
df_processed["text"] = df_processed["text"].apply(remove_stopwords, stopwords = en_stopwords, 
                                                  tokenizer = WordPunctTokenizer())

X_train, X_test, y_train, y_test = train_test_split(df_processed["text"], df_processed["label"], 
                                                    test_size=0.2, random_state=42, stratify=df_processed["label"])
baseline_model, baseline_acc = tfidf_rf_pipeline(X_train, X_test, y_train, y_test)

# asserts
assert isinstance(baseline_model, Pipeline)
assert _hash(baseline_model[0]) == 'e68c8e581c16f0d62f3b9cb33a7967b17890e18c1fe819d013181e6714e7a303',"The\
pipeline parameters are not correct."
assert _hash(baseline_model[1]) == 'e5fd22909dcc06f7c81407ee302879e41a75675ddfd55fa1ec640ae68a3338d8',"The\
pipeline parameters are not correct."
assert np.allclose(baseline_acc, 0.908, 0.1), "something wrong with the accuracy score. Use the default parameters."

Wow, the accuracy is quite good for such a simple text model! This just proves that a starting trustworthy baseline is all you need. I can't stress enough that it's really important to have a simple first iteration, and afterwards we can add complexity and study which features make sense or not, testing more out of the box solutions. 

Sometimes, data scientists focus right off the bat on the most complex solutions and a simple one would be enough. Real life problems will obviously achieve lower scores as the datasets are not controlled or cleaned for you but that should not stop you from starting with a simpler and easier solution.

Now let's see if we can engineer other features. We will extract information with SpaCy and use it to train the same pipeline.

## Q2. SpaCy Matcher

Let's see if we can extract some useful features by using our SpaCy Matcher.

#### Q2.a) Simple Matcher

You think of some words that could be related with the detection of Fake News. Something starts ringing in your mind about "propaganda", "USA" and "fraud", so you decide to check how many of those words appear in our news articles using the SpaCy Matcher.

Use the docs list preprocessed by SpaCy and count the number of occurences of these words in all `Doc`s. The output should be the sum of occurencies in all news articles.

In [8]:
words = ["propaganda", "USA", "fraud"]

# init the matcher - remember it from the learning notebook
# add the patterns of the words. HINT: for a direct match you need a specific pattern (check SpaCy documentation)
# count how many matches in all news articles

# YOUR CODE HERE
counts = []

matcher = Matcher(nlp.vocab)
for word in words:
    matcher.add(word, [[{'TEXT': word}]])
    
for idx, doc in enumerate(docs[:]):
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]  # the matched span
        counts.append([idx, start, end, span])


# count = ...
count = len(counts)


In [9]:
assert _hash(count) == '2357bc4ef05103860befa4a49fd8bbaa1541640845f4b89c1733bb4d6eb7cfcf'

#### Q2.b) POS-Tagging Search

Ok, this doesn't look like the way to go, let's look at other theories. You start thinking that fake news might exaggerate on adjectives and adverbs by using exaggerated or over the top descriptions. So you decide to create a feature that counts the number of _Adjectives_ and _Adverbs_ in a piece of news article. The count should be normalized to the word count of the article. Don't forget to exclude the punctuation.

In [10]:
# HINT: you already have your news text processed (the docs variable),
# so you can go over every doc and check if there is any POS Tag which is an ADJ or ADV
# to check the POS tag of a token in a doc -----> token.pos_

"""
Try it out by running the below code! 
for token in docs[0]:
    print(token.pos_)
"""

# Return a list with the number of adjectives and adverbs for every piece of news in docs
# Normalize it to the number of words in the given article
# nb_adj_adv = [...]

# YOUR CODE HERE
matcher = Matcher(nlp.vocab)
matcher.add('adjective', [[{'POS': 'ADJ'}]])
matcher.add('adverb', [[{'POS': 'ADV'}]])

nb_adj_adv = []

for doc in docs:
    adj_adv_count = sum(1 for token in doc if token.pos_ in ['ADJ', 'ADV'] and not token.is_punct)
    word_count = sum(1 for token in doc if not token.is_punct)
    normalized_count = adj_adv_count / word_count if word_count > 0 else 0
    nb_adj_adv.append(normalized_count)


In [11]:
assert type(nb_adj_adv) == list, "The result should be a list with just 1 dimension."
assert len(nb_adj_adv) == 5000, "The length of the result list is wrong.\
You should have a count for every news article."
np.testing.assert_almost_equal(np.mean(nb_adj_adv), 0.1, decimal=1, err_msg='The result is not correct.')
np.testing.assert_almost_equal(np.sum(nb_adj_adv), 528, decimal=0, err_msg='The result is not correct.')

Let's add this feature to our dataframe:

In [12]:
df_processed["nb_adj_adv"] = nb_adj_adv

#### Q2.c) Adjectivized proper nouns

Another theory that might be worth testing is that adjectives with proper nouns are often used in this kind of news to induce sentiments towards people or organization. You want to extract proper nouns preceeded by adjectives to maybe use in a later analysis.

Create a `Matcher` to search for adjective + proper noun. Count the number of occurences of each and output the 10 most common as a list.

In [13]:
# I'll reset the matcher for you
matcher = Matcher(nlp.vocab)

# pattern = [...] to find adjectives followed by proper nouns
# matcher.add("", pattern)

# for doc in docs:
# do matches and save the text in a list

# count the number of times the same expression appears in the list (hint: remember the dictionary solution...)
# take the top 10 of the counter
# the result will be a list of tuples of the form (expression, count)

# most_common_adj_propn = []

# YOUR CODE HERE
pattern = [{"POS": "ADJ"}, {"POS": "PROPN"}]  # Adjective followed by a proper noun
matcher.add("ADJ_PROPN_PATTERN", [pattern])

# Find matches
matches_list = []
for doc in docs:
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]  # The matched span
        matches_list.append(str(span))

# Count occurrences and get the 10 most common
matches_counter = Counter(matches_list)
most_common_adj_propn = matches_counter.most_common(10)

# The result will be a list of tuples (expression, count)
most_common_adj_propn

[('former President', 99),
 ('eastern Aleppo', 76),
 ('many Americans', 74),
 ('most Americans', 49),
 ('northern Syria', 49),
 ('Russian President', 40),
 ('former Secretary', 38),
 ('east Aleppo', 31),
 ('super PAC', 29),
 ('congressional Republicans', 24)]

In [14]:
assert type(most_common_adj_propn) == list, "the output is not a list"
assert len(most_common_adj_propn) == 10, "It should be the top 10!"

assert _hash(most_common_adj_propn) == '0b6595146c52b5f1ce86ed96d3390418babe3c5da1497e2c97ea97cf1a830387', 'The top ten list is not correct.'

Let's look at the 10 most common:

In [15]:
most_common_adj_propn

[('former President', 99),
 ('eastern Aleppo', 76),
 ('many Americans', 74),
 ('most Americans', 49),
 ('northern Syria', 49),
 ('Russian President', 40),
 ('former Secretary', 38),
 ('east Aleppo', 31),
 ('super PAC', 29),
 ('congressional Republicans', 24)]

The counts seem to be too low to use these terms as features. Maybe running a vectorizer on all the results could work better.

#### Q2.d) Objects of preposition
Objects of the sentences could indicate something. For instance, 'NGO financed by Soros' is more likely to appear in fake news than 'NGO financed by UNESCO'. Both objects in these sentences are objects of preposition (hint: SpaCy has a dependency label for this).

Create a `Matcher` to search for objects of preposition which are nouns. Again, count the number of occurences of each and output the 10 most common as a list.

In [16]:
# I'll reset the matcher for you
matcher = Matcher(nlp.vocab)

# pattern = [...] to find objects of preposition that are nouns
# hint: you need to use dependency and POS labels
# matcher.add("", pattern)

# for doc in docs:
# do matches and save the text in a list

# count the number of times the same expression appears in the list (hint: remember the dictionary solution...)
# take the top 10 of the counter
# the result will be a list of tuples of the form (expression, count)

# most_common_pobj = []

# YOUR CODE HERE
# Define the pattern for nouns (since we can't directly match dependencies with Matcher)
pattern = [{"POS": "NOUN"}]
matcher.add("NOUN_PATTERN", [pattern])

# Initialize a list to hold matches that are objects of prepositions
pobj_matches = []

# Iterate through documents
for doc in docs:
    matches = matcher(doc)
    for match_id, start, end in matches:
        # Get the span for the current match
        span = doc[start:end]
        # Check if the span's root token has 'pobj' as its dependency label
        if span.root.dep_ == 'pobj':
            pobj_matches.append(span.text)

# Count the occurrences of each match and get the top 10
pobj_counter = Counter(pobj_matches)
most_common_pobj = pobj_counter.most_common(10)

In [17]:
assert type(most_common_pobj)== list, "the output is not a list"
assert len(most_common_pobj) == 10, "It should be the top 10!"

assert _hash(most_common_pobj) == 'eb1416f528dcaa1f08b3c1f3460e2079f8baedb953f8ca810347adbf2d34e52a', 'The top ten list is not correct.'

This time the counts are higher and might be more interesting for a feature.

In [18]:
most_common_pobj

[('people', 2749),
 ('time', 2468),
 ('world', 1642),
 ('country', 1558),
 ('election', 1181),
 ('campaign', 1144),
 ('way', 1108),
 ('government', 1013),
 ('state', 999),
 ('life', 937)]

#### Q2.e) Verbs with direct objects
As the last point, you decide to look at verbs with direct objects. These should indicate actions taken towards something or someone. This one can be done without a Matcher.

Search for verbs with direct objects which are not pronouns. This time it's a bit trickier - you need to look at the [parse tree](https://spacy.io/usage/linguistic-features#navigating) because the object does not necessarily come right after the verb. Lemmatize both the verb and the object and count the occurences of the lemmatized verb and direct object separated by a space, like this : 'verb_lemma dobj_lemma'. Don't forget to exclude objects that are pronouns.

Again, output the 10 most common as a list.

In [19]:
# hint: you need to use the parse tree and the dependency and POS labels

# for doc in docs:
# do matches and save the text in a list

# count the number of times the same expression appears in the list (hint: remember the dictionary solution...)
# take the top 10 of the counter
# the result will be a list of tuples of the form (expression, count)

# most_common_dobj = []

# YOUR CODE HERE
# Initialize a list to store verb-object pairs
verb_object_pairs = []

# Loop through each document
for doc in docs:
    # Loop through each token in the document
    for token in doc:
        # Check if the token is a direct object and not a pronoun
        if token.dep_ == 'dobj' and token.pos_ != 'PRON':
            # Check if the direct object's parent (head) is a verb
            if token.head.pos_ == 'VERB':
                # Lemmatize the verb (parent/head) and the object (token), and add to the list
                pair = f'{token.head.lemma_} {token.lemma_}'
                verb_object_pairs.append(pair)

# Count the occurrences of each pair
pair_counts = Counter(verb_object_pairs)

# Extract the 10 most common pairs
most_common_dobj = pair_counts.most_common(10)

most_common_dobj

[('take place', 380),
 ('do thing', 205),
 ('play role', 196),
 ('tell reporter', 174),
 ('kill people', 165),
 ('win election', 156),
 ('make decision', 153),
 ('have right', 149),
 ('make sense', 141),
 ('take action', 141)]

In [20]:
most_common_dobj

[('take place', 380),
 ('do thing', 205),
 ('play role', 196),
 ('tell reporter', 174),
 ('kill people', 165),
 ('win election', 156),
 ('make decision', 153),
 ('have right', 149),
 ('make sense', 141),
 ('take action', 141)]

In [21]:
assert type(most_common_dobj)== list, "the output is not a list"
assert len(most_common_dobj) == 10, "It should be the top 10!"

assert _hash(most_common_dobj) == 'eb869c73f4e54f102e4b7e0b24fe2b058ff941d01522520ed61d1327ffe6b891', 'The top ten list is not correct.'

AssertionError: The top ten list is not correct.

Not so many occurencies, but the whole list could be used with a vectorizer:

In [22]:
most_common_dobj

[('take place', 380),
 ('do thing', 205),
 ('play role', 196),
 ('tell reporter', 174),
 ('kill people', 165),
 ('win election', 156),
 ('make decision', 153),
 ('have right', 149),
 ('make sense', 141),
 ('take action', 141)]

## Q3. Feature Unions

We're going to create a few more numerical features here, then use them in a feature union pipeline and see if the baseline improves.

#### Q3.a) Adding Extra Features

There are a few more simple features that we can extract from the dataset to try to enrich our model. Let's add to our dataframe the following features: **number of words in the news article**, **character length of the news article**,  **average word length**, and **average sentence length**. (Remember that we already have the number of adverbs and adjectives.)

Use the SpaCy processed `Doc`s for calculating the sentence length and don't forget to exclude punctuation. Use the tokenized text in `df_processed` for everything else.

In [None]:
## Calculate the average sentence length, excluding punctuation from the word counts.
#sentence_lengths = []
#for doc in docs:
#    sentences = [sent for sent in doc.sents if len(sent) > 0]
#    avg_sentence_length = np.mean([len([token for token in sent if not token.is_punct]) for sent in sentences]) if sentences else 0
#    sentence_lengths.append(avg_sentence_length)
#df_processed['avg_sentence_length'] = sentence_lengths
#
## Replace NaN values that may arise, especially in documents without sentences or only punctuation in sentences.
#df_processed.fillna(0, inplace=True)

In [23]:
# df_processed["nb_words"] = ...
# df_processed["doc_length"] = ...
# df_processed["avg_word_length"] = ...
# df_processed["avg_sentence_length"] = ...



## YOUR CODE HERE
df_processed["doc_length"] = df_processed['text'].map(len)
df_processed["nb_words"] = df_processed['text'].apply(lambda x: len(x.split()))

df_processed["avg_word_length"] = df_processed['text'].apply(lambda x: sum([len(word) for word in x.split()])/len(x.split()) if len(x.split())>0 else 0 )


# Average Sentence Length
df_processed['avg_sentence_length'] = [np.mean([len(sentence) for sentence in doc.sents if len(sentence) > 0]) for doc in docs]


df_processed

Unnamed: 0_level_0,text,label,nb_adj_adv,doc_length,nb_words,avg_word_length,avg_sentence_length
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,house dem aide didnt comeys letter jason chaff...,1,0.114002,3155,413,6.641646,21.878049
1,feeling life circles roundabout heads straight...,0,0.121523,2588,345,6.504348,24.875000
2,truth fired october 29 2016 tension intelligen...,1,0.137789,4854,620,6.830645,24.327586
3,videos 15 civilians killed single airstrike id...,1,0.086556,2060,274,6.521898,22.296296
4,print iranian woman sentenced years prison ira...,1,0.047945,639,83,6.710843,33.800000
...,...,...,...,...,...,...,...
4995,washington little affection trust hillary clin...,0,0.107093,2929,398,6.361809,33.083333
4996,httpmediaarchivesgsradionetdduke112216mp3 dr d...,1,0.087591,592,75,6.906667,13.545455
4997,los angeles hollywoods secretive unusual jobs ...,0,0.092251,5025,722,5.961219,22.901408
4998,chant erupts college auditorium washington adm...,0,0.124720,14798,1967,6.523640,23.094444


In [24]:
assert df_processed.shape == (5000, 7), "Something wrong about the shape, do you have all columns/rows?"
assert "nb_words" in df_processed, "Missing column! Maybe wrong name?"
assert "doc_length" in df_processed, "Missing column! Maybe wrong name?"
assert "avg_word_length" in df_processed, "Missing column! Maybe wrong name?"
assert "avg_sentence_length" in df_processed, "Missing column! Maybe wrong name?"

assert np.sum(df_processed["nb_words"]) == 1963935, "Something wrong with the nb_words column."
assert np.sum(df_processed["doc_length"]) == 14636737, "Something wrong with the doc_length column."
np.testing.assert_almost_equal(np.sum(df_processed["avg_word_length"]), 32100.0, decimal=1, err_msg='Something is wrong with the avg_word_length column.')
np.testing.assert_almost_equal(np.sum(df_processed["avg_sentence_length"]), 116357.6, decimal=1, err_msg='Something is wrong with the avg_sentence_length column.')

AssertionError: 
Arrays are not almost equal to 1 decimals
Something is wrong with the avg_sentence_length column.
 ACTUAL: 116357.87017801998
 DESIRED: 116357.6

##### Q3.b) Feature Pipelines

Let's create a processing _Pipeline_ for every new feature and join them all in a _Feature Union_. For the textual features use the usual _TfidfVectorizer_ with default parameters and for any numerical feature use a _Standard Scaler_. Afterwards, join the features pipelines using a _Feature Union_.

Use the following Selector classes.

In [25]:
class Selector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a column from the dataframe to perform additional transformations on
    """ 
    def __init__(self, key):
        self.key = key
        
    def fit(self, X, y=None):
        return self
    

class TextSelector(Selector):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on text columns in the data
    """
    def transform(self, X):
        return X[self.key]
    
    
class NumberSelector(Selector):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on numeric columns in the data
    """
    def transform(self, X):
        return X[[self.key]]
    
    

In [26]:
# text_pipe = Pipeline([...])
# nb_adj_adv_pipe = Pipeline([...])
# nb_words_pipe = Pipeline([...])
# doc_length_pipe = Pipeline([...])
# avg_word_length_pipe = Pipeline([...])
# avg_sentence_length_pipe = Pipeline([...])
# feats = FeatureUnion(...)


# YOUR CODE HERE

text_pipe = Pipeline([
                ('selector', TextSelector("text")),
                ('tfidf', TfidfVectorizer())
            ])

nb_adj_adv_pipe =  Pipeline([
                ('selector', NumberSelector("nb_adj_adv")),
                ('standard', StandardScaler())
            ])

nb_words_pipe =  Pipeline([
                ('selector', NumberSelector("nb_words")),
                ('standard', StandardScaler())
            ])

doc_length_pipe =  Pipeline([
                ('selector', NumberSelector("doc_length")),
                ('standard', StandardScaler())
            ])

avg_word_length_pipe =  Pipeline([
                ('selector', NumberSelector("avg_word_length")),
                ('standard', StandardScaler())
            ])

avg_sentence_length_pipe = Pipeline([
                ('selector', NumberSelector("avg_sentence_length")),
                ('standard', StandardScaler())
            ])
feats = FeatureUnion([('text', text_pipe), 
                      ('nb_adj_adv', nb_adj_adv_pipe),
                      ('nb_words', nb_words_pipe),
                      ('doc_length', doc_length_pipe),
                      ('avg_word_length', avg_word_length_pipe),
                      ('avg_sentence_length', avg_sentence_length_pipe)
                     ])

In [27]:
assert isinstance(feats, FeatureUnion)
assert len(feats.transformer_list) == 6, "Are you creating 6 pipelines? One for each feature?"
for pipe in feats.transformer_list:
    
    selector = pipe[1][0]
    if not (isinstance(selector, TextSelector) or isinstance(selector, NumberSelector)):
        raise AssertionError("pipeline is wrong, the Selectors should come first.")
        
    feature_builder = pipe[1][1]
    if not (isinstance(feature_builder, TfidfVectorizer) or isinstance(feature_builder, StandardScaler)):
        raise AssertionError("pipeline is wrong, the second thing to come should be the Tfidf or the Scaler.")
    

##### Q3.c) Feature Union
Now let's build our function to use the newly created _Feature Union_ and calculate its performance!

Create a function that will apply the improved pipeline to the provided train data, make a prediction and calculate its accuracy. The pipeline should consist of the feature union we created in Q3.b and a RandomForestClassifier.

In [28]:
def improved_pipeline(feats, X_train, X_test, y_train, y_test):
    """
    Creates a pipeline with the provided feature union and a Random Forest classifier.
    Fits the pipeline to the train data and makes a prediction with the test data.
    Outputs the fitted pipeline and the accuracy of the prediction.
    """
    
    # pipe = (...)
    # pipe.fit(...)
    # (...)
    # acc = ...
    
    # YOUR CODE HERE
    pipe = Pipeline([
        ('features', feats),
        ('classifier', RandomForestClassifier())
    ])
    pipe.fit(X_train, y_train)

    preds = pipe.predict(X_test)
    acc = np.mean(preds == y_test)
    
    return pipe, acc

In [29]:
Y = df_processed["label"]
X = df_processed.drop(columns="label")

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)
pipeline_model, pipeline_acc = improved_pipeline(feats, X_train, X_test, y_train, y_test)

# asserts
assert isinstance(pipeline_model, Pipeline)
assert _hash(pipeline_model[0]) == '933e9e022884461f96b8dcbda7872290b8fd44f4003fa618ab5ff8d2a952c247', "The first part of the\
Pipeline is incorrect."
assert _hash(pipeline_model[1]) == 'e5fd22909dcc06f7c81407ee302879e41a75675ddfd55fa1ec640ae68a3338d8', "The second part of the\
Pipeline is incorrect."
assert np.allclose(pipeline_acc, 0.913, rtol=1e-1), "Something wrong with the accuracy score. Use the default parameters."

With this more complex approach we have achieved basically the same performance as our baseline. This might mean a lot of things: our features might have no real revelance to the model (which you can check with feature importances) or we have achieved a plateau and can't improve the score with this technique. 

Nevertheless it is a good score for this problem and dataset. Regardless the score, you have learnt a lot about _SpaCy_, _Feature Union_ and also learnt that the sky is the limit when creating features. Anything can be a feature really - now good features are a totally different thing that might need more research and validation.