# Full pipeline in action
----------

This notebook contains a set up of the full pipeline

**Inputs:**

- 
**Outputs:**
- results.csv (triple per line)

##### Table of contents:
1. Setup
2. Scraping
3. NLP Preprocessing
4. Information extraction
    -  Named entity recognition
    - Relation extraction
5. Linking
    - Entity linking
    - Relation linking
----------

### 1. Setup

##### Imports, settings & constants

In [12]:
import spacy
import pandas as pd
import numpy as np
import re
import requests
import sys
from tqdm import tqdm

import claucy


sys.path.insert(0, "../")
# Imports for NLP
from nlp import beautifulsoup as bsp
from nlp import nlp_preprocessing as nlp_prep
from nlp.read_warc import read_warc
from corpus_processing import relation_extraction as cre
from corpus_processing import entity_relation_coupling as erc
from corpus_processing import ner 
from corpus_processing import relation_extraction as re
from corpus_processing import relation_linking as rl
from corpus_processing import entity_linking as el 

In [13]:
# autoreload
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [14]:
WEBPAGE_URL = ['https://www.cbc.ca/news/world/trump-organization-taxes-guilty-1.6676368']

### 2. Webscraping

In [15]:
webpages = [bsp.fetch_webpage(url) for url in WEBPAGE_URL]
stripped_webpages = [bsp.scrape_webpage(webpage) for webpage in webpages]

Or alternatively, WARC files

In [17]:
# read_warc('../data/warcs/sample/9laughs.com/sleeping-with-three-girls-at-the-same-time')

### 3. NLP Preprocessing

In [18]:
spacy_processor = spacy.load("en_core_web_md")
# nlp.add_pipe("entityLinker", last=True)  # entity linker
# claucy.add_to_pipe(nlp)  # Open IE

In [19]:
spacy_docs = [nlp_prep.get_nlp_doc(page, spacy_processor) for page in stripped_webpages]

In [20]:
processed_pages = [nlp_prep.nlp_preprocessing(doc) for doc in spacy_docs]





### 4. Information extraction

##### 4.1 Named Entity Recognition

In [21]:
ner_pages = [ner.detect_entities(doc) for doc in spacy_docs][0]





In [22]:
ner_pages

Unnamed: 0,label,ner_type
0,Donald Trump's,PERSON
1,Tuesday,DATE
2,Manhattan,GPE
3,U.S.,GPE
4,two,CARDINAL
...,...,...
104,1E6,DATE
105,Canada,GPE
106,1-866,CARDINAL
108,Canadians,NORP


In [46]:
entities_to_link = ner_pages['label'].to_list()
linked_entities = []
for entity in tqdm(entities_to_link):
    linked_entities.append(el.link_entity(entity))

100%|██████████| 70/70 [00:44<00:00,  1.56it/s]


##### 4.2 Relation Extraction

In [23]:
relations = [cre.extract_relations(doc) for doc in spacy_docs]

{}
(Donald Trump's, Tuesday, Manhattan, U.S., two, the Trump Organization, 17, the second day, the Trump Organization, New York, three years, Trump, the Trump Organization, up to $1.6 million, US, Trump, Democrats, Trump Organization, Alan Futerfas, Trump, Washington, Florida, Mar-a-Lago, 2020, Justice Department, Fulton County, Georgia, Trump, Trump, the White House, last month, Holocaust, Kanye West, Trump, Constitution, Manhattan, the Trump Organization's, Allen Weisselberg, Weisselberg, five-month, the Trump Organization, Weisselberg, Jeffrey McConney, Trump Organization, Weisselberg, Weisselberg, month-long, Trump, Weisselberg, Trump, Weisselberg, Trump, Greatest Political Witch Hunt, New York City, Trump, Weisselberg, Weisselberg, RELIANCE, Manhattan, Trump, Weisselberg, $1.7 million, US, McConney, W-2, Joshua Steinglass, Trump, Trump, Weisselberg, Trump, Trump, Steinglass, Trump, Manhattan, Alvin Bragg, Democrat, January, Bragg, Trump, District, Cyrus Vance Jr., Trump, New York,

In [32]:
# Turn list of tuples into dataframe
relations_df = pd.DataFrame(relations[0], columns=['relation', 'object', 'subject'])
relations_df['object'] = relations_df['object'].apply(lambda x: x.text)
relations_df['subject'] = relations_df['subject'].apply(lambda x: x.text)

Unnamed: 0,relation,object,subject
0,convicted of,Donald Trump's,Tuesday
1,brought by the,Tuesday,Manhattan
2,found in the,Trump,Constitution
3,including,17,the second day
4,falsifying,17,the second day
...,...,...,...
140,Join the,our Submission Guidelines,Audience Relations
141,create a,CBC,Canadians
142,including,17,the second day
143,Described,Canadians,CBC


In [33]:
object_entities = relations_df['object'].to_list()
subject_entities = relations_df['subject'].to_list()
obj_ents, subj_ents = [], []
print(f'Linking object entities')
for entity in tqdm(object_entities):
    obj_ents.append(el.link_entity(entity))
print(f'Linking subject entities')
for entity in tqdm(subject_entities):
    subj_ents.append(el.link_entity(entity))

Linking object entities
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 145/145 [01:22<00:00,  1.75it/s]


Linking subject entities


100%|██████████| 145/145 [01:25<00:00,  1.70it/s]


Making sure all entities in the relations appear in the found entities

In [36]:
relations_df['object_wiki'] = obj_ents
relations_df['subject_wiki'] = subj_ents
relations_df

Unnamed: 0,relation,object,subject,object_wiki,subject_wiki
0,convicted of,Donald Trump's,Tuesday,https://en.wikipedia.org/wiki/Donald_Trump,https://en.wikipedia.org/wiki/Tuesday_Weld
1,brought by the,Tuesday,Manhattan,https://en.wikipedia.org/wiki/Tuesday_Weld,https://en.wikipedia.org/wiki/New_York_City
2,found in the,Trump,Constitution,https://en.wikipedia.org/wiki/Donald_Trump,https://en.wikipedia.org/wiki/U.S._state
3,including,17,the second day,https://en.wikipedia.org/wiki/1,https://en.wikipedia.org/wiki/Battle_of_Thermo...
4,falsifying,17,the second day,https://en.wikipedia.org/wiki/1,https://en.wikipedia.org/wiki/Battle_of_Thermo...
...,...,...,...,...,...
140,Join the,our Submission Guidelines,Audience Relations,https://en.wikipedia.org/wiki/BDSM,https://en.wikipedia.org/wiki/Public_relations
141,create a,CBC,Canadians,https://en.wikipedia.org/wiki/Columbia_Pictures,https://en.wikipedia.org/wiki/Canada
142,including,17,the second day,https://en.wikipedia.org/wiki/1,https://en.wikipedia.org/wiki/Battle_of_Thermo...
143,Described,Canadians,CBC,https://en.wikipedia.org/wiki/Canada,https://en.wikipedia.org/wiki/Columbia_Pictures


In [54]:
relations_df.loc[(relations_df['object_wiki'].isin(linked_entities)) & (relations_df['subject_wiki'].isin(linked_entities)) , :]

Unnamed: 0,relation,object,subject,object_wiki,subject_wiki
0,convicted of,Donald Trump's,Tuesday,https://en.wikipedia.org/wiki/Donald_Trump,https://en.wikipedia.org/wiki/Tuesday_Weld
1,brought by the,Tuesday,Manhattan,https://en.wikipedia.org/wiki/Tuesday_Weld,https://en.wikipedia.org/wiki/New_York_City
2,found in the,Trump,Constitution,https://en.wikipedia.org/wiki/Donald_Trump,https://en.wikipedia.org/wiki/U.S._state
3,including,17,the second day,https://en.wikipedia.org/wiki/1,https://en.wikipedia.org/wiki/Battle_of_Thermo...
4,falsifying,17,the second day,https://en.wikipedia.org/wiki/1,https://en.wikipedia.org/wiki/Battle_of_Thermo...
...,...,...,...,...,...
140,Join the,our Submission Guidelines,Audience Relations,https://en.wikipedia.org/wiki/BDSM,https://en.wikipedia.org/wiki/Public_relations
141,create a,CBC,Canadians,https://en.wikipedia.org/wiki/Columbia_Pictures,https://en.wikipedia.org/wiki/Canada
142,including,17,the second day,https://en.wikipedia.org/wiki/1,https://en.wikipedia.org/wiki/Battle_of_Thermo...
143,Described,Canadians,CBC,https://en.wikipedia.org/wiki/Canada,https://en.wikipedia.org/wiki/Columbia_Pictures


### 5. Linking