# Full pipeline in action
----------

This notebook contains a set up of the full pipeline

**Inputs:**

- 
**Outputs:**
- results.csv (triple per line)

##### Table of contents:
1. Setup
2. Scraping
3. NLP Preprocessing
4. Information extraction
    -  Named entity recognition
    - Relation extraction
5. Linking
    - Entity linking
    - Relation linking
----------

### 1. Setup

##### Imports, settings & constants

In [1]:
import spacy
import pandas as pd
import numpy as np
import re
import requests
import sys
from tqdm import tqdm

import claucy


sys.path.insert(0, "../")
# Imports for NLP
from nlp import beautifulsoup as bsp
from nlp import nlp_preprocessing as nlp_prep
from nlp.read_warc import read_warc
from corpus_processing import relation_extraction as cre
from corpus_processing import entity_relation_coupling as erc
from corpus_processing import ner 
from corpus_processing import relation_extraction as re
from corpus_processing import relation_linking as rl
from corpus_processing import entity_linking as el 

ModuleNotFoundError: No module named 'beautifulsoup'

In [2]:
# autoreload
%load_ext autoreload
%autoreload 2

In [3]:
WEBPAGE_URL = ['https://www.cbc.ca/news/world/trump-organization-taxes-guilty-1.6676368']

### 2. Webscraping

In [13]:
webpages = [bsp.fetch_webpage(url) for url in WEBPAGE_URL]
stripped_webpages = [bsp.scrape_webpage(webpage) for webpage in webpages]

Or alternatively, WARC files

In [49]:
read_warc.read_warc('../data/warcs/sample.warc.gz')

NameError: name 'read_warc' is not defined

### 3. NLP Preprocessing

In [15]:
spacy_processor = spacy.load("en_core_web_md")
# nlp.add_pipe("entityLinker", last=True)  # entity linker
# claucy.add_to_pipe(nlp)  # Open IE

In [16]:
spacy_docs = [nlp_prep.get_nlp_doc(page, spacy_processor) for page in stripped_webpages]

In [17]:
processed_pages = [nlp_prep.nlp_preprocessing(doc) for doc in spacy_docs]





### 4. Information extraction

##### 4.1 Named Entity Recognition

In [35]:
ner_pages = [ner.detect_entities(doc) for doc in spacy_docs][0]





In [36]:
ner_pages

Unnamed: 0,label,ner_type
0,Donald Trump's,PERSON
1,Tuesday,DATE
2,Manhattan,GPE
3,U.S.,GPE
4,two,CARDINAL
...,...,...
104,1E6,DATE
105,Canada,GPE
106,1-866,CARDINAL
108,Canadians,NORP


##### 4.2 Relation Extraction

In [22]:
relations = [cre.extract_relations(doc) for doc in spacy_docs]

{}
(Donald Trump's, Tuesday, Manhattan, U.S., two, the Trump Organization, 17, the second day, the Trump Organization, New York, three years, Trump, the Trump Organization, up to $1.6 million, US, Trump, Democrats, Trump Organization, Alan Futerfas, Trump, Washington, Florida, Mar-a-Lago, 2020, Justice Department, Fulton County, Georgia, Trump, Trump, the White House, last month, Holocaust, Kanye West, Trump, Constitution, Manhattan, the Trump Organization's, Allen Weisselberg, Weisselberg, five-month, the Trump Organization, Weisselberg, Jeffrey McConney, Trump Organization, Weisselberg, Weisselberg, month-long, Trump, Weisselberg, Trump, Weisselberg, Trump, Greatest Political Witch Hunt, New York City, Trump, Weisselberg, Weisselberg, RELIANCE, Manhattan, Trump, Weisselberg, $1.7 million, US, McConney, W-2, Joshua Steinglass, Trump, Trump, Weisselberg, Trump, Trump, Steinglass, Trump, Manhattan, Alvin Bragg, Democrat, January, Bragg, Trump, District, Cyrus Vance Jr., Trump, New York,

### 5. Linking

##### 5.1 Entity linking

In [41]:
entities_to_link = ner_pages['label'].to_list()
linked_entities = []
for entity in tqdm(entities_to_link):
    linked_entities.append(el.link_entity(entity))

100%|██████████| 70/70 [00:44<00:00,  1.59it/s]


##### 5.2 Relation linking