# Information Extraction

**Named Entity Recognition using spacy**

In [1]:
import spacy

In [5]:
nlp = spacy.load("en_core_web_sm")
text = "Donald Trump is president of America"
doc = nlp(text)
print("Named Entities:\n")
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")


Named Entities:

Donald Trump (PERSON)
America (GPE)


**Relationship Extraction**

In [21]:
for token in doc:
    if token.dep_ == "attr" and token.head.pos_ == "AUX":
        sub = [w for w in token.head.lefts if w.dep_ == "nsubj"]
        obj = [w for w in token.subtree if w.ent_type_ == "GPE"]

        if sub and obj:
            sub_phrase = " ".join([w.text for w in sub[0].subtree])
            print(f"{sub_phrase} -> {token.head.text} {token.text} of -> {obj[0].text}")

Donald Trump -> is president of -> America


**Named Entity Recognition (NER) using NLTK and spacy**

In [3]:
pip install beautifulsoup4

Collecting beautifulsoup4Note: you may need to restart the kernel to use updated packages.

  Downloading beautifulsoup4-4.13.3-py3-none-any.whl.metadata (3.8 kB)
Collecting soupsieve>1.2 (from beautifulsoup4)
  Using cached soupsieve-2.6-py3-none-any.whl.metadata (4.6 kB)
Downloading beautifulsoup4-4.13.3-py3-none-any.whl (186 kB)
Using cached soupsieve-2.6-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.13.3 soupsieve-2.6


In [8]:
import pandas as pd
import nltk
import requests
from bs4 import BeautifulSoup

In [9]:
nlp = spacy.load('en_core_web_sm')
pd.set_option("display.max_rows",200)

In [10]:
text = """Trinamool Congress leader Mahua Moitra has moved the Supreme Court against her expulsion from the Lok Sabha over the cash-for-query allegations against her. Moitra was ousted from the Parliament last week after the Ethics Committee of the Lok Sabha found her 
guilty of jeopardising national security by sharing her parliamentary portal's login credentials with businessman Darshan Hiranandani."""
doc = nlp(text)
for ent in doc.ents:
    print(ent.text,ent.start_char,ent.end_char,ent.label_)

Trinamool Congress 0 18 ORG
Mahua Moitra 26 38 PERSON
the Supreme Court 49 66 ORG
Moitra 157 163 NORP
Parliament 184 194 ORG
last week 195 204 DATE
the Ethics Committee 211 231 ORG
Darshan Hiranandani 374 393 PERSON


In [11]:
from spacy import displacy
displacy.render(doc,style = "ent")

In [12]:
ents = [(ent.text,ent.label_,ent.lemma_) for ent in doc.ents]
df = pd.DataFrame(ents,columns = ['text','type','lemma'])
print(df)

                   text    type                 lemma
0    Trinamool Congress     ORG    Trinamool Congress
1          Mahua Moitra  PERSON          Mahua Moitra
2     the Supreme Court     ORG     the Supreme Court
3                Moitra    NORP                Moitra
4            Parliament     ORG            Parliament
5             last week    DATE             last week
6  the Ethics Committee     ORG  the Ethics Committee
7   Darshan Hiranandani  PERSON   Darshan Hiranandani
