# Information extraction 

<br><br><hr>

Here we will use a Hugging Face model named [jtlicardo/bpmn-information-extraction](https://huggingface.co/jtlicardo/bpmn-information-extraction) to extract data from a text that is obtained by scrapping a website. 

## Loading the model 

In [1]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline




In [12]:
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)

# Scrapping the data

In [8]:
import requests
from bs4 import BeautifulSoup

urls = ["https://www.cnbctv18.com/economy/budget-24-gdp-gst-collection-fiscal-defecit-painting-a-picture-of-a-resilient-india-saurabh-m-deshmukh-18982831.htm", "https://www.aljazeera.com/economy/2024/2/5/analysis-indias-2024-interim-budget-shows-a-changing-economy"]

text_arr = []
for i in urls :
    req = requests.get(i)
    soup = BeautifulSoup(req.content, "html.parser").get_text()
    text_arr.append(soup)
text_joined = '\n'.join(text_arr)
text_joined

"Budget '24 — How it paints a picture of a resilient IndiaJoin UsLanguageEnglishहिन्दीLatest newsMarketsEconomyPersonal FinanceIndiaWorldSpecial CoverageWorld Cup 2023G20 CoverageDiwali 2023Chandrayaan 3Election 2023Assembly Elections 2023Election AnalysisLok Sabha Election 2024Upcoming EventsEngage with CNBC-TV18Photos VideosMinis Web Stories PollsQuizHealthcarePoliticsTravelEducation & CareersSportsViewsAutoEntertainment TechnologyStartups LifestyleRetail  Real EstateMinisIndia Business Leader AwardsBranded ContentThe Growth SummitCeo awardsAccelerate Your Cloud JourneyAccelerating to a Connected FutureFinancial Services Cloud SymposiumCryptoEducation NextEy Entrepreneur Of The YearIRMAWizards of financeThe Thought LeagueDiscover CNBCTV18AboutContactAdvertiseDisclaimerTerms of UsePrivacy PolicyShowsAnchorsPolls11:11 NewsletterLife\xa0 Watch Live TV MarketPersonal FinanceBusinessEconomyFeaturedNextGenLIVE TVNewCNBC-TV18 EdgeNewSME Champion AwardsNewLatest NewsLiveMarket LiveNewsletter

In [10]:
## removing \n from text

import re
text = re.sub(r"\n", "", text_joined)
text

"Budget '24 — How it paints a picture of a resilient IndiaJoin UsLanguageEnglishहिन्दीLatest newsMarketsEconomyPersonal FinanceIndiaWorldSpecial CoverageWorld Cup 2023G20 CoverageDiwali 2023Chandrayaan 3Election 2023Assembly Elections 2023Election AnalysisLok Sabha Election 2024Upcoming EventsEngage with CNBC-TV18Photos VideosMinis Web Stories PollsQuizHealthcarePoliticsTravelEducation & CareersSportsViewsAutoEntertainment TechnologyStartups LifestyleRetail  Real EstateMinisIndia Business Leader AwardsBranded ContentThe Growth SummitCeo awardsAccelerate Your Cloud JourneyAccelerating to a Connected FutureFinancial Services Cloud SymposiumCryptoEducation NextEy Entrepreneur Of The YearIRMAWizards of financeThe Thought LeagueDiscover CNBCTV18AboutContactAdvertiseDisclaimerTerms of UsePrivacy PolicyShowsAnchorsPolls11:11 NewsletterLife\xa0 Watch Live TV MarketPersonal FinanceBusinessEconomyFeaturedNextGenLIVE TVNewCNBC-TV18 EdgeNewSME Champion AwardsNewLatest NewsLiveMarket LiveNewsletter

In [11]:
## Splitting total text into sentences

import re

match = r"\.[^\d]"
sentences = re.split(match, text)

# Coloring entities

In [14]:
from termcolor import colored, cprint

In [19]:
total_text = ''
for text in sentences:
    ner_results = nlp(text)
    colored_text = ""

    ner_data_sorted = sorted(ner_results, key=lambda x: x['start'])
    current_position = 0

    color = ''
    for entity in ner_data_sorted:
        prev = color
        start = entity['start']
        end = entity['end']
        word = entity['word']
        entity_type = entity['entity'].split("-")[1]  # Extract entity type
        colored_text += text[current_position:start]
        if entity_type == 'PER':
            color = 'red'
        elif entity_type == 'LOC':
            color = 'cyan'
        elif entity_type == 'ORG':
            color = 'yellow'
        else:
            color = 'blue'
        current_position = end
        if color != '' and prev != color:
            colored_text += colored(" " + entity_type + " ", 'black', 'on_' + color) + " "
        colored_text += colored(text[start:end], color, attrs=['bold'])

    colored_text += text[current_position:]
    total_text += colored_text

print(total_text)


Budget '24 — How it paints a picture of a resilient [46m[30m LOC [0m [1m[36mIndia[0mJoin UsLanguageEnglishहिन्दीLatest newsMarketsEconomyPersonal FinanceIndiaWorldSpecial CoverageWorld Cup 2023G20 CoverageDiwali 2023Chandrayaan 3Election 2023Assembly Elections 2023Election AnalysisLok Sabha Election 2024Upcoming EventsEngage with [43m[30m ORG [0m [1m[33mC[0mNBC-TV18Photos VideosMinis Web Stories PollsQuizHealthcarePoliticsTravelEducation & CareersSportsViewsAutoEntertainment TechnologyStartups LifestyleRetail  Real EstateMinisIndia Business Leader AwardsBranded ContentThe Growth SummitCeo awardsAccelerate Your Cloud JourneyAccelerating to a Connected FutureFinancial Services Cloud SymposiumCryptoEducation NextEy Entrepreneur Of The YearIRMAWizards of financeThe Thought LeagueDiscover CNBCTV18AboutContactAdvertiseDisclaimerTerms of UsePrivacy PolicyShowsAnchorsPolls11:11 NewsletterLife  Watch Live TV MarketPersonal FinanceBusinessEconomyFeaturedNextGenLIVE TVNewCNBC-TV18 Edg