# Information Extraction 

<br><hr>

We will extract information from the website name and using [Babelscape/wikineural-multilingual-ner](https://huggingface.co/Babelscape/wikineural-multilingual-ner?text=George+Washington+went+to+Washington.) model from Hugging Face

# Loading the model

In [1]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline




In [8]:
tokenizer = AutoTokenizer.from_pretrained("Babelscape/wikineural-multilingual-ner")
model = AutoModelForTokenClassification.from_pretrained("Babelscape/wikineural-multilingual-ner")

# Extracting data from website 

In [11]:
import requests
from bs4 import BeautifulSoup

In [99]:
def scrape_website(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        story_body_divs = soup.find_all('div', class_='articlebodycontent')
        scraped_text = ""
        for story_body_div in story_body_divs:
            p_tags = story_body_div.find_all('p')
            for p_tag in p_tags:
                scraped_text += p_tag.get_text(separator=' ', strip=True) + ' '
        return scraped_text
    else:
        return None

url = 'https://www.thehindu.com/business/budget/ayushman-bharat-health-care-cover-to-be-extended-to-all-asha-anganwadi-workers-says-fm-nirmala-sitharaman/article67799628.ece'  
text = scrape_website(url)
text

'Health cover under the Ayushman Bharat Pradhan Mantri Jan Arogya Yojana will be extended to all Accredited Social Health Activist (ASHA) and anganwadi workers and helpers said Finance Minister Nirmala Sitharaman on Thursday, while announcing the interim Union Budget 2024-25. Union Finance Minister Nirmala Sitharaman holding a folder-case containing the Interim Budget 2024\n| Photo Credit:\nSHIV KUMAR PUSHPAKAR While a full budget for 2024-25 will be announced after the new government is formed following the Lok Sabha elections later this year the budget allocation saw an increase from ₹89,155 crore in 2023-24 to ₹90,658.63 crore for the Ministry of Health and Family Welfare while Ayush Ministry also saw a hike from ₹3,647.50 crore to ₹3,712.49 crore. Budget 2024 live | Interim budget leaves tax structure untouched; FM details Centre’s achievements For the health sector, the Minister added that India will be bringing in the services of the newly designed U-WIN platform for managing imm

# Performing NER in the extracted data

In [111]:
text = "Budget has proposed to withdraw all direct tax demands taxpayers are still disputing tax demands  Income-tax Department has been notified of Tax filing portal TaxSpanner says it is Taxpayers, especially those with pending taxpayers who are yet to pay pending Rapid urbanisation, high property prices, in Grihum Housing Finance (formerly Poon the specifics of the proposed scheme should be government’s existing credit linked subsidy scheme Budget reiterated government’s focus on infrastructure the input prices will further provide support to the allocation for public health insurance scheme increased from Rs development, port connectivity and improvement of tourism infrastructure"

In [112]:
nlp = pipeline("ner", model=model, tokenizer=tokenizer, grouped_entities='true')
ner_data = nlp(text)
print(ner_data)

[{'entity_group': 'ORG', 'score': 0.612234, 'word': 'Income', 'start': 98, 'end': 104}, {'entity_group': 'ORG', 'score': 0.61001265, 'word': 'tax Department', 'start': 105, 'end': 119}, {'entity_group': 'ORG', 'score': 0.88566506, 'word': 'TaxSpanner', 'start': 159, 'end': 169}, {'entity_group': 'ORG', 'score': 0.4449993, 'word': 'Tax', 'start': 181, 'end': 184}, {'entity_group': 'ORG', 'score': 0.74905574, 'word': 'Grihum Housing Finance', 'start': 304, 'end': 326}, {'entity_group': 'ORG', 'score': 0.58311063, 'word': 'Poon', 'start': 337, 'end': 341}]


# Using colored text to represent the NER

In [113]:
from termcolor import colored
import re

In [114]:

colored_text = ""

ner_data_sorted = sorted(ner_data, key=lambda x: x['start'])

current_position = 0

for entity in ner_data_sorted:
    start = entity['start']
    end = entity['end']
    word = entity['word']
    entity_group = entity['entity_group']
    
    colored_text += text[current_position:start]
    
    # Determine the color associated with the entity group
    if entity_group == 'PER':
        entity_color = 'red'
    elif entity_group == 'ORG':
        entity_color = 'green'
    elif entity_group == 'LOC':
        entity_color = 'cyan'
    else:
        entity_color = 'blue' 
    
    label_text_color = 'black'
    
    if entity_color == 'red':
        entity_background_color = 'on_red'
    elif entity_color == 'green':
        entity_background_color = 'on_green'
    elif entity_color == 'cyan':
        entity_background_color = 'on_cyan'
    else:
        entity_background_color = 'on_blue'
    
    colored_label = colored(entity_group + ':', label_text_color, attrs=['bold'], on_color=entity_background_color)
    
    colored_entity = colored(word, entity_color, attrs=['bold'])
    
    colored_text += colored_label + ' ' + colored_entity
    
    current_position = end

colored_text += text[current_position:]

print(colored_text)

Budget has proposed to withdraw all direct tax demands taxpayers are still disputing tax demands  [1m[42m[30mORG:[0m [1m[32mIncome[0m-[1m[42m[30mORG:[0m [1m[32mtax Department[0m has been notified of Tax filing portal [1m[42m[30mORG:[0m [1m[32mTaxSpanner[0m says it is [1m[42m[30mORG:[0m [1m[32mTax[0mpayers, especially those with pending taxpayers who are yet to pay pending Rapid urbanisation, high property prices, in [1m[42m[30mORG:[0m [1m[32mGrihum Housing Finance[0m (formerly [1m[42m[30mORG:[0m [1m[32mPoon[0m the specifics of the proposed scheme should be government’s existing credit linked subsidy scheme Budget reiterated government’s focus on infrastructure the input prices will further provide support to the allocation for public health insurance scheme increased from Rs development, port connectivity and improvement of tourism infrastructure
