<a href="https://colab.research.google.com/github/RajarajachozhanVK/RajarajachozhanVK/blob/main/POS_Tagging__Web_Documents_i.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Aim: To build POS tagging for unstructured Web documents
1. Introduction
Constructing a mathematical model for POS tagging specifically for web documents involves adapting traditional POS tagging approaches to
handle the characteristics and challenges of web content, which often includes informal language, HTML tags, and diverse writing styles. Here's
how you can formulate such a model:
1.1 Define Variables and Parameters
+ Observations (Words): W = (w1, ws, ..., w,), where w; represents the i-th word in the sequence extracted from a web document.
 Hidden States (POS tags): T'= (t,, 1y, . . . ,t, ), where t; represents the POS tag corresponding to w;
1.2 Model Assumptions
Given the nature of web documents, additional assumptions may include:
« HTML Tags: Handling and potentially ignoring or treating HTML tags appropriately.
« Informal Language: Accommodating informal language and text found in comments, forums, social media, etc.
« Contextual Information: Incorporating contextual cues from surrounding text and links.
1.3 Transition and Emission Probabilities
Similar to a traditional POS tagging model:
« Transition Probabilities: P(t; | t; ;) - Probability of transitioning to POS tag ¢; given t;
+ Emission Probabi s: P(w; | t;) - Probability of observing word w; given its POS tag ¢;.
1.4 Handling HTML Tags
Web documents often contain HTML tags that are not part of the natural language text but are crucial for understanding the structure and
context.
« Preprocessing: Before POS tagging, remove or appropriately handle HTML tags to avoid misinterpretation o errors in tagging.
1.5 Mathematical Formulation
The joint probability of a sequence of words and their POS tags can be expressed as:
P(T,W) = P(t:) - T}y Plt | 1) - T Plwi | &)
Where:
« P(t,) is the initial state probability distribution.
« P(t; | t; 1) are the transition probabilities.
« P(w; | t;) are the emission probabilities.
1.6 Implementation Considerations
« Tokenization: Splitting the web document into words or tokens, considering HTML tags and their potential impact on text segmentation.
« Feature Engineering: Extract relevant features such as word context, surrounding HTML structure, and potentially links and metadata.
+ Model Selection: Choose an appropriate model (e.g., HMM, CRF, neural network-based models) based on the complexity and specific
requirements of web document POS tagging.
1.7 Example Application: Handling Informal Language and HTML Tags
Consider a scenario where a web document includes comments and HTML tags. Your model might:
« Ignore HTML tags during the tagging process.
« Use additional features like user-generated content characteristics (e.g., emoji usage, abbreviations) to improve tagging accuracy.


In [5]:
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

# Download NLTK resources (if not already downloaded)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Create a function to fetch web content and parse it using BeautifulSoup
def fetch_and_parse(url):
    """Fetches content from a URL and parses the text using BeautifulSoup."""
    response = requests.get(url)  # Fetch web page content
    soup = BeautifulSoup(response.text, 'html.parser')  # Parse HTML content
    text = soup.get_text()  # Extract visible text from parsed HTML
    return text

# Create a function to tokenize the text and perform POS tagging using NLTK
def pos_tagging(text):
    """Performs Part-of-Speech (POS) tagging on the provided text."""
    tokens = word_tokenize(text)  # Tokenize the text into words
    tagged_tokens = pos_tag(tokens)  # Perform POS tagging
    return tagged_tokens

# Combine the functions to fetch a web page, parse its content, and perform POS tagging
if __name__ == "__main__":
    url = "http://example.com"  # Replace with the URL of the web document to process
    text = fetch_and_parse(url)  # Fetch and parse the text content from the URL
    tagged_tokens = pos_tagging(text)  # Perform POS tagging on the extracted text

    # Print out the POS tagged tokens
    for token in tagged_tokens:
        print(token)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


('Example', 'NNP')
('Domain', 'NNP')
('Example', 'NNP')
('Domain', 'NNP')
('This', 'DT')
('domain', 'NN')
('is', 'VBZ')
('for', 'IN')
('use', 'NN')
('in', 'IN')
('illustrative', 'JJ')
('examples', 'NNS')
('in', 'IN')
('documents', 'NNS')
('.', '.')
('You', 'PRP')
('may', 'MD')
('use', 'VB')
('this', 'DT')
('domain', 'NN')
('in', 'IN')
('literature', 'NN')
('without', 'IN')
('prior', 'JJ')
('coordination', 'NN')
('or', 'CC')
('asking', 'VBG')
('for', 'IN')
('permission', 'NN')
('.', '.')
('More', 'JJR')
('information', 'NN')
('...', ':')


In [7]:
def fetch_and_parse(url):
    """Fetches content from a URL and parses the text using BeautifulSoup."""
    response = requests.get(url)  # Fetch web page content
    soup = BeautifulSoup(response.text, 'html.parser')  # Parse HTML content
    text = soup.get_text()  # Extract visible text from parsed HTML
    return text

In [9]:
def pos_tagging(text):
    """Performs Part-of-Speech (POS) tagging on the provided text."""
    tokens = word_tokenize(text)  # Tokenize the text into words
    tagged_tokens = pos_tag(tokens)  # Perform POS tagging
    return tagged_tokens

In [11]:
if __name__ == "__main__":
    url = "http://example.com"  # Replace with the URL of the web document to process
    text = fetch_and_parse(url)  # Fetch and parse the text content from the URL
    tagged_tokens = pos_tagging(text)  # Perform POS tagging on the extracted text
    # Print out the POS tagged tokens
    for token in tagged_tokens:
        print(token)

('Example', 'NNP')
('Domain', 'NNP')
('Example', 'NNP')
('Domain', 'NNP')
('This', 'DT')
('domain', 'NN')
('is', 'VBZ')
('for', 'IN')
('use', 'NN')
('in', 'IN')
('illustrative', 'JJ')
('examples', 'NNS')
('in', 'IN')
('documents', 'NNS')
('.', '.')
('You', 'PRP')
('may', 'MD')
('use', 'VB')
('this', 'DT')
('domain', 'NN')
('in', 'IN')
('literature', 'NN')
('without', 'IN')
('prior', 'JJ')
('coordination', 'NN')
('or', 'CC')
('asking', 'VBG')
('for', 'IN')
('permission', 'NN')
('.', '.')
('More', 'JJR')
('information', 'NN')
('...', ':')
