<a href="https://colab.research.google.com/github/RonitShetty/NLP-Labs/blob/main/C070_RonitShetty_NLPLab6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Lab 6
****
**Aim:** Identify the Part of speech tagging in text data.

a)	Part of speech tagging - like noun, pronoun, verb, adjective, adverb, preposition, conjunction, and interjection


**Roll No.:** C070  
**Name:** Ronit Shetty  
**SAP ID:** 70322000128  
**Division:** C  
**Batch:** C1  

In [10]:
import spacy
from spacy import displacy
from spacy.tokens import Span

# Load the model
nlp = spacy.load("en_core_web_sm")

print("\n--- Basic Token Analysis ---")
print("Used to demonstrate the fundamental token attributes that spaCy extracts from text")
doc = nlp("Natural Language Processing (NLP) enables machines to understand, interpret, and generate human language.")
for token in doc:
    print(f"{token.text:12} {token.lemma_:12} {token.pos_:6} {token.tag_:6} {token.dep_:10} {token.shape_:8} {token.is_alpha} {token.is_stop}")

displacy.render(doc, style = "dep", jupyter=True)


# --- 1. Tokenization ---
print("\n--- 1. Tokenization ---")
print("Used to split raw text into meaningful units (tokens) like words, punctuation, and special characters")
text_tokenization = "SpaCy's tokenization is powerful. U.K. is one token, and don't is split."
doc_tokenization = nlp(text_tokenization)
print(f"Original Text: '{text_tokenization}'")
print("Tokens:", [token.text for token in doc_tokenization])



# --- 2. Sentence Segmentation ---
print("\n--- 2. Sentence Segmentation ---")
print("Used to identify sentence boundaries and split text into individual sentences")
text_segmentation = "This is the first sentence. This is another one! And a final sentence?"
doc_segmentation = nlp(text_segmentation)
for i, sent in enumerate(doc_segmentation.sents):
    print(f"Sentence {i+1}: '{sent.text}'")

# Check if sentence boundaries are available
print(f"Has sentence boundaries: {doc_segmentation.has_annotation('SENT_START')}")



# --- 3. POS Tagging & 4. Morphology ---
print("\n--- 3. POS Tagging & 4. Morphological Analysis ---")
print("Used to assign grammatical categories (noun, verb, etc.) and analyze word forms (tense, number, etc.)")
text_pos = "She was running and eating cats quickly."
doc_pos = nlp(text_pos)
for token in doc_pos:
    print(
        f"Token: {token.text:<12} | "
        f"POS: {token.pos_:<8} | "
        f"Tag: {token.tag_:<8} | "
        f"Morph: {str(token.morph) if token.morph else 'None':<15} | "
        f"Is_alpha: {token.is_alpha} | "
        f"Is_stop: {token.is_stop}"
    )



# --- 5. Lemmatization ---
print("\n--- 5. Lemmatization ---")
print("Used to reduce words to their base or dictionary form (e.g., 'running' → 'run')")
text_lemma = "I am running in circles, seeing the best sights and geese."
doc_lemma = nlp(text_lemma)
for token in doc_lemma:
    print(f"Token: {token.text:<12} | Lemma: {token.lemma_:<12} | POS: {token.pos_}")



# --- 6. Dependency Parsing ---
print("\n--- 6. Dependency Parsing ---")
print("Used to analyze grammatical relationships between words and build syntactic trees")
text_dep = "Apple is looking at buying a U.K. startup."
doc_dep = nlp(text_dep)
for token in doc_dep:
    print(
        f"Token: {token.text:<12} | "
        f"Dep: {token.dep_:<12} | "
        f"Head: {token.head.text:<12} | "
        f"Children: {[child.text for child in token.children]}"
    )

# Check if dependency parsing is available
print(f"Has dependency parsing: {doc_dep.has_annotation('DEP')}")



# --- 7. Named Entity Recognition ---
print("\n--- 7. Named Entity Recognition (NER) ---")
print("Used to identify and classify named entities like people, organizations, locations, and dates")
text_ner = "Apple Inc. is looking at buying a U.K. startup for $1 billion in London."
doc_ner = nlp(text_ner)
print(f"Entities found: {len(doc_ner.ents)}")
for ent in doc_ner.ents:
    print(
        f"Entity: {ent.text:<15} | "
        f"Label: {ent.label_:<10} | "
        f"Description: {spacy.explain(ent.label_)}"
    )

# Token-level entity information
print("\nToken-level NER:")
for token in doc_ner:
    if token.ent_type_:
        print(f"Token: {token.text:<12} | IOB: {token.ent_iob_:<3} | Type: {token.ent_type_}")



# --- 8. Noun Chunks ---
print("\n--- 8. Noun Chunks ---")
print("Used to extract base noun phrases that represent key concepts or entities in text")
text_chunks = "The quick brown fox jumps over the lazy dog in the beautiful garden."
doc_chunks = nlp(text_chunks)
print("Noun chunks found:")
for chunk in doc_chunks.noun_chunks:
    print(f"Chunk: '{chunk.text}' | Root: '{chunk.root.text}' | Dep: {chunk.root.dep_}")



# --- 9. Word Vectors and Similarity (if available) ---
print("\n--- 9. Word Vectors and Similarity ---")
print("Used to measure semantic similarity between words, sentences, or documents using numerical representations")
# Check if the model has vectors
if nlp.vocab.vectors.size:
    doc1 = nlp("I like cats.")
    doc2 = nlp("I love dogs.")
    doc3 = nlp("I like pizza.")

    print(f"Similarity between '{doc1.text}' and '{doc2.text}': {doc1.similarity(doc2):.3f}")
    print(f"Similarity between '{doc1.text}' and '{doc3.text}': {doc1.similarity(doc3):.3f}")

    # Token vectors
    cats_token = doc1[2]  # "cats"
    print(f"Vector for 'cats' has shape: {cats_token.vector.shape}")
    print(f"Token 'cats' has vector: {cats_token.has_vector}")
else:
    print("No word vectors available in this model (en_core_web_sm doesn't include vectors)")
    print("To get vectors, use a larger model like en_core_web_md or en_core_web_lg")



# --- 10. Language Data (Stop words, etc.) ---
print("\n--- 10. Language Data ---")
print("Used to access language-specific information like stop words, punctuation rules, and linguistic patterns")
stop_words = nlp.Defaults.stop_words
print(f"Number of stop words: {len(stop_words)}")
print("Sample stop words:", list(stop_words)[:10])

# Check specific words
test_words = ["the", "galaxy", "and", "python"]
for word in test_words:
    is_stop = nlp.vocab[word].is_stop
    print(f"Is '{word}' a stop word? {is_stop}")



# --- 11. Token Shape and Character Analysis ---
print("\n--- 11. Token Shape and Character Analysis ---")
print("Used to analyze the orthographic patterns and character types in tokens (digits, punctuation, etc.)")
text_shape = "The U.S.A. has 50 states and iPhone costs $999.99!"
doc_shape = nlp(text_shape)
for token in doc_shape:
    print(
        f"Token: {token.text:<12} | "
        f"Shape: {token.shape_:<8} | "
        f"Is_alpha: {token.is_alpha} | "
        f"Is_digit: {token.is_digit} | "
        f"Is_punct: {token.is_punct} | "
        f"Like_num: {token.like_num} | "
        f"Like_email: {token.like_email}"
    )



# --- 12. Retokenization (Merging and Splitting) ---
print("\n--- 12. Retokenization - Merging Tokens ---")
print("Used to modify tokenization by combining multiple tokens into one or splitting tokens apart")
doc_merge = nlp("I live in New York City and work there.")
print("Before merging:", [token.text for token in doc_merge])

# Merge "New York City" into a single token
with doc_merge.retokenize() as retokenizer:
    # Find the span for "New York City"
    nyc_span = doc_merge[3:6]  # tokens 3, 4, 5
    retokenizer.merge(nyc_span, attrs={"LEMMA": "New York City", "ENT_TYPE": "GPE"})

print("After merging:", [token.text for token in doc_merge])
print("Lemma of merged token:", doc_merge[3].lemma_)



# --- 13. Splitting Tokens ---
print("\n--- 13. Retokenization - Splitting Tokens ---")
print("Used to split single tokens into multiple tokens when default tokenization is insufficient")
# Note: This is a complex operation, here's a simpler example
doc_split = nlp("I'll go there.")
print("Before splitting:", [token.text for token in doc_split])

# For demonstration, let's show how splitting would work conceptually
# (Actual splitting requires careful handling of heads and dependencies)
print("Token 'I'll' could be split into: ['I', \"'ll\"]")
print("This requires specifying heads for proper dependency tree maintenance")



# --- 14. Custom Attributes (Extension) ---
print("\n--- 14. Custom Attributes ---")
print("Used to add custom properties and methods to spaCy objects for domain-specific processing")
# Register a custom attribute
from spacy.tokens import Token
Token.set_extension("is_greeting", default=False, force=True)

doc_custom = nlp("Hello, how are you today?")
# Set custom attribute
doc_custom[0]._.is_greeting = True

for token in doc_custom:
    print(f"Token: {token.text:<8} | Is_greeting: {token._.is_greeting}")



# --- 15. Pipeline Components ---
print("\n--- 15. Pipeline Information ---")
print("Used to understand and manage the processing pipeline components and their order")
print("Pipeline components:", nlp.pipe_names)
print("Pipeline component details:")
for name, component in nlp.pipeline:
    print(f"  - {name}: {type(component).__name__}")



# --- 16. Span Analysis ---
print("\n--- 16. Span Analysis ---")
print("Used to work with continuous sequences of tokens as single units for analysis")
text_span = "The European Union was established in 1993."
doc_span = nlp(text_span)

# Create a custom span
eu_span = doc_span[1:3]  # "European Union"
print(f"Span text: '{eu_span.text}'")
print(f"Span root: '{eu_span.root.text}'")
print(f"Span label: {eu_span.label_}")



# --- 17. Lexical Attributes ---
print("\n--- 17. Lexical Attributes ---")
print("Used to access word-level properties like currency symbols, numbers, and vocabulary status")
text_lex = "The price is twenty-five dollars and 50 cents."
doc_lex = nlp(text_lex)
for token in doc_lex:
    print(
        f"Token: {token.text:<12} | "
        f"Is_currency: {token.is_currency} | "
        f"Like_num: {token.like_num} | "
        f"Is_oov: {token.is_oov}"
    )



# --- 18. Statistical Information ---
print("\n--- 18. Model and Statistical Information ---")
print("Used to access metadata about the loaded model and vocabulary statistics")
print(f"Model name: {nlp.meta.get('name', 'Unknown')}")
print(f"Model version: {nlp.meta.get('version', 'Unknown')}")
print(f"Model language: {nlp.meta.get('lang', 'Unknown')}")
print(f"Vocabulary size: {len(nlp.vocab)}")


--- Basic Token Analysis ---
Used to demonstrate the fundamental token attributes that spaCy extracts from text
Natural      Natural      PROPN  NNP    compound   Xxxxx    True False
Language     Language     PROPN  NNP    compound   Xxxxx    True False
Processing   Processing   PROPN  NNP    nsubj      Xxxxx    True False
(            (            PUNCT  -LRB-  punct      (        False False
NLP          NLP          PROPN  NNP    appos      XXX      True False
)            )            PUNCT  -RRB-  punct      )        False False
enables      enable       VERB   VBZ    ROOT       xxxx     True False
machines     machine      NOUN   NNS    nsubj      xxxx     True False
to           to           PART   TO     aux        xx       True True
understand   understand   VERB   VB     ccomp      xxxx     True False
,            ,            PUNCT  ,      punct      ,        False False
interpret    interpret    VERB   VB     conj       xxxx     True False
,            ,            PUNCT  


--- 1. Tokenization ---
Used to split raw text into meaningful units (tokens) like words, punctuation, and special characters
Original Text: 'SpaCy's tokenization is powerful. U.K. is one token, and don't is split.'
Tokens: ['SpaCy', "'s", 'tokenization', 'is', 'powerful', '.', 'U.K.', 'is', 'one', 'token', ',', 'and', 'do', "n't", 'is', 'split', '.']

--- 2. Sentence Segmentation ---
Used to identify sentence boundaries and split text into individual sentences
Sentence 1: 'This is the first sentence.'
Sentence 2: 'This is another one!'
Sentence 3: 'And a final sentence?'
Has sentence boundaries: True

--- 3. POS Tagging & 4. Morphological Analysis ---
Used to assign grammatical categories (noun, verb, etc.) and analyze word forms (tense, number, etc.)
Token: She          | POS: PRON     | Tag: PRP      | Morph: Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs | Is_alpha: True | Is_stop: True
Token: was          | POS: AUX      | Tag: VBD      | Morph: Mood=Ind|Number=Sing|Person

**Question of curiosity:**

1.	How HMM can be used in POS tagging?

Ans: A Hidden Markov Model (HMM) is a statistical model perfectly suited for sequence-labeling tasks like POS tagging because it excels at finding the most probable sequence of tags for a given sentence. It works by modeling two key probabilities. First, emission probabilities represent the likelihood of a specific word appearing given a certain POS tag (e.g., the word "run" is highly likely to be a VERB). Second, transition probabilities represent the likelihood of one POS tag following another (e.g., a NOUN is very likely to follow a DETERMINER like "the").

The POS tags themselves are considered the "hidden states" that we can't see, while the words are the "observations" that are visible. Using these probabilities, an efficient algorithm called the Viterbi algorithm calculates the single most likely sequence of hidden tags that could have produced the observed sentence, effectively assigning a POS tag to every word.

2.	What are the Applications of NER?

•	Question Answering Systems: NER can be used to help question-answering systems identify the entities mentioned in a question and retrieve the relevant information from a knowledge base.

Ans: Named Entity Recognition (NER) has a vast range of real-world applications focused on extracting structured information from unstructured text. In customer support, NER can automatically identify product names or order numbers in a complaint to route it to the right department. HR departments use it to parse resumes, instantly pulling out names, skills, and past employers to populate databases. In the media, NER helps in content classification and recommendation by identifying the people, organizations, and locations mentioned in articles to suggest similar content to readers. It's also the backbone of many question-answering systems like Siri and Alexa, enabling them to understand what you're asking about. Furthermore, in fields like healthcare, NER is critical for extracting patient data, diseases, and medication names from doctors' notes, while in finance, it can be used to monitor news for mentions of specific companies for market analysis.
