<a href="https://colab.research.google.com/github/Amulyanrao7777/NLP/blob/main/lab4_POStagging_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [18]:
import nltk
import spacy
from spacy import displacy
from sklearn.metrics import classification_report, accuracy_score


nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')

# ==========================================
# PART 1: POS TAGGING
# ==========================================

sample_text = "Apple is looking at buying U.K. startup for $1 billion."

# ------------------------------------------
# APPROACH 1: Using NLTK (Rule-based/Stochastic)
# ------------------------------------------
print("--- NLTK POS Tagging ---")

# Step 1: Tokenization
tokens = nltk.word_tokenize(sample_text)

# Step 2: Tagging
# NLTK uses the Penn Treebank tagset by default (NN, VB, JJ, etc.)
nltk_tags = nltk.pos_tag(tokens)

print(f"{'Word':<15} {'Tag':<10} {'Description'}")
print("-" * 40)

for word, tag in nltk_tags:
    print(f"{word:<15} {tag:<10}")

# ------------------------------------------
# APPROACH 2: Using spaCy (Deep Learning based)
# ------------------------------------------
print("\n--- spaCy POS Tagging ---")

# Load the small English model
# Run in terminal first: python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

# Process the text
doc = nlp(sample_text)

print(f"{'Word':<15} {'POS':<10} {'Tag':<10} {'Explanation'}")
print("-" * 60)

for token in doc:
    # token.pos_ is the coarse-grained tag (Universal POS)
    # token.tag_ is the fine-grained tag (Penn Treebank)
    print(f"{token.text:<15} {token.pos_:<10} {token.tag_:<10} {spacy.explain(token.tag_)}")


# ==========================================
# PART 2: NAMED ENTITY RECOGNITION (NER)
# ==========================================

ner_text = "Elon Musk founded SpaceX in 2002. It is based in California."

# Process text
doc_ner = nlp(ner_text)

print(f"\n--- Named Entity Recognition ---")
print(f"{'Entity':<20} {'Label':<10} {'Explanation'}")
print("-" * 60)

for ent in doc_ner.ents:
    print(f"{ent.text:<20} {ent.label_:<10} {spacy.explain(ent.label_)}")

# --- VISUALIZATION ---
# If running in Jupyter Notebook, use:
displacy.render(doc_ner, style="ent", jupyter=True)



print("\nVisualization saved to 'ner_visualization.html'")


# ==========================================
# PART 3: EVALUATION METRICS
# ==========================================

# 'y_true' represents the actual correct tags for a sentence
# 'y_pred' represents what our model predicted
# Example Sentence: "Google is a big company"
# Correct Tags:     [ORG, VERB, DET, ADJ, NOUN]

y_true = ["ORG", "VERB", "DET", "ADJ", "NOUN", "NOUN", "PUNCT"]
y_pred = ["ORG", "VERB", "DET", "NOUN", "NOUN", "NOUN", "PUNCT"]
# Note the error: The model predicted "big" (ADJ) as a "NOUN"

print("--- Evaluation Metrics ---\n")

# 1. Accuracy
accuracy = accuracy_score(y_true, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# 2. Classification Report (Precision, Recall, F1)
# This generates a table showing metrics for every single tag type
report = classification_report(y_true, y_pred, zero_division=0)
print("\nClassification Report:\n")
print(report)

print("\nAnalysis:")
print("Notice how the ADJ tag has 0.00 precision/recall because our model missed it entirely.")
print("The NOUN tag has high recall but lower precision because we over-predicted it.")


# ==========================================
# STUDENT CHALLENGE: Entity Frequency Counter
# ==========================================

from collections import Counter

# Paste any news paragraph here
news_text = """
Apple Inc. and Google are among the top technology companies in the United States.
Microsoft also announced a partnership with OpenAI in San Francisco.
Amazon reported record earnings, while Meta faced scrutiny from the European Union.
"""

doc_news = nlp(news_text)

# Extract only ORG entities
org_entities = [ent.text for ent in doc_news.ents if ent.label_ == "ORG"]
org_counts = Counter(org_entities)

print("\n--- Most Frequent Organizations (ORG) in Text ---")
for org, count in org_counts.most_common():
    print(f"{org:<25} Count: {count}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


--- NLTK POS Tagging ---
Word            Tag        Description
----------------------------------------
Apple           NNP       
is              VBZ       
looking         VBG       
at              IN        
buying          VBG       
U.K.            NNP       
startup         NN        
for             IN        
$               $         
1               CD        
billion         CD        
.               .         

--- spaCy POS Tagging ---
Word            POS        Tag        Explanation
------------------------------------------------------------
Apple           PROPN      NNP        noun, proper singular
is              AUX        VBZ        verb, 3rd person singular present
looking         VERB       VBG        verb, gerund or present participle
at              ADP        IN         conjunction, subordinating or preposition
buying          VERB       VBG        verb, gerund or present participle
U.K.            PROPN      NNP        noun, proper singular
startup        


Visualization saved to 'ner_visualization.html'
--- Evaluation Metrics ---

Accuracy: 0.86

Classification Report:

              precision    recall  f1-score   support

         ADJ       0.00      0.00      0.00         1
         DET       1.00      1.00      1.00         1
        NOUN       0.67      1.00      0.80         2
         ORG       1.00      1.00      1.00         1
       PUNCT       1.00      1.00      1.00         1
        VERB       1.00      1.00      1.00         1

    accuracy                           0.86         7
   macro avg       0.78      0.83      0.80         7
weighted avg       0.76      0.86      0.80         7


Analysis:
Notice how the ADJ tag has 0.00 precision/recall because our model missed it entirely.
The NOUN tag has high recall but lower precision because we over-predicted it.

--- Most Frequent Organizations (ORG) in Text ---
Apple Inc.                Count: 1
Google                    Count: 1
Microsoft                 Count: 1
Amazon 