# `Part 1 : PoS tagging`

In [1]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m4.9 MB/s[0m  [33m0:00:02[0m eta [36m0:00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


ADJ, ADP, AUX, DET, NOUN, NUM, PROPN, PUNCT, SCONJ, VERB

### 1. "The cat sat on the couch"
This is a grammatically straightforward sentence with no ambiguity.

| Word | Tag | Reason |
| :--- | :--- | :--- |
| **The** | **DET** | Determiner (definite article) |
| **cat** | **NOUN** | Common noun |
| **sat** | **VERB** | Main verb (past tense) |
| **on** | **ADP** | Adposition (preposition) indicating location |
| **the** | **DET** | Determiner |
| **couch** | **NOUN** | Common noun |

---

### 2. "Time flies like an arrow"

This is a classic example of **lexical ambiguity**. While there are joke interpretations (e.g., *Time* as a verb, meaning "measure the speed of flies"), the standard reading is annotated below.

| Word | Tag | Reason |
| :--- | :--- | :--- |
| **Time** | **NOUN** | Abstract noun (Subject) |
| **flies** | **VERB** | Main verb (Action) |
| **like** | **ADP** | Adposition (acting as a preposition here for comparison) |
| **an** | **DET** | Determiner (indefinite article) |
| **arrow** | **NOUN** | Common noun |

> **Note on Ambiguity:** If interpreted as "Fruit flies [the insect] like [enjoy] a banana," *flies* would be a **NOUN** and *like* would be a **VERB**. However, for this specific sentence, the annotation above is the standard English reading.

---

### 3. "The spy saw the cop with the telescope"

This sentence contains **structural (attachment) ambiguity**.
* *Reading A:* The spy used a telescope to see the cop.
* *Reading B:* The spy saw a cop who had a telescope.

**However, the POS tags remain the same for both readings:**

| Word | Tag | Reason |
| :--- | :--- | :--- |
| **The** | **DET** | Determiner |
| **spy** | **NOUN** | Common noun |
| **saw** | **VERB** | Main verb |
| **the** | **DET** | Determiner |
| **cop** | **NOUN** | Common noun |
| **with** | **ADP** | Adposition (preposition) |
| **the** | **DET** | Determiner |
| **telescope**| **NOUN** | Common noun |

---

### 4. "The spy saw the cop with the revolver"
Structurally identical to the sentence above, though semantically we assume the "revolver" belongs to the cop (Reading B), as seeing *using* a revolver makes less sense. The tags follow the exact same pattern.

| Word | Tag | Reason |
| :--- | :--- | :--- |
| **The** | **DET** | Determiner |
| **spy** | **NOUN** | Common noun |
| **saw** | **VERB** | Main verb |
| **the** | **DET** | Determiner |
| **cop** | **NOUN** | Common noun |
| **with** | **ADP** | Adposition (preposition) |
| **the** | **DET** | Determiner |
| **revolver** | **NOUN** | Common noun |

---

### Summary of Tag Usage
* **NOUN:** cat, couch, Time, arrow, spy, cop, telescope, revolver
* **VERB:** sat, flies, saw
* **DET:** The, the, an
* **ADP:** on, like, with

In [2]:
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_sm-0.5.4.tar.gz

Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_sm-0.5.4.tar.gz
  Using cached https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_sm-0.5.4.tar.gz (14.8 MB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting spacy<3.8.0,>=3.7.4 (from en_core_sci_sm==0.5.4)
  Using cached spacy-3.7.5-cp310-cp310-macosx_11_0_arm64.whl.metadata (27 kB)
Collecting thinc<8.3.0,>=8.2.2 (from spacy<3.8.0,>=3.7.4->en_core_sci_sm==0.5.4)
  Using cached thinc-8.2.5-cp310-cp310-macosx_11_0_arm64.whl.metadata (15 kB)
Collecting blis<0.8.0,>=0.7.8 (from thinc<8.3.0,>=8.2.2->spacy<3.8.0,>=3.7.4->en_core_sci_sm==0.5.4)
  Using cached blis-0.7.11-cp310-cp310-macosx_11_0_arm64.whl.metadata (7.4 kB)
Using cached spacy-3.7.5-cp310-cp310-macosx_11_0_arm64.whl (6.6 MB)
Using cached thinc-8.2.5-cp310-cp310-macosx_11_0_arm64.whl

In [3]:
import spacy
import scispacy


def run_spacy_analysis(sentence_list):
    """
    Analyzes sentences using the standard English model (en_core_web_sm).
    """
    print("\n" + "="*60)
    print("RUNNING: STANDARD SPACY (en_core_web_sm)")
    print("="*60)

    try:
        nlp = spacy.load("en_core_web_sm")
    except OSError:
        print("Model 'en_core_web_sm' not found. Please download it.")
        return

    print(f"{'TOKEN':<20} {'POS':<8} {'DESCRIPTION'}")
    print("-" * 60)

    for text in sentence_list:
        doc = nlp(text)
        # Truncate long sentences for display
        print(f"\nSentence: \"{text[:50]}...\"")

        for token in doc:
            # token.pos_ returns the Universal Dependencies tag
            description = spacy.explain(token.pos_) or "No description"
            print(f"{token.text:<20} {token.pos_:<8} {description}")

In [4]:
def run_scispacy_analysis(sentence_list):
    """
    Analyzes sentences using the Biomedical model (en_core_sci_sm).
    """
    print("\n" + "="*60)
    print("RUNNING: SCISPACY (en_core_sci_sm)")
    print("="*60)

    try:
        nlp = spacy.load("en_core_sci_sm")
    except OSError:
        print("Model 'en_core_sci_sm' not found. Please install it via pip.")
        return

    print(f"{'TOKEN':<20} {'POS':<8} {'DESCRIPTION'}")
    print("-" * 60)

    for text in sentence_list:
        doc = nlp(text)
        print(f"\nSentence: \"{text[:50]}...\"")

        for token in doc:
            # SciSpacy also uses Universal Dependencies
            description = spacy.explain(token.pos_) or "No description"
            print(f"{token.text:<20} {token.pos_:<8} {description}")


In [5]:
dataset = [
    # General
    "The cat sat on the couch",
    "Time flies like an arrow",
    "The spy saw the cop with the telescope",
    "The spy saw the cop with the revolver",
    # Specific (Biology)
    "Arabidopsis thaliana seedlings exhibit longer hypocotyls when they are grown under high ambient temperature.",
    # Specific (Astronomy)
    "A spectrogram of PSN J10354824+3900279 obtained on Dec. 19.33 UT suggests that this is a type-Ia at redshift z 0.044."
]

In [6]:
run_spacy_analysis(dataset)


RUNNING: STANDARD SPACY (en_core_web_sm)
TOKEN                POS      DESCRIPTION
------------------------------------------------------------

Sentence: "The cat sat on the couch..."
The                  DET      determiner
cat                  NOUN     noun
sat                  VERB     verb
on                   ADP      adposition
the                  DET      determiner
couch                NOUN     noun

Sentence: "Time flies like an arrow..."
Time                 NOUN     noun
flies                VERB     verb
like                 ADP      adposition
an                   DET      determiner
arrow                NOUN     noun

Sentence: "The spy saw the cop with the telescope..."
The                  DET      determiner
spy                  NOUN     noun
saw                  VERB     verb
the                  DET      determiner
cop                  NOUN     noun
with                 ADP      adposition
the                  DET      determiner
telescope            NOUN     noun



In [7]:

run_scispacy_analysis(dataset)


RUNNING: SCISPACY (en_core_sci_sm)
TOKEN                POS      DESCRIPTION
------------------------------------------------------------

Sentence: "The cat sat on the couch..."
The                  DET      determiner
cat                  NOUN     noun
sat                  VERB     verb
on                   ADP      adposition
the                  DET      determiner
couch                NOUN     noun

Sentence: "Time flies like an arrow..."
Time                 NOUN     noun
flies                NOUN     noun
like                 ADP      adposition
an                   DET      determiner
arrow                NOUN     noun

Sentence: "The spy saw the cop with the telescope..."
The                  DET      determiner
spy                  NOUN     noun
saw                  VERB     verb
the                  DET      determiner
cop                  NOUN     noun
with                 ADP      adposition
the                  DET      determiner
telescope            NOUN     noun

Sent

  deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(  # type: ignore[union-attr]


# `Part 2 : Syntactic Parsing` 
### Parsing using NLTK Chat Parser

In [8]:
import nltk
from nltk import CFG

grammar_text = """
    S   -> NP VP

    VP  -> VERB NP
    VP  -> VP PP
    VP  -> VERB PP

    NP  -> DET NOUN
    NP  -> NP PP
    NP  -> 'Time'

    PP  -> ADP NP

    # Lexicon (Handling Ambiguities)
    DET  -> 'The' | 'the' | 'an' | 'a'
    NOUN -> 'cat' | 'couch' | 'Time' | 'flies' | 'arrow' | 'spy' | 'cop' | 'telescope' | 'revolver'
    VERB -> 'sat' | 'flies' | 'like' | 'saw'
    ADP  -> 'on' | 'like' | 'with'
"""

grammar = CFG.fromstring(grammar_text)

parser = nltk.ChartParser(grammar)


def parse_sentence(sentence):
    tokens = sentence.split()
    print(f"\nParsing: '{sentence}'")
    print("-" * 50)

    # Attempt to parse
    try:
        trees = list(parser.parse(tokens))
    except ValueError:
        print("Tokens contain words not in grammar.")
        return

    if not trees:
        print("No parse found (Grammar structure mismatch).")

    for i, tree in enumerate(trees):
        print(f"Tree #{i+1}:")
        tree.pretty_print()


sentences = [
    "The cat sat on the couch",
    "Time flies like an arrow",
    "The spy saw the cop with the telescope"
]

for s in sentences:
    parse_sentence(s)


Parsing: 'The cat sat on the couch'
--------------------------------------------------
Tree #1:
              S                        
      ________|________                 
     |                 VP              
     |         ________|___             
     |        |            PP          
     |        |     _______|___         
     NP       |    |           NP      
  ___|___     |    |        ___|____    
DET     NOUN VERB ADP     DET      NOUN
 |       |    |    |       |        |   
The     cat  sat   on     the     couch


Parsing: 'Time flies like an arrow'
--------------------------------------------------
Tree #1:
            S                    
  __________|____                 
 |               VP              
 |      _________|___             
 |     |             PP          
 |     |     ________|___         
 |     |    |            NP      
 |     |    |         ___|____    
 NP   VERB ADP      DET      NOUN
 |     |    |        |        |   
Time flies like

### Parsing using CYK

In [9]:
from cyk import parse_sentence_cyk
pos_lists = [

    #The cat sat on the couch
    [
        ["DET"],          
        ["NOUN"],         
        ["VERB"],         
        ["PREP"],         
        ["DET"],          
        ["NOUN"]         
    ],

    #Time flies like an arrow
    [
        ["NOUN"],                 
        ["NOUN", "VERB"],         # flies 
        ["PREP", "VERB"],         # like 
        ["DET"],                  
        ["NOUN"]                  
    ],
    #The spy saw the cop with the telescope
    [
        ["DET"],          
        ["NOUN"],         
        ["VERB"],         
        ["DET"],          
        ["NOUN"],         
        ["PREP"],        
        ["DET"],         
        ["NOUN"]          
    ]
]
for sentence, pos in zip(sentences, pos_lists):

    print("Sentence:", sentence)
    parse_sentence_cyk(sentence, pos)



Sentence: The cat sat on the couch
S
----NP
--------DET - The
--------NOUN - cat
----VP
--------VERB - sat
--------PP
------------PREP - on
------------NP
----------------DET - the
----------------NOUN - couch


S
----NP
--------DET - The
--------NOUN - cat
----VP
--------VP - sat
--------PP
------------PREP - on
------------NP
----------------DET - the
----------------NOUN - couch


Nb parses: 2
Sentence: Time flies like an arrow
S
----NP - Time
----VP
--------VERB - flies
--------PP
------------PREP - like
------------NP
----------------DET - an
----------------NOUN - arrow


S
----NP - Time
----VP
--------VP - flies
--------PP
------------PREP - like
------------NP
----------------DET - an
----------------NOUN - arrow


Nb parses: 2
Sentence: The spy saw the cop with the telescope
S
----NP
--------DET - The
--------NOUN - spy
----VP
--------VERB - saw
--------NP
------------NP
----------------DET - the
----------------NOUN - cop
------------PP
----------------PREP - with
-----------

### Comparison with Spacy Syntactic analysis(constituency parser)

In [10]:
import benepar, spacy
benepar.download('benepar_en3')

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('benepar', config={'model': 'benepar_en3'})

[nltk_data] Downloading package benepar_en3 to
[nltk_data]     /Users/sisso/nltk_data...
[nltk_data]   Package benepar_en3 is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


<benepar.integrations.spacy_plugin.BeneparComponent at 0x131f0f430>

In [11]:
def parse_sentence_constituency(sentence):
    doc = nlp(sentence)
    sent = list(doc.sents)[0]
    print(sent._.parse_string)


In [12]:
sentences = [
    "The cat sat on the couch",
    "Time flies like an arrow",
    "The spy saw the cop with the telescope"
]

for s in sentences:
    parse_sentence_constituency(s)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


(S (NP (DT The) (NN cat)) (VP (VBD sat) (PP (IN on) (NP (DT the) (NN couch)))))
(S (NP (NN Time)) (VP (VBZ flies) (PP (IN like) (NP (DT an) (NN arrow)))))
(S (NP (DT The) (NN spy)) (VP (VBD saw) (NP (DT the) (NN cop)) (PP (IN with) (NP (DT the) (NN telescope)))))


