# Part 1: Mandatory Question
### The following two sentences are ambiguous – that is, they each have two (or more) different readings. For each sentence, explain the different readings:
1. The girl attacked the boy with the book.

Can be read as "The girl used the book as a weapon to attack the boy" or "The girl attacked the boy who's holding a book"

2. We decided to leave on Saturday.

Can be read as "they made the decision on the saturday" or "the leaving is scheduled on saturday"

### Are the following sentences ambiguous? If the sentence is ambiguous, explain the different readings. If not, explain why not.
3. I saw a man with a briefcase.
Both ambigious and not ambigious. Can be read as "You saw a man carrying a briefcase" or "you used a briefcase to see a man". The second option is grammatically correct but doesn't make any sense. 

4. I saw the planet with a telescope.
Ambigious. Can be read as "You used a telescope to see a planet" or "you saw a planet that has a telescope"

### Try the above sentences (3 and 4) in this parse tree generator: https://huggingface.co/spaces/nanom/syntactic_tree Compare the dependency structures for both of them. Explain the differences in the link to the preposition “with” in 3 vs. 4.

Sentence 3: “(I) saw a man with a briefcase.”
- **saw** = Root (VBD)
- **man** depends on **saw** (the direct object)
- **with** depends on **saw** (a PP dependent/adjunct of the verb)
- **briefcase** depends on **with** (object of the preposition)
- **saw** → with → **briefcase**

Sentence 4: “I saw the planet with a telescope.”
- **saw** = Root (VBD)
- **I** depends on **saw** (subject)
- **planet** depends on **saw** (direct object)
- **with** depends on **saw**
- **telescope** depends on **with**

Structurally in the outputs there’s no difference in attachment: in both, **with** attaches to the verb **saw**.

Interpretively: the same syntactic attachment has different consequences:

- In (3), verb-attachment makes the sentence lean toward the odd reading (“saw using a briefcase”), whereas many humans prefer noun-attachment (“man with a briefcase”).

- In (4), verb-attachment produces the very natural instrumental reading (“saw using a telescope”), which is why the parser’s choice looks “right.”

In [3]:
# Import the required libraries
import pandas as pd 
import spacy
from spacy.pipeline import EntityRuler
from spacy.lang.en.stop_words import STOP_WORDS
from collections import Counter
from prettytable import PrettyTable

In [None]:
# downloading the English model for spaCy
#!python -m spacy download en_core_web_sm
#%pip install prettytable

Collecting prettytable
  Using cached prettytable-3.17.0-py3-none-any.whl.metadata (34 kB)
Using cached prettytable-3.17.0-py3-none-any.whl (34 kB)
Installing collected packages: prettytable
Successfully installed prettytable-3.17.0
Note: you may need to restart the kernel to use updated packages.


# Part 2: Select any one question to answer in this
[Linguistic Analysis of a Text Corpus Using spaCy]

There is an uploaded text corpus named ”sample.xlsx”. Conduct a linguistic
analysis using Part-of-Speech (POS) tagging and Named Entity Recognition
(NER) in spaCy library only on the ”SOS Tweet/SOS Message” column.
To complete this question you have to do the following tasks:

In [4]:
# Define the path to the Excel file and stored the specified column name in a variable
path = "sample.xlsx"
col = "SOS Tweet / SOS Message"

# 1) Load the Excel file "sample.xlsx"
df = pd.read_excel(path, skiprows= 8, skipfooter=5, usecols=[col])

df.head()


Unnamed: 0,SOS Tweet / SOS Message
0,my relative B************n who is from bagmara...
1,[Redacted Mention] [Redacted Mention] we need ...
2,#SOSDehradun #Plasma4Covid Urgent Plasma requ...
3,#SOS.#SOS : Oxygen bed required Location: Coi...
4,Name: Y****************m Age: 62 Location: H...


1) Prepare the Corpus: Pre-process the corpus data by removing stopwords, removing extra spaces, converting all text to lowercase, or apply any other text normalization techniques that you think is relevant.

In [5]:
# 2) Nested list with each tweet as a separate list
tw = (
    df[col]
    .dropna() 
    .astype(str) 
    .str.strip() 
    .tolist() 
)

# Overview of the loaded tweets
print(f"Loaded: {len(tw)} tweets")
print(tw[:1]) #view the first tweet

Loaded: 100 tweets
['my relative B************n who is from bagmara jharkhand is tested positive for corona and is admitted in recovery nourshing home bardmaan West Bengal.He is not getting sufficient treatment and his health is getting worse. We need an urgent ventilation bed. HELP [Redacted Mention]']


In [None]:
# 3) setup spaCy for tokenization
nlp = spacy.load("en_core_web_sm")

# As the tweets anonymizes names with a pattern like "A***B", we can add a custom entity ruler to recognize these as PERSON entities.
ruler = nlp.add_pipe("entity_ruler", before="ner")
patterns = [
    {
        "label": "PERSON",
        "pattern": [{"TEXT": {"REGEX": r"^[A-Za-z]\*+[A-Za-z]$"}}]
    }
]
ruler.add_patterns(patterns)

# Set up the stop words: default + extras
stop_words = set(STOP_WORDS) | {'helpðÿ', '™', '\x8fðÿ', '\x8f', 'ðÿ', '+', '@', "#", "*", 'a+'}
#stop_words.update({'helpðÿ', '™', '\x8fðÿ', '\x8f', 'ðÿ', '+', '@', "#", "*", 'a+'})


# tokenize the tweets using spaCy's nlp.pipe with batch processing for efficiency
tokenized_tweets = [
    [tok.text for tok in doc]
    for doc in nlp.pipe(tw, batch_size=200)
]
print(tokenized_tweets[:2])


# clean the tokenized tweets by removing stop words, punctuation, and whitespace, and converting to lowercase
cleaned_tweets = [
    [
        tok.text.lower().strip("-") # remove leading/trailing "-" but keep internal hyphens
        for tok in doc
        if not tok.is_space 
        and not tok.is_punct 
        and tok.text.lower() not in stop_words 
    ]
    for doc in nlp.pipe(tw, batch_size=200)
]

print(cleaned_tweets[:2])




[['my', 'relative', 'B************n', 'who', 'is', 'from', 'bagmara', 'jharkhand', 'is', 'tested', 'positive', 'for', 'corona', 'and', 'is', 'admitted', 'in', 'recovery', 'nourshing', 'home', 'bardmaan', 'West', 'Bengal', '.', 'He', 'is', 'not', 'getting', 'sufficient', 'treatment', 'and', 'his', 'health', 'is', 'getting', 'worse', '.', 'We', 'need', 'an', 'urgent', 'ventilation', 'bed', '.', 'HELP', '[', 'Redacted', 'Mention', ']'], ['[', 'Redacted', 'Mention', ']', '[', 'Redacted', 'Mention', ']', 'we', 'need', 'urgently', 'oxygen', 'sylinder', 'in', 'Nadiad', 'for', 'our', 'grand', 'fother', '..', 'please', 'help', 'us', '..']]
[['relative', 'b************n', 'bagmara', 'jharkhand', 'tested', 'positive', 'corona', 'admitted', 'recovery', 'nourshing', 'home', 'bardmaan', 'west', 'bengal', 'getting', 'sufficient', 'treatment', 'health', 'getting', 'worse', 'need', 'urgent', 'ventilation', 'bed', 'help', 'redacted', 'mention'], ['redacted', 'mention', 'redacted', 'mention', 'need', 'ur

2. POS Tagging and NER: Apply POS tagging to the entire corpus to
analyze the distribution of different parts of speech. Use NER to identify
and categorize named entities within the text such as name, date, locations
etc.

In [None]:
# POS tagging
cleaned_texts = [" ".join(tokens) for tokens in cleaned_tweets]

pos_counts = Counter(
    tok.pos_ 
    for doc in nlp.pipe(cleaned_texts, batch_size=200) 
    for tok in doc
)

#the POS counts into a pretty table
pos_tab = PrettyTable(["POS", "Count"])
for pos,count in pos_counts.most_common():
    pos_tab.add_row([pos,count])

print(pos_tab)



+-------+-------+
|  POS  | Count |
+-------+-------+
|  NOUN |  1092 |
|  VERB |  499  |
| PROPN |  469  |
|  NUM  |  201  |
|  ADJ  |  157  |
|  ADV  |   34  |
|  AUX  |   20  |
|  ADP  |   16  |
|  INTJ |   13  |
| PUNCT |   9   |
|   X   |   8   |
|  PRON |   5   |
| CCONJ |   4   |
|  DET  |   3   |
|  PART |   2   |
|  SYM  |   1   |
| SCONJ |   1   |
+-------+-------+


# Question 2.3 – Analysis

Nouns dominate, which shows the messages focus on concrete entities like resources, people, and medical needs rather than storytelling. Verbs are the next most common, reflecting that many lines are framed as direct actions or requests (for example “need,” “require,” “contact,” “help”). Proper nouns are also frequent, meaning the texts often include specific places, hospitals, and names to make the request actionable. The relatively high number of numbers indicates lots of critical details like phone numbers, ages, and oxygen/SpO2 values.

In [None]:
# NER counts

entity_counts = Counter(
    ent.label_ 
    for doc in nlp.pipe(cleaned_texts, batch_size=200) 
    for ent in doc.ents
)

ner_tab = PrettyTable(["NER", "Count"])
for ner,count in entity_counts.most_common():
    ner_tab.add_row([ner,count])

print(ner_tab)

+----------+-------+
|   NER    | Count |
+----------+-------+
| CARDINAL |  116  |
|  PERSON  |  107  |
|   DATE   |   49  |
|   ORG    |   25  |
|   GPE    |   8   |
| QUANTITY |   4   |
|  EVENT   |   1   |
| PRODUCT  |   1   |
| ORDINAL  |   1   |
|   NORP   |   1   |
|   TIME   |   1   |
|   LAW    |   1   |
+----------+-------+
