# Part 1: Mandatory Question
### The following two sentences are ambiguous – that is, they each have two (or more) different readings. For each sentence, explain the different readings:
1. The girl attacked the boy with the book.

Can be read as "The girl used the book as a weapon to attack the boy" or "The girl attacked the boy who's holding a book"

2. We decided to leave on Saturday.

Can be read as "they made the decision on the saturday" or "the leaving is scheduled on saturday"

### Are the following sentences ambiguous? If the sentence is ambiguous, explain the different readings. If not, explain why not.
3. I saw a man with a briefcase.
Both ambigious and not ambigious. Can be read as "You saw a man carrying a briefcase" or "you used a briefcase to see a man". The second option is grammatically correct but doesn't make any sense. 

4. I saw the planet with a telescope.
Ambigious. Can be read as "You used a telescope to see a planet" or "you saw a planet that has a telescope"

### Try the above sentences (3 and 4) in this parse tree generator: https://huggingface.co/spaces/nanom/syntactic_tree Compare the dependency structures for both of them. Explain the differences in the link to the preposition “with” in 3 vs. 4.

Sentence 3: “(I) saw a man with a briefcase.”
- **saw** = Root (VBD)
- **man** depends on **saw** (the direct object)
- **with** depends on **saw** (a PP dependent/adjunct of the verb)
- **briefcase** depends on **with** (object of the preposition)
- **saw** → with → **briefcase**

Sentence 4: “I saw the planet with a telescope.”
- **saw** = Root (VBD)
- **I** depends on **saw** (subject)
- **planet** depends on **saw** (direct object)
- **with** depends on **saw**
- **telescope** depends on **with**

Structurally in the outputs there’s no difference in attachment: in both, **with** attaches to the verb **saw**.

Interpretively: the same syntactic attachment has different consequences:

- In (3), verb-attachment makes the sentence lean toward the odd reading (“saw using a briefcase”), whereas many humans prefer noun-attachment (“man with a briefcase”).

- In (4), verb-attachment produces the very natural instrumental reading (“saw using a telescope”), which is why the parser’s choice looks “right.”

In [3]:
import pandas as pd 
import spacy
from spacy.lang.en.stop_words import STOP_WORDS

In [4]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


# Part 2: Select any one question to answer in this
[Linguistic Analysis of a Text Corpus Using spaCy]

There is an uploaded text corpus named ”sample.xlsx”. Conduct a linguistic
analysis using Part-of-Speech (POS) tagging and Named Entity Recognition
(NER) in spaCy library only on the ”SOS Tweet/SOS Message” column.
To complete this question you have to do the following tasks:

In [5]:
path = "/Users/clara.s.nielsen/Desktop/sample.xlsx"
col = "SOS Tweet / SOS Message"

# ---- Load data  ----
df = pd.read_excel(
    path, 
    skiprows= 8, 
    skipfooter=5, 
    usecols= [col])

df.head()


Unnamed: 0,SOS Tweet / SOS Message
0,my relative B************n who is from bagmara...
1,[Redacted Mention] [Redacted Mention] we need ...
2,#SOSDehradun #Plasma4Covid Urgent Plasma requ...
3,#SOS.#SOS : Oxygen bed required Location: Coi...
4,Name: Y****************m Age: 62 Location: H...


1) Prepare the Corpus: Pre-process the corpus data by removing stopwords, removing extra spaces, converting all text to lowercase, or apply any other text normalization techniques that you think is relevant.

In [6]:
# Make a clean list of tweet strings (no NaNs, stripped)
tw = (
    df[col]
    .dropna() 
    .astype(str) 
    .str.strip() 
    .tolist() 
)

print(f"Loaded: {len(tw)} tweets")
print(tw[:1]) 

Loaded: 100 tweets
['my relative B************n who is from bagmara jharkhand is tested positive for corona and is admitted in recovery nourshing home bardmaan West Bengal.He is not getting sufficient treatment and his health is getting worse. We need an urgent ventilation bed. HELP [Redacted Mention]']


In [63]:
# ---- spaCy setup (for tokenization) ----
nlp = spacy.load("en_core_web_sm")

from spacy.pipeline import EntityRuler

ruler = nlp.add_pipe("entity_ruler", before="ner")

patterns = [
    {
        "label": "PERSON",
        "pattern": [{"TEXT": {"REGEX": r"^[A-Za-z]\*+[A-Za-z]$"}}]
    }
]

ruler.add_patterns(patterns)

# Stop words: default + extras
stop_words = set(STOP_WORDS) | {'helpðÿ', '™', '\x8fðÿ', '\x8f', 'ðÿ', '+', '@', "#", "*", 'a+'}
#stop_words.update({"rt", "via", "amp"})


# ---- Tokenize each tweet ----
tokenized_tweets = [
    [tok.text for tok in doc]
    for doc in nlp.pipe(tw, batch_size=200)
]
print(tokenized_tweets[:2])


# ---- Clean tokens per tweet ----
cleaned_tweets = [
    [
        tok.text.lower().strip("-")
        for tok in doc
        if not tok.is_space 
        and not tok.is_punct 
        and tok.text.lower() not in stop_words 
        #if tok.is_alpha or tok.text.rstrip("-").isalpha()
        #and tok.text.isalpha()
    ]
    for doc in nlp.pipe(tw, batch_size=200)
]

print(cleaned_tweets[:])




[['my', 'relative', 'B************n', 'who', 'is', 'from', 'bagmara', 'jharkhand', 'is', 'tested', 'positive', 'for', 'corona', 'and', 'is', 'admitted', 'in', 'recovery', 'nourshing', 'home', 'bardmaan', 'West', 'Bengal', '.', 'He', 'is', 'not', 'getting', 'sufficient', 'treatment', 'and', 'his', 'health', 'is', 'getting', 'worse', '.', 'We', 'need', 'an', 'urgent', 'ventilation', 'bed', '.', 'HELP', '[', 'Redacted', 'Mention', ']'], ['[', 'Redacted', 'Mention', ']', '[', 'Redacted', 'Mention', ']', 'we', 'need', 'urgently', 'oxygen', 'sylinder', 'in', 'Nadiad', 'for', 'our', 'grand', 'fother', '..', 'please', 'help', 'us', '..']]
[['relative', 'b************n', 'bagmara', 'jharkhand', 'tested', 'positive', 'corona', 'admitted', 'recovery', 'nourshing', 'home', 'bardmaan', 'west', 'bengal', 'getting', 'sufficient', 'treatment', 'health', 'getting', 'worse', 'need', 'urgent', 'ventilation', 'bed', 'help', 'redacted', 'mention'], ['redacted', 'mention', 'redacted', 'mention', 'need', 'ur

In [61]:
# POS tagging – top 10 most common POS tags
from collections import Counter
from prettytable import PrettyTable
cleaned_texts = [" ".join(tokens) for tokens in cleaned_tweets]

pos_counts = Counter(
    tok.pos_ 
    for doc in nlp.pipe(cleaned_texts, batch_size=200) 
    for tok in doc
)
# make it into a pretty table
#pos_df = pd.DataFrame(pos_counts.most_common(), columns=["POS", "Count"])
#print(pos_df)

pos_tab = PrettyTable(["POS", "Count"])
for pos,count in pos_counts.most_common():
    pos_tab.add_row([pos,count])

print(pos_tab)



+-------+-------+
|  POS  | Count |
+-------+-------+
|  NOUN |  1092 |
|  VERB |  499  |
| PROPN |  469  |
|  NUM  |  201  |
|  ADJ  |  157  |
|  ADV  |   34  |
|  AUX  |   20  |
|  ADP  |   16  |
|  INTJ |   13  |
| PUNCT |   9   |
|   X   |   8   |
|  PRON |   5   |
| CCONJ |   4   |
|  DET  |   3   |
|  PART |   2   |
|  SYM  |   1   |
| SCONJ |   1   |
+-------+-------+


In [62]:

from collections import Counter
entity_counts = Counter(
    ent.label_ 
    for doc in nlp.pipe(cleaned_texts, batch_size=200) 
    for ent in doc.ents
)
#entity_df = pd.DataFrame(entity_counts.most_common(), columns=["Entity", "Count"])
#print(entity_df)

ner_tab = PrettyTable(["NER", "Count"])
for ner,count in entity_counts.most_common():
    ner_tab.add_row([ner,count])

print(ner_tab)

+----------+-------+
|   NER    | Count |
+----------+-------+
| CARDINAL |  116  |
|  PERSON  |  107  |
|   DATE   |   49  |
|   ORG    |   25  |
|   GPE    |   8   |
| QUANTITY |   4   |
|  EVENT   |   1   |
| PRODUCT  |   1   |
| ORDINAL  |   1   |
|   NORP   |   1   |
|   TIME   |   1   |
|   LAW    |   1   |
+----------+-------+
