# Part 1: Mandatory Question
### The following two sentences are ambiguous – that is, they each have two (or more) different readings. For each sentence, explain the different readings:
1. The girl attacked the boy with the book.

Can be read as "The girl used the book as a weapon to attack the boy" or "The girl attacked the boy who's holding a book"

2. We decided to leave on Saturday.

Can be read as "they made the decision on the saturday" or "the leaving is scheduled on saturday"

### Are the following sentences ambiguous? If the sentence is ambiguous, explain the different readings. If not, explain why not.
3. I saw a man with a briefcase.
Both ambigious and not ambigious. Can be read as "You saw a man carrying a briefcase" or "you used a briefcase to see a man". The second option is grammatically correct but doesn't make any sense. 

4. I saw the planet with a telescope.
Ambigious. Can be read as "You used a telescope to see a planet" or "you saw a planet that has a telescope"

### Try the above sentences (3 and 4) in this parse tree generator: https://huggingface.co/spaces/nanom/syntactic_tree Compare the dependency structures for both of them. Explain the differences in the link to the preposition “with” in 3 vs. 4.

Sentence 3: “(I) saw a man with a briefcase.”
- **saw** = Root (VBD)
- **man** depends on **saw** (the direct object)
- **with** depends on **saw** (a PP dependent/adjunct of the verb)
- **briefcase** depends on **with** (object of the preposition)
- **saw** → with → **briefcase**

Sentence 4: “I saw the planet with a telescope.”
- **saw** = Root (VBD)
- **I** depends on **saw** (subject)
- **planet** depends on **saw** (direct object)
- **with** depends on **saw**
- **telescope** depends on **with**

Structurally in the outputs there’s no difference in attachment: in both, **with** attaches to the verb **saw**.

Interpretively: the same syntactic attachment has different consequences:

- In (3), verb-attachment makes the sentence lean toward the odd reading (“saw using a briefcase”), whereas many humans prefer noun-attachment (“man with a briefcase”).

- In (4), verb-attachment produces the very natural instrumental reading (“saw using a telescope”), which is why the parser’s choice looks “right.”

In [None]:
import pandas as pd 
import spacy
from spacy.lang.en.stop_words import STOP_WORDS

In [None]:
!python -m spacy download en_core_web_sm

# Part 2: Select any one question to answer in this
[Linguistic Analysis of a Text Corpus Using spaCy]

There is an uploaded text corpus named ”sample.xlsx”. Conduct a linguistic
analysis using Part-of-Speech (POS) tagging and Named Entity Recognition
(NER) in spaCy library only on the ”SOS Tweet/SOS Message” column.
To complete this question you have to do the following tasks:

In [None]:
path = "/Users/clara.s.nielsen/Desktop/sample.xlsx"
col = "SOS Tweet / SOS Message"

# ---- Load data  ----
df = pd.read_excel(
    path, 
    skiprows= 8, 
    skipfooter=5, 
    usecols= [col])

df.head()


1) Prepare the Corpus: Pre-process the corpus data by removing stopwords, removing extra spaces, converting all text to lowercase, or apply any other text normalization techniques that you think is relevant.

In [None]:
# Make a clean list of tweet strings (no NaNs, stripped)
tw = (
    df[col]
    .dropna() 
    .astype(str) 
    .str.strip() 
    .tolist() 
)

print(f"Loaded: {len(tw)} tweets")
print(tw[:1]) 

In [None]:
# ---- spaCy setup (for tokenization) ----
nlp = spacy.load("en_core_web_sm")

# Stop words: default + extras
stop_words = set(STOP_WORDS) | {'helpðÿ', '™', '\x8fðÿ', '™', '\x8f', 'ðÿ'}
stop_words.update({"rt", "via", "amp"})


# ---- Tokenize each tweet ----
tokenized_tweets = [
    [tok.text for tok in doc]
    for doc in nlp.pipe(tw, batch_size=200)
]
print(tokenized_tweets[:2])


# ---- Clean tokens per tweet ----
cleaned_tweets = [
    [
        tok.text.lower().strip("-")
        for tok in doc
        if not tok.is_space 
        and not tok.is_punct 
        and tok.text.lower() not in stop_words 
        if tok.is_alpha or tok.text.rstrip("-").isalpha()
        #and tok.text.isalpha()
    ]
    for doc in nlp.pipe(tw, batch_size=200)
]



print(cleaned_tweets[83:85])