# Računalno jezikoslovlje | ispit 


**Napomena**: *Rješenja dopišite u ovu radnu bilježnicu te je dostavite na Teams platformu. Vrijeme pisanja je 2h.*

## Zadatak 1 (10)

---

Napišite gramatiku obogaćenu značajkama koja opisuje sljedeće rečenice:
```
Maleni pas je lajao.
Opasni psi laju.
Malena mačka prede.
Ptica pjeca.
```

Vodite računa o sročnosti NP sa VP u broju, rodu i licu. Dakle, gramatika ne bi trebala prihvatiti rečenicu `Malena pas laju` jer se pridjev, imenica i glagol ne poklapaju u rodu i broju.


In [1]:
import nltk
from nltk import load_parser
from nltk.tokenize import word_tokenize

# Definiranje gramatike kao string
grammar = """
% start S
S -> NP[NUM=?n] VP[NUM=?n]

# NP (imenica) produkcije
NP[NUM=?n, GEN=?g] -> Adj[NUM=?n, GEN=?g] N[NUM=?n, GEN=?g]
NP[NUM=?n, GEN=?g] -> N[NUM=?n, GEN=?g]

# VP (glagolske) produkcije
VP[TENSE=?t, NUM=?n] -> V[TENSE=?t, NUM=?n]
VP[TENSE=?t, NUM=?n] -> V[TENSE=?t, +AUX] VP[TENSE=?t, -AUX]

# Leksičke produkcije za imenice
N[NUM=sg, GEN=m] -> 'pas'
N[NUM=pl, GEN=m] -> 'psi'
N[NUM=sg, GEN=f] -> 'mačka'
N[NUM=sg, GEN=f] -> 'ptica'

# Leksičke produkcije za pridjeve
Adj[NUM=sg, GEN=m] -> 'maleni'
Adj[NUM=pl, GEN=m] -> 'opasni'
Adj[NUM=sg, GEN=f] -> 'malena'

# Leksičke produkcije za glagole
V[TENSE=past, NUM=sg, -AUX] -> 'lajao'
V[TENSE=pres, NUM=pl, -AUX] -> 'laju'
V[TENSE=pres, NUM=sg, -AUX] -> 'prede'
V[TENSE=pres, NUM=sg, -AUX] -> 'pjeva'

# Leksičke produkcije za pomoćne glagole
V[TENSE=past, +AUX] -> 'je'
"""

# Spremanje gramatike u datoteku s kodiranjem utf-8
with open('grammar.fcfg', 'w', encoding='utf-8') as f:
    f.write(grammar)

# Učitavanje parsera
cp = load_parser('grammar.fcfg')

# Definiranje rečenica za parsiranje
sentences = [
    'Maleni pas je lajao.',
    'Opasni psi laju.',
    'Malena mačka prede.',
    'Ptica pjeva.',
]

# Parsiranje i prikaz stabala za svaku rečenicu
for sentence in sentences:
    # Uklanjanje točke na kraju rečenice
    sentence = sentence.rstrip('.')
    tokens = word_tokenize(sentence.lower())
    trees = list(cp.parse(tokens))
    for tree in trees:
        print(tree)
        tree.draw()


(S[]
  (NP[GEN='m', NUM='sg']
    (Adj[GEN='m', NUM='sg'] maleni)
    (N[GEN='m', NUM='sg'] pas))
  (VP[NUM=?n, TENSE='past']
    (V[+AUX, TENSE='past'] je)
    (VP[NUM='sg', TENSE='past']
      (V[-AUX, NUM='sg', TENSE='past'] lajao))))
(S[]
  (NP[GEN='m', NUM='pl']
    (Adj[GEN='m', NUM='pl'] opasni)
    (N[GEN='m', NUM='pl'] psi))
  (VP[NUM='pl', TENSE='pres'] (V[-AUX, NUM='pl', TENSE='pres'] laju)))
(S[]
  (NP[GEN='f', NUM='sg']
    (Adj[GEN='f', NUM='sg'] malena)
    (N[GEN='f', NUM='sg'] mačka))
  (VP[NUM='sg', TENSE='pres']
    (V[-AUX, NUM='sg', TENSE='pres'] prede)))
(S[]
  (NP[GEN='f', NUM='sg'] (N[GEN='f', NUM='sg'] ptica))
  (VP[NUM='sg', TENSE='pres']
    (V[-AUX, NUM='sg', TENSE='pres'] pjeva)))


## Zadatak 2 (10)

---

Napišite program za semantičko parsiranje sljedećih rečenica:
 ```
 Dijete puza.
 Čovjek hoda.
 Ptica leti.
 Riba pliva.
 ```
 * Napišite gramatiku obogaćenu značajkama koja pretvora tekst u odgovarajući $\lambda$ izraz.
 * dajte model s evaluacijom koji će provjeriti istinistost sljedećih rečenica: `Dijete puza. Čovjek hoda. Ptica leti. Riba pliva. Čovjek leti.`

In [2]:
%%writefile sem.fcfg
% start S
# Grammar Rules
S[SEM=<?vp(?np)>] -> NP[SEM=?np] VP[SEM=?vp]
VP[SEM=?v] -> IV[SEM=?v]
# Lexical Rules
NP[SEM=<dijete>] -> 'Dijete'
NP[SEM=<covjek>] -> 'Čovjek'
NP[SEM=<ptica>] -> 'Ptica'
NP[SEM=<riba>] -> 'Riba'
IV[SEM=<\x.puzati(x)>] -> 'puza'
IV[SEM=<\x.hodati(x)>] -> 'hoda'
IV[SEM=<\x.letjeti(x)>] -> 'leti'
IV[SEM=<\x.plivati(x)>] -> 'pliva'


Overwriting sem.fcfg


In [3]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import load_parser

# Učitavanje gramatike
cp = load_parser('sem.fcfg', trace=0)

# Rečenice za parsiranje
sentences = ['Dijete puza.', 'Čovjek hoda.', 'Ptica leti.', 'Riba pliva.', 'Čovjek leti.']

# Uklanjanje točaka na kraju rečenica
sentences = [sentence.rstrip('.') for sentence in sentences]

# Tokenizacija rečenica
tokens = [word_tokenize(sentence) for sentence in sentences]

# Dobivanje parsiranih stabala i semantičkih reprezentacija
trees = [list(cp.parse(token)) for token in tokens]
sem_reps = [tree[0].label()['SEM'] for tree in trees if tree]

# Ispis semantičkih reprezentacija
for sem in sem_reps:
    print(sem)

# Postavljanje modela
v = """
    dijete => d
    covjek => c
    ptica => p
    riba => r
    puzati => {d}
    hodati => {c}
    letjeti => {p}
    plivati => {r}
"""
val = nltk.Valuation.fromstring(v)
g = nltk.Assignment(val.domain)
m = nltk.Model(val.domain, val)

# Evaluacija rečenica
results = nltk.evaluate_sents(sentences, 'sem.fcfg', m, g)

# Ispis rezultata evaluacije
for result in results:
    for (synrep, semrep, value) in result:
        print(synrep)
        print(semrep)
        print(value)


puzati(dijete)
hodati(covjek)
letjeti(ptica)
plivati(riba)
letjeti(covjek)
(S[SEM=<puzati(dijete)>]
  (NP[SEM=<dijete>] Dijete)
  (VP[SEM=<\x.puzati(x)>] (IV[SEM=<\x.puzati(x)>] puza)))
puzati(dijete)
True
(S[SEM=<hodati(covjek)>]
  (NP[SEM=<covjek>] Čovjek)
  (VP[SEM=<\x.hodati(x)>] (IV[SEM=<\x.hodati(x)>] hoda)))
hodati(covjek)
True
(S[SEM=<letjeti(ptica)>]
  (NP[SEM=<ptica>] Ptica)
  (VP[SEM=<\x.letjeti(x)>] (IV[SEM=<\x.letjeti(x)>] leti)))
letjeti(ptica)
True
(S[SEM=<plivati(riba)>]
  (NP[SEM=<riba>] Riba)
  (VP[SEM=<\x.plivati(x)>] (IV[SEM=<\x.plivati(x)>] pliva)))
plivati(riba)
True
(S[SEM=<letjeti(covjek)>]
  (NP[SEM=<covjek>] Čovjek)
  (VP[SEM=<\x.letjeti(x)>] (IV[SEM=<\x.letjeti(x)>] leti)))
letjeti(covjek)
False


## Zadatak 3 (20)

---


Implementirajte analizu sentimenta koristeći naivni Bayesov klasifikator za popis filmskih recenzija koje se nalaze u `nltk.corpus.movie_reviews`. Za značajke pojedine recenzije koristite informaciju sadrži li recenzija najčešćih 2000 riječi iz `movie_reviews` korpusa. Točnost (accuracy) klasifikatora mora biti veća od 50%. Prikažite preciznost, odziv i $F_1$ ocjenu za pojedinu kategoriju pozitivnog i negativnog sentimenta. Ispišite barem 2 primjera na kojem se klasifikator poklapa sa zlatnim standardom i barem 2 primjera na kojoj se ne podudara. 

In [None]:
import nltk
from nltk.corpus import movie_reviews
import random
from nltk import FreqDist

# Download movie_reviews corpus if not already downloaded
nltk.download('movie_reviews')

# Build the dataset
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# Shuffle the documents to avoid any order bias
random.shuffle(documents)

# Build a list of the 2000 most common words as features
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words.keys())[:2000]

# Function to extract features from a document
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

# Prepare the feature sets
featuresets = [(document_features(d), c) for (d, c) in documents]


In [16]:
# Split the dataset into training and testing sets
train_set = featuresets[:1600]
test_set = featuresets[1600:]


In [17]:
# Train the Naive Bayes classifier
classifier = nltk.NaiveBayesClassifier.train(train_set)


In [18]:
# Evaluate the classifier
print("Accuracy:", nltk.classify.accuracy(classifier, test_set))

# Define reference sets for computing precision, recall, and F1 score
refsets = nltk.defaultdict(set)
testsets = nltk.defaultdict(set)

for i, (feats, label) in enumerate(test_set):
    refsets[label].add(i)
    observed = classifier.classify(feats)
    testsets[observed].add(i)

# Print precision, recall, and F1 score for both positive and negative sentiments
print("Precision (negative):", nltk.precision(refsets['neg'], testsets['neg']))
print("Recall (negative):", nltk.recall(refsets['neg'], testsets['neg']))
print("F1 Score (negative):", nltk.f_measure(refsets['neg'], testsets['neg']))

print("Precision (positive):", nltk.precision(refsets['pos'], testsets['pos']))
print("Recall (positive):", nltk.recall(refsets['pos'], testsets['pos']))
print("F1 Score (positive):", nltk.f_measure(refsets['pos'], testsets['pos']))


Accuracy: 0.795
Precision (negative): 0.7783251231527094
Recall (negative): 0.8102564102564103
F1 Score (negative): 0.7939698492462312
Precision (positive): 0.8121827411167513
Recall (positive): 0.7804878048780488
F1 Score (positive): 0.7960199004975124


In [19]:
# Examples where the classifier matches the gold standard
print("\nExamples where the classifier matches the gold standard:")
for i, (feats, label) in enumerate(test_set):
    observed = classifier.classify(feats)
    if observed == label:
        print("Correct:", " ".join(documents[1600 + i][0][:10]), "...")

# Examples where the classifier does not match the gold standard
print("\nExamples where the classifier does not match the gold standard:")
for i, (feats, label) in enumerate(test_set):
    observed = classifier.classify(feats)
    if observed != label:
        print("Incorrect:", " ".join(documents[1600 + i][0][:10]), "...")
        print("Predicted:", observed, "Actual:", label)
        break  # Print only one example for brevity



Examples where the classifier matches the gold standard:
Correct: ` the bachelor ' is one of the best terrible ...
Correct: it ' s been a good long while since we ...
Correct: when a pair of films from the same director gets ...
Correct: i had a chance to see a sneak preview of ...
Correct: " snake eyes " is the most aggravating kind of ...
Correct: america ' s favorite homicidal plaything takes a wicked wife ...
Correct: ingredients : james bond , scuba scene , car controlled ...
Correct: back in 1998 dreamworks unveiled their first computer animated movie ...
Correct: plot : a young french boy sees his parents killed ...
Correct: okay , bear with me y ' all , cause ...
Correct: they should have stuck to the promise emblazoned on the ...
Correct: " the blair witch project " was perhaps one of ...
Correct: did claus von bulow try to kill his wife sunny ...
Correct: synopsis : valerie , a high school junior who doesn ...
Correct: martial arts master steven seagal ( not to mention direc

## Zadatak 4 (20)

---


U sljedećem primjeru svaka rečenica predstavlja poseban dokument:

```   
    
    Machine learning algorithms use data to make predictions.
    Deep learning models require large amounts of labeled data.
    Natural language processing techniques analyze textual data.
    Milena came home after finishing her workout, immediately took off her backpack, and washed her hands.
    She sat down at the table to eat.
    Then she focused on her homework, not thinking about tomorrow’s match.
    How can you accentuate words in English?
    Do you want to learn a new language quickly and efficiently?
    Exploring English syntax: embark on an adventure through English sentence structure!

```
Učinite sljedeće:
 1. Izračunajte TF-IDF vektor za svaku rečenicu u dokumentu. Prikazati rezultirajuće vektore za svaku rečenicu.

 2. Primijenite K-Means algoritam na dobivene TF-IDF vektore. Pretpostavite K = 3. Koje rečenice pripadaju kojem klasteru?

 3. Dobili ste sljedeću rečenicu: “Analiza podataka pomoću Pythona je izazovna.” Pronađite kojem klasteru ova rečenica pripada.

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# Korpus rečenica (dokumenata)
sentences = [
    "Machine learning algorithms use data to make predictions.",
    "Deep learning models require large amounts of labeled data.",
    "Natural language processing techniques analyze textual data.",
    "Milena came home after finishing her workout, immediately took off her backpack, and washed her hands.",
    "She sat down at the table to eat.",
    "Then she focused on her homework, not thinking about tomorrow’s match.",
    "How can you accentuate words in English?",
    "Do you want to learn a new language quickly and efficiently?",
    "Exploring English syntax: embark on an adventure through English sentence structure!"
]

# 1. Izračun TF-IDF mjere za svaku rečenicu
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sentences)

# Ispis rezultirajućih TF-IDF vektora za svaku rečenicu
print("TF-IDF vektori za svaku rečenicu:")
for i, sentence in enumerate(sentences):
    print(f"Rečenica {i+1}: {sentence}")
    print(f"TF-IDF vektor: {X[i].toarray()}")
    print()


TF-IDF vektori za svaku rečenicu:
Rečenica 1: Machine learning algorithms use data to make predictions.
TF-IDF vektor: [[0.         0.         0.         0.         0.38370906 0.
  0.         0.         0.         0.         0.         0.
  0.         0.2817841  0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.32408678
  0.38370906 0.38370906 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.38370906
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.2817841  0.         0.
  0.38370906 0.         0.         0.         0.         0.        ]]

Rečenica 2: Deep learning models require large amounts of labeled data.
TF-IDF vektor: [[0.         0.         0.         0.         0

In [24]:
# 2. Klasteriranje rečenica u 3 klastera pomoću K-Means algoritma
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X)

# Ispis kojem klasteru pripada svaka rečenica
print("Klasteri za svaku rečenicu:")
for i, sentence in enumerate(sentences):
    cluster = kmeans.labels_[i]
    print(f"Rečenica '{sentence}' je u klasteru {cluster}")



Klasteri za svaku rečenicu:
Rečenica 'Machine learning algorithms use data to make predictions.' je u klasteru 2
Rečenica 'Deep learning models require large amounts of labeled data.' je u klasteru 2
Rečenica 'Natural language processing techniques analyze textual data.' je u klasteru 2
Rečenica 'Milena came home after finishing her workout, immediately took off her backpack, and washed her hands.' je u klasteru 0
Rečenica 'She sat down at the table to eat.' je u klasteru 0
Rečenica 'Then she focused on her homework, not thinking about tomorrow’s match.' je u klasteru 0
Rečenica 'How can you accentuate words in English?' je u klasteru 1
Rečenica 'Do you want to learn a new language quickly and efficiently?' je u klasteru 2
Rečenica 'Exploring English syntax: embark on an adventure through English sentence structure!' je u klasteru 1


In [22]:
# Nova rečenica za koju određujemo klaster
new_sentence = "Analiza podataka pomoću Pythona je izazovna."

# Izračun TF-IDF vektora za novu rečenicu
new_X = vectorizer.transform([new_sentence])

# Klasteriranje nove rečenice
predicted_cluster = kmeans.predict(new_X)[0]

# Ispis rezultata
print(f"Rečenica '{new_sentence}' pripada klasteru {predicted_cluster}")


Rečenica 'Analiza podataka pomoću Pythona je izazovna.' pripada klasteru 2
