# Kolokvij 2: Priprema

**Pokrivene teme:**
- Bayesov naivni klasifikator
- TF-IDF vektori
- K-means algoritam
- Word2Vec s gensim
- NLTK paket

**Format:** Zadaci s detaljnim rješenjima

## 1. Uvoz potrebnih paketa

In [94]:
import nltk
import numpy as np
import pandas as pd
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Sklearn (koristimo za TF-IDF i K-means)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# NLTK klasifikacija (Naive Bayes)
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy as nltk_accuracy

# Tokenizacija i stop riječi
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Gensim
from gensim.models import Word2Vec

# NLTK resursi
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

print("✓ Svi paketi uspješno učitani!")

✓ Svi paketi uspješno učitani!


---
# ZADATAK 1: Bayesov Naivni Klasifikator

**Zadatak:** Implementiraj Bayesov naivni klasifikator za klasifikaciju recenzija filmova kao pozitivnih ili negativnih. Značajke koje će  opisati rečenice jesu indikatori sadrži li tekst najčešćih 300 riječi (ili manje)

**Upute:**
1. Pripremi skup podataka s recenzijama (pozitivne i negativne)
2. Tokeniziraj tekstove koristeći NLTK
3. Ukloni stop riječi (na engleskom jeziku)
4. Podijeli podatke na train i test skupove (80/20)
5. Nauči NLTK NaiveBayesClassifier
6. Evaluiraj model na test skupu

### Rješenje:

In [95]:
# 1. Priprema podataka - recenzije filmova (mali, ali uravnotežen skup)
recenzije = [
    # Pozitivne
    "This movie was fantastic! I loved every moment of it.",
    "Great film with excellent acting and story.",
    "Amazing cinematography and wonderful performances.",
    "One of the best movies I have ever seen!",
    "Brilliant direction and superb cast.",
    "Outstanding film that exceeded my expectations.",
    "Wonderful movie with great character development.",
    "Excellent storytelling and beautiful scenes.",
    "A truly enjoyable film with a strong plot.",
    "The acting was great and the ending was satisfying.",
    "Heartwarming story and very well made.",
    "Highly recommended — fun and engaging.",
    # Negativne
    "Terrible movie, waste of time and money.",
    "Very disappointing, poor acting and bad plot.",
    "Awful film with terrible direction.",
    "Boring and predictable, not worth watching.",
    "Horrible movie, I hated every minute.",
    "Poor quality and weak storyline.",
    "Disaster of a film, avoid at all costs.",
    "Worst movie I have seen this year.",
    "The characters were flat and the plot made no sense.",
    "Bad acting and a very dull story.",
    "Painfully slow and completely uninteresting.",
    "Not recommended — disappointing and messy.",
]

# Oznake: 1 = pozitivno, 0 = negativno
oznake = [1]*12 + [0]*12

print(f"Ukupno recenzija: {len(recenzije)}")
print(f"Pozitivne: {sum(oznake)}, Negativne: {len(oznake) - sum(oznake)}")

Ukupno recenzija: 24
Pozitivne: 12, Negativne: 12


In [103]:
# 2. Tokenizacija i uklanjanje stop riječi
stop_rijeci = set(stopwords.words('english'))
stop_rijeci -= {'not', 'no', 'nor'}  # negacije su često važne za sentiment

def preprocess_text(text):
    """Predprocesira tekst: tokenizacija, mala slova, uklanjanje stop riječi"""
    tokens = word_tokenize(text.lower())
    tokens = [token for token in tokens if token.isalnum() and token not in stop_rijeci]
    return tokens  # Vraća listu tokena

recenzije_tokens = [preprocess_text(r) for r in recenzije]

print("Primjer preprocesiranog teksta:")
print(f"Original: {recenzije[0]}")
print(f"Tokeni: {recenzije_tokens[0]}")

Primjer preprocesiranog teksta:
Original: This movie was fantastic! I loved every moment of it.
Tokeni: ['movie', 'fantastic', 'loved', 'every', 'moment']


In [104]:
# 3. Priprema features za NLTK NaiveBayesClassifier (Bag-of-Words)
# Izračun najčešćih riječi u cijelom skupu
sve_rijeci = [w for tokens in recenzije_tokens for w in tokens]
freq = Counter(sve_rijeci)

# Uzmi top-N riječi kao značajke
N_FEATURES = min(300, len(freq))
word_features = [w for w, _ in freq.most_common(N_FEATURES)]

def document_features(tokens):
    """Značajke: sadrži li dokument riječ iz top-N vokabulara"""
    tokens_set = set(tokens)
    return {f"has({w})": (w in tokens_set) for w in word_features}

# Kreiraj feature setove s oznakama
featuresets = [(document_features(tokens), 'pos' if label == 1 else 'neg')
               for tokens, label in zip(recenzije_tokens, oznake)]

#  podjela (80/20) da test ima obje klase
from random import Random
rng = Random(42)

pos_fs = [fs for fs in featuresets if fs[1] == 'pos']
neg_fs = [fs for fs in featuresets if fs[1] == 'neg']
rng.shuffle(pos_fs)
rng.shuffle(neg_fs)

train_pos_size = int(0.8 * len(pos_fs))
train_neg_size = int(0.8 * len(neg_fs))

train_set = pos_fs[:train_pos_size] + neg_fs[:train_neg_size]
test_set = pos_fs[train_pos_size:] + neg_fs[train_neg_size:]
rng.shuffle(train_set)
rng.shuffle(test_set)

print(f"Broj feature riječi: {len(word_features)}")
print(f"Train skup: {len(train_set)} primjera")
print(f"Test skup: {len(test_set)} primjera")
print(f"Test pos/neg: {sum(1 for _,y in test_set if y=='pos')}/{sum(1 for _,y in test_set if y=='neg')}")

primjer_true = [k for k, v in train_set[0][0].items() if v][:8]
print(f"\nPrimjer TRUE features: {primjer_true}")

Broj feature riječi: 78
Train skup: 18 primjera
Test skup: 6 primjera
Test pos/neg: 3/3

Primjer TRUE features: ['has(poor)', 'has(quality)', 'has(weak)', 'has(storyline)']


In [105]:
# 4. Treniranje NLTK Naive Bayes klasifikatora
nb_classifier = NaiveBayesClassifier.train(train_set)

print("✓ NLTK Naive Bayes klasifikator uspješno treniran!")
print(f"\nBroj features: {len(nb_classifier.labels())} klase")
print(f"Klase: {nb_classifier.labels()}")

✓ NLTK Naive Bayes klasifikator uspješno treniran!

Broj features: 2 klase
Klase: ['neg', 'pos']


In [106]:
# 5. Evaluacija modela
accuracy = nltk_accuracy(nb_classifier, test_set)
print(f"✓ Točnost modela: {accuracy:.2%}")

# Prikaz najznačajnijih features
print("\nNajznačajnije riječi za klasifikaciju:")
nb_classifier.show_most_informative_features(10)

✓ Točnost modela: 50.00%

Najznačajnije riječi za klasifikaciju:
Most Informative Features
              has(movie) = True              neg : pos    =      1.7 : 1.0
              has(great) = False             neg : pos    =      1.3 : 1.0
          has(wonderful) = False             neg : pos    =      1.3 : 1.0
              has(movie) = False             pos : neg    =      1.1 : 1.0
            has(amazing) = False             neg : pos    =      1.1 : 1.0
              has(avoid) = False             pos : neg    =      1.1 : 1.0
              has(awful) = False             pos : neg    =      1.1 : 1.0
                has(bad) = False             pos : neg    =      1.1 : 1.0
          has(beautiful) = False             neg : pos    =      1.1 : 1.0
               has(best) = False             neg : pos    =      1.1 : 1.0


In [107]:
# 6. Testiranje na novim recenzijama
nove_recenzije = [
    "This film was absolutely brilliant and amazing!",
    "Terrible waste of time, very boring and bad."
]

print("Testiranje na novim recenzijama:\n")
for rec in nove_recenzije:
    tokens = preprocess_text(rec)
    features = document_features(tokens)
    prediction = nb_classifier.classify(features)
    prob_dist = nb_classifier.prob_classify(features)
    
    sentiment = "✓ POZITIVNO" if prediction == 'pos' else "✗ NEGATIVNO"
    confidence = prob_dist.prob(prediction)
    
    print(f"Recenzija: '{rec}'")
    print(f"Sentiment: {sentiment}")
    print(f"Sigurnost: {confidence:.2%}\n")

Testiranje na novim recenzijama:

Recenzija: 'This film was absolutely brilliant and amazing!'
Sentiment: ✓ POZITIVNO
Sigurnost: 92.53%

Recenzija: 'Terrible waste of time, very boring and bad.'
Sentiment: ✗ NEGATIVNO
Sigurnost: 97.16%



---
# ZADATAK 2: K-means klasterizacija (TF-IDF)

**Zadatak:** Grupiraj kratke dokumente u 3 klastera (sport / tehnologija / hrana) koristeći TF‑IDF i K‑means.

### Rješenje:

In [85]:
# 1) Skup dokumenata (3 teme)
dokumenti = [
    # Sport
    "Football match ended with a thrilling goal and victory.",
    "Basketball team won the championship after intense season.",
    "Tennis player served an ace and dominated the game.",
    "Fans celebrated the win at the stadium after the match.",
    # Tehnologija
    "New smartphone features advanced camera and powerful processor.",
    "Artificial intelligence and machine learning change modern technology.",
    "Computer programming requires logical thinking and problem solving.",
    "Software development involves coding and testing applications.",
    # Hrana
    "Italian pasta with tomato sauce and fresh basil is delicious.",
    "Homemade pizza with cheese and vegetables tastes amazing.",
    "Chocolate cake recipe includes flour sugar and cocoa powder.",
    "Fresh salad with lettuce tomatoes and olive oil is healthy.",
]

prave_teme = [
    'sport', 'sport', 'sport', 'sport',
    'tehnologija', 'tehnologija', 'tehnologija', 'tehnologija',
    'hrana', 'hrana', 'hrana', 'hrana',
]

print(f"Ukupno dokumenata: {len(dokumenti)}")
print(f"Teme: {sorted(set(prave_teme))}")

Ukupno dokumenata: 12
Teme: ['hrana', 'sport', 'tehnologija']


In [86]:
# 2. Predprocesiranje i TF-IDF vektorizacija
def preprocess_doc(text):
    tokens = word_tokenize(text.lower())
    tokens = [t for t in tokens if t.isalnum() and t not in stop_rijeci]
    return ' '.join(tokens)

dokumenti_clean = [preprocess_doc(doc) for doc in dokumenti]

# TF-IDF vektorizacija
vectorizer_kmeans = TfidfVectorizer(max_features=50)
X_tfidf = vectorizer_kmeans.fit_transform(dokumenti_clean)

print(f"Dimenzije TF-IDF matrice: {X_tfidf.shape}")

Dimenzije TF-IDF matrice: (12, 50)


In [87]:
# 3. K-means klasterizacija
k = 3
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X_tfidf)

klasteri = kmeans.labels_

print(f"Klasteri dodijeljeni dokumentima:\n")
for i, (klaster, tema) in enumerate(zip(klasteri, prave_teme)):
    print(f"  D{i+1:2d} [Tema: {tema:>11}] -> Klaster {klaster}: {dokumenti[i][:40]}...")

Klasteri dodijeljeni dokumentima:

  D 1 [Tema:       sport] -> Klaster 2: Football match ended with a thrilling go...
  D 2 [Tema:       sport] -> Klaster 0: Basketball team won the championship aft...
  D 3 [Tema:       sport] -> Klaster 0: Tennis player served an ace and dominate...
  D 4 [Tema:       sport] -> Klaster 2: Fans celebrated the win at the stadium a...
  D 5 [Tema: tehnologija] -> Klaster 0: New smartphone features advanced camera ...
  D 6 [Tema: tehnologija] -> Klaster 0: Artificial intelligence and machine lear...
  D 7 [Tema: tehnologija] -> Klaster 0: Computer programming requires logical th...
  D 8 [Tema: tehnologija] -> Klaster 0: Software development involves coding and...
  D 9 [Tema:       hrana] -> Klaster 1: Italian pasta with tomato sauce and fres...
  D10 [Tema:       hrana] -> Klaster 0: Homemade pizza with cheese and vegetable...
  D11 [Tema:       hrana] -> Klaster 0: Chocolate cake recipe includes flour sug...
  D12 [Tema:       hrana] -> Klaster 1: F

In [88]:
# 4. Analiza klastera - najznačajnije riječi
feature_names = vectorizer_kmeans.get_feature_names_out()
centers = kmeans.cluster_centers_

print("Najznačajnije riječi po klasterima:\n")
for klaster_id in range(k):
    centroid = centers[klaster_id]
    top_indices = centroid.argsort()[-8:][::-1]
    top_words = [feature_names[i] for i in top_indices]
    
    print(f"Klaster {klaster_id}:")
    print(f"  Ključne riječi: {', '.join(top_words)}")
    
    # Distribucija tema
    teme_u_klasteru = [prave_teme[i] for i, k in enumerate(klasteri) if k == klaster_id]
    tema_count = Counter(teme_u_klasteru)
    print(f"  Distribucija tema: {dict(tema_count)}\n")

Najznačajnije riječi po klasterima:

Klaster 0:
  Ključne riječi: problem, intense, logical, basketball, championship, computer, applications, involves
  Distribucija tema: {'sport': 2, 'tehnologija': 4, 'hrana': 2}

Klaster 1:
  Ključne riječi: fresh, olive, italian, lettuce, oil, pasta, basil, delicious
  Distribucija tema: {'hrana': 2}

Klaster 2:
  Ključne riječi: match, fans, celebrated, ended, football, goal, powerful, powder
  Distribucija tema: {'sport': 2}



---
# ZADATAK 3: Word2Vec (Gensim)

**Zadatak:** Treniraj Word2Vec model i istraži sličnosti i analogije.

*Napomena:* Na malom korpusu rezultati mogu biti slabiji — u praksi treba veći korpus.

### Rješenje:

In [89]:
# 1. Priprema korpusa rečenica
rechenice_korpus = [
    "dog is a loyal animal and best friend",
    "cat is a independent pet animal",
    "king rules the kingdom with wisdom",
    "queen is the wife of king",
    "man and woman are equal",
    "boy grows up to become man",
    "girl grows up to become woman",
    "prince is son of king and queen",
    "princess is daughter of king and queen",
    "dog barks and cat meows",
    "puppy is young dog",
    "kitten is young cat",
    "animals live in nature",
    "pets live with humans",
    "kingdom is ruled by royal family",
    "computer is electronic device",
    "laptop is portable computer",
    "phone is communication device",
    "car is a vehicle for transport",
    "bike is two wheel vehicle",
]

print(f"Ukupno rečenica: {len(rechenice_korpus)}")

# 2. Tokenizacija
tokenizirane_recenice = []
for recenica in rechenice_korpus:
    tokens = word_tokenize(recenica.lower())
    tokens = [t for t in tokens if t.isalnum()]
    tokenizirane_recenice.append(tokens)

print(f"Primjeri tokeniziranih rečenica:")
for i in range(3):
    print(f"  {tokenizirane_recenice[i]}")

Ukupno rečenica: 20
Primjeri tokeniziranih rečenica:
  ['dog', 'is', 'a', 'loyal', 'animal', 'and', 'best', 'friend']
  ['cat', 'is', 'a', 'independent', 'pet', 'animal']
  ['king', 'rules', 'the', 'kingdom', 'with', 'wisdom']


In [90]:
# 3. Treniranje Word2Vec modela
model_w2v = Word2Vec(
    sentences=tokenizirane_recenice,
    vector_size=100,      # Dimenzionalnost word embeddinga
    window=5,             # Kontekstualni prozor
    min_count=1,          # Minimalni broj pojavljivanja riječi
    workers=4,
    sg=0,                 # 0 = CBOW, 1 = Skip-gram
    epochs=100,
    seed=42
)

print(f"✓ Word2Vec model treniran!")
print(f"  Veličina vokabulara: {len(model_w2v.wv)}")
print(f"  Dimenzije vektora: {model_w2v.wv.vector_size}")

✓ Word2Vec model treniran!
  Veličina vokabulara: 63
  Dimenzije vektora: 100


In [91]:
# 4. Sličnost između riječi
print("=== SLIČNOST RIJEČI ===\n")

parovi = [
    ('king', 'queen'),
    ('man', 'woman'),
    ('dog', 'cat'),
    ('computer', 'laptop'),
    ('boy', 'girl'),
    ('king', 'dog'),  # Nesrodne riječi
]

for rijec1, rijec2 in parovi:
    try:
        slicnost = model_w2v.wv.similarity(rijec1, rijec2)
        print(f"Sličnost '{rijec1}' i '{rijec2}': {slicnost:.4f}")
    except KeyError:
        print(f"Riječ nije u vokabularu")

=== SLIČNOST RIJEČI ===

Sličnost 'king' i 'queen': -0.0453
Sličnost 'man' i 'woman': 0.0795
Sličnost 'dog' i 'cat': -0.0698
Sličnost 'computer' i 'laptop': -0.1398
Sličnost 'boy' i 'girl': 0.0756
Sličnost 'king' i 'dog': 0.1731


In [92]:
# 5. Pronalaženje najsličnijih riječi
print("\n=== NAJSLIČNIJE RIJEČI ===\n")

test_rijeci = ['king', 'dog', 'computer']

for rijec in test_rijeci:
    try:
        slicne = model_w2v.wv.most_similar(rijec, topn=4)
        print(f"Najsličnije za '{rijec}':")
        for slicna_rijec, score in slicne:
            print(f"  {slicna_rijec}: {score:.4f}")
        print()
    except KeyError:
        print(f"Riječ nije u vokabularu\n")


=== NAJSLIČNIJE RIJEČI ===

Najsličnije za 'king':
  loyal: 0.3584
  is: 0.3573
  man: 0.3262
  in: 0.3154

Najsličnije za 'dog':
  equal: 0.3230
  computer: 0.2363
  a: 0.2301
  loyal: 0.2006

Najsličnije za 'computer':
  a: 0.3696
  animals: 0.2852
  pet: 0.2832
  and: 0.2687



In [93]:
# 6. Analogije: king - man + woman = queen
print("=== ANALOGIJE ===\n")

analogije = [
    (['king', 'woman'], ['man']),      # king - man + woman ≈ queen
    (['queen', 'man'], ['woman']),     # queen - woman + man ≈ king
    (['boy', 'woman'], ['man']),       # boy - man + woman ≈ girl
    (['dog', 'kitten'], ['puppy']),    # dog - puppy + kitten ≈ cat
]

for pozitivne, negativne in analogije:
    try:
        rezultat = model_w2v.wv.most_similar(positive=pozitivne, negative=negativne, topn=2)
        print(f"{' + '.join(pozitivne)} - {' - '.join(negativne)} ≈")
        for rijec, score in rezultat:
            print(f"  {rijec}: {score:.4f}")
        print()
    except KeyError as e:
        print(f"Neka riječ nije u vokabularu\n")

=== ANALOGIJE ===

king + woman - man ≈
  is: 0.2940
  vehicle: 0.2513

queen + man - woman ≈
  son: 0.2984
  family: 0.2498

boy + woman - man ≈
  is: 0.1950
  vehicle: 0.1919

dog + kitten - puppy ≈
  equal: 0.2845
  king: 0.2322

