# üîç Notebook 11: Text Embeddings a S√©mantick√© Vyhled√°v√°n√≠

**Autor:** Praut s.r.o. - AI Integration & Business Automation

V tomto notebooku se nauƒç√≠me pracovat s textov√Ωmi embeddings - vektorov√Ωmi reprezentacemi textu. Uk√°≈æeme si jak vytv√°≈ôet embeddings, prov√°dƒõt s√©mantick√© vyhled√°v√°n√≠ a budovat jednoduch√© RAG syst√©my.

## Co se nauƒç√≠te:
- Vytv√°≈ôen√≠ text embeddings pomoc√≠ Sentence Transformers
- S√©mantick√© vyhled√°v√°n√≠ a similarity search
- Clustering dokument≈Ø
- Jednoduch√© vektorov√© datab√°ze
- Z√°klady RAG (Retrieval Augmented Generation)

In [None]:
# Instalace pot≈ôebn√Ωch knihoven
!pip install -q sentence-transformers faiss-cpu numpy pandas scikit-learn matplotlib

In [None]:
import torch
from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import warnings
warnings.filterwarnings('ignore')

# Detekce za≈ô√≠zen√≠
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Pou≈æ√≠v√°m za≈ô√≠zen√≠: {device}")
if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## 1. Z√°klady Text Embeddings

In [None]:
# Naƒçten√≠ modelu pro embeddings
# all-MiniLM-L6-v2 je rychl√Ω a kvalitn√≠ model pro obecn√© pou≈æit√≠
model = SentenceTransformer('all-MiniLM-L6-v2', device=device)

print(f"‚úÖ Model naƒçten")
print(f"   Dimenze embeddings: {model.get_sentence_embedding_dimension()}")
print(f"   Max d√©lka sekvence: {model.max_seq_length}")

In [None]:
# Vytvo≈ôen√≠ embeddings pro jednotliv√© vƒõty
sentences = [
    "Python je popul√°rn√≠ programovac√≠ jazyk.",
    "JavaScript se pou≈æ√≠v√° pro webov√Ω v√Ωvoj.",
    "Koƒçky jsou obl√≠ben√° dom√°c√≠ zv√≠≈ôata.",
    "Programov√°n√≠ vy≈æaduje logick√© my≈°len√≠.",
    "Psi jsou vƒõrn√≠ spoleƒçn√≠ci ƒçlovƒõka."
]

# Vytvo≈ôen√≠ embeddings
embeddings = model.encode(sentences)

print(f"Poƒçet vƒõt: {len(sentences)}")
print(f"Tvar embeddings: {embeddings.shape}")
print(f"\nUk√°zka prvn√≠ho embeddingu (prvn√≠ch 10 hodnot):")
print(embeddings[0][:10])

In [None]:
# V√Ωpoƒçet podobnosti mezi vƒõtami
similarity_matrix = cosine_similarity(embeddings)

# Vizualizace
plt.figure(figsize=(10, 8))
plt.imshow(similarity_matrix, cmap='RdYlGn', aspect='auto')
plt.colorbar(label='Cosine Similarity')
plt.xticks(range(len(sentences)), [s[:30] + '...' for s in sentences], rotation=45, ha='right')
plt.yticks(range(len(sentences)), [s[:30] + '...' for s in sentences])
plt.title('Matice podobnosti vƒõt')

# P≈ôid√°n√≠ hodnot do bunƒõk
for i in range(len(sentences)):
    for j in range(len(sentences)):
        plt.text(j, i, f'{similarity_matrix[i, j]:.2f}', 
                ha='center', va='center', fontsize=9)

plt.tight_layout()
plt.show()

## 2. S√©mantick√© vyhled√°v√°n√≠

In [None]:
class SemanticSearch:
    """
    T≈ô√≠da pro s√©mantick√© vyhled√°v√°n√≠ v korpusu dokument≈Ø.
    """
    
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        self.model = SentenceTransformer(model_name, device=device)
        self.documents = []
        self.embeddings = None
        self.metadata = []
    
    def add_documents(self, documents, metadata=None):
        """
        P≈ôid√° dokumenty do indexu.
        
        Args:
            documents: Seznam textov√Ωch dokument≈Ø
            metadata: Voliteln√° metadata pro ka≈æd√Ω dokument
        """
        self.documents.extend(documents)
        
        if metadata:
            self.metadata.extend(metadata)
        else:
            self.metadata.extend([{} for _ in documents])
        
        # Vytvo≈ôen√≠ embeddings
        new_embeddings = self.model.encode(documents, show_progress_bar=True)
        
        if self.embeddings is None:
            self.embeddings = new_embeddings
        else:
            self.embeddings = np.vstack([self.embeddings, new_embeddings])
        
        print(f"P≈ôid√°no {len(documents)} dokument≈Ø. Celkem: {len(self.documents)}")
    
    def search(self, query, top_k=5):
        """
        Vyhled√° nejpodobnƒõj≈°√≠ dokumenty k dotazu.
        """
        if self.embeddings is None or len(self.documents) == 0:
            return []
        
        # Embedding dotazu
        query_embedding = self.model.encode([query])
        
        # V√Ωpoƒçet podobnosti
        similarities = cosine_similarity(query_embedding, self.embeddings)[0]
        
        # Se≈ôazen√≠ podle podobnosti
        top_indices = similarities.argsort()[-top_k:][::-1]
        
        results = []
        for idx in top_indices:
            results.append({
                "document": self.documents[idx],
                "score": float(similarities[idx]),
                "metadata": self.metadata[idx],
                "index": int(idx)
            })
        
        return results
    
    def find_similar(self, document_index, top_k=5):
        """
        Najde dokumenty podobn√© dan√©mu dokumentu.
        """
        if document_index >= len(self.documents):
            return []
        
        doc_embedding = self.embeddings[document_index:document_index+1]
        similarities = cosine_similarity(doc_embedding, self.embeddings)[0]
        
        # Vynechat samotn√Ω dokument
        similarities[document_index] = -1
        
        top_indices = similarities.argsort()[-top_k:][::-1]
        
        return [
            {
                "document": self.documents[idx],
                "score": float(similarities[idx]),
                "index": int(idx)
            }
            for idx in top_indices
        ]

# Test
search_engine = SemanticSearch()
print("‚úÖ SemanticSearch inicializov√°n")

In [None]:
# P≈ô√≠prava testovac√≠ho korpusu
documents = [
    "Python je interpretovan√Ω programovac√≠ jazyk vysok√© √∫rovnƒõ s dynamick√Ωm typov√°n√≠m.",
    "Machine learning je podoblast umƒõl√© inteligence zab√Ωvaj√≠c√≠ se algoritmy, kter√© se uƒç√≠ z dat.",
    "Django je webov√Ω framework pro Python, kter√Ω umo≈æ≈àuje rychl√Ω v√Ωvoj webov√Ωch aplikac√≠.",
    "TensorFlow je open-source knihovna pro numerick√© v√Ωpoƒçty a strojov√© uƒçen√≠.",
    "JavaScript je skriptovac√≠ jazyk bƒõ≈ænƒõ pou≈æ√≠van√Ω pro interaktivn√≠ webov√© str√°nky.",
    "React je JavaScript knihovna pro tvorbu u≈æivatelsk√Ωch rozhran√≠.",
    "PostgreSQL je v√Ωkonn√° open-source relaƒçn√≠ datab√°ze.",
    "Docker je platforma pro kontejnerizaci aplikac√≠.",
    "Kubernetes orchestruje kontejnerizovan√© aplikace v clusteru.",
    "Git je distribuovan√Ω syst√©m spr√°vy verz√≠ pro sledov√°n√≠ zmƒõn v souborech.",
    "REST API je architektonick√Ω styl pro n√°vrh webov√Ωch slu≈æeb.",
    "GraphQL je dotazovac√≠ jazyk pro API vyvinut√Ω Facebookem.",
    "Redis je in-memory datab√°ze ƒçasto pou≈æ√≠van√° jako cache.",
    "Celery je distribuovan√° fronta √∫loh pro Python.",
    "Angular je TypeScript framework pro tvorbu webov√Ωch aplikac√≠."
]

metadata = [
    {"category": "programming", "language": "python"},
    {"category": "ai", "topic": "ml"},
    {"category": "web", "language": "python"},
    {"category": "ai", "topic": "ml"},
    {"category": "programming", "language": "javascript"},
    {"category": "web", "language": "javascript"},
    {"category": "database", "type": "relational"},
    {"category": "devops", "topic": "containers"},
    {"category": "devops", "topic": "orchestration"},
    {"category": "devops", "topic": "version-control"},
    {"category": "web", "topic": "api"},
    {"category": "web", "topic": "api"},
    {"category": "database", "type": "cache"},
    {"category": "devops", "language": "python"},
    {"category": "web", "language": "typescript"}
]

# P≈ôid√°n√≠ dokument≈Ø
search_engine.add_documents(documents, metadata)

In [None]:
# Test vyhled√°v√°n√≠
queries = [
    "Jak vytvo≈ôit webovou aplikaci v Pythonu?",
    "N√°stroje pro spr√°vu kontejner≈Ø",
    "Knihovny pro strojov√© uƒçen√≠"
]

for query in queries:
    print(f"\nüîç Dotaz: {query}")
    print("-" * 50)
    
    results = search_engine.search(query, top_k=3)
    
    for i, result in enumerate(results, 1):
        print(f"{i}. [{result['score']:.3f}] {result['document'][:80]}...")
        if result['metadata']:
            print(f"   Metadata: {result['metadata']}")

## 3. V√≠cejazyƒçn√© embeddings

In [None]:
# V√≠cejazyƒçn√Ω model
multilingual_model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2', device=device)

print("‚úÖ V√≠cejazyƒçn√Ω model naƒçten")

In [None]:
# Test cross-lingual similarity
sentences_multilingual = [
    "The cat is sitting on the mat.",          # EN
    "Koƒçka sed√≠ na koberci.",                   # CZ
    "Die Katze sitzt auf der Matte.",          # DE
    "Le chat est assis sur le tapis.",         # FR
    "The dog is running in the park.",         # EN
    "Pes bƒõ≈æ√≠ v parku.",                        # CZ
]

# Embeddings
multi_embeddings = multilingual_model.encode(sentences_multilingual)

# Matice podobnosti
multi_sim = cosine_similarity(multi_embeddings)

# Vizualizace
plt.figure(figsize=(10, 8))
plt.imshow(multi_sim, cmap='RdYlGn')
plt.colorbar(label='Similarity')

labels = ['Cat (EN)', 'Koƒçka (CZ)', 'Katze (DE)', 'Chat (FR)', 'Dog (EN)', 'Pes (CZ)']
plt.xticks(range(len(labels)), labels, rotation=45, ha='right')
plt.yticks(range(len(labels)), labels)
plt.title('Cross-lingual Similarity')

for i in range(len(labels)):
    for j in range(len(labels)):
        plt.text(j, i, f'{multi_sim[i, j]:.2f}', ha='center', va='center', fontsize=9)

plt.tight_layout()
plt.show()

## 4. Clustering dokument≈Ø

In [None]:
class DocumentClusterer:
    """
    Clustering dokument≈Ø pomoc√≠ embeddings.
    """
    
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        self.model = SentenceTransformer(model_name, device=device)
    
    def cluster(self, documents, n_clusters=3):
        """
        Seskup√≠ dokumenty do cluster≈Ø.
        """
        # Vytvo≈ôen√≠ embeddings
        embeddings = self.model.encode(documents, show_progress_bar=True)
        
        # K-Means clustering
        kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
        cluster_labels = kmeans.fit_predict(embeddings)
        
        # Organizace v√Ωsledk≈Ø
        clusters = {i: [] for i in range(n_clusters)}
        for doc, label in zip(documents, cluster_labels):
            clusters[label].append(doc)
        
        return {
            "clusters": clusters,
            "labels": cluster_labels,
            "embeddings": embeddings,
            "centroids": kmeans.cluster_centers_
        }
    
    def visualize_clusters(self, embeddings, labels, documents=None):
        """
        Vizualizuje clustery pomoc√≠ t-SNE.
        """
        # Redukce dimenz√≠
        tsne = TSNE(n_components=2, random_state=42, perplexity=min(30, len(embeddings)-1))
        reduced = tsne.fit_transform(embeddings)
        
        # Vizualizace
        plt.figure(figsize=(12, 8))
        scatter = plt.scatter(reduced[:, 0], reduced[:, 1], c=labels, cmap='viridis', s=100, alpha=0.7)
        plt.colorbar(scatter, label='Cluster')
        
        if documents:
            for i, doc in enumerate(documents):
                plt.annotate(doc[:20] + '...', (reduced[i, 0], reduced[i, 1]), 
                           fontsize=8, alpha=0.7)
        
        plt.title('Document Clusters (t-SNE visualization)')
        plt.xlabel('t-SNE 1')
        plt.ylabel('t-SNE 2')
        plt.tight_layout()
        plt.show()

# Test
clusterer = DocumentClusterer()
print("‚úÖ DocumentClusterer p≈ôipraven")

In [None]:
# Test clusteringu
result = clusterer.cluster(documents, n_clusters=4)

print("V√Ωsledky clusteringu:")
print("=" * 50)

for cluster_id, docs in result['clusters'].items():
    print(f"\nüìÅ Cluster {cluster_id} ({len(docs)} dokument≈Ø):")
    for doc in docs:
        print(f"   ‚Ä¢ {doc[:70]}...")

# Vizualizace
clusterer.visualize_clusters(result['embeddings'], result['labels'], documents)

## 5. FAISS - Efektivn√≠ vektorov√© vyhled√°v√°n√≠

In [None]:
import faiss

class FAISSVectorStore:
    """
    Vektorov√° datab√°ze pomoc√≠ FAISS.
    """
    
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        self.model = SentenceTransformer(model_name, device=device)
        self.dimension = self.model.get_sentence_embedding_dimension()
        
        # Inicializace FAISS indexu
        self.index = faiss.IndexFlatIP(self.dimension)  # Inner Product (cosine sim pro normalizovan√© vektory)
        
        self.documents = []
        self.metadata = []
    
    def add(self, documents, metadata=None):
        """
        P≈ôid√° dokumenty do indexu.
        """
        # Vytvo≈ôen√≠ a normalizace embeddings
        embeddings = self.model.encode(documents)
        embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
        
        # P≈ôid√°n√≠ do FAISS
        self.index.add(embeddings.astype('float32'))
        
        # Ulo≈æen√≠ dokument≈Ø a metadat
        self.documents.extend(documents)
        if metadata:
            self.metadata.extend(metadata)
        else:
            self.metadata.extend([{} for _ in documents])
        
        print(f"P≈ôid√°no {len(documents)} dokument≈Ø. Index obsahuje {self.index.ntotal} vektor≈Ø.")
    
    def search(self, query, top_k=5):
        """
        Vyhled√° nejpodobnƒõj≈°√≠ dokumenty.
        """
        # Embedding dotazu
        query_embedding = self.model.encode([query])
        query_embedding = query_embedding / np.linalg.norm(query_embedding, axis=1, keepdims=True)
        
        # Vyhled√°n√≠
        scores, indices = self.index.search(query_embedding.astype('float32'), top_k)
        
        results = []
        for score, idx in zip(scores[0], indices[0]):
            if idx < len(self.documents):
                results.append({
                    "document": self.documents[idx],
                    "score": float(score),
                    "metadata": self.metadata[idx],
                    "index": int(idx)
                })
        
        return results
    
    def save(self, path):
        """Ulo≈æ√≠ index na disk."""
        import pickle
        
        faiss.write_index(self.index, f"{path}.faiss")
        with open(f"{path}.pkl", 'wb') as f:
            pickle.dump({'documents': self.documents, 'metadata': self.metadata}, f)
        print(f"Index ulo≈æen do {path}")
    
    def load(self, path):
        """Naƒçte index z disku."""
        import pickle
        
        self.index = faiss.read_index(f"{path}.faiss")
        with open(f"{path}.pkl", 'rb') as f:
            data = pickle.load(f)
            self.documents = data['documents']
            self.metadata = data['metadata']
        print(f"Index naƒçten z {path}. Obsahuje {self.index.ntotal} vektor≈Ø.")

# Test
faiss_store = FAISSVectorStore()
print("‚úÖ FAISS VectorStore p≈ôipraven")

In [None]:
# P≈ôid√°n√≠ dokument≈Ø
faiss_store.add(documents, metadata)

# Test vyhled√°v√°n√≠
query = "Framework pro tvorbu API"
results = faiss_store.search(query, top_k=3)

print(f"\nüîç Dotaz: {query}")
print("-" * 50)

for i, result in enumerate(results, 1):
    print(f"{i}. [{result['score']:.3f}] {result['document']}")

## 6. Praktick√° automatizace: FAQ Bot

In [None]:
class FAQBot:
    """
    Jednoduch√Ω FAQ bot vyu≈æ√≠vaj√≠c√≠ s√©mantick√© vyhled√°v√°n√≠.
    """
    
    def __init__(self, threshold=0.5):
        self.vector_store = FAISSVectorStore()
        self.threshold = threshold
        self.qa_pairs = []
    
    def add_faq(self, question, answer, tags=None):
        """
        P≈ôid√° ot√°zku a odpovƒõƒè do FAQ.
        """
        self.qa_pairs.append({
            "question": question,
            "answer": answer,
            "tags": tags or []
        })
    
    def build_index(self):
        """
        Vytvo≈ô√≠ index z p≈ôidan√Ωch FAQ.
        """
        questions = [qa['question'] for qa in self.qa_pairs]
        metadata = [
            {"answer": qa['answer'], "tags": qa['tags']}
            for qa in self.qa_pairs
        ]
        
        self.vector_store.add(questions, metadata)
        print(f"Index vytvo≈ôen s {len(self.qa_pairs)} FAQ polo≈ækami.")
    
    def ask(self, question):
        """
        Odpov√≠ na ot√°zku pomoc√≠ FAQ.
        """
        results = self.vector_store.search(question, top_k=3)
        
        if not results or results[0]['score'] < self.threshold:
            return {
                "found": False,
                "message": "Bohu≈æel jsem nena≈°el odpov√≠daj√≠c√≠ odpovƒõƒè. Zkuste formulovat ot√°zku jinak.",
                "suggestions": [r['document'] for r in results]
            }
        
        best_match = results[0]
        
        return {
            "found": True,
            "answer": best_match['metadata']['answer'],
            "matched_question": best_match['document'],
            "confidence": best_match['score'],
            "tags": best_match['metadata'].get('tags', []),
            "alternatives": [
                {"question": r['document'], "score": r['score']}
                for r in results[1:]
            ]
        }

# Vytvo≈ôen√≠ FAQ bota
faq_bot = FAQBot(threshold=0.4)

# P≈ôid√°n√≠ FAQ
faqs = [
    ("Jak resetovat heslo?", "Pro reset hesla kliknƒõte na 'Zapomenut√© heslo' na p≈ôihla≈°ovac√≠ str√°nce a zadejte sv≈Øj email.", ["√∫ƒçet", "heslo"]),
    ("Jak√© jsou zp≈Øsoby platby?", "P≈ôij√≠m√°me platby kartou (Visa, Mastercard), bankovn√≠m p≈ôevodem a Apple Pay.", ["platba", "objedn√°vka"]),
    ("Jak dlouho trv√° doruƒçen√≠?", "Standardn√≠ doruƒçen√≠ trv√° 3-5 pracovn√≠ch dn≈Ø. Expresn√≠ doruƒçen√≠ je mo≈æn√© do 24 hodin.", ["doruƒçen√≠", "objedn√°vka"]),
    ("Jak mohu vr√°tit zbo≈æ√≠?", "Zbo≈æ√≠ m≈Ø≈æete vr√°tit do 30 dn≈Ø od doruƒçen√≠. Vypl≈àte formul√°≈ô pro vr√°cen√≠ v sekci 'Moje objedn√°vky'.", ["vr√°cen√≠", "reklamace"]),
    ("Kde najdu fakturu?", "Fakturu najdete v emailu s potvrzen√≠m objedn√°vky nebo v sekci 'Moje objedn√°vky' po p≈ôihl√°≈°en√≠.", ["faktura", "√∫ƒçet"]),
    ("Jak zmƒõnit doruƒçovac√≠ adresu?", "Doruƒçovac√≠ adresu m≈Ø≈æete zmƒõnit v nastaven√≠ √∫ƒçtu nebo bƒõhem objedn√°vky p≈ôed potvrzen√≠m.", ["adresa", "doruƒçen√≠"]),
    ("Je mo≈æn√© sledovat z√°silku?", "Ano, po odesl√°n√≠ obdr≈æ√≠te email s trackovac√≠m ƒç√≠slem pro sledov√°n√≠ z√°silky.", ["doruƒçen√≠", "tracking"]),
    ("Jak kontaktovat z√°kaznickou podporu?", "Podporu m≈Ø≈æete kontaktovat emailem na podpora@example.cz nebo telefonicky na 800 123 456.", ["podpora", "kontakt"]),
]

for q, a, tags in faqs:
    faq_bot.add_faq(q, a, tags)

# Vytvo≈ôen√≠ indexu
faq_bot.build_index()

In [None]:
# Test FAQ bota
test_questions = [
    "Zapomnƒõl jsem heslo, co m√°m dƒõlat?",
    "ƒå√≠m m≈Ø≈æu zaplatit?",
    "Kdy mi p≈ôijde bal√≠k?",
    "Jak√© je poƒças√≠ dnes?"  # Nesouvisej√≠c√≠ ot√°zka
]

for question in test_questions:
    print(f"\n‚ùì Ot√°zka: {question}")
    print("=" * 50)
    
    response = faq_bot.ask(question)
    
    if response['found']:
        print(f"‚úÖ Odpovƒõƒè: {response['answer']}")
        print(f"   Confidence: {response['confidence']:.1%}")
        print(f"   Matched: {response['matched_question']}")
        if response['tags']:
            print(f"   Tags: {', '.join(response['tags'])}")
    else:
        print(f"‚ùå {response['message']}")
        if response['suggestions']:
            print("   Mo≈æn√° jste mƒõli na mysli:")
            for sugg in response['suggestions']:
                print(f"   ‚Ä¢ {sugg}")

## 7. Duplicate Detection (Detekce duplicit)

In [None]:
class DuplicateDetector:
    """
    Detekce duplicitn√≠ch nebo velmi podobn√Ωch text≈Ø.
    """
    
    def __init__(self, threshold=0.85):
        self.model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
        self.threshold = threshold
    
    def find_duplicates(self, documents):
        """
        Najde skupiny duplicitn√≠ch dokument≈Ø.
        """
        # Embeddings
        embeddings = self.model.encode(documents, show_progress_bar=True)
        
        # Matice podobnosti
        similarity_matrix = cosine_similarity(embeddings)
        
        # Hled√°n√≠ duplicit
        duplicate_groups = []
        used = set()
        
        for i in range(len(documents)):
            if i in used:
                continue
            
            group = [i]
            for j in range(i + 1, len(documents)):
                if j not in used and similarity_matrix[i, j] >= self.threshold:
                    group.append(j)
                    used.add(j)
            
            if len(group) > 1:
                duplicate_groups.append({
                    "indices": group,
                    "documents": [documents[idx] for idx in group],
                    "similarity": float(np.mean([similarity_matrix[group[0], idx] for idx in group[1:]]))
                })
                used.add(i)
        
        # Unik√°tn√≠ dokumenty
        unique_indices = [i for i in range(len(documents)) if i not in used]
        
        return {
            "duplicate_groups": duplicate_groups,
            "unique_count": len(unique_indices),
            "duplicate_count": len(used),
            "total": len(documents)
        }

# Test
detector = DuplicateDetector(threshold=0.8)

test_docs = [
    "Python je skvƒõl√Ω programovac√≠ jazyk.",
    "Python je v√Ωborn√Ω programovac√≠ jazyk pro zaƒç√°teƒçn√≠ky.",  # Podobn√Ω
    "JavaScript se pou≈æ√≠v√° pro webov√Ω v√Ωvoj.",
    "Pro v√Ωvoj webov√Ωch str√°nek se ƒçasto pou≈æ√≠v√° JavaScript.",  # Podobn√Ω
    "Machine learning mƒõn√≠ svƒõt technologi√≠.",
    "Koƒçky jsou skvƒõl√° dom√°c√≠ zv√≠≈ôata.",
    "Strojov√© uƒçen√≠ revolucionizuje technologick√Ω pr≈Ømysl."  # Podobn√Ω k ML
]

result = detector.find_duplicates(test_docs)

print(f"Celkem dokument≈Ø: {result['total']}")
print(f"Unik√°tn√≠ch: {result['unique_count']}")
print(f"Duplicitn√≠ch: {result['duplicate_count']}")

print("\nSkupiny duplicit:")
for i, group in enumerate(result['duplicate_groups'], 1):
    print(f"\n  Skupina {i} (podobnost: {group['similarity']:.1%}):")
    for doc in group['documents']:
        print(f"    ‚Ä¢ {doc}")

## 8. Text Classification pomoc√≠ Embeddings

In [None]:
class EmbeddingClassifier:
    """
    Zero-shot klasifikace pomoc√≠ embeddings.
    """
    
    def __init__(self):
        self.model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
        self.class_embeddings = None
        self.class_names = []
    
    def set_classes(self, class_descriptions: dict):
        """
        Nastav√≠ t≈ô√≠dy pomoc√≠ textov√Ωch popis≈Ø.
        
        Args:
            class_descriptions: {"n√°zev_t≈ô√≠dy": "popis t≈ô√≠dy"}
        """
        self.class_names = list(class_descriptions.keys())
        descriptions = list(class_descriptions.values())
        
        self.class_embeddings = self.model.encode(descriptions)
        print(f"Nastaveno {len(self.class_names)} t≈ô√≠d: {', '.join(self.class_names)}")
    
    def classify(self, text, return_all=False):
        """
        Klasifikuje text do jedn√© z t≈ô√≠d.
        """
        if self.class_embeddings is None:
            raise ValueError("Nejprve nastavte t≈ô√≠dy pomoc√≠ set_classes()")
        
        # Embedding textu
        text_embedding = self.model.encode([text])
        
        # Podobnost s t≈ô√≠dami
        similarities = cosine_similarity(text_embedding, self.class_embeddings)[0]
        
        # Nejlep≈°√≠ t≈ô√≠da
        best_idx = similarities.argmax()
        
        result = {
            "class": self.class_names[best_idx],
            "confidence": float(similarities[best_idx])
        }
        
        if return_all:
            result["all_scores"] = {
                name: float(score)
                for name, score in zip(self.class_names, similarities)
            }
        
        return result
    
    def classify_batch(self, texts):
        """
        Klasifikuje v√≠ce text≈Ø najednou.
        """
        return [self.classify(text) for text in texts]

# Test
classifier = EmbeddingClassifier()

# Definice t≈ô√≠d
classes = {
    "technology": "Technology, computers, software, programming, AI, internet",
    "sports": "Sports, athletics, games, competitions, teams, players",
    "health": "Health, medicine, fitness, wellness, diseases, treatments",
    "business": "Business, finance, economy, companies, markets, investments",
    "entertainment": "Entertainment, movies, music, celebrities, arts, culture"
}

classifier.set_classes(classes)

In [None]:
# Test klasifikace
test_texts = [
    "Apple announced new iPhone with improved camera and longer battery life.",
    "The team won the championship after an exciting final match.",
    "New study shows benefits of regular exercise for heart health.",
    "Stock market reached record highs amid positive economic data.",
    "The new movie broke box office records in its opening weekend."
]

print("V√Ωsledky klasifikace:")
print("=" * 60)

for text in test_texts:
    result = classifier.classify(text, return_all=True)
    
    print(f"\nüìù {text[:60]}...")
    print(f"   ‚û°Ô∏è {result['class'].upper()} ({result['confidence']:.1%})")
    
    # Top 3 sk√≥re
    sorted_scores = sorted(result['all_scores'].items(), key=lambda x: -x[1])[:3]
    scores_str = ", ".join([f"{k}: {v:.1%}" for k, v in sorted_scores])
    print(f"   Scores: {scores_str}")

## Shrnut√≠

V tomto notebooku jsme se nauƒçili:

1. **Text Embeddings** - vektorov√© reprezentace textu pomoc√≠ Sentence Transformers
2. **S√©mantick√© vyhled√°v√°n√≠** - hled√°n√≠ dokument≈Ø na z√°kladƒõ v√Ωznamu
3. **V√≠cejazyƒçn√© embeddings** - pr√°ce s v√≠ce jazyky
4. **Clustering dokument≈Ø** - automatick√© seskupov√°n√≠ podobn√Ωch text≈Ø
5. **FAISS** - efektivn√≠ vektorov√© vyhled√°v√°n√≠ pro velk√© kolekce
6. **FAQ Bot** - praktick√° aplikace s√©mantick√©ho vyhled√°v√°n√≠
7. **Detekce duplicit** - hled√°n√≠ podobn√Ωch dokument≈Ø
8. **Zero-shot klasifikace** - klasifikace bez tr√©nov√°n√≠

### Dal≈°√≠ kroky
- Notebook 12: Fine-tuning model≈Ø