# ‚ùì Question Answering a vyhled√°v√°n√≠ v dokumentech

**Autor:** Praut s.r.o. - AI Integration & Business Automation

## Co se nauƒç√≠te:
- Extraktivn√≠ QA (nalezen√≠ odpovƒõdi v textu)
- Generativn√≠ QA
- Vyhled√°v√°n√≠ v dokumentech
- Automatizace FAQ a helpdesku

In [None]:
!pip install -q transformers accelerate torch faiss-cpu sentence-transformers

In [None]:
from transformers import pipeline
import torch

device = 0 if torch.cuda.is_available() else -1
print(f"üñ•Ô∏è Device: {'GPU' if device == 0 else 'CPU'}")

## 1. Extraktivn√≠ Question Answering

In [None]:
# QA pipeline
qa = pipeline("question-answering", 
              model="deepset/roberta-base-squad2",
              device=device)

# Kontext (znalostn√≠ b√°ze)
kontext = """
Praut s.r.o. is a Czech company specializing in AI integration and business automation. 
The company was founded in Cheb and focuses on implementing artificial intelligence 
into business processes for Czech firms. Services include AI automation, custom development, 
cloud and server deployment, employee training, and system design.

The company works with modern technology stacks including Django, Angular, PostgreSQL, 
Redis, Celery, and various AI providers. PostHub is the company's flagship SaaS product 
for social media content automation.

Working hours are Monday to Friday, 9:00 AM to 5:00 PM. The support team can be reached 
at support@praut.cz. Emergency support is available 24/7 for enterprise customers.
"""

otazky = [
    "Where is Praut s.r.o. located?",
    "What services does the company offer?",
    "What is PostHub?",
    "What are the working hours?",
    "How can I contact support?"
]

print("‚ùì Automatick√© odpov√≠d√°n√≠ na ot√°zky:\n")
for otazka in otazky:
    odpoved = qa(question=otazka, context=kontext)
    print(f"Q: {otazka}")
    print(f"A: {odpoved['answer']} (confidence: {odpoved['score']:.1%})\n")

## 2. QA s v√≠ce kontexty

In [None]:
# Datab√°ze dokument≈Ø
dokumenty = [
    {
        "title": "Pricing",
        "content": """Our pricing plans include: Starter at $29/month for small businesses, 
        Professional at $99/month for growing teams, and Enterprise with custom pricing 
        for large organizations. All plans include 14-day free trial."""
    },
    {
        "title": "Features",
        "content": """Key features include: AI-powered content generation, multi-platform 
        social media scheduling, analytics dashboard, team collaboration tools, 
        and API access for custom integrations."""
    },
    {
        "title": "Support",
        "content": """Support options: Email support with 24h response time for all plans, 
        live chat available for Professional and Enterprise, dedicated account manager 
        for Enterprise customers. Knowledge base available at docs.example.com."""
    }
]

def najdi_odpoved(otazka, dokumenty):
    """Prohled√° v≈°echny dokumenty a najde nejlep≈°√≠ odpovƒõƒè."""
    nejlepsi = None
    
    for doc in dokumenty:
        odpoved = qa(question=otazka, context=doc['content'])
        if nejlepsi is None or odpoved['score'] > nejlepsi['score']:
            nejlepsi = {
                **odpoved,
                'source': doc['title']
            }
    
    return nejlepsi

# Test
test_otazky = [
    "How much does the Professional plan cost?",
    "What features are included?",
    "How fast is email support?"
]

print("üîç Vyhled√°v√°n√≠ v dokumentech:\n")
for otazka in test_otazky:
    vysledek = najdi_odpoved(otazka, dokumenty)
    print(f"Q: {otazka}")
    print(f"A: {vysledek['answer']}")
    print(f"   üìÑ Zdroj: {vysledek['source']} | Confidence: {vysledek['score']:.1%}\n")

## 3. Semantic Search s Sentence Transformers

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Embedding model
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# FAQ datab√°ze
faq = [
    {"q": "How do I reset my password?", "a": "Go to Settings > Security > Reset Password."},
    {"q": "Can I cancel my subscription?", "a": "Yes, you can cancel anytime from Account Settings."},
    {"q": "How to export my data?", "a": "Use the Export feature in Dashboard > Data > Export."},
    {"q": "Is there a mobile app?", "a": "Yes, available for iOS and Android."},
    {"q": "How to add team members?", "a": "Go to Team > Invite Members and enter their emails."},
    {"q": "What payment methods do you accept?", "a": "We accept credit cards, PayPal, and bank transfers."},
    {"q": "How to integrate with Slack?", "a": "Navigate to Integrations > Slack and click Connect."},
    {"q": "Can I get a refund?", "a": "Refunds are available within 30 days of purchase."}
]

# Vytvo≈ôen√≠ embeddings pro FAQ
faq_otazky = [f['q'] for f in faq]
faq_embeddings = embedder.encode(faq_otazky, convert_to_numpy=True)

def semantic_search(dotaz, top_k=3):
    """Najde nejrelevantnƒõj≈°√≠ FAQ polo≈æky."""
    dotaz_embedding = embedder.encode([dotaz], convert_to_numpy=True)[0]
    
    # V√Ωpoƒçet podobnosti (cosine)
    similarities = np.dot(faq_embeddings, dotaz_embedding) / (
        np.linalg.norm(faq_embeddings, axis=1) * np.linalg.norm(dotaz_embedding)
    )
    
    # Top-k v√Ωsledky
    top_indices = np.argsort(similarities)[::-1][:top_k]
    
    return [(faq[i], similarities[i]) for i in top_indices]

# Test
uzivatelske_dotazy = [
    "I forgot my password",
    "How to connect Slack?",
    "I want my money back"
]

print("üîé Semantic FAQ Search:\n")
for dotaz in uzivatelske_dotazy:
    vysledky = semantic_search(dotaz, top_k=1)
    nejlepsi = vysledky[0]
    print(f"‚ùì Dotaz: {dotaz}")
    print(f"   Nalezeno: {nejlepsi[0]['q']}")
    print(f"   üí° Odpovƒõƒè: {nejlepsi[0]['a']}")
    print(f"   Relevance: {nejlepsi[1]:.1%}\n")

## 4. Automatick√Ω FAQ Bot

In [None]:
class FAQBot:
    def __init__(self, faq_data, threshold=0.6):
        self.faq = faq_data
        self.threshold = threshold
        self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
        self.faq_embeddings = self.embedder.encode(
            [f['q'] for f in self.faq], 
            convert_to_numpy=True
        )
    
    def odpovez(self, dotaz):
        """Odpov√≠ na dotaz nebo po≈æ√°d√° o up≈ôesnƒõn√≠."""
        dotaz_emb = self.embedder.encode([dotaz], convert_to_numpy=True)[0]
        
        similarities = np.dot(self.faq_embeddings, dotaz_emb) / (
            np.linalg.norm(self.faq_embeddings, axis=1) * np.linalg.norm(dotaz_emb)
        )
        
        max_idx = np.argmax(similarities)
        max_sim = similarities[max_idx]
        
        if max_sim >= self.threshold:
            return {
                'status': 'found',
                'answer': self.faq[max_idx]['a'],
                'matched_question': self.faq[max_idx]['q'],
                'confidence': float(max_sim)
            }
        else:
            # Nab√≠dni podobn√© ot√°zky
            top_3 = np.argsort(similarities)[::-1][:3]
            suggestions = [self.faq[i]['q'] for i in top_3]
            return {
                'status': 'unclear',
                'message': 'Nena≈°el jsem p≈ôesnou odpovƒõƒè. Mysleli jste:',
                'suggestions': suggestions
            }

# Inicializace bota
bot = FAQBot(faq)

# Simulace konverzace
print("ü§ñ FAQ Bot Demo:\n")

konverzace = [
    "How can I change my password?",
    "something about teams",
    "refund policy"
]

for dotaz in konverzace:
    print(f"üë§ User: {dotaz}")
    odpoved = bot.odpovez(dotaz)
    
    if odpoved['status'] == 'found':
        print(f"ü§ñ Bot: {odpoved['answer']}")
        print(f"   (Matched: {odpoved['matched_question']}, {odpoved['confidence']:.1%})")
    else:
        print(f"ü§ñ Bot: {odpoved['message']}")
        for i, sug in enumerate(odpoved['suggestions'], 1):
            print(f"   {i}. {sug}")
    print()

## 5. Document QA s dlouh√Ωmi texty

In [None]:
def qa_dlouhy_dokument(otazka, dokument, chunk_size=500, overlap=50):
    """QA pro dlouh√© dokumenty - rozdƒõl√≠ na chunky."""
    
    # Rozdƒõlen√≠ na chunky
    words = dokument.split()
    chunky = []
    
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        if chunk:
            chunky.append(chunk)
    
    # QA na ka≈æd√©m chunku
    nejlepsi_odpoved = None
    
    for i, chunk in enumerate(chunky):
        try:
            odpoved = qa(question=otazka, context=chunk)
            if nejlepsi_odpoved is None or odpoved['score'] > nejlepsi_odpoved['score']:
                nejlepsi_odpoved = {
                    **odpoved,
                    'chunk_index': i
                }
        except:
            continue
    
    return nejlepsi_odpoved

# Dlouh√Ω dokument
dlouhy_text = """
Chapter 1: Introduction to AI

Artificial Intelligence (AI) refers to the simulation of human intelligence in machines. 
The field was founded at a workshop at Dartmouth College in 1956. Key figures include 
John McCarthy, who coined the term, and Alan Turing, who proposed the Turing Test.

Chapter 2: Machine Learning Fundamentals

Machine learning is a subset of AI that enables systems to learn from data. The main types 
are supervised learning, unsupervised learning, and reinforcement learning. Popular algorithms 
include neural networks, decision trees, and support vector machines.

Chapter 3: Deep Learning Revolution

Deep learning uses neural networks with many layers. The breakthrough came in 2012 when 
AlexNet won the ImageNet competition. Key architectures include CNNs for images, RNNs for 
sequences, and Transformers for various tasks.

Chapter 4: Natural Language Processing

NLP enables computers to understand human language. BERT, introduced by Google in 2018, 
revolutionized the field. GPT models from OpenAI demonstrated impressive text generation 
capabilities. The latest models like GPT-4 show emergent abilities.
"""

otazky_dlouhy = [
    "When was AI founded?",
    "What are the types of machine learning?",
    "When did BERT come out?"
]

print("üìö QA na dlouh√©m dokumentu:\n")
for otazka in otazky_dlouhy:
    odpoved = qa_dlouhy_dokument(otazka, dlouhy_text)
    print(f"Q: {otazka}")
    print(f"A: {odpoved['answer']} ({odpoved['score']:.1%})\n")

## 6. Batch zpracov√°n√≠ FAQ

In [None]:
import pandas as pd

# Simulace support ticket≈Ø
tickety = pd.DataFrame({
    'id': range(1, 8),
    'dotaz': [
        "Can't log in to my account",
        "Need to add more users to team",
        "Export data to CSV",
        "Connect with Slack workspace",
        "Upgrade subscription plan",
        "Delete my account",
        "Mobile app not working"
    ]
})

# Automatick√© odpov√≠d√°n√≠
odpovedi = []
for _, row in tickety.iterrows():
    result = bot.odpovez(row['dotaz'])
    odpovedi.append({
        'auto_reply': result.get('answer', 'P≈ôed√°no oper√°torovi'),
        'confidence': result.get('confidence', 0),
        'auto_resolved': result['status'] == 'found' and result.get('confidence', 0) > 0.7
    })

tickety['auto_reply'] = [o['auto_reply'] for o in odpovedi]
tickety['confidence'] = [o['confidence'] for o in odpovedi]
tickety['auto_resolved'] = [o['auto_resolved'] for o in odpovedi]

print("üìä Automatick√© zpracov√°n√≠ ticket≈Ø:\n")
print(tickety[['id', 'dotaz', 'auto_resolved', 'confidence']].to_string(index=False))

print(f"\n‚úÖ Automaticky vy≈ôe≈°eno: {tickety['auto_resolved'].sum()}/{len(tickety)}")

---
## üèÅ Shrnut√≠

- ‚úÖ Extraktivn√≠ QA s RoBERTa
- ‚úÖ Semantic search pro FAQ
- ‚úÖ Automatick√Ω FAQ bot
- ‚úÖ QA na dlouh√Ωch dokumentech
- ‚úÖ Automatizace support ticket≈Ø

**Dal≈°√≠ notebook:** Whisper - p≈ôevod ≈ôeƒçi na text