## üì¶ Installation des d√©pendances

In [1]:
# Installation des packages n√©cessaires
!pip install sentence-transformers faiss-cpu pandas numpy scikit-learn umap-learn plotly streamlit fastapi uvicorn

'pip' n'est pas reconnu en tant que commande interne
ou externe, un programme exÔøΩcutable ou un fichier de commandes.


## üìö Imports

In [1]:
import pandas as pd
import numpy as np
import faiss
import pickle
import time
from pathlib import Path
from typing import List, Dict, Tuple
import warnings
warnings.filterwarnings('ignore')

# Sentence Transformers pour les embeddings
from sentence_transformers import SentenceTransformer, CrossEncoder

# Pour la visualisation
import plotly.express as px
import plotly.graph_objects as go
from umap import UMAP

# Pour l'√©valuation
from sklearn.metrics.pairwise import cosine_similarity

print("‚úÖ Tous les imports r√©ussis !")

‚úÖ Tous les imports r√©ussis !


# √âtape 1 : Construction du Corpus
## üìÑ Chargement et nettoyage des donn√©es

In [2]:
# D√©finir le chemin vers les donn√©es
data_dir = Path("BD quest_resp medecine")

# Liste des fichiers CSV m√©dicaux
medical_files = [
    "CancerQA.csv",
    "Diabetes_and_Digestive_and_Kidney_DiseasesQA.csv",
    "Disease_Control_and_PreventionQA.csv",
    "Genetic_and_Rare_DiseasesQA.csv",
    "Heart_Lung_and_BloodQA.csv",
    "MedicalQuestionAnswering.csv",
    "Neurological_Disorders_and_StrokeQA.csv",
    "SeniorHealthQA.csv"
]

# Charger tous les fichiers
all_data = []
for file in medical_files:
    try:
        df = pd.read_csv(data_dir / file, encoding='utf-8')
        df['source_file'] = file.replace('.csv', '')
        all_data.append(df)
        print(f"‚úÖ {file}: {len(df)} lignes charg√©es")
    except Exception as e:
        print(f"‚ùå Erreur avec {file}: {e}")

# Concat√©ner tous les DataFrames
corpus_df = pd.concat(all_data, ignore_index=True)
print(f"\nüìä Corpus total: {len(corpus_df)} documents")

‚úÖ CancerQA.csv: 729 lignes charg√©es
‚úÖ Diabetes_and_Digestive_and_Kidney_DiseasesQA.csv: 1192 lignes charg√©es
‚úÖ Disease_Control_and_PreventionQA.csv: 270 lignes charg√©es
‚úÖ Genetic_and_Rare_DiseasesQA.csv: 5388 lignes charg√©es
‚úÖ Heart_Lung_and_BloodQA.csv: 559 lignes charg√©es
‚úÖ MedicalQuestionAnswering.csv: 16406 lignes charg√©es
‚úÖ Neurological_Disorders_and_StrokeQA.csv: 1088 lignes charg√©es
‚úÖ SeniorHealthQA.csv: 769 lignes charg√©es

üìä Corpus total: 26401 documents


In [3]:
# Examiner la structure des donn√©es
print("üìã Colonnes disponibles:")
print(corpus_df.columns.tolist())
print("\nüìä Aper√ßu des donn√©es:")
corpus_df.head(3)

üìã Colonnes disponibles:
['Question', 'Answer', 'topic', 'split', 'source_file']

üìä Aper√ßu des donn√©es:


Unnamed: 0,Question,Answer,topic,split,source_file
0,What is (are) Non-Small Cell Lung Cancer ?,Key Points\n - Non-small ce...,cancer,train,CancerQA
1,Who is at risk for Non-Small Cell Lung Cancer? ?,Smoking is the major risk factor for non-small...,cancer,train,CancerQA
2,What are the symptoms of Non-Small Cell Lung C...,Signs of non-small cell lung cancer include a ...,cancer,test,CancerQA


In [4]:
# Nettoyage des donn√©es
def clean_text(text):
    """Nettoie le texte en supprimant les caract√®res ind√©sirables"""
    if pd.isna(text):
        return ""
    
    text = str(text)
    # Supprimer les retours √† la ligne multiples
    text = ' '.join(text.split())
    # Supprimer les caract√®res sp√©ciaux excessifs
    text = text.replace('\n', ' ').replace('\r', ' ')
    return text.strip()

# Nettoyer les colonnes
corpus_df['Question'] = corpus_df['Question'].apply(clean_text)
corpus_df['Answer'] = corpus_df['Answer'].apply(clean_text)

# Supprimer les lignes vides
corpus_df = corpus_df[(corpus_df['Question'] != '') & (corpus_df['Answer'] != '')]

# Cr√©er une colonne de texte combin√© pour la recherche
corpus_df['text'] = corpus_df['Question'] + " " + corpus_df['Answer']

# Cr√©er un ID unique pour chaque document
corpus_df['doc_id'] = range(len(corpus_df))

print(f"‚úÖ Nettoyage termin√©. {len(corpus_df)} documents valides")

‚úÖ Nettoyage termin√©. 26401 documents valides


In [5]:
# Limiter √† environ 1500 documents pour le projet (comme demand√©: 500-2000)
if len(corpus_df) > 1500:
    corpus_df = corpus_df.sample(n=1500, random_state=42).reset_index(drop=True)
    corpus_df['doc_id'] = range(len(corpus_df))
    print(f"üìä Corpus r√©duit √† {len(corpus_df)} documents")

# Statistiques du corpus
print("\nüìä Statistiques du corpus:")
print(f"Nombre de documents: {len(corpus_df)}")
print(f"Sources: {corpus_df['source_file'].nunique()}")
print(f"\nDistribution par source:")
print(corpus_df['source_file'].value_counts())

üìä Corpus r√©duit √† 1500 documents

üìä Statistiques du corpus:
Nombre de documents: 1500
Sources: 8

Distribution par source:
source_file
MedicalQuestionAnswering                        922
Genetic_and_Rare_DiseasesQA                     324
Neurological_Disorders_and_StrokeQA              69
Diabetes_and_Digestive_and_Kidney_DiseasesQA     61
SeniorHealthQA                                   38
Heart_Lung_and_BloodQA                           36
CancerQA                                         34
Disease_Control_and_PreventionQA                 16
Name: count, dtype: int64


In [6]:
# Sauvegarder le corpus nettoy√©
corpus_df.to_csv('docs_medical.csv', index=False)
print("‚úÖ Corpus sauvegard√© dans 'docs_medical.csv'")

‚úÖ Corpus sauvegard√© dans 'docs_medical.csv'


# √âtape 2 : Vectorisation et Indexation
## üß† G√©n√©ration des embeddings avec Sentence-Transformers

In [7]:
# Charger le mod√®le d'embedding
print("üì• Chargement du mod√®le Sentence-Transformer...")
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
print(f"‚úÖ Mod√®le charg√©. Dimension des embeddings: {model.get_sentence_embedding_dimension()}")

üì• Chargement du mod√®le Sentence-Transformer...
‚úÖ Mod√®le charg√©. Dimension des embeddings: 384


In [8]:
# G√©n√©rer les embeddings
print("üîÑ G√©n√©ration des embeddings... (cela peut prendre quelques minutes)")
start_time = time.time()

# Encoder les textes
embeddings = model.encode(
    corpus_df['text'].tolist(),
    show_progress_bar=True,
    batch_size=32,
    normalize_embeddings=True  # Normaliser pour utiliser la similarit√© cosinus
)

elapsed_time = time.time() - start_time
print(f"\n‚úÖ Embeddings g√©n√©r√©s en {elapsed_time:.2f}s")
print(f"üìä Shape des embeddings: {embeddings.shape}")

üîÑ G√©n√©ration des embeddings... (cela peut prendre quelques minutes)


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 47/47 [01:02<00:00,  1.34s/it]


‚úÖ Embeddings g√©n√©r√©s en 62.92s
üìä Shape des embeddings: (1500, 384)





In [9]:
# Sauvegarder les embeddings
np.save('embeddings_medical.npy', embeddings)
print("‚úÖ Embeddings sauvegard√©s dans 'embeddings_medical.npy'")

‚úÖ Embeddings sauvegard√©s dans 'embeddings_medical.npy'


## Test du mod√®le - Recherche s√©mantique

In [25]:
# Test de recherche avec le mod√®le
def test_semantic_search(query, model, embeddings, corpus_df, top_k=5):
    """Test de recherche s√©mantique avec le vrai mod√®le"""
    print(f"\nüîç Question: {query}")
    print("=" * 80)
    
    # Encoder la requ√™te avec le mod√®le
    query_embedding = model.encode([query], normalize_embeddings=True)
    
    # Calculer les similarit√©s
    similarities = np.dot(embeddings, query_embedding.T).flatten()
    
    # Trier par score
    top_indices = np.argsort(similarities)[::-1][:top_k]
    
    # Afficher les r√©sultats
    for rank, idx in enumerate(top_indices, 1):
        doc = corpus_df.iloc[idx]
        score = similarities[idx]
        print(f"\nüìÑ R√©sultat #{rank} | Score: {score:.4f}")
        print(f"Question: {doc['Question'][:100]}...")
        print(f"R√©ponse: {doc['Answer'][:150]}...")
        print(f"Source: {doc['source_file']}")
    
    return top_indices, similarities[top_indices]

# Tester avec plusieurs questions m√©dicales
test_questions = [
    "What are the symptoms of diabetes?",
    "How to treat cancer?",
    "What causes heart disease?",
    "What are neurological disorders?"
]

print("üß™ Test du mod√®le de recherche s√©mantique")
print("=" * 80)

for query in test_questions:
    indices, scores = test_semantic_search(query, model, embeddings, corpus_df, top_k=3)
    print("\n" + "-" * 80)

üß™ Test du mod√®le de recherche s√©mantique

üîç Question: What are the symptoms of diabetes?

üìÑ R√©sultat #1 | Score: 0.7964
Question: What are the symptoms of Diabetes ?...
R√©ponse: Many people with diabetes experience one or more symptoms, including extreme thirst or hunger, a frequent need to urinate and/or fatigue. Some lose we...
Source: MedicalQuestionAnswering

üìÑ R√©sultat #2 | Score: 0.7698
Question: What are the symptoms of Your Guide to Diabetes: Type 1 and Type 2 ?...
R√©ponse: The signs and symptoms of diabetes are - being very thirsty - urinating often - feeling very hungry - feeling very tired - losing weight without tryin...
Source: Diabetes_and_Digestive_and_Kidney_DiseasesQA

üìÑ R√©sultat #3 | Score: 0.6423
Question: What are the symptoms of Prevent diabetes problems: Keep your kidneys healthy ?...
R√©ponse: In the early stages, diabetic kidney disease does not have any symptoms. Kidney disease happens so slowly that you may not feel sick at all for many y

## üóÇÔ∏è Cr√©ation de l'index FAISS

In [10]:
# Cr√©er l'index FAISS
dimension = embeddings.shape[1]

# Option 1: IndexFlatIP (Inner Product) - Pour les embeddings normalis√©s
index = faiss.IndexFlatIP(dimension)

# Ajouter les embeddings √† l'index
index.add(embeddings.astype('float32'))

print(f"‚úÖ Index FAISS cr√©√©")
print(f"üìä Nombre de vecteurs dans l'index: {index.ntotal}")

‚úÖ Index FAISS cr√©√©
üìä Nombre de vecteurs dans l'index: 1500


In [11]:
# Sauvegarder l'index FAISS
faiss.write_index(index, 'medical_faiss.index')
print("‚úÖ Index FAISS sauvegard√© dans 'medical_faiss.index'")

‚úÖ Index FAISS sauvegard√© dans 'medical_faiss.index'


## üîç Fonction de recherche

In [12]:
def search_documents(query: str, top_k: int = 10) -> List[Dict]:
    """
    Recherche les documents les plus pertinents pour une requ√™te
    
    Args:
        query: La requ√™te en langage naturel
        top_k: Nombre de r√©sultats √† retourner
    
    Returns:
        Liste de dictionnaires contenant les documents pertinents
    """
    # Encoder la requ√™te
    query_embedding = model.encode([query], normalize_embeddings=True)
    
    # Rechercher dans l'index FAISS
    scores, indices = index.search(query_embedding.astype('float32'), top_k)
    
    # Pr√©parer les r√©sultats
    results = []
    for i, (idx, score) in enumerate(zip(indices[0], scores[0])):
        doc = corpus_df.iloc[idx]
        results.append({
            'rank': i + 1,
            'doc_id': int(doc['doc_id']),
            'question': doc['Question'],
            'answer': doc['Answer'],
            'source': doc['source_file'],
            'score': float(score)
        })
    
    return results

print("‚úÖ Fonction de recherche cr√©√©e")

‚úÖ Fonction de recherche cr√©√©e


In [13]:
# Test de la recherche
test_query = "What are the symptoms of diabetes?"
print(f"üîç Requ√™te de test: {test_query}\n")

results = search_documents(test_query, top_k=5)

for result in results:
    print(f"Rang {result['rank']} | Score: {result['score']:.4f}")
    print(f"Question: {result['question'][:100]}...")
    print(f"R√©ponse: {result['answer'][:200]}...")
    print(f"Source: {result['source']}")
    print("-" * 80)

üîç Requ√™te de test: What are the symptoms of diabetes?

Rang 1 | Score: 0.7964
Question: What are the symptoms of Diabetes ?...
R√©ponse: Many people with diabetes experience one or more symptoms, including extreme thirst or hunger, a frequent need to urinate and/or fatigue. Some lose weight without trying. Additional signs include sore...
Source: MedicalQuestionAnswering
--------------------------------------------------------------------------------
Rang 2 | Score: 0.7699
Question: What are the symptoms of Your Guide to Diabetes: Type 1 and Type 2 ?...
R√©ponse: The signs and symptoms of diabetes are - being very thirsty - urinating often - feeling very hungry - feeling very tired - losing weight without trying - sores that heal slowly - dry, itchy skin - fee...
Source: Diabetes_and_Digestive_and_Kidney_DiseasesQA
--------------------------------------------------------------------------------
Rang 3 | Score: 0.6423
Question: What are the symptoms of Prevent diabetes problems: Kee

## üéØ Re-ranking avec CrossEncoder

In [14]:
# Charger le mod√®le de re-ranking
print("üì• Chargement du CrossEncoder pour le re-ranking...")
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
print("‚úÖ CrossEncoder charg√©")

üì• Chargement du CrossEncoder pour le re-ranking...
‚úÖ CrossEncoder charg√©


In [15]:
def search_with_reranking(query: str, top_k: int = 10, rerank_top_n: int = 50) -> List[Dict]:
    """
    Recherche avec re-ranking pour am√©liorer la pertinence
    
    Args:
        query: La requ√™te en langage naturel
        top_k: Nombre de r√©sultats finaux
        rerank_top_n: Nombre de candidats √† re-ranker
    
    Returns:
        Liste de r√©sultats re-rank√©s
    """
    # √âtape 1: R√©cup√©ration initiale avec FAISS
    initial_results = search_documents(query, top_k=rerank_top_n)
    
    # √âtape 2: Re-ranking avec CrossEncoder
    pairs = [[query, result['answer']] for result in initial_results]
    cross_scores = cross_encoder.predict(pairs)
    
    # √âtape 3: Trier par score du CrossEncoder
    for result, cross_score in zip(initial_results, cross_scores):
        result['cross_score'] = float(cross_score)
    
    # Trier et retourner les top_k meilleurs
    reranked_results = sorted(initial_results, key=lambda x: x['cross_score'], reverse=True)[:top_k]
    
    # Mettre √† jour les rangs
    for i, result in enumerate(reranked_results):
        result['rank'] = i + 1
    
    return reranked_results

print("‚úÖ Fonction de recherche avec re-ranking cr√©√©e")

‚úÖ Fonction de recherche avec re-ranking cr√©√©e


In [16]:
# Test avec re-ranking
print(f"üîç Requ√™te avec re-ranking: {test_query}\n")

reranked_results = search_with_reranking(test_query, top_k=5)

for result in reranked_results:
    print(f"Rang {result['rank']} | FAISS Score: {result['score']:.4f} | Cross Score: {result['cross_score']:.4f}")
    print(f"Question: {result['question'][:100]}...")
    print(f"Source: {result['source']}")
    print("-" * 80)

üîç Requ√™te avec re-ranking: What are the symptoms of diabetes?

Rang 1 | FAISS Score: 0.7699 | Cross Score: 9.2198
Question: What are the symptoms of Your Guide to Diabetes: Type 1 and Type 2 ?...
Source: Diabetes_and_Digestive_and_Kidney_DiseasesQA
--------------------------------------------------------------------------------
Rang 2 | FAISS Score: 0.7964 | Cross Score: 8.8553
Question: What are the symptoms of Diabetes ?...
Source: MedicalQuestionAnswering
--------------------------------------------------------------------------------
Rang 3 | FAISS Score: 0.5716 | Cross Score: 4.0853
Question: What are the symptoms of Diabetic mastopathy ?...
Source: MedicalQuestionAnswering
--------------------------------------------------------------------------------
Rang 4 | FAISS Score: 0.6423 | Cross Score: 2.4375
Question: What are the symptoms of Prevent diabetes problems: Keep your kidneys healthy ?...
Source: MedicalQuestionAnswering
--------------------------------------------------

# √âtape 3 : API Backend avec FastAPI
## üöÄ Cr√©ation de l'API REST

In [17]:
# Cr√©er le fichier de l'API
api_code = '''
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Optional
import pandas as pd
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer, CrossEncoder
import uvicorn

# Initialiser FastAPI
app = FastAPI(title="Medical Q&A Search API", version="1.0.0")

# Charger les ressources
print("Loading resources...")
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
index = faiss.read_index('medical_faiss.index')
corpus_df = pd.read_csv('docs_medical.csv')
print(f"Resources loaded. Index contains {index.ntotal} documents")

# Mod√®les Pydantic
class QueryRequest(BaseModel):
    query: str
    top_k: int = 10
    use_reranking: bool = True

class SearchResult(BaseModel):
    rank: int
    doc_id: int
    question: str
    answer: str
    source: str
    score: float
    cross_score: Optional[float] = None

class QueryResponse(BaseModel):
    query: str
    results: List[SearchResult]
    total_results: int
    processing_time: float

# Endpoints
@app.get("/")
def root():
    return {
        "message": "Medical Q&A Search API",
        "version": "1.0.0",
        "endpoints": ["/query", "/docs/{doc_id}", "/stats"]
    }

@app.post("/query", response_model=QueryResponse)
def query_documents(request: QueryRequest):
    """Rechercher des documents pertinents"""
    import time
    start_time = time.time()
    
    try:
        # Encoder la requ√™te
        query_embedding = model.encode([request.query], normalize_embeddings=True)
        
        # Recherche FAISS
        top_k = 50 if request.use_reranking else request.top_k
        scores, indices = index.search(query_embedding.astype('float32'), top_k)
        
        # Pr√©parer les r√©sultats
        results = []
        for idx, score in zip(indices[0], scores[0]):
            doc = corpus_df.iloc[idx]
            results.append({
                'doc_id': int(doc['doc_id']),
                'question': doc['Question'],
                'answer': doc['Answer'],
                'source': doc['source_file'],
                'score': float(score)
            })
        
        # Re-ranking si demand√©
        if request.use_reranking:
            pairs = [[request.query, r['answer']] for r in results]
            cross_scores = cross_encoder.predict(pairs)
            
            for result, cross_score in zip(results, cross_scores):
                result['cross_score'] = float(cross_score)
            
            results = sorted(results, key=lambda x: x['cross_score'], reverse=True)[:request.top_k]
        else:
            results = results[:request.top_k]
        
        # Ajouter les rangs
        for i, result in enumerate(results):
            result['rank'] = i + 1
        
        processing_time = time.time() - start_time
        
        return {
            "query": request.query,
            "results": results,
            "total_results": len(results),
            "processing_time": processing_time
        }
    
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/docs/{doc_id}")
def get_document(doc_id: int):
    """R√©cup√©rer un document par son ID"""
    try:
        doc = corpus_df[corpus_df['doc_id'] == doc_id]
        if len(doc) == 0:
            raise HTTPException(status_code=404, detail="Document not found")
        
        doc = doc.iloc[0]
        return {
            "doc_id": int(doc['doc_id']),
            "question": doc['Question'],
            "answer": doc['Answer'],
            "source": doc['source_file'],
            "topic": doc.get('topic', 'N/A')
        }
    except HTTPException:
        raise
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/stats")
def get_stats():
    """Statistiques du corpus"""
    return {
        "total_documents": len(corpus_df),
        "index_size": index.ntotal,
        "sources": corpus_df['source_file'].value_counts().to_dict(),
        "model": "all-MiniLM-L6-v2",
        "cross_encoder": "ms-marco-MiniLM-L-6-v2"
    }

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)
'''

with open('api_medical.py', 'w', encoding='utf-8') as f:
    f.write(api_code)

print("‚úÖ Fichier API cr√©√©: api_medical.py")
print("\nPour lancer l'API, ex√©cutez:")
print("python api_medical.py")
print("ou")
print("uvicorn api_medical:app --reload")

‚úÖ Fichier API cr√©√©: api_medical.py

Pour lancer l'API, ex√©cutez:
python api_medical.py
ou
uvicorn api_medical:app --reload


# √âtape 4 : Interface Web avec Streamlit
## üé® Cr√©ation de l'interface utilisateur

In [18]:
# Cr√©er le fichier Streamlit
streamlit_code = '''
import streamlit as st
import pandas as pd
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer, CrossEncoder
import plotly.express as px
import plotly.graph_objects as go
from umap import UMAP
import time

# Configuration de la page
st.set_page_config(
    page_title="Recherche M√©dicale S√©mantique",
    page_icon="üè•",
    layout="wide"
)

# Style CSS personnalis√©
st.markdown("""
    <style>
    .main-header {
        font-size: 3rem;
        color: #1f77b4;
        text-align: center;
        margin-bottom: 2rem;
    }
    .result-card {
        background-color: #f0f2f6;
        padding: 1rem;
        border-radius: 0.5rem;
        margin-bottom: 1rem;
    }
    </style>
""", unsafe_allow_html=True)

# Charger les ressources
@st.cache_resource
def load_resources():
    """Charger tous les mod√®les et donn√©es"""
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    index = faiss.read_index('medical_faiss.index')
    corpus_df = pd.read_csv('docs_medical.csv')
    embeddings = np.load('embeddings_medical.npy')
    return model, cross_encoder, index, corpus_df, embeddings

# Fonction de recherche
def search_with_reranking(query, model, cross_encoder, index, corpus_df, top_k=10):
    """Recherche avec re-ranking"""
    start_time = time.time()
    
    # Encoder la requ√™te
    query_embedding = model.encode([query], normalize_embeddings=True)
    
    # Recherche FAISS
    scores, indices = index.search(query_embedding.astype('float32'), 50)
    
    # Pr√©parer les r√©sultats
    results = []
    for idx, score in zip(indices[0], scores[0]):
        doc = corpus_df.iloc[idx]
        results.append({
            'doc_id': int(doc['doc_id']),
            'question': doc['Question'],
            'answer': doc['Answer'],
            'source': doc['source_file'],
            'score': float(score)
        })
    
    # Re-ranking
    pairs = [[query, r['answer']] for r in results]
    cross_scores = cross_encoder.predict(pairs)
    
    for result, cross_score in zip(results, cross_scores):
        result['cross_score'] = float(cross_score)
    
    results = sorted(results, key=lambda x: x['cross_score'], reverse=True)[:top_k]
    
    for i, result in enumerate(results):
        result['rank'] = i + 1
    
    elapsed = time.time() - start_time
    return results, elapsed

# Interface principale
def main():
    # Titre
    st.markdown('<h1 class="main-header">üè• Recherche M√©dicale S√©mantique</h1>', unsafe_allow_html=True)
    st.markdown("### Syst√®me de recherche intelligent pour questions m√©dicales")
    
    # Charger les ressources
    with st.spinner("Chargement des mod√®les..."):
        model, cross_encoder, index, corpus_df, embeddings = load_resources()
    
    # Barre lat√©rale
    with st.sidebar:
        st.header("‚öôÔ∏è Param√®tres")
        top_k = st.slider("Nombre de r√©sultats", 1, 20, 10)
        
        st.markdown("---")
        st.header("üìä Statistiques du corpus")
        st.metric("Documents totaux", len(corpus_df))
        st.metric("Vecteurs index√©s", index.ntotal)
        
        st.markdown("---")
        st.header("üìö Sources")
        source_counts = corpus_df['source_file'].value_counts()
        for source, count in source_counts.items():
            st.text(f"{source[:20]}...: {count}")
    
    # Zone de recherche
    st.markdown("### üîç Entrez votre question m√©dicale")
    query = st.text_input(
        "",
        placeholder="Ex: What are the symptoms of diabetes?",
        label_visibility="collapsed"
    )
    
    # Exemples de requ√™tes
    st.markdown("**Exemples de questions:**")
    col1, col2, col3 = st.columns(3)
    with col1:
        if st.button("ü©∫ Symptoms of cancer"):
            query = "What are the symptoms of cancer?"
    with col2:
        if st.button("üíä Treatment for diabetes"):
            query = "What is the treatment for diabetes?"
    with col3:
        if st.button("üß† Neurological disorders"):
            query = "What are common neurological disorders?"
    
    # Recherche
    if query:
        with st.spinner("Recherche en cours..."):
            results, elapsed = search_with_reranking(
                query, model, cross_encoder, index, corpus_df, top_k
            )
        
        # Afficher les m√©triques
        col1, col2, col3 = st.columns(3)
        with col1:
            st.metric("‚è±Ô∏è Temps de traitement", f"{elapsed:.3f}s")
        with col2:
            st.metric("üìÑ R√©sultats trouv√©s", len(results))
        with col3:
            st.metric("üéØ Score moyen", f"{np.mean([r['cross_score'] for r in results]):.3f}")
        
        st.markdown("---")
        
        # Afficher les r√©sultats
        st.markdown("### üìã R√©sultats")
        for result in results:
            with st.expander(
                f"#{result['rank']} - {result['question'][:80]}... (Score: {result['cross_score']:.3f})",
                expanded=(result['rank'] == 1)
            ):
                st.markdown(f"**Question:** {result['question']}")
                st.markdown(f"**R√©ponse:** {result['answer']}")
                
                col1, col2, col3 = st.columns(3)
                with col1:
                    st.caption(f"üìÅ Source: {result['source']}")
                with col2:
                    st.caption(f"üî¢ Doc ID: {result['doc_id']}")
                with col3:
                    st.caption(f"‚≠ê FAISS: {result['score']:.3f} | Cross: {result['cross_score']:.3f}")
        
        # Visualisation des scores
        st.markdown("---")
        st.markdown("### üìä Distribution des scores")
        
        scores_df = pd.DataFrame(results)
        fig = go.Figure()
        fig.add_trace(go.Bar(
            x=scores_df['rank'],
            y=scores_df['score'],
            name='FAISS Score',
            marker_color='lightblue'
        ))
        fig.add_trace(go.Bar(
            x=scores_df['rank'],
            y=scores_df['cross_score'],
            name='Cross-Encoder Score',
            marker_color='orange'
        ))
        fig.update_layout(
            title="Comparaison des scores FAISS vs Cross-Encoder",
            xaxis_title="Rang",
            yaxis_title="Score",
            barmode='group'
        )
        st.plotly_chart(fig, use_container_width=True)

if __name__ == "__main__":
    main()
'''

with open('app_streamlit.py', 'w', encoding='utf-8') as f:
    f.write(streamlit_code)

print("‚úÖ Fichier Streamlit cr√©√©: app_streamlit.py")
print("\nPour lancer l'application, ex√©cutez:")
print("streamlit run app_streamlit.py")

‚úÖ Fichier Streamlit cr√©√©: app_streamlit.py

Pour lancer l'application, ex√©cutez:
streamlit run app_streamlit.py


# √âtape 5 : √âvaluation et Visualisation
## üìà M√©triques de performance

In [19]:
# Cr√©er des requ√™tes de test
test_queries = [
    "What are the symptoms of diabetes?",
    "How is cancer treated?",
    "What causes heart disease?",
    "What are neurological disorders?",
    "How to prevent stroke?",
    "What is genetic testing?",
    "Symptoms of kidney disease",
    "Treatment for blood disorders",
    "What causes rare diseases?",
    "Health issues in seniors"
]

print(f"‚úÖ {len(test_queries)} requ√™tes de test cr√©√©es")

‚úÖ 10 requ√™tes de test cr√©√©es


In [20]:
# √âvaluer les performances
def evaluate_system(queries, top_k=10):
    """√âvaluer le syst√®me sur un ensemble de requ√™tes"""
    latencies = []
    all_scores = []
    
    for query in queries:
        start_time = time.time()
        results = search_with_reranking(query, top_k=top_k)
        latency = time.time() - start_time
        
        latencies.append(latency)
        all_scores.extend([r['cross_score'] for r in results])
    
    metrics = {
        'avg_latency': np.mean(latencies),
        'max_latency': np.max(latencies),
        'min_latency': np.min(latencies),
        'avg_score': np.mean(all_scores),
        'total_queries': len(queries)
    }
    
    return metrics, latencies, all_scores

print("üîÑ √âvaluation du syst√®me...")
metrics, latencies, all_scores = evaluate_system(test_queries)

print("\nüìä M√©triques de performance:")
print(f"Latence moyenne: {metrics['avg_latency']:.3f}s")
print(f"Latence min: {metrics['min_latency']:.3f}s")
print(f"Latence max: {metrics['max_latency']:.3f}s")
print(f"Score moyen: {metrics['avg_score']:.3f}")
print(f"Requ√™tes test√©es: {metrics['total_queries']}")

üîÑ √âvaluation du syst√®me...

üìä M√©triques de performance:
Latence moyenne: 5.821s
Latence min: 4.489s
Latence max: 6.756s
Score moyen: 0.250
Requ√™tes test√©es: 10


In [21]:
# Visualisation des latences
fig_latency = go.Figure()
fig_latency.add_trace(go.Bar(
    x=[f"Q{i+1}" for i in range(len(latencies))],
    y=latencies,
    marker_color='lightblue',
    name='Latence'
))
fig_latency.add_hline(
    y=metrics['avg_latency'], 
    line_dash="dash", 
    line_color="red",
    annotation_text=f"Moyenne: {metrics['avg_latency']:.3f}s"
)
fig_latency.update_layout(
    title="Latence par requ√™te",
    xaxis_title="Requ√™te",
    yaxis_title="Temps (secondes)",
    showlegend=False
)
fig_latency.show()

## üé® Visualisation des embeddings avec UMAP

In [22]:
# R√©duire les dimensions avec UMAP
print("üîÑ R√©duction de dimensionnalit√© avec UMAP...")
umap_model = UMAP(n_components=2, random_state=42, n_neighbors=15, min_dist=0.1)

# √âchantillonner si trop de donn√©es
sample_size = min(500, len(embeddings))
sample_indices = np.random.choice(len(embeddings), sample_size, replace=False)
embeddings_sample = embeddings[sample_indices]
corpus_sample = corpus_df.iloc[sample_indices]

# R√©duction UMAP
embeddings_2d = umap_model.fit_transform(embeddings_sample)
print("‚úÖ R√©duction UMAP termin√©e")

üîÑ R√©duction de dimensionnalit√© avec UMAP...
‚úÖ R√©duction UMAP termin√©e


In [23]:
# Cr√©er le DataFrame pour la visualisation
viz_df = pd.DataFrame({
    'x': embeddings_2d[:, 0],
    'y': embeddings_2d[:, 1],
    'source': corpus_sample['source_file'].values,
    'question': corpus_sample['Question'].str[:100].values
})

# Visualisation interactive
fig_umap = px.scatter(
    viz_df,
    x='x',
    y='y',
    color='source',
    hover_data=['question'],
    title="Visualisation UMAP des embeddings m√©dicaux",
    labels={'x': 'UMAP 1', 'y': 'UMAP 2'},
    width=1000,
    height=700
)
fig_umap.update_traces(marker=dict(size=8, opacity=0.7))
fig_umap.show()

## üìä Tableau de m√©triques

In [24]:
# Cr√©er un tableau r√©capitulatif
metrics_table = pd.DataFrame({
    'M√©trique': [
        'Nombre de documents',
        'Dimension des embeddings',
        'Latence moyenne',
        'Latence minimale',
        'Latence maximale',
        'Score moyen',
        'Mod√®le d\'embedding',
        'Mod√®le de re-ranking'
    ],
    'Valeur': [
        f"{len(corpus_df)}",
        f"{embeddings.shape[1]}",
        f"{metrics['avg_latency']:.3f}s",
        f"{metrics['min_latency']:.3f}s",
        f"{metrics['max_latency']:.3f}s",
        f"{metrics['avg_score']:.3f}",
        "all-MiniLM-L6-v2",
        "ms-marco-MiniLM-L-6-v2"
    ]
})

print("\nüìä Tableau r√©capitulatif des m√©triques:")
metrics_table


üìä Tableau r√©capitulatif des m√©triques:


Unnamed: 0,M√©trique,Valeur
0,Nombre de documents,1500
1,Dimension des embeddings,384
2,Latence moyenne,5.821s
3,Latence minimale,4.489s
4,Latence maximale,6.756s
5,Score moyen,0.250
6,Mod√®le d'embedding,all-MiniLM-L6-v2
7,Mod√®le de re-ranking,ms-marco-MiniLM-L-6-v2


# üéØ R√©sum√© du Projet

## Ce qui a √©t√© r√©alis√©:

### ‚úÖ √âtape 1: Construction du corpus
- Collecte de ~1500 documents m√©dicaux
- Nettoyage et pr√©paration des donn√©es
- Sauvegarde dans `docs_medical.csv`

### ‚úÖ √âtape 2: Vectorisation et Indexation
- G√©n√©ration des embeddings avec `all-MiniLM-L6-v2`
- Cr√©ation de l'index FAISS (IndexFlatIP)
- Sauvegarde de l'index dans `medical_faiss.index`

### ‚úÖ √âtape 3: API Backend
- API REST avec FastAPI
- Endpoints: `/query`, `/docs/{id}`, `/stats`
- Re-ranking avec CrossEncoder
- Fichier: `api_medical.py`

### ‚úÖ √âtape 4: Interface Web
- Application Streamlit compl√®te
- Interface intuitive et responsive
- Visualisations interactives
- Fichier: `app_streamlit.py`

### ‚úÖ √âtape 5: √âvaluation et Visualisation
- Calcul des m√©triques de performance
- Visualisation UMAP des embeddings
- Graphiques de latence et scores

## üöÄ Comment utiliser le syst√®me:

### 1. Lancer l'API (optionnel):
```bash
python api_medical.py
```

### 2. Lancer l'application Streamlit:
```bash
streamlit run app_streamlit.py
```

### 3. Utiliser directement dans le notebook:
```python
results = search_with_reranking("What are the symptoms of diabetes?", top_k=10)
```

## üìÅ Fichiers g√©n√©r√©s:
- `docs_medical.csv` - Corpus nettoy√©
- `embeddings_medical.npy` - Vecteurs d'embeddings
- `medical_faiss.index` - Index FAISS
- `api_medical.py` - API FastAPI
- `app_streamlit.py` - Application Streamlit

## üéì Technologies utilis√©es:
- **Embeddings**: Sentence-Transformers (all-MiniLM-L6-v2)
- **Indexation**: FAISS (IndexFlatIP)
- **Re-ranking**: CrossEncoder (ms-marco-MiniLM-L-6-v2)
- **Backend**: FastAPI
- **Frontend**: Streamlit
- **Visualisation**: Plotly, UMAP

## üåü Points forts du projet:
- Syst√®me de recherche s√©mantique performant
- Re-ranking pour am√©liorer la pertinence
- Interface utilisateur intuitive
- API REST compl√®te et document√©e
- Visualisations interactives
- M√©triques de performance d√©taill√©es