# Generate FAQ and embeddings

This notebook provides a manual workflow for generating FAQs (Frequently Asked Questions) and their corresponding vector embeddings from course content.

**What this notebook does:**
- Loads course content from CSV files
- Generates persona-specific FAQs using AI
- Creates vector embeddings for semantic search
- Organizes outputs in structured directories

**Use this when:** You need to manually generate FAQs and embeddings for specific courses or when you want to control the generation process step by step.


In [1]:
from pathlib import Path
import os
import pandas as pd
from rag_faq.config import load_config
from rag_faq.indexer import generate_faqs
from rag_faq.embedder import embed_faqs

## ⚙️ Configuration and Data Loading

This cell sets up the core parameters for FAQ generation:

**Manual Configuration** (edit these values as needed):
- `csv_path`: Path to the course chunks CSV file
- `course_name`: Display name for the course
- `project_name`: Display name for the project directory


**Automatic Setup**:
- Loads the CSV data and converts to text format
- Loads configuration from `config.yaml`
- Creates project directory structure

In [None]:
# =============== EDIT THIS MANUALLY ===============
csv_path = Path("data/ppp_best/ppp_best_chunks.csv")
course_name = "Bacharelado em Estatística"
project_name = "myproj"
# ==================================================

if not csv_path.exists():
    raise FileNotFoundError(f"Pages CSV not found: {csv_path}")

df = pd.read_csv(csv_path)
texts = [str(row.to_dict()) for _, row in df.iterrows()]

config = load_config("config.yaml")

doc_stem = csv_path.parent.name
project_dir = Path(config["paths"]["projects_dir"]) / project_name / doc_stem
project_dir.mkdir(parents=True, exist_ok=True)

## 🎯 Individual FAQ Generation

This cell generates FAQs for a single persona (typically "aluno" - student).

**Purpose**: 
- Creates a focused FAQ set for one specific user type
- Saves to the `individual` subdirectory

**Output**: `individual/faq.csv` with student-focused questions and answers


In [3]:
# Create individual directory for single-persona FAQs
individual_dir = project_dir / "individual"
individual_dir.mkdir(parents=True, exist_ok=True)

persona = "aluno"
generate_faqs(config, individual_dir, texts, course_name, persona)

Generating FAQs: 100%|██████████| 12/12 [02:34<00:00, 12.85s/it]

✅ FAQ saved to: projects\myproj1\ppp_best\individual\faq.csv





## 👥 Multi-Persona FAQ Generation

This cell generates FAQs for different user personas (students, professors, researchers). 

**What it does**:
- Creates persona-specific FAQs using different prompts
- Generates separate CSV files for each persona type in the `unificado` folder

**Output**: Three separate FAQ files in `unificado/` folder:
- `faq_aluno.csv` - Student-focused FAQs
- `faq_professor.csv` - Professor-focused FAQs  
- `faq_pesquisador.csv` - Researcher-focused FAQs


In [4]:
# Create unificado directory for multi-persona FAQs
unificado_dir = project_dir / "unificado"
unificado_dir.mkdir(parents=True, exist_ok=True)

persona_type = ["aluno", "professor", "pesquisador"]

for persona in persona_type:
   generate_faqs(config, unificado_dir, texts, course_name, persona, multi_persona=True)

Generating FAQs: 100%|██████████| 12/12 [03:07<00:00, 15.64s/it]


✅ FAQ saved to: projects\myproj1\ppp_best\unificado\faq_aluno.csv


Generating FAQs: 100%|██████████| 12/12 [01:47<00:00,  9.00s/it]


✅ FAQ saved to: projects\myproj1\ppp_best\unificado\faq_professor.csv


Generating FAQs: 100%|██████████| 12/12 [02:12<00:00, 11.02s/it]

✅ FAQ saved to: projects\myproj1\ppp_best\unificado\faq_pesquisador.csv





## 🔄 Multi-Persona FAQ Merging

This cell combines all persona-specific FAQ files into a single unified csv file.

**What it does**:
- Loads FAQs from all three persona files (aluno, professor, pesquisador)
- Merges them into one comprehensive FAQ dataset

**Output**: `unificado/faq.csv` with all personas combined


In [8]:
# Merge all persona-specific FAQs into a single file

# List of persona files to merge
persona_files = ["faq_aluno.csv", "faq_professor.csv", "faq_pesquisador.csv"]
all_faqs = []

for file_name in persona_files:
    file_path = unificado_dir / file_name
    if file_path.exists():
        df = pd.read_csv(file_path)
        all_faqs.append(df)
        print(f"✅ Loaded {len(df)} FAQs from {file_name}")
    else:
        print(f"⚠️  File not found: {file_name}")

if all_faqs:
    # Combine all DataFrames
    merged_df = pd.concat(all_faqs, ignore_index=True)
    
    # Save merged file
    merged_path = unificado_dir / "faq.csv"
    merged_df.to_csv(merged_path, index=False, encoding="utf-8")
    
    print(f"\n📊 Merge Summary:")
    print(f"📈 Total FAQs: {len(merged_df)}")
    
    # Show breakdown by persona
    persona_counts = merged_df['persona'].value_counts()
    print(f"👥 FAQs by persona:")
    for persona, count in persona_counts.items():
        print(f"   - {persona}: {count}")
    
    print(f"🎓 FAQs by course:")
    course_counts = merged_df['course'].value_counts()
    for course, count in course_counts.items():
        print(f"   - {course}: {count}")
        
    print(f"\n✅ Combined CSV saved to: {merged_path}")
else:
    print("❌ No FAQ files found to merge!")


✅ Loaded 120 FAQs from faq_aluno.csv
✅ Loaded 120 FAQs from faq_professor.csv
✅ Loaded 120 FAQs from faq_pesquisador.csv

📊 Merge Summary:
📈 Total FAQs: 360
👥 FAQs by persona:
   - aluno: 120
   - professor: 120
   - pesquisador: 120
🎓 FAQs by course:
   - Bacharelado em Estatística: 360

✅ Combined CSV saved to: projects\myproj1\ppp_best\unificado\faq.csv


## 🧠 Embedding Generation

This cell creates vector embeddings for all FAQs to enable semantic search.

**What it does**:
- Converts FAQ text into numerical vector representations
- Uses AI embedding models to capture semantic meaning
- Enables similarity-based retrieval for RAG (Retrieval-Augmented Generation)
- Saves embeddings for use in the search system

**Output**: `embeddings.npy` file containing vector representations of all FAQs


In [6]:
# Generate embeddings only if the respective FAQ generation steps were executed
if unificado_dir:
    embed_faqs(config, unificado_dir)

if individual_dir:
    embed_faqs(config, individual_dir)

Generating embeddings: 100%|██████████| 360/360 [00:13<00:00, 27.29it/s]


✅ Embeddings saved to: projects\myproj1\ppp_best\unificado


Generating embeddings: 100%|██████████| 120/120 [00:03<00:00, 31.28it/s]

✅ Embeddings saved to: projects\myproj1\ppp_best\individual



