# Notebook 1: Data Collection

This notebook demonstrates how to collect QA datasets from various sources for the AI Response Evaluation System.

## What You'll Learn:
1. Collect data from Alpaca dataset
2. Generate responses using OpenAI API
3. Load custom datasets
4. Validate and preview data
5. Save datasets for annotation

## Setup: Import Required Libraries

In [None]:
import sys
from pathlib import Path
import pandas as pd
import os
from dotenv import load_dotenv

# Add src to path
sys.path.insert(0, str(Path.cwd().parent / 'src'))

# Load environment variables
load_dotenv(Path.cwd().parent / '.env')

from src.data_collection import DatasetCollector
from src.config import RAW_DATA_DIR, TARGET_DATASET_SIZE

print("✓ Libraries imported successfully")
print(f"✓ Data directory: {RAW_DATA_DIR}")
print(f"✓ Target dataset size: {TARGET_DATASET_SIZE}")

## Method 1: Collect from Alpaca Dataset

The Alpaca dataset contains instruction-following examples. This is the easiest and fastest way to get started.

In [None]:
# Initialize collector
collector = DatasetCollector()

# Collect 100 samples from Alpaca (adjust number as needed)
print("Collecting data from Alpaca dataset...")
df_alpaca = collector.collect_from_alpaca(num_samples=100)

print(f"\n✓ Collected {len(df_alpaca)} samples")
print(f"\nDataset shape: {df_alpaca.shape}")
print(f"Columns: {list(df_alpaca.columns)}")

In [None]:
# Preview the data
print("First 5 samples:")
df_alpaca.head()

In [None]:
# Look at a specific example
idx = 0
print(f"Question {idx+1}:")
print(df_alpaca.iloc[idx]['question'])
print(f"\nResponse:")
print(df_alpaca.iloc[idx]['model_response'])

## Method 2: Generate Responses Using OpenAI API

Generate responses for custom questions using OpenAI's API.

In [None]:
# Check if API key is set
api_key = os.getenv('OPENAI_API_KEY')
if api_key:
    print("✓ OpenAI API key found")
    print(f"Key starts with: {api_key[:10]}...")
else:
    print("⚠ OpenAI API key not found. Set OPENAI_API_KEY in .env file")

In [None]:
# Create sample questions
sample_questions = collector.create_sample_questions(20)

print(f"Created {len(sample_questions)} sample questions")
print("\nFirst 5 questions:")
for i, q in enumerate(sample_questions[:5], 1):
    print(f"{i}. {q}")

In [None]:
# Generate responses using OpenAI (only if API key is available)
if api_key:
    print("Generating responses with OpenAI...")
    # Use only first 5 questions to save API calls
    df_openai = collector.collect_from_openai(sample_questions[:5])
    
    print(f"\n✓ Generated {len(df_openai)} responses")
    display(df_openai.head())
else:
    print("⚠ Skipping OpenAI generation - API key not set")
    df_openai = pd.DataFrame()

## Method 3: Load Custom Dataset

Load your own CSV file with question-response pairs.

In [None]:
# Load the sample dataset that comes with the project
sample_file = Path.cwd().parent / 'data' / 'raw' / 'sample_qa_dataset.csv'

if sample_file.exists():
    df_custom = collector.load_custom_dataset(sample_file)
    print(f"✓ Loaded {len(df_custom)} samples from {sample_file.name}")
    display(df_custom)
else:
    print("⚠ Sample file not found")

## Data Validation and Statistics

In [None]:
# Use the Alpaca data for analysis
df = df_alpaca

print("Dataset Statistics:")
print("=" * 50)
print(f"Total samples: {len(df)}")
print(f"\nMissing values:")
print(df.isnull().sum())

print(f"\nQuestion length statistics:")
df['question_length'] = df['question'].str.len()
print(df['question_length'].describe())

print(f"\nResponse length statistics:")
df['response_length'] = df['model_response'].str.len()
print(df['response_length'].describe())

In [None]:
# Visualize length distributions
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Question lengths
axes[0].hist(df['question_length'], bins=30, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Question Length (characters)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Question Lengths')
axes[0].axvline(df['question_length'].mean(), color='red', linestyle='--', label='Mean')
axes[0].legend()

# Response lengths
axes[1].hist(df['response_length'], bins=30, edgecolor='black', alpha=0.7, color='orange')
axes[1].set_xlabel('Response Length (characters)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Response Lengths')
axes[1].axvline(df['response_length'].mean(), color='red', linestyle='--', label='Mean')
axes[1].legend()

plt.tight_layout()
plt.show()

## Save Dataset for Annotation

In [None]:
# Save the collected dataset
output_filename = "my_qa_dataset.csv"

# Remove temporary columns
df_to_save = df[['question', 'model_response']].copy()

collector.save_dataset(df_to_save, output_filename)

print(f"\n✓ Dataset saved successfully!")
print(f"Location: {RAW_DATA_DIR / output_filename}")
print(f"\nNext step: Annotate this dataset using notebook 02_annotation.ipynb")

## Summary

In this notebook, you learned how to:
- ✓ Collect data from Alpaca dataset (fast and free)
- ✓ Generate responses using OpenAI API (requires API key)
- ✓ Load custom CSV datasets
- ✓ Validate and analyze the data
- ✓ Save datasets for annotation

### Next Steps:
1. Open `02_annotation.ipynb` to annotate your dataset
2. Or run: `python main.py --annotate --dataset "data/raw/my_qa_dataset.csv"`