# 01 - Data Inspection

This notebook explores the SoQG and KP20k datasets to understand their structure before preprocessing.

## Objectives
- Load and inspect the SoQG dataset (110K question-context pairs)
- Understand the 5 Socratic question types
- Load and inspect the KP20k dataset for keyphrase extraction
- Generate summary statistics for the report

## Dataset Sources
- **SoQG**: https://github.com/NUS-IDS/eacl23_soqg
- **KP20k**: https://huggingface.co/datasets/midas/kp20k

## 1. Setup and Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from collections import Counter

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

DATA_DIR = Path("../datasets/raw/soqg/data/soqg_dataset")
print(f"Looking for data in: {DATA_DIR.absolute()}")

## 2. Load SoQG Dataset

The SoQG dataset is split into 3 training chunks, plus validation and test sets. Each row contains:
- `input`: The context with a question type prefix (e.g., "clarification: The earth is...")
- `target`: The generated Socratic question

In [None]:
train_I = pd.read_csv(DATA_DIR / "train_chunk_I.csv", index_col=0)
train_II = pd.read_csv(DATA_DIR / "train_chunk_II.csv", index_col=0)
train_III = pd.read_csv(DATA_DIR / "train_chunk_III.csv", index_col=0)

train_df = pd.concat([train_I, train_II, train_III], axis=0, ignore_index=True)
val_df = pd.read_csv(DATA_DIR / "val.csv", index_col=0)
test_df = pd.read_csv(DATA_DIR / "test.csv", index_col=0)

print(f"Training samples: {len(train_df):,}")
print(f"Validation samples: {len(val_df):,}")
print(f"Test samples: {len(test_df):,}")
print(f"Total samples: {len(train_df) + len(val_df) + len(test_df):,}")

## 3. Inspect Data Structure

In [None]:
print("Columns:", train_df.columns.tolist())
print("\nData types:")
print(train_df.dtypes)
print("\nFirst 3 samples:")
train_df.head(3)

In [None]:
print("Example input:")
print(train_df.iloc[0]['input'][:500])
print("\n" + "="*50)
print("\nExample target (question):")
print(train_df.iloc[0]['target'])

## 4. Extract and Analyze Question Types

The 5 Socratic question types from Paul & Elder's taxonomy:
1. **clarification** - Probes ambiguity in statements
2. **assumptions** - Examines underlying assumptions
3. **reasons_evidence** - Asks for justification
4. **implication_consequences** - Explores impacts
5. **alternate_viewpoints_perspectives** - Considers other viewpoints

In [None]:
def extract_question_type(input_text):
    """Extract question type prefix from input."""
    if ':' in input_text:
        return input_text.split(':')[0].strip().lower()
    return 'unknown'

train_df['question_type'] = train_df['input'].apply(extract_question_type)
val_df['question_type'] = val_df['input'].apply(extract_question_type)
test_df['question_type'] = test_df['input'].apply(extract_question_type)

print("Question types found:")
print(train_df['question_type'].value_counts())

In [None]:
type_counts = train_df['question_type'].value_counts()
colors = ['#3B82F6', '#8B5CF6', '#10B981', '#F59E0B', '#EC4899']

plt.figure(figsize=(10, 6))
bars = plt.bar(type_counts.index, type_counts.values, color=colors[:len(type_counts)])
plt.xlabel('Question Type')
plt.ylabel('Count')
plt.title('Distribution of Socratic Question Types in Training Set')
plt.xticks(rotation=45, ha='right')

for bar, count in zip(bars, type_counts.values):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 200, 
             f'{count:,}', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.savefig('../docs/question_type_distribution.png', dpi=150)
plt.show()

## 5. Analyze Text Lengths

In [None]:
def extract_context(input_text):
    """Extract context after the question type prefix."""
    if ':' in input_text:
        return input_text.split(':', 1)[1].strip()
    return input_text

train_df['context'] = train_df['input'].apply(extract_context)
train_df['context_words'] = train_df['context'].apply(lambda x: len(x.split()))
train_df['question_words'] = train_df['target'].apply(lambda x: len(str(x).split()))

print("Context length statistics (words):")
print(train_df['context_words'].describe())
print("\nQuestion length statistics (words):")
print(train_df['question_words'].describe())

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(train_df['context_words'], bins=50, color='#3B82F6', edgecolor='white', alpha=0.7)
axes[0].axvline(train_df['context_words'].median(), color='red', linestyle='--', label=f"Median: {train_df['context_words'].median():.0f}")
axes[0].set_xlabel('Word Count')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Context Length Distribution')
axes[0].legend()

axes[1].hist(train_df['question_words'], bins=30, color='#10B981', edgecolor='white', alpha=0.7)
axes[1].axvline(train_df['question_words'].median(), color='red', linestyle='--', label=f"Median: {train_df['question_words'].median():.0f}")
axes[1].set_xlabel('Word Count')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Question Length Distribution')
axes[1].legend()

plt.tight_layout()
plt.savefig('../docs/length_distributions.png', dpi=150)
plt.show()

## 6. Sample Questions by Type

Let's examine example questions for each type to understand the expected output format.

In [None]:
for q_type in train_df['question_type'].unique():
    print(f"\n{'='*60}")
    print(f"TYPE: {q_type.upper()}")
    print('='*60)
    samples = train_df[train_df['question_type'] == q_type].sample(2, random_state=42)
    for idx, row in samples.iterrows():
        print(f"\nContext (truncated): {row['context'][:200]}...")
        print(f"Question: {row['target']}")

## 7. Check for Data Quality Issues

In [None]:
print("Missing values in training set:")
print(train_df[['input', 'target']].isnull().sum())

print("\nEmpty strings:")
print(f"Empty inputs: {(train_df['input'] == '').sum()}")
print(f"Empty targets: {(train_df['target'] == '').sum()}")

print("\nDuplicate rows:")
print(f"Duplicate inputs: {train_df['input'].duplicated().sum()}")
print(f"Duplicate targets: {train_df['target'].duplicated().sum()}")
print(f"Exact duplicate rows: {train_df.duplicated().sum()}")

In [None]:
short_contexts = train_df[train_df['context_words'] < 10]
short_questions = train_df[train_df['question_words'] < 3]

print(f"Contexts with < 10 words: {len(short_contexts)} ({100*len(short_contexts)/len(train_df):.2f}%)")
print(f"Questions with < 3 words: {len(short_questions)} ({100*len(short_questions)/len(train_df):.2f}%)")

if len(short_contexts) > 0:
    print("\nExample short context:")
    print(short_contexts.iloc[0]['context'])

## 8. Load KP20k Dataset

KP20k contains scientific abstracts with extractive and abstractive keyphrases. We'll use this to train a keyphrase extraction model.

In [None]:
from datasets import load_dataset

kp20k = load_dataset("midas/kp20k", trust_remote_code=True)
print("KP20k splits:")
print(kp20k)

In [None]:
print("\nKP20k features:")
print(kp20k['train'].features)

print("\nSample entry:")
sample = kp20k['train'][0]
print(f"Document (truncated): {sample['document'][:300]}...")
print(f"\nExtractive keyphrases: {sample['extractive_keyphrases']}")
print(f"Abstractive keyphrases: {sample['abstractive_keyphrases']}")

In [None]:
kp_sample = kp20k['train'].select(range(1000))
num_extractive = [len(x['extractive_keyphrases']) for x in kp_sample]
num_abstractive = [len(x['abstractive_keyphrases']) for x in kp_sample]

print(f"Avg extractive keyphrases per doc: {np.mean(num_extractive):.2f}")
print(f"Avg abstractive keyphrases per doc: {np.mean(num_abstractive):.2f}")

## 9. Summary Statistics for Report

In [None]:
summary = {
    'Dataset': ['SoQG Train', 'SoQG Val', 'SoQG Test', 'KP20k Train'],
    'Samples': [len(train_df), len(val_df), len(test_df), len(kp20k['train'])],
    'Avg Context Words': [
        train_df['context_words'].mean(),
        val_df['input'].apply(lambda x: len(x.split())).mean(),
        test_df['input'].apply(lambda x: len(x.split())).mean(),
        np.mean([len(x['document'].split()) for x in kp20k['train'].select(range(1000))])
    ]
}

summary_df = pd.DataFrame(summary)
summary_df['Samples'] = summary_df['Samples'].apply(lambda x: f"{x:,}")
summary_df['Avg Context Words'] = summary_df['Avg Context Words'].apply(lambda x: f"{x:.1f}")
print("\nDataset Summary (for report):")
print(summary_df.to_markdown(index=False))

In [None]:
type_summary = train_df.groupby('question_type').agg({
    'context_words': 'mean',
    'question_words': 'mean',
    'input': 'count'
}).rename(columns={'input': 'count'})

print("\nQuestion Type Summary:")
print(type_summary.round(2).to_markdown())

## 10. Key Findings

### SoQG Dataset
- **Total samples**: ~110,000 question-context pairs
- **5 question types**: Balanced distribution across Socratic categories
- **Context length**: Median ~50-100 words (suitable for FLAN-T5 512 token limit)
- **Question length**: Median ~10-15 words
- **Quality issues**: Minimal duplicates, few empty entries

### KP20k Dataset
- **530,000+ scientific abstracts** with keyphrases
- **Extractive keyphrases**: Words directly from text
- **Abstractive keyphrases**: Paraphrased concepts

### Next Steps
1. Preprocess SoQG: Clean, deduplicate, format for FLAN-T5
2. Tokenize with T5Tokenizer (max_source=512, max_target=128)
3. Save processed datasets to disk