# PDF Classifier - Exploration Notebook

This notebook explores the PDF classification project using anomaly detection.

## Project Overview
- **Goal**: Identify "not useful" PDFs using anomaly detection
- **Approach**: Isolation Forest trained on 150 "not useful" PDFs
- **Test Set**: 150 PDFs for evaluation

In [None]:
# Import libraries
import sys
sys.path.insert(0, '../src')

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## 1. Load and Explore Data

In [None]:
# Check PDF files
from loader import PDFLoader

loader = PDFLoader()
pdf_files = loader.get_pdf_files()
print(f"Total PDFs found: {len(pdf_files)}")

# Get PDF info
pdf_info = loader.get_pdf_info()
df_info = pd.DataFrame(pdf_info)
df_info.head()

## 2. Extract and Analyze Text

In [None]:
# Extract text from a sample PDF
from extractor import PDFExtractor

if len(pdf_files) > 0:
    extractor = PDFExtractor()
    sample_text = extractor.extract_text_from_pdf(pdf_files[0])
    
    print(f"Sample PDF: {pdf_files[0].name}")
    print(f"Text length: {len(sample_text)} characters")
    print(f"\nFirst 500 characters:")
    print(sample_text[:500])
else:
    print("No PDFs found. Please add PDF files to data/raw_pdfs/")

## 3. Visualize Results

After running `main.py`, you can visualize the predictions here.

In [None]:
# Load predictions
predictions_path = Path('../results/predictions.csv')

if predictions_path.exists():
    df_results = pd.read_csv(predictions_path)
    
    # Display summary
    print(f"Total predictions: {len(df_results)}")
    print(f"\nLabel distribution:")
    print(df_results['label'].value_counts())
    
    # Plot distribution
    plt.figure(figsize=(10, 5))
    
    plt.subplot(1, 2, 1)
    df_results['label'].value_counts().plot(kind='bar', color=['#ff7f0e', '#1f77b4'])
    plt.title('Prediction Distribution')
    plt.xlabel('Label')
    plt.ylabel('Count')
    plt.xticks(rotation=45)
    
    plt.subplot(1, 2, 2)
    plt.hist(df_results['anomaly_score'], bins=30, edgecolor='black', alpha=0.7)
    plt.title('Anomaly Score Distribution')
    plt.xlabel('Anomaly Score')
    plt.ylabel('Frequency')
    
    plt.tight_layout()
    plt.show()
    
    # Show top useful PDFs
    print("\nTop 5 'useful' PDFs (lowest anomaly scores):")
    print(df_results.nsmallest(5, 'anomaly_score')[['filename', 'label', 'anomaly_score']])
else:
    print("No predictions found. Run main.py first!")