# üîç VerdictVision: AI-Powered Legal Analytics

**CMPE 258 - Deep Learning Project**

A RAG-based system for analyzing California appellate cases using hybrid retrieval and Microsoft Phi-2 LLM.

## Features
- üìö **Hybrid Retrieval**: Semantic + TF-IDF + Metadata scoring
- ü§ñ **LLM-Powered Q&A**: Answer legal questions with case citations
- ‚öñÔ∏è **IRAC Analysis**: Generate structured legal analysis
- üìä **Outcome Prediction**: Predict case outcomes based on similar cases

---

## 1. Setup & Installation

In [None]:
# Clone the repository (if running from GitHub)
# !git clone https://github.com/YOUR_USERNAME/VerdictVision.git
# %cd VerdictVision

# Install dependencies
!pip install -q torch transformers sentence-transformers pandas numpy scikit-learn
!pip install -q gensim requests beautifulsoup4 gdown matplotlib seaborn gradio tabulate

In [None]:
# Import modules
import sys
sys.path.append('.')

from src.data_collection import DataCollector
from src.preprocessing import CasePreprocessor
from src.embeddings import EmbeddingManager
from src.retrieval import HybridRetriever
from src.qa_system import VerdictVisionQA
from src.evaluation import OutcomePredictionEvaluator, RetrievalEvaluator
from configs.config import ensure_directories

ensure_directories()
print("‚úì Setup complete!")

## 2. Data Collection

Download California appellate case data from Case.law API or Google Drive backup.

In [None]:
# Download data from Google Drive (faster)
collector = DataCollector()
collector.download_from_gdrive()

## 3. Preprocessing Pipeline

Extract text, clean documents, and create chunks for RAG.

In [None]:
# Run preprocessing
preprocessor = CasePreprocessor()
df, chunks = preprocessor.run_pipeline()

print(f"\n‚úì Processed {len(df)} cases into {len(chunks)} chunks")

## 4. Build Retrieval System

Create embeddings and TF-IDF index for hybrid search.

In [None]:
# Create embeddings
from src.embeddings import create_embeddings_for_chunks

embeddings = create_embeddings_for_chunks()
print(f"\n‚úì Created embeddings: {embeddings.shape}")

In [None]:
# Test retrieval
retriever = HybridRetriever()
retriever.load_data()

# Example search
results = retriever.hybrid_search("breach of contract damages", top_k=3)

print("\nüîç Search Results:")
for i, r in enumerate(results, 1):
    print(f"\n[{i}] {r['case_name']}")
    print(f"    Score: {r['scores']['final']:.4f}")
    print(f"    Preview: {r['text'][:150]}...")

## 5. Initialize Q&A System

Load Phi-2 LLM and create the complete Q&A system.

In [None]:
# Initialize the complete system
qa_system = VerdictVisionQA()
qa_system.initialize()

## 6. Demo: Legal Q&A

In [None]:
# Ask a legal question
question = "What are the elements of breach of contract in California?"

result = qa_system.query(question, mode="qa")

print(f"‚ùì Question: {question}")
print(f"\nüí° Answer:\n{result['answer']}")
print(f"\nüìö Sources:")
for i, case in enumerate(result['cases'][:3], 1):
    print(f"   {i}. {case['case_name']}")
print(f"\n‚è±Ô∏è Latency: {result['latency_ms']:.1f}ms")

## 7. Demo: Outcome Prediction

In [None]:
# Predict case outcome
case_text = """
Carolina PONCIO, Plaintiff and Appellant, v. DEPARTMENT OF RESOURCES 
RECYCLING AND RECOVERY, Defendant and Respondent. The plaintiff held 
a probationary certificate to operate a beverage container recycling center.
The department revoked the certificate after finding that the plaintiff's 
employee engaged in dishonesty by offering a bribe.
"""

result = qa_system.query(case_text, mode="predict")

print(f"‚öñÔ∏è Predicted Outcome: {result['predicted_outcome'].upper()}")
print(f"üìä Confidence: {result['confidence']*100:.1f}%")
print(f"\nüìö Similar Cases:")
for i, case in enumerate(result['similar_cases'][:3], 1):
    print(f"   {i}. {case['case_name']} ({case.get('outcome', 'N/A')})")

## 8. Evaluation

In [None]:
# Run outcome prediction evaluation
evaluator = OutcomePredictionEvaluator()
evaluator.load_data()

# Train baseline
baseline_acc, report = evaluator.train_baseline()

# Majority baseline
majority_acc = evaluator.compute_majority_baseline()

print(f"\nüìä Results Summary:")
print(f"   Majority Baseline: {majority_acc:.1%}")
print(f"   LogReg Baseline:   {baseline_acc:.1%}")

## 9. Launch Interactive UI

In [None]:
# Launch Gradio interface
from src.qa_system import create_gradio_interface

interface = create_gradio_interface(qa_system)
interface.launch(share=True)

---

## üìù Project Structure

```
VerdictVision/
‚îú‚îÄ‚îÄ configs/
‚îÇ   ‚îî‚îÄ‚îÄ config.py          # Configuration parameters
‚îú‚îÄ‚îÄ src/
‚îÇ   ‚îú‚îÄ‚îÄ data_collection.py # Data download
‚îÇ   ‚îú‚îÄ‚îÄ preprocessing.py   # Text extraction & chunking
‚îÇ   ‚îú‚îÄ‚îÄ embeddings.py      # Embedding creation
‚îÇ   ‚îú‚îÄ‚îÄ retrieval.py       # Hybrid search
‚îÇ   ‚îú‚îÄ‚îÄ llm.py            # LLM management
‚îÇ   ‚îú‚îÄ‚îÄ qa_system.py      # Main Q&A system
‚îÇ   ‚îî‚îÄ‚îÄ evaluation.py     # Evaluation metrics
‚îú‚îÄ‚îÄ main.py               # CLI entry point
‚îú‚îÄ‚îÄ requirements.txt
‚îî‚îÄ‚îÄ README.md
```

## üéØ Key Results

| Metric | Value |
|--------|-------|
| Dataset | 713 California Appellate Cases |
| Chunks | ~3,500 text segments |
| P@5 | 0.85+ |
| Outcome Prediction Accuracy | 85%+ |
| Avg Latency | < 3s per query |