
---

# **CHAPTER 30: BUILDING PRODUCTION AI PORTFOLIO**

*From Tutorial Projects to Production-Grade Systems*

## **Chapter Overview**

Technical interviews and hiring decisions increasingly rely on demonstrated ability to ship complete systems rather than Kaggle leaderboard rankings. This chapter provides the scaffolding for five distinct portfolio projects that showcase MLOps proficiency, system design skills, and domain expertise. Each project is designed to be interview-defensible: complex enough to discuss trade-offs, scoped enough to complete in 40-60 hours, and practical enough that companies could theoretically deploy it.

**Estimated Time:** 60-80 hours (self-paced, typically 6-8 weeks part-time)  
**Prerequisites:** Completion of Chapters 19-24 (MLOps, Deployment, System Design), Git proficiency, cloud platform access (AWS/GCP/Azure)

---

## **30.0 Learning Objectives**

By the end of this chapter, you will have:
1. Scoped five distinct AI projects from problem formulation to production architecture
2. Implemented production-grade codebases with type hints, testing, and CI/CD pipelines
3. Deployed at least one project to a public cloud with persistent infrastructure
4. Documented system designs and trade-offs suitable for technical interviews
5. Created a portfolio narrative that connects technical choices to business impact

---

## **30.1 Project Scoping Framework**

#### **30.1.1 The PRD Template for ML Projects**

Before writing code, define the **Product Requirements Document**:

```markdown
## Project: Real-Time Fraud Detection API

### Problem Statement
E-commerce platform loses $2M annually to card-not-present fraud. Current rule-based 
system has 5% false positive rate (annoying customers) and catches only 60% of fraud.

### Success Metrics
- Business: Reduce fraud losses by 40% ($800K savings), maintain false positive <3%
- Technical: P99 latency <50ms, availability 99.9%, handle 10K TPS burst
- Model: AUC-ROC >0.92, Precision@Recall=0.8 >0.85

### Constraints
- Must explain declined transactions to customer service (XGBoost, not black-box NN)
- GDPR compliant: Delete user data within 30 days, right to explanation
- Budget: <$5K/month cloud spend at target scale

### MVP vs Production
MVP: Batch inference on hourly transactions, 80% fraud catch rate acceptable
Production: Real-time inference, circuit breakers, shadow mode deployment
```

#### **30.1.2 Anti-Patterns in Portfolio Projects**

Avoid these common traps:

**The Notebook Dump:** A Jupyter notebook with no tests, no modular code, and hardcoded paths. *Fix:* Convert to Python package with `src/`, `tests/`, and `config/` directories.

**The Kaggle Copy:** Using competition data with leaked features (e.g., target encoded IDs that won't exist in production). *Fix:* Simulate real data drift, use time-based splits.

**The Un-Deployable Model:** A 10GB pickled model with dependencies that can't be containerized. *Fix:* ONNX export, dependency pinning, multi-stage Docker builds.

**The Missing Negative:** Only showing happy path metrics. *Fix:* Document failure modes, error analysis by demographic group, bias audit results.

---

## **30.2 The Five Core Projects**

### **Project 1: Tabular MLOps Pipeline**
**Complexity:** Intermediate | **Domain:** Fintech/Retail | **Tech Stack:** Scikit-learn/XGBoost, Feast, MLflow, Airflow

**Architecture:**
```
Raw Data (S3) → Great Expectations Validation → Feature Engineering (Feast)
    → Training Pipeline (Airflow + MLflow) → Model Registry → FastAPI Service
    → Prometheus Monitoring → Grafana Dashboard
```

**Implementation Requirements:**
- **Data Validation:** Schema validation with Pandera, drift detection with Evidently
- **Feature Store:** Feast with Redis online store (point-in-time correctness for training)
- **Training:** Automated hyperparameter tuning (Optuna), experiment tracking
- **Serving:** FastAPI with Pydantic validation, batch and single-record endpoints
- **Testing:** Unit tests for feature engineering (pytest), integration tests for API, load tests with Locust

**Key Interview Talking Points:**
- *Why XGBoost over Neural Net?* Interpretability for regulatory compliance, faster inference
- *Why Feature Store?* Prevents training-serving skew, enables feature reuse across models
- *Handling Cold Start:* Fallback to demographic averages for new users

**Deliverables:**
- GitHub repo with `make test` running full test suite
- Live demo endpoint (can be on free tier)
- Architecture diagram showing data flow
- Cost analysis: "$0.001 per prediction at scale"

---

### **Project 2: Computer Vision API**
**Complexity:** Intermediate | **Domain:** Manufacturing/Retail | **Tech Stack:** PyTorch, TorchServe, OpenCV, Kubernetes

**Scope:** Multi-class defect detection on industrial parts (or retail product recognition).

**Architecture:**
```
Image Upload → S3 → SQS Queue → Inference Service (TorchServe on EKS)
    → Post-processing (NMS, thresholding) → DynamoDB Results
    → Notification (SNS) → Client Webhook
```

**Technical Depth:**
- **Model:** Fine-tuned ResNet-50 or EfficientNet-B3 with transfer learning
- **Optimization:** TensorRT FP16 quantization, ONNX export for CPU fallback
- **Data Pipeline:** Albumentations for augmentation, FiftyOne for dataset visualization
- **Deployment:** Kubernetes with HPA, rolling updates with zero downtime
- **Monitoring:** Track confidence distribution drift, input image quality metrics (blur detection)

**Differentiation:**
Implement **active learning loop**: Low-confidence predictions trigger human review, automatically adding labeled data to retraining pool.

**Interview Angle:**
- *Handling Imbalanced Data:* Use focal loss, class weights, or oversampling via Albumentations
- *Latency Optimization:* Dynamic batching in TorchServe, async pre-processing
- *Edge Considerations:* Demonstrate TensorFlow Lite conversion for mobile deployment

---

### **Project 3: NLP Service with Fine-Tuned Transformer**
**Complexity:** Advanced | **Domain:** Legal/Healthcare | **Tech Stack:** Hugging Face Transformers, LoRA, Docker, FastAPI

**Scope:** Named Entity Recognition (NER) for legal contracts or clinical notes, or document classification.

**Architecture:**
```
PDF/Text Input → LangChain Parsing → HuggingFace Pipeline 
    → Fine-tuned BERT/DeBERTa (LoRA adapter) → Structured Output (JSON)
    → Validation (Pydantic) → Response
```

**Production Considerations:**
- **Efficiency:** LoRA adapters for multi-tenant serving (swap adapters per customer without reloading base model)
- **Long Documents:** Implement sliding window or hierarchical attention for documents >512 tokens
- **Evaluation:** Entity-level F1 (not token-level), error analysis by entity type
- **Safety:** PII redaction using Presidio before model inference

**MLOps Integration:**
- **CI/CD:** Retraining triggered on new labeled data (GitOps with ArgoCD)
- **A/B Testing:** Shadow deployment comparing base model vs. fine-tuned version
- **Explainability:** LIME/SHAP for token importance visualization

**Portfolio Value:**
Demonstrates ability to handle unstructured data, HuggingFace ecosystem mastery, and domain adaptation (transfer learning).

---

### **Project 4: LLM Application with RAG**
**Complexity:** Advanced | **Domain:** Enterprise Knowledge Management | **Tech Stack:** LangChain/LlamaIndex, Vector DB (Pinecone/Weaviate), OpenAI/Local LLM

**Scope:** Chatbot answering questions over private documents (PDFs, Confluence, Slack).

**Architecture:**
```
Documents → Loaders (Unstructured.io) → Chunking (RecursiveTextSplitter)
    → Embeddings (OpenAI/HuggingFace) → VectorStore (Pinecone)
    
Query → Embedding → Similarity Search → Top-K Chunks → Prompt Engineering
    → LLM (GPT-4/Claude/Llama-2) → Streaming Response → Citation Layer
```

**Critical Engineering Decisions:**
- **Chunking Strategy:** Experiment with 256, 512, 1024 token chunks; overlap of 50 tokens; semantic chunking vs. fixed size
- **Retrieval:** Hybrid search (BM25 + Dense), re-ranking (Cohere Rerank or cross-encoders)
- **Evaluation:** RAGAS framework (faithfulness, answer relevance, context precision)
- **Cost Control:** Caching layer for common queries, query classification (simple FAQ vs. complex reasoning) to route to cheaper models

**Advanced Features:**
- **Agentic RAG:** Tool use for calculations, API calls to verify real-time data
- **Guardrails:** NeMo Guardrails or LlamaGuard for safety, topic restriction
- **Multi-modal:** Process images in documents (charts, diagrams) via CLIP embeddings

**Interview Narrative:**
Focus on evaluation methodology (how do you know RAG is working?), cost optimization strategies, and handling hallucinations via citation verification.

---

### **Project 5: Real-Time Recommendation System**
**Complexity:** Expert | **Domain:** Media/E-commerce | **Tech Stack:** Spark, Redis, Kafka, TensorFlow/PyTorch, Cassandra

**Scope:** Session-based recommendation (similar to Amazon "Customers who viewed X") or content feed ranking.

**Architecture (Two-Tower Neural Network):**
```
Real-time Events (Click, View) → Kafka → Flink Feature Computation → Redis
User Profile (Batch) → Spark → Feature Store (Feast) → Redis

Request → Candidate Generation (ANN: FAISS/Milvus) → 1000 Items
    → Ranking Model (Two-Tower or DeepFM) → Top 100
    → Business Logic (Diversity, Filtering) → Top 20
    → Response
```

**Technical Implementation:**
- **Candidate Generation:** Approximate Nearest Neighbors (HNSW index) over item embeddings
- **Ranking:** Contextual bandit or deep neural net with wide-and-deep architecture
- **Feature Engineering:** Real-time session features (view count in last 10 min), user historical features
- **Evaluation:** Offline (NDCG, MAP), Online (A/B testing framework with impression tracking)

**Scale Considerations:**
- **Latency Budget:** 20ms for candidate generation, 30ms for ranking, 10ms for business logic
- **Cold Start:** Content-based features for new items, exploration via epsilon-greedy
- **Infrastructure:** Redis Cluster for sub-millisecond feature lookup, Kafka for event streaming

**Portfolio Presentation:**
Include offline evaluation metrics showing lift over baseline (popular items), and architecture diagram proving you understand the serving constraints.

---

## **30.3 Code Quality & Engineering Excellence**

#### **30.3.1 Project Structure Template**

```
ai_project/
├── .github/
│   └── workflows/
│       ├── ci.yml          # Lint, test, build
│       └── cd.yml          # Deploy to staging/prod
├── config/
│   ├── config.yaml         # Hydra/OmegaConf configuration
│   └── schema.py           # Pydantic models for validation
├── data/
│   ├── raw/                # Gitignored, versioned with DVC
│   └── processed/
├── docker/
│   ├── Dockerfile.api
│   └── Dockerfile.training
├── notebooks/
│   └── 01_eda.ipynb        # Exploratory only, not production code
├── src/
│   ├── __init__.py
│   ├── features/           # Feature engineering logic
│   ├── models/             # Model definitions
│   ├── api/                # FastAPI/Flask app
│   └── pipeline/           # Training scripts
├── tests/
│   ├── unit/
│   ├── integration/
│   └── load/
├── Makefile                # Standard commands: make test, make train
├── pyproject.toml          # Poetry dependencies, black/isort config
└── README.md               # Setup instructions, architecture diagram
```

#### **30.3.2 Type Safety in ML**

```python
# src/features/engineering.py
from typing import Protocol, TypedDict
import pandas as pd
import numpy as np

class FeatureConfig(TypedDict):
    window_size: int
    aggregation: Literal["mean", "sum", "max"]

class FeatureEngineer(Protocol):
    """Protocol for feature transformers"""
    def fit(self, X: pd.DataFrame) -> "FeatureEngineer": ...
    def transform(self, X: pd.DataFrame) -> pd.DataFrame: ...

class RollingAggregator:
    def __init__(self, config: FeatureConfig):
        self.window = config["window_size"]
        self.agg = config["aggregation"]
        self._fitted = False
        
    def fit(self, X: pd.DataFrame) -> "RollingAggregator":
        # Validation
        if X.empty:
            raise ValueError("Empty dataframe")
        self.columns_ = X.columns.tolist()
        self._fitted = True
        return self
    
    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        if not self._fitted:
            raise RuntimeError("Call fit before transform")
        return X.rolling(window=self.window).agg(self.agg)
```

#### **30.3.3 Testing Strategy**

**Unit Tests (pytest):**
```python
def test_feature_engineering_handles_missing():
    engineer = RollingAggregator({"window_size": 3, "aggregation": "mean"})
    df = pd.DataFrame({"value": [1, np.nan, 3, 4]})
    result = engineer.fit(df).transform(df)
    assert not result.isna().all().all(), "Should handle NaN gracefully"
```

**Contract Tests (Pact):**
Verify API consumer (frontend) and provider (ML service) agree on schema.

**Load Tests (Locust):**
```python
from locust import HttpUser, task

class MLAPIUser(HttpUser):
    @task
    def predict(self):
        self.client.post("/predict", json={"features": [1.0, 2.0, 3.0]})
```

#### **30.3.4 CI/CD for ML**

```yaml
# .github/workflows/ml-pipeline.yml
name: ML Pipeline

on:
  push:
    paths:
      - 'src/**'
      - 'config/**'

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      
      - name: Install dependencies
        run: pip install -r requirements-dev.txt
      
      - name: Lint
        run: |
          black --check src/
          ruff check src/
          mypy src/
      
      - name: Unit tests
        run: pytest tests/unit --cov=src --cov-report=xml
      
      - name: Data validation
        run: |
          great_expectations checkpoint run raw_data_validation
      
      - name: Model performance regression
        run: |
          python -m src.pipeline.train --config config/test.yaml
          python -m src.evaluation.compare_baseline --threshold 0.05
```

---

## **30.4 Portfolio Presentation**

#### **30.4.1 GitHub Repository Hygiene**

**README Structure:**
1. **One-sentence description:** "Production-grade fraud detection API handling 10K TPS with <50ms latency"
2. **Architecture diagram:** Draw.io or Excalidraw embedded via image
3. **Quick start:** `docker-compose up` command to run locally
4. **Key metrics:** Latency, throughput, accuracy benchmarks
5. **Tech stack:** Badges for Python, FastAPI, AWS, etc.
6. **Blog post link:** Deep dive into technical decisions

**Code Documentation:**
- **Docstrings:** Google style, with Args/Returns/Raises
- **Architecture Decision Records (ADRs):** `docs/adr/001-feature-store.md`
- **API Docs:** Auto-generated Swagger UI screenshots

#### **30.4.2 Technical Blog Content**

Write 1,000-2,000 words on one challenging aspect:
- "Why we moved from batch to streaming features (and the 3 attempts that failed)"
- "Optimizing Transformer inference: From 2s to 20ms"
- "The hidden cost of Pandas: A memory optimization journey"

**Platforms:** Medium, Dev.to, or personal site (SEO-friendly).

#### **30.4.3 Demo Videos**

**The 3-Minute Demo:**
1. **0:00-0:30:** Problem statement ("E-commerce loses $X to fraud")
2. **0:30-1:30:** System walkthrough (show Grafana dashboard, API call in Postman)
3. **1:30-2:30:** Load test visualization (watch 1K requests/second handled)
4. **2:30-3:00:** GitHub repo walkthrough (highlight test coverage, CI/CD)

**Tools:** Loom (free), OBS Studio, or simple screen recording with voiceover.

---

## **30.5 Workbook Labs**

### **Lab 1: Portfolio Scoping Workshop**
For each of the 5 projects:

1. **Write PRD:** Define metrics, constraints, MVP vs. v1 scope
2. **Architecture Decision:** Choose 3 key technologies, document alternatives rejected
3. **Risk Assessment:** What will most likely fail? (data quality, latency, model drift)
4. **30-60-90 Day Plan:** Week 1 (EDA), Week 2 (Baseline), Week 3 (MLOps), etc.

**Deliverable:** 5 markdown files in `portfolio_planning/` directory.

### **Lab 2: Code Quality Audit**
Take an old Kaggle notebook and refactor:

1. **Structure:** Convert to `src/` package structure
2. **Typing:** Add type hints to all functions
3. **Testing:** Achieve >80% test coverage on feature engineering
4. **CI/CD:** GitHub Actions passing lint, test, and build stages

**Deliverable:** Before/after comparison showing lines of code, test count, documentation coverage.

### **Lab 3: Deployment Challenge**
Deploy Project 1 (Tabular) or Project 2 (CV) to cloud:

1. **Infrastructure as Code:** Terraform or Pulumi (not click-ops)
2. **Monitoring:** Live dashboard showing predictions/second, latency histogram
3. **Load Test:** Demonstrate handling 10x normal load via auto-scaling
4. **Cost Optimization:** Show < $50/month spend (using spot instances or serverless)

**Deliverable:** Public endpoint (or screenshot if sensitive), architecture diagram, cost breakdown.

### **Lab 4: Interview Prep Documentation**
Create "Interview Cheat Sheet" for each project:

1. **Elevator Pitch:** 30-second description
2. **Deep Dive Questions:** 
   - "Why XGBoost not Random Forest?"
   - "How do you handle cold start?"
   - "What happens if feature store is down?"
3. **Failure Modes:** "Tell me about a bug you encountered"
4. **Scale Estimation:** "How would this handle 10x traffic?"

**Deliverable:** `INTERVIEW_PREP.md` in each project repo.

---

## **30.6 Common Pitfalls**

1. **Perfectionism Paralysis:** Waiting for "perfect" architecture before shipping. **Fix:** Ship MVP with hardcoded model, iterate. Done > Perfect.

2. **Resume-Driven Development:** Using Kubernetes for 100-requests/day hobby project. **Fix:** Match complexity to requirements. Flask + SQLite is fine for demos.

3. **Neglecting Documentation:** Code works but README says "TODO". **Fix:** Document setup steps as you build, not after.

4. **Fake Data Only:** Using `make_classification` synthetic data. **Fix:** Use real public datasets (even if messy) to show data cleaning skills.

5. **No Error Handling:** Happy path only, no try-catch, no validation. **Fix:** Pydantic validation, circuit breakers, fallback predictions.

---

## **30.7 Interview Questions**

**Q1:** Walk me through your portfolio project. What was the hardest technical challenge?
*A: Structure: (1) Context: Business problem, scale, constraints, (2) Architecture: High-level diagram, key technologies, (3) Challenge: Specific issue (e.g., training-serving skew, latency), (4) Solution: Technical decision made, alternatives considered, (5) Result: Metrics improved, lessons learned. Focus on decision-making process, not just final code.*

**Q2:** Why did you choose [Technology X] over [Technology Y] in Project Z?
*A: Demonstrate trade-off analysis. "I chose Redis over Memcached for the feature store because I needed data structures (sorted sets for top-N features) and persistence. The trade-off was operational complexity—we needed Redis Sentinel for failover. For a simpler cache-only use case, Memcached would be better." Show you can defend decisions but acknowledge limitations.*

**Q3:** How would you scale this project to 10x the traffic?
*A: Identify current bottlenecks: (1) Database: Read replicas, sharding, or move to NoSQL, (2) Model: Quantization, distillation, or model parallelism, (3) Caching: Add CDN or edge caching, (4) Async: Move heavy work to queues, (5) Horizontal: Kubernetes HPA, multi-region deployment. Show understanding of vertical vs. horizontal scaling trade-offs.*

**Q4:** What would you do differently if you started over?
*A: Honest reflection shows growth. "I initially used Airflow for the training pipeline, but for this latency requirement, I should have used a message queue (SQS) + Lambda for faster iteration. Also, I hardcoded feature names initially; I should have used a feature registry from day one." Avoid "nothing"—always have learnings.*

**Q5:** How do you know your model is still working in production?
*A: Monitoring strategy: (1) Data drift detection (Kolmogorov-Smirnov tests), (2) Performance monitoring (accuracy if labels available, proxy metrics if delayed), (3) Business metrics (conversion rate, fraud catch rate), (4) Alerting: PagerDuty for data pipeline failures, Slack for drift warnings, (5) Automated retraining triggers. Show operational maturity.*

---

## **30.8 Further Reading**

**Books:**
- *Building Machine Learning Pipelines* (O'Reilly) - Kubeflow patterns
- *Designing Machine Learning Systems* (Chip Huyen) - Chapter on testing and deployment

**Resources:**
- **Made With ML:** GokuMohandas's MLOps course (excellent portfolio examples)
- **Evidently AI:** ML System Design case studies
- **AWS Architecture Center:** ML patterns (real-time inference, batch processing)

---

## **30.9 Checkpoint Project: The Capstone**

Complete **one** end-to-end project from Section 30.2 with the following production criteria:

**Must Haves:**
- [ ] GitHub repo with >90% test coverage (measured by pytest-cov)
- [ ] Live deployment (Heroku free tier, AWS free tier, or personal server)
- [ ] CI/CD pipeline running (GitHub Actions green badge)
- [ ] Architecture diagram (PNG in repo)
- [ ] Technical blog post published (Medium/Dev.to)
- [ ] Monitoring dashboard (Grafana/Datadog screenshot showing traffic)

**Evaluation Rubric:**
| Criteria | Poor | Good | Excellent |
|----------|------|------|-----------|
| **Code Quality** | No tests, no types | Some tests, basic types | Full type hints, >90% coverage, linting |
| **Architecture** | Monolithic script | Modular but simple | Microservices/event-driven where appropriate |
| **Deployment** | Local only | Docker but not deployed | Live endpoint with HTTPS, auto-scaling |
| **Documentation** | README only | Basic setup docs | ADRs, API docs, architecture blog |
| **Monitoring** | None | Logs only | Metrics, alerts, dashboards |

**Success Criteria:**
You should be able to spend 45 minutes in an interview discussing only this project—covering data collection choices, model selection trade-offs, scaling challenges, and failure modes—without repeating yourself.

---

**End of Chapter 30**

*You now have the roadmap to build a portfolio that demonstrates production AI engineering capabilities. Chapter 31 covers Interview Preparation & Career Strategy.*

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='29. ai_system_design_and_architecture.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='31. interview_preparation_and_career_strategy.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
