Here is **Chapter 4: Development Environment & Tools** — the infrastructure foundation for production AI.

---

# **CHAPTER 4: DEVELOPMENT ENVIRONMENT & TOOLS**

*The Professional's Workshop*

## **Chapter Overview**

Great models are built on messy laptops but deployed through rigorous engineering pipelines. This chapter transforms you from a notebook experimenter into a production engineer who can version control 10GB models, containerize training pipelines, and debug CUDA errors on remote servers at 2 AM.

**Estimated Time:** 30-40 hours (2-3 weeks)  
**Prerequisites:** Chapters 1-3, access to Linux/macOS terminal (WSL acceptable for Windows)

---

## **4.0 Learning Objectives**

By the end of this chapter, you will be able to:
1. Manage ML code with Git LFS, implement GitFlow for experiments, and automate CI/CD for ML pipelines
2. Navigate Linux servers, write shell scripts for data pipelines, and diagnose GPU/CPU resource contention
3. Create reproducible environments with Conda/Poetry and resolve dependency conflicts
4. Build optimized Docker images for ML training and serving (CUDA-enabled, multi-stage, layer-cached)
5. Interface with cloud storage (S3) and compute instances for distributed training
6. Configure VS Code for remote development on GPU servers with full debugging capabilities

---

## **4.1 Version Control: Git for ML**

Standard Git fails with ML artifacts (datasets, model weights). We need specialized workflows.

#### **4.1.1 Git Fundamentals (The 20% you use 80% of the time)**

```bash
# Configuration (do this once)
git config --global user.name "Your Name"
git config --global user.email "you@example.com"
git config --global init.defaultBranch main

# Daily workflow
git status                    # What's changed?
git add -p                    # Stage interactively (review each hunk)
git commit -m "feat: add data augmentation pipeline"
git pull --rebase             # Keep history linear (cleaner than merge commits)
git push origin feature-branch

# Undoing mistakes
git checkout -- <file>        # Discard local changes
git reset --soft HEAD~1       # Undo last commit, keep changes staged
git stash -u                  # Stash changes including untracked files
git stash pop                 # Restore stashed changes
```

**Branching Strategy for ML (GitFlow Adapted):**
- `main`: Production-ready code, tagged releases (v1.0.0)
- `develop`: Integration branch for features
- `experiment/*`: Individual ML experiments (e.g., `experiment/resnet50-augmentation-v2`)
- `feature/*`: Infrastructure features (e.g., `feature/add-mlflow-logging`)
- `hotfix/*`: Critical production fixes

```bash
# Create experiment branch
git checkout -b experiment/gpt2-finetuning-lr-sweep

# Push to remote (set upstream)
git push -u origin experiment/gpt2-finetuning-lr-sweep

# When done: squash merge to keep history clean
git checkout develop
git merge --squash experiment/gpt2-finetuning-lr-sweep
git commit -m "experiment: GPT-2 finetuning with LR sweep results"
```

#### **4.1.2 Git LFS (Large File Storage)**

ML repositories contain binary files (`.pt`, `.pkl`, `.csv`) that break standard Git.

**Setup:**
```bash
# Install Git LFS
git lfs install

# Track model files
git lfs track "*.pt" "*.pth" "*.h5" "*.pb" "data/*.csv"
git lfs track "checkpoints/**"

# Verify .gitattributes created
cat .gitattributes
```

**Best Practices:**
- **Don't track generated files:** Only track source code and final model artifacts, not every checkpoint.
- **Storage costs:** Git LFS bandwidth/storage is expensive on GitHub. For >1GB models, use DVC (Data Version Control) or cloud storage (S3) with reference files.
- **Locking:** For binary files that can't be merged (Jupyter notebooks with outputs), use `git lfs lock notebook.ipynb`.

#### **4.1.3 Managing Jupyter Notebooks in Git**

Notebooks create messy diffs (JSON with execution counts/binary outputs).

**Solution 1: Strip outputs before commit**
```bash
# Using nbstripout
pip install nbstripout
nbstripout --install  # Run in repo root

# Now git diff shows only code changes, not outputs
```

**Solution 2: ReviewNB (GitHub App)**
Visual diff tool for notebooks in pull requests (essential for team collaboration).

**Solution 3: Convert to Python for review**
```bash
# Use jupytext to pair .ipynb with .py
jupytext --set-formats ipynb,py notebook.ipynb
# Edit .py file, sync back to .ipynb
```

#### **4.1.4 CI/CD for ML (GitHub Actions)**

Automate testing and training validation on every commit.

```yaml
# .github/workflows/ml-pipeline.yml
name: ML Pipeline

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'
          
      - name: Cache dependencies
        uses: actions/cache@v3
        with:
          path: ~/.cache/pip
          key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
          
      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install pytest
          
      - name: Run tests
        run: pytest tests/ --cov=src
        
      - name: Check code style
        run: |
          pip install black flake8
          black --check src/
          flake8 src/
```

**Advanced: Self-hosted Runners**
For GPU tests, use self-hosted runners (your own GPU server) instead of GitHub's CPU-only runners.

```yaml
jobs:
  gpu-tests:
    runs-on: self-hosted  # Your GPU machine
    steps:
      - uses: actions/checkout@v3
      - name: GPU Tests
        run: pytest tests/test_gpu_ops.py
```

---

## **4.2 Linux/Unix for AI Engineers**

Most training happens on Linux servers. You must navigate without a GUI.

#### **4.2.1 Essential Navigation and File Operations**

```bash
# Navigation
cd -                  # Go to previous directory (useful!)
cd ~                  # Home directory
pwd                   # Print working directory

# File operations
ls -lh                # Human-readable sizes (MB, GB)
ls -lt                # Sort by time (newest first)
cp -r src/ dst/       # Recursive copy
rsync -avz --progress data/ user@server:/data/  # Resumable transfer, compression
rm -rf dir/           # DANGER: Recursive force delete (no trash!)

# Disk usage (critical for shared GPU servers)
du -sh *              # Size of each item in current dir
du -h --max-depth=1 /home/user  # Find what's eating disk space
df -h                 # Disk free space on all mounts

# Find files (faster than find: use fd if installed)
find . -name "*.py" -type f -size +1M  # Python files > 1MB
find . -mtime -7      # Modified in last 7 days

# Search content (use ripgrep: rg, faster than grep)
grep -r "TODO" --include="*.py" src/
rg "class.*Dataset" --type py
```

#### **4.2.2 File Permissions and Ownership**

Shared servers require strict permission management.

```bash
# Permission bits: rwx (read, write, execute) for owner/group/others
chmod 755 script.py   # rwxr-xr-x (owner full, others read+execute)
chmod 600 ~/.ssh/id_rsa  # rw------- (owner only, critical for keys!)

# Ownership (usually need sudo)
chown user:group file.txt
chown -R $USER:$USER /data/my_experiment  # Recursive

# Access Control Lists (ACLs) for fine-grained sharing
setfacl -m u:colleague:rwx /shared/project
getfacl /shared/project
```

**ML Context:** Dataset directories should be read-only for group members to prevent accidental deletion. Model checkpoint directories need write permissions for the training user only.

#### **4.2.3 Process Management and Monitoring**

```bash
# View processes
htop                  # Interactive process viewer (better than top)
nvidia-smi            # GPU usage (memory, utilization, temperature)
watch -n 1 nvidia-smi # Refresh every second

# Kill runaway processes
ps aux | grep python  # Find Python processes
kill -9 PID           # Force kill (SIGKILL, use as last resort)
pkill -f "train.py"   # Kill by command name

# Background jobs
python train.py &     # Run in background
bg                    # Resume suspended job in background
fg %1                 # Bring job 1 to foreground
jobs                  # List background jobs

# No hangup (survives SSH disconnect)
nohup python train.py > logs.txt 2>&1 &
# OR use tmux/screen (preferred)

# tmux essentials
tmux new -s training  # New session named "training"
# Ctrl+b, d to detach
tmux ls               # List sessions
tmux attach -t training  # Reconnect
# Ctrl+b, c (new window), n (next), p (previous), % (split vertical)
```

#### **4.2.4 Shell Scripting for ML Pipelines**

```bash
#!/bin/bash
# run_experiments.sh

# Exit on error, undefined vars, pipe failures
set -euo pipefail

# Configuration
MODELS=("resnet50" "efficientnet-b0" "vit-base")
LEARNING_RATES=(0.001 0.0001)
DATA_DIR="/data/imagenet"
LOG_DIR="./logs/$(date +%Y%m%d_%H%M%S)"

mkdir -p "$LOG_DIR"

# Loop through hyperparameters
for model in "${MODELS[@]}"; do
    for lr in "${LEARNING_RATES[@]}"; do
        EXP_NAME="${model}_lr${lr}"
        echo "Starting experiment: $EXP_NAME"
        
        python train.py \
            --model "$model" \
            --lr "$lr" \
            --data-dir "$DATA_DIR" \
            --log-dir "$LOG_DIR/$EXP_NAME" \
            > "$LOG_DIR/${EXP_NAME}.log" 2>&1 &
            
        # Limit concurrent jobs to avoid OOM
        if (( $(jobs -r | wc -l) >= 4 )); then
            wait -n  # Wait for any job to finish
        fi
    done
done

wait  # Wait for all remaining jobs
echo "All experiments completed. Logs in $LOG_DIR"
```

**Advanced:** Use `parallel` for more sophisticated job scheduling.

#### **4.2.5 SSH and Remote Development**

```bash
# SSH keys (passwordless login)
ssh-keygen -t ed25519 -C "your_email@example.com"
ssh-copy-id user@server  # Copy public key to server

# SSH config (~/.ssh/config) to simplify connections
Host gpu-server
    HostName 192.168.1.100
    User ubuntu
    IdentityFile ~/.ssh/id_ed25519
    ForwardX11 yes  # For GUI forwarding (matplotlib plots)
    ServerAliveInterval 60  # Keep connection alive

# Now just: ssh gpu-server

# Port forwarding (access TensorBoard on remote)
ssh -L 6006:localhost:6006 gpu-server
# Open localhost:6006 on local browser
```

---

## **4.3 Environment Management: The Reproducibility Challenge**

ML dependencies are nightmares (CUDA versions, specific PyTorch builds, C++ extensions).

#### **4.3.1 Conda (The Heavyweight)**

Best for: Data science beginners, managing Python + non-Python dependencies (CUDA, MKL).

```bash
# Create environment from file
conda env create -f environment.yml

# environment.yml
name: ml-project
channels:
  - pytorch
  - nvidia
  - conda-forge
  - defaults
dependencies:
  - python=3.9
  - pytorch=2.0.0
  - torchvision
  - pytorch-cuda=11.8
  - cudatoolkit=11.8
  - numpy=1.24
  - pip
  - pip:
    - transformers==4.30.0
    - wandb
    - -e .  # Install current package in editable mode

# Export exact environment (including builds)
conda env export --no-builds > environment_lock.yml

# Cloning environments
conda create --name new_env --clone old_env
```

**Conda Best Practices:**
- **Don't mix pip and conda** (unless pip section in yml). Leads to broken environments.
- **Use `mamba`** (C++ reimplementation) for faster solving: `mamba install pytorch`
- **Clean caches regularly:** `conda clean -a` (saves GBs of disk space)

#### **4.3.2 Poetry (The Modern Standard)**

Best for: Pure Python projects, production services, dependency resolution is superior.

```bash
# Initialize
poetry init  # Interactive setup
poetry add torch transformers datasets
poetry add --group dev pytest black mypy  # Dev dependencies

# pyproject.toml (the modern standard)
[tool.poetry.dependencies]
python = "^3.9"
torch = "^2.0.0"
transformers = "^4.30.0"

[tool.poetry.group.dev.dependencies]
pytest = "^7.0"
black = "^23.0"

# Lock file (poetry.lock) ensures exact reproducibility
poetry install  # Creates virtualenv and installs

# Running commands in environment
poetry run python train.py
poetry shell     # Activate environment

# Export to requirements.txt (for Docker)
poetry export -f requirements.txt --output requirements.txt --without-hashes
```

**Poetry vs Conda:**
- **Poetry:** Better resolver, cleaner `pyproject.toml` standard, faster for pure Python, supports lock files natively.
- **Conda:** Required for CUDA/GPU libraries, handles non-Python dependencies (C++ libs), better for research environments.

**Hybrid Approach:** Use Conda for Python + CUDA, Poetry for Python package management inside Conda env.

#### **4.3.3 Docker (The Nuclear Option)**

When you need **guaranteed** reproducibility across laptops, servers, and cloud.

**Dockerfile for ML:**
```dockerfile
# Multi-stage build for smaller final image
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 as builder

# Avoid interactive prompts
ENV DEBIAN_FRONTEND=noninteractive

# Install Python and build tools
RUN apt-get update && apt-get install -y \
    python3.10 python3-pip git \
    && rm -rf /var/lib/apt/lists/*

# Install dependencies (separate layer for caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Production stage (smaller, no build tools)
FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04

WORKDIR /app

# Copy only necessary artifacts from builder
COPY --from=builder /usr/local/lib/python3.10/dist-packages /usr/local/lib/python3.10/dist-packages
COPY --from=builder /usr/local/bin /usr/local/bin

# Copy application code
COPY src/ ./src/
COPY configs/ ./configs/

# Non-root user for security
RUN useradd -m -u 1000 mluser
USER mluser

# Entry point
ENTRYPOINT ["python", "-m", "src.train"]
```

**Key Docker Commands:**
```bash
# Build
docker build -t ml-project:latest .

# Run with GPU support
docker run --gpus all -it --rm \
    -v $(pwd)/data:/app/data \
    -v $(pwd)/outputs:/app/outputs \
    ml-project:latest \
    --config configs/experiment.yaml

# Debug inside container
docker run --gpus all -it --rm --entrypoint /bin/bash ml-project:latest
```

**Docker Compose for Multi-Service:**
```yaml
# docker-compose.yml
version: '3.8'
services:
  training:
    build: .
    volumes:
      - ./data:/data
      - ./checkpoints:/checkpoints
    environment:
      - CUDA_VISIBLE_DEVICES=0,1
    command: python train.py --epochs 100
  
  tensorboard:
    image: tensorflow/tensorflow:latest
    ports:
      - "6006:6006"
    volumes:
      - ./checkpoints:/logs
    command: tensorboard --logdir=/logs --host=0.0.0.0
```

---

## **4.4 Cloud Platforms: The First Steps**

You don't need to be a cloud architect, but you must move data and run instances.

#### **4.4.1 AWS (Amazon Web Services)**

**Essential Services:**
- **S3:** Object storage (datasets, model artifacts). Infinite storage, pay per GB.
- **EC2:** Virtual machines (GPU instances: p3, p4, g4dn).
- **IAM:** Identity management (don't use root credentials!).

**CLI Essentials:**
```bash
# Install AWS CLI v2
aws configure  # Enter access key, secret key, region

# S3 operations (like a remote filesystem)
aws s3 cp local_model.pt s3://my-bucket/models/v1/
aws s3 sync s3://my-bucket/datasets/imagenet ./data/imagenet  # Sync (resume interrupted)
aws s3 ls s3://my-bucket/ --recursive --human-readable --summarize  # Check size

# EC2 instance management
aws ec2 start-instances --instance-ids i-1234567890abcdef0
aws ec2 describe-instances --instance-ids i-1234567890abcdef0 --query 'Reservations[0].Instances[0].PublicIpAddress'

# Spot instances (70% cheaper)
aws ec2 request-spot-instances --spot-price "1.00" --instance-count 1 --type "one-time" --launch-specification file://specs.json
```

**S3 with Python (boto3):**
```python
import boto3

s3 = boto3.client('s3')

# Upload with progress bar
def upload_file(file_name, bucket, object_name=None):
    if object_name is None:
        object_name = file_name
    
    from tqdm import tqdm
    import os
    
    file_size = os.stat(file_name).st_size
    with tqdm(total=file_size, unit='B', unit_scale=True, desc=file_name) as pbar:
        s3.upload_file(
            file_name, bucket, object_name,
            Callback=lambda bytes_transferred: pbar.update(bytes_transferred)
        )

# Stream data without downloading (useful for large datasets)
obj = s3.get_object(Bucket='my-bucket', Key='data/large_file.csv')
df = pd.read_csv(obj['Body'])  # Stream directly to pandas
```

#### **4.4.2 GCP and Azure (Brief)**

- **GCP:** Similar to AWS (Cloud Storage = S3, Compute Engine = EC2). Good integration with TensorFlow/Keras.
- **Azure:** Strong Windows integration, good for enterprises. Blob Storage, Virtual Machines.

**Universal Pattern:** All clouds have:
1. Object storage (S3/Cloud Storage/Blob)
2. Compute instances (EC2/Compute Engine/VMs)
3. Identity management (IAM)
4. CLI tools (`aws`/`gcloud`/`az`)

---

## **4.5 IDEs and Productivity**

#### **4.5.1 VS Code for ML**

**Essential Extensions:**
- **Python:** IntelliSense, linting, debugging
- **Jupyter:** Native notebook support (no browser needed)
- **Remote - SSH:** Edit files on GPU server as if local
- **GitLens:** Git blame, history
- **Docker:** Manage containers

**Remote Development Workflow:**
1. Install "Remote - SSH" extension
2. `Ctrl+Shift+P` → "Remote-SSH: Connect to Host" → `gpu-server`
3. Open `/home/user/project` on server
4. All editing, terminal, and debugging happens on server (local laptop is just a display)
5. Forward ports (TensorBoard) automatically detected

**Debugging Configuration (.vscode/launch.json):**
```json
{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Python: Train",
            "type": "python",
            "request": "launch",
            "program": "${workspaceFolder}/train.py",
            "args": ["--config", "configs/debug.yaml", "--epochs", "1"],
            "console": "integratedTerminal",
            "env": {
                "CUDA_VISIBLE_DEVICES": "0"
            }
        }
    ]
}
```

#### **4.5.2 Jupyter Extensions**

```bash
# Essential extensions
pip install jupyterlab-code-formatter  # Black formatting
pip install jupyterlab-git             # Git GUI in Jupyter
pip install lckr-jupyterlab-variableinspector  # Variable explorer like Spyder
```

---

## **4.6 Workbook Labs**

### **Lab 1: Git Mastery with Merge Conflicts**
1. Create a repo with a Jupyter notebook
2. Create two branches modifying same cell
3. Resolve conflict manually (understand HEAD vs incoming)
4. Use `git rerere` (reuse recorded resolution) to automate future similar conflicts

**Deliverable:** Screenshot of resolved conflict and commit history showing clean merge.

### **Lab 2: Shell Script for Hyperparameter Sweep**
Write a bash script that:
- Reads experiment configs from a CSV file
- Launches training jobs with GNU `parallel` to use exactly 4 GPUs
- Monitors GPU memory with `nvidia-smi` and kills jobs if memory >95%
- Aggregates results into a summary CSV

**Deliverable:** `sweep.sh` with error handling and logging.

### **Lab 3: Multi-Stage Docker Build**
Create a Dockerfile that:
- Stage 1: Compiles a custom CUDA extension (C++ PyTorch op)
- Stage 2: Runtime environment with only Python and necessary libs (no gcc, no CUDA dev tools)
- Final image size <2GB (vs >8GB for devel image)

**Deliverable:** `Dockerfile`, `docker-compose.yml`, and size comparison report.

### **Lab 4: Cloud Data Pipeline**
Write a Python script that:
- Syncs local training data to S3 (incremental, only changed files)
- Launches EC2 spot instance
- Waits for training completion by polling S3 for `done.txt`
- Downloads results and terminates instance

**Deliverable:** `cloud_train.py` using boto3, with cost estimation.

---

## **4.7 Common Pitfalls**

1. **Committing Large Files Without LFS:**
   - Repo becomes permanently bloated (history retains file even if deleted)
   - Solution: Use `git-filter-repo` to purge from history (destructive!)

2. **Docker Layer Caching Issues:**
   - Copying code before installing requirements busts cache on every code change
   - Solution: Copy requirements first, install, then copy code

3. **Conda Environment in Production:**
   - Conda solves for 30 minutes on deployment
   - Solution: Use `conda-lock` or export exact environment, or use Docker

4. **Running Jupyter on 0.0.0.0 Without Password:**
   - Anyone on network can execute code as you
   - Solution: Use SSH tunneling instead, or set strong password/token

5. **CUDA Version Mismatch:**
   - Host has CUDA 11.8, container has 12.0 = crash
   - Solution: Use `nvidia/cuda` base images matching host driver capabilities

---

## **4.8 Interview Questions**

**Q1:** How do you version control a 5GB model checkpoint?
*A: Git LFS for <2GB files with bandwidth considerations. For larger, use DVC (Data Version Control) or cloud storage (S3) with versioned filenames (model-v1.0.0.pt), storing only the S3 URI in Git. For experiments, use MLflow or Weights & Biases artifact tracking.*

**Q2:** Explain the difference between `pip install -r requirements.txt` and `poetry install`.
*A: Pip installs latest versions satisfying constraints at install time (non-reproducible). Poetry uses lock file (poetry.lock) to install exact versions of all transitive dependencies, ensuring identical environments across machines/times. Poetry also manages virtualenvs automatically.*

**Q3:** Your training job dies when you close laptop (SSH disconnects). How do you fix it?
*A: Use `tmux` or `screen` to create persistent sessions that survive disconnect. Or use `nohup` with output redirection. Best practice: Use a process manager like systemd or dedicated job schedulers (Slurm, Kubernetes) for production training.*

**Q4:** How do you share a GPU server with 4 colleagues without conflicts?
*A: Use `CUDA_VISIBLE_DEVICES` to assign specific GPUs to users. Implement a simple lock file system or use GPU scheduling tools (e.g., RunAI, Kubernetes with GPU operator). Monitor with `nvidia-smi` and set memory limits if using containers.*

**Q5:** Your Docker build takes 20 minutes every time because it reinstalls PyTorch.
*A: Leverage layer caching: Copy requirements.txt and install dependencies BEFORE copying source code. Only the COPY source layer rebuilds on code changes; dependency layer is cached. Use BuildKit for better caching (`DOCKER_BUILDKIT=1`).*

---

## **4.9 Further Reading**

**Books:**
- *Pro Git* (Scott Chacon) - Free online, comprehensive Git reference
- *The Linux Command Line* (William Shotts) - Essential shell skills
- *Docker Deep Dive* (Nigel Poulton) - Container internals

**Tools:**
- **DVC:** Data Version Control (Git for data)
- **Pre-commit:** Framework for managing pre-commit hooks (black, flake8, etc.)
- **Tmuxp:** YAML configuration for tmux sessions (save layout setups)

---

## **4.10 Checkpoint Project: Reproducible Training Infrastructure**

Build a complete training infrastructure that could be handed to a new team member and "just work."

**Requirements:**

**Repository Structure:**
```
project/
├── .gitattributes      # LFS tracking
├── .github/
│   └── workflows/
│       └── train.yml   # CI to verify training runs
├── Dockerfile          # Multi-stage, CUDA-enabled
├── docker-compose.yml  # Training + TensorBoard + Postgres (for MLflow)
├── Makefile            # Common commands (make train, make test, make clean)
├── pyproject.toml      # Poetry dependencies
├── environment.yml     # Conda alternative
├── data/               # Gitignored, mounted volume
├── configs/            # YAML experiment configs
├── scripts/
│   ├── setup.sh        # Install hooks, create dirs
│   └── sync_data.sh    # Rsync/S3 sync wrapper
└── src/
    └── train.py
```

**Features:**
1. **One-command setup:** `make init` creates environment, installs pre-commit hooks, pulls sample data
2. **Dual environment support:** Both `poetry install` and `conda env create` work identically
3. **Remote training script:** `scripts/remote_train.sh` that SSHs to server, starts tmux session, runs Docker container, detaches
4. **Artifact management:** Training outputs saved to S3 with versioning, metadata in MLflow
5. **Reproducibility:** `REPRODUCIBILITY.md` document listing exact hardware, driver versions, and random seeds used for published results

**Deliverables:**
- GitHub repo link
- Demo video: Clone → `make train` → TensorBoard shows live metrics (under 5 minutes)
- Cost analysis: Training cost on cloud vs local hardware

---

**End of Chapter 4**

*You now have the engineering hygiene to build production-grade ML systems. Chapter 5 will begin Phase 2: Machine Learning Fundamentals — starting with Data Preprocessing and Feature Engineering.*

---


<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='3. computer_science_fundamentals.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='../2. Machine_learning_fundamentals/5. data_preprocessing_and_feature_engineering.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
