# Student Manual & Handbook
## Designing and Implementing Data Science Solutions on Azure

**Preparation for Business Analyst and Data Engineer Roles**

---

### üìö Course Overview

This comprehensive manual prepares you for the **DP-100 certification** and equips you with practical skills for **Business Analyst** and **Data Engineer** roles in data science and machine learning.

**Learning Objectives:**
- Design and create suitable working environments for data science workloads
- Explore data and train machine learning models on Azure Machine Learning
- Run jobs and pipelines to prepare models for production
- Deploy and monitor scalable machine learning solutions
- Optimize language models for AI applications

**Target Roles:**
- Business Analyst (Data-focused)
- Data Engineer
- ML Engineer
- Data Scientist

---

## Module 1: Design a Machine Learning Solution

### üéØ Learning Objectives
- Understand the complete machine learning lifecycle
- Design data ingestion solutions
- Choose appropriate Azure services for ML workloads
- Plan for model deployment and monitoring

---

### 1.1 The Machine Learning Process

#### Six Key Steps:

1. **Define the Problem** - Identify the business question and ML task type
2. **Get the Data** - Source and collect relevant datasets
3. **Prepare the Data** - Clean, transform, and engineer features
4. **Train the Model** - Select and train algorithms
5. **Integrate the Model** - Deploy for consumption
6. **Monitor the Model** - Track performance and detect drift

#### Common ML Problem Types:

| Problem Type | Description | Example |
|--------------|-------------|----------|
| **Classification** | Predict categorical values | Customer churn (Yes/No) |
| **Regression** | Predict numerical values | House price prediction |
| **Time-series Forecasting** | Predict future values over time | Sales forecasting |
| **Computer Vision** | Classify images or detect objects | Medical image diagnosis |
| **NLP** | Process and understand text | Sentiment analysis |

#### Business Analyst Perspective:
- Translate business requirements into ML problems
- Define success metrics (accuracy, precision, recall, F1-score)
- Ensure alignment with business KPIs

#### Data Engineer Perspective:
- Design scalable data pipelines
- Ensure data quality and consistency
- Optimize data storage and retrieval

### 1.2 Data Storage and Ingestion

#### Azure Storage Options for ML:

```python
# Common Azure Storage Services for ML
"""
1. Azure Blob Storage
   - Use Case: Unstructured data (images, videos, logs)
   - Cost: Low
   - Performance: Good for large files

2. Azure Data Lake Storage (Gen 2)
   - Use Case: Big data analytics, hierarchical data
   - Cost: Low-Medium
   - Performance: Optimized for analytics

3. Azure SQL Database
   - Use Case: Structured transactional data
   - Cost: Medium-High
   - Performance: High for structured queries
"""
```

#### Data Types:

- **Structured**: Tabular data (CSV, SQL tables)
- **Semi-structured**: JSON, XML, Parquet
- **Unstructured**: Images, text, audio, video

#### Data Ingestion Architecture:

```
Data Source ‚Üí Azure Data Factory/Synapse ‚Üí Transform ‚Üí Azure Storage ‚Üí Azure ML
    (CRM/IoT)         (ETL Pipeline)        (Clean)    (Blob/ADLS)    (Training)
```

#### Key Principle: **Separate Compute from Storage**
- Scale compute up/down based on demand
- Keep data persistent in storage
- Shutdown compute when not needed to save costs

### 1.3 Choosing Azure ML Services

#### Service Comparison:

| Service | Best For | Key Features | When to Use |
|---------|----------|--------------|-------------|
| **Azure AI Services** | Pre-built models | Quick deployment, minimal coding | Standard use cases (vision, speech, language) |
| **Microsoft Fabric** | End-to-end data platform | Unified analytics | Data engineering + data science at scale |
| **Azure Databricks** | Big data + ML | Spark, collaborative notebooks | Large-scale data processing |
| **Azure Machine Learning** | Custom ML models | Full ML lifecycle management | Custom models, MLOps, tracking |

#### Compute Options:

**CPU vs GPU:**
- CPU: Smaller tabular datasets, cost-effective
- GPU: Unstructured data (images, text), deep learning

**General Purpose vs Memory Optimized:**
- General Purpose: Balanced CPU-to-memory, testing/development
- Memory Optimized: High memory-to-CPU, in-memory analytics

**Spark Compute:**
- Driver node + worker nodes
- Distributed processing
- Available in Azure Synapse Analytics and Databricks

### 1.4 Model Deployment Strategies

#### Real-time vs Batch Deployment:

| Aspect | Real-time | Batch |
|--------|-----------|-------|
| **Latency** | Milliseconds-seconds | Minutes-hours |
| **Use Case** | Mobile apps, websites | Nightly reports, bulk processing |
| **Compute** | Always-on | Scheduled/on-demand |
| **Cost** | Higher (always running) | Lower (runs when needed) |
| **Example** | Credit card fraud detection | Customer churn prediction |

#### Deployment Endpoints:

```python
# Real-time Endpoint Example
"""
Client App ‚Üí REST API ‚Üí Model Endpoint ‚Üí Prediction Response
          (HTTP POST)   (Always Available)  (Immediate)
"""

# Batch Endpoint Example
"""
Scheduled Job ‚Üí Batch Endpoint ‚Üí Process Data ‚Üí Save Results to Storage
    (Daily)      (Spin up)      (Multiple rows)    (Database/File)
"""
```

### 1.5 MLOps Architecture

#### Six-Stage MLOps Workflow:

1. **Setup**: Create Azure resources (Resource Group, Workspace, Compute)
2. **Model Development** (Inner Loop): Explore data, train models
3. **Continuous Integration**: Package and register models
4. **Model Deployment** (Outer Loop): Deploy to endpoints
5. **Continuous Deployment**: Test and promote to production
6. **Monitoring**: Track performance and detect drift

#### Monitoring Considerations:

**Performance Metrics:**
- Accuracy, Precision, Recall, F1-score
- AUC-ROC for classification
- RMSE, MAE for regression

**Drift Detection:**
- **Data Drift**: Input distribution changes
- **Concept Drift**: Relationship between features and labels changes

**Retraining Triggers:**
- Metrics fall below threshold
- Scheduled intervals
- Detected drift
- New data availability

### üéì Module 1: Practice Questions

#### Question 1
Your company collects IoT sensor data every minute as JSON objects. What type of data is this?
- A) Structured
- B) Semi-structured ‚úì
- C) Unstructured

**Answer: B** - JSON is semi-structured data with a defined schema but flexible structure.

#### Question 2
Which Azure service should you use for quick iteration over multiple algorithms without extensive coding?
- A) Azure AI Services
- B) Automated Machine Learning ‚úì
- C) Azure Databricks

**Answer: B** - AutoML automatically tests multiple algorithms and preprocessing techniques.

#### Question 3
A mobile app needs instant predictions for credit card transactions. Which deployment should you use?
- A) Batch deployment
- B) Real-time deployment ‚úì
- C) Scheduled deployment

**Answer: B** - Real-time deployment provides immediate responses for transactional scenarios.

---

## Module 2: Azure Machine Learning Workspace

### üéØ Learning Objectives
- Create and configure Azure ML workspaces
- Understand workspace resources and assets
- Use developer tools (Studio, Python SDK, Azure CLI)
- Manage compute resources

---

### 2.1 Azure ML Workspace Components

#### Core Resources:

| Resource | Description | Purpose |
|----------|-------------|----------|
| **Workspace** | Top-level resource | Central hub for ML operations |
| **Compute Instances** | Development VMs | Interactive development with Jupyter |
| **Compute Clusters** | Scalable compute | Training jobs and batch inference |
| **Datastores** | Storage connections | Link to Azure Storage accounts |
| **Linked Services** | External services | Azure Synapse, Azure Databricks |

#### Workspace Assets:

- **Models**: Trained ML models
- **Environments**: Package dependencies
- **Data**: Datasets and data assets
- **Components**: Reusable pipeline steps
- **Pipelines**: End-to-end workflows

### 2.2 Setting Up Azure ML Workspace

#### Prerequisites:
```python
# Required Azure Resources
"""
1. Azure Subscription
2. Resource Group
3. Azure Machine Learning Workspace
4. Associated Storage Account (auto-created)
5. Application Insights (optional, for monitoring)
6. Key Vault (optional, for secrets)
"""
```

In [None]:
# Installation - Python SDK v2
# Run this in your terminal or notebook
!pip install azure-ai-ml azure-identity

In [None]:
# Connect to Azure ML Workspace
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# Replace with your details
subscription_id = "<your-subscription-id>"
resource_group = "<your-resource-group>"
workspace_name = "<your-workspace-name>"

# Authenticate and create client
ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id=subscription_id,
    resource_group_name=resource_group,
    workspace_name=workspace_name
)

print(f"Connected to workspace: {ml_client.workspace_name}")

### 2.3 Role-Based Access Control (RBAC)

#### Common Roles:

| Role | Permissions | Use Case |
|------|-------------|----------|
| **Owner** | Full control | Workspace administrators |
| **Contributor** | Create/modify resources | Team leads, senior engineers |
| **AzureML Data Scientist** | Run experiments, train models | Data scientists, ML engineers |
| **AzureML Compute Operator** | Manage compute | DevOps, cost managers |
| **Reader** | View only | Stakeholders, auditors |

#### Business Analyst Role:
- Typically assigned **AzureML Data Scientist** or **Reader**
- Can run experiments and view results
- May need custom roles for specific business needs

### 2.4 Developer Tools

#### 1. Azure Machine Learning Studio
- **Access**: https://ml.azure.com
- **Use Case**: Visual interface, no-code/low-code development
- **Features**:
  - Designer (drag-and-drop ML)
  - AutoML interface
  - Experiment tracking
  - Model management

#### 2. Python SDK (v2)
- **Use Case**: Code-first development, automation
- **Best For**: Data scientists, ML engineers
- **Features**:
  - Full programmatic control
  - Integration with notebooks
  - Reproducible pipelines

#### 3. Azure CLI
- **Use Case**: DevOps, CI/CD pipelines, automation
- **Best For**: Data engineers, DevOps engineers
- **Features**:
  - Command-line automation
  - YAML-based configurations
  - Script integration

In [None]:
# Example: List all compute targets in workspace
from azure.ai.ml.entities import AmlCompute

# List all compute resources
compute_list = ml_client.compute.list()

print("Available compute resources:")
for compute in compute_list:
    print(f"- {compute.name} (Type: {compute.type})")

### 2.5 Working with Compute

#### Compute Types:

**1. Compute Instance (Dev Environment):**
- Personal development VM
- Jupyter, VS Code, RStudio
- Always-on or scheduled
- Good for: Interactive development

**2. Compute Cluster (Training):**
- Auto-scaling cluster
- Min/max nodes configuration
- Pay only when running
- Good for: Distributed training, batch jobs

**3. Inference Cluster (Deployment):**
- Azure Kubernetes Service (AKS)
- Managed endpoints
- Production-grade
- Good for: Real-time inferencing

In [None]:
# Create a compute cluster
from azure.ai.ml.entities import AmlCompute

compute_cluster = AmlCompute(
    name="cpu-cluster",
    type="amlcompute",
    size="STANDARD_DS3_V2",  # VM size
    min_instances=0,          # Scale down to 0 when idle
    max_instances=4,          # Max 4 nodes
    idle_time_before_scale_down=120  # Wait 2 min before scaling down
)

# Create the compute cluster
ml_client.compute.begin_create_or_update(compute_cluster).result()
print(f"Compute cluster '{compute_cluster.name}' created!")

### 2.6 Job Types in Azure ML

#### Three Main Job Types:

| Job Type | Purpose | When to Use |
|----------|---------|-------------|
| **Command** | Execute single script | Training a model, data preprocessing |
| **Sweep** | Hyperparameter tuning | Optimize model parameters |
| **Pipeline** | Multi-step workflow | End-to-end ML pipelines |

#### Job Execution Flow:
```
Submit Job ‚Üí Queue ‚Üí Provision Compute ‚Üí Run Script ‚Üí Log Outputs ‚Üí Cleanup
```

### üéì Module 2: Practice Questions

#### Question 1
A data scientist needs to run training scripts as jobs. Which role provides the necessary permissions?
- A) Reader
- B) AzureML Data Scientist ‚úì
- C) AzureML Compute Operator

**Answer: B** - AzureML Data Scientist role allows running experiments and training jobs.

#### Question 2
What type of job should you use to train a single model with one script?
- A) Command ‚úì
- B) Pipeline
- C) Sweep

**Answer: A** - Command job executes a single script for model training.

#### Question 3
Which tool is best for automating weekly model retraining in a CI/CD pipeline?
- A) Azure Machine Learning Studio
- B) Python SDK
- C) Azure CLI ‚úì

**Answer: C** - Azure CLI integrates well with DevOps and CI/CD automation.

---

## Module 3: Automated Machine Learning (AutoML)

### üéØ Learning Objectives
- Understand AutoML capabilities and benefits
- Configure AutoML experiments
- Select appropriate tasks and metrics
- Interpret AutoML results

---

### 3.1 What is Automated Machine Learning?

#### Key Concept:
AutoML automates the process of:
- Feature engineering
- Algorithm selection
- Hyperparameter tuning
- Model evaluation

#### Benefits:
- **Speed**: Train multiple models in parallel
- **Efficiency**: No manual trial-and-error
- **Quality**: Find optimal model configuration
- **Accessibility**: Less ML expertise required

#### When to Use AutoML:
‚úÖ Quick prototyping and baseline models
‚úÖ Limited ML expertise on the team
‚úÖ Standard ML tasks (classification, regression)
‚úÖ Need to compare multiple algorithms

‚ùå Highly specialized custom models
‚ùå Need complete control over every parameter
‚ùå Novel ML architectures

### 3.2 AutoML Task Types

#### Supported Tasks:

| Task | Use Case | Example | Output |
|------|----------|---------|--------|
| **Classification** | Predict categories | Email spam detection | Binary/Multi-class |
| **Regression** | Predict numbers | House price prediction | Continuous value |
| **Time-series Forecasting** | Predict future values | Sales forecasting | Time-ordered predictions |
| **Computer Vision** | Image tasks | Object detection | Bounding boxes, labels |
| **NLP** | Text analysis | Sentiment classification | Text labels |

#### Business Analyst Use Cases:

**Classification:**
- Customer churn prediction
- Lead scoring
- Product recommendation

**Regression:**
- Revenue forecasting
- Customer lifetime value
- Pricing optimization

**Time-series:**
- Demand forecasting
- Inventory planning
- Budget prediction

### 3.3 Data Preparation for AutoML

#### Data Requirements:

**Classification & Regression:**
- Tabular format (CSV, Parquet)
- Clean column headers
- Target column identified

**Data Asset Types:**
1. **File Dataset**: Points to files in storage
2. **Tabular Dataset**: Structured as table
3. **MLTable**: Includes schema definition

#### Data Quality Checklist:
- ‚úÖ Remove duplicates
- ‚úÖ Handle missing values
- ‚úÖ Encode categorical variables
- ‚úÖ Normalize numerical features
- ‚úÖ Split train/test appropriately

In [None]:
# Create an MLTable data asset
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml import Input

# Reference an existing data asset
my_training_data = Input(
    type=AssetTypes.MLTABLE,
    path="azureml:diabetes-data:1"  # name:version
)

print(f"Data asset configured: {my_training_data.path}")

### 3.4 Featurization in AutoML

#### Automatic Featurization:

**Scaling & Normalization:**
- StandardScaler: Mean=0, StdDev=1
- MinMaxScaler: Range [0,1]
- RobustScaler: Handles outliers

**Missing Value Imputation:**
- Mean/Median for numerical
- Mode for categorical
- Forward fill for time-series

**Categorical Encoding:**
- One-hot encoding (< 10 categories)
- Label encoding (many categories)
- Target encoding (advanced)

#### Configuration Options:

In [None]:
# Configure featurization
from azure.ai.ml import automl

# Example featurization config
featurization_config = {
    "mode": "auto",  # auto, custom, or off
    "blocked_transformers": ["LabelEncoder"],  # Exclude specific transformers
    "column_purposes": {
        "customer_id": "Ignore",  # Don't use this column
        "purchase_date": "DateTime"  # Treat as datetime
    },
    "transformer_params": {
        "Imputer": {"strategy": "median"}  # Custom imputation
    }
}

print("Featurization configured")

### 3.5 Configuring an AutoML Classification Experiment

In [None]:
# Complete AutoML Classification Example
from azure.ai.ml import automl
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml import Input

# Configure data
my_training_data = Input(
    type=AssetTypes.MLTABLE,
    path="azureml:customer-churn-data:1"
)

# Create classification job
classification_job = automl.classification(
    compute="cpu-cluster",
    experiment_name="customer-churn-classification",
    training_data=my_training_data,
    target_column_name="Churn",  # Column to predict
    primary_metric="accuracy",   # Optimization metric
    n_cross_validations=5,       # Cross-validation folds
    enable_model_explainability=True,
    
    # Limit training time
    timeout_minutes=60,
    max_concurrent_trials=4,
    max_trials=20,
    
    # Featurization
    featurization="auto"
)

# Set training limits
classification_job.set_limits(
    timeout_minutes=60,
    trial_timeout_minutes=20,
    max_trials=20,
    max_concurrent_trials=4
)

# Submit the job
returned_job = ml_client.jobs.create_or_update(classification_job)
print(f"Job submitted: {returned_job.name}")

### 3.6 AutoML Metrics

#### Classification Metrics:

| Metric | Best For | Range | Goal |
|--------|----------|-------|------|
| **Accuracy** | Balanced classes | [0, 1] | Maximize |
| **Precision** | Minimize false positives | [0, 1] | Maximize |
| **Recall** | Minimize false negatives | [0, 1] | Maximize |
| **F1 Score** | Balance precision/recall | [0, 1] | Maximize |
| **AUC ROC** | Overall performance | [0, 1] | Maximize |

#### Regression Metrics:

| Metric | Description | Goal |
|--------|-------------|------|
| **R¬≤ Score** | Variance explained | Maximize (closer to 1) |
| **RMSE** | Root mean squared error | Minimize |
| **MAE** | Mean absolute error | Minimize |
| **MAPE** | Mean absolute percentage error | Minimize |

#### Choosing the Right Metric:

**Business Impact Questions:**
- What's more costly: false positives or false negatives?
- Is class imbalance present?
- What's the business KPI?

**Example Scenarios:**
- Medical diagnosis ‚Üí Recall (catch all sick patients)
- Spam detection ‚Üí Precision (don't block real emails)
- Credit scoring ‚Üí F1 Score (balance both)

### 3.7 Interpreting AutoML Results

In [None]:
# Retrieve and analyze AutoML results
from azure.ai.ml import MLClient

# Get job details
job_name = "customer-churn-classification_123456"
job = ml_client.jobs.get(job_name)

print(f"Job Status: {job.status}")
print(f"Best Model: {job.properties.get('best_child_run_id')}")

# Download outputs
ml_client.jobs.download(job_name, download_path="./outputs")
print("Results downloaded to ./outputs")

#### What to Look For:

1. **Best Model**: Algorithm with highest metric score
2. **Feature Importance**: Which features matter most?
3. **Confusion Matrix**: Classification performance breakdown
4. **Training Time**: Cost-performance tradeoff
5. **Model Explainability**: SHAP values, feature contributions

### üéì Module 3: Practice Questions

#### Question 1
Your marketing team wants to predict customer churn (Yes/No). Which AutoML task should you use?
- A) Forecasting
- B) Regression
- C) Classification ‚úì

**Answer: C** - Churn is a binary classification problem (churn vs not churn).

#### Question 2
A medical company wants to detect abnormalities in X-ray images. Which task should be used?
- A) Forecasting
- B) Computer Vision ‚úì
- C) Natural Language Processing

**Answer: B** - Image classification/object detection requires computer vision.

#### Question 3
For fraud detection where missing fraud is very costly, which metric should you optimize?
- A) Precision
- B) Recall ‚úì
- C) Accuracy

**Answer: B** - Recall minimizes false negatives (missed fraud cases).

---

## Module 4: Training Custom Models

### üéØ Learning Objectives
- Train models using Jupyter notebooks
- Run training scripts as command jobs
- Track experiments with MLflow
- Perform hyperparameter tuning with sweep jobs

---

### 4.1 Training in Notebooks vs Scripts

#### Comparison:

| Aspect | Notebooks | Scripts |
|--------|-----------|----------|
| **Use Case** | Exploration, prototyping | Production, automation |
| **Execution** | Interactive, cell-by-cell | End-to-end, automated |
| **Tracking** | Manual logging | Automatic job tracking |
| **Reproducibility** | Lower | Higher |
| **Scalability** | Limited to instance | Distributed compute |
| **Version Control** | Harder | Easier (Git) |

#### Workflow:
```
Exploration (Notebook) ‚Üí Refine (Script) ‚Üí Production (Job)
```

### 4.2 MLflow Tracking

#### What is MLflow?
- Open-source platform for ML lifecycle
- Tracks experiments, parameters, metrics
- Integrated with Azure ML
- Language-agnostic (Python, R, Java)

#### Key Concepts:

**Experiment**: Collection of related runs
**Run**: Single execution of training code
**Parameters**: Input values (learning rate, batch size)
**Metrics**: Output values (accuracy, loss)
**Artifacts**: Files (models, plots, data)

In [None]:
# Install MLflow
!pip install mlflow azureml-mlflow

In [None]:
# Example: Training with MLflow in a notebook
import mlflow
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Set experiment
mlflow.set_experiment("diabetes-classification")

# Start MLflow run
with mlflow.start_run():
    # Load data
    df = pd.read_csv('diabetes.csv')
    
    # Prepare data
    X = df.drop('Diabetic', axis=1)
    y = df['Diabetic']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Log parameters
    n_estimators = 100
    max_depth = 10
    mlflow.log_param("n_estimators", n_estimators)
    mlflow.log_param("max_depth", max_depth)
    mlflow.log_param("test_size", 0.2)
    
    # Train model
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        random_state=42
    )
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Log metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("precision", precision)
    mlflow.log_metric("recall", recall)
    
    # Log model
    mlflow.sklearn.log_model(model, "random_forest_model")
    
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")

### 4.3 Running Scripts as Command Jobs

#### Why Use Command Jobs?
- Reproducible execution
- Automatic tracking
- Distributed compute
- Environment management
- Easy scheduling

#### Training Script Structure:

In [None]:
# training_script.py (save this as a separate file)
"""
import argparse
import mlflow
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Parse arguments
parser = argparse.ArgumentParser()
parser.add_argument('--data-path', type=str, help='Path to training data')
parser.add_argument('--n-estimators', type=int, default=100)
parser.add_argument('--max-depth', type=int, default=10)
args = parser.parse_args()

# Enable autologging
mlflow.sklearn.autolog()

# Load data
df = pd.read_csv(args.data_path)
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train model
model = RandomForestClassifier(
    n_estimators=args.n_estimators,
    max_depth=args.max_depth
)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
"""
print("Training script template created")

In [None]:
# Submit a command job
from azure.ai.ml import command, Input
from azure.ai.ml.constants import AssetTypes

# Configure input data
data_input = Input(
    type=AssetTypes.URI_FILE,
    path="azureml:diabetes-data:1"
)

# Create command job
job = command(
    code="./src",  # Folder with training script
    command="python training_script.py --data-path ${{inputs.data}} --n-estimators ${{inputs.n_estimators}}",
    inputs={
        "data": data_input,
        "n_estimators": 200
    },
    environment="azureml:sklearn-env:1",
    compute="cpu-cluster",
    experiment_name="diabetes-training",
    display_name="random-forest-training"
)

# Submit job
returned_job = ml_client.jobs.create_or_update(job)
print(f"Job submitted: {returned_job.name}")
print(f"Studio URL: {returned_job.studio_url}")

### 4.4 Hyperparameter Tuning with Sweep Jobs

#### What is Hyperparameter Tuning?
Systematically searching for the best combination of hyperparameters to optimize model performance.

#### Search Strategies:

| Strategy | How It Works | Best For |
|----------|--------------|----------|
| **Grid Search** | Try all combinations | Small parameter spaces |
| **Random Search** | Random sampling | Large parameter spaces |
| **Bayesian** | Smart sampling based on previous results | Expensive evaluations |

#### Common Hyperparameters:

**Tree-based Models:**
- n_estimators (number of trees)
- max_depth (tree depth)
- min_samples_split
- learning_rate (for boosting)

**Neural Networks:**
- learning_rate
- batch_size
- number of layers
- dropout_rate

In [None]:
# Example: Sweep Job for Hyperparameter Tuning
from azure.ai.ml.sweep import Choice, Uniform, Normal

# Define command job (template)
job_to_sweep = command(
    code="./src",
    command="python training_script.py --data-path ${{inputs.data}} --n-estimators ${{search_space.n_estimators}} --max-depth ${{search_space.max_depth}} --learning-rate ${{search_space.learning_rate}}",
    inputs={
        "data": data_input
    },
    environment="azureml:sklearn-env:1",
    compute="cpu-cluster"
)

# Define search space
sweep_job = job_to_sweep.sweep(
    sampling_algorithm="random",
    primary_metric="accuracy",
    goal="Maximize",
    search_space={
        "n_estimators": Choice([50, 100, 200, 300]),
        "max_depth": Choice([5, 10, 15, 20]),
        "learning_rate": Uniform(0.01, 0.1)
    }
)

# Set limits
sweep_job.set_limits(
    max_total_trials=20,
    max_concurrent_trials=4,
    timeout_minutes=60
)

# Early termination policy (stop poor performers)
from azure.ai.ml.sweep import BanditPolicy
sweep_job.early_termination = BanditPolicy(
    slack_factor=0.1,
    evaluation_interval=1,
    delay_evaluation=5
)

# Submit sweep job
returned_sweep = ml_client.jobs.create_or_update(sweep_job)
print(f"Sweep job submitted: {returned_sweep.name}")

### 4.5 Early Termination Policies

Save compute costs by stopping underperforming trials early.

#### Policy Types:

**1. Bandit Policy:**
- Terminates runs that don't perform within a slack factor of the best run
- Good for: Most scenarios

**2. Median Stopping:**
- Terminates runs whose best metric is worse than median
- Good for: Conservative stopping

**3. Truncation Selection:**
- Cancels a percentage of lowest-performing runs
- Good for: Aggressive stopping

In [None]:
# Early termination policy examples
from azure.ai.ml.sweep import BanditPolicy, MedianStoppingPolicy, TruncationSelectionPolicy

# Bandit Policy
bandit_policy = BanditPolicy(
    slack_factor=0.15,  # Allow 15% slack from best performance
    evaluation_interval=2,  # Check every 2 iterations
    delay_evaluation=5  # Don't evaluate until 5 iterations complete
)

# Median Stopping Policy
median_policy = MedianStoppingPolicy(
    evaluation_interval=1,
    delay_evaluation=5
)

# Truncation Selection Policy
truncation_policy = TruncationSelectionPolicy(
    truncation_percentage=20,  # Cancel bottom 20% of runs
    evaluation_interval=1,
    delay_evaluation=5
)

print("Early termination policies configured")

### üéì Module 4: Practice Questions

#### Question 1
What's the primary advantage of using command jobs over notebook training?
- A) Easier to write
- B) Better for exploration
- C) More reproducible and scalable ‚úì

**Answer: C** - Command jobs provide reproducibility, automatic tracking, and distributed compute.

#### Question 2
Which hyperparameter tuning strategy is best for large parameter spaces?
- A) Grid Search
- B) Random Search ‚úì
- C) Manual tuning

**Answer: B** - Random search efficiently samples large spaces without trying all combinations.

#### Question 3
What does MLflow autolog() do?
- A) Automatically saves the model
- B) Automatically logs parameters, metrics, and models ‚úì
- C) Automatically deploys the model

**Answer: B** - autolog() tracks parameters, metrics, and artifacts automatically.

---

## Module 5: Pipelines and Model Management

### üéØ Learning Objectives
- Design and implement ML pipelines
- Register and version models
- Use Responsible AI dashboard
- Prepare models for deployment

---

### 5.1 ML Pipelines Overview

#### What is an ML Pipeline?
A sequence of interconnected steps that automate the ML workflow from data preparation to model training.

#### Benefits:
- **Reproducibility**: Same inputs = same outputs
- **Automation**: Schedule and trigger automatically
- **Modularity**: Reuse components across pipelines
- **Scalability**: Parallel execution of steps
- **Collaboration**: Share pipelines across teams

#### Common Pipeline Steps:
1. Data ingestion
2. Data validation
3. Data transformation
4. Feature engineering
5. Model training
6. Model evaluation
7. Model registration

#### Business Value:
- Faster time to production
- Consistent model quality
- Easier compliance and auditing
- Reduced manual errors

In [None]:
# Example: Simple 3-Step Pipeline
from azure.ai.ml import load_component
from azure.ai.ml.dsl import pipeline

# Load pre-built components (can also create custom)
prep_data = load_component(source="./components/prep_data.yml")
train_model = load_component(source="./components/train_model.yml")
evaluate_model = load_component(source="./components/evaluate_model.yml")

# Define pipeline
@pipeline(
    name="diabetes-training-pipeline",
    description="Complete training pipeline for diabetes prediction",
    compute="cpu-cluster"
)
def diabetes_pipeline(pipeline_input_data):
    # Step 1: Prepare data
    prep_step = prep_data(input_data=pipeline_input_data)
    
    # Step 2: Train model
    train_step = train_model(
        training_data=prep_step.outputs.output_data,
        n_estimators=100,
        max_depth=10
    )
    
    # Step 3: Evaluate model
    eval_step = evaluate_model(
        model=train_step.outputs.model,
        test_data=prep_step.outputs.test_data
    )
    
    return {
        "trained_model": train_step.outputs.model,
        "evaluation_results": eval_step.outputs.metrics
    }

# Create pipeline instance
pipeline_job = diabetes_pipeline(
    pipeline_input_data=Input(type=AssetTypes.URI_FOLDER, path="azureml:diabetes-data:1")
)

# Submit pipeline
returned_pipeline = ml_client.jobs.create_or_update(pipeline_job)
print(f"Pipeline submitted: {returned_pipeline.name}")

### 5.2 Creating Reusable Components

#### Component Structure:
A component consists of:
- **Interface**: Inputs, outputs, parameters
- **Implementation**: Script to execute
- **Environment**: Dependencies

#### Component YAML Example:

In [None]:
# prep_data.yml
"""
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: prep_data
display_name: Prepare Data
version: 1
type: command

inputs:
  input_data:
    type: uri_folder
    description: Raw input data
  test_split:
    type: number
    default: 0.2
    description: Test set percentage

outputs:
  output_data:
    type: uri_folder
    description: Prepared training data
  test_data:
    type: uri_folder
    description: Test data

code: ./src
environment: azureml:sklearn-env:1
command: >
  python prep_data.py 
  --input-data ${{inputs.input_data}}
  --test-split ${{inputs.test_split}}
  --output-data ${{outputs.output_data}}
  --test-data ${{outputs.test_data}}
"""
print("Component YAML template")

### 5.3 Model Registration and Versioning

#### Why Register Models?
- **Version Control**: Track model evolution
- **Lineage**: Know how model was created
- **Governance**: Control model access
- **Deployment**: Deploy from registry
- **Comparison**: Compare model versions

#### Model Metadata:
- Name and version
- Description and tags
- Training job ID
- Metrics and parameters
- Framework (sklearn, tensorflow, pytorch)
- Model file location

In [None]:
# Register a model
from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes

# Register from job output
model = Model(
    path="azureml://jobs/{job_id}/outputs/artifacts/paths/model/",
    name="diabetes-classifier",
    description="Random Forest classifier for diabetes prediction",
    type=AssetTypes.MLFLOW_MODEL,
    tags={
        "framework": "sklearn",
        "algorithm": "RandomForest",
        "accuracy": "0.89"
    }
)

# Register the model
registered_model = ml_client.models.create_or_update(model)
print(f"Model registered: {registered_model.name} version {registered_model.version}")

In [None]:
# List and retrieve models
# List all versions of a model
models = ml_client.models.list(name="diabetes-classifier")

print("Available model versions:")
for model in models:
    print(f"- Version {model.version}: {model.description}")

# Get specific version
model_v1 = ml_client.models.get(name="diabetes-classifier", version="1")
print(f"\nRetrieved: {model_v1.name} v{model_v1.version}")

### 5.4 Responsible AI Dashboard

#### Purpose:
Assess and improve ML model fairness, interpretability, and reliability.

#### Dashboard Components:

| Component | What It Shows | Why It Matters |
|-----------|---------------|----------------|
| **Error Analysis** | Where model fails | Target improvements |
| **Model Explanations** | Feature importance | Build trust, debug |
| **Fairness Assessment** | Performance across groups | Ensure equity |
| **Causal Analysis** | Treatment effects | What-if scenarios |
| **Counterfactuals** | Minimum changes for different prediction | Actionable insights |

#### Business Analyst Use Cases:
- Identify model biases
- Explain predictions to stakeholders
- Ensure regulatory compliance
- Build customer trust

#### Creating a Responsible AI Dashboard:

In [None]:
# Example: Create Responsible AI dashboard
from azure.ai.ml import command
from azure.ai.ml.entities import RAIInsights

# Create RAI insights
rai_insights = RAIInsights(
    target_column_name="Diabetic",
    task_type="classification",
    model_info={
        "model_name": "diabetes-classifier",
        "model_version": "1"
    },
    train_dataset=train_data,
    test_dataset=test_data
)

# Add components
rai_insights.add_explainer(
    method="mimic",  # or 'shap', 'lime'
    model_task="classification"
)

rai_insights.add_error_analysis(
    max_depth=3,
    num_leaves=15
)

rai_insights.add_fairness(
    sensitive_features=["Age", "Gender"],
    fairness_metrics=["demographic_parity", "equalized_odds"]
)

print("Responsible AI dashboard configured")

### üéì Module 5: Practice Questions

#### Question 1
What's the main benefit of using ML pipelines?
- A) Faster training
- B) Better accuracy
- C) Reproducibility and automation ‚úì

**Answer: C** - Pipelines ensure consistent, repeatable workflows that can be automated.

#### Question 2
Why should you register models in Azure ML?
- A) It's required for deployment
- B) For version control and lineage tracking ‚úì
- C) To improve model accuracy

**Answer: B** - Registration enables versioning, tracking, and governance.

#### Question 3
Which Responsible AI component shows where your model performs poorly?
- A) Model Explanations
- B) Error Analysis ‚úì
- C) Fairness Assessment

**Answer: B** - Error Analysis identifies cohorts where the model fails.

---

## Module 6: Model Deployment

### üéØ Learning Objectives
- Deploy models to managed online endpoints
- Deploy models to batch endpoints
- Monitor deployed models
- Implement blue-green deployments

---

### 6.1 Deployment Options

#### Endpoint Types:

| Type | Use Case | Latency | Cost | Availability |
|------|----------|---------|------|--------------|
| **Managed Online** | Real-time predictions | Low (ms) | Higher | Always on |
| **Batch** | Bulk predictions | High (min-hrs) | Lower | On-demand |
| **Real-time (AKS)** | High-scale production | Low (ms) | Highest | Always on |

#### Decision Flowchart:
```
Need real-time predictions? 
‚îú‚îÄ Yes ‚Üí High traffic expected?
‚îÇ         ‚îú‚îÄ Yes ‚Üí Real-time (AKS)
‚îÇ         ‚îî‚îÄ No  ‚Üí Managed Online Endpoint
‚îî‚îÄ No  ‚Üí Batch Endpoint
```

### 6.2 Managed Online Endpoints

#### Key Features:
- Automatic scaling
- Blue-green deployments
- Built-in monitoring
- Authentication (key/token)
- HTTPS endpoint

#### Deployment Process:
1. Register model
2. Create endpoint
3. Create deployment
4. Allocate traffic
5. Test endpoint
6. Monitor performance

In [None]:
# Deploy to Managed Online Endpoint
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    Model,
    Environment,
    CodeConfiguration
)

# Step 1: Create endpoint
endpoint = ManagedOnlineEndpoint(
    name="diabetes-endpoint",
    description="Endpoint for diabetes predictions",
    auth_mode="key"  # or 'aml_token'
)

# Create the endpoint
endpoint = ml_client.online_endpoints.begin_create_or_update(endpoint).result()
print(f"Endpoint created: {endpoint.name}")

# Step 2: Create deployment
deployment = ManagedOnlineDeployment(
    name="blue",
    endpoint_name="diabetes-endpoint",
    model="azureml:diabetes-classifier:1",
    instance_type="Standard_DS3_v2",
    instance_count=1,
    environment="azureml:sklearn-env:1",
    code_configuration=CodeConfiguration(
        code="./deployment",
        scoring_script="score.py"
    )
)

# Deploy
deployment = ml_client.online_deployments.begin_create_or_update(deployment).result()
print(f"Deployment created: {deployment.name}")

# Step 3: Allocate traffic
endpoint.traffic = {"blue": 100}  # 100% to blue deployment
ml_client.online_endpoints.begin_create_or_update(endpoint).result()
print("Traffic allocated to blue deployment")

#### Scoring Script (score.py):

In [None]:
# score.py - Scoring script for deployment
"""
import json
import joblib
import numpy as np
import os

def init():
    global model
    # Load model
    model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'model.pkl')
    model = joblib.load(model_path)
    print('Model loaded')

def run(raw_data):
    try:
        # Parse input
        data = json.loads(raw_data)['data']
        data = np.array(data)
        
        # Make predictions
        predictions = model.predict(data)
        probabilities = model.predict_proba(data)
        
        # Return results
        return {
            'predictions': predictions.tolist(),
            'probabilities': probabilities.tolist()
        }
    except Exception as e:
        return {'error': str(e)}
"""
print("Scoring script template")

In [None]:
# Test the endpoint
import json

# Prepare test data
test_data = {
    "data": [
        [1, 85, 66, 29, 0, 26.6, 0.351, 31],  # Sample 1
        [8, 183, 64, 0, 0, 23.3, 0.672, 32]   # Sample 2
    ]
}

# Convert to JSON
request_data = json.dumps(test_data)

# Invoke endpoint
response = ml_client.online_endpoints.invoke(
    endpoint_name="diabetes-endpoint",
    request_file=request_data
)

print("Predictions:", response)

### 6.3 Blue-Green Deployment Strategy

#### What is Blue-Green Deployment?
Deploy a new version (green) alongside the current version (blue), then gradually shift traffic.

#### Benefits:
- **Zero downtime**: Always have a working version
- **Safe rollout**: Test before full deployment
- **Easy rollback**: Switch back if issues arise
- **A/B testing**: Compare versions with real traffic

#### Deployment Steps:
1. Deploy green version (0% traffic)
2. Test green version
3. Gradually shift traffic (10% ‚Üí 50% ‚Üí 100%)
4. Monitor metrics
5. Delete blue version

In [None]:
# Blue-Green Deployment Example

# Create green deployment
green_deployment = ManagedOnlineDeployment(
    name="green",
    endpoint_name="diabetes-endpoint",
    model="azureml:diabetes-classifier:2",  # New version
    instance_type="Standard_DS3_v2",
    instance_count=1,
    environment="azureml:sklearn-env:1"
)

# Deploy green (0% traffic initially)
ml_client.online_deployments.begin_create_or_update(green_deployment).result()
print("Green deployment created with 0% traffic")

# Test green deployment
# ... test code ...

# Gradually shift traffic
endpoint.traffic = {"blue": 90, "green": 10}  # 10% to green
ml_client.online_endpoints.begin_create_or_update(endpoint).result()
print("Traffic: 90% blue, 10% green")

# Monitor and continue shifting
endpoint.traffic = {"blue": 50, "green": 50}  # 50/50 split
ml_client.online_endpoints.begin_create_or_update(endpoint).result()
print("Traffic: 50% blue, 50% green")

# Complete migration
endpoint.traffic = {"blue": 0, "green": 100}  # 100% to green
ml_client.online_endpoints.begin_create_or_update(endpoint).result()
print("Traffic: 100% green")

# Delete blue deployment
ml_client.online_deployments.begin_delete(
    name="blue",
    endpoint_name="diabetes-endpoint"
).result()
print("Blue deployment deleted")

### 6.4 Batch Endpoints

#### When to Use Batch Endpoints:
- Scheduled predictions (nightly, weekly)
- Large datasets (thousands-millions of rows)
- No real-time requirement
- Cost optimization (pay only when running)

#### Use Cases:
- Customer churn scoring (monthly)
- Product recommendations (daily)
- Fraud detection (batch analysis)
- Market segmentation (quarterly)

In [None]:
# Deploy to Batch Endpoint
from azure.ai.ml.entities import (
    BatchEndpoint,
    BatchDeployment,
    Model,
    Environment,
    CodeConfiguration
)

# Create batch endpoint
batch_endpoint = BatchEndpoint(
    name="diabetes-batch",
    description="Batch endpoint for diabetes predictions"
)

ml_client.batch_endpoints.begin_create_or_update(batch_endpoint).result()
print(f"Batch endpoint created: {batch_endpoint.name}")

# Create batch deployment
batch_deployment = BatchDeployment(
    name="diabetes-batch-v1",
    endpoint_name="diabetes-batch",
    model="azureml:diabetes-classifier:1",
    compute="cpu-cluster",
    instance_count=2,
    max_concurrency_per_instance=2,
    mini_batch_size=10,
    output_action="append_row",
    output_file_name="predictions.csv",
    retry_settings={"max_retries": 3, "timeout": 300},
    logging_level="info"
)

ml_client.batch_deployments.begin_create_or_update(batch_deployment).result()
print(f"Batch deployment created: {batch_deployment.name}")

In [None]:
# Invoke batch endpoint
from azure.ai.ml import Input
from azure.ai.ml.constants import AssetTypes

# Prepare input data
input_data = Input(
    type=AssetTypes.URI_FOLDER,
    path="azureml:batch-scoring-data:1"
)

# Invoke batch job
batch_job = ml_client.batch_endpoints.invoke(
    endpoint_name="diabetes-batch",
    inputs={"data": input_data}
)

print(f"Batch job submitted: {batch_job.name}")

# Monitor job
ml_client.jobs.stream(batch_job.name)

### 6.5 Monitoring Deployed Models

#### Key Metrics to Monitor:

**Performance Metrics:**
- Latency (response time)
- Throughput (requests/second)
- Error rate
- CPU/Memory utilization

**Model Metrics:**
- Prediction accuracy
- Data drift
- Model drift
- Feature distribution

**Business Metrics:**
- Usage patterns
- User satisfaction
- ROI impact

In [None]:
# Enable Application Insights monitoring
from azure.ai.ml.entities import ManagedOnlineEndpoint

# Get endpoint
endpoint = ml_client.online_endpoints.get("diabetes-endpoint")

# Check monitoring configuration
print(f"Application Insights: {endpoint.application_insights}")
print(f"Scoring URI: {endpoint.scoring_uri}")
print(f"Primary Key: {endpoint.keys.primary_key[:10]}...")

### üéì Module 6: Practice Questions

#### Question 1
A mobile app needs instant predictions for user interactions. Which deployment should you use?
- A) Batch endpoint
- B) Managed online endpoint ‚úì
- C) Pipeline endpoint

**Answer: B** - Online endpoints provide low-latency, real-time predictions.

#### Question 2
What's the main benefit of blue-green deployment?
- A) Faster deployment
- B) Lower cost
- C) Zero-downtime updates ‚úì

**Answer: C** - Blue-green deployment allows safe, zero-downtime version updates.

#### Question 3
When should you use batch endpoints?
- A) Real-time user requests
- B) Scheduled bulk predictions ‚úì
- C) Low-latency scenarios

**Answer: B** - Batch endpoints are ideal for scheduled, bulk prediction workloads.

---

## Module 7: Language Models and AI Optimization

### üéØ Learning Objectives
- Explore models from Azure AI catalog
- Optimize performance with prompt engineering
- Implement Retrieval Augmented Generation (RAG)
- Fine-tune language models

---

### 7.1 Azure AI Model Catalog

#### What's Available:

| Category | Examples | Use Cases |
|----------|----------|----------|
| **Language Models** | GPT-4, GPT-3.5, Llama 2 | Text generation, Q&A, summarization |
| **Vision Models** | Florence, CLIP | Image classification, OCR |
| **Speech Models** | Whisper | Speech-to-text, transcription |
| **Open Source** | BERT, T5, Falcon | Custom NLP tasks |

#### Model Selection Criteria:
- **Task requirements**: What do you need to accomplish?
- **Performance**: Speed vs accuracy tradeoff
- **Cost**: Inference cost per token
- **Customization**: Can you fine-tune?
- **Compliance**: Data residency, privacy

In [None]:
# Browse and deploy from model catalog
from azure.ai.ml.entities import Model

# List available models
models = ml_client.models.list()

print("Available models in catalog:")
for model in models:
    print(f"- {model.name} ({model.version})")
    print(f"  Description: {model.description}")
    print(f"  Tags: {model.tags}")
    print()

### 7.2 Prompt Engineering

#### What is Prompt Engineering?
The art and science of crafting effective inputs (prompts) to get desired outputs from language models.

#### Key Principles:

1. **Be Specific**: Clear, detailed instructions
2. **Provide Context**: Background information
3. **Use Examples**: Show desired format (few-shot learning)
4. **Set Tone/Style**: Specify how to respond
5. **Iterate**: Refine based on results

#### Prompt Patterns:

**Zero-shot:**
```
Classify the sentiment: "I love this product!"
```

**Few-shot:**
```
Classify sentiment:
"Great service!" ‚Üí Positive
"Terrible experience" ‚Üí Negative
"It's okay" ‚Üí Neutral
"Amazing quality!" ‚Üí ?
```

**Chain-of-thought:**
```
Let's solve this step by step:
1. First, identify...
2. Then, calculate...
3. Finally, conclude...
```

#### Prompt Engineering for Business Use Cases:

**Customer Support:**
```python
prompt = """
You are a helpful customer support agent for an e-commerce company.
Respond professionally and empathetically.

Customer: My order hasn't arrived yet.
Agent:
"""
```

**Data Analysis:**
```python
prompt = """
Analyze this sales data and provide 3 key insights:

Q1 Sales: $250K
Q2 Sales: $180K
Q3 Sales: $320K
Q4 Sales: $290K

Insights:
"""
```

**Report Generation:**
```python
prompt = """
Write a executive summary of this data in bullet points:
- Focus on trends
- Highlight risks
- Suggest actions

Data: [your data here]
"""
```

In [None]:
# Example: Using Prompty for prompt engineering
"""
Prompty is a format for defining prompts with metadata.

Example prompty file (customer_support.prompty):

---
name: CustomerSupport
description: Customer support response generator
model:
  api: chat
  configuration:
    type: azure_openai
    azure_deployment: gpt-4
sample:
  customer_message: "My order is late"
---

system:
You are a professional customer support agent.
Be empathetic, helpful, and provide clear solutions.

user:
Customer message: {{customer_message}}

Please respond to the customer.
"""

print("Prompty template example")

### 7.3 Retrieval Augmented Generation (RAG)

#### What is RAG?
Enhance LLM responses by retrieving relevant information from a knowledge base before generating answers.

#### How RAG Works:
```
User Query ‚Üí Search Knowledge Base ‚Üí Retrieve Relevant Docs ‚Üí 
Add to Prompt ‚Üí LLM Generates Answer
```

#### Benefits:
- **Factual accuracy**: Ground responses in real data
- **Up-to-date info**: Access current information
- **Domain expertise**: Use company-specific knowledge
- **Reduced hallucinations**: Less made-up information
- **Transparency**: Can cite sources

#### Business Use Cases:
- Internal knowledge base Q&A
- Product documentation search
- Policy and compliance queries
- Customer support with manual lookup
- Research and analysis

#### RAG Architecture Components:

1. **Document Store**: Where knowledge is stored
   - Azure Cognitive Search
   - Vector databases (Pinecone, Weaviate)
   - Azure Cosmos DB

2. **Embedding Model**: Convert text to vectors
   - OpenAI embeddings
   - Sentence transformers

3. **Retrieval System**: Find relevant documents
   - Semantic search
   - Keyword search
   - Hybrid search

4. **Generation Model**: Create response
   - GPT-4, GPT-3.5
   - Llama 2
   - Claude

In [None]:
# Simplified RAG implementation concept
"""
from azure.search.documents import SearchClient
from openai import AzureOpenAI

# Initialize clients
search_client = SearchClient(endpoint, index_name, credential)
openai_client = AzureOpenAI(api_key, endpoint)

def rag_query(user_question):
    # Step 1: Search for relevant documents
    search_results = search_client.search(
        search_text=user_question,
        top=3  # Get top 3 results
    )
    
    # Step 2: Extract relevant text
    context = ""
    for result in search_results:
        context += result['content'] + "\n\n"
    
    # Step 3: Create prompt with context
    prompt = f"""
    Context information:
    {context}
    
    Question: {user_question}
    
    Answer based on the context above:
    """
    
    # Step 4: Generate response
    response = openai_client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "Answer based on provided context."},
            {"role": "user", "content": prompt}
        ]
    )
    
    return response.choices[0].message.content

# Example usage
answer = rag_query("What is our refund policy?")
"""

print("RAG implementation concept")

### 7.4 Fine-tuning Language Models

#### What is Fine-tuning?
Train a pre-trained model on your specific data to adapt it to your domain or task.

#### When to Fine-tune:

‚úÖ **Good candidates:**
- Consistent task format
- Domain-specific terminology
- Style/tone requirements
- Have 100+ quality examples
- Need better performance than prompt engineering

‚ùå **Not ideal:**
- Limited training data (<50 examples)
- Frequently changing requirements
- Need to update knowledge regularly
- General-purpose tasks

#### Fine-tuning vs RAG vs Prompt Engineering:

| Approach | Cost | Effort | Flexibility | Best For |
|----------|------|--------|-------------|----------|
| **Prompt Engineering** | Low | Low | High | Quick iteration |
| **RAG** | Medium | Medium | Medium | Knowledge-based tasks |
| **Fine-tuning** | High | High | Low | Specialized tasks |

#### Fine-tuning Process:

1. **Prepare training data**: Format as prompt-completion pairs
2. **Upload data**: Create dataset in Azure ML
3. **Configure training**: Set hyperparameters
4. **Train model**: Submit fine-tuning job
5. **Evaluate**: Test on validation set
6. **Deploy**: Make available for inference

In [None]:
# Fine-tuning data format example
"""
Training data should be in JSONL format:

{"prompt": "Customer: I want to return my order. Agent:", "completion": " I'd be happy to help with your return. May I have your order number?"}
{"prompt": "Customer: Where is my package? Agent:", "completion": " Let me check the tracking information for you. Could you provide your order number?"}
{"prompt": "Customer: I received the wrong item. Agent:", "completion": " I apologize for the error. Let's get this corrected right away. What did you receive?"}

Best practices:
- At least 100 examples (more is better)
- Diverse examples covering various scenarios
- High-quality, consistent responses
- Clear prompt-completion separation
- Representative of production use
"""

print("Fine-tuning data format guidelines")

### üéì Module 7: Practice Questions

#### Question 1
Which approach is best for quickly adapting an LLM to a new task without retraining?
- A) Fine-tuning
- B) Prompt engineering ‚úì
- C) RAG

**Answer: B** - Prompt engineering is the fastest way to adapt without retraining.

#### Question 2
When should you use RAG instead of fine-tuning?
- A) When you need style/tone adaptation
- B) When you need access to current information ‚úì
- C) When you have limited examples

**Answer: B** - RAG provides access to up-to-date information from a knowledge base.

#### Question 3
What's the minimum recommended number of examples for fine-tuning?
- A) 10-20
- B) 50-75
- C) 100+ ‚úì

**Answer: C** - At least 100 quality examples are recommended for effective fine-tuning.

---

## Appendix: Exam Preparation Guide

### üìù DP-100 Exam Structure

#### Exam Details:
- **Duration**: 120 minutes
- **Questions**: 40-60 questions
- **Passing Score**: 700/1000
- **Question Types**: Multiple choice, case studies, scenario-based

#### Score Distribution:

| Domain | Percentage |
|--------|------------|
| Design and prepare a machine learning solution | 20-25% |
| Explore data and run experiments | 20-25% |
| Train and deploy models | 25-30% |
| Optimize language models for AI applications | 25-30% |

---

### üéØ Key Topics to Master

#### Domain 1: Design and Prepare
- ‚úÖ ML problem types (classification, regression, forecasting)
- ‚úÖ Data storage options (Blob, ADLS, SQL)
- ‚úÖ Compute choices (CPU/GPU, instance types)
- ‚úÖ Deployment strategies (real-time vs batch)
- ‚úÖ MLOps architecture

#### Domain 2: Explore and Experiment
- ‚úÖ Azure ML workspace components
- ‚úÖ Developer tools (Studio, SDK, CLI)
- ‚úÖ AutoML configuration and tasks
- ‚úÖ Data preparation and featurization
- ‚úÖ Metrics selection

#### Domain 3: Train and Deploy
- ‚úÖ MLflow tracking
- ‚úÖ Command jobs and sweep jobs
- ‚úÖ Pipeline creation and components
- ‚úÖ Model registration and versioning
- ‚úÖ Endpoint deployment (online and batch)
- ‚úÖ Blue-green deployments
- ‚úÖ Responsible AI dashboard

#### Domain 4: Optimize Language Models
- ‚úÖ Azure AI model catalog
- ‚úÖ Prompt engineering techniques
- ‚úÖ RAG architecture and implementation
- ‚úÖ Fine-tuning strategies
- ‚úÖ Model evaluation and comparison

---

### üìö Study Resources

#### Official Microsoft Resources:
1. **Microsoft Learn**: https://learn.microsoft.com/training/courses/dp-100t01
2. **Exam Study Guide**: Download from Microsoft
3. **Practice Assessments**: Take practice tests
4. **Azure ML Documentation**: https://docs.microsoft.com/azure/machine-learning

#### Hands-on Practice:
- Complete all module labs
- Build your own projects
- Experiment with different services
- Practice with sample datasets

#### Community Resources:
- Microsoft Q&A forums
- Stack Overflow
- LinkedIn Learning courses
- YouTube tutorials

---

### üí° Exam Tips and Strategies

#### Before the Exam:
- Review all practice questions in this manual
- Take at least 2 practice tests
- Set up your own Azure ML workspace and practice
- Study case studies and scenario questions
- Review Azure ML pricing and service limits

#### During the Exam:
- Read questions carefully (watch for "EXCEPT" or "NOT")
- Eliminate obviously wrong answers first
- Use process of elimination
- Flag uncertain questions for review
- Manage your time (2 minutes per question average)
- For case studies, read scenario first, then questions

#### Common Pitfalls:
- Confusing online vs batch endpoints
- Mixing up job types (command vs sweep vs pipeline)
- Not understanding when to use AutoML vs custom training
- Forgetting about cost optimization strategies
- Unclear on RBAC roles and permissions

---

### üîë Key Formulas and Concepts

#### Classification Metrics:
```
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 Score = 2 √ó (Precision √ó Recall) / (Precision + Recall)
```

#### Regression Metrics:
```
RMSE = ‚àö(Œ£(predicted - actual)¬≤ / n)
MAE = Œ£|predicted - actual| / n
R¬≤ = 1 - (SS_res / SS_tot)
```

#### Cross-validation:
```
K-fold CV: Split data into K folds
Train on K-1 folds, validate on 1 fold
Repeat K times, average metrics
```

---

### üìä Quick Reference Tables

#### Azure ML Job Types:
| Job Type | Purpose | Use When |
|----------|---------|----------|
| Command | Single script execution | Training one model |
| Sweep | Hyperparameter tuning | Optimizing parameters |
| Pipeline | Multi-step workflow | Production workflows |

#### RBAC Roles:
| Role | Key Permissions |
|------|----------------|
| Owner | Full control |
| Contributor | Create/modify resources |
| AzureML Data Scientist | Run experiments, train models |
| AzureML Compute Operator | Manage compute |
| Reader | View only |

#### Compute SKUs:
| SKU Series | Best For |
|------------|----------|
| D-series | General purpose |
| E-series | Memory optimized |
| F-series | Compute optimized |
| NC-series | GPU compute |

---

## üéì Business Analyst & Data Engineer Career Guide

### Career Paths with DP-100 Certification

#### Business Analyst (Data-Focused)
**Responsibilities:**
- Translate business problems into ML solutions
- Define success metrics and KPIs
- Interpret model results for stakeholders
- Ensure model alignment with business goals
- Monitor model performance and business impact

**Key Skills from DP-100:**
- Understanding ML problem types
- Metrics selection and interpretation
- AutoML for rapid prototyping
- Model monitoring and drift detection
- Responsible AI practices

**Average Salary Range:** $75,000 - $130,000

---

#### Data Engineer (ML-Focused)
**Responsibilities:**
- Design and build data pipelines
- Manage ML infrastructure
- Implement MLOps workflows
- Optimize compute and storage
- Ensure data quality and governance

**Key Skills from DP-100:**
- Azure ML workspace management
- Pipeline design and automation
- Compute optimization
- Model deployment strategies
- Integration with Azure services

**Average Salary Range:** $90,000 - $150,000

---

### üöÄ Next Steps After Certification

#### Immediate Actions:
1. Update LinkedIn profile with certification
2. Add certification badge to resume
3. Share achievement on professional networks
4. Start building portfolio projects

#### Portfolio Project Ideas:
1. **Customer Segmentation**: Cluster analysis on customer data
2. **Churn Prediction**: Classification model with monitoring
3. **Sales Forecasting**: Time-series forecasting pipeline
4. **Sentiment Analysis**: NLP with fine-tuned models
5. **Recommendation System**: Collaborative filtering with deployment

#### Continuing Education:
- **Related Certifications**:
  - DP-203: Azure Data Engineer Associate
  - AI-102: Azure AI Engineer Associate
  - PL-300: Power BI Data Analyst Associate
  
- **Advanced Topics**:
  - Deep learning with PyTorch/TensorFlow
  - Advanced MLOps with Azure DevOps
  - Real-time ML with Azure Stream Analytics
  - Cost optimization strategies

---

### üìå Final Checklist

Before considering yourself exam-ready, ensure you can:

#### Design (20-25%):
- [ ] Identify appropriate ML task types
- [ ] Design data ingestion solutions
- [ ] Choose appropriate Azure services
- [ ] Plan deployment strategies
- [ ] Design monitoring solutions

#### Explore (20-25%):
- [ ] Create and configure workspaces
- [ ] Use Studio, SDK, and CLI
- [ ] Configure AutoML experiments
- [ ] Prepare and featurize data
- [ ] Select and interpret metrics

#### Train and Deploy (25-30%):
- [ ] Track experiments with MLflow
- [ ] Submit command and sweep jobs
- [ ] Create and run pipelines
- [ ] Register and version models
- [ ] Deploy to endpoints
- [ ] Implement blue-green deployments

#### Optimize (25-30%):
- [ ] Browse and deploy catalog models
- [ ] Apply prompt engineering
- [ ] Implement RAG solutions
- [ ] Fine-tune language models
- [ ] Evaluate model performance

---

## üéâ Congratulations!

You've completed the DP-100 Student Manual. You now have the knowledge and skills to:
- Design end-to-end ML solutions on Azure
- Build and deploy production ML models
- Implement MLOps best practices
- Work effectively as a Business Analyst or Data Engineer

**Good luck on your DP-100 exam and your ML career journey!**

---

### üìû Additional Resources

**Microsoft Learn**: learn.microsoft.com/training/paths/azure-machine-learning

**Azure ML Documentation**: docs.microsoft.com/azure/machine-learning

**Exam Registration**: learn.microsoft.com/certifications/exams/dp-100

**Community Forums**: techcommunity.microsoft.com/t5/azure-ai-services/ct-p/Azure-AI

---

*Version: 1.0 | Last Updated: February 2026*