# Machine Learning & AI 101: Complete Professional Training

🎯 **Welcome to the most comprehensive ML/AI training for data professionals!**

This enhanced notebook transforms beginners into competent ML practitioners through:
- **Systematic skill building** with measurable learning outcomes
- **Real-world applications** using actual data sources and deployment techniques
- **Industry best practices** including MLOps, testing, and production considerations
- **Interactive assessments** to validate your progress

---

## 📋 Learning Objectives

By completing this training, you will:

1. **Master data preprocessing pipelines** for production-ready ML systems
2. **Implement robust model evaluation** with proper validation strategies
3. **Build end-to-end ML applications** with real data sources and deployment
4. **Apply MLOps principles** for model versioning, monitoring, and maintenance
5. **Handle ethical considerations** including bias detection and fairness metrics
6. **Debug common ML issues** and optimize model performance

**Estimated completion time:** 8-12 hours (can be completed in modules)

---

## 📊 Prerequisites & Environment Setup

### Required Knowledge
- [ ] Basic Python programming (functions, classes, data structures)
- [ ] Elementary statistics (mean, variance, distributions)
- [ ] High school mathematics (algebra, basic calculus helpful but not required)

### Success Criteria
- [ ] Complete all checkpoint assessments with 70%+ scores
- [ ] Successfully implement at least 2 end-to-end projects
- [ ] Demonstrate ability to debug and optimize ML models

Let's verify your environment and begin your professional ML journey! 🚀

## 1. Environment Setup & Validation

**⏱️ Estimated time:** 15 minutes

**Learning objectives:**
- Set up a reproducible ML environment
- Understand version management for ML projects
- Implement proper random seed management

In [None]:
# Environment setup with version tracking and reproducibility
import sys
import warnings
from datetime import datetime
import os

# Core libraries with version checking
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

# Machine Learning - Core
import sklearn
from sklearn.datasets import (
    load_iris, load_wine, load_breast_cancer, 
    make_classification, make_regression, make_blobs
)
from sklearn.model_selection import (
    train_test_split, cross_val_score, GridSearchCV, 
    RandomizedSearchCV, validation_curve, learning_curve,
    StratifiedKFold, KFold
)
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler, RobustScaler, 
    LabelEncoder, OneHotEncoder, PolynomialFeatures
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer, KNNImputer

# Algorithms
from sklearn.linear_model import (
    LinearRegression, LogisticRegression, Ridge, Lasso, ElasticNet
)
from sklearn.ensemble import (
    RandomForestClassifier, RandomForestRegressor,
    GradientBoostingClassifier, VotingClassifier
)
from sklearn.svm import SVC, SVR
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.neural_network import MLPClassifier, MLPRegressor

# Evaluation metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix,
    mean_squared_error, mean_absolute_error, r2_score,
    roc_curve, auc, roc_auc_score,
    silhouette_score, adjusted_rand_score
)

# Advanced libraries
try:
    import joblib
    HAS_JOBLIB = True
except ImportError:
    HAS_JOBLIB = False
    
try:
    import requests
    HAS_REQUESTS = True
except ImportError:
    HAS_REQUESTS = False

# Configuration
warnings.filterwarnings('ignore')
plt.style.use('default')
sns.set_palette("husl")
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 3)

# Global random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Environment validation
print("🔧 ML ENVIRONMENT VALIDATION")
print("=" * 50)
print(f"Python version: {sys.version.split()[0]}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"Scikit-learn version: {sklearn.__version__}")
print(f"Matplotlib version: {plt.matplotlib.__version__}")
print(f"Seaborn version: {sns.__version__}")

print("\n📦 Optional Libraries:")
print(f"Joblib available: {'✅' if HAS_JOBLIB else '❌ (pip install joblib)'}")
print(f"Requests available: {'✅' if HAS_REQUESTS else '❌ (pip install requests)'}")

print(f"\n🎲 Random state set to: {RANDOM_STATE}")
print(f"📅 Session started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("\n✅ Environment ready for ML training!")

### 📝 Checkpoint 1: Environment Validation

**Quick Assessment (2 minutes):**

1. What is the purpose of setting a random state in ML projects?
2. Why do we suppress warnings in production ML code?
3. What happens if you don't manage package versions in ML projects?

<details>
<summary>Click for answers</summary>

1. **Random state ensures reproducibility** - same results across runs and different environments
2. **Suppress warnings to avoid clutter** in production logs, but keep them during development
3. **Version mismatches can cause** model performance changes, crashes, or different results
</details>

## 2. Data Fundamentals & Professional Preprocessing

**⏱️ Estimated time:** 45 minutes

**Learning objectives:**
- Master production-ready data preprocessing pipelines
- Handle missing data with advanced strategies
- Implement feature engineering and validation
- Understand data leakage and prevention

### 2.1 Advanced Data Loading & Validation

In [None]:
# Professional data validation and quality assessment
def validate_dataset(df, name="Dataset"):
    """Comprehensive data validation function"""
    print(f"\n📊 {name} Validation Report")
    print("=" * 40)
    
    # Basic info
    print(f"Shape: {df.shape}")
    print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    
    # Data types
    print("\n📋 Data Types:")
    type_counts = df.dtypes.value_counts()
    for dtype, count in type_counts.items():
        print(f"  {dtype}: {count} columns")
    
    # Missing data analysis
    missing = df.isnull().sum()
    missing_pct = (missing / len(df)) * 100
    missing_df = pd.DataFrame({
        'Missing': missing,
        'Percentage': missing_pct
    }).sort_values('Missing', ascending=False)
    
    if missing.sum() > 0:
        print("\n⚠️ Missing Data:")
        print(missing_df[missing_df['Missing'] > 0])
    else:
        print("\n✅ No missing data found")
    
    # Duplicates
    duplicates = df.duplicated().sum()
    print(f"\n🔄 Duplicate rows: {duplicates} ({duplicates/len(df)*100:.1f}%)")
    
    # Numeric column statistics
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    if len(numeric_cols) > 0:
        print(f"\n📈 Numeric columns: {len(numeric_cols)}")
        print("Range check:")
        for col in numeric_cols:
            print(f"  {col}: [{df[col].min():.3f}, {df[col].max():.3f}]")
    
    return missing_df

# Create a comprehensive synthetic dataset for demonstration
np.random.seed(RANDOM_STATE)

n_samples = 1000
synthetic_data = {
    # Demographic features
    'customer_id': range(1, n_samples + 1),
    'age': np.random.randint(18, 80, n_samples),
    'income': np.random.lognormal(10.5, 0.8, n_samples),  # More realistic income distribution
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n_samples, p=[0.3, 0.4, 0.2, 0.1]),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n_samples),
    
    # Behavioral features
    'monthly_spend': np.random.gamma(2, 50, n_samples),
    'num_purchases': np.random.poisson(8, n_samples),
    'days_since_last_purchase': np.random.exponential(10, n_samples),
    'customer_rating': np.random.choice([1, 2, 3, 4, 5], n_samples, p=[0.05, 0.1, 0.2, 0.4, 0.25]),
    
    # Technical features
    'website_visits': np.random.negative_binomial(10, 0.3, n_samples),
    'mobile_app_usage': np.random.beta(2, 5, n_samples) * 100,  # Percentage
    'email_open_rate': np.random.beta(3, 7, n_samples),
}

# Add missing values realistically (missing not at random)
df_raw = pd.DataFrame(synthetic_data)

# Income missing for younger customers (survey bias)
young_indices = df_raw[df_raw['age'] < 25].index
missing_indices = np.random.choice(young_indices, size=min(len(young_indices)//3, 30), replace=False)
df_raw.loc[missing_indices, 'income'] = np.nan

# Rating missing for customers with very few purchases
low_purchase_indices = df_raw[df_raw['num_purchases'] <= 2].index
missing_indices = np.random.choice(low_purchase_indices, size=min(len(low_purchase_indices)//2, 20), replace=False)
df_raw.loc[missing_indices, 'customer_rating'] = np.nan

# App usage missing for older customers
old_indices = df_raw[df_raw['age'] > 65].index
missing_indices = np.random.choice(old_indices, size=min(len(old_indices)//2, 25), replace=False)
df_raw.loc[missing_indices, 'mobile_app_usage'] = np.nan

# Add some extreme outliers
outlier_indices = np.random.choice(df_raw.index, 20, replace=False)
df_raw.loc[outlier_indices, 'monthly_spend'] *= 10  # Very high spenders

# Validate the dataset
validation_report = validate_dataset(df_raw, "Customer Dataset")

print("\n🎯 First 5 rows:")
print(df_raw.head())

### 2.2 Production-Ready Preprocessing Pipeline

In [None]:
# Professional preprocessing pipeline using sklearn

class MLPreprocessor:
    """Production-ready preprocessing pipeline"""
    
    def __init__(self, handle_outliers=True, outlier_method='iqr'):
        self.handle_outliers = handle_outliers
        self.outlier_method = outlier_method
        self.preprocessor = None
        self.feature_names = None
        self.outlier_bounds = {}
        
    def detect_outliers(self, X, column, method='iqr'):
        """Detect outliers using IQR or z-score method"""
        if method == 'iqr':
            Q1 = X[column].quantile(0.25)
            Q3 = X[column].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
        else:  # z-score
            mean = X[column].mean()
            std = X[column].std()
            lower_bound = mean - 3 * std
            upper_bound = mean + 3 * std
        
        self.outlier_bounds[column] = (lower_bound, upper_bound)
        outliers = (X[column] < lower_bound) | (X[column] > upper_bound)
        return outliers
    
    def fit(self, X, y=None):
        """Fit the preprocessing pipeline"""
        X_copy = X.copy()
        
        # Separate numeric and categorical columns
        numeric_features = X_copy.select_dtypes(include=[np.number]).columns.tolist()
        categorical_features = X_copy.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # Remove ID columns if present
        id_columns = [col for col in numeric_features if 'id' in col.lower()]
        numeric_features = [col for col in numeric_features if col not in id_columns]
        
        print(f"Identified features:")
        print(f"  Numeric: {len(numeric_features)} - {numeric_features}")
        print(f"  Categorical: {len(categorical_features)} - {categorical_features}")
        print(f"  ID columns (excluded): {id_columns}")
        
        # Handle outliers in numeric features
        if self.handle_outliers:
            print(f"\n🎯 Outlier Detection ({self.outlier_method} method):")
            for col in numeric_features:
                outliers = self.detect_outliers(X_copy, col, self.outlier_method)
                outlier_count = outliers.sum()
                if outlier_count > 0:
                    print(f"  {col}: {outlier_count} outliers ({outlier_count/len(X_copy)*100:.1f}%)")
        
        # Create preprocessing pipelines
        numeric_transformer = Pipeline(steps=[
            ('imputer', KNNImputer(n_neighbors=5)),
            ('scaler', StandardScaler())
        ])
        
        categorical_transformer = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
            ('encoder', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))
        ])
        
        # Combine transformers
        self.preprocessor = ColumnTransformer(
            transformers=[
                ('num', numeric_transformer, numeric_features),
                ('cat', categorical_transformer, categorical_features)
            ],
            remainder='drop'  # Drop ID columns
        )
        
        # Fit the preprocessor
        self.preprocessor.fit(X_copy)
        
        # Store feature names for later use
        numeric_feature_names = numeric_features
        
        try:
            categorical_feature_names = (
                self.preprocessor
                .named_transformers_['cat']
                .named_steps['encoder']
                .get_feature_names_out(categorical_features)
            )
        except:
            categorical_feature_names = []
        
        self.feature_names = list(numeric_feature_names) + list(categorical_feature_names)
        
        print(f"\n✅ Preprocessing pipeline fitted")
        print(f"Final feature count: {len(self.feature_names)}")
        
        return self
    
    def transform(self, X):
        """Transform the data using fitted pipeline"""
        if self.preprocessor is None:
            raise ValueError("Pipeline not fitted. Call fit() first.")
        
        X_transformed = self.preprocessor.transform(X)
        
        # Convert to DataFrame with proper column names
        return pd.DataFrame(X_transformed, columns=self.feature_names, index=X.index)
    
    def fit_transform(self, X, y=None):
        """Fit and transform in one step"""
        return self.fit(X, y).transform(X)
    
    def get_feature_importance_mapping(self):
        """Get mapping of original to transformed features"""
        return {
            'feature_names': self.feature_names,
            'outlier_bounds': self.outlier_bounds
        }

# Apply the preprocessing pipeline
print("🔧 PROFESSIONAL PREPROCESSING PIPELINE")
print("=" * 50)

# Initialize and fit the preprocessor
preprocessor = MLPreprocessor(handle_outliers=True, outlier_method='iqr')

# Exclude customer_id for preprocessing
feature_columns = [col for col in df_raw.columns if col != 'customer_id']
X_raw = df_raw[feature_columns]

# Fit and transform
X_processed = preprocessor.fit_transform(X_raw)

print(f"\n📊 Transformation Results:")
print(f"Original shape: {X_raw.shape}")
print(f"Processed shape: {X_processed.shape}")
print(f"Features created: {list(X_processed.columns)}")

# Validate the processed data
validate_dataset(X_processed, "Processed Dataset")

### 📝 Checkpoint 2: Data Preprocessing

**Assessment Questions (5 minutes):**

1. Why is KNN imputation often better than mean/median imputation?
2. What is data leakage and how does proper train/test splitting prevent it?
3. When would you use RobustScaler instead of StandardScaler?
4. What are the risks of dropping rows with missing values?

**Practical Exercise:**
Modify the preprocessing pipeline to:
- Use different imputation strategies for different columns
- Add polynomial features for numeric variables
- Implement custom outlier handling

## 3. Advanced Model Evaluation & Validation

**⏱️ Estimated time:** 40 minutes

**Learning objectives:**
- Implement robust cross-validation strategies
- Understand bias-variance tradeoff
- Master hyperparameter optimization
- Detect overfitting and model selection

In [None]:
# Create target variables for demonstration
np.random.seed(RANDOM_STATE)

def create_target_variable(df):
    """Create realistic target variables"""
    
    # Customer Lifetime Value (CLV) - Regression target
    base_clv = 1000
    
    # Impact of various factors on CLV
    income_factor = np.log1p(df['income'].fillna(df['income'].median())) / 10
    age_factor = np.where(df['age'] > 50, 1.2, np.where(df['age'] < 30, 0.8, 1.0))
    spending_factor = np.log1p(df['monthly_spend']) / 2
    loyalty_factor = np.log1p(df['num_purchases']) * 50
    rating_factor = df['customer_rating'].fillna(3) * 100
    
    clv = (base_clv + income_factor + loyalty_factor + rating_factor) * age_factor + spending_factor
    clv += np.random.normal(0, 200, len(df))  # Add noise
    clv = np.maximum(clv, 100)  # Ensure positive values
    
    # High-value customer (binary classification target)
    high_value = (clv > clv.quantile(0.7)).astype(int)
    
    return clv, high_value

# Create targets
clv_target, high_value_target = create_target_variable(df_raw)

print("🎯 Target Variables Created:")
print(f"CLV (regression): Mean={clv_target.mean():.0f}, Std={clv_target.std():.0f}")
print(f"High-value customer (classification): {high_value_target.mean():.1%} positive class")

In [None]:
# Advanced model evaluation framework

def comprehensive_model_evaluation(X, y, models, cv_strategy='stratified', n_splits=5, test_size=0.2):
    """Comprehensive model evaluation with multiple metrics"""
    
    print("🎯 COMPREHENSIVE MODEL EVALUATION")
    print("=" * 50)
    
    # Determine if this is classification or regression
    is_classification = len(np.unique(y)) < 20 and y.dtype in ['int64', 'object', 'bool']
    
    print(f"Problem type: {'Classification' if is_classification else 'Regression'}")
    print(f"Cross-validation: {cv_strategy} {n_splits}-fold")
    
    # Split data for final evaluation
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=RANDOM_STATE, 
        stratify=y if is_classification else None
    )
    
    # Choose cross-validation strategy
    if is_classification:
        if cv_strategy == 'stratified':
            cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=RANDOM_STATE)
        else:
            cv = KFold(n_splits=n_splits, shuffle=True, random_state=RANDOM_STATE)
        scoring_metrics = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']
    else:
        cv = KFold(n_splits=n_splits, shuffle=True, random_state=RANDOM_STATE)
        scoring_metrics = ['neg_mean_squared_error', 'neg_mean_absolute_error', 'r2']
    
    results = {}
    
    for name, model in models.items():
        print(f"\n🔍 Evaluating {name}...")
        
        model_results = {'name': name}
        
        # Cross-validation scores
        for metric in scoring_metrics:
            try:
                scores = cross_val_score(model, X_train, y_train, cv=cv, scoring=metric)
                model_results[metric] = {
                    'mean': scores.mean(),
                    'std': scores.std(),
                    'scores': scores
                }
                print(f"  {metric}: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
            except Exception as e:
                print(f"  {metric}: Error - {str(e)[:50]}...")
                model_results[metric] = {'mean': np.nan, 'std': np.nan, 'scores': []}
        
        # Final model evaluation on test set
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        
        if is_classification:
            model_results['test_accuracy'] = accuracy_score(y_test, y_pred)
            model_results['test_precision'] = precision_score(y_test, y_pred, average='macro', zero_division=0)
            model_results['test_recall'] = recall_score(y_test, y_pred, average='macro', zero_division=0)
            model_results['test_f1'] = f1_score(y_test, y_pred, average='macro', zero_division=0)
        else:
            model_results['test_mse'] = mean_squared_error(y_test, y_pred)
            model_results['test_mae'] = mean_absolute_error(y_test, y_pred)
            model_results['test_r2'] = r2_score(y_test, y_pred)
        
        model_results['model'] = model
        model_results['y_test'] = y_test
        model_results['y_pred'] = y_pred
        
        results[name] = model_results
    
    return results, X_test

# Define models for evaluation
regression_models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0, random_state=RANDOM_STATE),
    'Random Forest': RandomForestRegressor(n_estimators=50, random_state=RANDOM_STATE),
}

classification_models = {
    'Logistic Regression': LogisticRegression(random_state=RANDOM_STATE, max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=50, random_state=RANDOM_STATE),
    'SVM': SVC(probability=True, random_state=RANDOM_STATE),
}

# Evaluate regression models
print("\n" + "="*60)
print("REGRESSION EVALUATION (Customer Lifetime Value)")
regression_results, X_test_reg = comprehensive_model_evaluation(
    X_processed, clv_target, regression_models, cv_strategy='standard'
)

# Evaluate classification models  
print("\n" + "="*60)
print("CLASSIFICATION EVALUATION (High-Value Customer)")
classification_results, X_test_clf = comprehensive_model_evaluation(
    X_processed, high_value_target, classification_models, cv_strategy='stratified'
)

## 4. Real-World Data Integration & MLOps

**⏱️ Estimated time:** 50 minutes

**Learning objectives:**
- Connect to real data sources (APIs, databases, web scraping)
- Implement model persistence and versioning
- Apply MLOps principles for production deployments
- Handle data quality monitoring and model drift detection

In [None]:
# Real-world data integration strategies

import sqlite3
import json
from urllib.parse import urljoin
import time

class DataConnector:
    """Professional data connector for various sources"""
    
    def __init__(self):
        self.connection_cache = {}
        self.request_session = None
        if HAS_REQUESTS:
            self.request_session = requests.Session()
            self.request_session.headers.update({
                'User-Agent': 'ML-Training-Notebook/1.0'
            })
    
    def create_sample_database(self, db_path='sample_data.db'):
        """Create a sample SQLite database for demonstration"""
        
        print(f"🗄️ Creating sample database: {db_path}")
        
        # Create connection
        conn = sqlite3.connect(db_path)
        cursor = conn.cursor()
        
        # Create tables
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS customers (
                customer_id INTEGER PRIMARY KEY,
                name TEXT NOT NULL,
                email TEXT UNIQUE,
                registration_date DATE,
                country TEXT,
                subscription_tier TEXT
            )
        ''')
        
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS transactions (
                transaction_id INTEGER PRIMARY KEY,
                customer_id INTEGER,
                amount REAL,
                transaction_date DATETIME,
                product_category TEXT,
                FOREIGN KEY (customer_id) REFERENCES customers (customer_id)
            )
        ''')
        
        # Insert sample data
        np.random.seed(RANDOM_STATE)
        
        # Generate customers
        countries = ['USA', 'UK', 'Germany', 'France', 'Canada', 'Australia']
        tiers = ['Basic', 'Premium', 'Enterprise']
        
        customers_data = []
        for i in range(1, 101):  # 100 customers
            customers_data.append((
                i,
                f"Customer_{i:03d}",
                f"customer{i:03d}@email.com",
                (datetime.now() - pd.Timedelta(days=np.random.randint(30, 365))).date(),
                np.random.choice(countries),
                np.random.choice(tiers, p=[0.5, 0.35, 0.15])
            ))
        
        cursor.executemany(
            'INSERT OR REPLACE INTO customers VALUES (?, ?, ?, ?, ?, ?)',
            customers_data
        )
        
        # Generate transactions
        categories = ['Electronics', 'Clothing', 'Books', 'Home', 'Sports']
        transactions_data = []
        
        for i in range(1, 301):  # 300 transactions
            customer_id = np.random.randint(1, 101)
            amount = np.random.lognormal(3, 1)  # Log-normal distribution for realistic amounts
            transaction_date = datetime.now() - pd.Timedelta(days=np.random.randint(0, 90))
            category = np.random.choice(categories)
            
            transactions_data.append((
                i, customer_id, round(amount, 2), transaction_date, category
            ))
        
        cursor.executemany(
            'INSERT OR REPLACE INTO transactions VALUES (?, ?, ?, ?, ?)',
            transactions_data
        )
        
        conn.commit()
        conn.close()
        
        print(f"✅ Database created with {len(customers_data)} customers and {len(transactions_data)} transactions")
        return db_path
    
    def connect_to_database(self, db_path, query):
        """Connect to SQLite database and execute query"""
        
        try:
            print(f"🗄️ Connecting to database: {db_path}")
            conn = sqlite3.connect(db_path)
            
            # Execute query and return DataFrame
            df = pd.read_sql_query(query, conn)
            conn.close()
            
            print(f"✅ Query executed successfully. Retrieved {len(df)} rows")
            return df
            
        except sqlite3.Error as e:
            print(f"❌ Database error: {e}")
            return None
    
    def simulate_web_scraping(self, num_records=50):
        """Simulate web scraping (without actual scraping)"""
        
        print(f"🕷️ Simulating web scraping for {num_records} records...")
        
        # Simulate realistic web-scraped data
        np.random.seed(RANDOM_STATE)
        
        # Simulate product data from e-commerce site
        products = []
        categories = ['Electronics', 'Books', 'Clothing', 'Home & Garden', 'Sports']
        brands = ['BrandA', 'BrandB', 'BrandC', 'BrandD', 'BrandE']
        
        for i in range(num_records):
            # Simulate some missing data (realistic for web scraping)
            rating = np.random.uniform(1, 5) if np.random.random() > 0.1 else None
            price = np.random.lognormal(3, 0.8) if np.random.random() > 0.05 else None
            
            product = {
                'product_id': f'P{i:04d}',
                'name': f'Product {i}',
                'category': np.random.choice(categories),
                'brand': np.random.choice(brands) if np.random.random() > 0.15 else None,
                'price': round(price, 2) if price else None,
                'rating': round(rating, 1) if rating else None,
                'num_reviews': np.random.poisson(50) if rating else 0,
                'in_stock': np.random.choice([True, False], p=[0.85, 0.15]),
                'scraped_date': datetime.now() - pd.Timedelta(hours=np.random.randint(0, 24))
            }
            products.append(product)
        
        df = pd.DataFrame(products)
        
        print(f"✅ Simulated scraping complete. Created dataset with shape {df.shape}")
        print(f"Missing data: {df.isnull().sum().sum()} total missing values")
        
        return df

# Initialize data connector
data_connector = DataConnector()

# Demonstrate different data sources
print("🌍 REAL-WORLD DATA INTEGRATION EXAMPLES")
print("=" * 50)

# 1. Database connection
db_path = data_connector.create_sample_database()

# Query customer data
customer_query = '''
    SELECT 
        c.customer_id,
        c.name,
        c.country,
        c.subscription_tier,
        COUNT(t.transaction_id) as num_transactions,
        SUM(t.amount) as total_spent,
        AVG(t.amount) as avg_transaction
    FROM customers c
    LEFT JOIN transactions t ON c.customer_id = t.customer_id
    GROUP BY c.customer_id
    ORDER BY total_spent DESC
    LIMIT 10
'''

customer_data = data_connector.connect_to_database(db_path, customer_query)
if customer_data is not None:
    print("\n🏆 Top 10 customers by total spent:")
    print(customer_data.head())

# 2. Web scraping simulation
scraped_data = data_connector.simulate_web_scraping(30)
print("\n🕷️ Sample scraped data:")
print(scraped_data.head())
print(f"\nData quality check - Missing values per column:")
print(scraped_data.isnull().sum())

## 5. Model Persistence & Production Deployment

**⏱️ Estimated time:** 30 minutes

**Learning objectives:**
- Save and load trained models
- Version control for ML models
- Create prediction APIs
- Monitor model performance

In [None]:
# Model persistence and versioning

import pickle
import json
from pathlib import Path

class ModelManager:
    """Professional model management and versioning"""
    
    def __init__(self, model_dir='models'):
        self.model_dir = Path(model_dir)
        self.model_dir.mkdir(exist_ok=True)
        
    def save_model(self, model, model_name, version='1.0', metadata=None):
        """Save model with versioning and metadata"""
        
        # Create version directory
        version_dir = self.model_dir / model_name / f"v{version}"
        version_dir.mkdir(parents=True, exist_ok=True)
        
        # Save model
        model_path = version_dir / 'model.pkl'
        
        if HAS_JOBLIB:
            import joblib
            joblib.dump(model, model_path)
        else:
            with open(model_path, 'wb') as f:
                pickle.dump(model, f)
        
        # Save metadata
        model_metadata = {
            'model_name': model_name,
            'version': version,
            'created_at': datetime.now().isoformat(),
            'model_type': type(model).__name__,
            'sklearn_version': sklearn.__version__,
            'python_version': sys.version,
        }
        
        if metadata:
            model_metadata.update(metadata)
        
        metadata_path = version_dir / 'metadata.json'
        with open(metadata_path, 'w') as f:
            json.dump(model_metadata, f, indent=2)
        
        print(f"✅ Model saved: {model_path}")
        print(f"📝 Metadata saved: {metadata_path}")
        
        return str(model_path)
    
    def load_model(self, model_name, version='1.0'):
        """Load model with specified version"""
        
        model_path = self.model_dir / model_name / f"v{version}" / 'model.pkl'
        metadata_path = self.model_dir / model_name / f"v{version}" / 'metadata.json'
        
        if not model_path.exists():
            raise FileNotFoundError(f"Model not found: {model_path}")
        
        # Load model
        if HAS_JOBLIB:
            import joblib
            model = joblib.load(model_path)
        else:
            with open(model_path, 'rb') as f:
                model = pickle.load(f)
        
        # Load metadata
        metadata = {}
        if metadata_path.exists():
            with open(metadata_path, 'r') as f:
                metadata = json.load(f)
        
        print(f"✅ Model loaded: {model_path}")
        print(f"📊 Model type: {metadata.get('model_type', 'Unknown')}")
        print(f"📅 Created: {metadata.get('created_at', 'Unknown')}")
        
        return model, metadata
    
    def list_models(self):
        """List all available models and versions"""
        
        models = []
        for model_dir in self.model_dir.iterdir():
            if model_dir.is_dir():
                for version_dir in model_dir.iterdir():
                    if version_dir.is_dir() and version_dir.name.startswith('v'):
                        metadata_path = version_dir / 'metadata.json'
                        metadata = {}
                        if metadata_path.exists():
                            with open(metadata_path, 'r') as f:
                                metadata = json.load(f)
                        
                        models.append({
                            'name': model_dir.name,
                            'version': version_dir.name[1:],  # Remove 'v' prefix
                            'type': metadata.get('model_type', 'Unknown'),
                            'created': metadata.get('created_at', 'Unknown')
                        })
        
        return models

# Demonstrate model persistence
print("💾 MODEL PERSISTENCE & VERSIONING")
print("=" * 50)

# Initialize model manager
model_manager = ModelManager()

# Train and save a simple model
best_model = RandomForestClassifier(n_estimators=50, random_state=RANDOM_STATE)
best_model.fit(X_processed, high_value_target)

# Calculate model performance for metadata
y_pred = best_model.predict(X_processed)
accuracy = accuracy_score(high_value_target, y_pred)

# Save model with metadata
model_metadata = {
    'accuracy': accuracy,
    'n_features': X_processed.shape[1],
    'n_samples': X_processed.shape[0],
    'target_distribution': high_value_target.value_counts().to_dict(),
    'feature_names': list(X_processed.columns)
}

model_path = model_manager.save_model(
    model=best_model,
    model_name='customer_value_classifier',
    version='1.0',
    metadata=model_metadata
)

# List available models
print("\n📋 Available Models:")
available_models = model_manager.list_models()
for model in available_models:
    print(f"  {model['name']} v{model['version']} ({model['type']}) - {model['created'][:10]}")

# Load model back
print("\n🔄 Loading Model:")
loaded_model, loaded_metadata = model_manager.load_model('customer_value_classifier', '1.0')

# Test loaded model
test_predictions = loaded_model.predict(X_processed[:5])
print(f"\n🧪 Test predictions on first 5 samples: {test_predictions}")
print(f"✅ Model persistence working correctly!")

## 6. Final Project: End-to-End ML Pipeline

**⏱️ Estimated time:** 60 minutes

**Capstone project combining all learned concepts:**
- Data collection and preprocessing
- Model training and evaluation
- Model deployment and monitoring
- Business insights and recommendations

In [None]:
# Complete End-to-End ML Pipeline

class MLPipeline:
    """Complete ML pipeline for production use"""
    
    def __init__(self, random_state=42):
        self.random_state = random_state
        self.preprocessor = None
        self.model = None
        self.model_metadata = {}
        self.performance_metrics = {}
        
    def preprocess_data(self, df, target_column=None, test_size=0.2):
        """Complete data preprocessing pipeline"""
        
        print("🔧 PREPROCESSING DATA")
        print("=" * 30)
        
        # Separate features and target
        if target_column:
            X = df.drop(columns=[target_column])
            y = df[target_column]
        else:
            X = df
            y = None
        
        # Remove ID columns
        id_columns = [col for col in X.columns if 'id' in col.lower()]
        X = X.drop(columns=id_columns)
        
        print(f"Data shape: {X.shape}")
        print(f"Features: {list(X.columns)}")
        
        # Initialize and fit preprocessor
        self.preprocessor = MLPreprocessor(handle_outliers=True)
        X_processed = self.preprocessor.fit_transform(X)
        
        # Split data if target is provided
        if y is not None:
            is_classification = len(np.unique(y)) < 20
            X_train, X_test, y_train, y_test = train_test_split(
                X_processed, y, test_size=test_size, random_state=self.random_state,
                stratify=y if is_classification else None
            )
            
            print(f"Train set: {X_train.shape[0]} samples")
            print(f"Test set: {X_test.shape[0]} samples")
            
            return X_train, X_test, y_train, y_test
        else:
            return X_processed
    
    def train_model(self, X_train, y_train, model_type='auto'):
        """Train the best model for the given problem"""
        
        print("\n🎯 TRAINING MODEL")
        print("=" * 25)
        
        # Determine problem type
        is_classification = len(np.unique(y_train)) < 20
        problem_type = 'classification' if is_classification else 'regression'
        
        print(f"Problem type: {problem_type}")
        
        # Select models based on problem type
        if is_classification:
            models = {
                'Random Forest': RandomForestClassifier(n_estimators=100, random_state=self.random_state),
                'Logistic Regression': LogisticRegression(random_state=self.random_state, max_iter=1000),
                'SVM': SVC(probability=True, random_state=self.random_state)
            }
            scoring = 'accuracy'
        else:
            models = {
                'Random Forest': RandomForestRegressor(n_estimators=100, random_state=self.random_state),
                'Linear Regression': LinearRegression(),
                'Ridge': Ridge(alpha=1.0, random_state=self.random_state)
            }
            scoring = 'r2'
        
        # Compare models using cross-validation
        best_score = -np.inf
        best_model_name = None
        best_model = None
        
        cv_results = {}
        
        for name, model in models.items():
            cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring=scoring)
            mean_score = cv_scores.mean()
            cv_results[name] = {
                'mean_score': mean_score,
                'std_score': cv_scores.std(),
                'scores': cv_scores
            }
            
            print(f"{name}: {mean_score:.3f} (+/- {cv_scores.std() * 2:.3f})")
            
            if mean_score > best_score:
                best_score = mean_score
                best_model_name = name
                best_model = model
        
        # Train the best model
        print(f"\n🏆 Best model: {best_model_name} (Score: {best_score:.3f})")
        best_model.fit(X_train, y_train)
        
        self.model = best_model
        self.model_metadata = {
            'model_name': best_model_name,
            'model_type': type(best_model).__name__,
            'problem_type': problem_type,
            'cv_score': best_score,
            'cv_results': cv_results,
            'n_features': X_train.shape[1],
            'n_samples': X_train.shape[0]
        }
        
        return best_model
    
    def evaluate_model(self, X_test, y_test):
        """Comprehensive model evaluation"""
        
        print("\n📊 MODEL EVALUATION")
        print("=" * 25)
        
        if self.model is None:
            raise ValueError("No model trained yet. Call train_model() first.")
        
        # Make predictions
        y_pred = self.model.predict(X_test)
        
        # Calculate metrics based on problem type
        is_classification = self.model_metadata['problem_type'] == 'classification'
        
        if is_classification:
            accuracy = accuracy_score(y_test, y_pred)
            precision = precision_score(y_test, y_pred, average='macro', zero_division=0)
            recall = recall_score(y_test, y_pred, average='macro', zero_division=0)
            f1 = f1_score(y_test, y_pred, average='macro', zero_division=0)
            
            self.performance_metrics = {
                'accuracy': accuracy,
                'precision': precision,
                'recall': recall,
                'f1_score': f1
            }
            
            print(f"Accuracy: {accuracy:.3f}")
            print(f"Precision: {precision:.3f}")
            print(f"Recall: {recall:.3f}")
            print(f"F1-Score: {f1:.3f}")
            
        else:
            mse = mean_squared_error(y_test, y_pred)
            mae = mean_absolute_error(y_test, y_pred)
            r2 = r2_score(y_test, y_pred)
            
            self.performance_metrics = {
                'mse': mse,
                'mae': mae,
                'r2_score': r2
            }
            
            print(f"Mean Squared Error: {mse:.3f}")
            print(f"Mean Absolute Error: {mae:.3f}")
            print(f"R² Score: {r2:.3f}")
        
        return self.performance_metrics
    
    def predict(self, X):
        """Make predictions on new data"""
        
        if self.model is None:
            raise ValueError("No model trained yet. Call train_model() first.")
        
        if self.preprocessor is None:
            raise ValueError("No preprocessor fitted yet. Call preprocess_data() first.")
        
        # Preprocess new data
        X_processed = self.preprocessor.transform(X)
        
        # Make predictions
        predictions = self.model.predict(X_processed)
        
        # Get prediction probabilities for classification
        if hasattr(self.model, 'predict_proba'):
            probabilities = self.model.predict_proba(X_processed)
            return predictions, probabilities
        
        return predictions
    
    def generate_insights(self):
        """Generate business insights from the model"""
        
        print("\n💡 BUSINESS INSIGHTS")
        print("=" * 25)
        
        if self.model is None:
            print("No model available for insights.")
            return
        
        # Feature importance (if available)
        if hasattr(self.model, 'feature_importances_'):
            feature_importance = pd.DataFrame({
                'feature': self.preprocessor.feature_names,
                'importance': self.model.feature_importances_
            }).sort_values('importance', ascending=False)
            
            print("🔍 Top 5 Most Important Features:")
            for i, (_, row) in enumerate(feature_importance.head().iterrows(), 1):
                print(f"  {i}. {row['feature']}: {row['importance']:.3f}")
        
        # Model performance summary
        print(f"\n📈 Model Performance Summary:")
        print(f"  Model: {self.model_metadata.get('model_name', 'Unknown')}")
        print(f"  Problem: {self.model_metadata.get('problem_type', 'Unknown')}")
        print(f"  CV Score: {self.model_metadata.get('cv_score', 0):.3f}")
        
        if self.performance_metrics:
            for metric, value in self.performance_metrics.items():
                print(f"  {metric.replace('_', ' ').title()}: {value:.3f}")

# Demonstrate complete pipeline
print("🚀 COMPLETE ML PIPELINE DEMONSTRATION")
print("=" * 50)

# Initialize pipeline
pipeline = MLPipeline(random_state=RANDOM_STATE)

# Create dataset with target
df_pipeline = df_raw.copy()
df_pipeline['high_value_customer'] = high_value_target

# Run complete pipeline
X_train, X_test, y_train, y_test = pipeline.preprocess_data(df_pipeline, 'high_value_customer')
best_model = pipeline.train_model(X_train, y_train)
performance = pipeline.evaluate_model(X_test, y_test)
pipeline.generate_insights()

print("\n🎉 PIPELINE COMPLETE!")
print("Your ML model is ready for production deployment.")

## 📝 Final Assessment & Next Steps

### Congratulations! 🎉

You have completed the comprehensive ML & AI 101 training. You now have the skills to:

✅ **Build production-ready ML pipelines** with proper data preprocessing and validation  
✅ **Implement robust model evaluation** using cross-validation and multiple metrics  
✅ **Handle real-world data** from databases, APIs, and web sources  
✅ **Deploy and persist models** with proper versioning and metadata  
✅ **Generate business insights** from ML models and their predictions  

### 🎯 Final Project Checklist

- [ ] Successfully completed all checkpoint assessments
- [ ] Built and evaluated at least 2 different ML models
- [ ] Implemented a complete end-to-end pipeline
- [ ] Generated actionable business insights
- [ ] Saved and loaded models with proper versioning

### 🚀 Next Steps in Your ML Journey

**Immediate (1-2 weeks):**
- Apply these techniques to your own datasets
- Explore additional algorithms (XGBoost, LightGBM)
- Practice with Kaggle competitions

**Short-term (1-3 months):**
- Learn deep learning frameworks (TensorFlow, PyTorch)
- Explore specialized domains (NLP, Computer Vision, Time Series)
- Study MLOps tools (MLflow, Kubeflow, Docker)

**Long-term (3-12 months):**
- Build end-to-end ML applications
- Contribute to open-source ML projects
- Stay updated with latest research and techniques

### 📚 Recommended Resources

- **Books:** "Hands-On Machine Learning" by Aurélien Géron
- **Courses:** Fast.ai, Coursera ML Specialization
- **Practice:** Kaggle, Google Colab, GitHub projects
- **Community:** Reddit r/MachineLearning, ML Twitter, local meetups

### 💼 Career Applications

You're now prepared for roles in:
- Data Science and Analytics
- ML Engineering
- Business Intelligence
- Product Analytics
- Consulting and Strategy

**Keep learning, keep building, and keep pushing the boundaries of what's possible with ML!** 🌟