# Module 08: CRISP-DM Framework for Data Science

**Difficulty**: ⭐⭐⭐ (Advanced)

**Estimated Time**: 60 minutes

**Prerequisites**: Module 07 - Data Governance and Ethics

## Learning Objectives

By the end of this notebook, you will be able to:

1. Apply CRISP-DM's 6 iterative phases to real data science projects
2. Compare CRISP-DM with Team Data Science Process (TDSP) and choose appropriate methodologies
3. Adapt methodology for academic vs industry contexts
4. Identify and avoid common data science pitfalls systematically
5. Integrate Agile practices with CRISP-DM for modern development


## Setup

Let's import the libraries we'll use in this notebook.


In [None]:
# Standard data science libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import FancyBboxPatch, FancyArrowPatch
import warnings
warnings.filterwarnings('ignore')

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print('✓ Libraries imported successfully!')


## 1. Understanding CRISP-DM: The Industry Standard

### What is CRISP-DM?

**CRISP-DM** (Cross-Industry Standard Process for Data Mining) is the most widely adopted methodology for data science projects. Developed in 1996-2000 by industry leaders, it provides a structured framework for planning and executing data science initiatives.

**Key Statistics (2020 Survey)**:
- 43% of data scientists use CRISP-DM
- 22% use proprietary in-house methodologies
- 15% use newer frameworks like TDSP

### The Six Phases

1. **Business Understanding**: Define objectives, assess situation, define success criteria
2. **Data Understanding**: Collect data, explore, assess quality
3. **Data Preparation**: Select variables, clean, construct features
4. **Modeling**: Select techniques, build models, tune hyperparameters
5. **Evaluation**: Assess performance, review business impact
6. **Deployment**: Plan deployment, implement, monitor

### Iterative Nature

CRISP-DM is NOT linear. Phases loop back based on findings:
- Evaluation shows poor performance → cycle back to Data Preparation
- Data Understanding reveals issues → cycle back to Business Understanding
- Deployment monitoring shows drift → cycle back to Modeling


In [None]:
# Example: Customer Churn Prediction
crisp_example = {
    'Business Understanding': {
        'Objectives': ['Identify customers likely to churn', 'Reduce churn from 25% to 18%'],
        'Timeline': '2 weeks',
        'Success': 'Model with >85% precision for high-risk customers'
    },
    'Data Understanding': {
        'Sources': ['Customer database', 'Transaction history', 'Support tickets'],
        'Issues': ['12% missing support records', 'Churn definition varies'],
        'Timeline': '3 weeks'
    },
    'Data Preparation': {
        'Actions': ['Standardize churn definition', 'Handle missing values', 'Create features'],
        'Timeline': '4 weeks'
    },
    'Modeling': {
        'Algorithms': ['Logistic Regression', 'Random Forest', 'Gradient Boosting'],
        'Timeline': '2 weeks'
    },
    'Evaluation': {
        'Best Model': 'Gradient Boosting (91% precision, 68% recall)',
        'Decision': 'Proceed with A/B testing',
        'Timeline': '2 weeks'
    },
    'Deployment': {
        'Implementation': 'Score all customers weekly, integrate with CRM',
        'Timeline': '4 weeks',
        'Ongoing': 'Monthly monitoring and quarterly retraining'
    }
}

print('CRISP-DM Application: Customer Churn Prediction')
print('=' * 70)
for phase, details in crisp_example.items():
    print(f'\n{phase}')
    for key, value in details.items():
        if isinstance(value, list):
            print(f'  {key}: {", ".join(value)}')
        else:
            print(f'  {key}: {value}')


## 2. CRISP-DM vs Team Data Science Process (TDSP)

### CRISP-DM Limitations

While CRISP-DM is industry standard, it has limitations:
- Assumes data fits in memory (pre-big data design)
- No MLOps or DevOps integration
- Lacks team collaboration structures
- No version control or experiment tracking
- Silent on cloud infrastructure

### TDSP (Microsoft, 2016)

TDSP modernizes CRISP-DM with:
- **4 consolidated phases** (same core, better organized)
- **Team roles** with clear responsibilities
- **Git-based workflows** mandatory
- **Cloud-first** architecture (Azure focus)
- **Automated pipelines** for data and model management

### Key TDSP Innovations

1. **Team Structure**: Project Lead, Data Scientists, Data Engineers, DevOps
2. **Standardized directories**: Docs/, Data/, Code/, Output/
3. **Version control**: All work in Git with code reviews
4. **Automated testing**: Unit tests, integration tests, CI/CD pipelines

### Comparison

| Aspect | CRISP-DM | TDSP |
|--------|----------|------|
| **Phases** | 6 (detailed) | 4 (consolidated) |
| **Team** | Individual | Cross-functional |
| **Version Control** | Optional | Mandatory |
| **Scalability** | Limited | Cloud-native |
| **Best For** | Individual projects | Enterprise production |


## 3. Common Data Science Pitfalls

### The Top 5 Pitfalls

**Pitfall 1: Solving the Wrong Problem**
- Building technically correct solution to misunderstood problem
- Prevention: Dedicated Business Understanding phase with stakeholder interviews
- Example: Optimizing accuracy when business needs interpretability

**Pitfall 2: Data Leakage**
- Using information in training that wouldn't be available at prediction time
- Prevention: Split data BEFORE preprocessing, fit scalers only on training data
- Example: Preprocessing test data before splitting

**Pitfall 3: Correlation ≠ Causation**
- Claiming X causes Y based on correlation alone
- Prevention: Create causal diagrams, identify confounding variables
- Example: Ice cream sales correlate with drowning (both driven by temperature)

**Pitfall 4: Overfitting**
- Model fits training noise rather than true signal
- Prevention: Cross-validation, regularization, hold-out test set
- Example: 99% train accuracy, 65% test accuracy

**Pitfall 5: P-Hacking**
- Testing many hypotheses and reporting only significant ones
- Prevention: Pre-register hypotheses, use multiple comparison corrections
- Example: Try 20 predictors, report the 1 with p < 0.05


## 4. Academic vs Industry Applications

### Key Differences

| Aspect | Academic | Industry |
|--------|----------|----------|
| **Goal** | Advance knowledge, publish | Drive business value |
| **Timeline** | 6-12 months | Days to weeks |
| **Success Metric** | Novel insights, strong evidence | ROI, customer impact |
| **Team** | Individual or small group | Cross-functional |
| **Reproducibility** | Critical and mandatory | Important but business-driven |
| **Deployment** | Publication, one-time | Continuous iteration and monitoring |
| **Structure** | IMRaD (Intro, Methods, Results, Discussion) | Agile sprints |

### Adapting CRISP-DM for Academic Research

Map CRISP-DM phases to IMRaD:
- Business Understanding → Introduction + Literature Review
- Data Understanding → Methods (Data Collection)
- Data Preparation → Methods (Preprocessing)
- Modeling → Methods (Statistical Techniques)
- Evaluation → Results + Discussion
- Deployment → Conclusions + Future Work

### Adapting CRISP-DM for Industry

Follow Agile principles with CRISP-DM:
- **1-2 week sprints**: Each sprint covers a vertical slice (end-to-end)
- **MVP approach**: Build minimum viable product first
- **Iterative refinement**: Weekly stakeholder demos
- **A/B testing**: Measure business impact before full rollout
- **Continuous monitoring**: Track performance and data drift


In [None]:
# Pitfall Detection System
class PitfallDetector:
    def __init__(self):
        self.pitfalls = {
            'wrong_problem': {
                'name': 'Solving the Wrong Problem',
                'checklist': [
                    'Success criteria documented',
                    'Business metrics defined',
                    'Stakeholder sign-off obtained',
                    'Business vs technical metrics differ'
                ]
            },
            'data_leakage': {
                'name': 'Data Leakage',
                'checklist': [
                    'Train/test split done FIRST',
                    'Scalers fitted only on training data',
                    'No future information in features',
                    'Temporal order preserved for time-series'
                ]
            },
            'overfitting': {
                'name': 'Overfitting',
                'checklist': [
                    'Cross-validation used',
                    'Regularization applied',
                    'Test set held out from tuning',
                    'Simple models compared to complex'
                ]
            }
        }
    
    def check_pitfall(self, pitfall_key, items_completed):
        if pitfall_key not in self.pitfalls:
            return {'error': 'Pitfall key not found'}
        
        pitfall = self.pitfalls[pitfall_key]
        n_items = len(pitfall['checklist'])
        items_done = sum(items_completed)
        completion_rate = items_done / n_items
        
        if completion_rate >= 0.75:
            risk = 'LOW RISK'
        elif completion_rate >= 0.5:
            risk = 'MEDIUM RISK'
        else:
            risk = 'HIGH RISK'
        
        return {
            'pitfall': pitfall['name'],
            'risk_level': risk,
            'completion': f'{items_done}/{n_items}',
            'completion_rate': f'{completion_rate*100:.0f}%',
            'checklist': pitfall['checklist'],
            'completed': items_completed
        }

# Example assessment
detector = PitfallDetector()
project = {
    'wrong_problem': [True, True, True, False],
    'data_leakage': [True, True, False, True],
    'overfitting': [True, True, True, True]
}

print('Project Pitfall Assessment')
print('=' * 60)
for key, items in project.items():
    result = detector.check_pitfall(key, items)
    print(f'\n{result["pitfall"]}: {result["risk_level"]}')
    print(f'  Completion: {result["completion"]} ({result["completion_rate"]})')


## 5. Hybrid Agile-CRISP-DM Framework

### Why Hybrid?

Modern projects need:
- **CRISP-DM's structure** for proper methodology
- **Agile's flexibility** for rapid iteration

### The Hybrid Approach

**Week 1: Discovery Sprint**
- CRISP Phase: Business Understanding
- Duration: 1 week
- Activities: Stakeholder interviews, objectives, data audit
- Deliverable: Project Charter

**Weeks 2-N: Development Sprints**
- CRISP Phases: Data Understanding through Evaluation
- Duration: 1-2 weeks per sprint
- Activities: Daily standups, mid-sprint checks, sprint reviews
- Deliverable: Incremental model improvements

**Final Weeks: Deployment Sprint**
- CRISP Phase: Deployment
- Duration: 1-2 weeks
- Activities: Code review, testing, A/B test design, production deployment
- Deliverable: Production model + monitoring

**Ongoing: Support Phase**
- CRISP Phase: Deployment (Monitoring)
- Duration: Ongoing (1-2 hours/week)
- Activities: Performance monitoring, retraining planning
- Deliverable: Health metrics


In [None]:
# CRISP-DM Project Tracker
class CRISPDMTracker:
    def __init__(self, project_name):
        self.project_name = project_name
        self.phases = {
            1: {'name': 'Business Understanding', 'completion': 0},
            2: {'name': 'Data Understanding', 'completion': 0},
            3: {'name': 'Data Preparation', 'completion': 0},
            4: {'name': 'Modeling', 'completion': 0},
            5: {'name': 'Evaluation', 'completion': 0},
            6: {'name': 'Deployment', 'completion': 0}
        }
    
    def update_phase(self, phase_num, completion_pct):
        if phase_num in self.phases:
            self.phases[phase_num]['completion'] = min(completion_pct, 100)
    
    def get_status(self):
        total = sum(p['completion'] for p in self.phases.values())
        return total / len(self.phases)
    
    def report(self):
        print(f'Project: {self.project_name}')
        print('=' * 60)
        print(f'Overall Completion: {self.get_status():.0f}%\n')
        for phase_num in sorted(self.phases.keys()):
            p = self.phases[phase_num]
            bar = '█' * (p['completion'] // 5) + '░' * (20 - p['completion'] // 5)
            print(f'Phase {phase_num}: {p["name"]}')
            print(f'  [{bar}] {p["completion"]}%')

# Example
tracker = CRISPDMTracker('Customer Churn Prediction')
tracker.update_phase(1, 100)
tracker.update_phase(2, 100)
tracker.update_phase(3, 75)
tracker.update_phase(4, 50)
tracker.update_phase(5, 0)
tracker.update_phase(6, 0)
tracker.report()


## Exercises

### Exercise 1: CRISP-DM Phase Planning

Project: "Predict equipment failure for manufacturing plant"

**Task**: For each CRISP-DM phase, identify 3-4 key activities and questions.

Consider:
- What does business care about?
- What data exists in manufacturing systems?
- How to deploy predictions to production?


In [None]:
# Exercise 1 Template: Equipment Failure Prediction
phases_template = {
    'Business Understanding': {
        'Questions': ['???', '???', '???'],
        'Activities': ['???', '???', '???']
    },
    'Data Understanding': {
        'Questions': ['???', '???', '???'],
        'Activities': ['???', '???', '???']
    },
    'Data Preparation': {
        'Questions': ['???', '???', '???'],
        'Activities': ['???', '???', '???']
    },
    'Modeling': {
        'Questions': ['???', '???', '???'],
        'Activities': ['???', '???', '???']
    },
    'Evaluation': {
        'Questions': ['???', '???', '???'],
        'Activities': ['???', '???', '???']
    },
    'Deployment': {
        'Questions': ['???', '???', '???'],
        'Activities': ['???', '???', '???']
    }
}

print('Exercise 1: Equipment Failure Prediction')
print('Fill in the ??? marks for each phase')


### Exercise 2: Identifying Pitfalls

A team completed this analysis:
- Model: 95% accuracy on test set
- Used 95% train, 5% test (no validation)
- Scaled ALL features using entire dataset statistics
- Found "customer service calls" is strongest predictor
- Tried 47 feature combinations, reported top 5
- No stakeholder interviews

**Task**: Identify the pitfalls and what should be done.


In [None]:
# Exercise 2 Template
findings = {
    'Finding 1: 95% train/5% test split': {
        'Pitfall': 'Overfitting',
        'Solution': 'Use 60/20/20 split + cross-validation'
    },
    'Finding 2: No stakeholder interviews': {
        'Pitfall': '???',
        'Solution': '???'
    },
    'Finding 3: Scaled using entire dataset': {
        'Pitfall': '???',
        'Solution': '???'
    },
    'Finding 4: Service calls is top predictor': {
        'Pitfall': '???',
        'Solution': '???'
    },
    'Finding 5: Tried 47 features, reported top 5': {
        'Pitfall': '???',
        'Solution': '???'
    }
}

print('Exercise 2: Identify Pitfalls')
print('Options: Overfitting, Data Leakage, Wrong Problem,')
print('         Correlation≠Causation, P-Hacking')


### Exercise 3: Design Hybrid Agile-CRISP-DM Project

Project: "Optimize pricing strategy based on competitor data"

**Task**: Create a 6-week sprint plan:
- Week 1: Discovery
- Weeks 2-4: Development
- Week 5: Validation
- Week 6: Deployment

For each week, specify:
- Key activities
- CRISP-DM phases
- Deliverables
- Success metrics


In [None]:
# Exercise 3 Template: Pricing Optimization Sprint Plan
sprint_plan = {
    'Week 1: Discovery': {
        'CRISP Phase': 'Business Understanding',
        'Activities': ['???', '???', '???'],
        'Deliverables': ['???'],
        'Metrics': ['???']
    },
    'Week 2: Data Exploration': {
        'CRISP Phase': 'Data Understanding',
        'Activities': ['???', '???', '???'],
        'Deliverables': ['???'],
        'Metrics': ['???']
    },
    'Week 3: Preparation & Modeling': {
        'CRISP Phase': 'Data Preparation + Modeling',
        'Activities': ['???', '???', '???'],
        'Deliverables': ['???'],
        'Metrics': ['???']
    },
    'Week 4: Refinement': {
        'CRISP Phase': 'Modeling + Evaluation',
        'Activities': ['???', '???', '???'],
        'Deliverables': ['???'],
        'Metrics': ['???']
    },
    'Week 5: Validation': {
        'CRISP Phase': 'Evaluation',
        'Activities': ['???', '???', '???'],
        'Deliverables': ['???'],
        'Metrics': ['???']
    },
    'Week 6: Deployment': {
        'CRISP Phase': 'Deployment',
        'Activities': ['???', '???', '???'],
        'Deliverables': ['???'],
        'Metrics': ['???']
    }
}

print('Exercise 3: Design your hybrid project sprint plan')


## Summary

### Key Takeaways

✅ **CRISP-DM** is the industry-standard 6-phase methodology used by 43% of data scientists
- Iterative, not linear - cycles based on findings

✅ **TDSP** modernizes CRISP-DM for enterprise projects
- Better for large-scale, team-based, production systems

✅ **Academic vs Industry** require different adaptations
- Academic: IMRaD structure, statistical rigor, reproducibility
- Industry: Agile sprints, MVP approach, continuous iteration

✅ **Five common pitfalls** have methodological solutions
- Wrong problem → Dedicated Business Understanding
- Data leakage → Strict train/test/validation protocols
- Correlation ≠ causation → Causal diagrams + domain expertise
- Overfitting → Cross-validation + regularization
- P-hacking → Pre-registration + multiple comparison corrections

✅ **Hybrid Agile-CRISP-DM** combines structure with flexibility
- Sprint-based iteration with CRISP-DM rigor
- Weekly stakeholder demos, daily standups

### What's Next?

In **Module 09: MLOps and Model Lifecycle Management**, you'll learn:
- How to operationalize CRISP-DM in production systems
- Model versioning, experimentation tracking, and deployment
- Monitoring and continuous retraining frameworks


## Self-Assessment

Before moving to the next module, ensure you can:

- [ ] Explain all 6 CRISP-DM phases and how they iterate
- [ ] Contrast CRISP-DM with TDSP and choose appropriate frameworks
- [ ] Adapt CRISP-DM for academic research (IMRaD structure)
- [ ] Adapt CRISP-DM for industry (Agile sprints)
- [ ] Identify the 5 major data science pitfalls
- [ ] Describe prevention strategies for each pitfall
- [ ] Design a hybrid Agile-CRISP-DM project plan
- [ ] Track project progress through CRISP-DM phases
- [ ] Apply CRISP-DM to a real problem

If you can confidently check all boxes, you're ready for Module 09! 🎉
