# Chapter 88: Project Management

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand the unique challenges of managing machine learning projects compared to traditional software development.
- Choose an appropriate project management methodology (Agile, Scrum, Kanban) for a time‑series prediction system.
- Conduct effective sprint planning, backlog grooming, and estimation.
- Identify and manage stakeholders, including business sponsors, domain experts, and end users.
- Recognize and mitigate technical, data, and business risks.
- Plan resources (people, compute, budget) for the lifecycle of an ML project.
- Track progress using metrics like velocity, burndown charts, and key performance indicators.
- Communicate status effectively through reports and demos.
- Run productive retrospectives to drive continuous improvement.
- Apply best practices tailored to ML projects, such as managing data dependencies and ensuring reproducibility.

---

## **88.1 Introduction to Project Management for ML Systems**

Building a time‑series prediction system like the NEPSE stock predictor is not just about writing code and training models. It involves coordinating people, managing expectations, mitigating risks, and delivering value to stakeholders. Project management provides the framework to do this systematically.

Machine learning projects differ from traditional software projects in several ways:

- **Experimentation**: The path to a good model is uncertain; multiple approaches may fail before one succeeds.
- **Data dependencies**: Model performance depends on data quality and availability, which may be outside the team's control.
- **Reproducibility**: Experiments must be reproducible, requiring careful tracking of code, data, and hyperparameters.
- **Model decay**: Models degrade over time (concept drift), requiring ongoing monitoring and retraining.
- **Interdisciplinary teams**: Data scientists, engineers, and domain experts must collaborate closely.

These differences mean that project management practices must be adapted. This chapter will guide you through the essential elements, using the NEPSE system as a running example.

---

## **88.2 Choosing a Methodology: Agile, Scrum, or Kanban?**

### **88.2.1 Waterfall vs. Agile**
Traditional **Waterfall** (requirements → design → implementation → testing → deployment) is ill‑suited for ML because the requirements for the model (e.g., accuracy) may not be achievable, and the best approach is discovered through experimentation. **Agile** methodologies embrace change and iterative delivery, which aligns well with ML.

### **88.2.2 Scrum**
Scrum is a popular Agile framework with fixed‑length iterations (sprints), typically 1‑4 weeks. It includes roles (Product Owner, Scrum Master, Development Team) and ceremonies (Sprint Planning, Daily Stand‑up, Sprint Review, Retrospective). Scrum works well when the team has a clear backlog and can commit to a set of work each sprint.

For the NEPSE project, a 2‑week sprint might include tasks like:
- Implement a new feature (e.g., RSI calculation).
- Train and evaluate an LSTM model.
- Fix a bug in the data ingestion pipeline.
- Set up monitoring for prediction errors.

### **88.2.3 Kanban**
Kanban is a continuous flow method, ideal for teams with varying priorities or unpredictable workloads. Work items are pulled from a backlog as capacity allows. Kanban is often used for maintenance or support teams, but can also work for ML when experiments have uncertain duration.

### **88.2.4 Hybrid Approaches**
Many teams use a hybrid: Scrum for planned feature work, and Kanban for ad‑hoc requests or research spikes. For example, the NEPSE team might have a sprint for feature engineering and model training, while also maintaining a Kanban board for urgent bug fixes.

**Recommendation**: Start with Scrum if you have a dedicated team and a clear product roadmap. Use Kanban if the work is highly variable or if you are in a research phase. Most ML projects benefit from the structure of sprints to ensure regular progress.

---

## **88.3 Sprint Planning**

Sprint planning is a ceremony where the team selects a set of backlog items to complete in the upcoming sprint, and defines a sprint goal.

### **88.3.1 Preparing the Backlog**
Before sprint planning, the Product Owner should groom the backlog: ensure items are well‑defined, prioritized, and estimated. Each item (user story or task) should have:

- A clear description.
- Acceptance criteria.
- Dependencies identified.
- An estimate of effort.

### **88.3.2 Estimation**
Estimation in ML projects is notoriously difficult because of uncertainty. Common techniques:

- **Story points**: Relative sizing (e.g., 1, 2, 3, 5, 8, 13). The team agrees on a baseline story (e.g., “add a simple lag feature” = 2 points) and compares others.
- **T‑shirt sizes**: Small, Medium, Large, Extra‑Large.
- **Time‑based**: Hours or days, but this can be less accurate for exploratory work.

For ML tasks, break down work into smaller, more predictable chunks. For example, “Build LSTM model” could be split into:

- Prepare sequence data for LSTM (2 points)
- Implement LSTM architecture (3 points)
- Train and tune hyperparameters (5 points – uncertain)
- Evaluate and document (1 point)

### **88.3.3 Capacity Planning**
Determine the team’s available capacity for the sprint (e.g., 4 developers × 10 days = 40 person‑days, minus meetings, support, etc.). Select items whose total estimated effort fits within capacity.

### **88.3.4 Sprint Goal**
Define a concise goal that summarises the sprint’s objective. Example: “Deliver an initial LSTM model with baseline performance and integrate it into the prediction service for testing.”

---

## **88.4 Backlog Management**

The backlog is a living document of all work to be done. Keeping it healthy is crucial.

### **88.4.1 Grooming / Refinement**
Regularly (e.g., mid‑sprint) review and update backlog items:

- Remove items that are no longer relevant.
- Break large epics into smaller stories.
- Add new items based on stakeholder feedback.
- Re‑prioritize based on business value and dependencies.

### **88.4.2 Prioritization Frameworks**
- **MoSCoW**: Must have, Should have, Could have, Won't have. Helps focus on essentials.
- **Value vs. Effort**: Plot items on a 2x2 matrix; focus on high‑value, low‑effort items first.
- **Weighted Shortest Job First (WSJF)**: Used in SAFe; prioritizes based on cost of delay divided by duration.

For the NEPSE system, a must‑have might be “ensure data ingestion runs daily without failure”, while a could‑have might be “add sentiment analysis from news”.

### **88.4.3 Technical Debt**
Include time in the backlog for refactoring and addressing technical debt (Chapter 86). If ignored, it will slow down future development.

---

## **88.5 Stakeholder Management**

Stakeholders are individuals or groups with an interest in the project. They can include:

- Business sponsors (e.g., executives funding the project).
- Domain experts (e.g., financial analysts for NEPSE).
- End users (e.g., traders who will use predictions).
- Operations teams (who will maintain the system).
- Regulators (if in a regulated industry).

### **88.5.1 Identifying Stakeholders**
Create a stakeholder map: list all stakeholders, their interest, influence, and communication needs.

### **88.5.2 Communication Plan**
Define how and when to communicate with each stakeholder group:

- **Weekly status report** for sponsors: high‑level progress, risks, next steps.
- **Monthly demo** for domain experts: show new features, get feedback.
- **Quarterly review** with executives: align on strategic goals.
- **Daily stand‑up** for the team (internal).

### **88.5.3 Managing Expectations**
ML projects often face the “expectation gap” – stakeholders may expect 100% accuracy, which is impossible. Educate them about the nature of predictions, uncertainty, and the iterative improvement process. Use concrete metrics (MAE, accuracy) and benchmarks.

For the NEPSE system, you might show that the current model has MAE of 12 points, which is better than a naive forecast (previous day close) of 15 points, but still has error. Explain that predictions are a tool, not a crystal ball.

### **88.5.4 Feedback Loops**
Involve stakeholders early and often. A quick demo after each sprint can catch misunderstandings and incorporate valuable feedback.

---

## **88.6 Risk Management**

Risk management is the process of identifying, assessing, and mitigating risks that could derail the project.

### **88.6.1 Types of Risks**
- **Technical risks**: Model may not achieve required accuracy; infrastructure may not scale; integration challenges.
- **Data risks**: Data quality issues; data not available; concept drift.
- **Business risks**: Changing requirements; loss of stakeholder support; competitor moves.
- **Resource risks**: Key person leaving; insufficient compute budget.
- **Regulatory risks**: Compliance with data protection laws; model explainability requirements.

### **88.6.2 Risk Register**
Maintain a simple table:

| Risk | Likelihood (1‑5) | Impact (1‑5) | Score | Mitigation |
|------|------------------|--------------|-------|------------|
| Data feed from NEPSE API fails | 3 | 5 | 15 | Implement fallback (manual upload); monitor feed health |
| Model accuracy drops after 6 months | 4 | 4 | 16 | Set up automated drift detection; plan monthly retraining |
| Key data scientist leaves | 2 | 5 | 10 | Cross‑train other team members; document model code and rationale |

### **88.6.3 Mitigation Strategies**
For high‑score risks, define concrete actions. For example, for the “data feed fails” risk, the mitigation could be: “Build a monitoring alert that notifies the team if no new data arrives by 9am; have a manual upload process documented.”

### **88.6.4 Review Risks Regularly**
Revisit the risk register at least monthly, or when significant changes occur. Update likelihoods and impacts as the project progresses.

---

## **88.7 Resource Planning**

### **88.7.1 People**
Identify the skills needed: data engineering, ML, software development, DevOps, domain expertise. Plan for hiring or training if gaps exist. Consider using contractors or consultants for specialised needs.

For the NEPSE team, you might need:

- 1 data engineer (2 months)
- 2 ML engineers (ongoing)
- 1 backend developer (3 months for API)
- 1 DevOps (part‑time)

### **88.7.2 Compute**
ML training requires compute resources. Estimate:

- Data size and expected model complexity.
- Number of experiments.
- Need for GPUs.

Plan cloud budget accordingly. Use spot instances to reduce cost. For the NEPSE project, a small GPU instance might suffice for occasional LSTM training, while XGBoost can run on CPUs.

### **88.7.3 Budget**
Track actual spend against planned budget. Use cloud cost management tools (e.g., AWS Cost Explorer) to avoid surprises.

---

## **88.8 Progress Tracking**

Tracking progress helps the team stay on course and provides visibility to stakeholders.

### **88.8.1 Velocity**
In Scrum, velocity is the sum of story points completed per sprint. Over time, it becomes a predictor of future capacity. For the NEPSE team, if velocity is consistently 20 points per sprint, they can plan accordingly.

### **88.8.2 Burndown Charts**
A burndown chart shows remaining work (in story points or hours) vs. time. It helps spot if the team is ahead or behind schedule. Many project management tools (Jira, Azure DevOps) generate these automatically.

### **88.8.3 Key Performance Indicators (KPIs)**
Track metrics that reflect project health:

- **Sprint completion rate**: % of committed work completed.
- **Bug rate**: Number of bugs found in production.
- **Model performance**: MAE, accuracy, etc., over time.
- **Data freshness**: Time since last data update.
- **Deployment frequency**: How often new models are deployed.

### **88.8.4 OKRs (Objectives and Key Results)**
Set quarterly objectives with measurable key results. Example for the NEPSE project:

- Objective: Improve prediction accuracy.
  - KR1: Reduce MAE from 12 to 10 by end of quarter.
  - KR2: Implement and test three new features (RSI, MACD, Bollinger Bands).
  - KR3: Achieve 95% availability of prediction API.

OKRs align the team on strategic goals.

---

## **88.9 Reporting**

Regular reporting keeps stakeholders informed and builds trust.

### **88.9.1 Status Reports**
A simple template:

- **Last period accomplishments**: What was delivered.
- **Current period focus**: What the team is working on now.
- **Risks and issues**: Any blockers or concerns.
- **Metrics**: Key project metrics (e.g., velocity, model performance).

### **88.9.2 Dashboards**
Create a live dashboard (e.g., using Grafana) showing:

- Model performance over time.
- System uptime and latency.
- Data pipeline health.
- Recent predictions vs. actuals.

This can be shared with the team and stakeholders for real‑time visibility.

### **88.9.3 Demos**
At the end of each sprint, hold a demo for stakeholders. Show working software, even if it's just a new feature or improved accuracy. This builds confidence and solicits early feedback.

---

## **88.10 Retrospectives**

Retrospectives are the key to continuous improvement. After each sprint, the team reflects on what went well, what could be improved, and what actions to take.

### **88.10.1 Format**
A common format:

1. **Set the stage**: Review the sprint goal and outcome.
2. **Gather data**: Each team member writes notes on sticky notes (physical or virtual) for:
   - What went well?
   - What didn't go well?
   - What puzzles me?
3. **Generate insights**: Group similar items, discuss root causes.
4. **Decide what to do**: Select 1‑3 actionable improvements for the next sprint.
5. **Close**: Thank the team.

### **88.10.2 Action Items**
Ensure improvements are concrete and assigned. For example:

- “Set up a shared calendar for team availability” (assign: Alice).
- “Create a template for PR descriptions” (assign: Bob).
- “Schedule a weekly knowledge‑sharing session” (assign: Carol).

### **88.10.3 Blameless Culture**
Focus on processes, not individuals. If something went wrong, ask “How can we change our process to prevent this?” rather than “Who made the mistake?”

---

## **88.11 Best Practices for ML Project Management**

Drawing from the above, here are specific best practices for ML projects:

1. **Treat data as a first‑class dependency**: Include data acquisition and validation tasks in your backlog. Plan for data cleaning and feature engineering early.
2. **Version everything**: Code, data, and models should be versioned. This enables reproducibility and rollback.
3. **Automate experiments**: Use experiment tracking (MLflow, Weights & Biases) to keep a record of all runs, hyperparameters, and results.
4. **Plan for model decay**: After deployment, allocate time for monitoring and retraining. Build this into your roadmap.
5. **Involve domain experts early**: Their input is crucial for feature engineering and evaluation. Invite them to demos and reviews.
6. **Set realistic expectations**: Be transparent about the limitations of ML. Use benchmarks to show progress.
7. **Manage technical debt**: Regularly refactor code and improve infrastructure. Include technical debt items in your backlog.
8. **Celebrate small wins**: ML projects can be long and uncertain. Celebrate each improvement, each new feature, each successful deployment.

---

## **Chapter Summary**

In this chapter, we explored project management tailored to building a time‑series prediction system like the NEPSE stock predictor. We discussed:

- Choosing an Agile methodology (Scrum or Kanban) to accommodate the iterative nature of ML.
- Sprint planning, estimation, and capacity planning.
- Keeping the backlog healthy through grooming and prioritization.
- Engaging stakeholders through communication plans and demos.
- Identifying and mitigating technical, data, and business risks.
- Planning resources (people, compute, budget).
- Tracking progress with velocity, burndown charts, and KPIs.
- Reporting and using dashboards to communicate status.
- Running effective retrospectives to drive continuous improvement.
- Best practices specific to ML project management.

Effective project management ensures that the team delivers value predictably, adapts to change, and maintains a healthy work environment. It complements the technical practices covered in previous chapters to create a successful prediction system.

In the next chapter, we will discuss **Documentation Strategies**, diving deeper into how to document code, models, and processes for long‑term maintainability.

---

**End of Chapter 88**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='87. team_collaboration.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='89. documentation_strategies.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
