# 📜 IBM Data Science Professional Certificate  
*Curiosity to Capability — One Notebook at a Time*

---

**Compiled and Authored by:**  
**Partho Sarothi Das**  
Dhaka, Bangladesh  
🎓 Bachelor's & Master's in Statistics  
💼 Investment Banking Professional → Aspiring Data Scientist  

>**Disclaimer:** This notebook is based on content from the [IBM Data Science Professional Certificate](https://www.coursera.org/professional-certificates/ibm-data-science) offered on Coursera. It is intended for personal learning and review purposes.

---
---

# Introduction to Data Science Methodology

###  Modules & Lessons

#### Module 1

* Lesson 1: Business Understanding & Analytic Approach
* Lesson 2: Data Requirements & Data Collection

#### Module 2

* Lesson 1: Data Understanding & Data Preparation
* Lesson 2: Modeling & Evaluation

#### Module 3

* Final Lesson: Deployment & Feedback

---

###  Case Study Highlight

A healthcare facility with limited budget wants to improve patient care before discharge.
**Goal**: Use data science to help doctors make **timely, data-driven decisions**.
If successful, this pilot will **improve patient outcomes** and optimize **budget allocation**.

---

###  Interactive Features

* Case study icons differentiate **theory vs. practice**
* Discussion forums for learner support and interaction

---

###  Prerequisites

You’re ready if you:

* Have **basic knowledge of data science**
* Know how to use **Jupyter Notebooks**

---
---

# Data Science Methodology Overview

### What is Methodology?

* A methodology is a system of methods used in a particular area of study.
* In data science, it provides a structured approach to solving problems and making data-driven decisions.
* It includes:

  * Data collection methods
  * Measurement strategies
  * Comparison of analysis techniques based on research goals

---

### Why Use Methodology in Data Science?

* Despite improved technology and data access, teams often struggle to define questions clearly or apply data effectively.
* Methodology helps avoid jumping straight to solutions and ensures a disciplined process for reliable results.

---

### John Rollins’s Contribution

* The methodology taught in this course is based on the work of John Rollins, IBM Senior Data Scientist.
* His framework outlines 10 stages to guide data scientists from problem understanding to solution deployment.

---

### 10 Stages of the Data Science Methodology

1. Business Understanding
2. Analytic Approach
3. Data Requirements
4. Data Collection
5. Data Understanding
6. Data Preparation
7. Modeling
8. Evaluation
9. Deployment
10. Feedback

---

### Key Questions for Each Stage

#### Problem Definition & Approach (Stages 1–2)

* What is the problem you're trying to solve?
* How can data help answer the question?

#### Organizing Around Data (Stages 3–6)

* What data do you need?
* Where is the data sourced from and how will you access it?
* Does the data represent the problem?
* What preparation is needed to use the data?

#### Model Validation & Solution (Stages 7–10)

* Do visualizations reveal insights related to the problem?
* Does the model address the business question, or does it need adjustments?
* Can you deploy the model?
* Can stakeholders provide useful feedback?

---

### ✅ **Takeaways**

* Data science methodology **structures complex problem-solving**.
* It emphasizes the **importance of asking the right questions** at each stage.
* Following a proven methodology, like Rollins’s 10-stage process, leads to **better decisions and successful outcomes**.

---

# From Problem to Approach – Business Understanding

### Purpose of the Lesson

To explain the **first and foundational step** in data science methodology: **Business Understanding** — clarifying the problem to ensure the right questions are being asked and the right data is being used.

---

### Key Concepts

#### Why Business Understanding Matters

* Jumping into analysis without fully understanding the business problem can lead to wasted effort solving the *wrong* question.
* Clarity around goals and objectives helps identify the correct analytic approach.
* Rollins emphasizes that a clearly defined question is crucial to direct effective data science work.

---

### 👔 **Real-Life Example: A Healthcare Case Study**

#### 🏥 **The Problem**

* An American health insurance provider faced funding cuts for readmissions.
* If they didn’t find a solution, they might have to **raise insurance rates**, which was undesirable.

#### 🔍 **The Question**

> *“What is the best way to allocate a limited healthcare budget to maximize quality care?”*

#### 🧠 **The Process Followed**

1. **Stakeholders (insurance company, healthcare authorities, IBM data scientists)** aligned to define the **goal** and **supporting objectives**.
2. The team identified **patient readmissions** as a key priority.
3. Data showed:

   * 30% of rehab patients were readmitted within 1 year.
   * 50% within 5 years.
   * **Congestive Heart Failure** patients were most frequently readmitted.
4. A **decision tree model** was proposed to analyze this trend.

#### 🛠️ **Business Workshop**

* IBM conducted an **on-site workshop** to frame the project.
* **Key sponsor** played a vital role in:

  * Setting direction
  * Staying involved
  * Providing support

#### 📋 **Identified Business Requirements**

1. Predict readmission outcomes for Congestive Heart Failure (CHF) patients.
2. Assess readmission **risk** levels.
3. Understand the **event combinations** leading to readmissions.
4. Apply an **easy-to-understand model** for future predictions.

---

### 💡 **Main Takeaways**

* Data science starts with **asking the right questions** and defining **clear goals**.
* **Engage stakeholders** early and throughout the project.
* Structure conversations to **identify goals, break them into objectives, and set priorities**.
* **Business understanding** determines the analytic approach and data requirements.

---
---

# From Problem to Approach – Analytic Approach

### 🎯 **Purpose of the Lesson**

To explain the **second stage** of the data science methodology: **Analytic Approach** — choosing the right type of analysis based on the business question.

---

### 🧩 **Key Concepts**

#### ✅ **Selecting the Right Analytic Approach**

* Depends on the **type of question** being asked.
* Requires **clarification from the question-asker** to ensure alignment with the **business goal**.
* Helps determine what **patterns** need to be discovered in the data.

#### 🎯 **Examples of Analytic Approaches**

| **Question Type**            | **Suggested Approach**                      |
| ---------------------------- | ------------------------------------------- |
| Yes/No (binary outcome)      | **Classification**                          |
| Predicting future outcomes   | **Predictive Modeling**                     |
| Identifying relationships    | **Descriptive Modeling** (e.g., clustering) |
| Understanding human behavior | **Clustering** or **Association Rules**     |
| Quantitative answers         | **Statistical Analysis**                    |
| Complex pattern discovery    | **Machine Learning Techniques**             |

> **Machine Learning** is used to identify hidden relationships and trends in data without being explicitly programmed.

---

### 🏥 **Case Study: Healthcare Readmission Model**

* **Goal**: Predict risk of **readmission** for patients with **Congestive Heart Failure**.
* **Analytic Approach Used**: **Decision Tree Classification**

  * Easy to interpret and apply.
  * Breaks data into paths (nodes → leaves) based on variables and thresholds.
  * Produces both the **predicted outcome** and its **probability** (likelihood).

    * If the dominant outcome in a leaf is "Yes" → Risk = proportion of "Yes".
    * If "No" → Risk = 1 - proportion of "No".

#### 🩺 **Benefits for Clinicians**

* Transparent: Helps clinicians **understand which conditions** lead to high risk.
* Dynamic: Can build **multiple models** at different hospital stages to **track evolving patient risk**.

---

### 💡 **Main Takeaways**

* The **analytic approach** must match the **nature of the business question**.
* In the case study, **decision trees** were ideal due to:

  * Interpretability for healthcare staff.
  * Ability to provide clear risk estimates for readmission.
  * Flexibility to apply models at various stages during a hospital stay.

---
---

# Analytic Approach Based on the Question Type

When choosing an analytic approach for a problem, the type of question you’re trying to answer greatly influences the methodology. Here are five common types of questions and corresponding analytic approaches:

### 1. Descriptive Questions: “What is the current status?”

**Approach:** Descriptive Analytics

**Question:**  "What is the current status of our sales?"  

**Techniques:**

- *Data aggregation:* Combining data from various sources into a unified view.

- *Data mining:* Extracting useful information from large datasets.

- *Data visualization:* Using visual tools to present data in an easily understandable format.

*Examples:*

- Summarizing sales data

- Creating dashboards

- Generating reports


### 2. Diagnostic Questions: “Why did it happen?”

**Approach:** Diagnostic Analytics

**Question:** "Why did our sales decline in the last quarter?"  

**Techniques:**

- *Drill-down:* Exploring detailed data to find underlying causes.

- *Data discovery:* Identifying patterns and relationships in data.

- *Correlation analysis:* Assessing the relationship between different variables.

**Examples:**

- Identifying root causes of sales decline

- Analyzing customer complaints

- Understanding failure points in a process


### 3. Predictive Questions: “What is likely to happen?”

**Approach:** Predictive Analytics

**Question:** "What is our sales forecast for the next year?"  

**Techniques:**

- *Regression analysis:* Predicting outcomes based on relationships between variables.

- *Time series forecasting:* Predicting future values based on past trends.

- *Machine learning models:* Using algorithms to predict future outcomes based on historical data.

**Examples:**

- Forecasting sales

- Predicting customer churn

- Estimating future demand


### 4. Prescriptive Questions: “What should we do?”

**Approach:** Prescriptive Analytics

**Question:** "What should we do to increase website traffic?"  

**Techniques:**

- Optimization models: Finding the best solution from a set of alternatives.

- Simulation: Modeling scenarios to predict outcomes.

- Decision analysis: Evaluating and comparing different decisions.

**Examples:**

- Recommending inventory levels

- Optimizing marketing campaigns

- Determining pricing strategies


### 5. Classification Questions: “Which category does this belong to?”

**Approach:** Classification (Supervised Learning)

**Question:** "Which category does this data point belong to?"   

**Techniques:**

- Logistic regression: Predicting the probability of a categorical outcome.

- Decision trees: Splitting data into branches to classify it.

- Support vector machines: Finding the best boundary to separate categories.

- Neural networks: Using interconnected nodes to classify data.

**Examples:**

- Email spam detection

- Image classification

- Disease diagnosis
---
---

# From Requirements to Collection

## Data Requirements

![Data Requirements](images/Data_Requirements.png)

### 🎯 **Objective**

Understand the **data requirements** stage of the data science methodology — identifying *what data is needed*, *how it should be formatted*, and *what sources to collect from* to support the analytic approach and solve the business problem.

---

### 🧩 **Key Concepts**

#### 🥘 **Analogy**: Cooking with Data

* The **problem** is the recipe.
* The **data** is the ingredient.
* The **data scientist** must identify:

  * What data (ingredients) are needed,
  * Where to collect them from,
  * How to process and prepare them.

---

### 📊 **What Are Data Requirements?**

* **Defined before** data collection and preparation.
* Must be aligned with:

  * The **business problem**,
  * The **analytic approach** (e.g., decision tree classification),
  * The structure and **format** required for modeling.

---

### 🏥 **Case Study: Congestive Heart Failure Readmission**

To build a decision tree model, the team defined **strict data requirements**:

#### ✅ **Cohort Selection Criteria**

1. **In-patient admission** within the provider’s service area.
2. **Primary diagnosis** of congestive heart failure (CHF) during one calendar year.
3. **Continuous insurance enrollment** for at least 6 months before admission (to ensure full medical history access).
4. **Exclude patients** with other serious conditions to prevent skewing readmission risk.

#### 📦 **Data Content & Format**

* **Model format**: One row per patient; columns = model variables.
* Raw data includes:

  * Admissions
  * Diagnoses (primary, secondary, tertiary)
  * Procedures
  * Prescriptions
  * Outpatient services
* This data was **rolled up** from thousands of rows per patient to one summarized row using aggregation — preparing for modeling.

---

### 💡 **Main Takeaways**

* Clearly **define what data is needed** based on the analytic model.
* Understand the **source, content, and structure** of that data.
* Anticipate **future stages** like data preparation and modeling.
* Preparing well at this stage ensures effective and efficient downstream analysis.

---
---

## Data Collection

![Data Collection](images/Data_Collection.png)

### 🎯 **Objective**

Understand the **Data Collection** stage of the data science methodology — where raw data is gathered based on the data requirements, assessed, and prepared for analysis.

---

### 🧩 **Key Concepts**

#### 🍽️ **Analogy: Ingredients for Cooking**

* Just like gathering ingredients for a meal, collecting data may involve unexpected challenges.
* Some "ingredients" (data sources) might be:

  * Unavailable,
  * Expensive to acquire,
  * Or come with quality issues.

---

### 📥 **What Happens in Data Collection?**

* Follows the **Data Requirements** stage.
* Involves identifying and **retrieving data** from various sources.
* Data scientists **assess**:

  * **Availability**
  * **Completeness**
  * **Quality**
  * **Suitability** for analysis

#### 🛠️ **Assessment Tools**

* **Descriptive statistics** and **visualizations** help:

  * Explore the dataset,
  * Understand distributions,
  * Identify missing values or inconsistencies.

---

### 🏥 **Case Study: Congestive Heart Failure Readmission**

#### 📌 **Data Sources Needed**

* **Demographics**
* **Clinical histories**
* **Insurance coverage**
* **Provider information**
* **Claims and pharmaceutical data**

> **Note:** Drug information was needed but not integrated initially.

#### 📌 **Action Taken**

* The team **deferred** acquiring the drug data until intermediate model results justified the effort.
* In the end, the **model performed well without it**, validating the decision.

#### 🤝 **Collaboration**

* **DBAs and programmers** extracted and merged data from multiple sources.
* Redundant or duplicate records were removed.
* Data was made ready for the **next stage: Data Understanding**.

---

### 💡 **Main Takeaways**

* It’s okay to proceed with partial data and **adjust** as needed.
* Always **assess what you have** after initial data collection.
* Consider **collaborating with IT/data teams** for efficient extraction and preparation.
* **Automation** can improve future data collection processes.

---
---

![Data Requirment and Data Collection](images/Data_Requirement_and_Collection.png)