# üìú IBM Data Science Professional Certificate  
*Curiosity to Capability ‚Äî One Notebook at a Time*

---

**Compiled and Authored by:**  
**Partho Sarothi Das**  
Dhaka, Bangladesh  
üéì Bachelor's & Master's in Statistics  
üíº Investment Banking Professional ‚Üí Aspiring Data Scientist  

>**Disclaimer:** This notebook is based on content from the [IBM Data Science Professional Certificate](https://www.coursera.org/professional-certificates/ibm-data-science) offered on Coursera. It is intended for personal learning and review purposes.

---
---

# Data Understanding

### Objective

Understand the **Data Understanding** stage of the data science methodology, which focuses on exploring and evaluating collected data to ensure it is suitable and representative of the problem.

---

###  What is Data Understanding?

* This stage answers:
  üîç *"Is the data collected truly representative of the problem we're trying to solve?"*
* It involves deep exploration, statistical analysis, and data quality assessment.

---

### Key Techniques Used

1. **Descriptive Statistics** (Univariates):

   * Mean, Median, Min, Max, Standard Deviation
   * Help summarize the characteristics of each variable

2. **Pairwise Correlation**:

   * Identify variables that are highly correlated (redundant)
   * Helps decide which variables to drop for modeling

3. **Histograms**:

   * Visual tool to understand variable distributions
   * Helps spot skewness, outliers, and opportunities to simplify categorical variables

4. **Assessing Data Quality**:

   * Identify **missing values**, **invalid entries**, or **anomalies**
   * For example:

     * Age variable with values 0‚Äì100 and ‚Äú999‚Äù where 999 = "missing"
     * Recode or remove misleading data

---

### Case Study Application: Congestive Heart Failure

* Initially defined by **primary diagnosis** of heart failure.
* During analysis, it was discovered this missed many true cases.
* Result: Analysts **looped back to the Data Collection stage** to include **secondary and tertiary diagnoses**.
* Shows the **iterative nature** of the methodology: insights at one stage may lead to revisiting earlier stages.

---

### Why This Matters

* Refinement during Data Understanding helps:

  * Improve variable selection
  * Detect errors and inconsistencies early
  * Build a more accurate, representative model
  * Prevent misleading or biased outcomes

---

### Main Takeaways

* **Data Understanding** is critical before moving to modeling.
* Apply statistics and visualizations to assess quality and relevance.
* Be ready to revise your data sources or definitions.
* The process is **iterative and interactive** ‚Äî constant refinement leads to better solutions.

---
---

# Data Preparation (Concepts)

### **Objective**

To understand the **importance, tasks, and techniques** involved in the **Data Preparation** stage of the data science methodology.

---

### **What is Data Preparation?**

* Like **washing and chopping vegetables**, it involves **cleaning and transforming raw data** into a usable format.
* Answers the key question:
  üîç *‚ÄúWhat are the ways in which data is prepared?‚Äù*

---

### **Time Commitment**

* Most **time-consuming phase** of a data science project:

  * Takes **70‚Äì90%** of total time.
  * Can be reduced to **\~50%** if processes are automated (e.g., via database scripts).
  * Saves time for **modeling and analysis**.

---

### **Key Activities in Data Preparation**

1. **Cleaning the Data**:

   * Remove duplicates
   * Handle missing or invalid values
   * Format data consistently

2. **Transforming the Data**:

   * Modify raw data into more usable forms
   * Make it easier for models to analyze effectively

3. **Feature Engineering**:

   * Create new features (variables) using domain knowledge
   * Enhances predictive power of machine learning models
   * Features = characteristics useful in solving the problem

4. **Text Analysis (if working with text data)**:

   * Encode and process text for modeling
   * Ensure groupings are logical and hidden insights are captured

---

### **Analogy to Cooking**

* Just as chopped onions release flavor better than whole ones in sauce,
* **Transformed data** enables better insights and modeling results.

---

### **Why This Stage Is Critical**

* Done properly, it sets the foundation for accurate models.
* If **skipped or done poorly**, the **entire project can fail** or require major rework.
* Attention to detail is **essential**‚Äîeven one ‚Äúbad ingredient‚Äù can spoil the outcome.

---

### **Main Takeaways**

* Data preparation is a **foundational step**.
* Involves **cleaning**, **formatting**, **feature engineering**, and **text processing**.
* Investing time here results in **better models** and **more reliable outcomes**.
* Automating parts of the process can improve efficiency and accuracy.

---
---

# Data Preparation ‚Äì Case Study

### Objective

To understand how **data preparation concepts** are practically applied in a real-world **case study on congestive heart failure (CHF)** readmissions.

---

### Key Analogy

Like washing and cleaning vegetables before cooking, **data preparation** involves removing noise, inconsistencies, and restructuring data into a usable form.

---

### Case Study: CHF Readmission Risk Modeling

#### 1. Defining the Problem Clearly

* The **first step** was to **define ‚Äúcongestive heart failure (CHF)‚Äù** precisely.

  * Required identifying the correct **diagnosis-related group (DRG) codes**.
  * CHF is a specific **subset of heart failure**; **clinical input** was essential to get the codes right.

#### 2. Readmission Criteria

* Defined **readmission** as any return within **30 days** after a discharge related to CHF.

  * Based on **clinical guidelines**, 30 days was chosen as the relevant window.
  * Differentiated **index admissions** (initial events) from **readmissions**.

#### 3. Handling Transactional Data

* Patients had **hundreds or thousands** of clinical records:

  * Claims from physicians, labs, hospitals
  * Diagnoses, prescriptions, procedures
* These were **aggregated** to create **a single record per patient** for modeling (as required for decision trees).

#### 4. Feature Engineering

* Created new variables (columns) during aggregation:

  * Visit frequencies
  * Recent diagnoses or treatments
  * Prescription data
  * Co-morbidities (e.g., diabetes, hypertension)

#### 5. Literature Review

* A **literature review** was conducted to identify **any overlooked variables** or co-morbidities.

  * This led to **looping back** to the data collection phase to add missing indicators.

#### 6. Merging and Final Dataset

* Combined all features with **demographic data** (age, gender, insurance type, etc.).
* Resulted in a **single table** where each row represented a patient and each column an attribute.

---

### Modeling Inputs

* **Dependent (target) variable**:

  * Readmission within 30 days (Yes/No)
* **Cohort size**: 2,343 patients meeting all criteria
* Dataset was **split into training and test sets** for modeling and validation.

---

### Takeaways

* Data preparation required:

  * **Precise definitions** of medical conditions
  * **Aggregation** of transactional records
  * **Creation of new features**
  * **Clinical and literature insights**
* Prepared data laid the foundation for **effective predictive modeling** using a decision tree.

---
---

# From Modeling to Evaluation

## Modeling ‚Äì Concepts

---

### Objective

To understand the **purpose** and **characteristics** of the **modeling** stage in the data science methodology.

---

### Cooking Analogy

Modeling is like **tasting the sauce**‚Äîyou check whether your preparation is on track or needs tweaking.

---

### What is Data Modeling?

* **Modeling** is the stage where data scientists develop models to:

  * **Describe** patterns or behavior
  * **Predict** outcomes (e.g., Yes/No, Stop/Go)

* Models are based on the **chosen analytic approach**, which may be:

  * Statistical
  * Machine Learning

* The **training set** (historical data with known outcomes) is used to build and calibrate the model.

---

### Characteristics of the Modeling Process

* Involves **testing various algorithms** to find the best model fit.
* Requires checking whether **variables** used are relevant and effective.
* Relies heavily on:

  * Clear understanding of the problem
  * Correct analytic method
  * Quality data preparation

---

### Modeling in the Methodology Framework (John Rollins)

Rollins‚Äô framework emphasizes:

1. Understanding the business question
2. Choosing the right analytic method
3. Collecting, understanding, preparing, and modeling the data

Modeling is not isolated ‚Äî it's a **dynamic** and **iterative** process involving:

* Refinement
* Adjustment
* Continuous learning

---

### Outcome

* A successful model **answers the business question** effectively.
* Evaluation, deployment, and feedback (next stages) ensure the solution is **relevant and useful** in real-world applications.

---

### Key Takeaway

Modeling is where everything comes together ‚Äî **data, preparation, and strategy** ‚Äî to deliver a working solution. It's a critical turning point in the data science lifecycle.

---
---

# Modeling ‚Äì Case Study

### Objective

To demonstrate how **model building**‚Äîspecifically **parameter tuning**‚Äîis applied in a real-world case study on **congestive heart failure readmission**.

---

### üè• **Case Study Overview**

* The goal: **Predict high-risk patients** likely to be readmitted for congestive heart failure.
* Method: **Decision tree classification model** using a prepared training dataset.
* Focus: Improving model performance‚Äîespecially for predicting **"yes" (readmission)** outcomes.

---

### üîÅ **Model Iteration and Parameter Tuning**

#### üîπ **Model 1** (Default setting):

* **Overall Accuracy**: 85%
* **Accuracy for ‚ÄúYes‚Äù outcomes**: 45%
* ‚ùó Too many **false negatives** ‚Üí patients at risk not identified.

#### üîπ **Model 2** (Relative cost of misclassifying ‚ÄúYes‚Äù vs. ‚ÄúNo‚Äù set to **9:1**):

* **Accuracy for ‚ÄúYes‚Äù**: 97%
* **Overall Accuracy**: 49%
* ‚ùó Too many **false positives** ‚Üí unnecessary costly interventions.

#### üîπ **Model 3** (Relative cost adjusted to **4:1**):

* **Sensitivity (Yes)**: 68%
* **Specificity (No)**: 85%
* **Overall Accuracy**: 81%
* ‚úÖ **Best balance** between detecting high-risk patients and avoiding unnecessary actions.

---

### ‚öñÔ∏è **Types of Errors Considered**

* **False Positive (Type I Error)**: Predicting a readmission when it won‚Äôt happen ‚Üí waste of resources.
* **False Negative (Type II Error)**: Missing a true readmission risk ‚Üí potential harm to the patient and high cost.

---

### üîÑ **Iterative Nature of Modeling**

* Tuning parameters is just one part.
* Data scientists often loop back to **data preparation** to refine variables and improve model performance.

---

### ‚úÖ **Key Takeaway**

Effective model building in data science involves:

* Balancing sensitivity and specificity
* Adjusting misclassification costs
* Iterating across stages (especially between modeling and data preparation)

---
---

# Data Science Methodology 101: Evaluation

### üéØ **Objective**

To understand the role of **model evaluation** in the data science methodology and apply it to the congestive heart failure case study.

---

### üß† **What is Model Evaluation?**

* An **iterative process** performed alongside modeling.
* Helps assess:

  * ‚úÖ Model **quality** (how well it works)
  * ‚ùì Whether it answers the **original business question**
* Can lead to **model adjustments** if necessary before deployment.

---

### üîç **Two Main Phases of Evaluation**

1. **Diagnostic Measures**

   * Checks how the model is functioning.
   * For **predictive models**: use tools like **decision trees**.
   * For **descriptive models**: compare with **test data** that has known outcomes.

2. **Statistical Significance Testing**

   * Ensures results are **not due to chance**.
   * Confirms that **data has been interpreted correctly**.

---

### üè• **Case Study Application: Congestive Heart Failure Model**

* **Goal**: Select the best model for predicting patient readmissions.
* Four models were tested using different **relative misclassification costs**.

  * Higher costs for misclassifying ‚Äúyes‚Äù (actual readmissions) improved sensitivity.
  * But this also increased **false positives**, which could lead to wasted interventions.

---

### üìà **ROC Curve for Model Selection**

* **ROC = Receiver Operating Characteristic** curve.
* Plots **True Positive Rate (Sensitivity)** vs. **False Positive Rate**.
* Helps **visualize trade-offs** in model performance.
* **Optimal model**: the one with the **maximum separation** from the diagonal baseline.
* **Model 3** (with a 4:1 cost ratio) was selected as best:

  * Balanced sensitivity and specificity.
  * Met both **clinical** and **budgetary constraints**.

---

### ‚úÖ **Key Takeaways**

* **Model evaluation is crucial** before deployment.
* It validates if the model:

  * Accurately answers the business question
  * Aligns with real-world constraints (e.g., cost, care quality)
* Tools like the **ROC curve** are essential for choosing the most appropriate model.

---
---

![Model to Evaluation](images/Model_to_Evaluation.png)