# ***Logistic Regression: Comprehensive Exercise***

**Write up and Coding Set 2**

------------------------------------------------

**Submission Guidelines**

1. Please submit your work in PDF format and ensure that it includes all necessary details as specified in the corresponding item number.
2. All modules or libraries used must be clearly documented.
3. Each line of code should be accompanied by a description explaining its function; simply providing a series of code lines will not be sufficient for scoring points.
4. Each part is worth 10 points each.

--------------------------------------------------------------------

 **Dataset**

**Heart Disease Dataset** (available from UCI Machine Learning Repository or Kaggle)
This dataset contains information about patients and whether they have heart disease, with 303
observations and 13 predictor variables.
*Target Variable:* target (1 = presence of heart disease, 0 = no heart disease)

**Predictor Variables:** <br>

- age: Age in years <br>
- sex: Sex (1 = male, 0 = female) <br>
- cp: Chest pain type (0-3)<br>
- trestbps: Resting blood pressure (mm Hg)<br>
- chol: Serum cholesterol (mg/dl)<br>
- fbs: Fasting blood sugar > 120 mg/dl (1 = true, 0 = false)<br>
- restecg: Resting electrocardiographic results (0-2)<br>
- thalach: Maximum heart rate achieved <br>
- exang: Exercise induced angina (1 = yes, 0 = no) <br>
- oldpeak: ST depression induced by exercise relative to rest<br>
- slope: Slope of the peak exercise ST segment (0-2)<br>
- ca: Number of major vessels colored by fluoroscopy (0-3)<br>
- thal: Thalassemia (0 = normal, 1 = fixed defect, 2 = reversible defect)<br>


----------------------------------------------------------------

## **Exercise Tasks**

------------------------------------------------------------------

**Part 1: Exploratory Data Analysis (EDA)**

**Task 1.1:** Load and examine the dataset <br>
- Load the dataset and display the first few rows<br>
- Check dimensions (rows and columns)<br>
- Display data types for each variable<br>
- Identify and handle missing values (if any)<br>
- Impute the missing values with the mean if continuous and the mode if categorical<br>
- Check for class imbalance in the target variable<br>

**Task 1.2:** Visualize distributions<br>
- Create histograms for continuous variables<br>
- Create bar plots for categorical variables<br>
- Create a count plot for the target variable<br>
- Identify potential outliers using boxplots<br>

**Task 1.3:** Explore relationships with target variable    <br>
- Create side-by-side boxplots of continuous predictors by target class  <br>
- Create stacked bar charts for categorical predictors by target class  <br>
- Identify which variables show clear differences between classes  <br>

-------------------------------------------------------------------

**Part 2: Correlation Analysis**

**Task 2.1:** Compute and visualize correlation matrix <br> 
- Calculate correlations between all numeric variables <br>
- Create a heatmap with annotations <br>
- Include the target variable in the correlation analysis <br>

**Task 2.2:** Interpret correlations <br>
- Which predictors are most strongly correlated with the target variable? <br>
- Which predictor pairs show high correlation (potential multicollinearity)? <br>
- List any correlation coefficients > 0.7 or < -0.7 between predictors<br>


-----------------------------------------------------------------

**Part 3: Full Logistic Regression Model**

**Task 3.1:** Build the full model <br>
- Create a logistic regression model using ALL predictor variables<br>
- Display the model summary with coefficients, standard errors, and p-values<br>
- Report the log-likelihood and deviance; refer to the programming for DSA notes on the definition of deviance.<br>


**Task 3.2:** Initial model interpretation <br>
- How many predictors are statistically significant at α = 0.05? <br>
- What is the overall model significance (likelihood ratio test)? Prog for DSA notes. <br>

------------------------------------------------------------------

**Part 4: Linearity Assumption - Log Odds**

**Task 4.1:** Test linearity of continuous predictors to log odds For each continuous predictor, test
the linearity assumption using one of these methods:<br>
- Box-Tidwell test: Add interaction terms between continuous variables and their logarithms

**Task 4.2:** Document findings <br>
- Which continuous variables satisfy the linearity assumption?<br>
- Which variables violate the assumption?<br>
- Report the statistical findings.<br>

--------------------------------------------------------

**Part 5: Multicollinearity Assessment**

**Task 5.1:** Calculate Variance Inflation Factor (VIF) <br>
- Compute VIF for each predictor in the full model<br>
- Create a table or plot displaying VIF values<br>
- Use VIF > 5 or VIF > 10 as thresholds for concern<br>

**Task 5.2:** Interpret and address multicollinearity <br>
- Which predictors exhibit problematic multicollinearity?<br>
- How might multicollinearity affect coefficient estimates and p-values?<br>
- Suggest which variables could be removed or combined<br>


-----------------------------------------------------------------

**Part 6: Goodness of Fit - Hosmer-Lemeshow Test**

**Task 6.1:** Perform Hosmer-Lemeshow test on full model <br>
- Conduct the Hosmer-Lemeshow goodness-of-fit test (typically with g = 10 groups) <br>
- Report the test statistic, degrees of freedom, and p-value <br>
- Create a calibration plot showing observed vs. expected probabilities<br>


**Task 6.2:** Interpret the results<br>
- Does the model fit the data well? (p-value > 0.05 indicates good fit)<br>
- Are there systematic patterns in the calibration plot?<br>
- What does this tell you about the model's predictive ability?<br>

------------------------------------------------------

**Part 7: Alternative Model Building - Method 1**

**Task 7.1:** Build your first alternative model Choose one method:<br>
- Manual feature selection based on clinical/domain knowledge<br>
- Best subset selection<br>


**Task 7.2:** Document your approach<br>
- Clearly state which method you chose and why<br>
- Describe the selection criterion used (AIC, BIC, cross-validation)<br>
- Report which variables were selected/retained<br>
- Display the final model summary<br>

-------------------------------------

**Part 8: Alternative Model Building - Method 2**

**Task 8.1:** Build your second alternative model Choose a DIFFERENT method from Part 7: <br>
- Model it using decision tree-based feature importance then logistic regression<br>
- Stepwise selection (forward, backward, or both)<br>



**Task 8.2:** Document your approach<br>
- State your method and rationale<br>
- Describe the tuning process (if applicable) <br>
- Report which variables/features were used<br>
- Display the final model summary<br>

---------------------------------------------

**Part 9: Model Evaluation and Comparison**

**Task 9.1:** Split data for evaluation <br>
- Create train-test split (80-20 or use cross-validation) <br>
- Ensure both classes are represented in both sets <br>

**Task 9.2:** Calculate performance metrics for all three models For each model (Full Model,
Method 1, Method 2), calculate on TEST data: <br>
- Accuracy: (TP + TN) / Total<br>
- Precision: TP / (TP + FP)<br>
- Recall/Sensitivity: TP / (TP + FN)<br>
- F1 Score: 2 × (Precision × Recall) / (Precision + Recall)<br>
- AUC-ROC: Area under the ROC curve<br>

**Task 9.3:** Create visualizations <br>
- Plot ROC curves for all three models on the same graph<br>
- Create confusion matrices for each model<br>
- Include optimal probability threshold determination<br>

**Task 9.4:** Build comparison table <br>
![image.png](attachment:image.png)

**Task 9.5:** Select the best model <br>
- Which model performs best overall? <br>
- Is there a tradeoff between metrics (e.g., precision vs. recall)?<br>
- Consider model parsimony (number of predictors)<br>
- State your final model choice with justification<br>

------------------------------------------

**Part 10: Best Model Interpretation**

**Task 10.1:** Interpret coefficients of the best model For each predictor in your chosen best model:<br>
- Report the coefficient (β)<br>
- Calculate and report the Odds Ratio (OR = e^β)<br>
- Interpret the odds ratio in context:<br>
  - For continuous variables: "A one-unit increase in [variable] is associated with a [OR] times change in odds of heart disease"
  - For binary variables: "[Group 1] has [OR] times the odds of heart disease compared to [Group 0]"


**Task 10.2:** Wald Test for statistical significance<br>
- Report the Wald test statistic for each coefficient<br>
- Report the p-value for each predictor<br>
- State which predictors are statistically significant at α = 0.05<br>

**Task 10.3:** Create an interpretation table

![image.png](attachment:image.png)

**Task 10.4:** Summarize key findings <br>
- Which 3 predictors have the strongest association with heart disease?<br>
- Are the directions of effects (positive/negative) consistent with medical knowledge?<br>
- Are there any surprising findings?<br>


----------------------------------------------------------

**Part 11: Executive Summary for Non-Technical Audience**

**Task 11.1:** Write an executive summary (1-5 paragraphs). Your summary should be
understandable to a healthcare administrator or policymaker with no statistical background.
Include:


1. Key Risk Factors Identified:
    - What are the most important factors that increase heart disease risk?
    - Use plain language (e.g., "older patients are more likely to have heart disease" instead of "age has a positive coefficient")
    - Include 2-3 visualizations that clearly show risk pattern

2. Model Explanation (Non-Technical):
    - Explain what the prediction model does without using jargon
    - Use analogies: "The model works like a checklist that weighs different risk factors"
    - Avoid terms like "logistic regression", "AUC", "log odds"

3. Model Performance in Plain Language:
    - How accurate are the predictions? (e.g., "The model correctly identifies 85% of heart disease cases")
    - What are the implications of false positives vs. false negatives?
    - Translate accuracy metrics into actionable insights

4. Clinical Recommendations:
    - Which risk factors should clinicians focus on during screening?
    - How can this model be used in practice?
    - What are the limitations and when should clinical judgment override the model?

**Writing Guidelines:**
- Replace statistical terms:
    - "Odds ratio of 2.5" → "2.5 times more likely"
    - "Statistically significant" → "Has a meaningful impact on risk"
    - "AUC of 0.85" → "The model is highly accurate at distinguishing between patients"
    - "Sensitivity" → "Ability to catch actual heart disease cases"
    - "Specificity" → "Ability to correctly rule out healthy patients"
- Use bullet points for key findings
- Focus on actionable insights
- Include clear recommendations


---------------------------------