# PROJECT 3 — MACHINE LEARNING MODELING REPORT
## Part A: Logistic Regression
## Part B: Random Forest Claasifier (Non-Logistic Model)

### 1. Introduction
The objective of this project is to build and evaluate predictive models using a cleaned and pre processed dataset.The modeling process is divided into two stages: Logistic Regression (Part A) where I used feature engineering to try to  improve the ROC-AUC score as well as the accuracy and the F1 score and a more advanced non-logistic model (Random Forest, Part B).

### 2. Data Cleaning & Preprocessing
Key steps included:
- Removing unwanted characters like ## % _ etc 
- Dropping irrelevant columns which were either empty or represented just as column ids which did not give any useful information.
- Handled missing data (either nan or inf) with the median for the numeric columns and mode for categorical columns.
- Dummy (pd.get_dummies) encoding categorical variables for object cols identified from .info. As well as converted the numeric looking cols to numeric type.
- Removed highly correlated columns over 0.8
- Created quantile bins after identifying continuous variable cols by checking no.of unique values in each col.
- Created few visulation to check the correlation of the clean df, used the heatmap to visualize the null values, analysis for continuous variable in the col, box plot to detect outliers.
- Saved the clean data as csv file in the name : "cleaned_new_train_data.csv"

### 3. Part A — Logistic Regression Model
- logistic regression model showed stable performance with or without hyperparameter tuning. The validation accuracy, F1-score, and AUC remained essentially unchanged before and after applying RandomizedSearchCV
This indicates that:
The dataset is well suited for logistic regression
The model is not highly sensitive to the choice of regularization strength (C)
The default logistic regression configuration already achieves near-optimal performance
Hyperparameter tuning primarily confirmed the best solver/penalty rather than improving predictive power.

- For continuous variables I have used standard scaler for those columns.

**Best Parameters:** {'solver': 'liblinear', 'penalty': 'l1', 'C': 0.078}

**Validation Metrics:**
- Accuracy: 0.878\n
- Precision: 0.869\n
- Recall: 0.847\n
- F1: 0.858\n
- AUC: 0.95\n

**Test Metrics:** \n
- Accuracy: 0.872\n
- Precision: 0.866\n
- Recall: 0.833\n
- F1: 0.849\n
- AUC: 0.94\n

### Top Logistic Regression Coefficients to identify interactive features
- Positive: x25_LC, x14, x17, x20, x18, x9
- Negative: x26_PT, x4, x28

-In addition to standard preprocessing and scaling, I explored interaction features to help logistic regression capture relationships that are not purely linear. Logistic regression by itself assumes additive effects, meaning each feature contributes independently to the log-odds. However, in many real-world datasets, the effect of one variable may depend on the value of another variable.

To capture such dependencies, I created a set of pairwise interaction terms between the continuous variables:

x23_x29 = x23 * x29

x23_x31 = x23 * x31

x29_x31 = x29 * x31

x5_x23 = x5 * x23

x5_x29 = x5 * x29

These products allow the logistic model to understand patterns like:

“When x23 is high and x29 is high, the risk increases more sharply.”

“The effect of x5 may only be visible when combined with x23.”

*** Feature engineering (interaction + log transforms for Part A)

**Validation Metrics:**
- Accuracy: 0.880\n
- Precision: 0.873\n
- Recall: 0.848\n
- F1: 0.860\n
- AUC: 0.95\n

**Test Metrics:** \n
- Accuracy: 0.873\n
- Precision: 0.866\n
- Recall: 0.833\n
- F1: 0.850\n
- AUC: 0.94\n

- It can be seen the validation metrics like accuracy and F1 slightly improved while for test metrics it only improved by 0.001.The ROC-AUC didn't had any significant changes.

### 4. Part B — Random Forest Classifier
Here I decided to go with random forest classifier, hyperparameter using randomsearchedCV.
**Best Parameters:** class_weight='balanced', max_features=0.7, min_samples_leaf=2, min_samples_split=10, n_estimators=400

**Validation Metrics:**\n
- Accuracy: 0.964\n
- Precision: 0.970\n
- Recall: 0.946\n
- F1: 0.958\n
- AUC: 0.99\n

**Test Metrics:**\n
- Accuracy: 0.962\n
- Precision: 0.969\n
- Recall: 0.942\n
- F1: 0.955\n
- AUC: 0.99\n

### Top Random Forest Features
- x14, x9, x26_PT, x28, x16, x25_LC, x5, x20, x18, x29

The Random Forest model provides feature importance values that indicate how frequently and how effectively each variable is used to split the data into more homogeneous groups. Higher importance means the feature plays a stronger role in determining the prediction. Based on the tuned Random Forest model, the top contributing features were:

1. x14 (Most Important Feature)

x14 was consistently the strongest predictor of the target.
Its high importance indicates that this variable frequently helped differentiate between the two classes through clean and effective split points. This suggests that x14 captures a strong underlying pattern or threshold effect associated with the outcome.

2. x9

x9 also ranked highly and likely interacts with other predictors in determining the risk.
Random Forest models naturally account for such interactions, which may explain why this feature gained more importance compared to its coefficient weight in logistic regression.

3. x26_PT

This categorical indicator (_PT) had strong discriminative power.
Its importance means that belonging to the PT category has a notable impact on the predicted class. Interestingly, this feature had one of the most negative coefficients in logistic regression, and the Random Forest’s high importance further validates its relevance.

4. x28

x28 emerged as another influential feature.
Its importance shows it may represent a non-linear or threshold-driven relationship, which tree-based models capture more effectively than logistic regression.

5. x16

Though not extremely strong in the logistic regression model, x16 appears frequently in Random Forest splits, suggesting it interacts with other predictors or has non-linear boundaries.

6. x25_LC

This categorical variable (_LC) had one of the highest positive coefficients in logistic regression and also ranks among top features in the tree model.
Both models agree on its strong association with the target.

7. x5

x5 is one of the continuous features.
Its importance indicates that the Random Forest found useful thresholds in the variable, showing non-linear effects (e.g., risk increasing or decreasing rapidly past certain values).

8. x20 & x18

Both features appear moderately important.
Their ranking suggests they do not individually drive predictions but likely interact with other variables to improve split quality.

9. x29

As another continuous variable, x29 contributes meaningfully to model predictions.
Although its effect appears subtle in logistic regression, the Random Forest reveals more complex patterns involving this variable.

### 5. Model Comparison

This project evaluated two modeling approaches:
- Logistic Regression (Part A) – a linear, interpretable model
- Random Forest (Part B) – a nonlinear, tree-based ensemble model
Both models were trained on the same cleaned dataset and evaluated on validation and test sets.

Performance Summary:
| Metric | Logistic Regression | Random Forest |
|--------|--------------------|---------------|
| Test AUC | 0.94 | 0.99 |
| Test Accuracy | 0.873 | 0.962 |
| Test F1 | 0.850 | 0.955 |

Random Forest performs substantially better than Logistic Regression across every metric.
The improvement is especially large in Recall, F1 Score, and overall accuracy.

Random Forest performs better

1. Captures Nonlinear Relationships

Logistic Regression assumes a linear relationship between each feature and the outcome.
Random Forest does not—it automatically learns:

nonlinear trends

thresholds

interactions between features

This gives it a major advantage on complex datasets.

2. Handles Feature Interactions Automatically

In Part A, interaction terms had to be manually engineered and tested.
In Part B, Random Forest learns interactions on its own through tree splits.

3. More Robust to Irrelevant or Noisy Features

Logistic Regression is sensitive to multicollinearity and scaling.
Random Forest is not—it naturally selects informative features.

4. Consistent Generalization

Even after hyperparameter tuning, Random Forest achieved:

Train AUC: 0.9996

Validation/Test AUC: 0.99

The small gap indicates excellent generalization with minimal overfitting.

### 6. Conclusion
Random Forest significantly outperforms Logistic Regression on all metrics, showing excellent generalization and robustness after hyperparameter tuning.
