## Penalized Logistic Regression

#### Imports

In [31]:
import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report

#### Loading Training and Validation Data

In [19]:
train = pd.read_csv("/nvss_train.csv")
val = pd.read_csv("/nvss_val.csv")

# Remove any rows with missing target values
train = train.dropna(subset=["infant_death"])

# Separate training features (X) and labels (y)
X_train = train.drop(columns=["infant_death"])
y_train = train["infant_death"]

# Separate validation features and labels (validation stays untouched)
X_val = val.drop(columns=["infant_death"])
y_val = val["infant_death"]

#### Class Imbalance Check
- We first check how many infants died versus survived in the dataset.
- Helps us understand the imbalance of the outcome.
Infant death is extremely rare, so we expect the positive class (1) to be very small.

In [33]:
train = pd.read_csv("/nvss_train.csv")

# Count 0s and 1s
counts = train["infant_death"].value_counts()

print(counts)
print("\nPercentage distribution:")
print((counts / len(train)) * 100)

infant_death
0    2126113
1      12536
Name: count, dtype: int64

Percentage distribution:
infant_death
0    99.413836
1     0.586164
Name: count, dtype: float64


The data is extremely imbalanced (99.4% survivors, 0.6% deaths). Our model would predict mostly “0” and appear accurate but be useless. We can fix this by keeping all death cases and randomly sampling 10% of survivors. This balances the classes, reduces the dataset size from ~2.1M to ~225k rows, and makes the model faster.

In [35]:
# Separate minority (death) and majority (survival) classes
minority = train[train["infant_death"] == 1]     # All infant deaths
majority = train[train["infant_death"] == 0]     # All survivors

# Subsample only 10% of majority class
majority_sampled = majority.sample(frac=0.10, random_state=42)

# Combine balanced dataset
train_balanced = pd.concat([minority, majority_sampled], axis=0)

# Shuffle
train_balanced = train_balanced.sample(frac=1, random_state=42)

# Display balanced counts
print("Original training size:", len(train))
print("Balanced training size:", len(train_balanced))
print("Deaths:", train_balanced.infant_death.sum())
print("Survivors:", len(train_balanced) - train_balanced.infant_death.sum())

Original training size: 2138649
Balanced training size: 225147
Deaths: 12536
Survivors: 212611


#### Final X and y, CV setup, and pipeline

In [36]:
# Final training features and labels
X_train = train_balanced.drop(columns=["infant_death"])
y_train = train_balanced["infant_death"]

# Stratified 3-fold cross-validation
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

# Pipeline: StandardScaler + Logistic Regression
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("logreg", LogisticRegression(
        solver="saga",          # Required for L1 + ElasticNet
        max_iter=5000,
        class_weight="balanced" # Helps with residual imbalance
    ))
])


#### Modeling
We train three penalized logistic models: Ridge (L2) shrinks coefficients, Lasso (L1) selects features by zeroing some out, and Elastic Net mixes both penalties.

We tune each model with GridSearchCV using F1 score, since accuracy isn’t reliable for rare-event data.

#### Ridge Model

In [37]:
param_grid_ridge = {
    "logreg__penalty": ["l2"],
    "logreg__C": [0.01, 0.1, 1, 10]
}

grid_ridge = GridSearchCV(
    pipe, param_grid_ridge, cv=cv,
    scoring="f1", n_jobs=-1, verbose=1
)

grid_ridge.fit(X_train, y_train)

ridge_best = grid_ridge.best_estimator_
ridge_pred = ridge_best.predict(X_val)

print("RIDGE best params:", grid_ridge.best_params_)
print("RIDGE CV F1:", grid_ridge.best_score_)
print("RIDGE VAL F1:", f1_score(y_val, ridge_pred))
print(confusion_matrix(y_val, ridge_pred))

Fitting 3 folds for each of 4 candidates, totalling 12 fits
RIDGE best params: {'logreg__C': 0.01, 'logreg__penalty': 'l2'}
RIDGE CV F1: 0.5005626833358247
RIDGE VAL F1: 0.10463057707612597
[[654536  54169]
 [   958   3221]]


#### Lasso Model

In [38]:
param_grid_lasso = {
    "logreg__penalty": ["l1"],
    "logreg__C": [0.01, 0.1, 1, 10]
}

grid_lasso = GridSearchCV(
    pipe, param_grid_lasso, cv=cv,
    scoring="f1", n_jobs=-1, verbose=1
)

grid_lasso.fit(X_train, y_train)

lasso_best = grid_lasso.best_estimator_
lasso_pred = lasso_best.predict(X_val)

print("LASSO best params:", grid_lasso.best_params_)
print("LASSO CV F1:", grid_lasso.best_score_)
print("LASSO VAL F1:", f1_score(y_val, lasso_pred))
print(confusion_matrix(y_val, lasso_pred))

Fitting 3 folds for each of 4 candidates, totalling 12 fits
LASSO best params: {'logreg__C': 0.01, 'logreg__penalty': 'l1'}
LASSO CV F1: 0.5019065170812075
LASSO VAL F1: 0.10483463874550163
[[654692  54013]
 [   960   3219]]


#### Elastic Net Model

In [39]:
param_grid_en = {
    "logreg__penalty": ["elasticnet"],
    "logreg__C": [0.1, 1],
    "logreg__l1_ratio": [0.5]   # 50% L1, 50% L2
}

grid_en = GridSearchCV(
    pipe, param_grid_en, cv=cv,
    scoring="f1", n_jobs=-1, verbose=1
)

grid_en.fit(X_train, y_train)

en_best = grid_en.best_estimator_
en_pred = en_best.predict(X_val)

print("ElasticNet best params:", grid_en.best_params_)
print("ElasticNet CV F1:", grid_en.best_score_)
print("ElasticNet VAL F1:", f1_score(y_val, en_pred))
print(confusion_matrix(y_val, en_pred))

Fitting 3 folds for each of 2 candidates, totalling 6 fits
ElasticNet best params: {'logreg__C': 0.1, 'logreg__l1_ratio': 0.5, 'logreg__penalty': 'elasticnet'}
ElasticNet CV F1: 0.5000940215687087
ElasticNet VAL F1: 0.10456773690874266
[[654499  54206]
 [   958   3221]]


#### Model Comparison

To compare the three penalized logistic regression models (Ridge (L2), Lasso (L1), and Elastic Net) we evaluated each model using both cross-validated F1 scores and validation set metrics. The table below summarizes their performance.

In [40]:
import pandas as pd
from sklearn.metrics import precision_score, recall_score

results = []

# Ridge
results.append({
    "Model": "Ridge (L2)",
    "Best Params": grid_ridge.best_params_,
    "CV F1": grid_ridge.best_score_,
    "Val F1": f1_score(y_val, ridge_pred),
    "Val Precision": precision_score(y_val, ridge_pred),
    "Val Recall": recall_score(y_val, ridge_pred)
})

# Lasso
results.append({
    "Model": "Lasso (L1)",
    "Best Params": grid_lasso.best_params_,
    "CV F1": grid_lasso.best_score_,
    "Val F1": f1_score(y_val, lasso_pred),
    "Val Precision": precision_score(y_val, lasso_pred),
    "Val Recall": recall_score(y_val, lasso_pred)
})

# Elastic Net
results.append({
    "Model": "Elastic Net",
    "Best Params": grid_en.best_params_,
    "CV F1": grid_en.best_score_,
    "Val F1": f1_score(y_val, en_pred),
    "Val Precision": precision_score(y_val, en_pred),
    "Val Recall": recall_score(y_val, en_pred)
})

results_df = pd.DataFrame(results)
results_df

Unnamed: 0,Model,Best Params,CV F1,Val F1,Val Precision,Val Recall
0,Ridge (L2),"{'logreg__C': 0.01, 'logreg__penalty': 'l2'}",0.500563,0.104631,0.056125,0.770759
1,Lasso (L1),"{'logreg__C': 0.01, 'logreg__penalty': 'l1'}",0.501907,0.104835,0.056245,0.77028
2,Elastic Net,"{'logreg__C': 0.1, 'logreg__l1_ratio': 0.5, 'l...",0.500094,0.104568,0.056089,0.770759


Among the three penalized logistic regression models (Ridge, Lasso, and Elastic Net), Lasso (L1) performed the best. It achieved the highest F1 score on both cross-validation and the validation set. Although differences were small, which is expected for rare outcomes like infant mortality, Lasso showed slightly better precision, giving it the overall advantage.

Because Lasso selects the most important features and removes weaker ones, it aligns well with our goal of identifying key predictors of newborn mortality while improving generalization.

#### Saving Models

In [41]:
from joblib import dump
dump(lasso_best, "lasso_model.joblib")

['lasso_model.joblib']