Amisha Meka

Assignment Week 09 - Machine Leanrning with Scikit Learn

**Among the different classification models included in the Python notebook, which model had the best overall performance? Support your response by referencing appropriate evidence.**

Several machine learning models were assessed in the results table of the notebook to forecast patient mortality. The RandomForest_CV2 model demonstrates the most effective performance based on the analyzed evidence. The model demonstrated a suitable accuracy distribution by reaching 0.7473 training score and 0.736 testing score.
The RandomForest_noCV model produced almost perfect results (0.9993) during training but achieved only 0.686 accuracy on testing which indicates severe model overfitting. The training and testing scores show a substantial difference which indicates the model learned training data patterns instead of developing generalizable patterns.
All logistic regression models with different penalty parameters including Logistic and Logistic_L1 with multiple C values exhibited similar performance levels by achieving 0.73 average training accuracy as well as test accuracy between 0.71 and 0.72. Testing results match training results very well yet the RandomForest_CV2 maintains slightly better overall performance.
The RandomForest_CV2 model reached the highest accuracy level while maintaining optimal generalization capacity through its implementation of grid search cross-validation on maximum depth parameter optimization. The optimization method stopped model overfitting through proper control of tree depth as it sustained high predictive capabilities. The model surpasses the initial RandomForest_noCV through an exchange of training accuracy for better new data performance.
The RandomForest_CV2 model demonstrates the best performance because of its 0.736 test accuracy rate and its acceptable difference between training and testing scores which indicates strong generalization abilities.

**Next, fit a series of logistic regression models, without regularization. Each model should use the same set of predictors (all of the relevant predictors in the dataset) and should use the entire dataset, rather than a fraction of it. Use a randomly chosen 80% proportion of observations for training and the remaining for checking the generalizable performance (i.e., performance on the holdout subset). Be sure to ensure that the training and holdout subsets are identical across all models. Each model should choose a different solver.**



**Compare the results of the models in terms of their accuracy (use this as the performance metric to assess generalizability error on the holdout subset) and the time taken (use appropriate timing function). Summarize your results via a table with the following structure:**

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import time
import sklearn

In [2]:
patient_data = pd.read_csv('./PatientAnalyticFile.csv')

In [3]:
patient_data['mortality'] = np.where(patient_data['DateOfDeath'].isnull(), 0, 1)

patient_data['DateOfBirth'] = pd.to_datetime(patient_data['DateOfBirth'])
patient_data['Age_years'] = ((pd.to_datetime('2015-01-01') - patient_data['DateOfBirth']).dt.days/365.25)

vars_remove = ['PatientID', 'First_Appointment_Date', 'DateOfBirth',
               'Last_Appointment_Date', 'DateOfDeath', 'mortality']
vars_left = set(patient_data.columns) - set(vars_remove)
formula = "mortality ~ " + " + ".join(vars_left)

In [7]:
Y, X = dmatrices(formula, patient_data)

# Split into training and test sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, np.ravel(Y),
    test_size=0.20,
    random_state=42)

In [8]:
solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

In [9]:
results = {
    'Solver': [],
    'Training Accuracy': [],
    'Holdout Accuracy': [],
    'Time Taken (seconds)': []
}

In [13]:
for solver in solvers:
    start_time = time.time()

    # For liblinear solver, we need to specify a valid penalty
    if solver == 'liblinear':
        clf = LogisticRegression(solver=solver, penalty='l2', random_state=42)
    else:
        clf = LogisticRegression(solver=solver, penalty=None, random_state=42)

    # Fit the model
    clf.fit(X_train, y_train)

    # Calculate time taken
    time_taken = time.time() - start_time

    # Calculate accuracies
    train_accuracy = accuracy_score(y_train, clf.predict(X_train))
    test_accuracy = accuracy_score(y_test, clf.predict(X_test))

    # Store results
    results['Solver'].append(solver)
    results['Training Accuracy'].append(train_accuracy)
    results['Holdout Accuracy'].append(test_accuracy)
    results['Time Taken (seconds)'].append(time_taken)

    print(f"Completed solver: {solver}")

results_df = pd.DataFrame(results)
print(results_df)

Completed solver: newton-cg
Completed solver: lbfgs
Completed solver: liblinear




Completed solver: sag
Completed solver: saga
      Solver  Training Accuracy  Holdout Accuracy  Time Taken (seconds)
0  newton-cg           0.748062           0.73550              0.475472
1      lbfgs           0.747812           0.73575              0.515028
2  liblinear           0.747938           0.73625              0.138068
3        sag           0.748000           0.73600              1.825584
4       saga           0.748437           0.73500              1.213396
5  newton-cg           0.748062           0.73550              0.136485
6      lbfgs           0.747812           0.73575              0.212048
7  liblinear           0.747938           0.73625              0.141457
8        sag           0.748000           0.73600              2.741828
9       saga           0.748437           0.73500              4.063922




**Based on the results, which solver yielded the best results? Explain the basis for ranking the models - did you use training subset accuracy? Holdout subset accuracy? Time of execution? All three? Some combination of the three?**



Multiple criteria should guide the solver ranking process because each metric provides distinct information about model performance. The liblinear solver demonstrates the most effective results when observing the three performance metrics as a whole. The model produced the highest holdout accuracy at 0.73625 which represents the essential real-world metric because it demonstrates how well the model performs on new data.
The liblinear solver executed at a speed of 0.14 seconds which proved to be 3.5 times faster than newton-cg and lbfgs and more than 10 times faster than sag and saga solvers. The model's high computational speed proves crucial for handling big datasets and repeated model training operations.
The saga solver achieved a slightly better training accuracy of 0.748437 but this minimal improvement did not translate to superior performance on the holdout set because the additional computational expense yields no meaningful generalization benefits. The holdout accuracy of saga ranked as the lowest among all solver tests.
The sag and newton-cg solvers demonstrate average accuracy levels while their computational expenses differ. The sag solver displayed a convergence warning because it achieved the maximum iteration limit before reaching complete convergence which creates potential reliability issues.
The liblinear solver emerges as the optimal selection for this logistic regression task because it achieves the most effective combination of accurate predictions on new data while maintaining efficient computation.