# Sowmya Uppaluru

# Week 09 - Machine Learning with Scikit-learn

For this week’s assignment, you are required to investigate the accuracy-computation time tradeoffs of the different optimization algorithms (solvers) that are available for fitting linear regression models in Scikit-Learn. Using the code shared via the Python notebook (part of this week’s uploads archive) where the use of logistic regression was demonstrated, complete the following operations:

1. Among the different classification models included in the Python notebook, which model had the best overall performance? Support your response by referencing appropriate evidence.

The RandomForest models demonstrated high scores according to the results summary. A problem existed in the RandomForest_noCV model because it demonstrated excessive overfitting through its 0.686 test score compared to its excessive training score of 0.9993.
The Logistic_L1_C_10 configuration of logistic regression with L1 penalty and C=10 value delivered the most balanced training accuracy (0.7347) and test accuracy (0.718). A strong generalization ability exists in this model because it demonstrated similar performance rates between training and test datasets.
The shared notebook output lacks complete performance metrics for the optimized RandomForest models which used cross-validation (RandomForest_CV and RandomForest_CV2).
Test data showed 0.718 accuracy rate from the "Logistic" model which ran as a simple logistic regression method.
The null model which predicts the most common class reached 0.6467 accuracy on training data and 0.608 accuracy on test data.
The Logistic_L1_C_10 model demonstrates the best overall performance among fully reported models because it achieves 0.718 test accuracy and shows strong correspondence between training and testing results. The model demonstrates an optimal combination between its complexity level and its ability to generalize. The model with L1 penalty set to C=10 achieves effective regularization which helps it maintain its predictive power by avoiding both underfitting and overfitting.
The complete performance metrics of the final RandomForest models with cross-validation remain hidden in the provided notebook output because they could have achieved better results through proper tuning.

In [3]:
import os
import numpy as np
import pandas as pd
import time
from patsy import dmatrices
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

df_patient = pd.read_csv('./PatientAnalyticFile.csv')
df_patient['mortality'] = np.where(df_patient['DateOfDeath'].isnull(), 0, 1)
df_patient['DateOfBirth'] = pd.to_datetime(df_patient['DateOfBirth'])
df_patient['Age_years'] = ((pd.to_datetime('2015-01-01') - df_patient['DateOfBirth']).dt.days/365.25)
df_patient.head()

Unnamed: 0,PatientID,DateOfBirth,Gender,Race,Myocardial_infarction,Congestive_heart_failure,Peripheral_vascular_disease,Stroke,Dementia,Pulmonary,...,Obesity,Depression,Hypertension,Drugs,Alcohol,First_Appointment_Date,Last_Appointment_Date,DateOfDeath,mortality,Age_years
0,1,1962-02-27,female,hispanic,0,0,0,0,0,0,...,0,0,0,0,0,2013-04-27,2018-06-01,,0,52.843258
1,2,1959-08-18,male,white,0,0,0,0,0,0,...,0,0,1,0,0,2005-11-30,2008-11-02,2008-11-02,1,55.373032
2,3,1946-02-15,female,white,0,0,0,0,0,0,...,0,0,1,0,0,2011-11-05,2015-11-13,,0,68.876112
3,4,1979-07-27,female,white,0,0,0,0,0,1,...,0,0,0,0,0,2010-03-01,2016-01-17,2016-01-17,1,35.433265
4,5,1983-02-19,female,hispanic,0,0,0,0,0,0,...,0,0,1,0,0,2006-09-22,2018-06-01,,0,31.865845


In [4]:
vars_remove = ['PatientID', 'First_Appointment_Date', 'DateOfBirth',
               'Last_Appointment_Date', 'DateOfDeath', 'mortality']
vars_left = set(df_patient.columns) - set(vars_remove)
formula = "mortality ~ " + " + ".join(vars_left)

In [6]:
Y, X = dmatrices(formula, df_patient)
X_train, X_test, y_train, y_test = train_test_split(
    X, np.ravel(Y), test_size=0.2, random_state=42
)

solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
results = {
    'Solver': [],
    'Training Accuracy': [],
    'Holdout Accuracy': [],
    'Time Taken (seconds)': []
}


In [7]:
for solver in solvers:
    print(f"Fitting model with solver: {solver}")

    # Some solvers require different penalty types
    if solver == 'liblinear':
        penalty = 'l2'
    else:
        penalty = 'l2'

    # Initialize model
    model = LogisticRegression(
        solver=solver,
        penalty=penalty,
        max_iter=1000,
        random_state=42
    )

    # Time the fitting process
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()

    # Calculate training and testing accuracy
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)

    results['Solver'].append(solver)
    results['Training Accuracy'].append(round(train_accuracy, 4))
    results['Holdout Accuracy'].append(round(test_accuracy, 4))
    results['Time Taken (seconds)'].append(round(end_time - start_time, 4))
results_df = pd.DataFrame(results)
print(results_df)

Fitting model with solver: newton-cg
Fitting model with solver: lbfgs
Fitting model with solver: liblinear
Fitting model with solver: sag




Fitting model with solver: saga
      Solver  Training Accuracy  Holdout Accuracy  Time Taken (seconds)
0  newton-cg             0.7482            0.7362                0.1623
1      lbfgs             0.7482            0.7360                0.7087
2  liblinear             0.7479            0.7362                0.1152
3        sag             0.7481            0.7362               17.6033
4       saga             0.7480            0.7362               24.9559




4. Based on the results, which solver yielded the best results? Explain the basis for ranking the models - did you use training subset accuracy? Holdout subset accuracy? Time of execution? All three? Some combination of the three?


The most important metric for new data generalizability reveals "newton-cg," "liblinear," "sag," and "saga" achieved identical scores of 0.7362 while "lbfgs" scored 0.7360. The models display equivalent holdout performance which means we need to evaluate their other characteristics.
The execution times between proposals create a significant difference in performance. The fastest execution time belonged to the "liblinear" solver that finished within 0.1152 seconds before "newton-cg" completed its run in 0.1623 seconds. The execution times for "sag" reached 17.6033 seconds and "saga" needed 24.9559 seconds which demonstrated a slowdown of more than 150-200 times when compared to the fastest options. The execution time of "lbfgs" solver amounted to 0.7087 seconds.
The training accuracy scores from all models showed no significant difference because they fell between 0.7479 and 0.7482 with a minimal gap of 0.0003.
The "liblinear" solver stands out as the best option since it delivered the highest holdout accuracy together with minimal computational duration. The "liblinear" solver stands as the most time-efficient solution for solving this particular classification problem. The "liblinear" solver stands out as the best choice because it delivers superior practical efficiency through its speed advantages even though the other solvers achieve similar performance levels.
