Week 09 - Machine Learning with Scikit-learn




Hemavathi Karuppaiah

1. Among the different classification models included in the Python notebook, which model had the best overall performance? Support your response by referencing appropriate evidence.



The Python notebook's classification models can reveal their best performer through an evaluation of training and test accuracy scores. The standard Logistic Regression model and the Logistic Regression with L1 penalty (C=10) achieved identical test accuracy scores of 0.718 which outperformed all other models in generalizing to new data. These methods achieved the most suitable combination of model complexity and performance metrics on the patient mortality dataset.

The unvalidated RandomForest model showed an extreme training accuracy of 0.9993 but achieved only 0.686 on the test data. The significant difference between training accuracy and testing results shows the model learned the training data by rote instead of extracting meaningful patterns from the data. Therefore it remains unusable despite its advanced algorithm. In comparison the regularized logistic regression models maintained consistent performance throughout training because they did well in both training and testing sets.

The Logistic Regression with L1 penalty and C=10 demonstrates superior performance compared to standard Logistic due to its slightly better training accuracy (0.7347 vs. 0.7333) while sharing identical test results. The L1 regularization with its appropriate penalty strength enabled better feature selection along with maintaining generalization capability. The model demonstrates reliable and stable performance across both metrics which establishes it as the top classification model among all tested models in the notebook.


In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from patsy import dmatrices
import time

# Loading the  data
df_patients_full = pd.read_csv('./PatientAnalyticFile.csv')

# Creating mortality variable
df_patients_full['mortality'] = np.where(df_patients_full['DateOfDeath'].isnull(), 0, 1)

# Converting DateOfBirth to date and calculate age
df_patients_full['DateOfBirth'] = pd.to_datetime(df_patients_full['DateOfBirth'])
df_patients_full['Age_years'] = ((pd.to_datetime('2015-01-01') - df_patients_full['DateOfBirth']).dt.days/365.25)

vars_remove = ['PatientID', 'First_Appointment_Date', 'DateOfBirth',
              'Last_Appointment_Date', 'DateOfDeath', 'mortality']
vars_left = set(df_patients_full.columns) - set(vars_remove)
formula = "mortality ~ " + " + ".join(vars_left)

In [2]:
# Create model matrices
Y, X = dmatrices(formula, df_patients_full)

# Split data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, np.ravel(Y),
    test_size=0.2,
    random_state=42
)

In [3]:
# List of solvers to test
solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

# Results dictionary
results = {
    'Solver': [],
    'Training Accuracy': [],
    'Holdout Accuracy': [],
    'Time Taken (seconds)': []
}

In [4]:
for solver in solvers:
    start_time = time.time()
    penalty = 'l2'
    if solver == 'liblinear':
        penalty = 'l2'

    model = LogisticRegression(
        solver=solver,
        penalty=penalty,
        random_state=42,
        max_iter=1000
    )

    model.fit(X_train, y_train)

    end_time = time.time()
    time_taken = end_time - start_time

    # Making predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Calculating accuracy
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)

    # Storing the  results
    results['Solver'].append(solver)
    results['Training Accuracy'].append(train_accuracy)
    results['Holdout Accuracy'].append(test_accuracy)
    results['Time Taken (seconds)'].append(time_taken)

results_df = pd.DataFrame(results)
results_df['Training Accuracy'] = results_df['Training Accuracy'].map('{:.4f}'.format)
results_df['Holdout Accuracy'] = results_df['Holdout Accuracy'].map('{:.4f}'.format)
results_df['Time Taken (seconds)'] = results_df['Time Taken (seconds)'].map('{:.4f}'.format)
print(results_df)



      Solver Training Accuracy Holdout Accuracy Time Taken (seconds)
0  newton-cg            0.7482           0.7362               0.1974
1      lbfgs            0.7482           0.7360               0.8815
2  liblinear            0.7479           0.7362               0.1517
3        sag            0.7481           0.7362              16.9453
4       saga            0.7480           0.7362              12.7136




4. Based on the results, which solver yielded the best results? Explain the basis for ranking the models - did you use training subset accuracy? Holdout subset accuracy? Time of execution? All three? Some combination of the three?


The selection of the "best" solver demands analyzing several performance metrics based on the presented results. The holdout accuracy of all solvers demonstrated almost equivalent generalization ability since their values spanned from 0.7360 to 0.7362. All solvers demonstrated identical success in optimizing logistic regression since their training accuracy ranged from 0.7479 to 0.7482.

Since all models demonstrate equivalent performance the execution time becomes the critical element for determining model rankings. The liblinear solver delivers the fastest execution times of 0.1517 seconds while providing equivalent accuracy to other models in the analysis. The newton-cg solver runs at 0.1974 seconds and lbfgs requires 0.8815 seconds to complete. The sag and saga solvers needed prolonged processing durations of 16.9453 seconds and 12.7136 seconds yet failed to enhance accuracy results.

The liblinear solver stands as the optimal choice because it reached equivalent predictive results as other solvers while operating with maximum computational efficiency. The efficiency advantage demonstrated by liblinear would be critical in real-life deployments of either large datasets or models that need recurrent training. The selection of the most efficient modeling approach becomes logical when prediction metrics are equivalent because it optimizes resource utilization without impacting the accuracy.