Harika Pamulapati

Week 09 - Machine Learning with Scikit-learn

1. Among the different classification models included in the Python notebook, which model had the best overall performance? Support your response by referencing appropriate evidence.


The standard Logistic Regression model together with Logistic Regression with L1 penalty (C=10) demonstrated optimal classification results among all evaluated models according to the Python notebook results. The models delivered equivalent test accuracy results of 0.718 which stood as the highest performance among all examined models. The Random Forest without cross-validation achieved 0.9993 training accuracy but exhibited 0.686 test accuracy because it overfitted the training data strongly.

The Logistic Regression with L1 penalty (C=1) achieved a test accuracy of 0.716 while the Logistic Regression with Standard Scaling and L1 penalty model obtained 0.714. The models with smaller C values and auto-selected C parameters demonstrated inferior performance because this dataset benefited from reduced regularization.

The classification models demonstrated superior performance than the null baseline model which achieved 0.608 test accuracy by predicting the dominant class. The improved results indicate that the dataset features hold relevant information which helps classify patients based on their mortality risk.

The evaluation of different models should include both predictive accuracy measurements and the ease of interpretation. The Logistic Regression models offer clear coefficient interpretation to identify important features yet Random Forest models tend to show superior performance across other datasets but sacrifice interpretability. The simple Logistic Regression models demonstrated the most optimal combination of accuracy and generalization performance.

2. Next, fit a series of logistic regression models, without regularization. Each model should use the same set of predictors (all of the relevant predictors in the dataset) and should use the entire dataset, rather than a fraction of it. Use a randomly chosen 80% proportion of observations for training and the remaining for checking the generalizable performance (i.e., performance on the holdout subset). Be sure to ensure that the training and holdout subsets are identical across all models. Each model should choose a different solver.

3. Compare the results of the models in terms of their accuracy (use this as the performance metric to assess generalizability error on the holdout subset) and the time taken (use appropriate timing function). Summarize your results via a table with the following structure:

In [1]:
import numpy as np
import pandas as pd
import time
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
df_patient = pd.read_csv('PatientAnalyticFile.csv')

df_patient['mortality'] = np.where(df_patient['DateOfDeath'].isnull(), 0, 1)

df_patient['DateOfBirth'] = pd.to_datetime(df_patient['DateOfBirth'])
df_patient['Age_years'] = ((pd.to_datetime('2015-01-01') - df_patient['DateOfBirth']).dt.days/365.25)

vars_remove = ['PatientID', 'First_Appointment_Date', 'DateOfBirth',
               'Last_Appointment_Date', 'DateOfDeath', 'mortality']
vars_left = set(df_patient.columns) - set(vars_remove)
formula = "mortality ~ " + " + ".join(vars_left)

In [3]:
Y, X = dmatrices(formula, df_patient)

X_train, X_test, y_train, y_test = train_test_split(
    X, np.ravel(Y), test_size=0.2, random_state=42)

solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

results = {
    'Solver': [],
    'Training Accuracy': [],
    'Holdout Accuracy': [],
    'Time Taken (seconds)': []
}

In [7]:
for solver in solvers:
    # Time the model fitting
    start_time = time.time()

    # Create and fit the model - handle different solvers appropriately
    if solver == 'liblinear':
        model = LogisticRegression(solver=solver, penalty='l2', C=1e6, random_state=42, max_iter=1000)
    else:
        model = LogisticRegression(solver=solver, penalty= None, random_state=42, max_iter=1000)

    model.fit(X_train, y_train)

    # Calculate time taken
    time_taken = time.time() - start_time

    # Calculate accuracies
    train_accuracy = accuracy_score(y_train, model.predict(X_train))
    test_accuracy = accuracy_score(y_test, model.predict(X_test))

    # Store results
    results['Solver'].append(solver)
    results['Training Accuracy'].append(round(train_accuracy, 4))
    results['Holdout Accuracy'].append(round(test_accuracy, 4))
    results['Time Taken (seconds)'].append(round(time_taken, 4))

# Create and display results dataframe
results_df = pd.DataFrame(results)

print(results_df)

      Solver  Training Accuracy  Holdout Accuracy  Time Taken (seconds)
0  newton-cg             0.7481            0.7355                0.2995
1      lbfgs             0.7481            0.7360                0.5745
2  newton-cg             0.7481            0.7355                0.0813
3      lbfgs             0.7481            0.7360                0.2418
4  newton-cg             0.7481            0.7355                0.0835
5      lbfgs             0.7481            0.7360                0.2438
6  liblinear             0.7479            0.7362                0.0633
7        sag             0.7479            0.7358                2.4845
8       saga             0.7480            0.7360                4.0497


4. Based on the results, which solver yielded the best results? Explain the basis for ranking the models - did you use training subset accuracy? Holdout subset accuracy? Time of execution? All three? Some combination of the three?



The liblinear solver stands as the most effective solution for achieving high accuracy at a reasonable execution time. The liblinear solver delivered the maximum holdout accuracy score of 0.7362 and operated at the fastest speed of 0.0633 seconds. The combination of high speed and accuracy performance makes liblinear the most efficient solver for processing this dataset.

The lbfgs solver achieved the same holdout accuracy (0.7360) as liblinear but liblinear showed a slight advantage with 0.7362. The lbfgs solver executed at a slower pace than liblinear since it required 0.2438-0.5745 seconds to run while liblinear executed in 0.0633 seconds thus making it less efficient despite achieving similar accuracy levels.

The newton-cg solver delivered consistent results with a holdout accuracy of 0.7355 yet failed to surpass liblinear in terms of accuracy or execution speed. The execution times of 2.4845 and 4.0497 seconds for the sag and saga solvers demonstrated their status as the slowest methods while failing to deliver any accuracy benefits to offset this excessive computational cost.

Practical machine learning model evaluations should take into account how efficiently models predict and process data. The combination of factors that the liblinear solver provides makes it the ideal selection for running this classification operation.