# Week 09 - Machine Learning with Scikit-learn

# Srija Velumula

# Question 1

Among the different classification models included in the Python notebook, which model had the best overall performance? Support your response by referencing appropriate evidence.

# Answer

Logistic: Train accuracy 0.7333, Test accuracy 0.718

Null: Train accuracy 0.6467, Test accuracy 0.608

Logistic_L1_C_1: Train accuracy 0.732, Test accuracy 0.716

Logistic_L1_C_01: Train accuracy 0.726, Test accuracy 0.706

Logistic_L1_C_10: Train accuracy 0.7347, Test accuracy 0.718

Logistic_L1_C_auto: Train accuracy 0.7233, Test accuracy 0.708

Logistic_SL1_C_auto: Train accuracy 0.7307, Test accuracy 0.714

RandomForest_noCV: Train accuracy 0.9993, Test accuracy 0.686

The optimal model selection would be a model which maintains balanced training and test accuracy performance while avoiding overfitting. The RandomForest_noCV model demonstrates severe overfitting because it reaches almost perfect training accuracy (0.9993) yet its test accuracy (0.686) remains low indicating poor generalization.
The Logistic_L1_C_10 model (Logistic Regression with L1 penalty and C=10) demonstrated the best performance by achieving a test accuracy of 0.718 which was identical to the base Logistic model while also reaching a training accuracy of 0.7347. The model demonstrates an optimal relationship between complexity and generalization capabilities. The base Logistic model demonstrated equivalent test accuracy (0.718) as the other models.
The standard Logistic model and Logistic_L1_C_10 demonstrate the highest test accuracy while Logistic_L1_C_10 exhibits a slightly superior training accuracy performance. The model achieved better performance with L1 regularization when C=10 was applied as a weak penalty term without compromising generalization capabilities.



# Question 2 and 3


Next, fit a series of logistic regression models, without regularization. Each model should use the same set of predictors (all of the relevant predictors in the dataset) and should use the entire dataset, rather than a fraction of it. Use a randomly chosen 80% proportion of observations for training and the remaining for checking the generalizable performance (i.e., performance on the holdout subset). Be sure to ensure that the training and holdout subsets are identical across all models. Each model should choose a different solver.

Compare the results of the models in terms of their accuracy (use this as the performance metric to assess generalizability error on the holdout subset) and the time taken (use appropriate timing function). Summarize your results via a table with the following structure:


Solver used

Training subset accuracy

Holdout subset accuracy

Time taken

In [1]:
import pandas as pd
import numpy as np
import time
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from patsy import dmatrices
from sklearn.metrics import accuracy_score


In [2]:
df_patient = pd.read_csv('PatientAnalyticFile.csv')

# Create mortality variable
df_patient['mortality'] = np.where(df_patient['DateOfDeath'].isnull(), 0, 1)

# Calculate age
df_patient['DateOfBirth'] = pd.to_datetime(df_patient['DateOfBirth'])
df_patient['Age_years'] = ((pd.to_datetime('2015-01-01') - df_patient['DateOfBirth']).dt.days/365.25)

# Create formula for all variables in model
vars_remove = ['PatientID', 'First_Appointment_Date', 'DateOfBirth',
               'Last_Appointment_Date', 'DateOfDeath', 'mortality']
vars_left = set(df_patient.columns) - set(vars_remove)
formula = "mortality ~ " + " + ".join(vars_left)

In [3]:
Y, X = dmatrices(formula, df_patient)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, np.ravel(Y),
    test_size=0.2,
    random_state=42
)

# List of solvers to evaluate
solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

In [6]:
def evaluate_solver(solver_name, X_train, X_test, y_train, y_test):
    start_time = time.time()

    if solver_name == 'liblinear':
        model = LogisticRegression(solver=solver_name, penalty='l2', random_state=42)
    elif solver_name == 'sag' or solver_name == 'saga':
        model = LogisticRegression(solver=solver_name, penalty=None, random_state=42, max_iter=1000)
    else:
        model = LogisticRegression(solver=solver_name, penalty=None, random_state=42)

    # Fit model
    model.fit(X_train, y_train)

    # Calculate time
    elapsed_time = time.time() - start_time

    # Calculate accuracies
    train_accuracy = accuracy_score(y_train, model.predict(X_train))
    test_accuracy = accuracy_score(y_test, model.predict(X_test))

    return {
        'Solver': solver_name,
        'Training Accuracy': train_accuracy,
        'Holdout Accuracy': test_accuracy,
        'Time Taken (seconds)': elapsed_time
    }


In [7]:
results = []
for solver in solvers:
    result = evaluate_solver(solver, X_train, X_test, y_train, y_test)
    results.append(result)

# Create and display results table
results_df = pd.DataFrame(results)
results_df = results_df[['Solver', 'Training Accuracy', 'Holdout Accuracy', 'Time Taken (seconds)']]
print(results_df.to_string(index=False))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


   Solver  Training Accuracy  Holdout Accuracy  Time Taken (seconds)
newton-cg           0.748062           0.73550              0.098839
    lbfgs           0.748125           0.73575              0.204017
liblinear           0.747938           0.73625              0.062977
      sag           0.747938           0.73575              2.989801
     saga           0.748000           0.73600              5.766594


# Question 4

Based on the results, which solver yielded the best results? Explain the basis for ranking the models - did you use training subset accuracy? Holdout subset accuracy? Time of execution? All three? Some combination of the three?


The accuracy results from running all solvers show minimal variations between each other. The training accuracy measurement between the best solver lbfgs at 0.748125 and the worst solvers liblinear and sag at 0.747938 shows a minor difference of 0.00075. The highest holdout accuracy of 0.73625 belongs to liblinear while newton-cg shows the lowest at 0.73550 which represents a 0.00075 difference between them.
The execution time shows significant variations between different solvers. The liblinear solver executed the task in 0.062977 seconds which provided 1.6 times speed advantage over newton-cg and 3.2 times speed advantage over lbfgs and 47.5 times speed advantage over sag and 91.6 times speed advantage over saga.
I recommend liblinear as the top solver solution for this dataset and task based on the obtained results. The decision was made through multiple metrics but execution time proved decisive because accuracy levels were similar. The efficient choice for this task becomes liblinear because it delivered the maximum holdout accuracy (0.73625) within the lowest computational time (0.062977 seconds).
Liblinear would remain our first choice even when accuracy becomes the only important factor since it demonstrates slightly better holdout accuracy performance. Practical machine learning applications require finding equilibrium between accuracy and efficiency because large datasets and frequent model training sessions demand it.
Newton-cg stands as my preferred choice for its efficient speed and accuracy level and lbfgs follows behind it in performance ranking based on time-accuracy considerations.