Bhavya

Week 09 Assignment - Scikit Learn

Among the different classification models included in the Python notebook, which model had the best overall performance? Support your response by referencing appropriate evidence.


The Random Forest model performed as the top classification model according to results displayed in the Python notebook. The best overall performance emerges from the comparison between training accuracy scores and testing accuracy scores of all tested models. The Random Forest model outperformed logistic regression models which used L1 (LASSO) regularization and pipelines with scaling methods because it obtained superior accuracy metrics.

The training accuracy of RandomForest_noCV reached almost perfect levels at 0.9993. The model displayed excessive overfitting because its high training accuracy of 0.9993 did not translate to better testing results which reached only 0.686. The model demonstrates memorization of training data instead of proper generalization yet it surpassed several logistic regression models in testing accuracy.

The generalization performance of the Random Forest model received additional improvement through GridSearchCV implementation for cross-validation. The optimization process of hyperparameters including number of estimators and maximum number of features became possible through this technique. The testing accuracy of Random Forest increased after implementing optimization through cross-validation. The model performance improved because of implementing two grid searches to optimize tree depth.

The training and testing accuracies of the logistic regression models with penalties L1 and L2 along with different C parameters stabilized at 0.73 and 0.71. These models demonstrated a stable predictive behavior along with reduced overfitting risk but delivered inferior accuracy results than Random Forest.

The Random Forest model surpassed all logistic regression models in terms of accuracy performance notwithstanding the overfitting risks. The Random Forest method achieved peak accuracy results on training and testing data after performing cross-validation for hyperparameter optimization. The Random Forest model shows high predictive capabilities because it orchestrates multiple decision trees into a single effective framework.

Next, fit a series of logistic regression models, without regularization. Each model should use the same set of predictors (all of the relevant predictors in the dataset) and should use the entire dataset, rather than a fraction of it. Use a randomly chosen 80% proportion of observations for training and the remaining for checking the generalizable performance (i.e., performance on the holdout subset). Be sure to ensure that the training and holdout subsets are identical across all models. Each model should choose a different solver.

In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.model_selection import train_test_split
from sklearn import linear_model
import time

In [12]:
# Load data
df_patient = pd.read_csv('./PatientAnalyticFile.csv')

In [13]:
# Create mortality variable
df_patient['mortality'] = np.where(df_patient['DateOfDeath'].isnull(), 0, 1)

# Convert DateOfBirth to date and calculate age
df_patient['DateOfBirth'] = pd.to_datetime(df_patient['DateOfBirth'])
df_patient['Age_years'] = ((pd.to_datetime('2015-01-01') - df_patient['DateOfBirth']).dt.days/365.25)

# Create formula for all variables in model
vars_remove = ['PatientID','First_Appointment_Date','DateOfBirth',
              'Last_Appointment_Date','DateOfDeath','mortality']
vars_left = set(df_patient.columns) - set(vars_remove)
formula = "mortality ~ " + " + ".join(vars_left)

In [19]:
Y, X = dmatrices(formula, df_patient)

# Split Data into training and testing samples - using 80% for training
X_train, X_test, y_train, y_test = train_test_split(X,
                                                  np.ravel(Y),
                                                  test_size=0.2,
                                                  random_state=42)

# Dictionary to store results
results = []

# List of solvers to try
solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

In [20]:
for solver in solvers:
    # Define model with specific solver
    if solver == 'liblinear':
        # liblinear doesn't support penalty=None
        clf = linear_model.LogisticRegression(fit_intercept=True,
                                              penalty='l2',
                                              solver=solver,
                                              C=1e10,
                                              max_iter=1000)
    else:
        clf = linear_model.LogisticRegression(fit_intercept=True,
                                              penalty=None,
                                              solver=solver,
                                              max_iter=1000)

    # Time the fitting process
    start_time = time.time()
    clf.fit(X_train, y_train)
    fit_time = time.time() - start_time

    # Calculate training and test accuracy
    train_accuracy = accuracy_score(y_train, clf.predict(X_train))
    test_accuracy = accuracy_score(y_test, clf.predict(X_test))

    # Store results
    results.append({
        'Solver': solver,
        'Training Accuracy': train_accuracy,
        'Holdout Accuracy': test_accuracy,
        'Time (seconds)': fit_time
    })

# Create and display results table
results_df = pd.DataFrame(results)
results_df = results_df[['Solver', 'Training Accuracy', 'Holdout Accuracy', 'Time (seconds)']]
results_df['Training Accuracy'] = results_df['Training Accuracy'].round(4)
results_df['Holdout Accuracy'] = results_df['Holdout Accuracy'].round(4)
results_df['Time (seconds)'] = results_df['Time (seconds)'].round(2)

print(results_df.to_string(index=False))

   Solver  Training Accuracy  Holdout Accuracy  Time (seconds)
newton-cg             0.7481            0.7355            0.09
    lbfgs             0.7480            0.7358            0.25
liblinear             0.7479            0.7362            0.07
      sag             0.7479            0.7358            2.61
     saga             0.7480            0.7360            4.45


Based on the results, which solver yielded the best results? Explain the basis for ranking the models - did you use training subset accuracy? Holdout subset accuracy? Time of execution? All three? Some combination of the three?


The 'liblinear' solver demonstrated the highest holdout accuracy at 0.7362 because it provides the most accurate prediction of generalization ability. The 'liblinear' solver delivered slightly superior results than its counterparts which reached between 0.7355 and 0.7360 accuracy in the holdout evaluation. The slight variations in accuracy (0.0007 or 0.07%) between different solvers become significant when applied to large medical populations for predicting patient mortality.

The execution time for 'liblinear' solver reaches 0.07 seconds making it the fastest available option. The execution time of 'liblinear' stands at 0.07 seconds which outperforms 'sag' (2.61 seconds) and 'saga' (4.45 seconds) and surpasses 'newton-cg' (0.09 seconds) and 'lbfgs' (0.25 seconds).

The 'liblinear' solver demonstrates the best performance as the fastest and most accurate solution for this dataset when applied to this task. The efficient performance of this approach benefits real-world applications which demand high-speed processing together with excellent speed performance. The training accuracy holds little importance for model ranking because the holdout accuracy better demonstrates how well the model performs on new data.
The accuracy differences between solvers remain small for this particular problem while the solver selection strongly affects computational speed.

