Week 09 - Machine Learning with Scikit-learn

Dhanraj Pallepati

1. Among the different classification models included in the Python notebook, which model had the best overall performance? Support your response by referencing appropriate evidence.


A basic Random Forest Classifier approached the training data with the most successful results amongst all classification models investigated in the notebook. The model reached a training accuracy level of 99.93% during its performance evaluation. The training score was outstanding but testing accuracy came in at 68.6% which showed a clear sign of overfitting.

The multiple logistic regression models including standard logistic regression and L1-penalized (LASSO) models with different hyperparameters maintained similar performance between training and testing data. The testing accuracy levels of these models fell within a range of 70.6% to 71.8%. The baseline logistic regression model achieved 71.8% testing accuracy together with the L1-penalized logistic regression model implementing C=10.

The Random Forest model demonstrates superior training performance because it fits the training dataset perfectly through its multiple decision trees which enhance model complexity. The testing results demonstrate that the model lacks ability to generalize effectively. All logistic regression techniques showed better generalization capability toward unseen data alongside reduced training accuracy performance.

Random Forest Classifier proves itself as the top model when training performance serves as the primary focus. The standard logistic regression model and L1-penalized logistic regression model with C=10 achieve the best results when evaluating both training and testing performance for robustness and generalization. Their ability to retain high accuracy across training and testing sets together with minimum overfitting marks them as the best choice.

2
Next, fit a series of logistic regression models, without regularization. Each model should use the same set of predictors (all of the relevant predictors in the dataset) and should use the entire dataset, rather than a fraction of it. Use a randomly chosen 80% proportion of observations for training and the remaining for checking the generalizable performance (i.e., performance on the holdout subset). Be sure to ensure that the training and holdout subsets are identical across all models. Each model should choose a different solver.

3
Compare the results of the models in terms of their accuracy (use this as the performance metric to assess generalizability error on the holdout subset) and the time taken (use appropriate timing function). Summarize your results via a table with the following structure:




In [5]:
import pandas as pd
import numpy as np
import time
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from patsy import dmatrices
import warnings

warnings.filterwarnings('ignore')
df_patient = pd.read_csv('./PatientAnalyticFile.csv')

# Create mortality variable
df_patient['mortality'] = np.where(df_patient['DateOfDeath'].isnull(), 0, 1)

# Convert DateOfBirth to datetime and calculate age in years as of 2015-01-01
df_patient['DateOfBirth'] = pd.to_datetime(df_patient['DateOfBirth'])
df_patient['Age_years'] = ((pd.to_datetime('2015-01-01') - df_patient['DateOfBirth']).dt.days / 365.25)

vars_remove = ['PatientID', 'First_Appointment_Date', 'DateOfBirth', 'Last_Appointment_Date', 'DateOfDeath', 'mortality']
vars_left = set(df_patient.columns) - set(vars_remove)
formula = "mortality ~ " + " + ".join(vars_left)

# Create model matrices
Y, X = dmatrices(formula, df_patient, return_type='dataframe')

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, np.ravel(Y), test_size=0.2, random_state=42)

In [6]:
# Solvers to be tested
solvers = ['liblinear', 'lbfgs', 'newton-cg', 'sag', 'saga']

# Store results
results = []

# Train and evaluate models for each solver
for solver in solvers:
    start_time = time.time()
    clf = LogisticRegression(solver=solver, max_iter=500)
    clf.fit(X_train, y_train)
    train_accuracy = clf.score(X_train, y_train)
    test_accuracy = clf.score(X_test, y_test)
    time_taken = time.time() - start_time
    results.append([solver, train_accuracy, test_accuracy, time_taken])

# Create results dataframe
results_df = pd.DataFrame(results, columns=['Solver used', 'Training subset accuracy', 'Holdout subset accuracy', 'Time taken'])

In [7]:
results_df

Unnamed: 0,Solver used,Training subset accuracy,Holdout subset accuracy,Time taken
0,liblinear,0.747938,0.73625,0.050005
1,lbfgs,0.748062,0.736,0.759226
2,newton-cg,0.748125,0.73625,0.073325
3,sag,0.748,0.73625,5.102967
4,saga,0.748,0.736,5.839588


Based on the results, which solver yielded the best results? Explain the basis for ranking the models - did you use training subset accuracy? Holdout subset accuracy? Time of execution? All three? Some combination of the three?

Among all available choices the liblinear solver demonstrates the best overall performance. The evaluation of all three metrics including training subset accuracy, holdout subset accuracy and time taken leads to this conclusion.

The training accuracy of the liblinear solver reached 0.747938 while its holdout accuracy result was at 0.73625. The liblinear solver matches the other solvers regarding accuracy performance while demonstrating the fastest execution time. The solver completed training and testing in 0.050005 seconds which established it as the speediest solver among the alternatives. Large datasets and time-sensitive processes find 'liblinear' particularly suitable because it finishes execution quickly.

The accuracy results of lbfgs and ewton-cg solvers mirrored liblinear at 0.748062 and 0.748125 for training and 0.73600 and 0.73625 for testing yet they needed 0.759226 seconds and 0.073325 seconds to finish respectively which was slower than liblinear.

The sag and saga solvers demonstrated equivalent accuracy performance during training with 0.748000 and testing with 0.73625 and 0.73600 yet their execution times reached .102967 seconds and 5.839588 seconds respectively. These results indicate that the selected solvers should not be used for this specific problem.

Holdout subset accuracy served as the main ranking factor because it demonstrates how well the model performs on new data points. The liblinear solver demonstrated the best combination of accuracy and speed making it the best selection for this task because time efficiency proved equally important.