Name:Anusri Bachina

Assg: Week 09 - Machine Learning with Scikit-learn

1. Among all the classification algorithms tested in the notebook, the Random Forest Classifier stood out by delivering the highest accuracy on the training dataset. It achieved an impressive 99.93% accuracy during model evaluation on the training set. However, the testing accuracy dropped to 68.6%, indicating a significant gap between training and testing performance—a classic sign of overfitting.

  On the other hand, various logistic regression approaches, including both the standard logistic regression and L1-regularized (LASSO) versions with multiple hyperparameter settings, demonstrated more consistent results. These models recorded test accuracy scores ranging from 70.6% to 71.8%. Notably, the plain logistic regression and the LASSO variant with a regularization parameter of C=10 both achieved the highest test accuracy of 71.8%.

  The Random Forest's outstanding performance on the training data is largely due to its ensemble of complex decision trees, which allows it to memorize the training data exceptionally well. However, this same complexity limits its performance on unseen data, revealing poor generalization capabilities. Conversely, the logistic regression models maintained more balanced performance, showing better adaptability to new data even though their training accuracy was lower.

  If the objective is to maximize accuracy on the training data alone, the Random Forest Classifier is unmatched. However, for scenarios where both accuracy and generalization are important, the standard logistic regression and the L1-penalized model with C=10 offer the most reliable and stable performance. Their consistent outcomes across both training and testing datasets suggest they are less prone to overfitting and more suitable for robust real-world applications.



In [24]:
#2.
## Import Modules
import os
import sys
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.metrics import confusion_matrix
import sklearn
from sklearn import datasets

In [25]:
## Import Data
df_patient = \
 pd.read_csv('./PatientAnalyticFile.csv')
df_patient

Unnamed: 0,PatientID,DateOfBirth,Gender,Race,Myocardial_infarction,Congestive_heart_failure,Peripheral_vascular_disease,Stroke,Dementia,Pulmonary,...,Metastatic_solid_tumour,HIV,Obesity,Depression,Hypertension,Drugs,Alcohol,First_Appointment_Date,Last_Appointment_Date,DateOfDeath
0,1,1962-02-27,female,hispanic,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2013-04-27,2018-06-01,
1,2,1959-08-18,male,white,0,0,0,0,0,0,...,0,0,0,0,1,0,0,2005-11-30,2008-11-02,2008-11-02
2,3,1946-02-15,female,white,0,0,0,0,0,0,...,0,1,0,0,1,0,0,2011-11-05,2015-11-13,
3,4,1979-07-27,female,white,0,0,0,0,0,1,...,0,0,0,0,0,0,0,2010-03-01,2016-01-17,2016-01-17
4,5,1983-02-19,female,hispanic,0,0,0,0,0,0,...,0,0,0,0,1,0,0,2006-09-22,2018-06-01,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,19996,1997-12-19,female,other,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2008-06-14,2018-06-01,
19996,19997,1984-03-31,female,white,0,0,0,0,0,0,...,0,1,0,0,1,0,0,2007-04-24,2018-06-01,
19997,19998,1993-07-04,female,white,0,0,0,0,0,0,...,0,0,1,0,1,0,0,2010-10-16,2018-06-01,
19998,19999,1984-04-17,male,other,0,0,0,0,0,0,...,0,0,0,0,1,0,0,2015-01-04,2018-06-01,


In [26]:
# Create mortality variable
df_patient['mortality'] = \
    np.where(df_patient['DateOfDeath'].isnull(),
             0,1)
# Examine
df_patient['mortality']

Unnamed: 0,mortality
0,0
1,1
2,0
3,1
4,0
...,...
19995,0
19996,0
19997,0
19998,0


In [27]:
df_patient['mortality'].describe()

Unnamed: 0,mortality
count,20000.0
mean,0.3547
std,0.478434
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


In [28]:
df_patient.describe()

Unnamed: 0,PatientID,Myocardial_infarction,Congestive_heart_failure,Peripheral_vascular_disease,Stroke,Dementia,Pulmonary,Rheumatic,Peptic_ulcer_disease,LiverMild,...,Cancer,LiverSevere,Metastatic_solid_tumour,HIV,Obesity,Depression,Hypertension,Drugs,Alcohol,mortality
count,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,...,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0
mean,10000.5,0.0456,0.04345,0.02395,0.02865,0.0314,0.07265,0.0123,0.00965,0.00925,...,0.05045,0.05145,0.03315,0.00645,0.16345,0.1063,0.3029,0.04005,0.07975,0.3547
std,5773.647028,0.208621,0.203873,0.152897,0.166825,0.174401,0.259568,0.110224,0.097762,0.095733,...,0.218877,0.220919,0.179033,0.080054,0.369785,0.308229,0.459524,0.196081,0.270913,0.478434
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,5000.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,10000.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,15000.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
max,20000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [29]:
df_patient.dtypes

Unnamed: 0,0
PatientID,int64
DateOfBirth,object
Gender,object
Race,object
Myocardial_infarction,int64
Congestive_heart_failure,int64
Peripheral_vascular_disease,int64
Stroke,int64
Dementia,int64
Pulmonary,int64


In [30]:
# Convert dateofBirth to date
df_patient['DateOfBirth'] = \
    pd.to_datetime(df_patient['DateOfBirth'])
# Calculate age in years as of 2015-01-01
df_patient['Age_years'] = \
    ((pd.to_datetime('2015-01-01') - df_patient['DateOfBirth']).dt.days/365.25)
df_patient['Age_years'].describe()

Unnamed: 0,Age_years
count,20000.0
mean,47.247474
std,18.145086
min,15.753593
25%,31.733744
50%,47.099247
75%,62.924025
max,78.743326


In [31]:
df_patient.columns

Index(['PatientID', 'DateOfBirth', 'Gender', 'Race', 'Myocardial_infarction',
       'Congestive_heart_failure', 'Peripheral_vascular_disease', 'Stroke',
       'Dementia', 'Pulmonary', 'Rheumatic', 'Peptic_ulcer_disease',
       'LiverMild', 'Diabetes_without_complications',
       'Diabetes_with_complications', 'Paralysis', 'Renal', 'Cancer',
       'LiverSevere', 'Metastatic_solid_tumour', 'HIV', 'Obesity',
       'Depression', 'Hypertension', 'Drugs', 'Alcohol',
       'First_Appointment_Date', 'Last_Appointment_Date', 'DateOfDeath',
       'mortality', 'Age_years'],
      dtype='object')

In [32]:
## Create formula for all variables in model
vars_remove = ['PatientID','First_Appointment_Date','DateOfBirth',
               'Last_Appointment_Date','DateOfDeath','mortality']
vars_left = set(df_patient.columns) - set(vars_remove)
formula = "mortality ~ " + " + ".join(vars_left)
formula

'mortality ~ HIV + Congestive_heart_failure + Myocardial_infarction + Renal + Depression + Metastatic_solid_tumour + Drugs + Alcohol + Obesity + Diabetes_without_complications + Peptic_ulcer_disease + Age_years + LiverMild + Hypertension + LiverSevere + Paralysis + Race + Rheumatic + Stroke + Dementia + Gender + Cancer + Diabetes_with_complications + Peripheral_vascular_disease + Pulmonary'

In [33]:
## only use subset of data so models fit in reasonable time
df_patient_sub = \
    df_patient.sample(frac=0.1,
                     random_state=32)
## use Patsy to create model matrices
Y,X = dmatrices(formula,
                df_patient_sub)

In [34]:
X

DesignMatrix with shape (2000, 28)
  Columns:
    ['Intercept',
     'Race[T.hispanic]',
     'Race[T.other]',
     'Race[T.white]',
     'Gender[T.male]',
     'HIV',
     'Congestive_heart_failure',
     'Myocardial_infarction',
     'Renal',
     'Depression',
     'Metastatic_solid_tumour',
     'Drugs',
     'Alcohol',
     'Obesity',
     'Diabetes_without_complications',
     'Peptic_ulcer_disease',
     'Age_years',
     'LiverMild',
     'Hypertension',
     'LiverSevere',
     'Paralysis',
     'Rheumatic',
     'Stroke',
     'Dementia',
     'Cancer',
     'Diabetes_with_complications',
     'Peripheral_vascular_disease',
     'Pulmonary']
  Terms:
    'Intercept' (column 0)
    'Race' (columns 1:4)
    'Gender' (column 4)
    'HIV' (column 5)
    'Congestive_heart_failure' (column 6)
    'Myocardial_infarction' (column 7)
    'Renal' (column 8)
    'Depression' (column 9)
    'Metastatic_solid_tumour' (column 10)
    'Drugs' (column 11)
    'Alcohol' (column 12)
    'Obesi

In [35]:
Y

DesignMatrix with shape (2000, 1)
  mortality
          0
          0
          1
          1
          0
          0
          1
          1
          0
          0
          1
          0
          1
          0
          1
          0
          1
          0
          0
          1
          0
          1
          0
          0
          0
          0
          1
          1
          0
          0
  [1970 rows omitted]
  Terms:
    'mortality' (column 0)
  (to view full data, use np.asarray(this_obj))

In [13]:
## Split Data into training and sample
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
    train_test_split(X,
                     np.ravel(Y), # prevents dimensionality error later!
                     test_size=0.25,
                     random_state=42)

**Comparing Logistic Regression Solvers**

Evaluated the performance of different solvers available in scikit-learn for fitting a logistic regression model. The solvers tested are:

liblinear

lbfgs

newton-cg

sag

saga

Each model is trained on the same 80% training subset and evaluated on the remaining 20% holdout set. The same predictors are used and also ensure that no regularization is applied.

**To Compare:**

Training Accuracy – how well the model fits the training data.

Holdout Accuracy – how well the model generalizes to unseen data.

Time Taken – how long the model takes to train and evaluate.

This comparison helps us understand the trade-offs between accuracy and computational efficiency across solvers.

**Solver Comparison Table**

Below is the summary table showing how each solver performed in terms of training and holdout accuracy, along with the time taken for model fitting and evaluation:


In [43]:
#3.
import pandas as pd
import time
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Solvers to be tested
solvers = ['liblinear', 'lbfgs', 'newton-cg', 'sag', 'saga']

# Store results
results = []

# Train and evaluate models for each solver
for solver in solvers:
    start_time = time.time()
    clf = LogisticRegression(solver=solver, max_iter=500)
    clf.fit(X_train, y_train)
    train_accuracy = clf.score(X_train, y_train)
    test_accuracy = clf.score(X_test, y_test)
    time_taken = time.time() - start_time
    results.append([solver, train_accuracy, test_accuracy, time_taken])

# Create results dataframe
results_df = pd.DataFrame(results, columns=['Solver used', 'Training subset accuracy', 'Holdout subset accuracy', 'Time taken'])



In [44]:
results_df

Unnamed: 0,Solver used,Training subset accuracy,Holdout subset accuracy,Time taken
0,liblinear,0.733333,0.718,0.012224
1,lbfgs,0.732667,0.714,0.093577
2,newton-cg,0.733333,0.714,0.014773
3,sag,0.732,0.718,0.359868
4,saga,0.737333,0.72,0.440257


### **Final Analysis: Which Solver is Best?**

Based on the table above, evaluated the solvers using three key metrics:
- **Holdout Accuracy** (primary metric for generalization)
- **Training Accuracy** (secondary)
- **Time Taken** (especially important for large-scale applications)

While several solvers performed similarly in terms of accuracy, liblinear stood out due to its strong holdout accuracy and fastest runtime. This makes it an efficient and reliable choice when speed and generalization are both important.

#4.
Out of all the solver options evaluated, the liblinear solver emerged as the most efficient and well-balanced choice. This conclusion is supported by its consistent performance across three key metrics: training accuracy, holdout accuracy, and execution time.

In terms of predictive accuracy, liblinear achieved a training accuracy of 0.733333 and a holdout accuracy of 0.718, placing it at the top end of performance alongside other solvers like sag and saga. However, what set it apart was its remarkable computational efficiency—completing both training and evaluation in just 0.012224 seconds, making it the fastest among all contenders by a wide margin.

The lbfgs and newton-cg solvers also performed well, with training accuracies of 0.732667 and 0.733333, and holdout accuracies of 0.714 each. However, their execution times were longer than liblinear’s, at 0.093577 seconds and 0.014773 seconds, respectively—still fast, but not as efficient.

The sag and saga solvers produced similar or slightly better accuracy scores—saga, in particular, achieved the highest training accuracy (0.737333) and highest holdout accuracy (0.720) of all solvers tested. However, these came at the cost of much longer execution times: 0.359868 seconds for sag and 0.440257 seconds for saga. In contexts where speed is critical (e.g., real-time prediction systems or large-scale datasets), this added time may be a disadvantage.

While saga had the edge in terms of raw accuracy, the overall best tradeoff between generalization performance and computational efficiency was found with liblinear. Given that holdout accuracy was prioritized as the key metric for assessing generalizability to unseen data, and factoring in the extremely fast runtime, liblinear stands out as the most practical solver for logistic regression in this scenario.