<a href="https://colab.research.google.com/github/SRARNAB7/HDS_5230_07_Arnab/blob/main/Week%2009/Week_09_Assignment_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1) Among the different classification models included in the Python notebook, which model had the best overall performance? Support your response by referencing appropriate evidence.

Model     Train     Test

Logistic  0.7333    0.718

Null 0.6467 0.608

Logistic_L1_C_1 0.732 0.716

Logistic_L1_C_01 0.726 0.706

Logistic_L1_C_10 0.7347 0.718

Logistic_L1_C_auto 0.7233 0.708

Logistic_SL1_C_auto 0.7307 0.714

RandomForest_noCV 0.9993 0.69

RandomForest_CV 0.9987 0.702

RandomForest_CV2 0.7273 0.702

Based on the results presented in the screenshot, the Logistic regression model with L1 penalty and C=10 (Logistic_L1_C_10) demonstrated the best overall performance. It achieved the highest test accuracy of 0.718, which is a key indicator of how well the model generalizes to new, unseen data. In contrast, the Random Forest (no CV) model had an almost perfect training accuracy of 0.9993, but its test accuracy dropped sharply to 0.686, suggesting significant overfitting.

The Logistic_L1_C_10 model showed a strong balance between training and testing performance, with a training accuracy of 0.7347 and a test accuracy of 0.718. This close alignment indicates that the model generalizes well without being overly fitted to the training data. Therefore, among all the classification models evaluated, Logistic_L1_C_10 proved to be the most reliable and generalizable.

In [None]:
## Import Modules
import os
import sys
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.metrics import confusion_matrix
import sklearn
from sklearn import datasets

In [None]:
## Set default figure size to be larger
## this may only work in matplotlib 2.0+!
matplotlib.rcParams['figure.figsize'] = [10.0,6.0]
## Enable multiple outputs from jupyter cells
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
## Get Version information
print(sys.version)
print("Pandas version: {0}".format(pd.__version__))
print("Matplotlib version: {0}".format(matplotlib.__version__))
print("Numpy version: {0}".format(np.__version__))
print("SciKitLearn version: {0}".format(sklearn.__version__))

Check the working directory Set the working directory to make paths easier :)

In [None]:
# Working Directory
import os
print("My working directory:\n" + os.getcwd())
# Set Working Directory
os.chdir(".")
print("My new working directory:\n" + os.getcwd())

Patient Mortality Dataset We will use a dataset with a binary outcome of mortality as a motivating example.

This is a dataset of patients demographics and disease status, with mortality indicated. The dataset is here:

data\healthcare\patientAnalyticFile.csv

In practice, you most likely would have created a dataset like this from multiple other files after cleaning, reshaping, and joining them.

You can generalize this setup to any situation with a binary outcome, such as estimating the probability of a customer filing a warranty claim, or the probability of a transaction being fraudulent.

We will first import this dataset and examine the potential variables to use in our classification algorithm.

In [None]:
## Set print limits
pd.options.display.max_rows = 10
## Import Data
df_patient = \
 pd.read_csv('./PatientAnalyticFile.csv')
df_patient

We need to make a variable to indicate mortality. We can do that based on the abscence of 'date of death':

In [None]:
# Create mortality variable
df_patient['mortality'] = \
    np.where(df_patient['DateOfDeath'].isnull(),
             0,1)
# Examine
df_patient['mortality']

In [None]:
df_patient['mortality'].describe()

In [None]:
df_patient.describe()

In [None]:
df_patient.dtypes

We should change date of birth to be an actual date and calculate age if we want to include it in the model:

In [None]:
# Convert dateofBirth to date
df_patient['DateOfBirth'] = \
    pd.to_datetime(df_patient['DateOfBirth'])
# Calculate age in years as of 2015-01-01
df_patient['Age_years'] = \
    ((pd.to_datetime('2015-01-01') - df_patient['DateOfBirth']).dt.days/365.25)
df_patient['Age_years'].describe()

Use Patsy to Create the Model Matrices We typically start out with a pandas dataframe for manipulation purposes, then we will use this dataframe as the input to the machine learning library. I created a pandas dataframe above to replicate this process. We will use the dmatrices function from the patsy library to easily generate the design matrices for the machine learning algorithms representing the inputs. THis handles the following:

drops rows with missing data construct one-hot encoding for categorical variables optionally adds constant intecercept

In [None]:
df_patient.columns

In [None]:
## Create formula for all variables in model
vars_remove = ['PatientID','First_Appointment_Date','DateOfBirth',
               'Last_Appointment_Date','DateOfDeath','mortality']
vars_left = set(df_patient.columns) - set(vars_remove)
formula = "mortality ~ " + " + ".join(vars_left)
formula

**2) Next, fitting a series of logistic regression models, without regularization. Each model should use the same set of predictors (all of the relevant predictors in the dataset) and should use the entire dataset, rather than a fraction of it. Use a randomly chosen 80% proportion of observations for training and the remaining for checking the generalizable performance (i.e., performance on the holdout subset). Be sure to ensure that the training and holdout subsets are identical across all models. Each model should choose a different solver.**

In [None]:
Y, X = dmatrices(formula, df_patient)

In [None]:
Y

In [None]:
X

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, np.ravel(Y), test_size=0.2, random_state=42)

Confirming the Output Dimensions. The dimensions of the data are the same within test and train. The proportion should also be close to the test_size argument.

In [None]:
## Confirm dimensions
X_train.shape

In [None]:
X_test.shape

In [None]:
y_train.shape

In [None]:
y_test.shape

**Logistic Regression Models with Different Solvers that are not generalized. The different solvers are 'lbfgs', 'newton-cg', 'newton-cholesky', 'sag', 'saga'**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import time
import pandas as pd

# List of solvers that support penalty=None
solvers = ['lbfgs', 'newton-cg', 'newton-cholesky', 'sag', 'saga']

# Store results
results = []

# Loop through solvers
for solver in solvers:
    print(f"\nTraining with solver = {solver}")

    # Initialize model
    clf = LogisticRegression(
        penalty=None,
        solver=solver,
        fit_intercept=True,
        max_iter=1000,
        random_state=42
    )

    # Time the fitting process
    start_time = time.time()
    clf.fit(X_train, y_train)
    end_time = time.time()

    # Evaluate
    train_acc = accuracy_score(y_train, clf.predict(X_train))
    test_acc = accuracy_score(y_test, clf.predict(X_test))
    duration = end_time - start_time

    # Store results
    results.append({
        "Solver": solver,
        "Train Accuracy": round(train_acc, 4),
        "Test Accuracy": round(test_acc, 4),
        "Time (seconds)": round(duration, 4)
    })

# Convert to DataFrame
results_df = pd.DataFrame(results)

# Display table
results_df.sort_values(by="Test Accuracy", ascending=False, inplace=True)
results_df.reset_index(drop=True, inplace=True)
results_df

**4) Based on the results, which solver yielded the best results? Explain the basis for ranking the models - did you use training subset accuracy? Holdout subset accuracy? Time of execution? All three? Some combination of the three?**