**You are given a dataset (MiniExam4Dataset.csv) that includes 14 features that 
represents clinical conditions of 500 ICU patients and target variable death that 
represents whether the patient died (=1) in the ICU or discharged alive (=0).**

**(a) Split the data set into a training set and a test set (80% Training, 20% Test)**

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np

df = pd.read_csv('MiniExam4DataSet.csv')
df.head()

Unnamed: 0,age,LOS,previous_LOS,previous_ICU_stays,cvc_status,SIRS_48_hour,MV_24_hour,Initial_SOFA,Discharge_SOFA,Max_SOFA,AdmitApache,DischargeApache,sex,Type,death
0,86,160.983333,1.8,0,0,0,0,6,3,6,66,57,F,Surgical,0
1,61,103.533333,11.433333,0,1,1,0,7,4,7,80,73,F,Surgical,0
2,22,572.383333,14.45,0,1,1,0,10,4,12,90,74,F,Surgical,0
3,58,51.2,0.0,0,1,1,1,7,3,7,78,64,M,Medical,0
4,18,35.116667,0.0,0,0,1,1,7,2,7,73,49,M,Medical,0


In [2]:
# These categorical variables transformed to dummy variables with one-hot-encoder

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()

df_encoded = encoder.fit_transform(df[['sex', 'Type']])

df_encoded = pd.DataFrame(df_encoded.toarray(), columns=encoder.get_feature_names_out(['sex', 'Type']))

df = df.drop(['sex', 'Type'], axis=1)

df = pd.concat([df, df_encoded], axis=1)

In [3]:
df.head()

Unnamed: 0,age,LOS,previous_LOS,previous_ICU_stays,cvc_status,SIRS_48_hour,MV_24_hour,Initial_SOFA,Discharge_SOFA,Max_SOFA,AdmitApache,DischargeApache,death,sex_F,sex_M,Type_Medical,Type_Surgical
0,86,160.983333,1.8,0,0,0,0,6,3,6,66,57,0,1.0,0.0,0.0,1.0
1,61,103.533333,11.433333,0,1,1,0,7,4,7,80,73,0,1.0,0.0,0.0,1.0
2,22,572.383333,14.45,0,1,1,0,10,4,12,90,74,0,1.0,0.0,0.0,1.0
3,58,51.2,0.0,0,1,1,1,7,3,7,78,64,0,0.0,1.0,1.0,0.0
4,18,35.116667,0.0,0,0,1,1,7,2,7,73,49,0,0.0,1.0,1.0,0.0


In [4]:
# Features and target variable is adjusted as arrays.

X = df.drop(columns=['death','sex_M','Type_Medical'])
y = df['death'].values

In [5]:
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

In [6]:
# Train and test sets are splitted.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)

**(b) Standardize your features.**

In [7]:
# Train and test sets of X are standardized.

st_scaler = preprocessing.StandardScaler()
st_scaler.fit(X_train)
X_trainStandard = st_scaler.transform(X_train)
X_testStandard = st_scaler.transform(X_test)

**(c) Use cross-validation to select the best method and the best set of parameters. 
Try Regularized Logistic Regression (both L1 and L2 penalties and different C 
values), KNN classifier (different numbers of neighbors you believe to be 
reasonable). BE CAREFUL that the best model should be selected using cross 
validation hence you should never evaluate different methods using the test set. 
Also, be very careful that the standardization needs to be carefully done during 
cross validation not to end up with data snooping (recall the pipe approach 
discussed in the class).**

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier

In [9]:
# With using different c values, train sets are trained with logistic regression - L1 peanlty (Lasso).

C_param = [0.001,0.01,0.1,1,10,100,1000,10000]
scoresCV_l1 = []
for c in C_param:
    lr = LogisticRegression(C=c,penalty='l1',max_iter=1000,solver='liblinear')
    pipeline = make_pipeline(preprocessing.StandardScaler(), lr)
    scoreCV_l1 = cross_val_score(pipeline, X_train, y_train, scoring='accuracy',
                             cv=KFold(n_splits=10, shuffle=True,
                                            random_state=1))
    scoresCV_l1.append([c,np.mean(scoreCV_l1)])

In [10]:
# Then, validation accuracy scores are printed for L1 penalty.

print("Validation Accuracy in L1 Penalty")
df_l1 = pd.DataFrame(scoresCV_l1,columns=['C (1/lambda)','Validation Accuracy'])
df_l1

Validation Accuracy in L1 Penalty


Unnamed: 0,C (1/lambda),Validation Accuracy
0,0.001,0.8075
1,0.01,0.8075
2,0.1,0.9025
3,1.0,0.895
4,10.0,0.8975
5,100.0,0.8975
6,1000.0,0.8975
7,10000.0,0.8975


c = 0,1 has highest validation accuracy as being 0.9025 for L1 penalty.

In [11]:
# Similarly, train sets are trained with logistic regression - L2 peanlty (Ridge).

scoresCV_l2 = []
for c in C_param:
    lr = LogisticRegression(C=c,penalty='l2',max_iter=1000)
    pipeline = make_pipeline(preprocessing.StandardScaler(), lr)
    scoreCV_l2 = cross_val_score(pipeline, X_train, y_train, scoring='accuracy',
                             cv=KFold(n_splits=10, shuffle=True,
                                            random_state=1))
    scoresCV_l2.append([c,np.mean(scoreCV_l2)])

In [12]:
# Validation accuracy scores are printed for L2 penalty.

print("Validation Accuracy in L2 Penalty")
df_l2 = pd.DataFrame(scoresCV_l2,columns=['C (1/lambda)','Validation Accuracy'])
df_l2

Validation Accuracy in L2 Penalty


Unnamed: 0,C (1/lambda),Validation Accuracy
0,0.001,0.8075
1,0.01,0.8775
2,0.1,0.895
3,1.0,0.8975
4,10.0,0.8975
5,100.0,0.8975
6,1000.0,0.8975
7,10000.0,0.8975


c = 1 has the highest validation score as being 0.8975 for L2 penalty.

In [13]:
# For finding accuracy scores in KNN classifier with different K values, for loop is created.
# Different than previous ones, train sets are splitted as valid and trainv,
# Because KNN should be fitted for finding best K values for the model.

cv = KFold(n_splits=10, random_state=1, shuffle=True)
CV_accuracy=[]
for j in range(1,26):
    knn = KNeighborsClassifier(n_neighbors = j)
    pipe = make_pipeline(preprocessing.StandardScaler(), knn)
    scores = cross_val_score(pipe, X_train, y_train, scoring='accuracy',
                             cv=KFold(n_splits=10, shuffle=True,
                                            random_state=1))
    CV_accuracy.append([j,scores.mean()])

In [14]:
df_knn = pd.DataFrame (CV_accuracy,columns=['NeighbourSize','Validation Accuracy'])

In [15]:
# Neighbour size (K) and validation accuracy of them are printed for KNN classifier.

kfoldCV = df_knn.groupby("NeighbourSize")
kfoldCV = kfoldCV.mean()
kfoldCV = kfoldCV.reset_index()
print('KNN Neighbour Size & Accuracy')
kfoldCV[['NeighbourSize', 'Validation Accuracy']]

KNN Neighbour Size & Accuracy


Unnamed: 0,NeighbourSize,Validation Accuracy
0,1,0.8325
1,2,0.855
2,3,0.8725
3,4,0.87
4,5,0.89
5,6,0.89
6,7,0.8975
7,8,0.895
8,9,0.895
9,10,0.9


After cross-validation, the accuracy scores of logistic regression and the KNN classifier under various conditions and different parameters show distinct results. According to the findings, in logistic regression with L1 (Lasso) regularization, C=0.1 yields the best validation accuracy score of 0.9025. With L2 (Ridge) regularization, C=1 produces 0.8975. While analyzing the KNN classifier model, K=10 results in the highest validation accuracy at 0.900.

It is evident that the logistic regression model exhibits a higher accuracy score than the KNN model in this dataset. Therefore, the data should be fitted using the logistic regression model. At this juncture, L1 gives higher accuracy score than L2 for this dataset. Thus, L1 (Lasso) penalty is used for the model.

**(d) Once you decide on the final method and the set of best parameters, refit 
your model on the standardized training set and evaluate the performance 
(accuracy) on the standardized test set.**

In [16]:
# Parameters are set for Logistic regression, and standardized training data is fitted to the model.
# Then, performance score of the model is printed on test sets.

logreg = LogisticRegression(C=1,penalty='l1',max_iter=1000, solver='liblinear')
logreg.fit(X_trainStandard, y_train)
score = logreg.score(X_testStandard, y_test)
print(score)

0.92


Accuracy score of the model on the test set after fitting the standardized training set is 0.92.

In [17]:
Model_intercept = pd.DataFrame({"Variables":'Intercept',"Coefficients":logreg.intercept_[0]},index=[0])
Model_coefficients = pd.DataFrame({"Variables":X.columns,"Coefficients":np.transpose(logreg.coef_[0])})
Model_coefficients = pd.concat([Model_intercept,Model_coefficients]).reset_index(drop=True)
print(Model_coefficients)

             Variables  Coefficients
0            Intercept     -2.222448
1                  age      0.645888
2                  LOS      0.242947
3         previous_LOS      0.450976
4   previous_ICU_stays      0.165730
5           cvc_status     -0.103515
6         SIRS_48_hour      0.232399
7           MV_24_hour      0.488888
8         Initial_SOFA     -0.164343
9       Discharge_SOFA      1.134682
10            Max_SOFA      0.000000
11         AdmitApache     -0.170475
12     DischargeApache      0.746385
13               sex_F      0.000000
14       Type_Surgical     -0.237872


Coefficients of the features for the model are given at the above table.

**(e) Provide the test confusion matrix.**

In [18]:
y_pred = logreg.predict(X_testStandard)

In [19]:
from sklearn.metrics import confusion_matrix

In [20]:
cnf_matrix = confusion_matrix(y_test, y_pred)
cnf_matrix

array([[79,  2],
       [ 6, 13]], dtype=int64)

True Positives (TP): 13
These are instances that belong to the positive class and are correctly predicted as positive.

True Negatives (TN): 79
These are instances that belong to the negative class and are correctly predicted as negative.

False Positives (FP): 2
These are instances that belong to the negative class but are incorrectly predicted as positive.

False Negatives (FN): 6
These are instances that belong to the positive class but are incorrectly predicted as negative.

In [21]:
from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))

Accuracy: 0.92
Precision: 0.8666666666666667
Recall: 0.6842105263157895


The accuracy of the model is 92%. This metric represents the overall correctness of predictions and is calculated as the ratio of correctly predicted instances to the total number of instances.

Precision is 86.67%. Precision measures the accuracy of the positive predictions made by the model. In this case, 86.67% of the instances predicted as positive by the model were indeed true positives.

The recall of the model is 68.42%. Recall, also known as sensitivity, measures the model's ability to correctly identify all relevant instances of a particular class. In this context, the model correctly identified 68.42% of all actual positive instances.

The high accuracy of 92% indicates that the majority of predictions made by the model are correct.
The precision of 86.67% suggests that when the model predicts a positive instance, it is highly likely to be correct.
However, the recall of 68.42% indicates that the model may be missing some of the actual positive instances. This could be a concern, especially in scenarios where identifying all positive instances is crucial.