<h1>2. Predicting Hospital Readmission Using Logistic Regression</h1>
<h3><b> Preprocessing Steps:</b></h3>
<ul>
    <li>Handle missing values (e.g., fill missing values with mode for categorical variables).</li>
    <li>Encode categorical variables (e.g., one-hot encoding for hospital type, region, etc.).</li>
    <li>Standardize numerical features.</li>
</ul>
<h3><b> Task:</b> Implement logistic regression to predict hospital readmission and evaluate the model using precision, recall, and F1-score.</h3>

In [1]:
# Importing Libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

In [2]:
# Loading the dataset
hospital_dataset = pd.read_csv('..\\..\\Datasets\\HospitalReadmissions.csv')
print(hospital_dataset.shape, '\n')
hospital_dataset.head()

(25000, 17) 



Unnamed: 0,age,time_in_hospital,n_lab_procedures,n_procedures,n_medications,n_outpatient,n_inpatient,n_emergency,medical_specialty,diag_1,diag_2,diag_3,glucose_test,A1Ctest,change,diabetes_med,readmitted
0,[70-80),8,72,1,18,2,0,0,Missing,Circulatory,Respiratory,Other,no,no,no,yes,no
1,[70-80),3,34,2,13,0,0,0,Other,Other,Other,Other,no,no,no,yes,no
2,[50-60),5,45,0,18,0,0,0,Missing,Circulatory,Circulatory,Circulatory,no,no,yes,yes,yes
3,[70-80),2,36,0,12,1,0,0,Missing,Circulatory,Other,Diabetes,no,no,yes,yes,yes
4,[60-70),1,42,0,7,0,0,0,InternalMedicine,Other,Circulatory,Respiratory,no,no,no,yes,no


In [3]:
# Printing the basic statistics of the dataset
hospital_dataset.describe()

Unnamed: 0,time_in_hospital,n_lab_procedures,n_procedures,n_medications,n_outpatient,n_inpatient,n_emergency
count,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0
mean,4.45332,43.24076,1.35236,16.2524,0.3664,0.61596,0.1866
std,3.00147,19.81862,1.715179,8.060532,1.195478,1.177951,0.885873
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0
25%,2.0,31.0,0.0,11.0,0.0,0.0,0.0
50%,4.0,44.0,1.0,15.0,0.0,0.0,0.0
75%,6.0,57.0,2.0,20.0,0.0,1.0,0.0
max,14.0,113.0,6.0,79.0,33.0,15.0,64.0


In [4]:
# Printing information of the dataset
hospital_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   age                25000 non-null  object
 1   time_in_hospital   25000 non-null  int64 
 2   n_lab_procedures   25000 non-null  int64 
 3   n_procedures       25000 non-null  int64 
 4   n_medications      25000 non-null  int64 
 5   n_outpatient       25000 non-null  int64 
 6   n_inpatient        25000 non-null  int64 
 7   n_emergency        25000 non-null  int64 
 8   medical_specialty  25000 non-null  object
 9   diag_1             25000 non-null  object
 10  diag_2             25000 non-null  object
 11  diag_3             25000 non-null  object
 12  glucose_test       25000 non-null  object
 13  A1Ctest            25000 non-null  object
 14  change             25000 non-null  object
 15  diabetes_med       25000 non-null  object
 16  readmitted         25000 non-null  objec

<h2>Data Preprocessing</h2>

<h3>1. Handling Missing Values</h3>

In [5]:
# Checking the missing values in the dataset
hospital_dataset.isnull().sum()

age                  0
time_in_hospital     0
n_lab_procedures     0
n_procedures         0
n_medications        0
n_outpatient         0
n_inpatient          0
n_emergency          0
medical_specialty    0
diag_1               0
diag_2               0
diag_3               0
glucose_test         0
A1Ctest              0
change               0
diabetes_med         0
readmitted           0
dtype: int64

-> Since there are no missing values in the dataset, we can proceed to the next step i.e, <b>Encoding Categorical Variables</b>.

<h3>2. Encoding Categorical Variables</h3>

In [6]:
# Separating the categorical and numerical features from the dataset
categorical_features = hospital_dataset.select_dtypes('object').columns
print("Categorical Variables:", len(categorical_features), '\n', categorical_features)

numerical_features = hospital_dataset.drop(columns=categorical_features, axis=1)
print("\nNumerical Features:", numerical_features.shape[1], '\n', numerical_features.columns)

Categorical Variables: 10 
 Index(['age', 'medical_specialty', 'diag_1', 'diag_2', 'diag_3',
       'glucose_test', 'A1Ctest', 'change', 'diabetes_med', 'readmitted'],
      dtype='object')

Numerical Features: 7 
 Index(['time_in_hospital', 'n_lab_procedures', 'n_procedures', 'n_medications',
       'n_outpatient', 'n_inpatient', 'n_emergency'],
      dtype='object')


In [7]:
# Printing the categories of the categorical variables
for category in categorical_features:
    print(category, ":", hospital_dataset[category].nunique(), ': ', hospital_dataset[category].unique())
    print()

age : 6 :  ['[70-80)' '[50-60)' '[60-70)' '[40-50)' '[80-90)' '[90-100)']

medical_specialty : 7 :  ['Missing' 'Other' 'InternalMedicine' 'Family/GeneralPractice'
 'Cardiology' 'Surgery' 'Emergency/Trauma']

diag_1 : 8 :  ['Circulatory' 'Other' 'Injury' 'Digestive' 'Respiratory' 'Diabetes'
 'Musculoskeletal' 'Missing']

diag_2 : 8 :  ['Respiratory' 'Other' 'Circulatory' 'Injury' 'Diabetes' 'Digestive'
 'Musculoskeletal' 'Missing']

diag_3 : 8 :  ['Other' 'Circulatory' 'Diabetes' 'Respiratory' 'Injury' 'Musculoskeletal'
 'Digestive' 'Missing']

glucose_test : 3 :  ['no' 'normal' 'high']

A1Ctest : 3 :  ['no' 'normal' 'high']

change : 2 :  ['no' 'yes']

diabetes_med : 2 :  ['yes' 'no']

readmitted : 2 :  ['no' 'yes']



In [8]:
# Applying Label encoder (to preserve the original size of the dataset)
encoder = LabelEncoder()

for category in categorical_features:
    hospital_dataset[category] = encoder.fit_transform(hospital_dataset[category])

In [9]:
# Printing the categories after encoding
for category in categorical_features:
    print(category, ":", hospital_dataset[category].nunique(), ': ', hospital_dataset[category].unique())
    print()

age : 6 :  [3 1 2 0 4 5]

medical_specialty : 7 :  [4 5 3 2 0 6 1]

diag_1 : 8 :  [0 6 3 2 7 1 5 4]

diag_2 : 8 :  [7 6 0 3 1 2 5 4]

diag_3 : 8 :  [6 0 1 7 3 5 2 4]

glucose_test : 3 :  [1 2 0]

A1Ctest : 3 :  [1 2 0]

change : 2 :  [0 1]

diabetes_med : 2 :  [1 0]

readmitted : 2 :  [0 1]



-> Hence the categorical features are now encoded.

<h3>3. Standardizing Numerical Features</h3>

In [10]:
# Applying standardization
scaler = StandardScaler()

for feature in numerical_features.columns:
    hospital_dataset[feature] = scaler.fit_transform(hospital_dataset[[feature]])

In [11]:
# Printing the basic statistics of the dataset
hospital_dataset.describe().round(2)

Unnamed: 0,age,time_in_hospital,n_lab_procedures,n_procedures,n_medications,n_outpatient,n_inpatient,n_emergency,medical_specialty,diag_1,diag_2,diag_3,glucose_test,A1Ctest,change,diabetes_med,readmitted
count,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0
mean,2.34,0.0,-0.0,-0.0,-0.0,-0.0,0.0,-0.0,3.46,3.3,3.33,3.14,1.0,0.94,0.46,0.77,0.47
std,1.32,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.43,2.83,2.91,2.84,0.23,0.4,0.5,0.42,0.5
min,0.0,-1.15,-2.13,-0.79,-1.89,-0.31,-0.52,-0.21,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,-0.82,-0.62,-0.79,-0.65,-0.31,-0.52,-0.21,3.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0
50%,2.0,-0.15,0.04,-0.21,-0.16,-0.31,-0.52,-0.21,4.0,3.0,3.0,2.0,1.0,1.0,0.0,1.0,0.0
75%,3.0,0.52,0.69,0.38,0.46,-0.31,0.33,-0.21,4.0,6.0,6.0,6.0,1.0,1.0,1.0,1.0,1.0
max,5.0,3.18,3.52,2.71,7.78,27.3,12.21,72.04,6.0,7.0,7.0,7.0,2.0,2.0,1.0,1.0,1.0


-> So each numerical feature have now mean of 0 and standard deviation of 1. Now this dataset is ready for the model training process.

<h2>Model Training</h2>

In [12]:
# Separating features and target variable
X = hospital_dataset.drop('readmitted', axis=1)
Y = hospital_dataset['readmitted']

# Splitting the dataset into train and test data in 80/20 ratio
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [13]:
# Implementing the model
lr_model = LogisticRegression()
lr_model.fit(X_train, Y_train)

In [14]:
# Predicting the target variable
Y_pred = lr_model.predict(X_test)
Y_pred.shape

(5000,)

<h2>Model Evaluation</h2>

<h3>1. Precision Score</h3>

In [15]:
# Calculating the precision score of the model
precision = precision_score(Y_test, Y_pred)
print("Precision Score of the model:", precision)

Precision Score of the model: 0.6278916060806345


<h3>2. Recall Score</h3>

In [16]:
# Calculating the recall score of the model
recall = recall_score(Y_test, Y_pred)
print("Recall Score of the model:", recall)

Recall Score of the model: 0.40563620836891545


<h3>3. F1-Score</h3>

In [17]:
# Calculating the f1 score of the model
f1 = f1_score(Y_test, Y_pred)
print("F1 Score of the model:", f1)

F1 Score of the model: 0.49286640726329445


<h3>The Classification Report</h3>

In [18]:
# Printing the classification report of the model
print('Classification Report:\n\n', classification_report(Y_test, Y_pred))

Classification Report:

               precision    recall  f1-score   support

           0       0.60      0.79      0.68      2658
           1       0.63      0.41      0.49      2342

    accuracy                           0.61      5000
   macro avg       0.61      0.60      0.59      5000
weighted avg       0.61      0.61      0.59      5000



-> The logistic regression model demonstrates moderate performance with a precision of 0.63 and recall of 0.41, resulting in an F1 score of 0.49. The model is more effective at identifying non-readmissions (class 0) than readmissions (class 1). This imbalance suggests the need for further tuning or additional features to improve recall and overall performance.

<hr>