---

# Predictive Analysis for Patient's Re-admission in Hospital

> **We are proposing idea of predictive analytics for the ‘Un-Healthy Patient’.**

---

```
With the help of ML algorithms we are predicting whether the patient need to re-admit the hospital again in the future or not, on the basis of the predictive analysis of the patient’s past and present reports.
```

---

In [None]:
#required library
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer
from pandas_profiling import ProfileReport
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import PowerTransformer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split,GridSearchCV

---

The process of collecting data has finished and we are now examining the patient dataset contained in the comma-separated file to determine its content.

---

In [2]:
data1 = pd.read_csv("framingham5k.csv") #path for dataset file
data1.head(10)

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,target
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0
5,0,43,2.0,0,0.0,0.0,0,1,0,228.0,180.0,110.0,30.3,77.0,99.0,0
6,0,63,1.0,0,0.0,0.0,0,0,0,205.0,138.0,71.0,33.11,60.0,85.0,1
7,0,45,2.0,1,20.0,0.0,0,0,0,313.0,100.0,71.0,21.68,79.0,78.0,0
8,1,52,1.0,0,0.0,0.0,0,1,0,260.0,141.5,89.0,26.36,76.0,79.0,0
9,1,43,1.0,1,30.0,0.0,0,1,0,225.0,162.0,107.0,23.61,93.0,88.0,0


---


The **data1.head()** method in Pandas is used to display data set from starting, similarly **data1.tail()** to display data set from ending.


---

In [3]:
data1.tail(10)

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,target
4230,0,56,1.0,1,3.0,0.0,0,1,0,268.0,170.0,102.0,22.89,57.0,,0
4231,1,58,3.0,0,0.0,0.0,0,1,0,187.0,141.0,81.0,24.96,80.0,81.0,0
4232,1,68,1.0,0,0.0,0.0,0,1,0,176.0,168.0,97.0,23.14,60.0,79.0,1
4233,1,50,1.0,1,1.0,0.0,0,1,0,313.0,179.0,92.0,25.97,66.0,86.0,1
4234,1,51,3.0,1,43.0,0.0,0,0,0,207.0,126.5,80.0,19.71,65.0,68.0,0
4235,0,48,2.0,1,20.0,,0,0,0,248.0,131.0,72.0,22.0,84.0,86.0,0
4236,0,44,1.0,1,15.0,0.0,0,0,0,210.0,126.5,87.0,19.16,86.0,,0
4237,0,52,2.0,0,0.0,0.0,0,0,0,269.0,133.5,83.0,21.47,80.0,107.0,0
4238,1,40,3.0,0,0.0,0.0,0,1,0,185.0,141.0,98.0,25.6,67.0,72.0,0
4239,0,39,3.0,1,30.0,0.0,0,0,0,196.0,133.0,86.0,20.91,85.0,80.0,0


---


The **"data1.describe()"** method in Pandas is used to generate descriptive statistics of a dataframe. It provides a summary of the central tendency, dispersion, and shape of the distribution of a set of continuous variables.


---

In [4]:
data1.describe()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,target
count,4240.0,4240.0,4135.0,4240.0,4211.0,4187.0,4240.0,4240.0,4240.0,4190.0,4240.0,4240.0,4221.0,4239.0,3852.0,4240.0
mean,0.429245,49.580189,1.979444,0.494104,9.005937,0.029615,0.005896,0.310613,0.025708,236.699523,132.354599,82.897759,25.800801,75.878981,81.963655,0.151887
std,0.495027,8.572942,1.019791,0.500024,11.922462,0.169544,0.076569,0.462799,0.15828,44.591284,22.0333,11.910394,4.07984,12.025348,23.954335,0.358953
min,0.0,32.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,107.0,83.5,48.0,15.54,44.0,40.0,0.0
25%,0.0,42.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,206.0,117.0,75.0,23.07,68.0,71.0,0.0
50%,0.0,49.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,234.0,128.0,82.0,25.4,75.0,78.0,0.0
75%,1.0,56.0,3.0,1.0,20.0,0.0,0.0,1.0,0.0,263.0,144.0,90.0,28.04,83.0,87.0,0.0
max,1.0,70.0,4.0,1.0,70.0,1.0,1.0,1.0,1.0,696.0,295.0,142.5,56.8,143.0,394.0,1.0


---

The **"data1.info()"** method in Pandas is used to get a summary of the dataframe's structure, including its index dtype, column dtypes, non-null values, and memory usage.

---

In [5]:
data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4240 entries, 0 to 4239
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   male             4240 non-null   int64  
 1   age              4240 non-null   int64  
 2   education        4135 non-null   float64
 3   currentSmoker    4240 non-null   int64  
 4   cigsPerDay       4211 non-null   float64
 5   BPMeds           4187 non-null   float64
 6   prevalentStroke  4240 non-null   int64  
 7   prevalentHyp     4240 non-null   int64  
 8   diabetes         4240 non-null   int64  
 9   totChol          4190 non-null   float64
 10  sysBP            4240 non-null   float64
 11  diaBP            4240 non-null   float64
 12  BMI              4221 non-null   float64
 13  heartRate        4239 non-null   float64
 14  glucose          3852 non-null   float64
 15  target           4240 non-null   int64  
dtypes: float64(9), int64(7)
memory usage: 530.1 KB


---

#### **EDA process(Exploratory Data Analysis Process)**


Exploratory Data Analysis (EDA) is a process of analyzing and summarizing a dataset in order to understand its underlying structure, relationships between variables, and overall characteristics. The goal of EDA is to gain insights into the data and identify patterns and trends that can help inform further analysis and modeling.

EDA often involves visualizing the data using graphs and plots, calculating summary statistics, and finding relationships between variables.It helps to identify potential issues with the data, such as outliers or missing values, and also helps to gain a better understanding of the data before building predictive models or conducting hypothesis testing.

---

In [None]:
profile = ProfileReport(data1, title="Pandas Profiling Report")
profile.to_file("report.html")

---

To find the missing value in the data set, "**df.isnull().sum()**" or  **"data1.isna().sum().sum())"** method in Pandas is used to determine the number of missing (null) values in a dataframe. The method returns a series that contains the sum of missing values in each column of the dataframe.

---

In [7]:
# Missing values
print("Missing values:")
print(data1.isnull().sum())
print("\nTotal missing Values:  ",data1.isna().sum().sum())

Missing values:
male                 0
age                  0
education          105
currentSmoker        0
cigsPerDay          29
BPMeds              53
prevalentStroke      0
prevalentHyp         0
diabetes             0
totChol             50
sysBP                0
diaBP                0
BMI                 19
heartRate            1
glucose            388
target               0
dtype: int64

Total missing Values:   645


---

**KNN imputation** is a method of imputing missing values in a dataset using the k-nearest neighbors algorithm. The idea behind **KNN imputation** is to fill in missing values with the average value of the k-nearest neighbors of the data point with missing values.

By handling missing values in a dataset, the potential biases and inaccuracies can be minimized, allowing for more accurate and interpretable machine learning models to be developed.

---

In [8]:
data= data1.to_numpy()
imputer = KNNImputer(n_neighbors=5)
imputed_data = imputer.fit_transform(data)
imputed_df = pd.DataFrame(imputed_data, columns=data1.columns)
imputed_df.to_csv("framingham5k_New.csv", index=False)
df = pd.read_csv("framingham5k_New.csv")

In [9]:
# Checking wether all the missing values are imputed using the KNN imputation.
print("Missing values:")
print(df.isnull().sum())

Missing values:
male               0
age                0
education          0
currentSmoker      0
cigsPerDay         0
BPMeds             0
prevalentStroke    0
prevalentHyp       0
diabetes           0
totChol            0
sysBP              0
diaBP              0
BMI                0
heartRate          0
glucose            0
target             0
dtype: int64


In [10]:
data = pd.read_csv("framingham5k_New.csv")
data.head(7)

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,target
0,1.0,39.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,195.0,106.0,70.0,26.97,80.0,77.0,0.0
1,0.0,46.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,250.0,121.0,81.0,28.73,95.0,76.0,0.0
2,1.0,48.0,1.0,1.0,20.0,0.0,0.0,0.0,0.0,245.0,127.5,80.0,25.34,75.0,70.0,0.0
3,0.0,61.0,3.0,1.0,30.0,0.0,0.0,1.0,0.0,225.0,150.0,95.0,28.58,65.0,103.0,1.0
4,0.0,46.0,3.0,1.0,23.0,0.0,0.0,0.0,0.0,285.0,130.0,84.0,23.1,85.0,85.0,0.0
5,0.0,43.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,228.0,180.0,110.0,30.3,77.0,99.0,0.0
6,0.0,63.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,205.0,138.0,71.0,33.11,60.0,85.0,1.0


In [11]:
# Output "1" will represent count of readmission, on the oher side "0" will represent people having low chance to get readmit.

data['target'].value_counts()

0.0    3596
1.0     644
Name: target, dtype: int64

In [12]:
data['target'].value_counts().plot(kind='bar')

<AxesSubplot: >

---
```
We are evaluating the accuracy of the algorithms.
```
---

1.We are evaluating the accuracy of the **Decision Tree**.

---

In [13]:
#spliting of test data and target data.
X = data.drop("target", axis=1)
y = data['target']
  
X_train, X_test,\
    y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train.shape, X_test.shape

((3180, 15), (1060, 15))

In [14]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

In [15]:
# Making Predictions with Our Model
predictions = model.predict(X_test)

# Measuring the accuracy of our model
print("Accuracy score is: ",accuracy_score(y_test, predictions))

Accuracy score is:  0.7632075471698113


---

2.We are evaluating the accuracy of the  **Naive Bayes** algorithm.

---

In [38]:
model = GaussianNB()
cv_scores = cross_val_score(model, X, y, cv=5)

In [39]:
predict_train = model.fit(X_train, y_train).predict(X_train)

# predict the target on the test dataset
predict_test = model.predict(X_test)

# Accuracy Score on test dataset
accuracy_test = accuracy_score(y_test,predict_test)
print('Accuracy score is: ', accuracy_test)

Accuracy score is:  0.8320754716981132


---

Accuracy score for test data recieved is **83.2%**

---

In [40]:
print(classification_report(y_test,predict_test))

              precision    recall  f1-score   support

         0.0       0.87      0.94      0.91       905
         1.0       0.37      0.21      0.27       155

    accuracy                           0.83      1060
   macro avg       0.62      0.58      0.59      1060
weighted avg       0.80      0.83      0.81      1060



---

3.We are evaluating the accuracy of the **K-Nearest Neighbors(KNN)**

---

In [19]:
#Create KNN Object.
knn = KNeighborsClassifier()
#Create x and y variables.
x = df.drop(columns=['target'])
y = df['target']


In [20]:
#Split data into training and testing.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=4)
#Training the model.
knn.fit(x_train, y_train)


# Generate predictions on the test set
y_pred = knn.predict(x_test)
# Checking performance our model with classification report.
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.85      0.95      0.89       719
         1.0       0.17      0.06      0.09       129

    accuracy                           0.81       848
   macro avg       0.51      0.50      0.49       848
weighted avg       0.75      0.81      0.77       848



In [21]:
# Measuring the accuracy of our model
print('Accuracy score is: ',accuracy_score(y_test, y_pred))

Accuracy score is:  0.8113207547169812


---

Accuracy score for test data recieved is **81.1%**

---

---

4.We are evaluating the accuracy of the **Random Forest**

---

In [22]:
#spliting of test data and target data.
X = data.drop("target", axis=1)
y = data['target']
  
X_train, X_test,\
    y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train.shape, X_test.shape

((3180, 15), (1060, 15))

In [23]:
#applying random forest algorithm
model = RandomForestClassifier()
model.fit(X_train, y_train)

In [24]:
# predict the mode
y_pred = model.predict(X_test)
  
# performance evaluatio metrics
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

         0.0       0.99      0.86      0.92      1047
         1.0       0.05      0.62      0.10        13

    accuracy                           0.86      1060
   macro avg       0.52      0.74      0.51      1060
weighted avg       0.98      0.86      0.91      1060



In [25]:
# Measuring the accuracy of our model
print('Accuracy score is: ',accuracy_score(y_test, y_pred))

Accuracy score is:  0.8566037735849057


---

The random forest algorithm resulted in an accuracy of **85%**

---

---

##### We are evaluating the accuracy after tuning the hyperparameters of the algorithms mentioned above.

---

---

Adjusting the parameters of a Decision Tree by appling Hyper parameter Tuning

---

In [26]:
# Define the hyperparameters to tune
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 4, 5, 6, 7, 8, 9, 10],
    'min_samples_split': [2, 3, 4, 5, 6, 7, 8, 9, 10],
    'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'max_features': ['sqrt', 'log2']
}

In [27]:
#Define the grid search to perform the hyperparameter tuning
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy')


In [28]:
# Fit the grid search to the data
grid_search.fit(X, y)

---
###### **GridSearchCV is basically considering all the combinations of the candidates in finding the best parameters. This would in turn take a very long time when there are a greater number of parameter and their values to tune.**
---

In [29]:
# Print the best parameters
print("Best parameters: ", grid_search.best_params_)

Best parameters:  {'criterion': 'entropy', 'max_depth': 10, 'max_features': 'log2', 'min_samples_leaf': 5, 'min_samples_split': 2}


In [30]:
# Get the best model
best_model = grid_search.best_estimator_

In [31]:
# Predict on the test data using the best model
y_pred = best_model.predict(X)

In [32]:
# Calculate the accuracy of the best model
acc = accuracy_score(y, y_pred)
print("Accuracy: ", acc)

Accuracy:  0.8655660377358491


---

Tuning the hyperparameters for the Naive Bayes algorithm

---

In [41]:
cv_method = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=999)

params_NB = {'var_smoothing': np.logspace(0,-9, num=100)}

gs_NB = GridSearchCV(estimator=model, 
                     param_grid=params_NB, 
                     cv=cv_method,
                     verbose=1, 
                     scoring='accuracy')

Data_transformed = PowerTransformer().fit_transform(X_test)

gs_NB.fit(Data_transformed, y_test)

Fitting 15 folds for each of 100 candidates, totalling 1500 fits


In [42]:
# predict the target on the test dataset
predict_test = gs_NB.predict(Data_transformed)

# Accuracy Score on test dataset
accuracy_test = accuracy_score(y_test,predict_test)
print('accuracy_score on test dataset : ', accuracy_test)

accuracy_score on test dataset :  0.8424528301886792


---

Tuning the hyperparameters for the Random Forest

---

In [35]:
param_grid = {
	'n_estimators': [25, 50, 100, 150],
	'max_features': ['sqrt', 'log2', None],
	'max_depth': [3, 6, 9],
	'max_leaf_nodes': [3, 6, 9],
}

In [36]:
random_search = RandomizedSearchCV(RandomForestClassifier(),
								param_grid)
random_search.fit(X_train, y_train)
print(random_search.best_estimator_)


RandomForestClassifier(max_depth=6, max_features='log2', max_leaf_nodes=6)


In [37]:
model_random = RandomForestClassifier(max_depth=3,
									max_features='log2',
									max_leaf_nodes=6,
									n_estimators=100)
model_random.fit(X_train, y_train)
y_pred_rand = model.predict(X_test)
print(classification_report(y_pred_rand, y_test))

              precision    recall  f1-score   support

         0.0       0.99      0.86      0.92      1047
         1.0       0.05      0.62      0.10        13

    accuracy                           0.86      1060
   macro avg       0.52      0.74      0.51      1060
weighted avg       0.98      0.86      0.91      1060



---

Tuning the hyperparameters for the KNN
    
    Hyperparameter Tuning to on KNN will take a lot more time since our data is vast. KNN is computationally expensive and hyperparameter tuning can further increase the computational cost, especially when you are working with large datasets. This can make the model training process slow and may lead to longer training times. For KNN, it may not always be necessary, and a simpler model without hyperparameter tuning may be sufficient for many problems.

---

In [None]:
# List the hyperparameters that you want to tune
leaf_size = list(range(1, 50))
n_neighbors = list(range(1, 30))
p = [1, 2]

# Convert the hyperparameters to a dictionary
hyperparameters = dict(leaf_size=leaf_size, n_neighbors=n_neighbors, p=p)



# Create a new KNN object
knn_2 = KNeighborsClassifier()

# Use grid search to find the best hyperparameters
clf = GridSearchCV(knn_2, hyperparameters, cv=10)

# Fit the model
best_model = clf.fit(x, y)

# Print the best hyperparameters
print('Best leaf_size:', best_model.best_estimator_.get_params()['leaf_size'])
print('Best p:', best_model.best_estimator_.get_params()['p'])
print('Best n_neighbors:', best_model.best_estimator_.get_params()['n_neighbors'])