## Module 11 Assignment

### By Niharika Madhadi

This dataset contains the medical records of 299 patients who had heart failure, collected during their follow-up period, where each patient profile has 13 clinical features.

The features are :             
- age: age of the patient (years)
- anaemia: decrease of red blood cells or hemoglobin (boolean)
- creatinine phosphokinase  (CPK): level of the CPK enzyme in the blood (mcg/L)
- diabetes: if the patient has diabetes (boolean)
- ejection fraction: percentage of blood leaving the heart at each contraction  (percentage)
- high blood pressure: if the patient has hypertension (boolean)
- platelets: platelets in the blood (kiloplatelets/mL)
- sex: woman or man (binary)
- serum creatinine: level of serum creatinine in the blood (mg/dL)
- serum sodium: level of serum sodium in the blood (mEq/L)
- smoking: if the patient smokes or not (boolean)
- time: follow-up period (days)
- [target] death event: if the patient died during the follow-up period (boolean)

Reference :                 
https://archive.ics.uci.edu/dataset/519/heart+failure+clinical+records

In [1]:
#install ucimlrepo
!pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.3-py3-none-any.whl (7.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.3


In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from ucimlrepo import fetch_ucirepo

# fetch dataset
heart_failure_clinical_records = fetch_ucirepo(id=519)

# data (as pandas dataframes)
X = heart_failure_clinical_records.data.features
y = heart_failure_clinical_records.data.targets



In [3]:
print("The information of features in the dataset\n")
print(X.info())
print("The statistical values of the dataset\n")
print(X.describe())
print("Checking if the dataset has any null values\n")
print(X.isnull().sum())

The information of features in the dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
dtypes: float64(3), int64(9)
memory usage: 28.2 KB
None
The statistical values of

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.20,
                                                    random_state=42)

In [5]:
# SelectFromModel with a RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
y_train_flat = y_train.values.ravel()
selections = SelectFromModel(estimator=RandomForestClassifier(n_estimators=100, random_state=42)).fit(
                    X_train, y_train_flat)
selected_feats = X_train.columns[(selections.get_support())]
selected_feats

Index(['ejection_fraction', 'serum_creatinine', 'time'], dtype='object')

In [6]:
#Recursive Feature Selection with Cross Validation

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV

y_train_flat = y_train.values.ravel()
selects = RFECV(RandomForestClassifier(n_estimators=100, random_state=42), step=1, cv=5)
selects.fit(X_train, y_train_flat)
selected_feats = X_train.columns[(selects.get_support())]
print(selected_feats)

X_train_final=selects.transform(X_train)
X_test_final=selects.transform(X_test)

Index(['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
       'ejection_fraction', 'high_blood_pressure', 'platelets',
       'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time'],
      dtype='object')


In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,precision_score,recall_score,classification_report

y_train_flat = y_train.values.ravel()
y_test_flat=y_test.values.ravel()

model = LogisticRegression(class_weight='balanced', solver='liblinear')
model.fit(X_train_final, y_train_flat)
predictions = model.predict(X_test_final)

print(f'Training Score: {model.score(X_train_final, y_train_flat)}')
print(f'Test Score: {model.score(X_test_final, y_test_flat)}')
print(f'Accuracy Score :{accuracy_score(y_test_flat,predictions)}')
print(f'Precision : {precision_score(y_test_flat,predictions)}')
print(f'Recall Score :{recall_score(y_test_flat,predictions)}')

Training Score: 0.8326359832635983
Test Score: 0.8
Accuracy Score :0.8
Precision : 0.8421052631578947
Recall Score :0.64


In [8]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

hyperparameters = {
            'n_estimators': [50, 200],
            'criterion': ['entropy', 'gini'],
            'max_depth': [3, 4],
            'max_leaf_nodes': [7, 9],
            'bootstrap': [True, False]
            }

grid_search = GridSearchCV(estimator = RandomForestClassifier(),
                           param_grid = hyperparameters,
                           scoring = 'accuracy',
                           cv = 10)

grid_search = grid_search.fit(X_train, y_train_flat)

best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_

print('best accuracy', best_accuracy)
print('best parameters', best_parameters)

best accuracy 0.8954710144927536
best parameters {'bootstrap': True, 'criterion': 'entropy', 'max_depth': 3, 'max_leaf_nodes': 7, 'n_estimators': 200}


model = RandomForestClassifier(
            n_estimators=200,
            criterion='gini',
            max_depth=3,
            max_leaf_nodes=7,
            bootstrap=True
            )

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

model = RandomForestClassifier(n_estimators = 200,
                               criterion = 'gini',
                               max_depth = 3,
                               max_leaf_nodes = 7,
                               bootstrap = True,
                               random_state = 42)

model.fit(X_train_final, y_train_flat)
predictions = model.predict(X_test_final)
print(accuracy_score(y_test_flat, predictions))

0.75


In [11]:
# build your final RandomForestClassifier model here using set_params and best_params; provide an accuracy score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

model = RandomForestClassifier(random_state = 42).set_params(**best_parameters) # * args, ** kwargs
model.fit(X_train_final, y_train_flat)
predictions = model.predict(X_test_final)
print(f'Training Score: {model.score(X_train_final, y_train_flat)}')
print(f'Test Score: {model.score(X_test_final, y_test_flat)}')
print(f'Accuracy Score :{accuracy_score(y_test_flat,predictions)}')
print(f'Precision : {precision_score(y_test_flat,predictions)}')
print(f'Recall Score :{recall_score(y_test_flat,predictions)}')

Training Score: 0.9121338912133892
Test Score: 0.75
Accuracy Score :0.75
Precision : 0.9166666666666666
Recall Score :0.44


## Summary

Comparing Logistic regression and Random Forest Classifier models' results.

For Logistic regression we got below values
Training Score: 0.8326359832635983

Test Score: 0.8

Accuracy Score :0.8

Precision : 0.8421052631578947

Recall Score :0.64


For RandomForestClassifier we got below values
Training Score: 0.9121338912133892

Test Score: 0.75

Accuracy Score :0.75

Precision : 0.9166666666666666

Recall Score :0.44
