In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn. metrics import accuracy_score

In [2]:
# loading the csv data to a pandas DataFrame
heart_data = pd.read_csv ('Documents/ml/data.csv')




1. Visit the UCI Machine Learning Repository:
   - The UCI Machine Learning Repository is available at https://archive.ics.uci.edu/ml/index.php

2. Select the 'Health and Medicine' related data category:
   - On the UCI Machine Learning Repository homepage, you can find the 'Health and Medicine' category under the 'Data Categories' section.

3. From the list of datasets, select the 'Heart Disease' dataset for building classification models:
   - The 'Heart Disease' dataset is a popular choice for building heart disease prediction models.

4. Describe the selected dataset in detail:
This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them.  In particular, the Cleveland database is the only one that has been used by ML researchers to date.  The "goal" field refers to the presence of heart disease in the patient.  It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).  
   
The names and social security numbers of the patients were recently removed from the database, replaced with dummy values.

One file has been "processed", that one containing the Cleveland database.  All four unprocessed files also exist in this directory.
a. Features:
   - The Heart Disease dataset contains 13 features that describe various clinical and medical attributes of the patients.
   - The features include:
     - Age
     - Sex
     - Chest pain type
     - Resting blood pressure
     - Serum cholesterol
     - Fasting blood sugar
     - Resting electrocardiographic results
     - Maximum heart rate achieved
     - Exercise-induced angina
     - ST depression induced by exercise relative to rest
     - The slope of the peak exercise ST segment
     - Number of major vessels colored by fluoroscopy
     - Thalassemia
These features represent various risk factors and indicators related to heart disease.

b. Target valueThe "goal" field refers to the presence of heart disease in the patient.  It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0). The target variable in this dataset can have one of 5 possible values: 0, 1, 2, 3, or 4.
These values likely represent different levels or stages of heart disease severity, rather than a simple binary classification of presence or absence. se).

c. Samples:
   - The dataset contains 303 samples, with each sample representing a patient's medical information.

This Heart Disease dataset from the UCI Machine Learning Repository is a well-known and widely used dataset for building classification models to predict the presence or absence of heart disease based on the provided clinical and medical attributes.

In [3]:
# print first 5 rows of the dataset
heart_data.head()

Unnamed: 0,age,sex,cp,tresebps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thalach.1,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,3,2
2,67,1,4,120,229,0,2,129,1,2.6,2,2,7,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0,3,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,3,0


In [4]:
# print last 5 rows of the dataset
heart_data.tail()


Unnamed: 0,age,sex,cp,tresebps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thalach.1,target
298,45,1,1,110,264,0,0,132,0,1.2,2,0,7,1
299,68,1,4,144,193,1,0,141,0,3.4,2,2,7,2
300,57,1,4,130,131,0,0,115,1,1.2,2,1,7,3
301,57,0,2,130,236,0,2,174,0,0.0,2,1,3,1
302,38,1,3,138,175,0,0,173,0,0.0,1,?,3,0


In [5]:
# number of rows and coloumns in the dataset
heart_data.shape

(303, 14)

In [6]:
# getting some info about the data
heart_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   age        303 non-null    int64  
 1   sex        303 non-null    int64  
 2   cp         303 non-null    int64  
 3   tresebps   303 non-null    int64  
 4   chol       303 non-null    int64  
 5   fbs        303 non-null    int64  
 6   restecg    303 non-null    int64  
 7   thalach    303 non-null    int64  
 8   exang      303 non-null    int64  
 9   oldpeak    303 non-null    float64
 10  slope      303 non-null    int64  
 11  ca         303 non-null    object 
 12  thalach.1  303 non-null    object 
 13  target     303 non-null    int64  
dtypes: float64(1), int64(11), object(2)
memory usage: 33.3+ KB


In [7]:
# checking for missing values
heart_data.isnull().sum()

age          0
sex          0
cp           0
tresebps     0
chol         0
fbs          0
restecg      0
thalach      0
exang        0
oldpeak      0
slope        0
ca           0
thalach.1    0
target       0
dtype: int64

In [8]:
# stastistical measures about the data
heart_data.describe()

Unnamed: 0,age,sex,cp,tresebps,chol,fbs,restecg,thalach,exang,oldpeak,slope,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.438944,0.679868,3.158416,131.689769,246.693069,0.148515,0.990099,149.607261,0.326733,1.039604,1.60066,0.937294
std,9.038662,0.467299,0.960126,17.599748,51.776918,0.356198,0.994971,22.875003,0.469794,1.161075,0.616226,1.228536
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0
25%,48.0,0.0,3.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0
50%,56.0,1.0,3.0,130.0,241.0,0.0,1.0,153.0,0.0,0.8,2.0,0.0
75%,61.0,1.0,4.0,140.0,275.0,0.0,2.0,166.0,1.0,1.6,2.0,2.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,4.0


In [9]:
import numpy as np
from sklearn.impute import SimpleImputer

# Replace '?' with NaN
heart_data.replace('?', np.nan, inplace=True)

heart_data.info()

# Check for missing values

missing_values = heart_data.isnull().sum()
print(missing_values)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   age        303 non-null    int64  
 1   sex        303 non-null    int64  
 2   cp         303 non-null    int64  
 3   tresebps   303 non-null    int64  
 4   chol       303 non-null    int64  
 5   fbs        303 non-null    int64  
 6   restecg    303 non-null    int64  
 7   thalach    303 non-null    int64  
 8   exang      303 non-null    int64  
 9   oldpeak    303 non-null    float64
 10  slope      303 non-null    int64  
 11  ca         299 non-null    object 
 12  thalach.1  301 non-null    object 
 13  target     303 non-null    int64  
dtypes: float64(1), int64(11), object(2)
memory usage: 33.3+ KB
age          0
sex          0
cp           0
tresebps     0
chol         0
fbs          0
restecg      0
thalach      0
exang        0
oldpeak      0
slope        0
ca           4
thalach.1  

In [10]:
# Check for missing values
heart_data.isnull().sum()


age          0
sex          0
cp           0
tresebps     0
chol         0
fbs          0
restecg      0
thalach      0
exang        0
oldpeak      0
slope        0
ca           4
thalach.1    2
target       0
dtype: int64

In [11]:
missing_values = heart_data.isnull().sum()
print(missing_values)

age          0
sex          0
cp           0
tresebps     0
chol         0
fbs          0
restecg      0
thalach      0
exang        0
oldpeak      0
slope        0
ca           4
thalach.1    2
target       0
dtype: int64


In [12]:
from sklearn.impute import SimpleImputer



# Fit the imputer to the data and transform the DataFrame by droping missing values
heart_data_imputed = heart_data.dropna()

In [13]:
# checking the distribution of target Variable
heart_data_imputed['target'].value_counts()

target
0    160
1     54
2     35
3     35
4     13
Name: count, dtype: int64

In [14]:
X=heart_data_imputed.drop(columns='target',axis=1)
Y=heart_data_imputed['target']


In [15]:
print(X)

     age  sex  cp  tresebps  chol  fbs  restecg  thalach  exang  oldpeak  \
0     63    1   1       145   233    1        2      150      0      2.3   
1     67    1   4       160   286    0        2      108      1      1.5   
2     67    1   4       120   229    0        2      129      1      2.6   
3     37    1   3       130   250    0        0      187      0      3.5   
4     41    0   2       130   204    0        2      172      0      1.4   
..   ...  ...  ..       ...   ...  ...      ...      ...    ...      ...   
297   57    0   4       140   241    0        0      123      1      0.2   
298   45    1   1       110   264    0        0      132      0      1.2   
299   68    1   4       144   193    1        0      141      0      3.4   
300   57    1   4       130   131    0        0      115      1      1.2   
301   57    0   2       130   236    0        2      174      0      0.0   

     slope ca thalach.1  
0        3  0         6  
1        2  3         3  
2        

In [16]:
print(Y)

0      0
1      2
2      1
3      0
4      0
      ..
297    1
298    1
299    2
300    3
301    1
Name: target, Length: 297, dtype: int64


In [17]:
X_train, X_test, Y_train, Y_test, = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=4)

In [18]:
print(X.shape, X_train.shape, X_test.shape)

(297, 13) (237, 13) (60, 13)


In [19]:
model=LogisticRegression()

In [20]:
model.fit(X_train,Y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [21]:
#accuracy on training data
X_train_prediction=model.predict(X_train)
training_data_accuracy=accuracy_score(X_train_prediction, Y_train)

In [22]:
print('Accuracy on training data : ',training_data_accuracy)

Accuracy on training data :  0.5991561181434599


In [23]:
#accuracy on test data
X_test_prediction=model.predict(X_test)
test_data_accuracy=accuracy_score(X_test_prediction, Y_test)

In [24]:
print('Accuracy on test data : ',test_data_accuracy)

Accuracy on test data :  0.6166666666666667


In [25]:
import numpy as np

input_data = (41, 0, 2, 130, 204, 0, 2, 172, 0, 1.4, 1, 0, 3)

# Change the input data to a numpy array
input_data_as_numpy_array = np.asarray(input_data)

# Reshape the numpy array as we are predicting for only one instance
input_data_reshaped = input_data_as_numpy_array.reshape(1, -1)

# Provide feature names (example)
feature_names = ['age', 'sex', 'cp', 'tresebps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thalach.1']

# Assign feature names to the reshaped numpy array
input_data_reshaped_named = pd.DataFrame(input_data_reshaped, columns=feature_names)

# Predict using the logistic regression model
prediction = model.predict(input_data_reshaped_named)

print(prediction)

[0]


In [26]:
import numpy as np

input_data = (53,1,4,140,203,1,2,155,1,3.1,3,0,7)

# Change the input data to a numpy array
input_data_as_numpy_array = np.asarray(input_data)

# Reshape the numpy array as we are predicting for only one instance
input_data_reshaped = input_data_as_numpy_array.reshape(1, -1)

# Provide feature names (example)
feature_names = ['age', 'sex', 'cp', 'tresebps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thalach.1']

# Assign feature names to the reshaped numpy array
input_data_reshaped_named = pd.DataFrame(input_data_reshaped, columns=feature_names)

# Predict using the logistic regression model
prediction = model.predict(input_data_reshaped_named)

print(prediction)

[1]


In [27]:
import numpy as np

input_data = (68,1,4,144,193,1,0,141,0,3.4,2,2,7)

# Change the input data to a numpy array
input_data_as_numpy_array = np.asarray(input_data)

# Reshape the numpy array as we are predicting for only one instance
input_data_reshaped = input_data_as_numpy_array.reshape(1, -1)

# Provide feature names (example)
feature_names = ['age', 'sex', 'cp', 'tresebps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thalach.1']

# Assign feature names to the reshaped numpy array
input_data_reshaped_named = pd.DataFrame(input_data_reshaped, columns=feature_names)

# Predict using the logistic regression model
prediction = model.predict(input_data_reshaped_named)

print(prediction)

[2]


In [39]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Create an SVM classifier
SVM_classifier = SVC()

# Fit the model to the training data
SVM_classifier.fit(X_train, Y_train)

# Accuracy on the training data
X_train_prediction = SVM_classifier.predict(X_train)
training_data_accuracy = accuracy_score(Y_train, X_train_prediction)
print('Accuracy on training data:', training_data_accuracy)

# Accuracy on the test data
X_test_prediction = SVM_classifier.predict(X_test)
test_data_accuracy = accuracy_score(Y_test, X_test_prediction)
print('Accuracy on test data:', test_data_accuracy)


Accuracy on training data: 0.540084388185654
Accuracy on test data: 0.5333333333333333


In [31]:
# Random Forest Classifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier()
rf_classifier.fit(X_train, Y_train)
rf_train_predictions = rf_classifier.predict(X_train)
rf_test_predictions = rf_classifier.predict(X_test)
rf_train_accuracy = accuracy_score(Y_train, rf_train_predictions)
rf_test_accuracy = accuracy_score(Y_test, rf_test_predictions)
print("Random Forest Classifier:")
print("Training Accuracy:", rf_train_accuracy)
print("Test Accuracy:", rf_test_accuracy)

Random Forest Classifier:
Training Accuracy: 1.0
Test Accuracy: 0.55


In [32]:
# Gradient Boosting Classifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
gb_classifier = GradientBoostingClassifier()
gb_classifier.fit(X_train, Y_train)
gb_train_predictions = gb_classifier.predict(X_train)
gb_test_predictions = gb_classifier.predict(X_test)
gb_train_accuracy = accuracy_score(Y_train, gb_train_predictions)
gb_test_accuracy = accuracy_score(Y_test, gb_test_predictions)
print("Gradient Boosting Classifier:")
print("Training Accuracy:", gb_train_accuracy)
print("Test Accuracy:", gb_test_accuracy)

Gradient Boosting Classifier:
Training Accuracy: 1.0
Test Accuracy: 0.5333333333333333


In [34]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Assuming X contains the features and y contains the labels

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# K-Nearest Neighbors (KNN)
knn_classifier = KNeighborsClassifier()
knn_classifier.fit(X_train, y_train)
knn_train_predictions = knn_classifier.predict(X_train)
knn_test_predictions = knn_classifier.predict(X_test)
knn_train_accuracy = accuracy_score(y_train, knn_train_predictions)
knn_test_accuracy = accuracy_score(y_test, knn_test_predictions)
print("K-Nearest Neighbors (KNN):")
print("Training Accuracy:", knn_train_accuracy)
print("Test Accuracy:", knn_test_accuracy)

# Decision Tree Classifier
dt_classifier = DecisionTreeClassifier()
dt_classifier.fit(X_train, y_train)
dt_train_predictions = dt_classifier.predict(X_train)
dt_test_predictions = dt_classifier.predict(X_test)
dt_train_accuracy = accuracy_score(y_train, dt_train_predictions)
dt_test_accuracy = accuracy_score(y_test, dt_test_predictions)
print("Decision Tree Classifier:")
print("Training Accuracy:", dt_train_accuracy)
print("Test Accuracy:", dt_test_accuracy)

# Naive Bayes Classifier
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)
nb_train_predictions = nb_classifier.predict(X_train)
nb_test_predictions = nb_classifier.predict(X_test)
nb_train_accuracy = accuracy_score(y_train, nb_train_predictions)
nb_test_accuracy = accuracy_score(y_test, nb_test_predictions)
print("Naive Bayes Classifier:")
print("Training Accuracy:", nb_train_accuracy)
print("Test Accuracy:", nb_test_accuracy)

K-Nearest Neighbors (KNN):
Training Accuracy: 0.6118143459915611
Test Accuracy: 0.5
Decision Tree Classifier:
Training Accuracy: 1.0
Test Accuracy: 0.5166666666666667
Naive Bayes Classifier:
Training Accuracy: 0.5527426160337553
Test Accuracy: 0.5833333333333334


In [None]:
 SVM_classifier
Accuracy on training data: 0.540084388185654
Accuracy on test data: 0.5333333333333333

Random Forest Classifier:
Test Accuracy: 0.55

Gradient Boosting Classifier:
Test Accuracy: 0.5333333333333333

K-Nearest Neighbors (KNN):
Training Accuracy: 0.6118143459915611
Test Accuracy: 0.5

Decision Tree Classifier:
Test Accuracy: 0.5166666666666667

Naive Bayes Classifier:
Training Accuracy: 0.5527426160337553
Test Accuracy: 0.5833333333333334

Based on accuracy metrics Naive Bayes Classifier have better accuracy in test data. It may be ok.