## Project Overview
This project applies **Supervised Machine Learning** techniques to make prediction. This is a binary classification problem. I am using **Logistic Redression** model. **Logistics Regression** model works well with binary classification broblems. The primary objective is to predict whether a patient has **Heart Disease** or **Not**.

The analysis follows a structured data science workflow:

* Data Collection and Data Processing
* Seperate The Data (Features) and The Label (Target)
* Split The Data Into 'Training' And 'Testing' Data
* Train The Data using Logistic Regression
* Model Evaluation Using ACCURACY SCORE
* Build a Predictive System

### IMPORT THE DEPENDENCIES

In [3]:
# import the required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

### DATA COLLECTION AND DATA PROCESSING

In [5]:
# Loading the dataset to a pandas dataframe
Heart_Disease_Predict = pd.read_csv('heart.csv')

In [6]:
# Check the first 5 rows of the dataset
Heart_Disease_Predict.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


In [7]:
# Check the shape of the dataset
Heart_Disease_Predict.shape

(1025, 14)

In [8]:
# Check the value counts of the (target) column
Heart_Disease_Predict['target'].value_counts()

target
1    526
0    499
Name: count, dtype: int64

1 ---> Have Heart Disease

0 ---> Dont Have Heart Disease

In [10]:
# Check for missing values in the dataset
Heart_Disease_Predict.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [11]:
# Check the Statistical measurs of the dataset
Heart_Disease_Predict.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0
mean,54.434146,0.69561,0.942439,131.611707,246.0,0.149268,0.529756,149.114146,0.336585,1.071512,1.385366,0.754146,2.323902,0.513171
std,9.07229,0.460373,1.029641,17.516718,51.59251,0.356527,0.527878,23.005724,0.472772,1.175053,0.617755,1.030798,0.62066,0.50007
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,48.0,0.0,0.0,120.0,211.0,0.0,0.0,132.0,0.0,0.0,1.0,0.0,2.0,0.0
50%,56.0,1.0,1.0,130.0,240.0,0.0,1.0,152.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,275.0,0.0,1.0,166.0,1.0,1.8,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


### Seperate The Data (Features) and The Label (Target)

In [13]:
# Seperate the Features and the target
x = Heart_Disease_Predict.drop(columns = 'target', axis = 1)
y = Heart_Disease_Predict['target']

### Spliting The Data Into 'Training' And 'Test' Data

In [15]:
# create four variables and split the dataset into training and test data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state = 2, stratify = y)

In [16]:
print(x.shape, x_train.shape, x_test.shape)

(1025, 13) (820, 13) (205, 13)


### Training the model ---> I am using the Logistic Regression model

In [18]:
# Put the Logistic Regression into a new variable
model = LogisticRegression()

In [19]:
# Training the data with the Logistic Regression model
model.fit(x_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### MODEL EVALUATION USING THE ACCURACY SCORE

In [21]:
# Check the accuracy score of thr trained data
x_train_prediction = model.predict(x_train)
training_data_accuracy = accuracy_score(x_train_prediction, y_train)

In [22]:
print('Accuracy Score on Training Data : ', training_data_accuracy)

Accuracy Score on Training Data :  0.8524390243902439


#### The Accuracy Score for the Train data is about 83%.

#### This is a very good score.

In [24]:
# Check the accuracy score of thr test data
x_test_prediction = model.predict(x_test)
test_data_accuracy = accuracy_score(x_test_prediction, y_test)

In [25]:
print('Accuracy Score on Test Data : ', test_data_accuracy)

Accuracy Score on Test Data :  0.8048780487804879


#### The Accuracy Score for the Test data is about 76%.

#### This is a very good score.

### Build A Predictive System That Predict whether A Patient 'Has Heart Disease' or 'Dont Have Heart Disease'

In [28]:
input_data = (54,1,0,124,266,0,0,109,1,2.2,1,1,3)

# Change the input data to a numpy array
input_data_as_numpy_array = np.asarray(input_data)

# Reshape the numpy array as we are predicting for only one instance or one data point
input_data_reshaped = input_data_as_numpy_array.reshape(1, -1)

prediction = model.predict(input_data_reshaped)
print(prediction)

if (prediction[0]=='0'):
    print('The Person Does Not Have Heart Disease')
else:
    print('The Person Has Heart Disease')


[0]
The Person Has Heart Disease


