# Student Information

Author: Suhaib Atef

Student ID: 132823

Section: 1

Class: Special Topics CPE597 (10:30-11:30)

Assignment Title: Logistic Regression - Part 1

Dataset Source: https://www.kaggle.com/datasets/dileep070/heart-disease-prediction-using-logistic-regression


# Assignment

## Required Code 

### Variables 

In [148]:
datasetName= "framingham"
xFeatures= ['male','age','currentSmoker','prevalentStroke','prevalentHyp']
yFeature = ['TenYearCHD']
testPrecentage = 0.2
learning_Rate = 0.5
MaxiterationNumber=10000
accuracy = 1e-4

### Libraries

In [37]:
import numpy as np 
import pandas as pd
import math
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix , accuracy_score,precision_score,recall_score,roc_auc_score,auc,roc_curve

## Data Set

### Info about the Dataset

The dataset is publically available on the Kaggle website, and it is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. 

The classification goal is to predict whether the patient has 10-year risk of future coronary heart disease (CHD).The dataset provides the patients’ information. It includes over 4,000 records and 15 attributes.

Attributes:


--> Demographic


1. Sex: male or female
2. Age: Age of the patient

--> Behavioral


1.  Current Smoker: whether or not the patient is a current smoker
2. Cigs Per Day: the number of cigarettes that the person smoked on average in one day.


--> Medical( history)


1.  BP Meds: whether or not the patient was on blood pressure medication
2. Prevalent Stroke: whether or not the patient had previously had a stroke
3. Prevalent Hyp: whether or not the patient was hypertensive
4. Diabetes: whether or not the patient had diabetes

--> Medical(current)


1. Tot Chol: total cholesterol level
2. Sys BP: systolic blood pressure
3. Dia BP: diastolic blood pressure
4. BMI: Body Mass Index
5. Heart Rate: heart rate
6. Glucose: glucose level
7. 10 year risk of coronary heart disease CHD


### Downloading the dataset

In [38]:
!gdown 1LR6yNKK0MyB43j9YxVYPWizB9GSss-xr

Downloading...
From: https://drive.google.com/uc?id=1LR6yNKK0MyB43j9YxVYPWizB9GSss-xr
To: /content/framingham.csv
  0% 0.00/196k [00:00<?, ?B/s]100% 196k/196k [00:00<00:00, 94.0MB/s]


### Viewing the Dataset 

In [39]:
df = pd.read_csv(datasetName+'.csv')

# Check the data 
df.head(5)

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


## Standardization The Data Set

In [40]:
def Standardization(X):
    mu = np.mean(X, axis=0)
    sigma  = np.std(X, axis=0)
    X_norm = (X - mu) / sigma      
    return X_norm

In [41]:
X=df[xFeatures]
y=df[yFeature]
X = Standardization(X)

In [42]:
X

Unnamed: 0,male,age,currentSmoker,prevalentStroke,prevalentHyp
0,1.153192,-1.234951,-0.988271,-0.077033,-0.671101
1,-0.867158,-0.418257,-0.988271,-0.077033,-0.671101
2,1.153192,-0.184916,1.011868,-0.077033,-0.671101
3,-0.867158,1.331800,1.011868,-0.077033,1.490089
4,-0.867158,-0.418257,1.011868,-0.077033,-0.671101
...,...,...,...,...,...
4233,1.153192,0.048425,1.011868,-0.077033,1.490089
4234,1.153192,0.165095,1.011868,-0.077033,-0.671101
4235,-0.867158,-0.184916,1.011868,-0.077033,-0.671101
4236,-0.867158,-0.651598,1.011868,-0.077033,-0.671101


In [43]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=testPrecentage, random_state=1)

## Building the Logistic Function

### The Sigmoid Function

In [44]:
def sig(z):
 return 1/(1 + np.exp(-z))

### The Cost Function


In [45]:
def J(X, y, w, b):
    m, n = X.shape
    cost = 0
    X = X.values.tolist()
    y = y.values.tolist()

    for i in range(m):
        z = np.dot(X[i],w) + b
        h = sig(z)
        cost += -1 * y[i][0] * np.log(h) - (1-y[i][0])* np.log(1-h)
    cost = cost/m
    return cost

### The Gradient Function

In [46]:
def gradientFunction(X, y, w, b): 
    m, n = X.shape
    dj_dw = np.zeros(w.shape)
    dj_db = 0.
    X = X.values.tolist()
    y = y.values.tolist()

    for i in range(m):
        f_wb_i = sig(np.dot(X[i],w) + b)          
        err_i  = f_wb_i  - y[i][0]   
        for j in range(n):
          dj_dw[j] = dj_dw[j] + err_i * X[i][j]      
        dj_db = dj_db + err_i
    dj_dw = dj_dw/m                                   
    dj_db = dj_db/m                                         
    return dj_db, dj_dw

### The Gradient Descent

In [47]:
def Gradient_Descent_Algorithm(X,Y,old_w,old_b,costFunction,gradientFunction,iterationNumber,LearningRate): 
    m = len(X)
    J_history = []
    w_history = []
    
    for i in range(iterationNumber):

        dj_db, dj_dw = gradientFunction(X, y, old_w, old_b)   

        # Update Parameters using w, b, alpha and gradient
        old_w = old_w - LearningRate * dj_dw               
        old_b = old_b - LearningRate * dj_db              
       
        # Save cost J at each iteration
        if i<iterationNumber:      # prevent resource exhaustion 
            cost =  costFunction(X, y, old_w, old_b)
            J_history.append(cost)
        
    return old_w, old_b #return w and J,w history for graphing



### Predict Function

In [145]:
def predict(X, w, b): 
    m, n = X.shape
    X = X.values.tolist()   
    predictions = np.zeros(m)
    for i in range(m):   
        z_wb = np.dot(X[i],w) 
        for j in range(n): 
            z_wb += 0
        z_wb += b
        f_wb = sig(z_wb)
        print(f_wb)
        predictions[i] = f_wb>0.1684
    return predictions

## Testing The Model 

### Applying the model to our Train Data 
In order to get the W and B parameters 

In [149]:
w_old = np.zeros(len(xFeatures))
global_w,global_b = Gradient_Descent_Algorithm(X_train,y_train,w_old,0,J,gradientFunction,MaxiterationNumber,learning_Rate)


### Use Theta's values to predict the test sample 

In [None]:
y_predict = predict(X_test, global_w, global_b)
n= y_test.to_numpy()
y_test2 = []
for item in n:
  y_test2.append(item[0])
y_predict

## Metrics 

In [147]:
print('Confusion Matrix')
tn, fp, fn, tp = confusion_matrix(y_test2,y_predict).ravel()
print(tn)
print(confusion_matrix(y_test2 , y_predict))

print('\nAccuracy')
print(accuracy_score(y_test2 , y_predict))

print('\nPrecision')
print(precision_score(y_test2, y_predict))

print('\nRecall')
print(recall_score(y_test2, y_predict))

print('\nROC')
print(roc_auc_score(y_test2,y_predict))

Confusion Matrix
666
[[666  67]
 [108   7]]

Accuracy
0.7936320754716981

Precision
0.0945945945945946

Recall
0.06086956521739131

ROC
0.4847321905213833
