<h3>Introduction<h3/>
    
<p>We have a data which classified if patients have heart disease or not according to features in it. We will try to use this data to create a model which tries predict if a patient has this disease or not. We will use logistic regression (classification) algorithm.</p>
    

<h3>Data Description</h3>

<p>This is a multivariate type of dataset which means providing or involving a variety of separate mathematical or statistical variables, multivariate numerical data analysis. It is composed of 14 attributes which are age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise-induced angina, oldpeak — ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels and Thalassemia. One of the major tasks on this dataset is to predict based on the given attributes of a patient that whether that particular person has heart disease or not and other is the experimental task to diagnose and find out various insights from this dataset which could help in understanding the problem more.</p> 
Data was downloaded from (https://www.kaggle.com/ronitf/heart-disease-uci)

### Variable Descriptions:
 1.  age(Age of the patient in years)   
 2.  sex(Male/Female)
 3.  cp chest pain type ([typical angina, atypical angina, non-anginal, asymptomatic])
 4.  trestbps resting blood pressure (resting blood pressure (in mm Hg on admission to the hospital))
 5.  chol (serum cholesterol in mg/dl)
 6.  fbs (if fasting blood sugar > 120 mg/dl)
 7.  restecg (resting electrocardiographic results)
     -- Values: [normal, stt abnormality, lv hypertrophy]
 8.  thalach: maximum heart rate achieved
 9.  exang: exercise-induced angina (True/ False)
 10. oldpeak: ST depression induced by exercise relative to rest
 11. slope: the slope of the peak exercise ST segment
 12. ca: number of major vessels (0-3) colored by fluoroscopy
 13. thal: [normal; fixed defect; reversible defect]
 16. target: the predicted attribute

### Project question/idea

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn

In [3]:
Dataset = pd.read_csv('Dataset/heart.csv')
Dataset.shape

(303, 14)

In [11]:
df = pd.read_csv('Dataset/framingham.csv')
df.head(10)

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0
5,0,43,2.0,0,0.0,0.0,0,1,0,228.0,180.0,110.0,30.3,77.0,99.0,0
6,0,63,1.0,0,0.0,0.0,0,0,0,205.0,138.0,71.0,33.11,60.0,85.0,1
7,0,45,2.0,1,20.0,0.0,0,0,0,313.0,100.0,71.0,21.68,79.0,78.0,0
8,1,52,1.0,0,0.0,0.0,0,1,0,260.0,141.5,89.0,26.36,76.0,79.0,0
9,1,43,1.0,1,30.0,0.0,0,1,0,225.0,162.0,107.0,23.61,93.0,88.0,0


In [12]:
df = df.dropna()
df.info()
df.iloc[:]['age']

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3658 entries, 0 to 4239
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   male             3658 non-null   int64  
 1   age              3658 non-null   int64  
 2   education        3658 non-null   float64
 3   currentSmoker    3658 non-null   int64  
 4   cigsPerDay       3658 non-null   float64
 5   BPMeds           3658 non-null   float64
 6   prevalentStroke  3658 non-null   int64  
 7   prevalentHyp     3658 non-null   int64  
 8   diabetes         3658 non-null   int64  
 9   totChol          3658 non-null   float64
 10  sysBP            3658 non-null   float64
 11  diaBP            3658 non-null   float64
 12  BMI              3658 non-null   float64
 13  heartRate        3658 non-null   float64
 14  glucose          3658 non-null   float64
 15  TenYearCHD       3658 non-null   int64  
dtypes: float64(9), int64(7)
memory usage: 485.8 KB


0       39
1       46
2       48
3       61
4       46
        ..
4233    50
4234    51
4237    52
4238    40
4239    39
Name: age, Length: 3658, dtype: int64

### DATA HANDLING FUNCTIONS

In [13]:
# 60-20-20 Split
def SplitData(Data):
    size = Data.shape[0]
    training = Data.iloc[:int(0.6*size)]
    validation = Data.iloc[int(0.6*size):int(0.8*size)]
    testing = Data.iloc[int(0.8*size):]
    
    return (training, validation, testing)
    
    
# FOR DISTINCTION PURPOSES
def DropNULLS(Data):
    return Data.dropna()

### NAIVE BAYES

In [29]:
def Priors(ClassArray):
    probabilities = np.array([]);
    classes, counts = np.unique(ClassArray, return_counts = True)
    
    for i in counts:
        probabilities = np.append(probabilities, [i/np.sum(counts)])
        
    return (classes, counts, probabilities)




def NaiveBayes(classIndex , trainData , dataPoint):
    classes , classCounts , priorProbs = Priors(trainData.iloc[:,classIndex])
    totalProbs = np.array([])
    
    for i in range(len(classes)):
        
        prob = priorProbs[i]
        Class = classes[i]
        classOccurance = classCounts[i]
        
        for j in range(len(dataPoint)):
            
            if j == classIndex:
                continue
                
            fCount = 0
            
            for k in range(len(trainData)):
                if len(np.where(trainData.iloc[k][classIndex] == Class and trainData.iloc[k][j] == dataPoint[j])[0] > 0):
                    fCount += 1 
            
            fCount = fCount/classOccurance
            prob *= fCount 
            
        totalProbs = np.append(totalProbs , [prob])
    
    return (totalProbs , classes);


def ConfusionMatric(trainingSet , testingSet):
    index = 0
    TP = 0 
    TN = 0 
    FN = 0 
    FP = 0
    
    for i in range(len(testingSet)):
#     for dataPoint in testingSet:
        dataPoint = testingSet.iloc[i]
        probs , classes = NaiveBayes(index , trainingSet , dataPoint)
        predicted = classes[np.argmax(probs)]
        trueValue = testingSet.iloc[i].TenYearCHD
        print(trueValue , predicted)
        if trueValue == 1:
            if predicted == trueValue:
                TP += 1
            else : 
                FN += 1
        else:
            if predicted == trueValue:
                TN += 1
            else : 
                FP += 1
    
    
    ConfusionMatrix = np.array([[TP , FP],[FN , TN]])
    return ConfusionMatrix

In [30]:
training, validation, testing = SplitData(df)

In [None]:
matrix = ConfusionMatric(training, testing)

1.0 0
0.0 0
0.0 0


In [None]:
matrix

In [None]:
testing.shape