#### Logistic Regression is actually a classification algorithm and is applied only on classification data, but has the final output as a continuous number.
We fit 'S' like curve to make classifications.
so it's like Best Fit S curve.

### Sigmoid function:
g(z)= 1 \ (1+e⁻ᶻ) , 0<g(z)<1


this gives us an output between 0 and 1
not much affected by outliers
e (euler's number)(constant value) = 2.71

g(z) always gives probability

1. calculate BFL
2. give the y valve as the power of z in the sigmoid function
3. compare the returned value between 0 to 1
4. We compare g(z) with the predefined cutoff value of 0.5.
   if g(Z) > 0.5 then output = 1 / Yes
   if g(Z) < 0.5 then output = 0 / No

   if z >= 0, g(z) will always be >= 0.5
   if z <= 0, g(z) will always be < 0.5

#### In short:

1. calculate :
   z = ŷ = m1x1 + m2x2 + m3x3 + m4x4 + c

2. assign the z value in the sigmoid function
   g(z)= 1 \ (1+e⁻ᶻ)

In [350]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

### Reading and exploring the data:

1. import data
2. check shape, dtypes of each columns
3. check missing values
4. check for duplicates and remove them
5. check for outliers in each coumn and deal with them
6. check for columns having obj dtype and encode them
7. Necessary vizualization

In [352]:
heart = pd.read_csv(r"C:\Users\Pooja\Downloads\heart.csv")
heart.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [353]:
heart.shape

(303, 14)

In [354]:
heart.dtypes

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object

1. axis = 0 is called rowwise that is top to bottom
2. axis = 1 is called columnwise that is left to right

### Simple mnemonic:
##### axis=0 → "Along the rows" → affects rows, works top to bottom.
##### axis=1 → "Along the columns" → affects columns, works left to right.

In [356]:
# isnull and isna is same

heart.isnull().sum(axis = 0)

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [357]:
heart.isnull().sum(axis = 1)

0      0
1      0
2      0
3      0
4      0
      ..
298    0
299    0
300    0
301    0
302    0
Length: 303, dtype: int64

In [358]:
heart.duplicated().sum()

1

In [359]:
heart[heart.duplicated()]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
164,38,1,2,138,175,0,1,173,0,0.0,2,4,2,1


In [360]:
heart.drop_duplicates(inplace = True)

In [361]:
heart.shape

(302, 14)

In [362]:
heart.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,302.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0
mean,54.42053,0.682119,0.963576,131.602649,246.5,0.149007,0.52649,149.569536,0.327815,1.043046,1.397351,0.718543,2.31457,0.543046
std,9.04797,0.466426,1.032044,17.563394,51.753489,0.356686,0.526027,22.903527,0.470196,1.161452,0.616274,1.006748,0.613026,0.49897
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,48.0,0.0,0.0,120.0,211.0,0.0,0.0,133.25,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.5,1.0,1.0,130.0,240.5,0.0,1.0,152.5,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.75,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [363]:
def remove_outliers(data, columns):
    for column in columns:
        if column in data.columns:
            Q1 = data[column].quantile(0.25)
            Q3 = data[column].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5*IQR
            upper_bound = Q3 + 1.5*IQR
            data = data[(data[column] >= lower_bound) & (data[column] <= upper_bound)]
        return data

In [364]:
remove_outliers(heart, ['age', 'chol', 'thalach'])

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [365]:
heart.shape

(302, 14)

### Machine Learning Process

1. create X and y
2. split the variables in training and testing sets
3. standardization /  scaling of the data
4. apply the logisic regression on the data
5. check the performance of the model on th etest set

In [367]:
X = heart.drop(columns = 'target')
y = heart['target']

In [368]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 100)

#### Standardization / Scaling of the data

always to do this after train-test split

To scale down a column we use:
Z = (Data - Mean) / SD


so the Z value means that the original value is z Standard deviations times away from the Mean
This is done to avoid auto scaling down of m values and thus misinterpreting the data

In [370]:
from sklearn.preprocessing import StandardScaler

In [371]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [372]:
# in fit_transform: for X train mean and SD is calculated for each columns
# in transform: for X test, mean and SD which is already calculated above in Fit transform is used

#### Apply Logistic Regression on the data

In [374]:
log_reg = LogisticRegression()
log_reg.fit(X_train_scaled, y_train)

In [402]:
y_pred = log_reg.predict(X_test_scaled)
y_pred

array([0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0], dtype=int64)

In [400]:
log_reg.predict_proba(X_test_scaled)

array([[0.99184208, 0.00815792],
       [0.93439187, 0.06560813],
       [0.08448225, 0.91551775],
       [0.99257794, 0.00742206],
       [0.03508389, 0.96491611],
       [0.20654315, 0.79345685],
       [0.07430371, 0.92569629],
       [0.98723367, 0.01276633],
       [0.09763965, 0.90236035],
       [0.42787258, 0.57212742],
       [0.86964694, 0.13035306],
       [0.49592035, 0.50407965],
       [0.08147662, 0.91852338],
       [0.81642406, 0.18357594],
       [0.02886708, 0.97113292],
       [0.16346754, 0.83653246],
       [0.45730734, 0.54269266],
       [0.6931893 , 0.3068107 ],
       [0.93451095, 0.06548905],
       [0.03568064, 0.96431936],
       [0.95474946, 0.04525054],
       [0.91116173, 0.08883827],
       [0.83572331, 0.16427669],
       [0.83802194, 0.16197806],
       [0.92096411, 0.07903589],
       [0.0190559 , 0.9809441 ],
       [0.92294981, 0.07705019],
       [0.13027014, 0.86972986],
       [0.82470086, 0.17529914],
       [0.92350415, 0.07649585],
       [0.

In [376]:
accuracy_score(y_test, y_pred)

0.8360655737704918

The creation of Decision Boundry means the training is done.
It's already done when the algorithm runs.

##### Checking theconfusion matrix

In [409]:
cm = confusion_matrix(y_test, y_pred)

In [421]:
TN, FP, FN, TP = cm.ravel()
# ravel converts 2d array to 1d

In [423]:
print(f"TN {TN}")
print(f"FP {FP}")
print(f"FN {FN}")
print(f"TP {TP}")

TN 27
FP 8
FN 2
TP 24


In [433]:
precision_score(y_test, y_pred)
# used to check FP, so 1.00 means no FP

0.75

In [435]:
recall_score(y_test, y_pred)
# used to check FN, so 1.00 means no FN

0.9230769230769231