**What is Logistic Regression?**

> Logistic Regression is a **statistical model** that uses a **logistic (sigmoid) function** to estimate the **probability of a binary outcome**.  

---

**What does Binary Outcome mean?**

A **binary outcome** means the **target variable** can take only **two possible values**, usually represented as `1` or `0`.

- Example: *Does a patient have a disease?*  
  - **Yes = 1**  
  - **No = 0**

We use `1` and `0` for simplicity in modeling:

- **1 → Event happens (positive class)**  
- **0 → Event does not happen (negative class)**


In [1]:
# Step 1: Import required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [2]:
df = pd.read_csv('heart_disease.csv')
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'heart_disease.csv'

**Step 2: Perform Sanity Checks on the Dataset**

Before building any model, we must **understand and inspect the dataset**.  
This step ensures that the data is clean, consistent, and ready for analysis.

**Common Sanity Checks:**
1. **Check the shape** of the dataset → how many rows and columns.  
2. **Look at the first few rows** to understand the structure.  
3. **Check column data types** (numerical, categorical, object).  
4. **Check for missing values**.  
5. **Get summary statistics** (mean, median, min, max, etc.).


In [None]:
# 1. Check shape attribute
df.shape  # Tells the dimensions of the data set

So we have 
 - 319,795 rows → these are the observations or records.
 - 18 columns → these are the features (variables) that describe each observation.

In [None]:
df.head(2) # Display first two rows

**Target Variable vs Features**

- **HeartDisease** is our **target variable**  
  - Also called the **dependent variable** or **label**  
  - This is what we are trying to **predict**  
  - It’s a **binary outcome**:  
    - **Yes = 1**  
    - **No = 0**

---

- All the other columns are **features**  
  - Also called **independent variables** or **predictors**  
  - These are the inputs the model will use to make predictions  
  - Examples: `BMI`, `Smoking`, `Sex`, `AgeCategory`, etc.


In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
df.duplicated().sum() #Check how many duplicate rows exist in the dataset.

In [None]:
# Check Target variable distribution
df['HeartDisease'].value_counts()


In [None]:
df['HeartDisease'].value_counts(normalize=True) # Check distribution in percentage

**Class Imbalance in the Target Variable**

If we look at the distribution of the target variable:

- Around **91% = No (0)**
- Around **8.6% = Yes (1)**

This is a **huge imbalance** in the data.

Why is this important?  
- A Logistic Regression model trained on this data might become **biased towards predicting "No"**.  
- For example, if the model always predicts "No", it would already be **91% accurate**, but it would **completely fail to identify positive (Yes) cases**.  



In [None]:
# Look for outliers in BMI

sns.boxplot(df['BMI'])
plt.xlabel("Count")


##### In the BMI column:
  - Most values lie between **18 and 40**.
  - Values above 40 are marked as **outliers**.
  - Maximum BMI in the dataset is around **95**.

In [None]:
# Look for outliers in Physical

sns.boxplot(df['PhysicalHealth'])
plt.xlabel("Count")

#### Outliers in PhysicalHealth

- Boxplot shows that most people reported **0 days** of poor physical health.
- A small number reported values up to **30 days**.
- Statistically, these appear as outliers, but they are **valid values** because:
  - The feature is measured in days (range: 0–30).
  - Reporting 30 days of poor health is possible, not an error.

---
#### Understanding the Box in a Boxplot

- The **blue box** represents the **Interquartile Range (IQR)**, which contains the middle 50% of the data.
- It has **three key horizontal lines**:

1. **Bottom line of the box (Q1 / 25th percentile)**  
   - 25% of the data lies **below** this value.  

2. **Middle line of the box (Median / Q2 / 50th percentile)**  
   - The midpoint of the data.  
   - 50% of values lie **below** this line, 50% above.  

3. **Top line of the box (Q3 / 75th percentile)**  
   - 75% of the data lies **below** this value.  

---

### Whiskers
- Lines extending out of the box = **whiskers**.  
- They typically reach up to **1.5 × IQR** beyond Q1 and Q3.  
- Points beyond the whiskers are plotted as **outliers**.


---
## Data Cleaning

**First, we clean the dataset to remove noise.**

---
 **1. Handle Missing values**

  If any columns have missing data:
  
 - Option 1: Drop rows/columns with too many missing values.
     
 - Option 2: Impute (fill) them using mean/median for numeric or mode for categorical features.

In [None]:
# numeric and categorical
df_num = df.select_dtypes(include=['float','int']).columns.tolist()

df_cat = df.select_dtypes(include='object').columns.tolist()

print(f"Numerical columns: ",df_num)

print(f"Categorical columns: ",df_cat)

In [None]:
# Drop the missing values 
df = df.dropna()

# Impute numeric columns with their mean
df[df_num] = df[df_num].fillna(df[df_num].mean())

# Impute categorical columns with their mode
for col in df_cat:
    df[col] = df[col].fillna(df[col].mode()[0])

**2. Handle Duplicated Rows**

In [None]:
print(f"Before Dropping duplicated rows:{df.duplicated().sum()}")
df = df.drop_duplicates()
print(f"Before Dropping duplicated rows:{df.duplicated().sum()}")

In [None]:
---
**3. Handle outliers**

In [None]:
def detect_outliers(feature):
    Q1 = df[feature].quantile(0.25)
    Q3 = df[feature].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = df[(df[feature] < lower_bound) | (df[feature] > upper_bound)]
    print(outliers.shape[0])

def cap_outliers(feature):
    Q1 = df[feature].quantile(0.25)
    Q3 = df[feature].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    df[feature] = np.where(df[feature] < lower_bound, lower_bound,df[feature])
    df[feature] = np.where(df[feature] > upper_bound, upper_bound, df[feature])

    return df

In [None]:
print("Before Capping Outliers")
detect_outliers("BMI")

cap_outliers("BMI")

print("After Capping Outliers")
detect_outliers("BMI")

### Exploratory Data Analysis (EDA)

### 1. Univariate Analysis
- Goal: Understand **what each feature looks like** individually.  
- Tools:  
  - For **numeric features**: Histograms, KDE plots, Boxplots, Summary statistics.  
  - For **categorical features**: Count plots, Value counts, Bar plots.  

In [None]:
# Univariate Analysis of Numerical Features
plt.figure(figsize=(8,8))
sns.histplot(df["BMI"],kde=True )
plt.title("Distribution of BMI")
plt.xlabel("BMI")
plt.ylabel("Count")
plt.show()

### Distribution of BMI
- Most BMI values are centered around **25–27**.  
- The KDE curve shows a **right-skewed distribution** → some people have much higher BMI.  
- Outliers above **40** are present but still valid (extremely high BMI cases).


In [None]:
# Univariate Analysis for Categorical features

plt.figure(figsize=(5,5))
sns.countplot(x=df["KidneyDisease"],color='red')
plt.xlabel("Kidney Disease")
plt.ylabel("Count")
plt.title("Distribution of Kidney Diseases")
plt.show()

### Distribution of Kidney Diseases
- The dataset is highly **imbalanced**:
  - **No** → ~290,000 individuals  
  - **Yes** → ~10,000 individuals  
- Most people do **not** have kidney disease.  
- Important: This imbalance may affect model training, similar to what we saw with `HeartDisease`.


---
### 2. Bivariate Analysis
- **Goal**: Understand relationships between two variables (especially with the target `HeartDisease`).  

**1. Numeric vs Numeric**
- Tools: Scatter plots, Correlation heatmaps.  
- Example: Relationship between `BMI` and `SleepTime`.  

**2. Numeric vs Categorical**
- Tools: Boxplots, Violin plots.  
- Example: Compare `BMI` distribution across HeartDisease = Yes/No.  

**3. Categorical vs Categorical**
- Tools: Countplots with `hue`, Crosstabs, Grouped bar plots.  
- Example: Compare `Smoking` frequency across HeartDisease = Yes/No.


In [None]:
# Categorical vs Categorical plot
plt.figure(figsize=(10,8))
sns.countplot(x='Smoking', hue='HeartDisease', data=df)
plt.title("Heart Disease Vs Smoking")
plt.show()

### 3. Correlation Analysis
- Goal: Identify **multicollinearity** or strong relationships between numeric features.  
- Tools: Correlation matrix, Heatmap.  
- Example: Check correlation between `PhysicalHealth`, `MentalHealth`, and `SleepTime`.


In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(df[df_num].corr(), annot=True, fmt=".2f", cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()


### Correlation Heatmap (Numeric Features)

- Shows pairwise correlation between numeric variables.  
- Values range from **-1 to +1**:
  - **+1** → perfect positive correlation  
  - **-1** → perfect negative correlation  
  - **0** → no correlation  

Observations:
- `PhysicalHealth` and `MentalHealth` have a **moderate positive correlation (~0.28)** → people reporting poor physical health often also report poor mental health.  
- `BMI` has very weak correlation with other features.  
- `SleepTime` has weak negative correlation with `MentalHealth (-0.12)`.  


### 4. Class Balance
- Goal: Confirm **distribution of the target variable**.  
- Already checked → Highly imbalanced (91% No, 9% Yes).

In [None]:
sns.countplot(x = df["HeartDisease"])

### Data Preprocessing

In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder


In [None]:
from sklearn.model_selection import train_test_split

X = df.drop(columns = 'HeartDisease')
y = df["HeartDisease"]

X_train, X_test, y_train, y_test = train_test_split( 
                                            X,y,
                                            random_state=42, 
                                            test_size=0.2 ,
                                            stratify=y
                                        )


**Why Split Before Scaling/Encoding?**

- To avoid **data leakage** → test set must remain unseen.  
- To mimic **real-world use** → we train only on training data, then apply the same transformations on unseen data.  
- To keep **consistency** → 
  - Training set: `fit_transform()`  
  - Test set: `transform()`  


In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder


# Ensure target is not in categorical feature list
if "HeartDisease" in df_cat:
    df_cat.remove("HeartDisease")

# Scale numeric columns
scaler = StandardScaler()
X_train[df_num] = scaler.fit_transform(X_train[df_num])
X_test[df_num] = scaler.transform(X_test[df_num])

# Encode categorical columns
encoder = OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore')

X_train_encoded = pd.DataFrame(
    encoder.fit_transform(X_train[df_cat]),
    columns=encoder.get_feature_names_out(df_cat),
    index=X_train.index
)

X_test_encoded = pd.DataFrame(
    encoder.transform(X_test[df_cat]),
    columns=encoder.get_feature_names_out(df_cat),
    index=X_test.index
)

# Replace categorical with encoded features
X_train = X_train.drop(columns=df_cat).join(X_train_encoded)
X_test = X_test.drop(columns=df_cat).join(X_test_encoded)


**Scale Numeric Features**

   - Use `StandardScaler` → mean = 0, std = 1.
   - Prevents large-scale variables from dominating.
---

**Encode Categorical Features**

   - Use `OneHotEncoder`:
     - `drop='first'` → avoid dummy variable trap.
     - `sparse_output=False` → return dense DataFrame.
     - `handle_unknown='ignore'` → safe for unseen categories in test.
---
**Combine Features**
   - Drop original categorical columns.
   - Join encoded columns back to the dataset.
   - Now `X_train` and `X_test` are fully numeric → ready for Logistic Regression.

In [None]:
X_train.head(2)

In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize model
log_reg = LogisticRegression(max_iter=1000, random_state=42)

# Train model
log_reg.fit(X_train, y_train)

In [None]:
log_reg.fit(X_train, y_train)

# Predictions
y_pred = log_reg.predict(X_test)
y_pred_proba = log_reg.predict_proba(X_test)[:, 1]  # probabilities for ROC-AUC

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score

# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

# Confusion Matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Precision, Recall, F1-score
print("Classification Report:\n", classification_report(y_test, y_pred))

# ROC-AUC
print("ROC-AUC Score:", roc_auc_score(y_test, y_pred_proba))


**Key Insight**
- High overall accuracy (91%) is misleading due to **class imbalance**.  
- Model performs well at predicting **No**, but poorly at predicting **Yes** (low recall = 10%).  
- ROC-AUC (0.83) shows the model has good discriminatory ability, but threshold tuning or resampling is needed to improve minority class detection.

In [None]:

# Initialize model
log_reg = LogisticRegression(max_iter=1000, random_state=42 , class_weight='balanced')

# Train model
log_reg.fit(X_train, y_train)

                             

# Predictions
y_pred = log_reg.predict(X_test)
y_pred_proba = log_reg.predict_proba(X_test)[:, 1]  # probabilities for ROC-AUC

# Precision, Recall, F1-score
print("Classification Report:\n", classification_report(y_test, y_pred))


#### Updated Model Performance (Logistic Regression with Class Weight/Resampling)

- **Accuracy**: 74%  
- **Classification Report**:
  - Class **No** → Precision = 0.97, Recall = 0.74, F1 = 0.84  
  - Class **Yes** → Precision = 0.23, Recall = 0.78, F1 = 0.35  

### Key Insights
- Accuracy dropped (from 91% → 74%) because the model is now focusing more on the minority class.  
- **Recall for Yes improved significantly (10% → 78%)** → model is catching many more positive cases.  
- Precision for Yes is lower (23%) → more false positives, but this trade-off is often acceptable in health-related predictions.  
- **Balanced performance**: The model is no longer biased toward predicting "No".


---
**Recall (Sensitivity / True Positive Rate)**
- Definition: Out of all the **actual positive cases**, how many did the model correctly identify as positive?  

**Precision (Positive Predictive Value)**
- Definition: Out of all the cases the model **predicted as positive**, how many were actually positive?  


**F1 is the **harmonic mean** of Precision and Recall.**

- Why harmonic mean?
  - It punishes extreme imbalance between Precision and Recall.
  - Ensures a model must do well on **both** to get a good F1.
- Useful for imbalanced datasets where Accuracy is misleading.

In [None]:
print(df.head())

In [None]:
male_list = df[df['sex'] == 1]
female_list = df[df['sex'] == 0]

print("Male List:")
display(male_list)

print("Female List:")
display(female_list)

In [None]:
import pandas as pd

df = pd.read_csv("heart_disease.csv")

print(df.head())
print(df.columns)


In [None]:
print("Hello Jupyter")
