<a href="https://colab.research.google.com/github/Tanishq-Choudhary/Tanishq-Choudhary-23FE10CSE00664-ML-Lab-Sem-6/blob/main/Lab5_Both.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Aim

1. To perform preprocessing, descriptive statistics, feature engineering and model building on the Chronic Kidney Disease (CKD) dataset.
2. To build and evaluate a Logistic Regression model for CKD classification.
3. To apply Logistic Regression on the USA Housing dataset (even though it is not meant for classification) to observe poor performance and understand why Logistic Regression is not suitable for regression problems.


We import Python libraries required for:

Data handling (pandas, numpy)

Visualization (matplotlib, seaborn)

ML preprocessing + modeling (scikit-learn)

Then we load datasets directly from GitHub raw URLs.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix
)
from sklearn.linear_model import LogisticRegression


In [2]:
# CKD Dataset
ckd_url = "https://raw.githubusercontent.com/Tanishq-Choudhary/Tanishq-Choudhary-23FE10CSE00664-ML-Lab-Sem-6/main/data/chronic_kidney_disease_full.csv"
ckd_df = pd.read_csv(ckd_url)

# Clean CKD column names
ckd_df.columns = ckd_df.columns.str.strip().str.replace("'", "")

# Housing Dataset
housing_url = "https://raw.githubusercontent.com/Tanishq-Choudhary/Tanishq-Choudhary-23FE10CSE00664-ML-Lab-Sem-6/main/data/USA_Housing.csv"
housing_df = pd.read_csv(housing_url)

print("CKD Shape:", ckd_df.shape)
print("Housing Shape:", housing_df.shape)


CKD Shape: (400, 25)
Housing Shape: (5000, 7)


Before preprocessing, we inspect:

first rows

columns

datatypes

missing values

In [3]:
print("===== CKD DATASET =====")
display(ckd_df.head())
print("\nCKD Columns:\n", ckd_df.columns)
print("\nCKD Info:")
ckd_df.info()

print("\nCKD Missing Values:")
display(ckd_df.isna().sum().sort_values(ascending=False).head(20))


===== CKD DATASET =====


Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class
0,48.0,80.0,1.02,1.0,0.0,,normal,notpresent,notpresent,121.0,...,44,7800,5.2,yes,yes,no,good,no,no,ckd
1,7.0,50.0,1.02,4.0,0.0,,normal,notpresent,notpresent,,...,38,6000,,no,no,no,good,no,no,ckd
2,62.0,80.0,1.01,2.0,3.0,normal,normal,notpresent,notpresent,423.0,...,31,7500,,no,yes,no,poor,no,yes,ckd
3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,117.0,...,32,6700,3.9,yes,no,no,poor,yes,yes,ckd
4,51.0,80.0,1.01,2.0,0.0,normal,normal,notpresent,notpresent,106.0,...,35,7300,4.6,no,no,no,good,no,no,ckd



CKD Columns:
 Index(['age', 'bp', 'sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba', 'bgr', 'bu',
       'sc', 'sod', 'pot', 'hemo', 'pcv', 'wbcc', 'rbcc', 'htn', 'dm', 'cad',
       'appet', 'pe', 'ane', 'class'],
      dtype='object')

CKD Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 25 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     391 non-null    float64
 1   bp      388 non-null    float64
 2   sg      353 non-null    float64
 3   al      354 non-null    float64
 4   su      351 non-null    float64
 5   rbc     248 non-null    object 
 6   pc      335 non-null    object 
 7   pcc     396 non-null    object 
 8   ba      396 non-null    object 
 9   bgr     356 non-null    float64
 10  bu      381 non-null    float64
 11  sc      383 non-null    float64
 12  sod     313 non-null    float64
 13  pot     312 non-null    float64
 14  hemo    348 non-null    float64
 15  pcv     330 non-nul

Unnamed: 0,0
rbc,152
rbcc,130
wbcc,105
pot,88
sod,87
pcv,70
pc,65
hemo,52
su,49
sg,47


In [4]:
print("===== HOUSING DATASET =====")
display(housing_df.head())
print("\nHousing Columns:\n", housing_df.columns)
print("\nHousing Info:")
housing_df.info()

print("\nHousing Missing Values:")
display(housing_df.isna().sum().sort_values(ascending=False).head(20))


===== HOUSING DATASET =====


Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1059034.0,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701..."
1,79248.642455,6.0029,6.730821,3.09,40173.072174,1505891.0,"188 Johnson Views Suite 079\nLake Kathleen, CA..."
2,61287.067179,5.86589,8.512727,5.13,36882.1594,1058988.0,"9127 Elizabeth Stravenue\nDanieltown, WI 06482..."
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1260617.0,USS Barnett\nFPO AP 44820
4,59982.197226,5.040555,7.839388,4.23,26354.109472,630943.5,USNS Raymond\nFPO AE 09386



Housing Columns:
 Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
      dtype='object')

Housing Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Avg. Area Income              5000 non-null   float64
 1   Avg. Area House Age           5000 non-null   float64
 2   Avg. Area Number of Rooms     5000 non-null   float64
 3   Avg. Area Number of Bedrooms  5000 non-null   float64
 4   Area Population               5000 non-null   float64
 5   Price                         5000 non-null   float64
 6   Address                       5000 non-null   object 
dtypes: float64(6), object(1)
memory usage: 273.6+ KB

Housing Missing Values:


Unnamed: 0,0
Avg. Area Income,0
Avg. Area House Age,0
Avg. Area Number of Rooms,0
Avg. Area Number of Bedrooms,0
Area Population,0
Price,0
Address,0


Before preprocessing, we must decide:

Target column (y)

Feature columns (X)

In [5]:
# CKD Target
print("CKD class unique values:", ckd_df["class"].unique())
print("CKD class value counts:\n", ckd_df["class"].value_counts())

# Housing: create classification target from Price
housing_df = housing_df.copy()

median_price = housing_df["Price"].median()
housing_df["Price_Class"] = (housing_df["Price"] > median_price).astype(int)

print("\nHousing Price median:", median_price)
print("Housing Price_Class value counts:\n", housing_df["Price_Class"].value_counts())


CKD class unique values: ['ckd' 'notckd' 'no']
CKD class value counts:
 class
ckd       250
notckd    149
no          1
Name: count, dtype: int64

Housing Price median: 1232669.3779657914
Housing Price_Class value counts:
 Price_Class
0    2500
1    2500
Name: count, dtype: int64


Descriptive statistics help us understand:

central tendency (mean, median, mode)

spread (std, min, max)

whether features are skewed

In [6]:
def descriptive_stats(df, name):
    print(f"\n===== {name} DESCRIPTIVE STATISTICS =====")

    numeric_cols = df.select_dtypes(include=["int64", "float64"]).columns

    stats_df = pd.DataFrame({
        "Mean": df[numeric_cols].mean(),
        "Median": df[numeric_cols].median(),
        "Mode": df[numeric_cols].mode().iloc[0],
        "Std": df[numeric_cols].std(),
        "Min": df[numeric_cols].min(),
        "Max": df[numeric_cols].max(),
        "Missing": df[numeric_cols].isna().sum()
    })

    display(stats_df)

descriptive_stats(ckd_df, "CKD")
descriptive_stats(housing_df, "HOUSING")



===== CKD DESCRIPTIVE STATISTICS =====


Unnamed: 0,Mean,Median,Mode,Std,Min,Max,Missing
age,51.483376,55.0,60.0,17.169714,2.0,90.0,9
bp,76.469072,80.0,80.0,13.683637,50.0,180.0,12
sg,1.017408,1.02,1.02,0.005717,1.005,1.025,47
al,1.016949,0.0,0.0,1.352679,0.0,5.0,46
su,0.450142,0.0,0.0,1.099191,0.0,5.0,49
bgr,148.036517,121.0,99.0,79.281714,22.0,490.0,44
bu,57.425722,42.0,46.0,50.503006,1.5,391.0,19
sc,3.072454,1.3,1.2,5.741126,0.4,76.0,17
sod,137.528754,138.0,135.0,10.408752,4.5,163.0,87
pot,4.627244,4.4,3.5,3.193904,2.5,47.0,88



===== HOUSING DESCRIPTIVE STATISTICS =====


Unnamed: 0,Mean,Median,Mode,Std,Min,Max,Missing
Avg. Area Income,68583.11,68804.29,17796.63119,10657.991214,17796.63119,107701.7,0
Avg. Area House Age,5.977222,5.970429,2.644304,0.991456,2.644304,9.519088,0
Avg. Area Number of Rooms,6.987792,7.002902,3.236194,1.005833,3.236194,10.75959,0
Avg. Area Number of Bedrooms,3.98133,4.05,4.38,1.234137,2.0,6.5,0
Area Population,36163.52,36199.41,172.610686,9925.650114,172.610686,69621.71,0
Price,1232073.0,1232669.0,15938.657923,353117.626581,15938.657923,2469066.0,0
Price_Class,0.5,0.5,0.0,0.50005,0.0,1.0,0


CKD dataset has:

Numeric columns: float64

Categorical columns: object

We must treat them differently:

Numeric: impute missing using median

Categorical: impute missing using mode + encode

In [7]:
ckd_numeric_cols = ckd_df.select_dtypes(include=["int64", "float64"]).columns.tolist()
ckd_cat_cols = ckd_df.select_dtypes(include=["object"]).columns.tolist()

# Remove target column from categorical list
ckd_cat_cols.remove("class")

print("CKD Numeric Columns:", ckd_numeric_cols)
print("\nCKD Categorical Columns:", ckd_cat_cols)


CKD Numeric Columns: ['age', 'bp', 'sg', 'al', 'su', 'bgr', 'bu', 'sc', 'sod', 'pot', 'hemo']

CKD Categorical Columns: ['rbc', 'pc', 'pcc', 'ba', 'pcv', 'wbcc', 'rbcc', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane']


Machine learning needs consistent target labels.
The CKD dataset contains a rare label "no" which is clearly the same meaning as "notckd".
We will merge it for clean binary classification

In [8]:
ckd_df = ckd_df.copy()

# Replace wrong label
ckd_df["class"] = ckd_df["class"].replace("no", "notckd")

print("Fixed CKD class counts:\n", ckd_df["class"].value_counts())
print("Unique labels:", ckd_df["class"].unique())


Fixed CKD class counts:
 class
ckd       250
notckd    150
Name: count, dtype: int64
Unique labels: ['ckd' 'notckd']


In CKD dataset:

Some columns like pcv, wbcc, rbcc are stored as object

But they are actually numeric (packed cell volume, WBC count, RBC count)

We will:

Convert them into numeric using pd.to_numeric(errors="coerce")

Any conversion failures become NaN (which we will impute later)

In [9]:
ckd_df = ckd_df.copy()

numeric_like_cols = ["pcv", "wbcc", "rbcc"]

for col in numeric_like_cols:
    ckd_df[col] = pd.to_numeric(ckd_df[col], errors="coerce")

print("Updated dtypes:\n")
display(ckd_df[numeric_like_cols].dtypes)

print("\nMissing values after conversion:")
display(ckd_df[numeric_like_cols].isna().sum())


Updated dtypes:



Unnamed: 0,0
pcv,float64
wbcc,float64
rbcc,float64



Missing values after conversion:


Unnamed: 0,0
pcv,71
wbcc,106
rbcc,131


After conversion, we must re-identify numeric and categorical columns, because pcv/wbcc/rbcc are now numeric.

In [10]:
ckd_numeric_cols = ckd_df.select_dtypes(include=["int64", "float64"]).columns.tolist()
ckd_cat_cols = ckd_df.select_dtypes(include=["object"]).columns.tolist()

# Remove target from features
ckd_cat_cols.remove("class")

print("CKD Numeric Columns:", ckd_numeric_cols)
print("\nCKD Categorical Columns:", ckd_cat_cols)


CKD Numeric Columns: ['age', 'bp', 'sg', 'al', 'su', 'bgr', 'bu', 'sc', 'sod', 'pot', 'hemo', 'pcv', 'wbcc', 'rbcc']

CKD Categorical Columns: ['rbc', 'pc', 'pcc', 'ba', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane']


CKD has many missing values, so we handle them properly:

Numeric columns: fill missing using median (robust to outliers)

Categorical columns: fill missing using mode (most frequent)

This ensures we don’t lose rows and don’t introduce bias by dropping.

In [11]:
ckd_df_filled = ckd_df.copy()

# Fill numeric with median
for col in ckd_numeric_cols:
    ckd_df_filled[col] = ckd_df_filled[col].fillna(ckd_df_filled[col].median())

# Fill categorical with mode
for col in ckd_cat_cols:
    ckd_df_filled[col] = ckd_df_filled[col].fillna(ckd_df_filled[col].mode()[0])

print("Missing values AFTER filling (should be 0):")
display(ckd_df_filled.isna().sum().sort_values(ascending=False).head(10))


Missing values AFTER filling (should be 0):


Unnamed: 0,0
age,0
bp,0
sg,0
al,0
su,0
rbc,0
pc,0
pcc,0
ba,0
bgr,0


Logistic Regression only works with numbers.
So categorical columns must be converted into numeric form.

We use One-Hot Encoding:

Converts categories into 0/1 columns

Prevents fake ordering issues

In [12]:
ckd_encoded = pd.get_dummies(ckd_df_filled, columns=ckd_cat_cols, drop_first=True)

# Encode target
ckd_encoded["class"] = ckd_encoded["class"].map({"ckd": 1, "notckd": 0})

print("CKD Encoded shape:", ckd_encoded.shape)
display(ckd_encoded.head())


CKD Encoded shape: (400, 31)


Unnamed: 0,age,bp,sg,al,su,bgr,bu,sc,sod,pot,...,dm_ yes,dm_no,dm_yes,cad_no,cad_yes,appet_no,appet_poor,pe_no,pe_yes,ane_yes
0,48.0,80.0,1.02,1.0,0.0,121.0,36.0,1.2,138.0,4.4,...,False,False,True,True,False,False,False,True,False,False
1,7.0,50.0,1.02,4.0,0.0,121.0,18.0,0.8,138.0,4.4,...,False,True,False,True,False,False,False,True,False,False
2,62.0,80.0,1.01,2.0,3.0,423.0,53.0,1.8,138.0,4.4,...,False,False,True,True,False,False,True,True,False,True
3,48.0,70.0,1.005,4.0,0.0,117.0,56.0,3.8,111.0,2.5,...,False,True,False,True,False,False,True,False,True,True
4,51.0,80.0,1.01,2.0,0.0,106.0,26.0,1.4,138.0,4.4,...,False,True,False,True,False,False,False,True,False,False


We split dataset into:

X (features) = all columns except target

y (target) = class (1 = ckd, 0 = notckd)

In [13]:
X_ckd = ckd_encoded.drop("class", axis=1)
y_ckd = ckd_encoded["class"]

print("X_ckd shape:", X_ckd.shape)
print("y_ckd distribution:\n", y_ckd.value_counts())


X_ckd shape: (400, 30)
y_ckd distribution:
 class
1    250
0    150
Name: count, dtype: int64


We split the dataset into:

80% training

20% testing

We use stratify=y to preserve class balance in both splits.

In [14]:
X_train_ckd, X_test_ckd, y_train_ckd, y_test_ckd = train_test_split(
    X_ckd, y_ckd,
    test_size=0.2,
    random_state=42,
    stratify=y_ckd
)

print("Train shape:", X_train_ckd.shape)
print("Test shape:", X_test_ckd.shape)
print("\nTrain target distribution:\n", y_train_ckd.value_counts())
print("\nTest target distribution:\n", y_test_ckd.value_counts())


Train shape: (320, 30)
Test shape: (80, 30)

Train target distribution:
 class
1    200
0    120
Name: count, dtype: int64

Test target distribution:
 class
1    50
0    30
Name: count, dtype: int64


Logistic Regression works best when features are scaled, especially when:

numeric columns have different ranges (e.g., bgr vs sg)

model uses gradient optimization

In [15]:
scaler_ckd = StandardScaler()

X_train_scaled_ckd = scaler_ckd.fit_transform(X_train_ckd)
X_test_scaled_ckd = scaler_ckd.transform(X_test_ckd)

print("Scaled CKD Train shape:", X_train_scaled_ckd.shape)
print("Scaled CKD Test shape:", X_test_scaled_ckd.shape)


Scaled CKD Train shape: (320, 30)
Scaled CKD Test shape: (80, 30)


Feature selection helps:

reduce noise

improve generalization

reduce overfitting

make model simpler

We will compute correlation of each feature with the target and keep the top features.

In [16]:
ckd_corr = ckd_encoded.corr(numeric_only=True)["class"].sort_values(ascending=False)

print("Top 15 positively correlated features with CKD:\n")
display(ckd_corr.head(15))

print("\nTop 15 negatively correlated features with CKD:\n")
display(ckd_corr.tail(15))


Top 15 positively correlated features with CKD:



Unnamed: 0,class
class,1.0
htn_yes,0.590438
dm_yes,0.549778
al,0.531562
appet_poor,0.393341
bgr,0.379321
pe_yes,0.375154
bu,0.369393
ane_yes,0.325396
su,0.294555



Top 15 negatively correlated features with CKD:



Unnamed: 0,class
wbcc,0.177571
pot,0.065218
dm_\tyes,0.05491
dm_ yes,0.038778
appet_no,-0.064631
cad_no,-0.243599
rbc_normal,-0.282642
sod,-0.3349
pe_no,-0.365101
pc_normal,-0.375154


We will select top N features by absolute correlation with target.
Then we train Logistic Regression using only those features.

In [17]:
top_n = 15

top_features = ckd_corr.drop("class").abs().sort_values(ascending=False).head(top_n).index.tolist()
print("Selected Top Features:", top_features)

X_ckd_selected = ckd_encoded[top_features]
y_ckd = ckd_encoded["class"]

X_train_ckd, X_test_ckd, y_train_ckd, y_test_ckd = train_test_split(
    X_ckd_selected, y_ckd,
    test_size=0.2,
    random_state=42,
    stratify=y_ckd
)

scaler_ckd = StandardScaler()
X_train_scaled_ckd = scaler_ckd.fit_transform(X_train_ckd)
X_test_scaled_ckd = scaler_ckd.transform(X_test_ckd)

print("Selected feature train shape:", X_train_scaled_ckd.shape)


Selected Top Features: ['hemo', 'pcv', 'sg', 'htn_yes', 'dm_no', 'rbcc', 'dm_yes', 'al', 'appet_poor', 'bgr', 'pc_normal', 'pe_yes', 'bu', 'pe_no', 'sod']
Selected feature train shape: (320, 15)


Train Logistic Regression Model (CKD)

In [18]:
ckd_model = LogisticRegression(
    max_iter=2000,
    solver="lbfgs",
    class_weight="balanced",
    random_state=42
)

ckd_model.fit(X_train_scaled_ckd, y_train_ckd)

print("CKD Logistic Regression model trained successfully.")


CKD Logistic Regression model trained successfully.


We evaluate model using:

Accuracy

Confusion Matrix

Precision, Recall, F1-score (Classification Report)


In [19]:
y_pred_ckd = ckd_model.predict(X_test_scaled_ckd)

print("CKD Accuracy:", accuracy_score(y_test_ckd, y_pred_ckd))

print("\nConfusion Matrix:")
cm = confusion_matrix(y_test_ckd, y_pred_ckd)
display(cm)

print("\nClassification Report:")
print(classification_report(y_test_ckd, y_pred_ckd))


CKD Accuracy: 1.0

Confusion Matrix:


array([[30,  0],
       [ 0, 50]])


Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        30
           1       1.00      1.00      1.00        50

    accuracy                           1.00        80
   macro avg       1.00      1.00      1.00        80
weighted avg       1.00      1.00      1.00        80



The Housing dataset contains:

6 numeric columns

1 text column Address

Since Logistic Regression cannot work with text directly (and Address is basically useless for prediction here), we will drop it.

We already created Price_Class earlier.

In [20]:
housing_clean = housing_df.copy()

# Drop Address column (not useful for numeric ML here)
housing_clean = housing_clean.drop("Address", axis=1)

print("Housing dataset after dropping Address:", housing_clean.shape)
display(housing_clean.head())


Housing dataset after dropping Address: (5000, 7)


Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Price_Class
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1059034.0,0
1,79248.642455,6.0029,6.730821,3.09,40173.072174,1505891.0,1
2,61287.067179,5.86589,8.512727,5.13,36882.1594,1058988.0,0
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1260617.0,1
4,59982.197226,5.040555,7.839388,4.23,26354.109472,630943.5,0


We separate:

X = all numeric predictors

y = Price_Class (0/1)

In [21]:
X_house = housing_clean.drop(["Price", "Price_Class"], axis=1)
y_house = housing_clean["Price_Class"]

print("X_house shape:", X_house.shape)
print("y_house distribution:\n", y_house.value_counts())


X_house shape: (5000, 5)
y_house distribution:
 Price_Class
0    2500
1    2500
Name: count, dtype: int64


We use 80/20 split with stratification to keep class balance equal.

In [22]:
X_train_house, X_test_house, y_train_house, y_test_house = train_test_split(
    X_house, y_house,
    test_size=0.2,
    random_state=42,
    stratify=y_house
)

print("Train shape:", X_train_house.shape)
print("Test shape:", X_test_house.shape)


Train shape: (4000, 5)
Test shape: (1000, 5)


The Housing features have large scale differences:

income ~ 70k

population ~ 36k

rooms ~ 7

Scaling helps logistic regression converge properly.

In [23]:
scaler_house = StandardScaler()

X_train_scaled_house = scaler_house.fit_transform(X_train_house)
X_test_scaled_house = scaler_house.transform(X_test_house)

print("Scaled train shape:", X_train_scaled_house.shape)
print("Scaled test shape:", X_test_scaled_house.shape)


Scaled train shape: (4000, 5)
Scaled test shape: (1000, 5)


We now train Logistic Regression on this forced binary target.

Even though it runs, it is not a true classification dataset originally, so results may be weak or misleading.

In [24]:
house_model = LogisticRegression(
    max_iter=2000,
    solver="lbfgs",
    random_state=42
)

house_model.fit(X_train_scaled_house, y_train_house)

print("Housing Logistic Regression trained successfully.")


Housing Logistic Regression trained successfully.


We evaluate using:

Accuracy

Confusion matrix

Precision, Recall, F1-score


In [25]:
y_pred_house = house_model.predict(X_test_scaled_house)

print("Housing Accuracy:", accuracy_score(y_test_house, y_pred_house))

print("\nConfusion Matrix:")
cm_house = confusion_matrix(y_test_house, y_pred_house)
display(cm_house)

print("\nClassification Report:")
print(classification_report(y_test_house, y_pred_house))


Housing Accuracy: 0.916

Confusion Matrix:


array([[449,  51],
       [ 33, 467]])


Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.90      0.91       500
           1       0.90      0.93      0.92       500

    accuracy                           0.92      1000
   macro avg       0.92      0.92      0.92      1000
weighted avg       0.92      0.92      0.92      1000



We print final results side-by-side for lab documentation.

In [26]:
print("===== FINAL RESULTS SUMMARY =====\n")

print("CKD Logistic Regression:")
print("Accuracy:", accuracy_score(y_test_ckd, y_pred_ckd))
print("Confusion Matrix:\n", confusion_matrix(y_test_ckd, y_pred_ckd))

print("\n--------------------------------\n")

print("Housing Logistic Regression (Price_Class):")
print("Accuracy:", accuracy_score(y_test_house, y_pred_house))
print("Confusion Matrix:\n", confusion_matrix(y_test_house, y_pred_house))


===== FINAL RESULTS SUMMARY =====

CKD Logistic Regression:
Accuracy: 1.0
Confusion Matrix:
 [[30  0]
 [ 0 50]]

--------------------------------

Housing Logistic Regression (Price_Class):
Accuracy: 0.916
Confusion Matrix:
 [[449  51]
 [ 33 467]]


## Conclusion

In this lab, we performed preprocessing, descriptive statistics, feature engineering, feature scaling and feature selection on two datasets: Chronic Kidney Disease (CKD) and USA Housing.

### CKD Dataset
- The CKD dataset contained missing values and categorical features.
- Missing values were handled using median (numeric) and mode (categorical).
- Categorical features were encoded using One-Hot Encoding.
- Features were scaled using StandardScaler.
- Logistic Regression produced excellent performance with very high accuracy and a clean confusion matrix.
- This shows that Logistic Regression is highly suitable for binary medical classification problems like CKD detection.

### USA Housing Dataset
- The original dataset is meant for regression (predicting continuous house price).
- To apply Logistic Regression, we converted the problem into classification by creating a new target:
  - 1 = House price above median
  - 0 = House price below/equal median
- Logistic Regression achieved high accuracy because the dataset has strong linear relationships.
- However, this does not solve the original regression problem of predicting the exact house price.

### Final Learning Outcome
- Logistic Regression is best suited for classification tasks.
- Proper preprocessing (handling missing values, encoding, scaling) is critical for good results.
- Feature selection helps identify the most important predictors.
- Regression datasets should ideally use regression models (Linear Regression, Random Forest Regressor, etc.), not Logistic Regression.
