# Logistic Regression

Logistic regression is binary classification - predicting one of two outcomes:

- Yes/No
- True/False
- 1/0
- Subscribed/Not Subscribed

**Key Difference from Linear Regression:**

- Linear: Predicts continuous values (y = θ₀ + θ₁*x)

- Logistic: Predicts probabilities between 0 and 1, then classifies

## The Model Equation
z = θ₀ + θ₁*x₁ + θ₂*x₂ + ... + θₙ*xₙ

y_pred = sigmoid(z) = 1 / (1 + e^(-z))

I'll Explain when ever a concept is used.

## Understanding the Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
data = pd.read_csv("./data/bank-full.csv", sep=";")
data

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,technician,married,tertiary,no,825,no,no,cellular,17,nov,977,3,-1,0,unknown,yes
45207,71,retired,divorced,primary,no,1729,no,no,cellular,17,nov,456,2,-1,0,unknown,yes
45208,72,retired,married,secondary,no,5715,no,no,cellular,17,nov,1127,5,184,3,success,yes
45209,57,blue-collar,married,secondary,no,668,no,no,telephone,17,nov,508,4,-1,0,unknown,no


In [3]:
data.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [4]:
data.shape

(45211, 17)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


In [6]:
data.describe()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
count,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0
mean,40.93621,1362.272058,15.806419,258.16308,2.763841,40.197828,0.580323
std,10.618762,3044.765829,8.322476,257.527812,3.098021,100.128746,2.303441
min,18.0,-8019.0,1.0,0.0,1.0,-1.0,0.0
25%,33.0,72.0,8.0,103.0,1.0,-1.0,0.0
50%,39.0,448.0,16.0,180.0,2.0,-1.0,0.0
75%,48.0,1428.0,21.0,319.0,3.0,-1.0,0.0
max,95.0,102127.0,31.0,4918.0,63.0,871.0,275.0


In [7]:
data.dtypes

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

In [8]:
print(f"   Total samples: {data.shape[0]}")
print(f"   Total columns: {data.shape[1]}")

   Total samples: 45211
   Total columns: 17


In [9]:
numerical_features = data.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = data.select_dtypes(include=['object']).columns.tolist()

print(f"Numerical_features: {numerical_features}\nNumerical Count: {len(numerical_features)}")
print(f" \n\nCategorical_features: {categorical_features}\nCategorical Count: {len(categorical_features)}")

Numerical_features: ['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']
Numerical Count: 7
 

Categorical_features: ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome', 'y']
Categorical Count: 10


## Encoding 
ENCODING STRATEGY:
1. ORDINAL ENCODING: For features with inherent order
   - Examples: education (primary < secondary < tertiary)
   - Method: Map to integers preserving order
   
2. ONE-HOT ENCODING: For features with no inherent order
   - Examples: job, marital status
   - Method: Create binary columns for each category

## 1. Ordinal Encoding

In [10]:
df_encoded = data.copy()

if 'education' in df_encoded.columns:
    print(f"Before: {df_encoded['education'].unique()}") # Education has this natural order primary < secondary < tertiary

    education_mapping = {
        'primary.education': 1,
        'secondary.education': 2,
        'tertiary.education': 3,
        'unknown': 0  # Unknown defaults to 0
    }

    df_encoded['education'] = df_encoded['education'].map(education_mapping)
    print(f"\nAfter: {df_encoded['education'].unique()}")
    print(f"\nMapping: {education_mapping}")

Before: ['tertiary' 'secondary' 'unknown' 'primary']

After: [nan  0.]

Mapping: {'primary.education': 1, 'secondary.education': 2, 'tertiary.education': 3, 'unknown': 0}


In [11]:
categorical_to_encode = ['job', 'marital', 'default', 'housing', 'loan', 'contact', 'poutcome']

for feature in categorical_to_encode:
    if feature in df_encoded.columns:
        print(f"\n   {feature}: {df_encoded[feature].nunique()} categories")
        print(f"     Categories: {df_encoded[feature].unique()[:5]}...")

# One-hot encode categorical features (drop original column)
df_encoded = pd.get_dummies(df_encoded, columns=categorical_to_encode, drop_first=True)

print(f"\n   After one-hot encoding:")
print(f"   New shape: {df_encoded.shape} (added binary columns for categories)")

# Drop features that should not be in the model
features_to_drop = ['month', 'day_of_week', 'duration']  # These are not predictive in the context
df_encoded = df_encoded.drop(columns=features_to_drop, errors='ignore')

# Encode target variable
print(f"\n3. TARGET VARIABLE ENCODING:")
print(f"   Before: {df_encoded['y'].unique()}")
df_encoded['y'] = (df_encoded['y'] == 'yes').astype(int)
print(f"   After: {df_encoded['y'].unique()}")
print(f"   Mapping: yes → 1, no → 0")


   job: 12 categories
     Categories: ['management' 'technician' 'entrepreneur' 'blue-collar' 'unknown']...

   marital: 3 categories
     Categories: ['married' 'single' 'divorced']...

   default: 2 categories
     Categories: ['no' 'yes']...

   housing: 2 categories
     Categories: ['yes' 'no']...

   loan: 2 categories
     Categories: ['no' 'yes']...

   contact: 3 categories
     Categories: ['unknown' 'cellular' 'telephone']...

   poutcome: 4 categories
     Categories: ['unknown' 'failure' 'other' 'success']...

   After one-hot encoding:
   New shape: (45211, 31) (added binary columns for categories)

3. TARGET VARIABLE ENCODING:
   Before: ['no' 'yes']
   After: [0 1]
   Mapping: yes → 1, no → 0


In [12]:
df_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 29 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   age                45211 non-null  int64  
 1   education          1857 non-null   float64
 2   balance            45211 non-null  int64  
 3   day                45211 non-null  int64  
 4   campaign           45211 non-null  int64  
 5   pdays              45211 non-null  int64  
 6   previous           45211 non-null  int64  
 7   y                  45211 non-null  int64  
 8   job_blue-collar    45211 non-null  bool   
 9   job_entrepreneur   45211 non-null  bool   
 10  job_housemaid      45211 non-null  bool   
 11  job_management     45211 non-null  bool   
 12  job_retired        45211 non-null  bool   
 13  job_self-employed  45211 non-null  bool   
 14  job_services       45211 non-null  bool   
 15  job_student        45211 non-null  bool   
 16  job_technician     452

In [13]:
print(f"Class distribution:")
class_counts = df_encoded['y'].value_counts()
print(f"  Class 0 (no): {class_counts[0]} ({100*class_counts[0]/len(df_encoded):.2f}%)")
print(f"  Class 1 (yes): {class_counts[1]} ({100*class_counts[1]/len(df_encoded):.2f}%)")

if class_counts[1] / class_counts[0] < 0.3:
    print(f"\nDataset is IMBALANCED (minority class < 30%)")
else:
    print(f"\nDataset is reasonably balanced")

Class distribution:
  Class 0 (no): 39922 (88.30%)
  Class 1 (yes): 5289 (11.70%)

Dataset is IMBALANCED (minority class < 30%)


In [14]:
# correlation with target
correlations = df_encoded.corr()['y'].drop('y')
abs_correlations = correlations.abs().sort_values(ascending=False)

print(f"\nTop 10 features by absolute correlation:")
for i, (feature, corr) in enumerate(abs_correlations.head(10).items(), 1):
    print(f"  {i}. {feature}: {correlations[feature]:+.4f}")

# top 7 features
selected_features = abs_correlations.head(7).index.tolist()
print(f"\n✓ Selected 7 features:")
for i, feat in enumerate(selected_features, 1):
    print(f"  {i}. {feat} (corr: {correlations[feat]:+.4f})")


Top 10 features by absolute correlation:
  1. poutcome_success: +0.3068
  2. poutcome_unknown: -0.1671
  3. contact_unknown: -0.1509
  4. housing_yes: -0.1392
  5. pdays: +0.1036
  6. previous: +0.0932
  7. job_retired: +0.0792
  8. job_student: +0.0769
  9. campaign: -0.0732
  10. job_blue-collar: -0.0721

✓ Selected 7 features:
  1. poutcome_success (corr: +0.3068)
  2. poutcome_unknown (corr: -0.1671)
  3. contact_unknown (corr: -0.1509)
  4. housing_yes (corr: -0.1392)
  5. pdays (corr: +0.1036)
  6. previous (corr: +0.0932)
  7. job_retired (corr: +0.0792)


In [15]:
# Keep only selected features and target
df_selected = df_encoded[selected_features + ['y']].copy()

print(f"\nDataset before removing duplicates: {df_selected.shape}")
df_selected = df_selected.drop_duplicates()
print(f"Dataset after removing duplicates: {df_selected.shape}")
print(f"Duplicates removed: {df_encoded.shape[0] - df_selected.shape[0]}")


Dataset before removing duplicates: (45211, 8)
Dataset after removing duplicates: (4547, 8)
Duplicates removed: 40664


In [16]:
X = df_selected[selected_features].values

y = df_selected['y'].values

print(f"\nBefore split:")
print(f"  X shape: {X.shape}")
print(f"  y shape: {y.shape}")


Before split:
  X shape: (4547, 7)
  y shape: (4547,)


In [17]:
np.random.seed(42)

n_samples = len(X)

# Separate indices by class
class_0_idx = np.where(y == 0)[0]
class_1_idx = np.where(y == 1)[0]

split_0 = int(0.8 * len(class_0_idx))
split_1 = int(0.8 * len(class_1_idx))


np.random.shuffle(class_0_idx)
np.random.shuffle(class_1_idx)

# Create train and test indices
train_idx = np.concatenate([class_0_idx[:split_0], class_1_idx[:split_1]])
test_idx = np.concatenate([class_0_idx[split_0:], class_1_idx[split_1:]])

# Split data
X_train = X[train_idx]
y_train = y[train_idx]
X_test = X[test_idx]
y_test = y[test_idx]


In [19]:
print(f"\nAfter stratified split:")
print(f"  X_train shape: {X_train.shape}")
print(f"  y_train shape: {y_train.shape}")
print(f"  X_test shape: {X_test.shape}")
print(f"  y_test shape: {y_test.shape}")

print(f"\nClass distribution in train set:")
print(f"  Class 0: {np.sum(y_train == 0)} ({100*np.sum(y_train == 0)/len(y_train):.2f}%)")
print(f"  Class 1: {np.sum(y_train == 1)} ({100*np.sum(y_train == 1)/len(y_train):.2f}%)")

print(f"\nClass distribution in test set:")
print(f"  Class 0: {np.sum(y_test == 0)} ({100*np.sum(y_test == 0)/len(y_test):.2f}%)")
print(f"  Class 1: {np.sum(y_test == 1)} ({100*np.sum(y_test == 1)/len(y_test):.2f}%)")



After stratified split:
  X_train shape: (3637, 7)
  y_train shape: (3637,)
  X_test shape: (910, 7)
  y_test shape: (910,)

Class distribution in train set:
  Class 0: 2523 (69.37%)
  Class 1: 1114 (30.63%)

Class distribution in test set:
  Class 0: 631 (69.34%)
  Class 1: 279 (30.66%)


In [21]:
# Convert all to float first (fixes the error)
X_train = X_train.astype(float)
X_test = X_test.astype(float)

# Identify which features are numerical
# After one-hot encoding, numerical features are those that are NOT one-hot encoded
# One-hot encoded features contain '_' in their name
numerical_mask = np.array([not ('_' in feat) for feat in selected_features])

print(f"\nNumerical features to standardize:")
numerical_feat_names = [selected_features[i] for i in range(len(selected_features)) if numerical_mask[i]]
print(f"Found {np.sum(numerical_mask)} numerical features")
for feat in numerical_feat_names:
    print(f"  - {feat}")

# Standardize using training set statistics
X_train_mean = np.mean(X_train, axis=0)
X_train_std = np.std(X_train, axis=0)

# Avoid division by zero
X_train_std = np.where(X_train_std == 0, 1, X_train_std)

# Apply standardization only to numerical features
X_train_standardized = X_train.copy()
X_test_standardized = X_test.copy()

X_train_standardized[:, numerical_mask] = (X_train[:, numerical_mask] - X_train_mean[numerical_mask]) / (X_train_std[numerical_mask] + 1e-8)
X_test_standardized[:, numerical_mask] = (X_test[:, numerical_mask] - X_train_mean[numerical_mask]) / (X_train_std[numerical_mask] + 1e-8)

print(f"\nStandardization parameters (from training set):")
for i, feat in enumerate(selected_features):
    if numerical_mask[i]:
        print(f"  {feat}: mean={X_train_mean[i]:.4f}, std={X_train_std[i]:.4f}")



Numerical features to standardize:
Found 2 numerical features
  - pdays
  - previous

Standardization parameters (from training set):
  pdays: mean=212.9871, std=125.3330
  previous: mean=4.0839, std=6.1599


In [22]:
n_numerical = np.sum(numerical_mask)
fig, axes = plt.subplots(1, n_numerical, figsize=(15, 4))

if n_numerical == 1:
    axes = [axes]

for idx, (feat_idx, feat_name) in enumerate(zip(np.where(numerical_mask)[0], numerical_feat_names)):
    ax = axes[idx]
    ax.hist(X_train_standardized[:, feat_idx], bins=30, alpha=0.7, edgecolor='black')
    ax.set_xlabel(feat_name)
    ax.set_ylabel('Frequency')
    ax.set_title(f'Distribution of {feat_name}\n(after standardization)')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('task2_numerical_distributions.png', dpi=150, bbox_inches='tight')
print(f"✓ Histogram saved as 'task2_numerical_distributions.png'")
plt.close()

✓ Histogram saved as 'task2_numerical_distributions.png'


In [24]:
class LogisticRegression:
    """Binary Logistic Regression from scratch"""
    
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.theta = None
        self.cost_history = []
    
    def sigmoid(self, z):
        """Sigmoid activation function"""
        # Clip to prevent overflow
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))
    
    def fit(self, X, y):
        """
        Fit logistic regression model
        
        X: shape (m, n) - training features
        y: shape (m,) - binary labels (0 or 1)
        """
        m, n = X.shape
        
        # Add bias term
        X_with_bias = np.column_stack([np.ones(m), X])  # (m, n+1)
        
        # Initialize parameters
        self.theta = np.zeros(X_with_bias.shape[1])
        
        print(f"   DEBUG: X_with_bias shape: {X_with_bias.shape}")
        print(f"   DEBUG: theta shape: {self.theta.shape}")
        print(f"   DEBUG: m = {m}")
        
        # Gradient descent
        for iteration in range(self.n_iterations):
            # Predictions
            z = X_with_bias @ self.theta  # (m,)
            y_pred = self.sigmoid(z)  # (m,)
            
            # Errors
            errors = y_pred - y  # (m,)
            
            # Gradients (same as linear regression!)
            gradients = (1 / m) * (X_with_bias.T @ errors)  # (n+1,)
            
            # Update parameters
            self.theta -= self.learning_rate * gradients
            
            # Record cost (cross-entropy loss)
            if iteration % 100 == 0:
                # Cost function: -1/m * sum(y*log(y_pred) + (1-y)*log(1-y_pred))
                # Avoid log(0)
                y_pred_clipped = np.clip(y_pred, 1e-15, 1 - 1e-15)
                cost = -1/m * np.sum(y * np.log(y_pred_clipped) + (1 - y) * np.log(1 - y_pred_clipped))
                self.cost_history.append(cost)
        
        return self
    
    def predict_proba(self, X):
        """Predict probabilities"""
        m = len(X)
        X_with_bias = np.column_stack([np.ones(m), X])
        z = X_with_bias @ self.theta
        return self.sigmoid(z)

    def predict(self, X, threshold=0.5):
        """Predict binary labels"""
        return (self.predict_proba(X) >= threshold).astype(int)
    
    def get_params(self):
        """Return learned parameters"""
        return self.theta

In [25]:
learning_rates = [0.01, 0.1, 1.0]
log_reg_models = {}

for lr in learning_rates:
    print(f"\nTraining with learning_rate = {lr}")
    model = LogisticRegression(learning_rate=lr, n_iterations=1000)
    model.fit(X_train_standardized, y_train)
    log_reg_models[lr] = model
    
    # Make predictions
    y_train_pred = model.predict(X_train_standardized)
    y_test_pred = model.predict(X_test_standardized)
    
    # Calculate accuracy
    train_accuracy = np.mean(y_train_pred == y_train)
    test_accuracy = np.mean(y_test_pred == y_test)
    
    print(f"   Train Accuracy: {train_accuracy:.4f}")
    print(f"   Test Accuracy: {test_accuracy:.4f}")



Training with learning_rate = 0.01
   DEBUG: X_with_bias shape: (3637, 8)
   DEBUG: theta shape: (8,)
   DEBUG: m = 3637
   Train Accuracy: 0.6937
   Test Accuracy: 0.6934

Training with learning_rate = 0.1
   DEBUG: X_with_bias shape: (3637, 8)
   DEBUG: theta shape: (8,)
   DEBUG: m = 3637
   Train Accuracy: 0.7314
   Test Accuracy: 0.7253

Training with learning_rate = 1.0
   DEBUG: X_with_bias shape: (3637, 8)
   DEBUG: theta shape: (8,)
   DEBUG: m = 3637
   Train Accuracy: 0.7322
   Test Accuracy: 0.7209


In [26]:
print("TASK 2.4: EVALUATION METRICS")
print("="*80)

best_lr = 0.1
best_model = log_reg_models[best_lr]

# Predictions
y_train_pred = best_model.predict(X_train_standardized)
y_test_pred = best_model.predict(X_test_standardized)

print(f"\nUsing best model (lr={best_lr}):")


TASK 2.4: EVALUATION METRICS

Using best model (lr=0.1):


In [27]:
def compute_confusion_matrix(y_true, y_pred):
    """Compute confusion matrix"""
    TP = np.sum((y_true == 1) & (y_pred == 1))
    FP = np.sum((y_true == 0) & (y_pred == 1))
    FN = np.sum((y_true == 1) & (y_pred == 0))
    TN = np.sum((y_true == 0) & (y_pred == 0))
    return TP, FP, FN, TN

# Test set confusion matrix
TP, FP, FN, TN = compute_confusion_matrix(y_test, y_test_pred)

print(f"\n1. CONFUSION MATRIX (Test Set):")
print(f"\n   Predicted:     No(0)  Yes(1)")
print(f"   Actual:    No  {TN:5d}  {FP:5d}")
print(f"              Yes {FN:5d}  {TP:5d}")

print(f"\n   Explanation:")
print(f"   - TP (True Positive): {TP} - Correctly predicted yes")
print(f"   - FP (False Positive): {FP} - Incorrectly predicted yes")
print(f"   - FN (False Negative): {FN} - Incorrectly predicted no")
print(f"   - TN (True Negative): {TN} - Correctly predicted no")


1. CONFUSION MATRIX (Test Set):

   Predicted:     No(0)  Yes(1)
   Actual:    No    577     54
              Yes   196     83

   Explanation:
   - TP (True Positive): 83 - Correctly predicted yes
   - FP (False Positive): 54 - Incorrectly predicted yes
   - FN (False Negative): 196 - Incorrectly predicted no
   - TN (True Negative): 577 - Correctly predicted no


In [28]:
def compute_metrics(y_true, y_pred):
    """Compute accuracy, precision, recall, F1-score"""
    TP, FP, FN, TN = compute_confusion_matrix(y_true, y_pred)
    
    accuracy = (TP + TN) / (TP + TN + FP + FN)
    precision = TP / (TP + FP) if (TP + FP) > 0 else 0
    recall = TP / (TP + FN) if (TP + FN) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    return accuracy, precision, recall, f1

# Training set metrics
train_acc, train_prec, train_rec, train_f1 = compute_metrics(y_train, y_train_pred)

print(f"\n2. TRAINING SET METRICS:")
print(f"   Accuracy:  {train_acc:.4f}")
print(f"   Precision: {train_prec:.4f}")
print(f"   Recall:    {train_rec:.4f}")
print(f"   F1-Score:  {train_f1:.4f}")

# Test set metrics
test_acc, test_prec, test_rec, test_f1 = compute_metrics(y_test, y_test_pred)

print(f"\n3. TEST SET METRICS:")
print(f"   Accuracy:  {test_acc:.4f}")
print(f"   Precision: {test_prec:.4f}")
print(f"   Recall:    {test_rec:.4f}")
print(f"   F1-Score:  {test_f1:.4f}")

print(f"\n4. METRIC DEFINITIONS:")
print(f"""
   Accuracy: (TP + TN) / (TP + TN + FP + FN)
     → Overall correctness
     → Best for balanced datasets
   
   Precision: TP / (TP + FP)
     → Of predicted positives, how many are correct?
     → "Is my model right when it says YES?"
   
   Recall: TP / (TP + FN)
     → Of actual positives, how many did model find?
     → "Does model find all the YES cases?"
   
   F1-Score: 2 * (Precision * Recall) / (Precision + Recall)
     → Harmonic mean of precision and recall
     → Good for imbalanced datasets
""")


2. TRAINING SET METRICS:
   Accuracy:  0.7314
   Precision: 0.6163
   Recall:    0.3259
   F1-Score:  0.4263

3. TEST SET METRICS:
   Accuracy:  0.7253
   Precision: 0.6058
   Recall:    0.2975
   F1-Score:  0.3990

4. METRIC DEFINITIONS:

   Accuracy: (TP + TN) / (TP + TN + FP + FN)
     → Overall correctness
     → Best for balanced datasets

   Precision: TP / (TP + FP)
     → Of predicted positives, how many are correct?
     → "Is my model right when it says YES?"

   Recall: TP / (TP + FN)
     → Of actual positives, how many did model find?
     → "Does model find all the YES cases?"

   F1-Score: 2 * (Precision * Recall) / (Precision + Recall)
     → Harmonic mean of precision and recall
     → Good for imbalanced datasets



In [29]:
print("VISUALIZING COST FUNCTION")
print("="*80)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Cost function convergence
ax = axes[0]
for lr in learning_rates:
    ax.plot(log_reg_models[lr].cost_history, marker='o', label=f'lr={lr}')
ax.set_xlabel('Iterations (×100)')
ax.set_ylabel('Cross-Entropy Loss')
ax.set_title('Logistic Regression: Cost Function Convergence')
ax.legend()
ax.grid(True, alpha=0.3)

# Plot 2: ROC-like plot (prediction distribution)
ax = axes[1]
y_test_proba = best_model.predict_proba(X_test_standardized)
ax.hist(y_test_proba[y_test == 0], bins=30, alpha=0.6, label='Actual: No', color='blue')
ax.hist(y_test_proba[y_test == 1], bins=30, alpha=0.6, label='Actual: Yes', color='orange')
ax.axvline(x=0.5, color='red', linestyle='--', linewidth=2, label='Decision Boundary')
ax.set_xlabel('Predicted Probability')
ax.set_ylabel('Frequency')
ax.set_title('Prediction Probability Distribution (Test Set)')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('task2_logistic_regression.png', dpi=150, bbox_inches='tight')
print("✓ Plots saved as 'task2_logistic_regression.png'")
plt.close()

VISUALIZING COST FUNCTION
✓ Plots saved as 'task2_logistic_regression.png'


In [31]:
print(f"""
✓ Task 2.2: Explored dataset ({data.shape[0]} samples, {data.shape[1]} features)
✓ Task 2.3A: Encoded categorical features (ordinal + one-hot)
✓ Task 2.3B: Analyzed class balance
✓ Task 2.3C: Selected 7 best features by correlation
✓ Task 2.3D: Removed duplicates
✓ Task 2.3E: Stratified train-test split (80-20)
✓ Task 2.3F: Standardized numerical features
✓ Task 2.3G: Plotted feature distributions
✓ Task 2.4: Built logistic regression from scratch
✓ Trained with 3 learning rates
✓ Computed all evaluation metrics
✓ Visualized results

FINAL RESULTS (Test Set with best lr={best_lr}):
  Accuracy:  {test_acc:.4f}
  Precision: {test_prec:.4f}
  Recall:    {test_rec:.4f}
  F1-Score:  {test_f1:.4f}
""")


✓ Task 2.2: Explored dataset (45211 samples, 17 features)
✓ Task 2.3A: Encoded categorical features (ordinal + one-hot)
✓ Task 2.3B: Analyzed class balance
✓ Task 2.3C: Selected 7 best features by correlation
✓ Task 2.3D: Removed duplicates
✓ Task 2.3E: Stratified train-test split (80-20)
✓ Task 2.3F: Standardized numerical features
✓ Task 2.3G: Plotted feature distributions
✓ Task 2.4: Built logistic regression from scratch
✓ Trained with 3 learning rates
✓ Computed all evaluation metrics
✓ Visualized results

FINAL RESULTS (Test Set with best lr=0.1):
  Accuracy:  0.7253
  Precision: 0.6058
  Recall:    0.2975
  F1-Score:  0.3990



this is question two results:

✓ Task 2.2: Explored dataset (45211 samples, 17 features) ✓ Task 2.3A: Encoded categorical features (ordinal + one-hot) ✓ Task 2.3B: Analyzed class balance ✓ Task 2.3C: Selected 7 best features by correlation ✓ Task 2.3D: Removed duplicates ✓ Task 2.3E: Stratified train-test split (80-20) ✓ Task 2.3F: Standardized numerical features ✓ Task 2.3G: Plotted feature distributions ✓ Task 2.4: Built logistic regression from scratch ✓ Trained with 3 learning rates ✓ Computed all evaluation metrics ✓ Visualized results  FINAL RESULTS (Test Set with best lr=0.1):   Accuracy:  0.7253   Precision: 0.6058   Recall:    0.2975   F1-Score:  0.3990



and you also has question3 resulsts from me, 