# Logistic Regression

Logistic regression is binary classification - predicting one of two outcomes:

- Yes/No
- True/False
- 1/0
- Subscribed/Not Subscribed

**Key Difference from Linear Regression:**

- Linear: Predicts continuous values (y = θ₀ + θ₁*x)

- Logistic: Predicts probabilities between 0 and 1, then classifies

## The Model Equation
z = θ₀ + θ₁*x₁ + θ₂*x₂ + ... + θₙ*xₙ

y_pred = sigmoid(z) = 1 / (1 + e^(-z))

I'll Explain when ever a concept is used.

## Understanding the Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
data = pd.read_csv("./data/bank-full.csv", sep=";")
data

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,technician,married,tertiary,no,825,no,no,cellular,17,nov,977,3,-1,0,unknown,yes
45207,71,retired,divorced,primary,no,1729,no,no,cellular,17,nov,456,2,-1,0,unknown,yes
45208,72,retired,married,secondary,no,5715,no,no,cellular,17,nov,1127,5,184,3,success,yes
45209,57,blue-collar,married,secondary,no,668,no,no,telephone,17,nov,508,4,-1,0,unknown,no


In [3]:
data.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [4]:
data.shape

(45211, 17)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


In [6]:
data.describe()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
count,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0
mean,40.93621,1362.272058,15.806419,258.16308,2.763841,40.197828,0.580323
std,10.618762,3044.765829,8.322476,257.527812,3.098021,100.128746,2.303441
min,18.0,-8019.0,1.0,0.0,1.0,-1.0,0.0
25%,33.0,72.0,8.0,103.0,1.0,-1.0,0.0
50%,39.0,448.0,16.0,180.0,2.0,-1.0,0.0
75%,48.0,1428.0,21.0,319.0,3.0,-1.0,0.0
max,95.0,102127.0,31.0,4918.0,63.0,871.0,275.0


In [7]:
data.dtypes

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

In [8]:
print(f"   Total samples: {data.shape[0]}")
print(f"   Total columns: {data.shape[1]}")

   Total samples: 45211
   Total columns: 17


In [9]:
numerical_features = data.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = data.select_dtypes(include=['object']).columns.tolist()

print(f"Numerical_features: {numerical_features}\nNumerical Count: {len(numerical_features)}")
print(f" \n\nCategorical_features: {categorical_features}\nCategorical Count: {len(categorical_features)}")

Numerical_features: ['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']
Numerical Count: 7
 

Categorical_features: ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome', 'y']
Categorical Count: 10


## Encoding 
ENCODING STRATEGY:
1. ORDINAL ENCODING: For features with inherent order
   - Examples: education (primary < secondary < tertiary)
   - Method: Map to integers preserving order
   
2. ONE-HOT ENCODING: For features with no inherent order
   - Examples: job, marital status
   - Method: Create binary columns for each category

## 1. Ordinal Encoding

In [10]:
df_encoded = data.copy()

if 'education' in df_encoded.columns:
    print(f"Before: {df_encoded['education'].unique()}") # Education has this natural order primary < secondary < tertiary

    education_mapping = {
        'primary.education': 1,
        'secondary.education': 2,
        'tertiary.education': 3,
        'unknown': 0  # Unknown defaults to 0
    }

    df_encoded['education'] = df_encoded['education'].map(education_mapping)
    print(f"\nAfter: {df_encoded['education'].unique()}")
    print(f"\nMapping: {education_mapping}")

Before: ['tertiary' 'secondary' 'unknown' 'primary']

After: [nan  0.]

Mapping: {'primary.education': 1, 'secondary.education': 2, 'tertiary.education': 3, 'unknown': 0}


In [11]:
categorical_to_encode = ['job', 'marital', 'default', 'housing', 'loan', 'contact', 'poutcome']

for feature in categorical_to_encode:
    if feature in df_encoded.columns:
        print(f"\n   {feature}: {df_encoded[feature].nunique()} categories")
        print(f"     Categories: {df_encoded[feature].unique()[:5]}...")

# One-hot encode categorical features (drop original column)
df_encoded = pd.get_dummies(df_encoded, columns=categorical_to_encode, drop_first=True)

print(f"\n   After one-hot encoding:")
print(f"   New shape: {df_encoded.shape} (added binary columns for categories)")

# Drop features that should not be in the model
features_to_drop = ['month', 'day_of_week', 'duration']  # These are not predictive in the context
df_encoded = df_encoded.drop(columns=features_to_drop, errors='ignore')

# Encode target variable
print(f"\n3. TARGET VARIABLE ENCODING:")
print(f"   Before: {df_encoded['y'].unique()}")
df_encoded['y'] = (df_encoded['y'] == 'yes').astype(int)
print(f"   After: {df_encoded['y'].unique()}")
print(f"   Mapping: yes → 1, no → 0")


   job: 12 categories
     Categories: ['management' 'technician' 'entrepreneur' 'blue-collar' 'unknown']...

   marital: 3 categories
     Categories: ['married' 'single' 'divorced']...

   default: 2 categories
     Categories: ['no' 'yes']...

   housing: 2 categories
     Categories: ['yes' 'no']...

   loan: 2 categories
     Categories: ['no' 'yes']...

   contact: 3 categories
     Categories: ['unknown' 'cellular' 'telephone']...

   poutcome: 4 categories
     Categories: ['unknown' 'failure' 'other' 'success']...

   After one-hot encoding:
   New shape: (45211, 31) (added binary columns for categories)

3. TARGET VARIABLE ENCODING:
   Before: ['no' 'yes']
   After: [0 1]
   Mapping: yes → 1, no → 0


In [12]:
df_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 29 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   age                45211 non-null  int64  
 1   education          1857 non-null   float64
 2   balance            45211 non-null  int64  
 3   day                45211 non-null  int64  
 4   campaign           45211 non-null  int64  
 5   pdays              45211 non-null  int64  
 6   previous           45211 non-null  int64  
 7   y                  45211 non-null  int64  
 8   job_blue-collar    45211 non-null  bool   
 9   job_entrepreneur   45211 non-null  bool   
 10  job_housemaid      45211 non-null  bool   
 11  job_management     45211 non-null  bool   
 12  job_retired        45211 non-null  bool   
 13  job_self-employed  45211 non-null  bool   
 14  job_services       45211 non-null  bool   
 15  job_student        45211 non-null  bool   
 16  job_technician     452

In [13]:
print(f"Class distribution:")
class_counts = df_encoded['y'].value_counts()
print(f"  Class 0 (no): {class_counts[0]} ({100*class_counts[0]/len(df_encoded):.2f}%)")
print(f"  Class 1 (yes): {class_counts[1]} ({100*class_counts[1]/len(df_encoded):.2f}%)")

if class_counts[1] / class_counts[0] < 0.3:
    print(f"\nDataset is IMBALANCED (minority class < 30%)")
else:
    print(f"\nDataset is reasonably balanced")

Class distribution:
  Class 0 (no): 39922 (88.30%)
  Class 1 (yes): 5289 (11.70%)

Dataset is IMBALANCED (minority class < 30%)


In [14]:
# correlation with target
correlations = df_encoded.corr()['y'].drop('y')
abs_correlations = correlations.abs().sort_values(ascending=False)

print(f"\nTop 10 features by absolute correlation:")
for i, (feature, corr) in enumerate(abs_correlations.head(10).items(), 1):
    print(f"  {i}. {feature}: {correlations[feature]:+.4f}")

# top 7 features
selected_features = abs_correlations.head(7).index.tolist()
print(f"\n✓ Selected 7 features:")
for i, feat in enumerate(selected_features, 1):
    print(f"  {i}. {feat} (corr: {correlations[feat]:+.4f})")


Top 10 features by absolute correlation:
  1. poutcome_success: +0.3068
  2. poutcome_unknown: -0.1671
  3. contact_unknown: -0.1509
  4. housing_yes: -0.1392
  5. pdays: +0.1036
  6. previous: +0.0932
  7. job_retired: +0.0792
  8. job_student: +0.0769
  9. campaign: -0.0732
  10. job_blue-collar: -0.0721

✓ Selected 7 features:
  1. poutcome_success (corr: +0.3068)
  2. poutcome_unknown (corr: -0.1671)
  3. contact_unknown (corr: -0.1509)
  4. housing_yes (corr: -0.1392)
  5. pdays (corr: +0.1036)
  6. previous (corr: +0.0932)
  7. job_retired (corr: +0.0792)


In [15]:
# Keep only selected features and target
df_selected = df_encoded[selected_features + ['y']].copy()

print(f"\nDataset before removing duplicates: {df_selected.shape}")
df_selected = df_selected.drop_duplicates()
print(f"Dataset after removing duplicates: {df_selected.shape}")
print(f"Duplicates removed: {df_encoded.shape[0] - df_selected.shape[0]}")


Dataset before removing duplicates: (45211, 8)
Dataset after removing duplicates: (4547, 8)
Duplicates removed: 40664


In [16]:
X = df_selected[selected_features].values

y = df_selected['y'].values

print(f"\nBefore split:")
print(f"  X shape: {X.shape}")
print(f"  y shape: {y.shape}")


Before split:
  X shape: (4547, 7)
  y shape: (4547,)


In [17]:
np.random.seed(42)

n_samples = len(X)

# Separate indices by class
class_0_idx = np.where(y == 0)[0]
class_1_idx = np.where(y == 1)[0]

split_0 = int(0.8 * len(class_0_idx))
split_1 = int(0.8 * len(class_1_idx))


np.random.shuffle(class_0_idx)
np.random.shuffle(class_1_idx)

# Create train and test indices
train_idx = np.concatenate([class_0_idx[:split_0], class_1_idx[:split_1]])
test_idx = np.concatenate([class_0_idx[split_0:], class_1_idx[split_1:]])

# Split data
X_train = X[train_idx]
y_train = y[train_idx]
X_test = X[test_idx]
y_test = y[test_idx]


In [18]:
print(f"\nAfter stratified split:")
print(f"  X_train shape: {X_train.shape}")
print(f"  y_train shape: {y_train.shape}")
print(f"  X_test shape: {X_test.shape}")
print(f"  y_test shape: {y_test.shape}")

print(f"\nClass distribution in train set:")
print(f"  Class 0: {np.sum(y_train == 0)} ({100*np.sum(y_train == 0)/len(y_train):.2f}%)")
print(f"  Class 1: {np.sum(y_train == 1)} ({100*np.sum(y_train == 1)/len(y_train):.2f}%)")

print(f"\nClass distribution in test set:")
print(f"  Class 0: {np.sum(y_test == 0)} ({100*np.sum(y_test == 0)/len(y_test):.2f}%)")
print(f"  Class 1: {np.sum(y_test == 1)} ({100*np.sum(y_test == 1)/len(y_test):.2f}%)")



After stratified split:
  X_train shape: (3637, 7)
  y_train shape: (3637,)
  X_test shape: (910, 7)
  y_test shape: (910,)

Class distribution in train set:
  Class 0: 2523 (69.37%)
  Class 1: 1114 (30.63%)

Class distribution in test set:
  Class 0: 631 (69.34%)
  Class 1: 279 (30.66%)
