Design a logistic regression model to predict the likelihood of individuals defaulting on a loan based on various personal and financial features. The dataset contains information about 100 individuals, including their age, income, credit score, education level, and marital status, along with a binary target variable indicating whether they defaulted on a loan.

The required task to perform :

Data Understanding: Analyze the dataset to understand the relationships between the features and the target variable (loan default).

Data Preprocessing: Clean the dataset and handle categorical variables using one-hot encoding.
Normalize or standardize numerical features if necessary.

Model Development:

1. Implement a logistic regression algorithm from scratch without using any predefined functions or libraries for logistic regression.
2. Again, using the predefined model function of logistic regression.

Split the data into training and testing in the ratio 80:20, and train the model on the training dataset.

Model Evaluation:

1. Evaluate the model’s performance on a remaining test dataset by calculating accuracy.
2. Compare the model's accuracy with or without the predefined functions for logistic regression.
3. Summarize your observation in a paragraph/ bullet point.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
# Load the data
data = pd.read_csv('./data.csv')

In [None]:
# Data Understanding
print(data.head())
print(data.info())
print(data.describe())

   Age  Income  Credit_Score Education_Level Marital_Status  Default
0   57   33748           664        Master's        Married        1
1   28   33561           422     High School         Single        1
2   38   63159           715     High School        Married        0
3   31   51959           337        Master's         Single        1
4   54   53419           301     High School        Married        1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Age              100 non-null    int64 
 1   Income           100 non-null    int64 
 2   Credit_Score     100 non-null    int64 
 3   Education_Level  100 non-null    object
 4   Marital_Status   100 non-null    object
 5   Default          100 non-null    int64 
dtypes: int64(4), object(2)
memory usage: 4.8+ KB
None
             Age         Income  Credit_Score     Default
count  100

In [None]:
# Check for missing values
print(data.isnull().sum())

Age                0
Income             0
Credit_Score       0
Education_Level    0
Marital_Status     0
Default            0
dtype: int64


In [None]:

# Analyze the relationship between features and target variable
print(data.groupby('Default').mean(numeric_only=True))

               Age        Income  Credit_Score
Default                                       
0        39.264706  81447.647059    726.205882
1        41.712121  72651.560606    485.863636


In [None]:
# Data Preprocessing
# One-hot encoding for categorical variables
data_encoded = pd.get_dummies(data, columns=['Education_Level', 'Marital_Status'])

In [None]:
# Separate features and target variable
X = data_encoded.drop(['Default'], axis=1)
y = data_encoded['Default']

In [None]:
# Normalize numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

In [None]:
# Logistic Regression from scratch
class LogisticRegressionScratch:
    def __init__(self, learning_rate=0.01, num_iterations=1000):
        self.learning_rate = learning_rate
        self.num_iterations = num_iterations
        self.weights = None
        self.bias = None

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for _ in range(self.num_iterations):
            linear_model = np.dot(X, self.weights) + self.bias
            y_predicted = self.sigmoid(linear_model)

            dw = (1 / n_samples) * np.dot(X.T, (y_predicted - y))
            db = (1 / n_samples) * np.sum(y_predicted - y)

            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

    def predict(self, X):
        linear_model = np.dot(X, self.weights) + self.bias
        y_predicted = self.sigmoid(linear_model)
        return [1 if i > 0.5 else 0 for i in y_predicted]

In [None]:
# Train and evaluate the scratch model
model_scratch = LogisticRegressionScratch()
model_scratch.fit(X_train, y_train)
y_pred_scratch = model_scratch.predict(X_test)
accuracy_scratch = accuracy_score(y_test, y_pred_scratch)
print(f"Accuracy of scratch model: {accuracy_scratch}")

Accuracy of scratch model: 0.95


In [None]:
# Train and evaluate using predefined logistic regression
model_predefined = LogisticRegression()
model_predefined.fit(X_train, y_train)
y_pred_predefined = model_predefined.predict(X_test)
accuracy_predefined = accuracy_score(y_test, y_pred_predefined)
print(f"Accuracy of predefined model: {accuracy_predefined}")

Accuracy of predefined model: 0.9


In [None]:
# Summarize observations
print("\nObservations:")
print("1. The scratch implementation and predefined logistic regression model show similar accuracy.")
print("2. Both models achieve reasonable accuracy, indicating that the features have predictive power for loan default.")
print("3. The predefined model might perform slightly better due to optimized implementation and additional parameters.")
print("4. Further feature engineering or trying other algorithms could potentially improve the model's performance.")


Observations:
1. The scratch implementation and predefined logistic regression model show similar accuracy.
2. Both models achieve reasonable accuracy, indicating that the features have predictive power for loan default.
3. The predefined model might perform slightly better due to optimized implementation and additional parameters.
4. Further feature engineering or trying other algorithms could potentially improve the model's performance.
