# Logistic Regression and AdaBoost for Binary Classification

## Import Statements

**NumPy Documentation:** https://numpy.org/doc/  
**`pandas` Documentation:** https://pandas.pydata.org/docs/

In [1]:
import numpy as np
import pandas as pd

In [2]:
# ref: https://stackoverflow.com/questions/21494489/what-does-numpy-random-seed0-do
np.random.seed(1)

# ref: https://stackoverflow.com/questions/14861891/runtimewarning-invalid-value-encountered-in-divide/54364060
# ref: https://numpy.org/doc/stable/reference/generated/numpy.seterr.html
np.seterr(divide='ignore', invalid='ignore')

# ref: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas
pd.options.mode.chained_assignment = None

**`sklearn.impute.SimpleImputer`:** https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html  
**`sklearn.preprocessing.StandardScaler`:** https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html  
**`sklearn.model_selection.train_test_split`:** https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [3]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

## Datasets Preprocessing  

### Reference:  
- **Preprocessing data:** https://scikit-learn.org/stable/modules/preprocessing.html  
- **Imputation of missing values:** https://scikit-learn.org/stable/modules/impute.html  
- **Easy Guide to Data Preprocessing in Python:** https://www.kdnuggets.com/2020/07/easy-guide-data-preprocessing-python.html  
- **Data preprocessing:** https://cs.ccsu.edu/~markov/ccsu_courses/DataMining-3.html  

### Datasets:  
- **Telco Customer Churn:** https://www.kaggle.com/blastchar/telco-customer-churn  
- **Adult Salary Scale:** https://archive.ics.uci.edu/ml/datasets/adult  
- **Credit Card Fraud Detection:** https://www.kaggle.com/mlg-ulb/creditcardfraud

### Telco Customer Churn Dataset Preprocessing

In [4]:
def telco_customer_churn_dataset_preprocessing():
    data_frame = pd.read_csv('./datasets/telco-customer-churn.csv')
    
    # imputing missing values in specific columns
    numeric_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
    
    data_frame['TotalCharges'].replace({' ': np.nan}, inplace=True)
    data_frame['TotalCharges'] = numeric_imputer.fit_transform(data_frame[['TotalCharges']])
    
    # dropping unnecessary columns
    data_frame.drop(['customerID'], inplace=True, axis=1)
    
    # modifying values in specific columns
    data_frame['MultipleLines'].replace({'No phone service': 'No'}, inplace=True)
    
    for key in ['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']:
        data_frame[key].replace({'No internet service': 'No'}, inplace=True)
    
    data_frame['Churn'].replace({'Yes': 1, 'No': 0}, inplace=True)
    
    # encoding categorical features
    data_frame = pd.get_dummies(
        data_frame, 
        columns=[
            'gender', 'SeniorCitizen', 'Partner', 'Dependents', 
            'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 
            'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 
            'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod'
        ]
    )
    
    # separating target column from data_frame
    data_frame_target = data_frame['Churn']
    data_frame.drop(['Churn'], inplace=True, axis=1)
    
    # standardizing specific columns in data_frame
    standard_scaler = StandardScaler()
    
    for key in ['tenure', 'MonthlyCharges', 'TotalCharges']:
        data_frame[key] = standard_scaler.fit_transform(data_frame[[key]])
    
    return data_frame.to_numpy(), data_frame_target.to_numpy()

### Adult Salary Scale Dataset Preprocessing

In [5]:
def adult_dataset_preprocessing():
    train_data_frame = pd.read_csv('./datasets/adult-train.csv')
    test_data_frame = pd.read_csv('./datasets/adult-test.csv')
    
    # modifying values in specific columns
    train_data_frame['salary-scale'].replace({' >50K': 1, ' <=50K': 0}, inplace=True)
    test_data_frame['salary-scale'].replace({' >50K.': 1, ' <=50K.': 0}, inplace=True)
    
    # concatenating train_data_frame and test_data_frame
    # ref: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/discussion/60425
    train_data_frame['is_train'] = 1
    test_data_frame['is_train'] = 0
    data_frame = pd.concat([train_data_frame, test_data_frame], ignore_index=True)
    
    # imputing missing values in specific columns
    categorical_imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
    
    for key in ['workclass', 'occupation', 'native-country']:
        data_frame[key].replace({' ?': np.nan}, inplace=True)
        data_frame[key] = categorical_imputer.fit_transform(data_frame[[key]])
    
    # encoding categorical features
    data_frame = pd.get_dummies(
        data_frame, 
        columns=['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
    )
    data_frame.drop(['native-country_ Holand-Netherlands'], inplace=True, axis=1)
    
    # separating train_data_frame and test_data_frame from data_frame
    train_data_frame = data_frame[data_frame['is_train'] == 1]
    test_data_frame = data_frame[data_frame['is_train'] == 0]
    
    train_data_frame.drop(['is_train'], inplace=True, axis=1)
    test_data_frame.drop(['is_train'], inplace=True, axis=1)
    
    # separating target column from data_frames
    train_data_frame_target = train_data_frame['salary-scale']
    train_data_frame.drop(['salary-scale'], inplace=True, axis=1)
    
    test_data_frame_target = test_data_frame['salary-scale']
    test_data_frame.drop(['salary-scale'], inplace=True, axis=1)
    
    # standardizing specific columns in data_frame
    standard_scaler = StandardScaler()
    
    for key in ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']:
        train_data_frame[key] = standard_scaler.fit_transform(train_data_frame[[key]])
        test_data_frame[key] = standard_scaler.fit_transform(test_data_frame[[key]])   
    
    return (
        train_data_frame.to_numpy(), 
        train_data_frame_target.to_numpy(), 
        test_data_frame.to_numpy(), 
        test_data_frame_target.to_numpy()
    )

### Credit Card Fraud Detection Dataset Preprocessing

In [6]:
def credit_card_fraud_detection_dataset_preprocessing(use_smaller_subset=False):
    data_frame = pd.read_csv('./datasets/credit-card-fraud-detection.csv')
    
    # separating data samples based on value in 'Class' column
    data_frame_0 = data_frame[data_frame['Class'] == 0]
    data_frame_1 = data_frame[data_frame['Class'] == 1]
    
    # splitting data_frame_0 and data_frame_1 into training and testing sets
    train_data_frame_0 = data_frame_0.sample(frac=0.8, random_state=1)
    test_data_frame_0 = data_frame_0.drop(train_data_frame_0.index)
    
    if use_smaller_subset:
        train_data_frame_0 = train_data_frame_0.sample(n=16000, random_state=1)
        test_data_frame_0 = test_data_frame_0.sample(n=4000, random_state=1)
    
    train_data_frame_1 = data_frame_1.sample(frac=0.8, random_state=1)
    test_data_frame_1 = data_frame_1.drop(train_data_frame_1.index)
    
    # concatenating train_data_frame_0 and train_data_frame_1, test_data_frame_0 and test_data_frame_1
    train_data_frame = pd.concat([train_data_frame_0, train_data_frame_1], ignore_index=True).sample(frac=1, random_state=1)
    test_data_frame = pd.concat([test_data_frame_0, test_data_frame_1], ignore_index=True).sample(frac=1, random_state=1)
    
    # separating target columns from data_frames
    train_data_frame_target = train_data_frame['Class']
    train_data_frame.drop(['Class'], inplace=True, axis=1)
    
    test_data_frame_target = test_data_frame['Class']
    test_data_frame.drop(['Class'], inplace=True, axis=1)
    
    # standardizing specific columns in data_frames
    standard_scaler = StandardScaler()
    
    for key in list(train_data_frame.columns):
        train_data_frame[key] = standard_scaler.fit_transform(train_data_frame[[key]])
        
    for key in list(test_data_frame.columns):
        test_data_frame[key] = standard_scaler.fit_transform(test_data_frame[[key]])
    
    return (
        train_data_frame.to_numpy(), 
        train_data_frame_target.to_numpy(), 
        test_data_frame.to_numpy(), 
        test_data_frame_target.to_numpy()
    )

## Logistic Regression Implementation  

**Reference:** https://towardsdatascience.com/logistic-regression-from-scratch-in-python-ec66603592e2

In [7]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

In [8]:
def tanh(z):
    return (np.exp(z) - np.exp(-z)) / (np.exp(z) + np.exp(-z))

In [9]:
def mean_squared_error(y_true, y_predicted):
    return np.mean((y_true - y_predicted) ** 2)

In [10]:
def gradients(X, y_true, y_predicted, is_weak_learning=False):
    num_samples = X.shape[0]
    
    # calculating gradients of loss with respect to weights w
    if is_weak_learning:
        # using tanh as logistic function
        dw = np.dot(X.T, (y_true - y_predicted) * (1 - y_predicted ** 2)) / num_samples
    else:
        # using sigmoid as logistic function
        dw = np.dot(X.T, (y_true - y_predicted) * y_predicted * (1 - y_predicted)) / num_samples
    
    return dw

In [11]:
def normalize(X):
    # standardizing to have zero mean and unit variance
    return (X - X.mean(axis=0)) / X.std(axis=0)

In [12]:
def train(X, y_true, epochs, learning_rate, report_loss=False, is_weak_learning=False, early_stopping_threshold=0):
    num_samples, num_features = X.shape
    
    # initializing weights w to zero
    w = np.zeros((num_features + 1, 1))
    
    # normalizing inputs X
    X = normalize(X)
    
    # augmenting dummy input attribute 1 to each row of X
    X = np.concatenate((X, np.ones((num_samples, 1))), axis=1)
    
    # reshaping target y_true
    y_true = y_true.reshape(num_samples, 1)
    
    # training loop
    for epoch in range(epochs):
        # calculating hypotheses
        if is_weak_learning:
            y_predicted = (1 + tanh(np.dot(X, w))) / 2
        else:
            y_predicted = sigmoid(np.dot(X, w))
        
        # calculating gradients of loss with respect to weights w
        dw = gradients(X, y_true, y_predicted, is_weak_learning=is_weak_learning)
        
        # gradient descent: updating parameters weights w
        w = w + learning_rate * dw
        
        # calculating MSE loss in this epoch
        if is_weak_learning:
            loss = mean_squared_error(y_true, (1 + tanh(np.dot(X, w))) / 2)
        else:
            loss = mean_squared_error(y_true, sigmoid(np.dot(X, w)))
        
        # reporting MSE loss in this epoch
        if report_loss:
            print(f'Epoch {epoch + 1}: MSE Loss = {loss}')
        
        # early termination of gradient descent
        if loss <= early_stopping_threshold:
            break
    
    return w

In [13]:
def predict(X, w, is_weak_learning=False):
    num_samples = X.shape[0]
    
    # normalizing inputs X
    X = normalize(X)
    
    # augmenting dummy input attribute 1 to each row of X
    X = np.concatenate((X, np.ones((num_samples, 1))), axis=1)
    
    # calculating hypotheses
    if is_weak_learning:
        y_predicted = (1 + tanh(np.dot(X, w))) / 2
    else:
        y_predicted = sigmoid(np.dot(X, w))
    
    # determining and storing predictions
    predictions = [1 if y_pred >= 0.5 else 0 for y_pred in y_predicted]
    
    return np.array(predictions).reshape(num_samples, 1)

## AdaBoost Implementation

In [14]:
def adaptive_boosting(X, y_true, num_boosting_rounds, report_accuracy=False):
    num_samples, num_features = X.shape
    
    # initializing local variables
    example_weights = np.full((num_samples), 1 / num_samples)
    hypotheses = []
    hypothesis_weights = []
    
    # boosting loop
    for k in range(num_boosting_rounds):
        # resampling input examples
        examples = np.concatenate((X, y_true), axis=1)
        
        # ref: https://stackoverflow.com/questions/14262654/numpy-get-random-set-of-rows-from-2d-array
        # ref: https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html
        data = examples[np.random.choice(num_samples, size=num_samples, replace=True, p=example_weights)]
        
        data_X = data[:, :num_features]
        data_y_true = data[:, -1:]
        
        # getting hypothesis from a learning algorithm
        w = train(
            data_X, 
            data_y_true, 
            epochs=1000, 
            learning_rate=0.01, 
            report_loss=False, 
            is_weak_learning=True, 
            early_stopping_threshold=0.2
        )
        
        # predicting target values with hypothesis
        y_predicted = predict(X, w, is_weak_learning=True)
        
        # reporting accuracy of hypothesis
        if report_accuracy:
            print(np.sum(y_true == y_predicted) / num_samples)
        
        # calculating error for hypothesis
        error = 0
        
        for i in range(num_samples):
            error = error + (example_weights[i] if y_true[i] != y_predicted[i] else 0)
        
        if error > 0.5:
            continue
        else:
            hypotheses.append(w)
        
        # updating example_weights
        for i in range(num_samples):
            example_weights[i] = example_weights[i] * (error / (1 - error) if y_true[i] == y_predicted[i] else 1)
        
        example_weights = example_weights / example_weights.sum()
        
        # updating hypothesis_weights
        hypothesis_weights.append(np.log2((1 - error) / error))
    
    return hypotheses, np.array(hypothesis_weights).reshape(len(hypotheses), 1)

In [15]:
def weighted_majority_predict(X, hypotheses, hypothesis_weights):
    num_samples = X.shape[0]
    num_hypotheses = len(hypotheses)
    
    # normalizing inputs X
    X = normalize(X)
    
    # augmenting dummy input attribute 1 to each row of X
    X = np.concatenate((X, np.ones((num_samples, 1))), axis=1)
    
    # calculating hypotheses
    y_predicteds = []
    
    for i in range(num_hypotheses):
        y_predicted = (1 + tanh(np.dot(X, hypotheses[i]))) / 2
        y_predicteds.append([1 if y_pred >= 0.5 else -1 for y_pred in y_predicted])
        
    y_predicteds = np.array(y_predicteds)
    
    # calculating weighted majority hypothesis and storing predictions
    weighted_majority_hypothesis = np.dot(y_predicteds.T, hypothesis_weights)
    predictions = [1 if y_pred >= 0 else 0 for y_pred in weighted_majority_hypothesis]
    
    return np.array(predictions).reshape(num_samples, 1)

## Performance Evaluation  

**Reference:** https://en.wikipedia.org/wiki/Confusion_matrix

In [16]:
def performance_evaluation(y_true, y_predicted):
    num_samples = y_true.shape[0]
    
    # initializing confusion matrix outcomes
    true_positive = 0
    false_negative = 0
    true_negative = 0
    false_positive = 0
    
    # calculating and storing confusion matrix outcomes
    for i in range(num_samples):
        if y_true[i] == 1:
            if y_true[i] == y_predicted[i]:
                true_positive = true_positive + 1
            else:
                false_negative = false_negative + 1
        elif y_true[i] == 0:
            if y_true[i] == y_predicted[i]:
                true_negative = true_negative + 1
            else:
                false_positive = false_positive + 1
    
    # calculating and storing performance measures
    accuracy = (true_positive + true_negative) / (true_positive + false_negative + true_negative + false_positive)
    sensitivity = true_positive / (true_positive + false_negative)
    specificity = true_negative / (true_negative + false_positive)
    precision = true_positive / (true_positive + false_positive)
    false_discovery_rate = false_positive / (true_positive + false_positive)
    f1_score = 2 * sensitivity * precision / (sensitivity + precision)
    
    return (accuracy, sensitivity, specificity, precision, false_discovery_rate, f1_score)

## Extracting Dataset Features & Targets and Splitting Datasets into Training & Testing Sets  

**Reference:** https://developers.google.com/machine-learning/crash-course/training-and-test-sets/splitting-data

### Telco Customer Churn Dataset

In [17]:
telco_customer_churn_features, telco_customer_churn_target = telco_customer_churn_dataset_preprocessing()

In [18]:
(
    train_churn_features, 
    test_churn_features, 
    train_churn_target, 
    test_churn_target
) = train_test_split(telco_customer_churn_features, telco_customer_churn_target, test_size=0.2, random_state=1)

In [19]:
train_churn_target = train_churn_target.reshape(train_churn_target.shape[0], 1)
test_churn_target = test_churn_target.reshape(test_churn_target.shape[0], 1)

### Adult Salary Scale Dataset

In [20]:
(train_salary_features, train_salary_target, test_salary_features, test_salary_target) = adult_dataset_preprocessing()

In [21]:
train_salary_target = train_salary_target.reshape(train_salary_target.shape[0], 1)
test_salary_target = test_salary_target.reshape(test_salary_target.shape[0], 1)

### Credit Card Fraud Detection Dataset (Entire)

In [22]:
(
    train_fraud_features, 
    train_fraud_target, 
    test_fraud_features, 
    test_fraud_target
) = credit_card_fraud_detection_dataset_preprocessing(use_smaller_subset=False)

In [23]:
train_fraud_target = train_fraud_target.reshape(train_fraud_target.shape[0], 1)
test_fraud_target = test_fraud_target.reshape(test_fraud_target.shape[0], 1)

### Credit Card Fraud Detection Dataset (Subsampled)

In [24]:
(
    train_fraud_sub_features, 
    train_fraud_sub_target, 
    test_fraud_sub_features, 
    test_fraud_sub_target
) = credit_card_fraud_detection_dataset_preprocessing(use_smaller_subset=True)

In [25]:
train_fraud_sub_target = train_fraud_sub_target.reshape(train_fraud_sub_target.shape[0], 1)
test_fraud_sub_target = test_fraud_sub_target.reshape(test_fraud_sub_target.shape[0], 1)

## Performance Measurement

### Logistic Regression with `sigmoid`

#### Telco Customer Churn Dataset

In [26]:
w = train(
    train_churn_features, 
    train_churn_target, 
    epochs=1000, 
    learning_rate=0.01, 
    report_loss=False, 
    is_weak_learning=False, 
    early_stopping_threshold=0
)

(
    accuracy, 
    sensitivity, 
    specificity, 
    precision, 
    false_discovery_rate, 
    f1_score
) = performance_evaluation(train_churn_target, predict(train_churn_features, w, is_weak_learning=False))

print(
    f'Logistic Regression with sigmoid for Telco Customer Churn Dataset: Train\n'
    f'Accuracy: {accuracy}\n'
    f'Sensitivity: {sensitivity}\n'
    f'Specificity: {specificity}\n'
    f'Precision: {precision}\n'
    f'False Discovery Rate: {false_discovery_rate}\n'
    f'F1 Score: {f1_score}\n'
)

(
    accuracy, 
    sensitivity, 
    specificity, 
    precision, 
    false_discovery_rate, 
    f1_score
) = performance_evaluation(test_churn_target, predict(test_churn_features, w, is_weak_learning=False))

print(
    f'Logistic Regression with sigmoid for Telco Customer Churn Dataset: Test\n'
    f'Accuracy: {accuracy}\n'
    f'Sensitivity: {sensitivity}\n'
    f'Specificity: {specificity}\n'
    f'Precision: {precision}\n'
    f'False Discovery Rate: {false_discovery_rate}\n'
    f'F1 Score: {f1_score}\n'
)

Logistic Regression with sigmoid for Telco Customer Churn Dataset: Train
Accuracy: 0.7781327653532126
Sensitivity: 0.7061143984220908
Specificity: 0.8047653780695356
Precision: 0.5721896643580181
False Discovery Rate: 0.4278103356419819
F1 Score: 0.6321365509123013

Logistic Regression with sigmoid for Telco Customer Churn Dataset: Test
Accuracy: 0.7778566359119943
Sensitivity: 0.7557471264367817
Specificity: 0.7851083883129123
Precision: 0.5356415478615071
False Discovery Rate: 0.46435845213849286
F1 Score: 0.6269368295589988



#### Adult Salary Scale Dataset

In [27]:
w = train(
    train_salary_features, 
    train_salary_target, 
    epochs=1000, 
    learning_rate=0.01, 
    report_loss=False, 
    is_weak_learning=False, 
    early_stopping_threshold=0
)

(
    accuracy, 
    sensitivity, 
    specificity, 
    precision, 
    false_discovery_rate, 
    f1_score
) = performance_evaluation(train_salary_target, predict(train_salary_features, w, is_weak_learning=False))

print(
    f'Logistic Regression with sigmoid for Adult Salary Scale Dataset: Train\n'
    f'Accuracy: {accuracy}\n'
    f'Sensitivity: {sensitivity}\n'
    f'Specificity: {specificity}\n'
    f'Precision: {precision}\n'
    f'False Discovery Rate: {false_discovery_rate}\n'
    f'F1 Score: {f1_score}\n'
)

(
    accuracy, 
    sensitivity, 
    specificity, 
    precision, 
    false_discovery_rate, 
    f1_score
) = performance_evaluation(test_salary_target, predict(test_salary_features, w, is_weak_learning=False))

print(
    f'Logistic Regression with sigmoid for Adult Salary Scale Dataset: Test\n'
    f'Accuracy: {accuracy}\n'
    f'Sensitivity: {sensitivity}\n'
    f'Specificity: {specificity}\n'
    f'Precision: {precision}\n'
    f'False Discovery Rate: {false_discovery_rate}\n'
    f'F1 Score: {f1_score}\n'
)

Logistic Regression with sigmoid for Adult Salary Scale Dataset: Train
Accuracy: 0.8200608089432143
Sensitivity: 0.7448029588062747
Specificity: 0.8439320388349515
Precision: 0.6021860177356156
False Discovery Rate: 0.3978139822643844
F1 Score: 0.6659444666172529

Logistic Regression with sigmoid for Adult Salary Scale Dataset: Test
Accuracy: 0.8195442540384498
Sensitivity: 0.7433697347893916
Specificity: 0.8431041415359871
Precision: 0.5943866943866943
False Discovery Rate: 0.4056133056133056
F1 Score: 0.6605822550831792



#### Credit Card Fraud Detection Dataset (Entire)

In [28]:
w = train(
    train_fraud_features, 
    train_fraud_target, 
    epochs=1000, 
    learning_rate=0.01, 
    report_loss=False, 
    is_weak_learning=False, 
    early_stopping_threshold=0
)

(
    accuracy, 
    sensitivity, 
    specificity, 
    precision, 
    false_discovery_rate, 
    f1_score
) = performance_evaluation(train_fraud_target, predict(train_fraud_features, w, is_weak_learning=False))

print(
    f'Logistic Regression with sigmoid for Credit Card Fraud Detection Dataset (Entire): Train\n'
    f'Accuracy: {accuracy}\n'
    f'Sensitivity: {sensitivity}\n'
    f'Specificity: {specificity}\n'
    f'Precision: {precision}\n'
    f'False Discovery Rate: {false_discovery_rate}\n'
    f'F1 Score: {f1_score}\n'
)

(
    accuracy, 
    sensitivity, 
    specificity, 
    precision, 
    false_discovery_rate, 
    f1_score
) = performance_evaluation(test_fraud_target, predict(test_fraud_features, w, is_weak_learning=False))

print(
    f'Logistic Regression with sigmoid for Credit Card Fraud Detection Dataset (Entire): Test\n'
    f'Accuracy: {accuracy}\n'
    f'Sensitivity: {sensitivity}\n'
    f'Specificity: {specificity}\n'
    f'Precision: {precision}\n'
    f'False Discovery Rate: {false_discovery_rate}\n'
    f'F1 Score: {f1_score}\n'
)

Logistic Regression with sigmoid for Credit Card Fraud Detection Dataset (Entire): Train
Accuracy: 0.998854489435847
Sensitivity: 0.42385786802030456
Specificity: 0.9998505179114714
Precision: 0.8308457711442786
False Discovery Rate: 0.1691542288557214
F1 Score: 0.561344537815126

Logistic Regression with sigmoid for Credit Card Fraud Detection Dataset (Entire): Test
Accuracy: 0.9989115359632029
Sensitivity: 0.42857142857142855
Specificity: 0.9998944832316269
Precision: 0.875
False Discovery Rate: 0.125
F1 Score: 0.5753424657534246



#### Credit Card Fraud Detection Dataset (Subsampled)

In [29]:
w = train(
    train_fraud_sub_features, 
    train_fraud_sub_target, 
    epochs=1000, 
    learning_rate=0.01, 
    report_loss=False, 
    is_weak_learning=False, 
    early_stopping_threshold=0
)

(
    accuracy, 
    sensitivity, 
    specificity, 
    precision, 
    false_discovery_rate, 
    f1_score
) = performance_evaluation(train_fraud_sub_target, predict(train_fraud_sub_features, w, is_weak_learning=False))

print(
    f'Logistic Regression with sigmoid for Credit Card Fraud Detection Dataset (Subsampled): Train\n'
    f'Accuracy: {accuracy}\n'
    f'Sensitivity: {sensitivity}\n'
    f'Specificity: {specificity}\n'
    f'Precision: {precision}\n'
    f'False Discovery Rate: {false_discovery_rate}\n'
    f'F1 Score: {f1_score}\n'
)

(
    accuracy, 
    sensitivity, 
    specificity, 
    precision, 
    false_discovery_rate, 
    f1_score
) = performance_evaluation(test_fraud_sub_target, predict(test_fraud_sub_features, w, is_weak_learning=False))

print(
    f'Logistic Regression with sigmoid for Credit Card Fraud Detection Dataset (Subsampled): Test\n'
    f'Accuracy: {accuracy}\n'
    f'Sensitivity: {sensitivity}\n'
    f'Specificity: {specificity}\n'
    f'Precision: {precision}\n'
    f'False Discovery Rate: {false_discovery_rate}\n'
    f'F1 Score: {f1_score}\n'
)

Logistic Regression with sigmoid for Credit Card Fraud Detection Dataset (Subsampled): Train
Accuracy: 0.9951201659143589
Sensitivity: 0.8045685279187818
Specificity: 0.9998125
Precision: 0.990625
False Discovery Rate: 0.009375
F1 Score: 0.8879551820728291

Logistic Regression with sigmoid for Credit Card Fraud Detection Dataset (Subsampled): Test
Accuracy: 0.9943875061005368
Sensitivity: 0.7857142857142857
Specificity: 0.9995
Precision: 0.9746835443037974
False Discovery Rate: 0.02531645569620253
F1 Score: 0.8700564971751412



### Logistic Regression with `tanh`

#### Telco Customer Churn Dataset

In [30]:
w = train(
    train_churn_features, 
    train_churn_target, 
    epochs=1000, 
    learning_rate=0.01, 
    report_loss=False, 
    is_weak_learning=True, 
    early_stopping_threshold=0
)

(
    accuracy, 
    sensitivity, 
    specificity, 
    precision, 
    false_discovery_rate, 
    f1_score
) = performance_evaluation(train_churn_target, predict(train_churn_features, w, is_weak_learning=True))

print(
    f'Logistic Regression with tanh for Telco Customer Churn Dataset: Train\n'
    f'Accuracy: {accuracy}\n'
    f'Sensitivity: {sensitivity}\n'
    f'Specificity: {specificity}\n'
    f'Precision: {precision}\n'
    f'False Discovery Rate: {false_discovery_rate}\n'
    f'F1 Score: {f1_score}\n'
)

(
    accuracy, 
    sensitivity, 
    specificity, 
    precision, 
    false_discovery_rate, 
    f1_score
) = performance_evaluation(test_churn_target, predict(test_churn_features, w, is_weak_learning=True))

print(
    f'Logistic Regression with tanh for Telco Customer Churn Dataset: Test\n'
    f'Accuracy: {accuracy}\n'
    f'Sensitivity: {sensitivity}\n'
    f'Specificity: {specificity}\n'
    f'Precision: {precision}\n'
    f'False Discovery Rate: {false_discovery_rate}\n'
    f'F1 Score: {f1_score}\n'
)

Logistic Regression with tanh for Telco Customer Churn Dataset: Train
Accuracy: 0.8015619453319134
Sensitivity: 0.5575279421433268
Specificity: 0.8918064672988086
Precision: 0.6558391337973705
False Discovery Rate: 0.34416086620262953
F1 Score: 0.6027007818052595

Logistic Regression with tanh for Telco Customer Churn Dataset: Test
Accuracy: 0.801277501774308
Sensitivity: 0.5890804597701149
Specificity: 0.8708765315739868
Precision: 0.5994152046783626
False Discovery Rate: 0.40058479532163743
F1 Score: 0.5942028985507246



#### Adult Salary Scale Dataset

In [31]:
w = train(
    train_salary_features, 
    train_salary_target, 
    epochs=1000, 
    learning_rate=0.01, 
    report_loss=False, 
    is_weak_learning=True, 
    early_stopping_threshold=0
)

(
    accuracy, 
    sensitivity, 
    specificity, 
    precision, 
    false_discovery_rate, 
    f1_score
) = performance_evaluation(train_salary_target, predict(train_salary_features, w, is_weak_learning=True))

print(
    f'Logistic Regression with tanh for Adult Salary Scale Dataset: Train\n'
    f'Accuracy: {accuracy}\n'
    f'Sensitivity: {sensitivity}\n'
    f'Specificity: {specificity}\n'
    f'Precision: {precision}\n'
    f'False Discovery Rate: {false_discovery_rate}\n'
    f'F1 Score: {f1_score}\n'
)

(
    accuracy, 
    sensitivity, 
    specificity, 
    precision, 
    false_discovery_rate, 
    f1_score
) = performance_evaluation(test_salary_target, predict(test_salary_features, w, is_weak_learning=True))

print(
    f'Logistic Regression with tanh for Adult Salary Scale Dataset: Test\n'
    f'Accuracy: {accuracy}\n'
    f'Sensitivity: {sensitivity}\n'
    f'Specificity: {specificity}\n'
    f'Precision: {precision}\n'
    f'False Discovery Rate: {false_discovery_rate}\n'
    f'F1 Score: {f1_score}\n'
)

Logistic Regression with tanh for Adult Salary Scale Dataset: Train
Accuracy: 0.8441694051165505
Sensitivity: 0.5883178166050249
Specificity: 0.9253236245954692
Precision: 0.7141972441554421
False Discovery Rate: 0.285802755844558
F1 Score: 0.6451748251748252

Logistic Regression with tanh for Adult Salary Scale Dataset: Test
Accuracy: 0.8450340888151834
Sensitivity: 0.5852834113364535
Specificity: 0.925371934057097
Precision: 0.7080843032400126
False Discovery Rate: 0.2919156967599874
F1 Score: 0.6408540925266905



#### Credit Card Fraud Detection Dataset (Entire)

In [32]:
w = train(
    train_fraud_features, 
    train_fraud_target, 
    epochs=1000, 
    learning_rate=0.01, 
    report_loss=False, 
    is_weak_learning=True, 
    early_stopping_threshold=0
)

(
    accuracy, 
    sensitivity, 
    specificity, 
    precision, 
    false_discovery_rate, 
    f1_score
) = performance_evaluation(train_fraud_target, predict(train_fraud_features, w, is_weak_learning=True))

print(
    f'Logistic Regression with tanh for Credit Card Fraud Detection Dataset (Entire): Train\n'
    f'Accuracy: {accuracy}\n'
    f'Sensitivity: {sensitivity}\n'
    f'Specificity: {specificity}\n'
    f'Precision: {precision}\n'
    f'False Discovery Rate: {false_discovery_rate}\n'
    f'F1 Score: {f1_score}\n'
)

(
    accuracy, 
    sensitivity, 
    specificity, 
    precision, 
    false_discovery_rate, 
    f1_score
) = performance_evaluation(test_fraud_target, predict(test_fraud_features, w, is_weak_learning=True))

print(
    f'Logistic Regression with tanh for Credit Card Fraud Detection Dataset (Entire): Test\n'
    f'Accuracy: {accuracy}\n'
    f'Sensitivity: {sensitivity}\n'
    f'Specificity: {specificity}\n'
    f'Precision: {precision}\n'
    f'False Discovery Rate: {false_discovery_rate}\n'
    f'F1 Score: {f1_score}\n'
)

Logistic Regression with tanh for Credit Card Fraud Detection Dataset (Entire): Train
Accuracy: 0.9989729905286905
Sensitivity: 0.49746192893401014
Specificity: 0.9998417248474404
Precision: 0.8448275862068966
False Discovery Rate: 0.15517241379310345
F1 Score: 0.6261980830670927

Logistic Regression with tanh for Credit Card Fraud Detection Dataset (Entire): Test
Accuracy: 0.9990168711925703
Sensitivity: 0.4897959183673469
Specificity: 0.9998944832316269
Precision: 0.8888888888888888
False Discovery Rate: 0.1111111111111111
F1 Score: 0.631578947368421



#### Credit Card Fraud Detection Dataset (Subsampled)

In [33]:
w = train(
    train_fraud_sub_features, 
    train_fraud_sub_target, 
    epochs=1000, 
    learning_rate=0.01, 
    report_loss=False, 
    is_weak_learning=True, 
    early_stopping_threshold=0
)

(
    accuracy, 
    sensitivity, 
    specificity, 
    precision, 
    false_discovery_rate, 
    f1_score
) = performance_evaluation(train_fraud_sub_target, predict(train_fraud_sub_features, w, is_weak_learning=True))

print(
    f'Logistic Regression with tanh for Credit Card Fraud Detection Dataset (Subsampled): Train\n'
    f'Accuracy: {accuracy}\n'
    f'Sensitivity: {sensitivity}\n'
    f'Specificity: {specificity}\n'
    f'Precision: {precision}\n'
    f'False Discovery Rate: {false_discovery_rate}\n'
    f'F1 Score: {f1_score}\n'
)

(
    accuracy, 
    sensitivity, 
    specificity, 
    precision, 
    false_discovery_rate, 
    f1_score
) = performance_evaluation(test_fraud_sub_target, predict(test_fraud_sub_features, w, is_weak_learning=True))

print(
    f'Logistic Regression with tanh for Credit Card Fraud Detection Dataset (Subsampled): Test\n'
    f'Accuracy: {accuracy}\n'
    f'Sensitivity: {sensitivity}\n'
    f'Specificity: {specificity}\n'
    f'Precision: {precision}\n'
    f'False Discovery Rate: {false_discovery_rate}\n'
    f'F1 Score: {f1_score}\n'
)

Logistic Regression with tanh for Credit Card Fraud Detection Dataset (Subsampled): Train
Accuracy: 0.9951201659143589
Sensitivity: 0.8045685279187818
Specificity: 0.9998125
Precision: 0.990625
False Discovery Rate: 0.009375
F1 Score: 0.8879551820728291

Logistic Regression with tanh for Credit Card Fraud Detection Dataset (Subsampled): Test
Accuracy: 0.9943875061005368
Sensitivity: 0.7857142857142857
Specificity: 0.9995
Precision: 0.9746835443037974
False Discovery Rate: 0.02531645569620253
F1 Score: 0.8700564971751412



### AdaBoost

#### Telco Customer Churn Dataset

In [34]:
for K in range(5, 25, 5):
    (
        hypotheses, 
        hypothesis_weights
    ) = adaptive_boosting(train_churn_features, train_churn_target, num_boosting_rounds=K, report_accuracy=False)
    
    (
        accuracy, 
        sensitivity, 
        specificity, 
        precision, 
        false_discovery_rate, 
        f1_score
    ) = performance_evaluation(
        train_churn_target, 
        weighted_majority_predict(train_churn_features, hypotheses, hypothesis_weights)
    )

    print(f'AdaBoost ({K} -> {len(hypotheses)} hypotheses) for Telco Customer Churn Dataset: Train\nAccuracy: {accuracy}\n')
    
    (
        accuracy, 
        sensitivity, 
        specificity, 
        precision, 
        false_discovery_rate, 
        f1_score
    ) = performance_evaluation(
        test_churn_target, 
        weighted_majority_predict(test_churn_features, hypotheses, hypothesis_weights)
    )

    print(f'AdaBoost ({K} -> {len(hypotheses)} hypotheses) for Telco Customer Churn Dataset: Test\nAccuracy: {accuracy}\n')

AdaBoost (5 -> 5 hypotheses) for Telco Customer Churn Dataset: Train
Accuracy: 0.7823926162584309

AdaBoost (5 -> 5 hypotheses) for Telco Customer Churn Dataset: Test
Accuracy: 0.78708303761533

AdaBoost (10 -> 10 hypotheses) for Telco Customer Churn Dataset: Train
Accuracy: 0.7933972310969116

AdaBoost (10 -> 10 hypotheses) for Telco Customer Churn Dataset: Test
Accuracy: 0.7849538679914834

AdaBoost (15 -> 15 hypotheses) for Telco Customer Churn Dataset: Train
Accuracy: 0.7875399361022364

AdaBoost (15 -> 15 hypotheses) for Telco Customer Churn Dataset: Test
Accuracy: 0.7927608232789212

AdaBoost (20 -> 20 hypotheses) for Telco Customer Churn Dataset: Train
Accuracy: 0.7783102591409301

AdaBoost (20 -> 20 hypotheses) for Telco Customer Churn Dataset: Test
Accuracy: 0.7842441447835344



#### Adult Salary Scale Dataset

In [35]:
for K in range(5, 25, 5):
    (
        hypotheses, 
        hypothesis_weights
    ) = adaptive_boosting(train_salary_features, train_salary_target, num_boosting_rounds=K, report_accuracy=False)
    
    (
        accuracy, 
        sensitivity, 
        specificity, 
        precision, 
        false_discovery_rate, 
        f1_score
    ) = performance_evaluation(
        train_salary_target, 
        weighted_majority_predict(train_salary_features, hypotheses, hypothesis_weights)
    )

    print(f'AdaBoost ({K} -> {len(hypotheses)} hypotheses) for Adult Salary Scale Dataset: Train\nAccuracy: {accuracy}\n')
    
    (
        accuracy, 
        sensitivity, 
        specificity, 
        precision, 
        false_discovery_rate, 
        f1_score
    ) = performance_evaluation(
        test_salary_target, 
        weighted_majority_predict(test_salary_features, hypotheses, hypothesis_weights)
    )

    print(f'AdaBoost ({K} -> {len(hypotheses)} hypotheses) for Adult Salary Scale Dataset: Test\nAccuracy: {accuracy}\n')

AdaBoost (5 -> 5 hypotheses) for Adult Salary Scale Dataset: Train
Accuracy: 0.8378735296827493

AdaBoost (5 -> 5 hypotheses) for Adult Salary Scale Dataset: Test
Accuracy: 0.836496529697193

AdaBoost (10 -> 10 hypotheses) for Adult Salary Scale Dataset: Train
Accuracy: 0.8385798961948343

AdaBoost (10 -> 10 hypotheses) for Adult Salary Scale Dataset: Test
Accuracy: 0.8374178490264725

AdaBoost (15 -> 15 hypotheses) for Adult Salary Scale Dataset: Train
Accuracy: 0.8400847639814502

AdaBoost (15 -> 15 hypotheses) for Adult Salary Scale Dataset: Test
Accuracy: 0.8368036361402862

AdaBoost (20 -> 15 hypotheses) for Adult Salary Scale Dataset: Train
Accuracy: 0.8372900095205921

AdaBoost (20 -> 15 hypotheses) for Adult Salary Scale Dataset: Test
Accuracy: 0.8370493212947607



#### Credit Card Fraud Detection Dataset (Subsampled)

In [36]:
for K in range(5, 25, 5):
    (
        hypotheses, 
        hypothesis_weights
    ) = adaptive_boosting(train_fraud_sub_features, train_fraud_sub_target, num_boosting_rounds=K, report_accuracy=False)
    
    (
        accuracy, 
        sensitivity, 
        specificity, 
        precision, 
        false_discovery_rate, 
        f1_score
    ) = performance_evaluation(
        train_fraud_sub_target, 
        weighted_majority_predict(train_fraud_sub_features, hypotheses, hypothesis_weights)
    )

    print(
        f'AdaBoost ({K} -> {len(hypotheses)} hypotheses) for Credit Card Fraud Detection Dataset (Subsampled): Train\n'
        f'Accuracy: {accuracy}\n'
    )
    
    (
        accuracy, 
        sensitivity, 
        specificity, 
        precision, 
        false_discovery_rate, 
        f1_score
    ) = performance_evaluation(
        test_fraud_sub_target, 
        weighted_majority_predict(test_fraud_sub_features, hypotheses, hypothesis_weights)
    )

    print(
        f'AdaBoost ({K} -> {len(hypotheses)} hypotheses) for Credit Card Fraud Detection Dataset (Subsampled): Test\n'
        f'Accuracy: {accuracy}\n'
    )

AdaBoost (5 -> 5 hypotheses) for Credit Card Fraud Detection Dataset (Subsampled): Train
Accuracy: 0.9951201659143589

AdaBoost (5 -> 5 hypotheses) for Credit Card Fraud Detection Dataset (Subsampled): Test
Accuracy: 0.9943875061005368

AdaBoost (10 -> 9 hypotheses) for Credit Card Fraud Detection Dataset (Subsampled): Train
Accuracy: 0.9951201659143589

AdaBoost (10 -> 9 hypotheses) for Credit Card Fraud Detection Dataset (Subsampled): Test
Accuracy: 0.9943875061005368

AdaBoost (15 -> 11 hypotheses) for Credit Card Fraud Detection Dataset (Subsampled): Train
Accuracy: 0.9951201659143589

AdaBoost (15 -> 11 hypotheses) for Credit Card Fraud Detection Dataset (Subsampled): Test
Accuracy: 0.9943875061005368

AdaBoost (20 -> 13 hypotheses) for Credit Card Fraud Detection Dataset (Subsampled): Train
Accuracy: 0.9951201659143589

AdaBoost (20 -> 13 hypotheses) for Credit Card Fraud Detection Dataset (Subsampled): Test
Accuracy: 0.9943875061005368

