# Function Description: BinomialLogisticRegression

This is a Python function named BinomialLogisticRegression that performs binomial logistic regression on a Pandas DataFrame and predicts the observation's classes based on the model. 

## Input Parameters

- df (pandas.DataFrame): the input dataframe.


- opt (bool): a boolean flag to indicate whether or not to optimize the model (default=False).


- cutoff (float): a float indicating the threshold for classification (default is 0.5).


## Dependencies

This function requires the Pandas, NumPy and Scipy libraries to be installed.

## Function Steps

**1.** The function starts by importing the required libraries, including a custom function for testing the normality of residuals.

**2.** It then defines a nested function to check for multicollinearity among the independent variables, excluding columns with perfect correlation to other columns in the DataFrame.

**3.** The function calculates the number of observations and variables in the input DataFrame, splitting the predictor and dependent variables, and computing the covariance matrix of the independent variables.

**4.** It obtains the coefficients and intercept of the linear regression model by solving the normal equation and prints them.

**5.** The function calculates the fitted values and R² of the model, and performs an F-test to check the overall significance of the model. If the F-test is significant, the function then performs a series of T-tests to check the significance of each individual predictor variable.

**6.** If the opt parameter of the function is set to True and any of the predictor variables are not significant, the function drops those variables and performs the regression again recursively until all predictor variables are significant or until the maximum number of iterations is reached.

**7.** Finally, the function returns a DataFrame with the original input variables and a new column for predicted values based on the linear regression model.

In [None]:
def BinomialLogisticRegression(df, opt=False, cutoff=.5, it=0):
    
    '''
    The function LogisticRegression performs binomial logistic regression on a Pandas DataFrame and predicts the 
    observation's classes based on the model. The function requires the Pandas, NumPy, and Scipy libraries installed.
    
    The function begins by calculating the number of observations and variables in the input DataFrame. It then splits
    the predictor and dependent variables and checks whether the dependent variable has two classes. If the dependent
    variable does not have two classes, the function prints an error message and returns.
    
    The multicollinearity_analysis function is defined inside LogisticRegression and aims to exclude columns with
    perfect  correlation to other columns in the DataFrame. This function generates a correlation matrix and checks
    for columns with a  correlation coefficient of 1. If it finds two such columns, it removes one of them.
    
    The LogisticRegression function initializes theta and defines the tolerance. It then performs logistic regression
    using Newton's method. The predicted classes are obtained by applying a threshold value (default 0.5) to the
    probabilities generated by the logistic function. 
    
    The function also performs a series of T-tests to check the significance of each individual predictor variable.
    If the "opt" parameter of the function is set to True and any of the predictor variables are not significant,
    the function drops those variables and performs the regression again recursively until all predictor variables
    are significant, to then, initialize theta.
    
    The he Logistic Regression main metrics, such as the log-likelihood, Akaike information criterion (AIC), 
    Bayesian information criterion (BIC), Confusion Matrix (based on the predicted classes), Accuracy, Precision,
    Recall, Especificity and F1-Score, are also printed. 
    
    Finally, the function returns a DataFrame with the original input variables and a new column for predicted values based
    on the logistic regression model.
    
    
    
    Args:
        df: a Pandas DataFrame, which is the input data for the logistic regression.
        opt: a boolean indicating whether or not to optimize the model (default is False).
        cutoff: a float indicating the threshold for classification (default is 0.5).
        
    '''
    
    
    # Importing needed libraries
    import pandas as pd
    import numpy as np
    from scipy import stats
    
    # Fuction for Multicollinearity Analysis
    def multicollinearity_analysis(df):
        
        '''This function aims to exclude columns with perfect correlation to other columns in the DataFrame.'''
    
        # Correlation Matrix
        df_cor = df.corr()

        # Getting columns with perfect correlation
        perfect_correlation_columns = []
        for col in df_cor.columns:
            for row in df_cor.index:
                if col != row:
                    if df_cor[col].loc[row] > .98:
                        if set([col, row]) not in perfect_correlation_columns:
                            perfect_correlation_columns.append(tuple(set([col, row])))
                            
        # Excluding columns                    
        for tuple_col in list(set(perfect_correlation_columns)):
            if tuple_col[0] in df:
                print(f'The "{tuple_col[0]}" column will be deleted for having a perfect correlation with {tuple_col[1]}.\n')
                df = df.drop(tuple_col[0], 1)
                      
        # Returning treated Dataframe
        return df
    
    # Logistic Function 
    def logistic(x):
        return 1 / (1 + np.exp(-x))
    
    # Function for crafting the confusion matrix and related metrics
    def confusion_matrix_and_metrics(df):
        
        '''Develop a confusion matrix based on the predicted and the real values and print the related main metrics.'''
        
        # Creating the Confusion Matrix
        df_classes = df.drop(predictor_columns, 1)
        conf_mat = pd.crosstab(index=df_classes[df_classes.columns[1]], columns=df_classes[df_classes.columns[0]])
        conf_mat.columns.name = 'Real'
        conf_mat.index.name = 'Predicted'
        
        # Calculating metrics
        accuracy = (conf_mat[0].loc[0] + conf_mat[1].loc[1])/conf_mat.sum().sum()
        precision = conf_mat[1].loc[1]/conf_mat.sum(1)[1]
        recall = conf_mat[1].loc[1]/conf_mat.sum(0)[1]
        especificity = conf_mat[0].loc[0]/conf_mat[0].sum()
        f1_score = (precision * recall * 2) / (precision + recall)
        
        # Printing outputs
        print(conf_mat)
        print(f'\n\nAccuracy: {round(accuracy,2)}\nPrecision: {round(precision,2)}\nRecall: {round(recall,2)}\nEspecificity: {round(especificity,2)}\nF1-score: {round(f1_score,2)}')
        
        return
    
    # Check input types
    if not isinstance(df, pd.DataFrame):
        raise TypeError("df must be a pandas DataFrame.")
    if not isinstance(opt, bool):
        raise TypeError("opt must be a boolean.")
    if not isinstance(it, int):
        raise TypeError("it must be an integer.")
        
    # Checking Null Values
    null_vals = df.isna().sum()
    if len(null_vals.loc[null_vals>0].index) > 0:
        print(f'NaN values in the columns: {null_vals.loc[null_vals>0].index}')
        return
    
    # Excluding columns with perfect correlation
    df = multicollinearity_analysis(df)  
    
    # Checking the dependent variable
    y = df.iloc[:,df.shape[1]-1].values
    if len(np.unique(y)) != 2:
           print('Incorrect number of classes on the dependent variable.')
           return
    
    # Spliting the predictor variables
    predictor_columns = df.columns[0:-1]
    X = df[predictor_columns].values
    X = np.hstack((np.ones((X.shape[0], 1)), X))
    
    # Initializing theta and defining tolerance
    theta = np.zeros(X.shape[1])
    tolerance = 0.0001
    
    # Iterating for getting the best theta
    for i in range(1000):
        y_pred = logistic(np.dot(X, theta))    
        error = y_pred - y

        gradient = X.T @ error
        W = np.diag(y_pred * (1 - y_pred))
        H = X.T @ W @ X
        
        new_theta = theta - np.linalg.inv(H) @ gradient
        if np.allclose(theta, new_theta, rtol=tolerance):
            break
        theta = new_theta
    
    # Getting Logit, p1 and p0
    logit = np.dot(X, theta)
    p1 = 1 / (1 + np.exp(-logit))
    p1 = np.where(p1 == 1, .9999, p1)
    p0 = 1 - p1
    predictions = np.where(p1 > cutoff, 1, 0)
    df['class'] = predictions
    
    # Calculating log-likelihood, AIC and BIC
    ll = sum(y * np.log(p1) + (1 - y)*np.log(p0))
    aic = -2 * ll + 2 * (len(predictor_columns) + 1 )
    bic = -2 * ll + (len(predictor_columns) + 1) * np.log(len(df))
    print(f'Log-likelihood: {round(ll, 4)},\t AIC: {round(aic, 4)},\t BIC: {round(bic, 4)}\n______________________________________________________________________\n')

    # T-student Statistics
    W = np.diag(p1*(1-p1))
    H = np.dot(X.T, np.dot(W, X))
    I = np.linalg.inv(H)
    se = np.sqrt(np.diagonal(I))
    z = theta / se
    p_val = (1 - stats.norm.cdf(abs(z)))*2
    
    # Initializing the Informative DataFrame
    df_info = pd.DataFrame(index=['Intercept'])
    
    # Intormative DataFrame
    df_info = pd.DataFrame(index=['intercept'] + list(predictor_columns), columns = ['Estimate', 'Std. Error', 'Z statistic', 'P value', 'Sig. at 0.05'])
    df_info['Estimate'].loc['intercept'] = np.round(theta[0], 4)
    df_info['Estimate'][1:] = np.round(theta[1:], 4)
    df_info['Std. Error'] = np.round(se,4)
    df_info['Z statistic'] = np.round(z,4)
    df_info['P value'] = np.round(p_val,4)
    df_info['Sig. at 0.05'] = np.where(p_val > 0.05, 'n', 'y')
    
    # If there's any p-value smaller than 0.05 and the parameter opt == True the function will drop non significant variables
    if opt == True and max(df_info['P value'][1:]) > 0.05:
        print(df_info)
        print('\n______________________________________________________________________\nExcluding columns not statistically significants.')
        max_index = df_info['P value'].loc[(df_info['P value'] == max(df_info[df_info.index != 'intercept']['P value'])) & (df_info.index != 'intercept')].index[0]
        it +=1  
        print(f'Excluded column: {max_index}')
        df = df.drop(max_index,1)
        df = df.drop('class', 1)
        print(f'______________________________________________________________________\n######################################################################')
        print(f'\t\t\tIteration number {it}')
        print(f'n#####################################################################\n______________________________________________________________________\n')
        df = BinomialLogisticRegression(df, opt=True, cutoff=cutoff, it=it)
        return  df
    
    # Returning the original Dataframe with the predicted values
    else:  
        print(df_info)
        print('\n______________________________________________________________________\n')
        df_final = df
        print(f'Considering the Cutoff of: {cutoff}\n\nConfusion Matrix:\n')
        confusion_matrix_and_metrics(df)
        return df