In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Binary Logistic Regression For Cancer Prediction

In this study, we developed a predictive model using **Binary Logistic Regression** implemented with `sm.Logit` from the **statsmodels** package in Python. The model was trained to classify cancer cases based on genomic data.

After building the model, we evaluated its performance using key metrics, including the **ROC Curve, GINI coefficient, Precision, Specificity, and Sensitivity**. These metrics help assess the model’s accuracy and reliability in detecting cancer cases.

In [None]:
# In[0.1]: Package installation

!pip install pandas
!pip install numpy
!pip install -U seaborn
!pip install matplotlib
!pip install plotly
!pip install scipy
!pip install statsmodels
!pip install scikit-learn
!pip install statstests

In [None]:
# In[0.2]: Package Import

import pandas as pd # data manipulation in dataframe format
import numpy as np # mathematical operations
import seaborn as sns # graphical visualization
import matplotlib.pyplot as plt # graphical visualization
from scipy.interpolate import UnivariateSpline # smoothed sigmoid curve
import statsmodels.api as sm # model estimation
import statsmodels.formula.api as smf # binary logistic model estimation
from statstests.process import stepwise # Stepwise procedure
from scipy import stats # chi2 statistics
import plotly.graph_objects as go # 3D graphics
from statsmodels.iolib.summary2 import summary_col # model comparison
from statsmodels.discrete.discrete_model import MNLogit # multinomial logistic model estimation
import warnings
from sklearn.preprocessing import MinMaxScaler
warnings.filterwarnings('ignore')

In [None]:
# In[0.3] Loading the Data

# Assigning the data to the variable df_cancer
df_cancer = pd.read_csv('/kaggle/input/genomic-data-for-cancer/gene_expression.csv')

# Removing spaces from variable labels
df_cancer.columns = df_cancer.columns.str.replace(' ', '_')

df_cancer.info()

### Model Evaluation and Results  

The **Binary Logistic Regression** model for cancer prediction shows strong predictive capability, as evidenced by a **Pseudo R-squared value of 0.5441**, indicating that the independent variables explain a substantial proportion of the variability in cancer presence.  

The **log-likelihood improved** significantly from -2079.4 (null model) to -948.05, reinforcing the model's effectiveness. Additionally, the **LLR p-value < 0.000** confirms that the model as a whole is statistically significant.  

#### Coefficients Analysis:  
- **Intercept (3.7855, p < 0.001):** A strong positive base probability for cancer presence.  
- **Gene_One (0.6145, p < 0.001):** Positively correlated with cancer presence, meaning higher expression increases the probability of cancer.  
- **Gene_Two (-1.3362, p < 0.001):** Negatively correlated, indicating higher expression decreases cancer likelihood.  

All coefficients are **statistically significant (p < 0.001)**, highlighting their strong influence on cancer prediction.

In [None]:
# In[0.4] Formula Model and Model

Formula_model = "Cancer_Present ~ Gene_One + Gene_Two"

model_cancer = sm.Logit.from_formula(Formula_model, df_cancer).fit()

model_cancer.summary()

## Confusion Matrix Function

This function generates a confusion matrix to evaluate the performance of a classification model. It calculates sensitivity, specificity, and accuracy based on a given cutoff value. The confusion matrix is displayed as a plot for better visualization.


In [None]:
# In[0.5] Building the Function for the Confusion Matrix

from sklearn.metrics import confusion_matrix, accuracy_score,\
    ConfusionMatrixDisplay, recall_score

def confusion_matrix_function(predicts, observed, cutoff):
    
    values = predicts.values
    
    binary_prediction = []
        
    for item in values:
        if item < cutoff:
            binary_prediction.append(0)
        else:
            binary_prediction.append(1)
           
    cm = confusion_matrix(binary_prediction, observed)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm)
    disp.plot()
    plt.xlabel('True')
    plt.ylabel('Classified')
    plt.gca().invert_xaxis()
    plt.gca().invert_yaxis()
    plt.show()
        
    sensitivity = recall_score(observed, binary_prediction, pos_label=1)
    specificity = recall_score(observed, binary_prediction, pos_label=0)
    accuracy = accuracy_score(observed, binary_prediction)

    # Visualizing the main indicators of this confusion matrix
    indicators = pd.DataFrame({'Sensitivity':[sensitivity],
                               'Specificity':[specificity],
                               'Accuracy':[accuracy]})
    return indicators

## Confusion Matrix Analysis and Model Performance

The confusion matrix and performance metrics indicate that the model performs well in predicting cancer presence:  

- **Sensitivity (85.73%)**: The model correctly identifies 85.73% of actual cancer cases, meaning it effectively detects positive cases.  
- **Specificity (85.67%)**: The model correctly identifies 85.67% of non-cancer cases, reducing false positives.  
- **Accuracy (85.7%)**: Overall, the model classifies cases correctly in 85.7% of instances.  

With **1286 true positives (TP)** and **1285 true negatives (TN)**, the model demonstrates a balanced performance. However, the **214 false positives (FP)** and **215 false negatives (FN)** indicate areas for improvement, particularly in reducing misclassifications.

In [None]:
# In[0.6] Construção da matriz de confusão

# Adicionando os valores previstos de probabilidade na base de dados criando a coluna PHat
df_cancer['phat'] = model_cancer.predict()

# Matriz de confusão para cutoff = 0.5
confusion_matrix_function(observed=df_cancer['Cancer_Present'],
                predicts=df_cancer['phat'],
                cutoff=0.50)

## **Analyzing Sensitivity and Specificity Across Different Cutoff Points**

This code is designed to analyze the behavior of **sensitivity and specificity** across different cutoff values, ranging from **0 to 1**.  

The function **`spec_sens`** takes observed values and predicted probabilities as input and evaluates how **sensitivity and specificity change** for each cutoff point in increments of **0.01**.  

By iterating through the cutoff range, it classifies predictions as **binary (0 or 1)** and computes **sensitivity (recall for positive class) and specificity (recall for negative class)**, storing the results in a DataFrame for further analysis.

In [None]:
# In[0.7] Equalizing specificity and sensitivity criteria for educational purposes

def spec_sens(observed, predicts):
    
    # Add object with the predicted values
    values = predicts.values
    
    # Range of cutoffs to be analyzed in steps of 0.01
    cutoffs = np.arange(0, 1.01, 0.01)
    
    # Lists to store specificity and sensitivity results
    sensitivity_list = []
    specificity_list = []
    
    for cutoff in cutoffs:
        
        binary_prediction = []
        
        # Defining binary result according to the prediction
        for item in values:
            if item >= cutoff:
                binary_prediction.append(1)
            else:
                binary_prediction.append(0)
                
        # Calculate sensitivity and specificity for the cutoff
        sensitivity = recall_score(observed, binary_prediction, pos_label=1)
        specificity = recall_score(observed, binary_prediction, pos_label=0)
        
        # Add values to the lists
        sensitivity_list.append(sensitivity)
        specificity_list.append(specificity)
        
    # Create dataframe with results for their respective cutoffs
    result = pd.DataFrame({'cutoffs': cutoffs, 'sensitivity': sensitivity_list, 'specificity': specificity_list})
    return result

In [None]:
# In[0.8] We create a dataframe that contains the vectors 'sensitivity', 'specificity', and 'cutoffs'

plotting_data = spec_sens(observed = df_cancer['Cancer_Present'],
                          predicts = df_cancer['phat'])
plotting_data

In this graph, you can see the result of sensitivity and specificity across different cutoff points. I created a graph that shows sensitivity and specificity at various cutoff points, and approximately, the cutoff of 0.5 is where the intersection of sensitivity and specificity occurs.

In [None]:
# In[0.9]: Plotting a graph that shows the variation of specificity and sensitivity as a function of the cutoff

plt.figure(figsize=(15,10))
with plt.style.context('seaborn-v0_8-whitegrid'):
    plt.plot(plotting_data.cutoffs, plotting_data.sensitivity, marker='o',
         color='indigo', markersize=8)
    plt.plot(plotting_data.cutoffs, plotting_data.specificity, marker='o',
         color='limegreen', markersize=8)
plt.xlabel('Cutoff', fontsize=20)
plt.ylabel('Sensitivity / Specificity', fontsize=20)
plt.xticks(np.arange(0, 1.1, 0.2), fontsize=14)
plt.yticks(np.arange(0, 1.1, 0.2), fontsize=14)
plt.legend(['Sensitivity', 'Specificity'], fontsize=20)
plt.show()

The ROC curve with an AUC of 0.9392 and the Gini coefficient of 0.8785 both show that the model performs very well. 

- **AUC = 0.9392** indicates that the model is great at distinguishing between classes (e.g., fraud vs. non-fraud).
- **Gini = 0.8785** confirms this, showing the model has strong predictive power.

Overall, these results suggest that the model is highly effective and performs well in identifying fraud with minimal errors.

In [None]:
# In[1.0] Construction of the ROC Curve

from sklearn.metrics import roc_curve, auc

# 'roc_curve' function from the 'metrics' package in sklearn

fpr, tpr, thresholds = roc_curve(df_cancer['Cancer_Present'],
                                 df_cancer['phat'])
roc_auc = auc(fpr, tpr)

# Calculation of the GINI coefficient
gini = (roc_auc - 0.5) / (0.5)

# Plotting the ROC curve
plt.figure(figsize=(15,10))
plt.plot(fpr, tpr, marker='o', color='darkorchid', markersize=10, linewidth=3)
plt.plot(fpr, fpr, color='gray', linestyle='dashed')
plt.title('Area under the curve: %g' % round(roc_auc, 4) +
          ' | GINI Coefficient: %g' % round(gini, 4), fontsize=22)
plt.xlabel('1 - Specificity', fontsize=20)
plt.ylabel('Sensitivity', fontsize=20)
plt.xticks(np.arange(0, 1.1, 0.2), fontsize=14)
plt.yticks(np.arange(0, 1.1, 0.2), fontsize=14)
plt.show()

# **Conclusion**

In conclusion, using the binary logistic regression model from the `statsmodels` package in Python, the results obtained—AUC of 0.9392 and Gini coefficient of 0.8785—demonstrate that the model is highly effective in distinguishing between the two classes (e.g., fraud vs. non-fraud). These strong performance metrics indicate that the model is well-suited for the task, providing accurate predictions with minimal errors. Overall, the logistic regression model shows great promise in detecting fraud or other binary outcomes, making it a reliable tool for such applications.