# **Breast Cancer Statistical Testing**

Before proceeding with logistic regression for the classification of breast cancer tumors using the [Breast Cancer Wisconsin (Original) ](https://archive.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original "Breast Cancer Wisconsin (Original) ") dataset, it is essential to perform hypothesis testing to understand the association between the features and the target variable (`Class`).

In this analysis, I will use the Chi-square test to evaluate the relationship between each feature and the target variable. The hypotheses for these tests are as follows:

* **Null Hypothesis (H0):** There is no association between the feature and the target variable (`Class`).

* **Alternative Hypothesis (H1):** There is an association between the feature and the target variable (`Class`).

By conducting these tests, I aim to identify which features have a statistically significant association with the target variable. This step is crucial as it helps in understanding the data better and ensures that the features included in the logistic regression model are relevant for predicting the class of the tumor.

Given the results of the Chi-square tests, where the p-values for all features are extremely low (p < 0.05), we reject the null hypothesis for each feature. This indicates that there is a significant association between each feature and the target variable. Consequently, we proceed with logistic regression, confident that the features we are using have a meaningful relationship with the target variable.

**Chi-square test references**

(2008). Chi-Square Test. In: The Concise Encyclopedia of Statistics (pp. 77-79). Springer, New York, NY.

## **Importing the libraries**

In [36]:
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

## **Importing the dataset**

In [37]:
# Gitbhub's repository with the dataset.
url = "https://raw.githubusercontent.com/SantiagoMorenoV/Breast_Cancer_Logit_Model/refs/heads/main/breast-cancer-wisconsin.data"

headers = [
    "Sample code number", "Clump Thickness", "Uniformity of Cell Size", "Uniformity of Cell Shape", "Marginal Adhesion",
    "Single Epithelial Cell Size", "Bare Nuclei", "Bland Chromatin", "Normal Nucleoli", "Mitoses", "Class"
]

data = pd.read_csv(url, header = None, names = headers)

data.replace("?", pd.NA, inplace=True)
data["Bare Nuclei"] = pd.to_numeric(data["Bare Nuclei"]).astype('Int64')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype
---  ------                       --------------  -----
 0   Sample code number           699 non-null    int64
 1   Clump Thickness              699 non-null    int64
 2   Uniformity of Cell Size      699 non-null    int64
 3   Uniformity of Cell Shape     699 non-null    int64
 4   Marginal Adhesion            699 non-null    int64
 5   Single Epithelial Cell Size  699 non-null    int64
 6   Bare Nuclei                  683 non-null    Int64
 7   Bland Chromatin              699 non-null    int64
 8   Normal Nucleoli              699 non-null    int64
 9   Mitoses                      699 non-null    int64
 10  Class                        699 non-null    int64
dtypes: Int64(1), int64(10)
memory usage: 60.9 KB


### **Missing values**

In [38]:
missing_percentage = data.isnull().sum() * 100 / len(data)
for column, percentage in missing_percentage.items():
    print(f'{column}: {percentage:.2f}%')

Sample code number: 0.00%
Clump Thickness: 0.00%
Uniformity of Cell Size: 0.00%
Uniformity of Cell Shape: 0.00%
Marginal Adhesion: 0.00%
Single Epithelial Cell Size: 0.00%
Bare Nuclei: 2.29%
Bland Chromatin: 0.00%
Normal Nucleoli: 0.00%
Mitoses: 0.00%
Class: 0.00%


### **Working Dataset** 

In [50]:
dataset = data.dropna()

In [51]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 683 entries, 0 to 698
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype
---  ------                       --------------  -----
 0   Sample code number           683 non-null    int64
 1   Clump Thickness              683 non-null    int64
 2   Uniformity of Cell Size      683 non-null    int64
 3   Uniformity of Cell Shape     683 non-null    int64
 4   Marginal Adhesion            683 non-null    int64
 5   Single Epithelial Cell Size  683 non-null    int64
 6   Bare Nuclei                  683 non-null    Int64
 7   Bland Chromatin              683 non-null    int64
 8   Normal Nucleoli              683 non-null    int64
 9   Mitoses                      683 non-null    int64
 10  Class                        683 non-null    int64
dtypes: Int64(1), int64(10)
memory usage: 64.7 KB


In [41]:
# Converting 'Class' binary category 
dataset.loc[:, 'Class'] = dataset['Class'].map({2: 0, 4: 1})

### **Converting the numerical features to categorical**

In [43]:
# Convert to numeric and drop columns with only NaN values
for column in headers[1:-1]:  # Excluding the first column as it is an identifier
    dataset.loc[:, column] = pd.to_numeric(dataset.loc[:, column], errors='coerce')

# Drop columns with only NaN values
dataset = dataset.dropna(axis=1, how='all')

# Apply pd.cut with duplicates='drop' and explicitly cast to object dtype
for column in dataset.columns[1:-1]:  # Excluding the first column as it is an identifier
    dataset.loc[:, column] = pd.cut(dataset.loc[:, column], bins=3, labels=["Low", "Medium", "High"], duplicates='drop').astype(object)


## **Function for the Chi-Squared test**

In [52]:
def chi2_test(var1, var2):
    contingency_table = pd.crosstab(dataset[var1], dataset[var2])
    chi2, p, dof, expected = chi2_contingency(contingency_table)
    return chi2, p

### **Applying the Chi-squared test between **Class** and the rest of the categorical features**

In [53]:
results = {}
for column in headers[1:-1]:  # Excluding the first column as it is an identifier
    if column in dataset.columns:
        chi2, p = chi2_test('Class', column)
        results[column] = {'chi2': chi2, 'p_val': p}
    else:
        print(f"Column '{column}' not found in dataset")

### **Results**

In [54]:
for feature, result in results.items():
    print(f"{feature}: chi2 = {result['chi2']:.4f}, p_val = {result['p_val']:.4f}")

Clump Thickness: chi2 = 378.0816, p_val = 0.0000
Uniformity of Cell Size: chi2 = 539.7931, p_val = 0.0000
Uniformity of Cell Shape: chi2 = 523.0710, p_val = 0.0000
Marginal Adhesion: chi2 = 390.0595, p_val = 0.0000
Single Epithelial Cell Size: chi2 = 447.8612, p_val = 0.0000
Bare Nuclei: chi2 = 489.0095, p_val = 0.0000
Bland Chromatin: chi2 = 453.2097, p_val = 0.0000
Normal Nucleoli: chi2 = 416.6306, p_val = 0.0000
Mitoses: chi2 = 191.9682, p_val = 0.0000


## **Interpretation of Chi-Squared Tests**

As noticed above, all *p-values* are less than 0.05, indicating a significant association between the target variable (`Class`) and each of the analyzed features. This suggests that these features can be useful in distinguishing between benign and malignant tumors.

Moreover, the chi-squared tests provide statistical evidence for including the features as explanatory variables for a binary classification model, such as logistic regression.