## In this notebook, we will perform some statistical tests on preprocessed breast cancer dataset

In [1]:
# import necessary libraries
import pandas as pd
import numpy as np
import scipy.stats as stats

In [2]:
# Load the dataset
file_path = '/kaggle/input/breastcancer-preprocessed/BreastCancer-Preprocessed.csv'
df = pd.read_csv(file_path)

##  Display the first few rows of the dataset to understand its structure

In [3]:
df.head()

Unnamed: 0,id,age,pathsize,lnpos,histgrad,er,pr,status,time,lnpos_YN,pathsize_Cat
0,1.0,60.0,,0.0,3.0,0.0,0.0,0.0,9.466667,No,
1,2.0,79.0,,0.0,,,,0.0,8.6,No,
2,3.0,82.0,,0.0,2.0,,,0.0,19.333333,No,
3,4.0,66.0,,0.0,2.0,1.0,1.0,0.0,16.333333,No,
4,5.0,52.0,,0.0,3.0,,,0.0,8.5,No,


## Understanding the Variables

The `lnpos_YN` and `status` are categorical variables that we want to test for independence.
Let's take a look at the unique values in these columns.

In [4]:
# Check the unique values in 'lnpos_YN' and 'status'
print("Unique values in 'lnpos_YN':", df['lnpos_YN'].unique())
print("Unique values in 'status':", df['status'].unique())


Unique values in 'lnpos_YN': ['No' 'Yes']
Unique values in 'status': [0. 1.]


## Creating a Contingency Table

We will create a contingency table to summarize the relationship between the two categorical variables.

In [5]:
# Create a contingency table
contingency_table = pd.crosstab(df['lnpos_YN'], df['status'])
print(contingency_table)


status    0.0  1.0
lnpos_YN          
No        887   42
Yes       248   30


## Performing the Chi-Square Test for Independence

The Chi-Square Test for Independence will help us determine whether there is a significant association between `lnpos_YN` and `status`.


In [6]:
# Perform the Chi-Square Test for Independence
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)

print("Chi-Square Statistic:", chi2)
print("***********************************")
print("p-value:", p)
print("***********************************")
print("Degrees of Freedom:", dof)
print("***********************************")
print("Expected Frequencies Table:\n", expected)

Chi-Square Statistic: 13.900758213995182
***********************************
p-value: 0.00019272070767964805
***********************************
Degrees of Freedom: 1
***********************************
Expected Frequencies Table:
 [[873.58326429  55.41673571]
 [261.41673571  16.58326429]]


## Interpreting the Results

- **Chi-Square Statistic**: A measure of how much the observed counts deviate from the expected counts.
- **p-value**: The probability of obtaining a Chi-Square statistic at least as extreme as the one computed, assuming that the variables are independent.
- **Degrees of Freedom**: Number of values that are free to vary given the constraints.
- **Expected Frequencies Table**: The expected counts if the variables are truly independent.

If the p-value is less than our significance level (typically 0.05), we reject the null hypothesis and conclude that there is a significant association between `lnpos_YN` and `status`.


## Calculate the residuals

In [7]:
# Calculate the residuals
residuals = (contingency_table - expected)
print("Residuals:\n", residuals)

Residuals:
 status          0.0        1.0
lnpos_YN                      
No        13.416736 -13.416736
Yes      -13.416736  13.416736


## Interpreting the Residuals

Residuals show the difference between observed and expected frequencies, standardized by the expected frequency. Large residuals indicate cells that contribute significantly to the Chi-Square statistic.


## Calculating the Odds Ratio

The odds ratio will provide a measure of association between the two categorical variables. Note that the odds ratio is only appropriate for 2x2 tables.


In [8]:
# Check if the table is 2x2
if contingency_table.shape == (2, 2):
    # Extract values from the contingency table
    a = contingency_table.iloc[0, 0]
    b = contingency_table.iloc[0, 1]
    c = contingency_table.iloc[1, 0]
    d = contingency_table.iloc[1, 1]
    
    # Calculate the Odds Ratio
    odds_ratio = (a / b) / (c / d)
    print(f"Odds Ratio: {odds_ratio:.2f}")
else:
    print("Odds Ratio calculation is only applicable for 2x2 tables.")


Odds Ratio: 2.55


## Interpreting the Results

- **Residuals**: Standardized differences between observed and expected frequencies.
- **Odds Ratio**: Measure of association between the two variables, applicable for 2x2 tables.

If the p-value is less than our significance level (typically 0.05), we reject the null hypothesis and conclude that there is a significant association between `lnpos_YN` and `status`.


## The Fisher Exact Test 

In [9]:
# Fisher is typically used when sample sizes are small, especially when the expected frequency in any of the cells of a contingency table is less than 5
# Check if any expected frequency is less than 5

if (expected < 5).any():
    print("At least one expected frequency is less than 5. Performing Fisher Exact Test.")
    
    # Fisher Exact Test can only be performed on 2x2 tables
    if contingency_table.shape == (2, 2):
        oddsratio, p_fisher = stats.fisher_exact(contingency_table)
        print("Fisher Exact Test p-value:", p_fisher)
    else:
        print("Fisher Exact Test is not applicable for tables larger than 2x2.")
else:
    print("All expected frequencies are 5 or greater. Performing Chi-Square Test for Independence.")


All expected frequencies are 5 or greater. Performing Chi-Square Test for Independence.


**In next versions I will perform another statistical test on this dataset**
> Stay Tuned 🌀