**Dataset:** Breast Cancer Dataset

**Aim:**  To investigate whether there is a statistically significant difference in post-surgery survival rates between patients with 2nd stage cancer and patients with 3rd stage cancer.

For that purpose we'll be performing hypothesis testing.

### **Task 1. Imports and Data Loading**
Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.


In [32]:
import pandas as pd
from scipy.stats import chi2_contingency
import numpy as np

In [19]:
data = pd.read_csv('Breast_cancer_data.csv')

### **Task 2. Data exploration**

Use descriptive statistics to conduct Exploratory Data Analysis (EDA).



In [20]:
data.head()

Unnamed: 0,Patient_ID,Age,Gender,Protein1,Protein2,Protein3,Protein4,Tumour_Stage,Histology,ER status,PR status,HER2 status,Surgery_type,Date_of_Surgery,Date_of_Last_Visit,Patient_Status
0,TCGA-D8-A1XD,36,FEMALE,0.080353,0.42638,0.54715,0.27368,III,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Modified Radical Mastectomy,15-Jan-17,19-Jun-17,Alive
1,TCGA-EW-A1OX,43,FEMALE,-0.42032,0.57807,0.61447,-0.031505,II,Mucinous Carcinoma,Positive,Positive,Negative,Lumpectomy,26-Apr-17,09-Nov-18,Dead
2,TCGA-A8-A079,69,FEMALE,0.21398,1.3114,-0.32747,-0.23426,III,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Other,08-Sep-17,09-Jun-18,Alive
3,TCGA-D8-A1XR,56,FEMALE,0.34509,-0.21147,-0.19304,0.12427,II,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Modified Radical Mastectomy,25-Jan-17,12-Jul-17,Alive
4,TCGA-BH-A0BF,56,FEMALE,0.22155,1.9068,0.52045,-0.31199,II,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Other,06-May-17,27-Jun-19,Dead


In [21]:
data.describe()

Unnamed: 0,Age,Protein1,Protein2,Protein3,Protein4
count,334.0,334.0,334.0,334.0,334.0
mean,58.886228,-0.029991,0.946896,-0.090204,0.009819
std,12.961212,0.563588,0.911637,0.585175,0.629055
min,29.0,-2.3409,-0.97873,-1.6274,-2.0255
25%,49.0,-0.358888,0.362173,-0.513748,-0.37709
50%,58.0,0.006129,0.992805,-0.17318,0.041768
75%,68.0,0.343598,1.6279,0.278353,0.42563
max,90.0,1.5936,3.4022,2.1934,1.6299


Check for and handle missing values.

In [23]:
data.shape

(334, 16)

In [24]:
data = data.dropna(axis = 0)
data.shape

(317, 16)

After cleaning the data we'll compute the Patient_Status for each Tumour_Stage

In [29]:
data.groupby('Tumour_Stage')['Patient_Status'].value_counts().unstack(fill_value=0)

Patient_Status,Alive,Dead
Tumour_Stage,Unnamed: 1_level_1,Unnamed: 2_level_1
I,51,9
II,144,36
III,60,17


In [34]:
#Alive and Dead status of 2nd stage cancer patients
stage_2 = data[data['Tumour_Stage']=='II']['Patient_Status'].value_counts()
stage_2

Alive    144
Dead      36
Name: Patient_Status, dtype: int64

In [27]:
#Alive and Dead status of 3rd stage cancer patients
stage_3 = data[data['Tumour_Stage']=='III']['Patient_Status'].value_counts()
stage_3

Alive    60
Dead     17
Name: Patient_Status, dtype: int64

 Based on the provided data, we can observe that the post-surgery survival rate for patients with 2nd stage cancer is 80%, while the survival rate for patients with 3rd stage cancer is approximately 77.92%.

 It appears that individuals diagnosed with 3rd stage cancer have a higher likelihood of post-surgery mortality compared to those with 2nd stage cancer. However, it's important to note that there is always a possibility of sampling variation, and further statistical analysis is needed to establish the significance of this difference.

### **Task 3. Hypothesis testing**
1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis

**Null Hypothesis**: The post-surgery mortality rate is the same for individuals with 2nd and 3rd stage cancer; there is no difference in the proportions of individuals dying in these two cancer stages.


**Alternative Hypothesis**:  The post-surgery mortality rate is higher in 3rd stage cancer patients compared to 2nd stage cancer patients.

choose 5% as the significance level and proceed with Chi-Square test.

Assumptions for Chi-Square test

1.   Both variables are categorical.
2.   All observations are independent.
3.   Cells in the contingency table are mutually exclusive.
4.   Expected value of cells should be 5 or greater in at least 80% of cells.





In [37]:
observed_data = np.array([[stage_2[0], stage_2[1]],
                          [stage_3[0], stage_3[1]]])
chi2, p_value, dof, expected = chi2_contingency(observed_data)

In [38]:
p_value

0.8345435107410862

With a p-value greater than the chosen significance level, we do not have sufficient evidence to conclude that there is a significant difference in post-surgery mortality rates between individuals with 2nd and 3rd stage cancer. Therefore, our analysis suggests that the proportions of individuals dying in these two cancer stages are not significantly different, and we do not reject the null hypothesis.