# Breast Cancer Survival Analysis for Healthcare Insights

About Dataset
Breast Cancer Dataset Description
This dataset includes information about breast cancer patients, with various clinical and demographic attributes. The dataset is structured with the following columns:
PatientID: Unique identifier for each patient.
Age: Age of the patient at the time of diagnosis.
CancerStage: Stage of breast cancer at the time of diagnosis (I, II, III, IV).
CancerType: Type of breast cancer (Ductal, Lobular, Mixed).
PrimaryTreatment: Primary treatment method (Surgery, Chemotherapy, Radiation, Hormone Therapy).
SurvivalStatus: Survival status of the patient 5 years post-diagnosis (0 = No, 1 = Yes).
RecurrenceStatus: Whether the cancer recurred (0 = No, 1 = Yes).
GeographicRegion: Geographic region where the patient resides (Urban, Suburban, Rural).
AnnualIncome: Annual income of the patient in USD.
HealthInsurance: Whether the patient has health insurance (Yes, No).
DiagnosisDate: Date of cancer diagnosis.
TreatmentEffectiveness: Effectiveness of the treatment on a scale of 1 to 10.

Dataset Characteristics:
Age: Ranges from 25 to 85 years old.
CancerStage: Includes all stages from I to IV.
CancerType: Categorized into Ductal, Lobular, and Mixed.
PrimaryTreatment: Includes Surgery, Chemotherapy, Radiation, and Hormone Therapy.
SurvivalStatus: Indicates if the patient survived for at least 5 years post-diagnosis.
RecurrenceStatus: Indicates if the cancer recurred after initial treatment.
GeographicRegion: Reflects the type of area where the patient resides (Urban, Suburban, Rural).
AnnualIncome: Annual income ranges from $20,000 to $100,000.
HealthInsurance: Indicates the presence of health insurance.
DiagnosisDate: Date when the patient was diagnosed with cancer.
TreatmentEffectiveness: Scaled from 1 to 10, indicating the effectiveness of the treatment

In [1]:
import pandas as pd

# Load the dataset
file_path = 'breast_cancer_dataset.csv'
df = pd.read_csv(file_path)

# Display the first few rows of the dataset
df.head()


Unnamed: 0,PatientID,Age,CancerStage,CancerType,PrimaryTreatment,SurvivalStatus,RecurrenceStatus,GeographicRegion,AnnualIncome,HealthInsurance,DiagnosisDate,TreatmentEffectiveness
0,1,63,III,Ductal,Radiation,1,1,Suburban,82824,Yes,2004-01-16,7
1,2,76,III,Mixed,Surgery,0,1,Suburban,62752,Yes,2017-08-05,6
2,3,53,II,Ductal,Hormone Therapy,1,0,Urban,52327,Yes,2009-04-30,9
3,4,39,I,Mixed,Surgery,1,1,Urban,62070,Yes,2001-01-24,6
4,5,67,II,Mixed,Surgery,0,0,Urban,60197,Yes,2002-09-26,5


In [3]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing values in each column:\n", missing_values)


Missing values in each column:
 PatientID                 0
Age                       0
CancerStage               0
CancerType                0
PrimaryTreatment          0
SurvivalStatus            0
RecurrenceStatus          0
GeographicRegion          0
AnnualIncome              0
HealthInsurance           0
DiagnosisDate             0
TreatmentEffectiveness    0
dtype: int64


In [6]:
import numpy as np

# Function to detect outliers using IQR
def calculate_iqr_bounds(data):
    Q1 = np.percentile(data, 25, interpolation = 'midpoint')
    Q3 = np.percentile(data, 75, interpolation = 'midpoint')
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return lower_bound, upper_bound

# Calculate the bounds for 'Age'
lower_bound_age, upper_bound_age = calculate_iqr_bounds(df['Age'])

# Calculate the bounds for 'AnnualIncome'
lower_bound_income, upper_bound_income = calculate_iqr_bounds(df['AnnualIncome'])

# Capping outliers in 'Age'
df['Age'] = np.where(df['Age'] < lower_bound_age, lower_bound_age, df['Age'])
df['Age'] = np.where(df['Age'] > upper_bound_age, upper_bound_age, df['Age'])

# Capping outliers in 'AnnualIncome'
df['AnnualIncome'] = np.where(df['AnnualIncome'] < lower_bound_income, lower_bound_income, df['AnnualIncome'])
df['AnnualIncome'] = np.where(df['AnnualIncome'] > upper_bound_income, upper_bound_income, df['AnnualIncome'])

# Verify that outliers have been handled
print("Data after handling outliers:\n", df.describe())


Users of the modes 'nearest', 'lower', 'higher', or 'midpoint' are encouraged to review the method they used. (Deprecated NumPy 1.22)
  lower_bound_age, upper_bound_age = calculate_iqr_bounds(df['Age'])
Users of the modes 'nearest', 'lower', 'higher', or 'midpoint' are encouraged to review the method they used. (Deprecated NumPy 1.22)
  lower_bound_income, upper_bound_income = calculate_iqr_bounds(df['AnnualIncome'])


Data after handling outliers:
         PatientID         Age  SurvivalStatus  RecurrenceStatus  AnnualIncome  \
count  400.000000  400.000000      400.000000        400.000000     400.00000   
mean   200.500000   54.882500        0.705000          0.257500   60160.62750   
std    115.614301   17.698072        0.456614          0.437805   24165.20098   
min      1.000000   25.000000        0.000000          0.000000   20138.00000   
25%    100.750000   40.000000        0.000000          0.000000   39461.50000   
50%    200.500000   56.000000        1.000000          0.000000   59153.50000   
75%    300.250000   69.250000        1.000000          1.000000   82089.50000   
max    400.000000   84.000000        1.000000          1.000000   99674.00000   

       TreatmentEffectiveness  
count              400.000000  
mean                 5.545000  
std                  2.842212  
min                  1.000000  
25%                  3.000000  
50%                  6.000000  
75%            

In [9]:
# Save the cleaned dataset
cleaned_file_path = 'cleaned_breast_cancer_dataset.csv'
df.to_csv(cleaned_file_path, index=False)
print(f"Cleaned dataset saved to {cleaned_file_path}")


Cleaned dataset saved to cleaned_breast_cancer_dataset.csv
