# Introduction
Welcome to this detailed analysis where we explore a dataset focused on breast cancer survival rates. The goal is to provide a comprehensive understanding of the data through descriptive statistics, enhancing our knowledge about key factors influencing survival.

# Objective
Our primary objective is to use descriptive statistics to summarize and understand the data's behavior, which will assist in further analyses like predictive modeling or hypotheses testing.

# Data Description
The dataset comes from a .sav file, commonly used in SPSS software, containing a mix of clinical and demographic variables related to breast cancer patients.

In [1]:
!pip install pyreadstat

Collecting pyreadstat
  Downloading pyreadstat-1.2.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.0 kB)
Downloading pyreadstat-1.2.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m31.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyreadstat
Successfully installed pyreadstat-1.2.7


# Load the Dataset
Now, we will load the dataset from the SPSS file and inspect the initial few entries to understand the structure.

In [2]:
# Import necessary libraries
import pandas as pd
import pyreadstat  # To read SPSS files
import numpy as np

In [3]:
# Load the data
df, meta = pyreadstat.read_sav('/kaggle/input/breast-cancer-dataset/Breast cancer survival.sav')

# Display the first few rows to understand the data
print(df.head())

# Check metadata for any additional insights into variable labels
print(meta.variable_value_labels)

    id   age  pathsize  lnpos  histgrad   er   pr  status       time
0  1.0  60.0       NaN    0.0       3.0  0.0  0.0     0.0   9.466667
1  2.0  79.0       NaN    0.0       NaN  NaN  NaN     0.0   8.600000
2  3.0  82.0       NaN    0.0       2.0  NaN  NaN     0.0  19.333333
3  4.0  66.0       NaN    0.0       2.0  1.0  1.0     0.0  16.333333
4  5.0  52.0       NaN    0.0       3.0  NaN  NaN     0.0   8.500000
{'histgrad': {4.0: 'Unknown'}, 'er': {0.0: 'Negative', 1.0: 'Positive', 2.0: 'Unknown'}, 'pr': {0.0: 'Negative', 1.0: 'Positive', 2.0: 'Unknown'}, 'status': {0.0: 'Censored', 1.0: 'Died'}}


In [4]:
df.describe()

Unnamed: 0,id,age,pathsize,lnpos,histgrad,er,pr,status,time
count,1207.0,1207.0,1121.0,1207.0,920.0,869.0,851.0,1207.0,1207.0
mean,621.07208,56.387738,1.733488,0.880696,2.269565,0.611047,0.542891,0.059652,46.956476
std,359.623207,13.327627,0.995857,2.535457,0.607487,0.487793,0.49845,0.236939,29.638977
min,1.0,22.0,0.1,0.0,1.0,0.0,0.0,0.0,2.633333
25%,310.5,46.0,1.0,0.0,2.0,0.0,0.0,0.0,22.55
50%,619.0,56.0,1.5,0.0,2.0,1.0,1.0,0.0,42.966667
75%,931.5,66.5,2.2,0.0,3.0,1.0,1.0,0.0,65.583333
max,1266.0,88.0,7.0,35.0,3.0,1.0,1.0,1.0,133.8


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1207 entries, 0 to 1206
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        1207 non-null   float64
 1   age       1207 non-null   float64
 2   pathsize  1121 non-null   float64
 3   lnpos     1207 non-null   float64
 4   histgrad  920 non-null    float64
 5   er        869 non-null    float64
 6   pr        851 non-null    float64
 7   status    1207 non-null   float64
 8   time      1207 non-null   float64
dtypes: float64(9)
memory usage: 85.0 KB


**Identifying Categorical Variables**
histgrad
Based on the data inspection, we identified four categorical columns: 'histgrad', 'er', 'pr', and 'status'. These columns contain categorical data that need to be converted into a format suitable for analysis and modeling.

In [6]:
# List of categorical columns
categorical_columns = ['histgrad', 'er', 'pr', 'status']

# Using a for loop to convert columns to string type
for col in categorical_columns:
    df[col] = df[col].astype(str).replace('nan',np.nan)
    
# in the "histgrad" column by converting them from 'nan' (string) to np.nan (actual NaN object). 
# this ensures that these values are treated appropriately in analyses as missing data rather than as a category. 

In [7]:
# Display the updated data types to confirm the changes
print(df.dtypes)

id          float64
age         float64
pathsize    float64
lnpos       float64
histgrad     object
er           object
pr           object
status       object
time        float64
dtype: object


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1207 entries, 0 to 1206
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        1207 non-null   float64
 1   age       1207 non-null   float64
 2   pathsize  1121 non-null   float64
 3   lnpos     1207 non-null   float64
 4   histgrad  920 non-null    object 
 5   er        869 non-null    object 
 6   pr        851 non-null    object 
 7   status    1207 non-null   object 
 8   time      1207 non-null   float64
dtypes: float64(5), object(4)
memory usage: 85.0+ KB


In [9]:
categorical_columns = ['histgrad', 'er', 'pr', 'status']
continuous_columns = ['age', 'pathsize', 'lnpos', 'time']

In [10]:
# Generate frequency tables and find the mode for each categorical column
for col in ['histgrad', 'er', 'pr', 'status']:
    frequency_table = df[col].value_counts()
    mode_value = df[col].mode()[0]
    print(f"Frequency table for {col}:\n{frequency_table}\n")
    print(f"Mode for {col}: {mode_value}\n")
    print('**'*30)

Frequency table for histgrad:
histgrad
2.0    514
3.0    327
1.0     79
Name: count, dtype: int64

Mode for histgrad: 2.0

************************************************************
Frequency table for er:
er
1.0    531
0.0    338
Name: count, dtype: int64

Mode for er: 1.0

************************************************************
Frequency table for pr:
pr
1.0    462
0.0    389
Name: count, dtype: int64

Mode for pr: 1.0

************************************************************
Frequency table for status:
status
0.0    1135
1.0      72
Name: count, dtype: int64

Mode for status: 0.0

************************************************************
