# **CLINICAL SURVIVAL ANALYSIS**

# Clinical Survival Analysis – Overview & Data Inspection

## Objectives

- Introduce the clinical survival analysis project and dataset from Kaggle (https://www.kaggle.com/datasets/imtkaggleteam/clinical-dataset) 
- Perform initial loading and inspection of the discovery and validation cohorts  
- Identify the structure, column types, and potential data issues
### Core Statistical Concepts
In this project, a few core statistical ideas guide the analysis and help turn raw data into meaningful insights. Measures like the mean (average) and median (middle value) summarise what is “typical” in the dataset, while the standard deviation tells us how 'spread out' the values are. Hypothesis testing is used to check whether the differences observed between groups are likely to be real or could have happened by chance. In addition, basic probability helps to interpret the results in terms of uncertainty — for example, using confidence intervals to show the range within which the true value is likely to fall. Together, these concepts provide a solid foundation for making sense of the data and supporting well-reasoned conclusions.


## Inputs

- `data/clinical_data_discovery_cohort.csv`  
- `data/clinical_data_validation_cohort.xlsx`

## Outputs

- Initial observations documented in Markdown  
- Dataset structure (rows, columns, types)  
- Missing value summary

## Additional Comments

* The working directory was fixed based on Copilot's suggestions
* As one of the files in the dataset is an Excel file, Copilot also assisted in the installation of the openpyxl package to read that file.

### Initial Project Wireframe

The diagram below outlines a proposed design for the final interactive dashboard to be developed in Tableau.  
It highlights the key metrics, charts, and visual components that will likely be included, providing a blueprint for how results from the analysis can be communicated to end-users.  

This wireframe serves as a visual reference for the intended layout — from high-level KPIs (e.g., average survival days, mortality rate) to detailed visualisations (e.g., survival time by stage, mutation status vs mortality rate, pack-years vs survival, and age distribution).  
The "Key Takeaways" section at the bottom will summarise the most important findings for quick interpretation.


![Clinical Survival Analysis Wireframe](images/clinical_survival_wireframe.png)


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\petal\\Downloads\\CI-DBC\\vscode-projects\\clinical-survival-analysis\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\petal\\Downloads\\CI-DBC\\vscode-projects\\clinical-survival-analysis'

# Section 1

# Import Libraries

In [4]:
# Step 0: Import core libraries
import pandas as pd

# Optional: Display all columns in wide datasets
pd.set_option('display.max_columns', None)


# Load Datasets

In [9]:
# Step 1: Load discovery cohort (CSV)
discovery = pd.read_csv("data/clinical_data_discovery_cohort.csv")

# Step 2: Load validation cohort (Excel)
validation = pd.read_excel("data/clinical_data_validation_cohort.xlsx")

# Step 3: Confirm load
print("Discovery shape:", discovery.shape)
print("Validation shape:", validation.shape)

display(discovery.head())
display(validation.head())

Discovery shape: (30, 10)
Validation shape: (95, 14)


Unnamed: 0,PatientID,Specimen date,Dead or Alive,Date of Death,Date of Last Follow Up,sex,race,Stage,Event,Time
0,1,3/17/2003,Dead,2/24/2010,2/24/2010,F,B,pT2N2MX,1,2536
1,2,6/17/2003,Dead,11/12/2004,11/12/2004,M,W,T2N2MX,1,514
2,3,9/9/2003,Dead,8/1/2009,8/1/2009,F,B,T2N1MX,1,2153
3,4,10/14/2003,Dead,12/29/2006,12/29/2006,M,W,pT2NOMX,1,1172
4,5,12/1/2003,Dead,1/31/2004,1/31/2004,F,W,T2NOMX,1,61


Unnamed: 0,Patient ID,Survival time (days),"Event (death: 1, alive: 0)",Tumor size (cm),Grade,Stage (TNM 8th edition),Age,Sex,Cigarette,Pack per year,Type.Adjuvant,batch,EGFR,KRAS
0,P109342,2329,1,2.6,3,IB,67,Male,Former,50.0,,1,,
1,P124450,2532,0,3.5,3,IB,68,Female,Former,52.5,,1,Negative,Negative
2,P131833,2271,0,2.0,2,IA2,80,Female,Never,0.0,,1,Negative,Negative
3,P131888,2193,0,3.0,2,IA3,63,Male,Former,47.0,,1,Negative,G12C
4,P131946,2387,0,4.0,2,IIIA,88,Female,Never,0.0,,1,Negative,Negative


Both the discovery and validation datasets were successfully loaded. 
The validation dataset required the openpyxl package for Excel reading. The package was installed based on Copilot's suggestions.


---

# Section 2

# Inspect Structure and Missing Data 

In [10]:
# Step 4: Inspect data structure
print("\n--- Discovery Cohort Info ---")
print(discovery.info())
print("\nMissing values per column:")
print(discovery.isnull().sum())

print("\n--- Validation Cohort Info ---")
print(validation.info())
print("\nMissing values per column:")
print(validation.isnull().sum())



--- Discovery Cohort Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   PatientID               30 non-null     int64 
 1   Specimen date           30 non-null     object
 2   Dead or Alive           30 non-null     object
 3   Date of Death           30 non-null     object
 4   Date of Last Follow Up  30 non-null     object
 5   sex                     30 non-null     object
 6   race                    30 non-null     object
 7   Stage                   30 non-null     object
 8   Event                   30 non-null     int64 
 9   Time                    30 non-null     int64 
dtypes: int64(3), object(7)
memory usage: 2.5+ KB
None

Missing values per column:
PatientID                 0
Specimen date             0
Dead or Alive             0
Date of Death             0
Date of Last Follow Up    0
sex                

### Initial Observations
- Discovery cohort: 30 rows × 10 columns, no missing values
- Validation cohort: 95 rows × 14 columns, with missing values in:
    - Type.Adjuvant: 73 missing (at first glance, seems to be highly sparse)
    - EGFR: 9 missing
    - KRAS: 30 missing
- Dates are currently object dtype. This will be converted to datetime in cleaning steps


> **Note:**  
> A large number of patients have no adjuvant therapy recorded.  
> In oncology, absence of adjuvant therapy may be intentional and clinically valid.
> Therefore, this is not necessarily missing data.  
> For this analysis, these entries will be coded as `"No_Adjuvant_Therapy"` to distinguish them from other therapy types.


---

# Next Steps

* Move to Notebook 1 for cleaning, encoding, and missing value handling