# ANALYSIS OF LUNG CANCER DATASET AND PREDICTION OF SURVIVAL RATE
Lung cancer is a leading cause of cancer-related deaths worldwide. Early detection and accurate prediction of survival rates can significantly improve patient outcomes. In this analysis, we will explore a lung cancer dataset, perform data preprocessing, and build a predictive model to estimate survival rates.

In [14]:
# suppress warnings
import warnings
warnings.filterwarnings("ignore")

# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## EXTRACTION
In this step, we will load the lung cancer dataset from a CSV file. The dataset contains various features related to lung cancer patients, including demographic information, clinical data, and survival outcomes.

**Loading the Dataset:**

In [15]:
# type: ignore
# loading the dataset
cancer = pd.read_csv("../data/Lung Cancer.csv")

# convert the dataset to a dataframe
cancer_df = pd.DataFrame(cancer)
cancer_df.head()


Unnamed: 0,id,age,gender,country,diagnosis_date,cancer_stage,family_history,smoking_status,bmi,cholesterol_level,hypertension,asthma,cirrhosis,other_cancer,treatment_type,end_treatment_date,survived
0,1,64.0,Male,Sweden,2016-04-05,Stage I,Yes,Passive Smoker,29.4,199,0,0,1,0,Chemotherapy,2017-09-10,0
1,2,50.0,Female,Netherlands,2023-04-20,Stage III,Yes,Passive Smoker,41.2,280,1,1,0,0,Surgery,2024-06-17,1
2,3,65.0,Female,Hungary,2023-04-05,Stage III,Yes,Former Smoker,44.0,268,1,1,0,0,Combined,2024-04-09,0
3,4,51.0,Female,Belgium,2016-02-05,Stage I,No,Passive Smoker,43.0,241,1,1,0,0,Chemotherapy,2017-04-23,0
4,5,37.0,Male,Luxembourg,2023-11-29,Stage I,No,Passive Smoker,19.7,178,0,0,0,0,Combined,2025-01-08,0


**a) Full extraction**
This step involves loading the entire dataset from a CSV file into a DataFrame. A full extraction is useful when we want to analyze the entire dataset without any filtering or selection creteria.



In [16]:
# full extraction
full_extraction = pd.read_csv("../data/Lung Cancer.csv")
print(f"Pulled {len(full_extraction)} rows via full extraction.")
full_extraction.head()

Pulled 890000 rows via full extraction.


Unnamed: 0,id,age,gender,country,diagnosis_date,cancer_stage,family_history,smoking_status,bmi,cholesterol_level,hypertension,asthma,cirrhosis,other_cancer,treatment_type,end_treatment_date,survived
0,1,64.0,Male,Sweden,2016-04-05,Stage I,Yes,Passive Smoker,29.4,199,0,0,1,0,Chemotherapy,2017-09-10,0
1,2,50.0,Female,Netherlands,2023-04-20,Stage III,Yes,Passive Smoker,41.2,280,1,1,0,0,Surgery,2024-06-17,1
2,3,65.0,Female,Hungary,2023-04-05,Stage III,Yes,Former Smoker,44.0,268,1,1,0,0,Combined,2024-04-09,0
3,4,51.0,Female,Belgium,2016-02-05,Stage I,No,Passive Smoker,43.0,241,1,1,0,0,Chemotherapy,2017-04-23,0
4,5,37.0,Male,Luxembourg,2023-11-29,Stage I,No,Passive Smoker,19.7,178,0,0,0,0,Combined,2025-01-08,0


**b) Incremental extraction** 
This step involves loading only the new or updated data from the source. Incremental extraction is useful when we want to keep our dataset up-to-date without reloading the entire dataset. In this case, we will filter the DataFrame to include only rows where the `year` column is greater than a specified value that is 2023-04-01

In [17]:
# set the last extraction date
last_extraction = ("2023-01-01")

# Load cancer dataset
incremental_ext = pd.read_csv("../data/Lung Cancer.csv")

# convert diagnosis date to datetime
incremental_ext['diagnosis_date'] = pd.to_datetime(incremental_ext['diagnosis_date'], errors='coerce')

# filter to include only rows where the diagnosis date is greater than the last extraction date
incremental_ext = incremental_ext[incremental_ext["diagnosis_date"] > pd.to_datetime(last_extraction)]

# reset the index of the DataFrame
incremental_ext.reset_index(drop=True, inplace=True)

# print output
print(f"Pulled {len(incremental_ext)} rows via incremental extraction.")
incremental_ext.head()

Pulled 125749 rows via incremental extraction.


Unnamed: 0,id,age,gender,country,diagnosis_date,cancer_stage,family_history,smoking_status,bmi,cholesterol_level,hypertension,asthma,cirrhosis,other_cancer,treatment_type,end_treatment_date,survived
0,2,50.0,Female,Netherlands,2023-04-20,Stage III,Yes,Passive Smoker,41.2,280,1,1,0,0,Surgery,2024-06-17,1
1,3,65.0,Female,Hungary,2023-04-05,Stage III,Yes,Former Smoker,44.0,268,1,1,0,0,Combined,2024-04-09,0
2,5,37.0,Male,Luxembourg,2023-11-29,Stage I,No,Passive Smoker,19.7,178,0,0,0,0,Combined,2025-01-08,0
3,6,50.0,Male,Italy,2023-01-02,Stage I,No,Never Smoked,37.6,274,1,0,0,0,Radiation,2024-12-27,0
4,11,48.0,Female,Luxembourg,2023-12-24,Stage IV,No,Never Smoked,30.7,262,1,1,0,0,Surgery,2024-10-28,1


## Description of the Dataset
The lung cancer dataset contained 890000 records initially. After performing an incremental extraction, we filtered the dataset to include only records from the year `2023-01-01` onwards. The resulting dataset contained 125,749 records.

The data contains the following columns:
- `patient_id`: Unique identifier for each patient.
- `age`: Age of the patient.
- `gender`: Either male or female.
- `country`: Country of the patient.
- `diagnosis_date`: Date when the patient was diagnosed with lung cancer.
- `cancer_stage`: Stage of lung cancer at the time of diagnosis.
- `family_history`: Indicates whether the patient has a family history of lung cancer.
- `smoking_status`: Indicates whether the patient is a passive smoker, never smoked, or former smoker.
- `BMI`: Body Mass Index of the patient.
- `choleterol`: Cholesterol level of the patient.
- `hypertension`: Indicates whether the patient has hypertension.
- `asthma`: Indicates whether the patient has asthma.
- `cirrhosis`: Indicates whether the patient has cirrhosis.
- `other_cancer`: Indicates whether the patient has other types of cancer.
- `treatment_type`: Type of treatment received by the patient.
- `end_treatment_date`: Date when the treatment ended.
- `survival_status`: Indicates whether the patient survived or not.

In [18]:
# checking for null values
missing = incremental_ext.isnull().sum()
print(f"Total number of missing values:\n{missing}")

# checking for duplicate values
dups = incremental_ext.duplicated().sum()
print(f"Total number of duplicate values: {dups}")

# checking the datatypes 
print("The datatypes of the columns:\n")
print(incremental_ext.dtypes)

# describing the dataset
incremental_ext.describe()

Total number of missing values:
id                    0
age                   0
gender                0
country               0
diagnosis_date        0
cancer_stage          0
family_history        0
smoking_status        0
bmi                   0
cholesterol_level     0
hypertension          0
asthma                0
cirrhosis             0
other_cancer          0
treatment_type        0
end_treatment_date    0
survived              0
dtype: int64
Total number of duplicate values: 0
The datatypes of the columns:

id                             int64
age                          float64
gender                        object
country                       object
diagnosis_date        datetime64[ns]
cancer_stage                  object
family_history                object
smoking_status                object
bmi                          float64
cholesterol_level              int64
hypertension                   int64
asthma                         int64
cirrhosis                      int64

Unnamed: 0,id,age,diagnosis_date,bmi,cholesterol_level,hypertension,asthma,cirrhosis,other_cancer,survived
count,125749.0,125749.0,125749,125749.0,125749.0,125749.0,125749.0,125749.0,125749.0,125749.0
mean,445566.758925,54.979889,2023-09-16 07:20:53.382532096,30.527964,233.744443,0.751688,0.471002,0.226069,0.08738,0.22071
min,2.0,7.0,2023-01-02 00:00:00,16.0,150.0,0.0,0.0,0.0,0.0,0.0
25%,223641.0,48.0,2023-05-10 00:00:00,23.3,196.0,1.0,0.0,0.0,0.0,0.0
50%,445321.0,55.0,2023-09-16 00:00:00,30.6,242.0,1.0,0.0,0.0,0.0,0.0
75%,668793.0,62.0,2024-01-23 00:00:00,37.8,271.0,1.0,1.0,0.0,0.0,0.0
max,889989.0,101.0,2024-05-30 00:00:00,45.0,300.0,1.0,1.0,1.0,1.0,1.0
std,256948.785913,9.987977,,8.377526,43.423667,0.432036,0.49916,0.418286,0.282393,0.414727


The extracted dataset lucks missing values and duplicate records, making it suitable for further analysis

# TRANSFORMATION
In this step, we will perform data preprocessing and transformation to prepare the dataset for analysis. This includes the following:

1. **Date Conversion**: Convert the `diagnosis_date` and `end_treatment_date` columns to datetime format. This enable us to perform date-related operations and calculations.

In [19]:
#type: ignore
# converting the diagnosis_date and end_treatment_date columns to datetime format
incremental_ext['diagnosis_date'] = pd.to_datetime(incremental_ext['diagnosis_date'], errors='coerce') 
incremental_ext['end_treatment_date'] = pd.to_datetime(incremental_ext['end_treatment_date'], errors='coerce')
incremental_ext.head()

Unnamed: 0,id,age,gender,country,diagnosis_date,cancer_stage,family_history,smoking_status,bmi,cholesterol_level,hypertension,asthma,cirrhosis,other_cancer,treatment_type,end_treatment_date,survived
0,2,50.0,Female,Netherlands,2023-04-20,Stage III,Yes,Passive Smoker,41.2,280,1,1,0,0,Surgery,2024-06-17,1
1,3,65.0,Female,Hungary,2023-04-05,Stage III,Yes,Former Smoker,44.0,268,1,1,0,0,Combined,2024-04-09,0
2,5,37.0,Male,Luxembourg,2023-11-29,Stage I,No,Passive Smoker,19.7,178,0,0,0,0,Combined,2025-01-08,0
3,6,50.0,Male,Italy,2023-01-02,Stage I,No,Never Smoked,37.6,274,1,0,0,0,Radiation,2024-12-27,0
4,11,48.0,Female,Luxembourg,2023-12-24,Stage IV,No,Never Smoked,30.7,262,1,1,0,0,Surgery,2024-10-28,1


2) **Feature Engineering**: 

a) Create a new column `treatment_duration` that calculates the duration of treatment in days by subtracting the `diagnosis_date` from the `end_treatment_date`. This will help us understand the impact of treatment duration on survival rates.

In [20]:
incremental_ext['treatment_duration'] = (incremental_ext['end_treatment_date'] - incremental_ext['diagnosis_date']).dt.days
incremental_ext.head()

Unnamed: 0,id,age,gender,country,diagnosis_date,cancer_stage,family_history,smoking_status,bmi,cholesterol_level,hypertension,asthma,cirrhosis,other_cancer,treatment_type,end_treatment_date,survived,treatment_duration
0,2,50.0,Female,Netherlands,2023-04-20,Stage III,Yes,Passive Smoker,41.2,280,1,1,0,0,Surgery,2024-06-17,1,424
1,3,65.0,Female,Hungary,2023-04-05,Stage III,Yes,Former Smoker,44.0,268,1,1,0,0,Combined,2024-04-09,0,370
2,5,37.0,Male,Luxembourg,2023-11-29,Stage I,No,Passive Smoker,19.7,178,0,0,0,0,Combined,2025-01-08,0,406
3,6,50.0,Male,Italy,2023-01-02,Stage I,No,Never Smoked,37.6,274,1,0,0,0,Radiation,2024-12-27,0,725
4,11,48.0,Female,Luxembourg,2023-12-24,Stage IV,No,Never Smoked,30.7,262,1,1,0,0,Surgery,2024-10-28,1,309


b) Create a new column `comorbidities_count` that counts the number of comorbidities (hypertension, asthma, cirrhosis, and other_cancer) for each patient. Comorbidity is the presence of one or more additional diseases or disorders co-occurring with a primary disease. This will help us understand the impact of comorbidities on survival rates.


In [21]:
# creating a new column
incremental_ext['comorbidities_count'] = incremental_ext[['hypertension', 'asthma', 'cirrhosis', 'other_cancer']].sum(axis=1)
incremental_ext.head()

Unnamed: 0,id,age,gender,country,diagnosis_date,cancer_stage,family_history,smoking_status,bmi,cholesterol_level,hypertension,asthma,cirrhosis,other_cancer,treatment_type,end_treatment_date,survived,treatment_duration,comorbidities_count
0,2,50.0,Female,Netherlands,2023-04-20,Stage III,Yes,Passive Smoker,41.2,280,1,1,0,0,Surgery,2024-06-17,1,424,2
1,3,65.0,Female,Hungary,2023-04-05,Stage III,Yes,Former Smoker,44.0,268,1,1,0,0,Combined,2024-04-09,0,370,2
2,5,37.0,Male,Luxembourg,2023-11-29,Stage I,No,Passive Smoker,19.7,178,0,0,0,0,Combined,2025-01-08,0,406,0
3,6,50.0,Male,Italy,2023-01-02,Stage I,No,Never Smoked,37.6,274,1,0,0,0,Radiation,2024-12-27,0,725,1
4,11,48.0,Female,Luxembourg,2023-12-24,Stage IV,No,Never Smoked,30.7,262,1,1,0,0,Surgery,2024-10-28,1,309,2


c) Level binning: This will involve creating bins for the `age` column to categorize patients into groups like `children`, `adolescents`, `adults`, and `elderly`. This will help us analyze survival rates based on age groups.

We will also create bins for the `BMI` column to categorize patients into groups like `underweight`, `normal`, `overweight`, and `obese`. This will help us analyze survival rates based on BMI categories.

Finally, we will create bins for the `cholesterol` column to categorize patients into groups like `Desirable`, `Borderline high`, and `High`. This will help us analyze survival rates based on cholesterol levels.

This can simplify relationships, make models more robust to outliers, and allow for easier interpretation of certain patterns, especially for visualization and initial exploration.

In [22]:
# creating age bing
incremental_ext['age_group'] = pd.cut(incremental_ext['age'], bins=[0, 12, 19, 59, 100], labels=['children', 'adolescents', 'adults', 'elderly'], right=False)

# creating bmi bins
incremental_ext['bmi_category'] = pd.cut(incremental_ext['bmi'], bins=[0, 18.5, 24.9, 29.9, 100], labels=['underweight', 'normal', 'overweight', 'obese'], right=False)   

# creating cholestral bins
incremental_ext['cholesterol_category'] = pd.cut(incremental_ext['cholesterol_level'], bins=[0, 200, 239, 1000], labels=['Desirable', 'Borderline high', 'High'], right=False)
incremental_ext.head()

Unnamed: 0,id,age,gender,country,diagnosis_date,cancer_stage,family_history,smoking_status,bmi,cholesterol_level,...,cirrhosis,other_cancer,treatment_type,end_treatment_date,survived,treatment_duration,comorbidities_count,age_group,bmi_category,cholesterol_category
0,2,50.0,Female,Netherlands,2023-04-20,Stage III,Yes,Passive Smoker,41.2,280,...,0,0,Surgery,2024-06-17,1,424,2,adults,obese,High
1,3,65.0,Female,Hungary,2023-04-05,Stage III,Yes,Former Smoker,44.0,268,...,0,0,Combined,2024-04-09,0,370,2,elderly,obese,High
2,5,37.0,Male,Luxembourg,2023-11-29,Stage I,No,Passive Smoker,19.7,178,...,0,0,Combined,2025-01-08,0,406,0,adults,normal,Desirable
3,6,50.0,Male,Italy,2023-01-02,Stage I,No,Never Smoked,37.6,274,...,0,0,Radiation,2024-12-27,0,725,1,adults,obese,High
4,11,48.0,Female,Luxembourg,2023-12-24,Stage IV,No,Never Smoked,30.7,262,...,0,0,Surgery,2024-10-28,1,309,2,adults,obese,High


3) **Drop Unnecessary Columns**: Drop columns that are not needed for analysis, such as `patient_id`, `age`, `bmi`, `cholesterol`, `hypertension`, `asthma`, `cirrhosis`, and `other_cancer`. Dropping unnecessary columns helps reduce the dimensionality of the dataset and focuses on relevant features for analysis.

In [23]:
# Drop unnecessary columns
incremental_ext = incremental_ext.drop(columns=['id', 'age', 'bmi', 'cholesterol_level', 'hypertension', 'asthma', 'cirrhosis', 'other_cancer'])

# save the transformed DataFrame to a new CSV file
incremental_ext.to_csv("../data/transformed_lung_cancer.csv", index=False) # this is for analysis purposes
incremental_ext.head()

Unnamed: 0,gender,country,diagnosis_date,cancer_stage,family_history,smoking_status,treatment_type,end_treatment_date,survived,treatment_duration,comorbidities_count,age_group,bmi_category,cholesterol_category
0,Female,Netherlands,2023-04-20,Stage III,Yes,Passive Smoker,Surgery,2024-06-17,1,424,2,adults,obese,High
1,Female,Hungary,2023-04-05,Stage III,Yes,Former Smoker,Combined,2024-04-09,0,370,2,elderly,obese,High
2,Male,Luxembourg,2023-11-29,Stage I,No,Passive Smoker,Combined,2025-01-08,0,406,0,adults,normal,Desirable
3,Male,Italy,2023-01-02,Stage I,No,Never Smoked,Radiation,2024-12-27,0,725,1,adults,obese,High
4,Female,Luxembourg,2023-12-24,Stage IV,No,Never Smoked,Surgery,2024-10-28,1,309,2,adults,obese,High


3) **Categorical Encoding**

a) **One-Hot Encoding**: Convert categorical variables such as `gender`, `country`, `cancer_stage`, `family_history`, `smoking_status`, and `treatment_type` into numerical format where each category becomes a new binary. This is necessary for machine learning algorithms that require numerical input. For encoding categorical variables, we will use one-hot encoding.

In [24]:
# columns to encode
encoded_col = ['gender', 'country', 'smoking_status', 'treatment_type', 'family_history']

# drop first to avoid multicollenearity
df_encoded_pd = pd.get_dummies(incremental_ext, columns=encoded_col, drop_first=True, dtype=int)
print("\nInfo after One-Hot Encoding:")
print(df_encoded_pd.info())
print("DataFrame after One-Hot Encoding with pd.get_dummies():")
df_encoded_pd.head()


Info after One-Hot Encoding:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125749 entries, 0 to 125748
Data columns (total 43 columns):
 #   Column                         Non-Null Count   Dtype         
---  ------                         --------------   -----         
 0   diagnosis_date                 125749 non-null  datetime64[ns]
 1   cancer_stage                   125749 non-null  object        
 2   end_treatment_date             125749 non-null  datetime64[ns]
 3   survived                       125749 non-null  int64         
 4   treatment_duration             125749 non-null  int64         
 5   comorbidities_count            125749 non-null  int64         
 6   age_group                      125748 non-null  category      
 7   bmi_category                   125749 non-null  category      
 8   cholesterol_category           125749 non-null  category      
 9   gender_Male                    125749 non-null  int64         
 10  country_Belgium                125749 

Unnamed: 0,diagnosis_date,cancer_stage,end_treatment_date,survived,treatment_duration,comorbidities_count,age_group,bmi_category,cholesterol_category,gender_Male,...,country_Slovenia,country_Spain,country_Sweden,smoking_status_Former Smoker,smoking_status_Never Smoked,smoking_status_Passive Smoker,treatment_type_Combined,treatment_type_Radiation,treatment_type_Surgery,family_history_Yes
0,2023-04-20,Stage III,2024-06-17,1,424,2,adults,obese,High,0,...,0,0,0,0,0,1,0,0,1,1
1,2023-04-05,Stage III,2024-04-09,0,370,2,elderly,obese,High,0,...,0,0,0,1,0,0,1,0,0,1
2,2023-11-29,Stage I,2025-01-08,0,406,0,adults,normal,Desirable,1,...,0,0,0,0,0,1,1,0,0,0
3,2023-01-02,Stage I,2024-12-27,0,725,1,adults,obese,High,1,...,0,0,0,0,1,0,0,1,0,0
4,2023-12-24,Stage IV,2024-10-28,1,309,2,adults,obese,High,0,...,0,0,0,0,1,0,0,0,1,0
