<a href="https://colab.research.google.com/github/Thesis-AfaanOromooChatGPT2025/MedPromptX/blob/main/Lung_Cancer_Patient_Data_EDA_%26_Insights.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
khwaishsaxena_lung_cancer_dataset_path = kagglehub.dataset_download('khwaishsaxena/lung-cancer-dataset')

print('Data source import complete.')


# 1.0 About Author
- Project: Lung Cancer Dataset
- Author: Zulqarnain Haider
- Code Submission Date: 15-07-2025
- Author's Contact Info:\
[Email](nainhaider989@gmail.com)\
[kaggle](https://www.kaggle.com/zulqarnain11)


# 2.0 About Data
-Data: Lung Cancer Dataset
- Data Size: : 93.38 MB
-Rows and Columns: 890,000 rows × 17 columns
- Data Age: Collected between June 2, 2014 to May 30, 2024
- **Dataset:** 🔗 [*link*](https://www.kaggle.com/datasets/khwaishsaxena/lung-cancer-dataset/data)
## 2.1 Task:
We aim to perform Exploratory Data Analysis (EDA) to uncover underlying patterns and insights from the dataset. The EDA will help us identify:

- Data distribution and trends

- Missing or inconsistent data

- Relationships among features

- Potential outliers or anomalies

This EDA will form the foundation for subsequent data wrangling, cleaning, and preprocessing, which are essential for any modeling or predictive analysis.

## 2.2 Objectives:
The key objectives of this analysis are to:

- Investigate factors contributing to lung cancer mortality and survival

- Explore demographic and clinical features such as age, gender, BMI, smoking status, and comorbidities

- Determine patterns in treatment types, cancer stages, and survival outcomes

- Offer insights that can assist healthcare researchers, developers, and policymakers in better understanding patient dynamics and risks

- Prepare a clean, normalized dataset suitable for further machine learning applications



## 2.3 Kernel Version Used:
- Python  3.11.13

# Importing necessary libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

# loading the dataset

In [None]:
df=pd.read_csv('/kaggle/input/lung-cancer-dataset')

# Exploring  and Preprocessing the Dataset

In [None]:
df.head()

In [None]:
df.shape  # shape  of dataset

In [None]:
df.describe()

In [None]:
#converting  survived dataset into bool Dtype
df.survived=df.survived.astype(bool)
df.survived=df.survived.apply(lambda x: 1 if x == True else 0)


In [None]:
df.info()

In [None]:
df.survived.unique()

In [None]:
#checking null values in each column
df.isnull().sum()

In [None]:
# plot heatmap graph for further clearity
sns.heatmap(df.isnull())

In [None]:
#let  fix the null values  as i have noticed  there is one same row for all null values so its better to remove that
df.dropna(inplace=True)

In [None]:
df.isnull().sum()

In [None]:
df.shape

In [None]:
df.age.unique() # as the age column is integer it is better convert into int and also do bining to get clear insights

In [None]:
df.age=df.age.astype('int')


In [None]:
#now lets do binning
bins=[0, 10, 20, 35, 50, 65, 80, 120]
labels=['Child', 'Teenager', 'Young Adult', 'Adult', 'Middle Aged', 'Senior', 'Elderly']
df["age_group"]=pd.cut(df["age"], bins=bins, labels=labels)

# Which Age Group Has the Most Lung Cancer Patients?

In [None]:
# Total patients per age group
total_patients = df['age_group'].value_counts().sort_index()

# Died patients (survived == 0)
died_patients = df[df['survived'] == 0]['age_group'].value_counts().sort_index()

# Survived patients (survived == 1)
survived_patients = df[df['survived'] == 1]['age_group'].value_counts().sort_index()

for age_group in total_patients.index:
    total = total_patients.get(age_group, 0)
    died = died_patients.get(age_group, 0)
    survived = survived_patients.get(age_group, 0)
    print(f"{age_group}: Total = {total}, Died = {died}, Survived = {survived}")



---
# Observation
- The Middle Aged (50–65) group had the highest number of patients and deaths:
469,405 total, with 78,387 deaths and 391,018 survived.
- The Adult (35–50) group followed with
267,253 total, 44,848 deaths, and 222,405 survived.
- The Senior (65–80) group had
125,764 total, 20,955 deaths, and 104,809 survived.

---

In [None]:
# percentage of died per age group
died_percentage = (died_patients / total_patients) *100
print(died_percentage)

# Is There Any Relationship Between Cancer Stage and Deaths?

In [None]:
df.columns# Correct filtering: survived == 0
died_df = df[df['survived'] == 0]

# Count deaths per cancer stage
died_by_stage = died_df['cancer_stage'].value_counts().sort_values(ascending=False)

# Print result
print(died_by_stage)


---
# Observation – Deaths by Cancer Stage
- Stage IV: 37,327 deaths
-Stage III: 37,326 deaths
- Stage I: 37,260 deaths
- Stage II: 36,907 deaths
#### This suggests that cancer stage may not be strongly correlated with mortality in this dataset, or it may indicate imbalanced class distributions or data limitations.
---



# which country people died more ?

In [None]:
died_df = df[df['survived'] == 0]

# Count deaths per cancer stage
died_by_country = died_df['country'].value_counts().sort_values(ascending=False)

# Print result of to the 10 countries
print(died_by_country.head(10))
#

---
 # Observation – Patients by Country
 - Ireland has the highest number of lung cancer patients in the dataset (5,643), followed closely by Croatia (5,611) and Malta (5,577).
 - The distribution across countries is fairly uniform, with only minor differences (all between ~5,400 and ~5,600 patients).
 - This suggests that the dataset may be evenly sampled across European countries, possibly for balanced analysis.
 ---

#  Does Smoking Status Affect Mortality?


In [None]:

died_df['smoking_status'].value_counts().sort_values(ascending=False)

---
#  Observation – Deaths by Smoking Status
- The number of deaths is quite similar across all smoking status categories.
- Passive Smokers had the highest number of deaths (37,499),
followed by Never Smoked (37,223), Former Smoker (37,054),
and Current Smoker (37,044).
---

#  Does Family History Impact Lung Cancer Mortality?

In [None]:

df[df['survived'] == 0]['family_history'].value_counts()


---
#  Observation – Does Family History Impact Lung Cancer Mortality?
 - Out of all patients who died,
74,498 had a family history of cancer,
while 74,322 had no family history.
- The death counts are almost equal, suggesting that family history may not have a significant impact on lung cancer mortality in this dataset.


In [None]:
sns.heatmap(df.corr())

In [None]:
df.columns

In [None]:
df[df.survived==0]["gender"].value_counts().sort_values(ascending=False)

# Death Rate by BMI Category – Is Weight a Risk Factor?

In [None]:
# Define BMI groups
bins = [0, 18.5, 24.9, 29.9, df['bmi'].max()]
labels = ['Underweight', 'Normal', 'Overweight', 'Obese']

# Create a new column for BMI category
df['bmi_group'] = pd.cut(df['bmi'], bins=bins, labels=labels)

# Count total patients in each BMI group
total_by_bmi = df['bmi_group'].value_counts().sort_index()

# Count died patients in each BMI group
died_by_bmi = df[df['survived'] == 0]['bmi_group'].value_counts().sort_index()

# Calculate death rate (%)
death_rate_bmi = (died_by_bmi / total_by_bmi) * 100
death_rate_bmi = death_rate_bmi.round(2)

# Display results
death_rate_bmi


# heatmap graph to check correaltion

In [None]:
# Select only numeric columns
numeric_df = df.select_dtypes(include='number')

# Plot correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap (Numeric Features Only)")
plt.show()

# Observation
The correlation heatmap shows that the survived column has no strong correlation with any of the numeric features (age, bmi, cholesterol_level, hypertension, etc.).

# Summary Observation
In this Exploratory Data Analysis (EDA) of the lung cancer dataset:

- Most patients fall in the Middle Aged (50–65) category, which also shows the highest number of deaths.

- Cancer stage, smoking status, family history, and BMI show minimal variation in death rates, suggesting that none of these features alone strongly influence survival outcomes.

- The correlation heatmap reveals that the survived variable has no strong linear correlation with any numeric feature.

- Country-wise distribution of patients is fairly uniform, indicating a balanced dataset across European regions.

- 🔍 Key Insight: The dataset appears to be well-balanced, and no single factor stands out as a clear predictor of survival. This suggests that survival may depend on complex interactions between multiple features rather than individual variables.
