# Data Loading and Cleaning

### Notebook Summary
This notebook documents the data loading and cleaning process, ensuring the dataset is accurate, consistent, and ready for analysis. Steps include:

- Importing and inspecting the raw dataset
- Identifying and handling missing values
- Removing duplicates and inconsistencies
- Correcting data types and formats
- Dropping irrelevant or redundant features

The goal is to produce a clean, reliable dataset that forms the foundation for the pre-processing and modeling stages.

### Notebook Setup

In [1]:
import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Data Loading

In [4]:
# Loading dataset
cancer_df = pd.read_csv('D:/lc-mortality/data/Lung Cancer.csv')

# View dataset
cancer_df.head(10)

Unnamed: 0,id,age,gender,country,diagnosis_date,cancer_stage,family_history,smoking_status,bmi,cholesterol_level,hypertension,asthma,cirrhosis,other_cancer,treatment_type,end_treatment_date,survived
0,1,64.0,Male,Sweden,2016-04-05,Stage I,Yes,Passive Smoker,29.4,199,0,0,1,0,Chemotherapy,2017-09-10,0
1,2,50.0,Female,Netherlands,2023-04-20,Stage III,Yes,Passive Smoker,41.2,280,1,1,0,0,Surgery,2024-06-17,1
2,3,65.0,Female,Hungary,2023-04-05,Stage III,Yes,Former Smoker,44.0,268,1,1,0,0,Combined,2024-04-09,0
3,4,51.0,Female,Belgium,2016-02-05,Stage I,No,Passive Smoker,43.0,241,1,1,0,0,Chemotherapy,2017-04-23,0
4,5,37.0,Male,Luxembourg,2023-11-29,Stage I,No,Passive Smoker,19.7,178,0,0,0,0,Combined,2025-01-08,0
5,6,50.0,Male,Italy,2023-01-02,Stage I,No,Never Smoked,37.6,274,1,0,0,0,Radiation,2024-12-27,0
6,7,49.0,Female,Croatia,2018-05-21,Stage III,Yes,Passive Smoker,43.1,259,0,0,0,0,Radiation,2019-05-06,1
7,8,51.0,Male,Denmark,2017-02-18,Stage IV,Yes,Former Smoker,25.8,195,1,1,0,0,Combined,2017-08-26,0
8,9,64.0,Male,Sweden,2021-03-21,Stage III,Yes,Current Smoker,21.5,236,0,0,0,0,Chemotherapy,2022-03-07,0
9,10,56.0,Male,Hungary,2021-11-30,Stage IV,Yes,Current Smoker,17.3,183,1,0,0,1,Surgery,2023-11-29,0


In [5]:
# View datatypes 
cancer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 890000 entries, 0 to 889999
Data columns (total 17 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  890000 non-null  int64  
 1   age                 890000 non-null  float64
 2   gender              890000 non-null  object 
 3   country             890000 non-null  object 
 4   diagnosis_date      890000 non-null  object 
 5   cancer_stage        890000 non-null  object 
 6   family_history      890000 non-null  object 
 7   smoking_status      890000 non-null  object 
 8   bmi                 890000 non-null  float64
 9   cholesterol_level   890000 non-null  int64  
 10  hypertension        890000 non-null  int64  
 11  asthma              890000 non-null  int64  
 12  cirrhosis           890000 non-null  int64  
 13  other_cancer        890000 non-null  int64  
 14  treatment_type      890000 non-null  object 
 15  end_treatment_date  890000 non-nul

In [7]:
# Viewing number of rows and columns
cancer_df.shape

(890000, 17)

### Dataset Observation  

The dataset comprises **89,000 patient records** with **17 attributes** describing demographic, lifestyle, and clinical characteristics. The **dependent variable** is the **`survived`** column, a binary indicator where:  

- `1` denotes that the patient **survived**, and  
- `0` denotes that the patient **did not survive**.  

The feature set includes a mix of **numerical** variables (e.g. age, cholesterol level) and **categorical** variables (e.g., gender, smoking history, environmental factors). This combination provides a diverse foundation for predictive modeling and allows for both statistical and machine learning approaches to outcome prediction.


In [8]:
# Viewing number of null values
cancer_df.isnull().sum()

id                    0
age                   0
gender                0
country               0
diagnosis_date        0
cancer_stage          0
family_history        0
smoking_status        0
bmi                   0
cholesterol_level     0
hypertension          0
asthma                0
cirrhosis             0
other_cancer          0
treatment_type        0
end_treatment_date    0
survived              0
dtype: int64

In [None]:
# Viewing total number of duplicate entries
cancer_df.duplicated().sum()

0

In [11]:
# Viewing number of unique values per column
print(cancer_df.nunique())

id                    890000
age                       95
gender                     2
country                   27
diagnosis_date          3651
cancer_stage               4
family_history             2
smoking_status             4
bmi                      291
cholesterol_level        151
hypertension               2
asthma                     2
cirrhosis                  2
other_cancer               2
treatment_type             4
end_treatment_date      4194
survived                   2
dtype: int64


### Dataset Cleanliness & Description  

The dataset is in good condition, with no missing values and no duplicate entries, ensuring a reliable foundation for analysis. Each patient is uniquely identified by an `id`, which will later be dropped as it has no predictive value.  

The variables show appropriate levels of uniqueness. Columns such as **gender**, **family_history**, and **comorbidities** (hypertension, asthma, cirrhosis, other_cancer) are binary, while **cancer_stage** and **treatment_type** have a small set of discrete categories. **Smoking_status** provides four categories, reflecting different patient habits.  

Continuous variables such as **age**, **BMI**, and **cholesterol_level** display a wide but reasonable range of unique values, which is expected in medical data. Columns like **diagnosis_date** and **end_treatment_date** contain many unique entries, reflecting the day-level granularity of patient records; these will need to be converted into proper datetime objects for meaningful analysis.  

The dependent variable, **`survived`**, is binary (1 = survived, 0 = not survived). Overall, the balance of categorical and continuous variables, along with the absence of nulls or duplicates, indicates that the dataset is clean and ready for exploratory analysis and preprocessing.  

It is worth noting that the dataset is already clean, with no missing values or duplicate entries. As such, there is no need to create a separate “cleaned” version of the data. Instead, only minor adjustments (such as dropping the `id` column and converting date fields) will be applied during preprocessing.
