# Task
Detect Lung Cancer using patient diagnosis data from "/content/dataset_med.csv" and build an end-to-end beginner-level project to predict patient survival.

## Load the dataset

### Subtask:
Load the data from the provided CSV file into a pandas DataFrame.


**Reasoning**:
Import the pandas library, read the CSV file into a DataFrame, and display the first 5 rows.



In [1]:
import pandas as pd

df = pd.read_csv('/content/dataset_med.csv')
display(df.head())

Unnamed: 0,id,age,gender,country,diagnosis_date,cancer_stage,family_history,smoking_status,bmi,cholesterol_level,hypertension,asthma,cirrhosis,other_cancer,treatment_type,end_treatment_date,survived
0,1,64.0,Male,Sweden,2016-04-05,Stage I,Yes,Passive Smoker,29.4,199,0,0,1,0,Chemotherapy,2017-09-10,0
1,2,50.0,Female,Netherlands,2023-04-20,Stage III,Yes,Passive Smoker,41.2,280,1,1,0,0,Surgery,2024-06-17,1
2,3,65.0,Female,Hungary,2023-04-05,Stage III,Yes,Former Smoker,44.0,268,1,1,0,0,Combined,2024-04-09,0
3,4,51.0,Female,Belgium,2016-02-05,Stage I,No,Passive Smoker,43.0,241,1,1,0,0,Chemotherapy,2017-04-23,0
4,5,37.0,Male,Luxembourg,2023-11-29,Stage I,No,Passive Smoker,19.7,178,0,0,0,0,Combined,2025-01-08,0


## Explore the data

### Subtask:
Perform initial data exploration, including checking for missing values, understanding data types, and examining the distribution of key features and the target variable (`survived`).


**Reasoning**:
Display column names and data types, check for missing values, display descriptive statistics for numerical columns, and show value counts for the target variable.



In [2]:
df.info()
print("\nMissing values per column:")
print(df.isnull().sum())
print("\nDescriptive statistics for numerical columns:")
display(df.describe())
print("\nValue counts for 'survived':")
display(df['survived'].value_counts())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 890000 entries, 0 to 889999
Data columns (total 17 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  890000 non-null  int64  
 1   age                 890000 non-null  float64
 2   gender              890000 non-null  object 
 3   country             890000 non-null  object 
 4   diagnosis_date      890000 non-null  object 
 5   cancer_stage        890000 non-null  object 
 6   family_history      890000 non-null  object 
 7   smoking_status      890000 non-null  object 
 8   bmi                 890000 non-null  float64
 9   cholesterol_level   890000 non-null  int64  
 10  hypertension        890000 non-null  int64  
 11  asthma              890000 non-null  int64  
 12  cirrhosis           890000 non-null  int64  
 13  other_cancer        890000 non-null  int64  
 14  treatment_type      890000 non-null  object 
 15  end_treatment_date  890000 non-nul

Unnamed: 0,id,age,bmi,cholesterol_level,hypertension,asthma,cirrhosis,other_cancer,survived
count,890000.0,890000.0,890000.0,890000.0,890000.0,890000.0,890000.0,890000.0,890000.0
mean,445000.5,55.007008,30.494172,233.633916,0.750024,0.46974,0.225956,0.088157,0.220229
std,256921.014128,9.994485,8.368539,43.432278,0.432999,0.499084,0.418211,0.283524,0.414401
min,1.0,4.0,16.0,150.0,0.0,0.0,0.0,0.0,0.0
25%,222500.75,48.0,23.3,196.0,1.0,0.0,0.0,0.0,0.0
50%,445000.5,55.0,30.5,242.0,1.0,0.0,0.0,0.0,0.0
75%,667500.25,62.0,37.7,271.0,1.0,1.0,0.0,0.0,0.0
max,890000.0,104.0,45.0,300.0,1.0,1.0,1.0,1.0,1.0



Value counts for 'survived':


Unnamed: 0_level_0,count
survived,Unnamed: 1_level_1
0,693996
1,196004


## Preprocess the data

### Subtask:
Handle missing values, encode categorical variables, and potentially perform feature scaling if necessary.


**Reasoning**:
Identify and encode categorical columns using one-hot encoding.



In [3]:
categorical_cols = ['gender', 'country', 'cancer_stage', 'family_history', 'smoking_status', 'treatment_type']
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
display(df.head())

Unnamed: 0,id,age,diagnosis_date,bmi,cholesterol_level,hypertension,asthma,cirrhosis,other_cancer,end_treatment_date,...,cancer_stage_Stage II,cancer_stage_Stage III,cancer_stage_Stage IV,family_history_Yes,smoking_status_Former Smoker,smoking_status_Never Smoked,smoking_status_Passive Smoker,treatment_type_Combined,treatment_type_Radiation,treatment_type_Surgery
0,1,64.0,2016-04-05,29.4,199,0,0,1,0,2017-09-10,...,False,False,False,True,False,False,True,False,False,False
1,2,50.0,2023-04-20,41.2,280,1,1,0,0,2024-06-17,...,False,True,False,True,False,False,True,False,False,True
2,3,65.0,2023-04-05,44.0,268,1,1,0,0,2024-04-09,...,False,True,False,True,True,False,False,True,False,False
3,4,51.0,2016-02-05,43.0,241,1,1,0,0,2017-04-23,...,False,False,False,False,False,False,True,False,False,False
4,5,37.0,2023-11-29,19.7,178,0,0,0,0,2025-01-08,...,False,False,False,False,False,False,True,True,False,False


## Feature engineering

### Subtask:
Create new features that could be relevant for prediction, such as calculating the duration between diagnosis and end of treatment.


**Reasoning**:
Convert the date columns to datetime objects, calculate the difference in days, store it in a new column, and then drop the original date columns.



In [4]:
df['diagnosis_date'] = pd.to_datetime(df['diagnosis_date'])
df['end_treatment_date'] = pd.to_datetime(df['end_treatment_date'])
df['treatment_duration_days'] = (df['end_treatment_date'] - df['diagnosis_date']).dt.days
df = df.drop(columns=['diagnosis_date', 'end_treatment_date'])
display(df.head())

Unnamed: 0,id,age,bmi,cholesterol_level,hypertension,asthma,cirrhosis,other_cancer,survived,gender_Male,...,cancer_stage_Stage III,cancer_stage_Stage IV,family_history_Yes,smoking_status_Former Smoker,smoking_status_Never Smoked,smoking_status_Passive Smoker,treatment_type_Combined,treatment_type_Radiation,treatment_type_Surgery,treatment_duration_days
0,1,64.0,29.4,199,0,0,1,0,0,True,...,False,False,True,False,False,True,False,False,False,523
1,2,50.0,41.2,280,1,1,0,0,1,False,...,True,False,True,False,False,True,False,False,True,424
2,3,65.0,44.0,268,1,1,0,0,0,False,...,True,False,True,True,False,False,True,False,False,370
3,4,51.0,43.0,241,1,1,0,0,0,False,...,False,False,False,False,False,True,False,False,False,443
4,5,37.0,19.7,178,0,0,0,0,0,True,...,False,False,False,False,False,True,True,False,False,406


## Split the data

### Subtask:
Split the preprocessed data into training and testing sets.


**Reasoning**:
Split the data into training and testing sets using train_test_split and print the shapes of the resulting sets.



In [5]:
from sklearn.model_selection import train_test_split

X = df.drop('survived', axis=1)
y = df['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (667500, 46)
Shape of X_test: (222500, 46)
Shape of y_train: (667500,)
Shape of y_test: (222500,)


## Build a model

### Subtask:
Choose and train a suitable classification model for survival prediction.


**Reasoning**:
Import and train a RandomForestClassifier model on the training data.



In [6]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

## Evaluate the model

### Subtask:
Evaluate the trained model's performance using appropriate metrics.


**Reasoning**:
Use the trained model to make predictions on the test set and then calculate and print the evaluation metrics.



In [7]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")

Accuracy: 0.7788
Precision: 0.1538
Recall: 0.0001
F1-score: 0.0002


## Make predictions

### Subtask:
Use the trained model to make predictions on new data.


**Reasoning**:
Use the trained model to make predictions on the test set and store them in a variable named `predictions`.



In [8]:
predictions = model.predict(X_test)

## Summary:

### Data Analysis Key Findings

*   The dataset contains 890,000 entries with 17 columns and no missing values.
*   The target variable, `survived`, is imbalanced, with significantly more patients who did not survive.
*   Categorical features were successfully one-hot encoded.
*   A new feature, `treatment_duration_days`, was created by calculating the difference between `end_treatment_date` and `diagnosis_date`.
*   The data was split into training (75%) and testing (25%) sets, resulting in 667,500 training samples and 222,500 testing samples.
*   A RandomForestClassifier model was trained on the training data.
*   The model evaluation showed an accuracy of 0.7788, but very low precision (0.1538), recall (0.0001), and F1-score (0.0002).

### Insights or Next Steps

*   The low recall and F1-score indicate that the model is not effective at identifying patients who survived, likely due to the imbalanced target variable.
*   Further steps should focus on addressing the class imbalance, potentially using techniques like oversampling, undersampling, or using evaluation metrics more suitable for imbalanced datasets (e.g., AUC-ROC). Model tuning or exploring different algorithms might also improve performance on the minority class.
