### Problem Statement

You are a data scientist / AI engineer at a medical research firm. You have been provided with a dataset named **`"cancer_data.csv"`**, which includes medical and lifestyle information for 1500 patients. The dataset is designed to predict the presence of cancer based on various features. The dataset comprises the following columns:

- `age:` Integer values representing the patient's age, ranging from 20 to 80.
- `gender:` Binary values representing gender, where 0 indicates Male and 1 indicates Female.
- `bmi:` Continuous values representing Body Mass Index, ranging from 15 to 40.
- `smoking:` Binary values indicating smoking status, where 0 means No and 1 means Yes.
- `genetic_risk:` Categorical values representing genetic risk levels for cancer, with 0 indicating Low, 1 indicating Medium, and 2 indicating High.
- `physical_activity:` Continuous values representing the number of hours per week spent on physical activities, ranging from 0 to 10.
- `alcohol_intake:` Continuous values representing the number of alcohol units consumed per week, ranging from 0 to 5.
- `cancer_history:` Binary values indicating whether the patient has a personal history of cancer, where 0 means No and 1 means Yes.
- `diagnosis:` Binary values indicating the cancer diagnosis status, where 0 indicates No Cancer and 1 indicates Cancer.

  
Your task is to use this dataset to build and compare the performance of Decision Tree and Random Forest models to predict the presence of cancer. Additionally, explore various parameters of the RandomForestClassifier to enhance model performance.

**Dataset credits:** Rabie El Kharoua (https://www.kaggle.com/datasets/rabieelkharoua/cancer-prediction-dataset)

In [1]:
#import necessary libraries
import pandas as pd
df = pd.read_csv("cancer_data.csv")
df.head()

Unnamed: 0,age,gender,bmi,smoking,genetic_risk,physical_activity,alcohol_intake,cancer_history,diagnosis
0,58,1,16.085313,0,1,8.146251,4.148219,1,1
1,71,0,30.828784,0,1,9.36163,3.519683,0,0
2,48,1,38.785084,0,2,5.135179,4.728368,0,1
3,34,0,30.040295,0,0,9.502792,2.044636,0,0
4,62,1,35.479721,0,0,5.35689,3.309849,0,1


### Data Preparation and Exploration

In [3]:
df.shape

(1500, 9)

In [5]:
# Check for any missing values in the dataset
df.isna().sum()

age                  0
gender               0
bmi                  0
smoking              0
genetic_risk         0
physical_activity    0
alcohol_intake       0
cancer_history       0
diagnosis            0
dtype: int64

### Model Training Using Decision Tree Classifier

In [10]:
from sklearn.model_selection import train_test_split

X = df[['age', 'gender', 'bmi', 'smoking', 'genetic_risk', 'physical_activity', 'alcohol_intake', 'cancer_history']]
y = df['diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [12]:
# Initialize and train a Decision Tree Classifier model using the training data        # ifpcr
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier

# Initialize the model
dt = DecisionTreeClassifier()

# fit the model
dt.fit(X_train, y_train)  

# Make predictions
y_pred = dt.predict(X_test)

# classification report
cr = classification_report(y_test, y_pred)
print(cr)

              precision    recall  f1-score   support

           0       0.90      0.89      0.89       233
           1       0.83      0.83      0.83       142

    accuracy                           0.87       375
   macro avg       0.86      0.86      0.86       375
weighted avg       0.87      0.87      0.87       375



### Model Training Using Random Forest Classifier

In [15]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the model
rf = RandomForestClassifier(n_estimators=25)

# Fit the model
rf.fit(X_train, y_train)

# Make predictions using Random Forest, not Decision Tree
y_pred = rf.predict(X_test)

# Classification report
cr = classification_report(y_test, y_pred)
print(cr)

              precision    recall  f1-score   support

           0       0.92      0.94      0.93       233
           1       0.90      0.86      0.88       142

    accuracy                           0.91       375
   macro avg       0.91      0.90      0.91       375
weighted avg       0.91      0.91      0.91       375



### Exploring Various Parameters in Random Forest Classifier.

In [18]:
f = RandomForestClassifier(
    n_estimators=50,
    max_features='log2',
    criterion='entropy',
    bootstrap=False,
    max_depth=15,
    min_samples_split=5,
    min_samples_leaf=3,
    random_state=42  # Optional but helps with reproducibility
)

# Fit the model
rf.fit(X_train, y_train)

# Make predictions using Random Forest, not Decision Tree
y_pred = rf.predict(X_test)

# Classification report
cr = classification_report(y_test, y_pred)
print(cr)

              precision    recall  f1-score   support

           0       0.93      0.96      0.94       233
           1       0.93      0.87      0.90       142

    accuracy                           0.93       375
   macro avg       0.93      0.92      0.92       375
weighted avg       0.93      0.93      0.92       375



In [None]:
# overall accuracy and recall has improved which is important in this case