### Problem Statement
The dataset is designed to predict the presence of cancer based on various features. The dataset comprises the following columns:

- age: Integer values representing the patient's age, ranging from 20 to 80.
- gender: Binary values representing gender, where 0 indicates Male and 1 indicates Female.
- bmi: Continuous values representing Body Mass Index, ranging from 15 to 40.
- smoking: Binary values indicating smoking status, where 0 means No and 1 means Yes.
- genetic_risk: Categorical values representing genetic risk levels for cancer, with 0 indicating Low, 1 indicating Medium, and 2 indicating High.
- physical_activity: Continuous values representing the number of hours per week spent on physical activities, ranging from 0 to 10.
- alcohol_intake: Continuous values representing the number of alcohol units consumed per week, ranging from 0 to 5.
- cancer_history: Binary values indicating whether the patient has a personal history of cancer, where 0 means No and 1 means Yes.
- diagnosis: Binary values indicating the cancer diagnosis status, where 0 indicates No Cancer and 1 indicates Cancer.
    
Task is to use this dataset to build and compare the performance of Decision Tree and Random Forest models to predict the presence of cancer. Additionally, explore various parameters of the RandomForestClassifier to enhance model performance.

Dataset credits: Rabie El Kharoua (https://www.kaggle.com/datasets/rabieelkharoua/cancer-prediction-dataset)

**Import Necessary Libraries**

In [1]:
import pandas as  pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

### Data Preparation and Exploration

In [2]:
# Step 1: Import the data from the "cancer_data.csv" file and store it in a variable 'df'
df = pd.read_csv("cancer_data.csv")

# Step 2: Display the number of rows and columns in the dataset
print("Number of rows and columns:", df.shape)

# Step 3: Display the first few rows of the dataset to get an overview
print("First few rows of the dataset:")
df.head()

Number of rows and columns: (1500, 9)
First few rows of the dataset:


Unnamed: 0,age,gender,bmi,smoking,genetic_risk,physical_activity,alcohol_intake,cancer_history,diagnosis
0,58,1,16.085313,0,1,8.146251,4.148219,1,1
1,71,0,30.828784,0,1,9.36163,3.519683,0,0
2,48,1,38.785084,0,2,5.135179,4.728368,0,1
3,34,0,30.040295,0,0,9.502792,2.044636,0,0
4,62,1,35.479721,0,0,5.35689,3.309849,0,1


In [3]:
# Step 4: Check for any missing values in the dataset
print("Missing values in the dataset:")
print(df.isna().sum())

Missing values in the dataset:
age                  0
gender               0
bmi                  0
smoking              0
genetic_risk         0
physical_activity    0
alcohol_intake       0
cancer_history       0
diagnosis            0
dtype: int64


### Model Training Using Decision Tree Classifier

In [4]:
# Step 1: Select the features and target variables for modeling
X = df[["age", "gender", "bmi", "smoking", "genetic_risk", "physical_activity", "alcohol_intake", "cancer_history"]]
y = df["diagnosis"]

# Step 2: Split the data into training and test sets with a test size of 25%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=5)

In [5]:
# Step 3: Initialize and train a Decision Tree Classifier model using the training data
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)
 
# Step 4: Make prediction on the test data using trained model
y_pred_dt = dt_model.predict(X_test)

# Step 5: Evaluate the model using a classification report and print the report
report_dt = classification_report(y_test, y_pred_dt)
print("Decision Tree Classification Report:")
print(report_dt)

Decision Tree Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.90      0.89       231
           1       0.84      0.78      0.81       144

    accuracy                           0.86       375
   macro avg       0.85      0.84      0.85       375
weighted avg       0.86      0.86      0.85       375



### Model Training Using Random Forest Classifier

In [6]:
# Step 1: Initialize and train a Random Forest Classifier model using training data
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)

# Step 2: Make predictions on the test set using the trained model
y_pred_rf = rf_model.predict(X_test)

# Step 3: Evaluate the model using a classification report and print the report
report_rf = classification_report(y_test, y_pred_rf)
print("Random Forest Classification Report:")
print(report_rf)

Random Forest Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.97      0.95       231
           1       0.95      0.88      0.91       144

    accuracy                           0.93       375
   macro avg       0.94      0.92      0.93       375
weighted avg       0.93      0.93      0.93       375



### Exploring Various Parameters in Random Forest Classifier

In [7]:
# Step 1: Train a Random Forest model with specified parameters
rf_params = {
    'n_estimators':50,
    'max_features':'log2',
    'criterion':'entropy',
    'bootstrap':False,
    'max_depth':15,
    'min_samples_split':5,
    'min_samples_leaf':3
}
rf_model_custom = RandomForestClassifier(**rf_params)
rf_model_custom.fit(X_train, y_train)

# Step 2: Make predictions on the test set using trained model
y_pred_rf_custom = rf_model_custom.predict(X_test)

# Step 3: Evaluate the model using classification report and print report
report_rf = classification_report(y_test, y_pred_rf_custom)
print("Random Forest Custom Classification Report:")
print(report_rf)

Random Forest Custom Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.97      0.94       231
           1       0.94      0.88      0.91       144

    accuracy                           0.93       375
   macro avg       0.93      0.92      0.93       375
weighted avg       0.93      0.93      0.93       375



### Conclusion 

We buit and evaluated various machine learning models to predict the presence of cancer. Here are key findings:
1. Decision Tree Classifier:
- Accuracy: 0.86
- Decision Tree Classifier showed significant improvement with an accuracy of 0.86.

2. Random Forest Classifier:
- Accuracy with default parameters: 0.93
- Accuracy with custom parameters: 0.93
- Random Forest Classifier outperformed Decision Tree Classifier with default and custom parameters.

Overall, the Random Forest Classifier Model with custom parameters provided the best performance for predicting the presence of cancer in the dataset.