### Problem Statement

You are a data scientist / AI engineer at a medical research firm. You have been provided with a dataset named **`"cancer_data.csv"`**, which includes medical and lifestyle information for 1500 patients. The dataset is designed to predict the presence of cancer based on various features. The dataset comprises the following columns:

- `age:` Integer values representing the patient's age, ranging from 20 to 80.
- `gender:` Binary values representing gender, where 0 indicates Male and 1 indicates Female.
- `bmi:` Continuous values representing Body Mass Index, ranging from 15 to 40.
- `smoking:` Binary values indicating smoking status, where 0 means No and 1 means Yes.
- `genetic_risk:` Categorical values representing genetic risk levels for cancer, with 0 indicating Low, 1 indicating Medium, and 2 indicating High.
- `physical_activity:` Continuous values representing the number of hours per week spent on physical activities, ranging from 0 to 10.
- `alcohol_intake:` Continuous values representing the number of alcohol units consumed per week, ranging from 0 to 5.
- `cancer_history:` Binary values indicating whether the patient has a personal history of cancer, where 0 means No and 1 means Yes.
- `diagnosis:` Binary values indicating the cancer diagnosis status, where 0 indicates No Cancer and 1 indicates Cancer.

  
Your task is to use this dataset to build and compare the performance of Decision Tree and Random Forest models to predict the presence of cancer. Additionally, explore various parameters of the RandomForestClassifier to enhance model performance.

**Dataset credits:** Rabie El Kharoua (https://www.kaggle.com/datasets/rabieelkharoua/cancer-prediction-dataset)

**Import Necessary Libraries**

In [2]:
#import necessary libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

### Task 1: Data Preparation and Exploration

1. Import the data from the `"cancer_data.csv"` file and store it in a variable df.
2. Display the number of rows and columns in the dataset.
3. Display the first few rows of the dataset to get an overview.
4. Check for any missing values in the dataset.

In [5]:
# Step 1: Import the data from the "cancer_data.csv" file and store it in a variable 'df'

df=pd.read_csv("cancer_data.csv")
# Step 2: Display the number of rows and columns in the dataset
print(df.shape)

# Step 3: Display the first few rows of the dataset to get an overview
df.sample(10)

(1500, 9)


Unnamed: 0,age,gender,bmi,smoking,genetic_risk,physical_activity,alcohol_intake,cancer_history,diagnosis
1143,64,1,29.878292,1,0,4.86405,3.291419,0,1
839,38,1,30.026854,0,2,3.58219,1.666209,0,1
1364,60,0,33.23362,0,1,2.541473,4.362314,0,0
349,68,1,19.700618,1,0,3.295304,2.452272,0,0
997,42,1,27.190695,0,0,3.830588,2.568167,0,0
179,34,1,39.917421,0,1,9.859342,2.122863,1,1
1101,74,1,26.44249,1,0,6.178132,1.068683,0,1
48,74,0,21.854017,1,1,0.522791,1.047034,0,0
648,51,1,26.508957,1,1,6.618603,3.38652,0,1
414,74,0,35.995057,0,0,9.487905,0.644895,0,0


In [6]:
# Step 4: Check for any missing values in the dataset
df.isna().sum()

age                  0
gender               0
bmi                  0
smoking              0
genetic_risk         0
physical_activity    0
alcohol_intake       0
cancer_history       0
diagnosis            0
dtype: int64

### Task 2: Model Training Using Decision Tree Classifier

1. Select the features `(age, gender, bmi, smoking, genetic_risk, physical_activity, alcohol_intake, cancer_history)` and the target variable `(diagnosis)` for modeling.
2. Split the data into training and test sets with a test size of 25%.
3. Initialize and train a Decision Tree Classifier model using the training data.
4. Make predictions on the test set using the trained model.
5. Evaluate the model using a classification report and print the report.

In [9]:
# Step 1: Select the features and target variable for modeling
X=df.drop(['diagnosis'],axis=1)
y=df['diagnosis']
# Step 2: Split the data into training and test sets with a test size of 25%
x_train,x_test,y_train,y_test= train_test_split(X,y,test_size=0.20,random_state=2)

In [14]:
# Step 3: Initialize and train a Decision Tree Classifier model using the training data

model=DecisionTreeClassifier()
# Step 4: Make predictions on the test set using the trained model
model.fit(x_train,y_train)
y_pred=model.predict(x_test)
# Step 5: Evaluate the model using a classification report and print the report
report=classification_report(y_test,y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.93      0.86      0.89       197
           1       0.77      0.87      0.82       103

    accuracy                           0.87       300
   macro avg       0.85      0.87      0.86       300
weighted avg       0.87      0.87      0.87       300



### Task 3: Model Training Using Random Forest Classifier

1. Initialize and train a Random Forest Classifier model with 25 estimators using the training data.
2. Make predictions on the test set using the trained model.
3. Evaluate the model using a classification report and print the report.

In [21]:
# Step 1: Initialize and train a Random Forest Classifier model with 25 estimators using the training data

model=RandomForestClassifier(n_estimators=100);

# Step 2: Make predictions on the test set using the trained model
model.fit(x_train,y_train)
y_pred=model.predict(x_test)
# Step 5: Evaluate the model using a classification report and print the report
report=classification_report(y_test,y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.93      0.94      0.93       197
           1       0.89      0.85      0.87       103

    accuracy                           0.91       300
   macro avg       0.91      0.90      0.90       300
weighted avg       0.91      0.91      0.91       300



### Task 4: Exploring Various Parameters in Random Forest Classifier

1. Train a Random Forest model with the following parameters:
   - n_estimators = 50
   - max_features = "log2"
   - criterion = "entropy"
   - bootstrap = False
   - max_depth = 15
   - min_samples_split = 5
   - min_samples_leaf = 3

Learn about these parameters here: [scikit-learn RandomForestClassifier Parameters](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)


--------------------------------------------------------------------------------------------------------------------------------------------------------
2. Evaluate the model using a classification report and print the report.

In [51]:
# Step 1: Initialize and train a Random Forest Classifier model with 25 estimators using the training data

model=RandomForestClassifier(n_estimators = 50,max_features = "log2", criterion = "gini"
   , bootstrap = False
   , max_depth = 200
   , min_samples_split = 5
   , min_samples_leaf = 4);

# Step 2: Make predictions on the test set using the trained model
model.fit(x_train,y_train)
y_pred=model.predict(x_test)
# Step 5: Evaluate the model using a classification report and print the report
report=classification_report(y_test,y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.94      0.95      0.95       197
           1       0.91      0.88      0.90       103

    accuracy                           0.93       300
   macro avg       0.93      0.92      0.92       300
weighted avg       0.93      0.93      0.93       300

