You are provided with a breast cancer dataset (Breast_Cancer_Data.csv) taken originally from the UCI data repository. 
https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original).

This dataset has approximately 683 patient data having 10 features and 1 class label describing whether the patient has cancer or not. Each row describes one patient, and the class column describes if the patient tumor is benign (label = 2) or malignant (label = 4). For this dataset, build all the classification models (using Python and Scikit-learn) given below (no need to visualize) and tabulate the accuracy and confusion matrix obtained for each. Split the dataset such that the test data size is 25% of the total dataset.

7. Random Forest (estimators = 10)

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

In [2]:
# Load the dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
column_names = ['id', 'clump_thickness', 'cell_size_uniformity', 'cell_shape_uniformity', 'marginal_adhesion', 
                'single_epithelial_cell_size', 'bare_nuclei', 'bland_chromatin', 'normal_nucleoli', 'mitoses', 'class']
data = pd.read_csv(url, header=None, names=column_names)


In [3]:
# Drop the 'id' column
data.drop('id', axis=1, inplace=True)

# Replace missing values denoted by '?' with NaN and then drop them
data.replace('?', pd.NA, inplace=True)
data.dropna(inplace=True)

# Convert all columns to numeric
data = data.apply(pd.to_numeric)

Split the Dataset

In [4]:
# Separate features and target variable
X = data.drop('class', axis=1)
y = data['class']

# Split the dataset into training and testing sets (75% training, 25% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)


Build and Evaluate Random Forest (estimators = 10)

In [9]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the model
rf = RandomForestClassifier(n_estimators=10)

# Train the model
rf.fit(X_train, y_train)

# Make predictions
y_pred = rf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Print the results
print(f"Random Forest Accuracy: {accuracy}")
print(f"Random Forest Confusion Matrix:\n{conf_matrix}")


Random Forest Accuracy: 0.9473684210526315
Random Forest Confusion Matrix:
[[102   1]
 [  8  60]]
