In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn import metrics
import matplotlib.pyplot as plt

# Load the breast cancer dataset
# You can replace the path with the actual path to your dataset
url = '/kaggle/input/breast-cancer/Breast_Cancer.csv'
breast_cancer = pd.read_csv(url)

# Assuming the target variable is in a column named 'Status' and you want to classify it
X = breast_cancer.drop('Status', axis=1)
y = breast_cancer['Status']

# Convert categorical variables to numeric using one-hot encoding
X = pd.get_dummies(X, drop_first=True)

# Binarize the 'Status' column for classification
y_class = (y == 'Alive').astype(int)  # Binary classification

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_class, test_size=0.2, random_state=42)

# Create a decision tree classifier with reduced max depth for better visualization
dt_classifier = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = dt_classifier.predict(X_test)

# Evaluate the model
accuracy = metrics.accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

# Visualize the decision tree with reduced max depth
plt.figure(figsize=(15, 10))
plot_tree(dt_classifier, filled=True, feature_names=list(X.columns), class_names=['Not Alive', 'Alive'], max_depth=3)
plt.title("Decision Tree Visualization for Breast Cancer Dataset (Max Depth=3)")
plt.show()

# Display the decision tree rules
tree_rules = export_text(dt_classifier, feature_names=list(X.columns), max_depth=3)
print("Decision Tree Rules:\n", tree_rules)

# Plot the feature importances
importances = dt_classifier.feature_importances_
features = X.columns

plt.barh(range(len(importances)), importances, align="center")
plt.yticks(range(len(importances)), features)
plt.xlabel("Feature Importance")
plt.title("Decision Tree Feature Importances")
plt.show()

SyntaxError: unterminated string literal (detected at line 12) (1281467040.py, line 12)

This Python script uses the scikit-learn library to create a decision tree classifier for the breast cancer dataset. The goal is to predict the 'Status' of a patient, which is assumed to be either 'Alive' or 'Not Alive'.

The script begins by importing the necessary libraries. It then loads the breast cancer dataset from a specified URL into a pandas DataFrame. The 'Status' column, which is the target variable, is separated from the rest of the dataset. The remaining columns, which are the features, are stored in `X`. If there are any categorical variables in the features, they are converted to numeric using one-hot encoding.

The 'Status' column is binarized for classification, meaning it's converted into a binary format: 'Alive' is represented as 1 and 'Not Alive' as 0. This is stored in `y_class`.

The dataset is then split into a training set and a test set, with 80% of the data used for training and 20% used for testing. This is done using the `train_test_split` function from scikit-learn.

A decision tree classifier is created with a maximum depth of 3 to keep the tree small and easier to visualize. The classifier is trained using the training data.

The trained classifier is then used to make predictions on the test set. The accuracy of these predictions is calculated by comparing them to the actual statuses in the test set.

The decision tree is visualized using matplotlib and scikit-learn's `plot_tree` function. The tree is displayed with nodes colored according to the class, and with the feature names and class names labeled.

The rules of the decision tree are displayed using the `export_text` function from scikit-learn. This provides a text representation of the decision tree, showing the conditions for each split in the tree.

Finally, the feature importances are plotted in a horizontal bar chart. Feature importance gives a score for each feature of the data, the higher the score more important or relevant is the feature towards the output variable. This can help in understanding which features are most influential in the decision-making process of the classifier.