## **Decision Tree Algorithmn using Scikit-learn**

A **Decision Tree** is a supervised learning algorithm used for classification and regression tasks. It works by recursively splitting the dataset based on the most significant features, forming a tree-like structure where each internal node represents a decision rule, and leaf nodes represent class labels.  

In this case, we're using a **Decision Tree Classifier** on the **Breast Cancer dataset** to classify tumors as **malignant** (cancerous) or **benign** (non-cancerous).  

---

### **Step 1: Importing Required Libraries**

In [None]:
import pandas as pd

# importing  the dataset from scikit-learn

from sklearn.datasets import load_breast_cancer

### **Step 2: Loading the Dataset**


- We use **pandas** to handle datasets.  
- `load_breast_cancer()` loads a dataset containing **features** (tumor properties) and **target labels** (malignant = 0, benign = 1). 

In [None]:
data = load_breast_cancer()
dataset = pd.DataFrame(data=data['data'], columns=data['feature_names'])
dataset

- The dataset is stored in a **pandas DataFrame** for better visualization and manipulation.  
- `data['feature_names']` contains **30 numerical features** related to tumor characteristics (e.g., mean radius, texture, area, etc.).

### **Step 3: Splitting Data into Train and Test Sets**


In [None]:
from sklearn.model_selection import train_test_split
X = dataset.copy()
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

- `X` stores **input features** (tumor characteristics).  
- `y` stores **target labels** (0 = malignant, 1 = benign).  
- **`train_test_split()`** splits the dataset into:  
  - **67% training data**  
  - **33% testing data** 

### **Step 4: Training the Decision Tree Model**


In [None]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(ccp_alpha=0.01)
clf = clf.fit(X_train, y_train)

- `DecisionTreeClassifier()` creates a **Decision Tree model**.  
- `ccp_alpha=0.01` is **cost-complexity pruning**, which prevents overfitting by removing unnecessary splits.  
- `clf.fit(X_train, y_train)` **trains the model** on the training dataset.  

### **Step 5: Checking Model Parameters**


In [None]:
clf.get_params()


- Displays **hyperparameters** like `criterion`, `max_depth`, `min_samples_split`, etc.  
- `criterion='gini'` is the default and measures impurity in splits. 

### **Step 6: Making Predictions**


In [None]:
predictions = clf.predict(X_test)
predictions

- The trained model **predicts tumor types** for the test dataset.  
- Outputs an array of **0s (malignant) and 1s (benign)**.  

### **Step 7: Probability Predictions**


In [None]:
clf.predict_proba(X_test)


- Instead of a **hard classification (0 or 1)**, this gives **probabilities** of each class for test samples.  


### **Step 8: Evaluating Model Performance**
#### **Accuracy Score**

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, predictions)

- Measures **overall correctness**:  
  $$ \text{Accuracy} = \frac{\text{Correct Predictions}}{\text{Total Predictions}} $$
- If accuracy is **high (~95-98%)**, the model performs well.  


#### **Confusion Matrix**


In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, predictions, labels=[0,1])

- Shows **True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN)**.  


#### **Precision and Recall**


In [None]:
from sklearn.metrics import precision_score, recall_score
precision_score(y_test, predictions)
recall_score(y_test, predictions)

- **Precision**: How many predicted **benign tumors** were actually benign?  
- **Recall**: How many actual **benign tumors** were correctly detected?  
- These metrics help understand the model's reliability.  

#### **Full Classification Report**


In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions, target_names=['malignant', 'benign']))

- Provides **Precision, Recall, F1-score** for both classes.  


### **Step 9: Feature Importance Analysis**


In [None]:
feature_names = X.columns
clf.feature_importances_

- Decision Trees **rank features by importance** in classification.  
- Higher values indicate more impact on classification.  

In [None]:
feature_importance = pd.DataFrame(clf.feature_importances_, index=feature_names).sort_values(0, ascending=False)

# Displays a **sorted list** of most important features.  
feature_importance

#### **Visualizing Feature Importance**

Let's plot the **top 10 most important features** in classification.

In [None]:
feature_importance.head(10).plot(kind='bar')


### **Step 10: Visualizing the Decision Tree**


In [None]:
from sklearn import tree
from matplotlib import pyplot as plt

fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(clf, 
                   feature_names=feature_names,  
                   class_names={0:'Malignant', 1:'Benign'},
                   filled=True,
                  fontsize=12)

- Displays the **entire Decision Tree**.  
- `filled=True` colors nodes based on classification confidence.  
- Helps **understand decision paths**.  

### **Step 11: Understanding Decision Paths**


In [None]:
X_test.head()
clf.decision_path(X_test)

- `decision_path()` **shows which nodes each sample passed through**.  


### Let's visualizes the **first 101 test samples** passing through the Decision Tree.  


In [None]:
sparse = clf.decision_path(X_test).toarray()[:101]
plt.figure(figsize=(20, 20))
plt.spy(sparse, markersize=5)


## **🔹 Summary of Key Takeaways**
1️ **Decision Trees classify data by learning decision rules from training data.**  
2️ **Pruning (via `ccp_alpha`) prevents overfitting.**  
3️ **Accuracy, Precision, Recall, and Feature Importance help evaluate model performance.**  
4️ **Tree visualization and decision paths help explain how the model makes predictions.**  
5️ **Feature importance identifies which factors influence classification the most.**  