# Decision Tree Algorithm in Machine Learning

## What is a Decision Tree?
A **Decision Tree** is a supervised learning algorithm used for both classification and regression tasks. It works by splitting the data into subsets based on feature values, forming a tree-like structure where:
- Each **internal node** represents a decision based on a feature.
- Each **branch** represents the outcome of a decision.
- Each **leaf node** represents a predicted outcome or value.

## Key Characteristics
- **Non-parametric**: No assumptions about the data distribution.
- **Interpretable**: Easy to visualize and understand.
- **Versatile**: Handles both categorical and numerical data.

---

## Basic Algorithm
1. **Start with the entire dataset** as the root node.
2. **Choose a splitting criterion**, such as:
   - **Gini Impurity** (used in CART)
   - **Entropy** (used in ID3)
   - **Mean Squared Error (MSE)** for regression tasks.
3. Split the dataset based on the selected criterion to minimize impurity or error.
4. Recursively apply steps 2 and 3 to create child nodes until:
   - A stopping criterion is met (e.g., max depth, minimum samples per leaf).
   - No further splits improve the model significantly.
5. Assign the majority class (classification) or the mean value (regression) to leaf nodes.

---

## Key Parameters
- **Max Depth**: Limits the depth of the tree to prevent overfitting.
- **Min Samples Split**: Minimum samples required to split a node.
- **Min Samples Leaf**: Minimum samples required to form a leaf node.
- **Criterion**: Metric used to measure the quality of splits (e.g., Gini, Entropy).

---

## Use Cases
1. **Classification**:
   - Diagnosing diseases
   - Predicting loan default
2. **Regression**:
   - House price prediction
   - Sales forecasting
3. **Feature Selection**:
   - Identifying important features in datasets.
---

## Limitations
- Prone to overfitting (can be mitigated with pruning or setting constraints).
- Sensitive to small changes in data (leads to different splits).
- May not perform well with imbalanced datasets or high noise.

---

## Visualization of Decision Trees
Visualizations can help interpret the tree structure and understand how decisions are made. Use libraries like:
- **Matplotlib**: `plot_tree`
- **Graphviz**: For more detailed and interactive visualizations.

---

## Best Practices
1. Perform **cross-validation** to avoid overfitting.
2. Use **ensemble methods** like Random Forest or Gradient Boosting for improved performance.
3. Preprocess data by handling missing values and encoding categorical variables (if required).
4. Regularize by setting constraints like max depth, min samples, etc.


![Screenshot (8112).png](attachment:3011f044-551b-4aab-8081-cb6292001b3e.png)

![Screenshot (8114).png](attachment:f3c2741c-4693-4990-8791-2bb7601e51a4.png)

![Screenshot (8115).png](attachment:07471e48-eb3d-4ecb-8209-615733ff886a.png)

![Screenshot (8119).png](attachment:b9bb296c-1181-4302-ba19-b7e2affedf3c.png)

![Screenshot (8120).png](attachment:4704830b-4d44-4e29-8d50-22be6224b715.png)



In [8]:
import numpy as np
import pandas as pd

In [10]:
data = pd.read_csv('Kyphosis.csv')

In [12]:
data.head()

Unnamed: 0,Kyphosis,Age,Number,Start
0,absent,71,3,5
1,absent,158,3,14
2,present,128,4,5
3,absent,2,5,1
4,absent,1,4,15


In [14]:
data.shape

(81, 4)

In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81 entries, 0 to 80
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Kyphosis  81 non-null     object
 1   Age       81 non-null     int64 
 2   Number    81 non-null     int64 
 3   Start     81 non-null     int64 
dtypes: int64(3), object(1)
memory usage: 2.7+ KB


In [18]:
# drop a column using dataset
x = data.drop('Kyphosis', axis=1)

In [20]:
x.head()

Unnamed: 0,Age,Number,Start
0,71,3,5
1,158,3,14
2,128,4,5
3,2,5,1
4,1,4,15


In [22]:
# see a single column from dataset
y = data['Kyphosis']
y.head()

0     absent
1     absent
2    present
3     absent
4     absent
Name: Kyphosis, dtype: object

In [24]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

In [26]:
x_train.shape

(56, 3)

In [28]:
x_test.shape

(25, 3)

### Decision Tree

In [32]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(x_train , y_train)

In [34]:
pred = model.predict(x_test)
pred

array(['absent', 'present', 'absent', 'absent', 'absent', 'present',
       'absent', 'absent', 'absent', 'absent', 'present', 'present',
       'absent', 'absent', 'absent', 'absent', 'present', 'absent',
       'absent', 'present', 'absent', 'present', 'absent', 'absent',
       'absent'], dtype=object)

In [36]:
y_test

40    present
67     absent
25     absent
32     absent
34     absent
23     absent
55     absent
64     absent
18     absent
26     absent
12     absent
79    present
10    present
44     absent
80     absent
7      absent
61    present
8      absent
45    present
50     absent
71     absent
17     absent
38     absent
68     absent
51     absent
Name: Kyphosis, dtype: object

In [38]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, pred)

0.68

In [40]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, pred)

array([[15,  5],
       [ 3,  2]], dtype=int64)