# Introduction to Decision Trees and Random Forests
This notebook introduces two popular machine learning algorithms: Decision Trees and Random Forests. We will explore these algorithms through practical examples and visualizations.

## Decision Trees

A **Decision Tree** is a supervised machine learning algorithm used for classification and regression. It splits the dataset into smaller subsets based on feature values, recursively forming a tree-like structure.

### Advantages:
- Easy to interpret
- Simple to visualize
- Handles numerical and categorical data

### Disadvantages:
- Prone to overfitting
- Sensitive to small changes in data

## Importing necessary libraries

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

## Load and Explore the Dataset
We'll use the Iris dataset for this demonstration.

In [None]:
iris = load_iris()
X = iris.data
y = iris.target

# DataFrame to view data
df = pd.DataFrame(X, columns=iris.feature_names)
df['species'] = pd.Categorical.from_codes(y, iris.target_names)

print(df.head())

## Splitting the dataset
Split the data into training (70%) and testing (30%) sets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=14)

## Training a Decision Tree Classifier

In [None]:
dt_clf = DecisionTreeClassifier(random_state=14)
dt_clf.fit(X_train, y_train)

## Visualizing the Decision Tree

In [None]:
plt.figure(figsize=(16,10))
plot_tree(dt_clf, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)
plt.title('Decision Tree Visualization')
plt.show()

## Evaluating Decision Tree Performance

In [None]:
y_pred_dt = dt_clf.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)
print(f'Decision Tree Accuracy: {accuracy_dt:.2f}')

## Random Forests

A **Random Forest** is an ensemble method that builds multiple decision trees, using random subsets of data and features. It aggregates predictions from these trees to provide improved accuracy and reduce overfitting.

### Advantages:
- Less prone to overfitting
- High accuracy
- Handles large datasets efficiently

---


## Training a Random Forest Classifier

In [None]:
rf_clf = RandomForestClassifier(n_estimators=100, random_state=14)
rf_clf.fit(X_train, y_train)

In [None]:
# Visualizing a single tree from Random Forest
estimator = rf_clf.estimators_[0]  # choosing the first tree

plt.figure(figsize=(16,10))
# Pass the 'estimator' object directly as the 'decision_tree' argument
plot_tree(decision_tree=estimator, # Changed 'estimator=' to 'decision_tree='
          feature_names=iris.feature_names,
          class_names=iris.target_names,
          filled=True)
plt.title('Visualization of a Single Decision Tree from Random Forest')
plt.show()

## Evaluating Random Forest Performance

In [None]:
y_pred_rf = rf_clf.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f'Random Forest Accuracy: {accuracy_rf:.2f}')

## Feature Importance from Random Forest

**Feature Importance** is a technique to measure the relative significance of each feature used by a Random Forest model. Features with higher importance have a greater impact on predicting outcomes. Identifying important features helps in feature selection and provides insights into the data.


In [None]:
importances = rf_clf.feature_importances_
indices = np.argsort(importances)[::-1]

print("Feature Importances:")
for idx in indices:
    print(f"{iris.feature_names[idx]}: {importances[idx]:.4f}")

# Visualize Feature Importance
plt.figure(figsize=(8, 6))
plt.title("Feature Importances in Random Forest")
plt.bar(range(X.shape[1]), importances[indices], align="center")
plt.xticks(range(X.shape[1]), [iris.feature_names[i] for i in indices], rotation=45)
plt.ylabel('Importance Score')
plt.tight_layout()
plt.show()

## Conclusion
We explored Decision Trees and Random Forests, visualized their structures and results, and learned how to interpret feature importance. Random Forests typically outperform single Decision Trees by reducing variance and overfitting.