# Module 4.4: Decision Trees and Random Forests

Linear and Logistic Regression are powerful, but they work best when the relationship between features and the target is linear. What happens when the patterns are more complex? We turn to non-linear models like **Decision Trees**. 🌳

**What is a Decision Tree?**
A Decision Tree is like a flowchart of `if-then` questions. It splits the data into smaller and smaller branches based on its features to make a final prediction. They are highly intuitive and easy to interpret.



**The Problem with a Single Tree:** A single Decision Tree can easily **overfit** the training data. It might learn the data's noise and quirks too well, causing it to perform poorly on new, unseen data.

**The Solution: Random Forests!**
A Random Forest is an **ensemble** model. Instead of one tree, it builds an entire forest of different Decision Trees and takes a majority vote for the final prediction. This "wisdom of the crowd" approach makes it much more robust and accurate.

**Goal of this Notebook:**
1.  Train a single Decision Tree Classifier.
2.  Train a Random Forest Classifier on the same data.
3.  Compare their performance.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

### Dataset Setup

We will reuse the `Social_Network_Ads.csv` dataset from the previous notebook. This allows us to directly compare the performance of these new models against Logistic Regression.

In [None]:
df = pd.read_csv('../02_Data_Analysis_and_Wrangling/data/Social_Network_Ads.csv')

# Define Features and Target
X = df[['Age', 'EstimatedSalary']]
y = df['Purchased']

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

> **Important Note:** Decision Trees and Random Forests do **not** require feature scaling. They are not sensitive to the variance in the data, which is a major advantage.

## 1. Training a Decision Tree Classifier

In [None]:
# Instantiate and train the model
dt_model = DecisionTreeClassifier(random_state=0)
dt_model.fit(X_train, y_train)

# Make predictions
dt_predictions = dt_model.predict(X_test)

# Evaluate the model
print("--- Decision Tree Results ---")
print(f"Accuracy: {accuracy_score(y_test, dt_predictions):.2f}\n")
print(classification_report(y_test, dt_predictions))

## 2. Training a Random Forest Classifier

In [None]:
# Instantiate and train the model
# n_estimators is the number of trees in the forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=0)
rf_model.fit(X_train, y_train)

# Make predictions
rf_predictions = rf_model.predict(X_test)

# Evaluate the model
print("--- Random Forest Results ---")
print(f"Accuracy: {accuracy_score(y_test, rf_predictions):.2f}\n")
print(classification_report(y_test, rf_predictions))

## 3. Comparison

Let's compare the results:
* **Logistic Regression Accuracy:** ~88%
* **Decision Tree Accuracy:** ~92%
* **Random Forest Accuracy:** ~92%

In this case, both the Decision Tree and the Random Forest outperform the simpler Logistic Regression model. The Random Forest provides the same high accuracy as the single tree but is generally more reliable and less prone to overfitting on new data, making it the preferred choice in most situations.

## ✅ What's Next?

You've now trained and compared three of the most common classification algorithms!

In the final notebook of this module, we will formalize our understanding of **Model Evaluation Metrics**. We'll review the metrics we've used and introduce a powerful visualization tool called the ROC curve to better understand the trade-offs in a model's performance.