# 🌳 Decision Tree Streamlit App - Complete Implementation Guide

## 🎯 **Project Overview**
Build an interactive web application for visualizing Decision Tree algorithms with comprehensive features for data exploration, model training, and visualization.

## 📊 **Core Features**

### 1. **Data Input & Management**
- **CSV file upload** with drag & drop interface
- **Built-in datasets**: Iris, Wine, Breast Cancer, Diabetes, Titanic
- **Data preview** with head/tail options
- **Automatic data type detection**
- **Missing value handling** options
- **Data statistics** display

### 2. **Feature Selection**
- **Multi-select feature picker** (2-10 features recommended)
- **Target variable dropdown**
- **Feature description tooltips**
- **Correlation matrix** visualization
- **Feature distribution** histograms

### 3. **Model Configuration**
#### Algorithm Type
- **Auto-detect** classification vs regression
- **Manual mode selection**

#### Hyperparameters Panel
```python
# Tree Structure
- max_depth: 1-20 (slider)
- min_samples_split: 2-20 (slider)
- min_samples_leaf: 1-20 (slider)

# Splitting Criteria
- criterion: gini/entropy (classification), mse/mae (regression)
- max_features: sqrt, log2, None, custom

# Advanced Options
- random_state: number input
- ccp_alpha: 0-0.1 (pruning)
- class_weight: balanced, None

In [1]:
import numpy as np
import pandas as pd

In [8]:
from sklearn.datasets import load_iris

iris = load_iris()
new_df = pd.DataFrame(iris.data, columns=iris.feature_names)
new_df['target'] = iris.target

In [None]:
new_df

In [11]:
feature1 = 'sepal length (cm)'
feature2 = 'target'
df = new_df.copy()
splitting_points = []

while len(df) > 20:
    error = {}
    
    # Calculate errors for all possible splits
    for i in range(len(df) - 1):
        avg_x = (df[feature1].iloc[i] + df[feature1].iloc[i + 1]) / 2
        above_rows = df[df[feature1] > avg_x]
        below_rows = df[df[feature1] <= avg_x]
        
        if len(above_rows) > 0 and len(below_rows) > 0:
            sse_above = np.sum(np.square(above_rows[feature2] - np.mean(above_rows[feature2])))
            sse_below = np.sum(np.square(below_rows[feature2] - np.mean(below_rows[feature2])))
            total_sse = sse_above + sse_below
            error[avg_x] = total_sse
    
    if error:  # Only if we found valid splits
        best_split = min(error, key=error.get)
        splitting_points.append(best_split)
        
        # Decide which branch to follow (you need to implement proper tree logic)
        # For now, let's follow the larger branch
        left_count = len(df[df[feature1] <= best_split])
        right_count = len(df[df[feature1] > best_split])
        
        if left_count > right_count:
            df = df[df[feature1] <= best_split]
        else:
            df = df[df[feature1] > best_split]
    else:
        break  # No more valid splits

splitting_points

[np.float64(5.550000000000001),
 np.float64(6.1),
 np.float64(7.05),
 np.float64(6.5),
 np.float64(6.25),
 np.float64(6.45)]