Decision Trees:

Introduction:  
A decision tree is a simple and visual way of making decisions or solving problems. 
It works like a flowchart, where each step (or node) represents a decision or a test based on a condition, and the branches represent the outcomes of that decision.
Example:
If you want to decide if you should go out for a picnic. The decision process could look like this:
Step 1: Check the weather.
If it's sunny, go to Step 2.
If it's rainy, don't go for a picnic.
Step 2: Check your friends' availability.
If they are free, go for a picnic.
If they are busy, stay at home.
This decision process can be represented visually as a decision tree.


Definition:
A decision tree is a machine learning algorithm used for both classification (categorizing data) and regression (predicting continuous values). It uses a tree-like structure where:
Root Node: The starting point (e.g., checking the weather).
Internal Nodes: Points where a decision is made (e.g., is it sunny?).
Branches: Possible outcomes of a decision (e.g., sunny, rainy).
Leaf Nodes: Final outcomes (e.g., go for a picnic, stay at home).


Algorithms:
1. ID3 (Iterative Dichotomiser 3):
How it works:
Uses information gain to decide which feature to split on at each step.
The feature that reduces uncertainty the most is chosen first.
Example: If you’re choosing between "weather" and "temperature" to split the data, the feature with the highest information gain is selected.
2. CART (Classification and Regression Trees)
How it works:
Uses Gini impurity for classification and mean squared error (MSE) for regression to decide splits.
Produces binary splits (each node has two branches).
Example: Deciding if a student passes a class based on "hours studied" and "class attendance."
3. C4.5
How it works:
An improvement over ID3.
Uses gain ratio, which adjusts for biases in information gain.
Handles both categorical and continuous data.
Example: Handling cases where student performance depends on both "test scores" (numerical) and "attendance" (categorical).
4. CHAID (Chi-squared Automatic Interaction Detector)
How it works:
Uses a Chi-squared test to determine the best splits.
Often used for larger datasets and market segmentation.
Example: Analyzing survey data to segment customers based on their preferences.
5. Random Forest (Extension of Decision Trees)
How it works:
Combines multiple decision trees to make decisions (ensemble method).
Reduces the risk of overfitting.
Example: Predicting house prices using various features like location, size, and number of bedrooms.

Steps of the CART Algorithm:
1) Start with the Entire Dataset
Begin with the full dataset and treat it as the root node.
2) Choose the Best Split
Evaluate all possible splits for each feature (e.g., "age < 30" or "income > 50k").
Use a splitting criterion:
For Classification: Use Gini Impurity to measure the "purity" of the groups after the split.
For Regression: Use Mean Squared Error (MSE) to measure how well the split predicts the target.
3) Split the Data
Divide the dataset into two groups based on the best split chosen.
4) Repeat for Each Subset
Treat each subset as a new "node" and repeat steps 2 and 3.
Continue splitting until a stopping condition is met:
All nodes are pure (contain data of only one class).
The maximum tree depth is reached.
A minimum number of samples in a node is reached.
5) Prune the Tree (Optional)
To prevent overfitting, remove branches that add little predictive power

In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier

In [7]:
df = pd.read_csv(r"C:\Users\iamni\OneDrive\Desktop\Iris.csv") #pandas

In [10]:
df.head() #pandas

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [13]:
df.shape, df.size #pandas

((150, 6), 900)

In [24]:
df.info() #pandas

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [23]:
df.describe() #pandas

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


In [28]:
x = df.drop('Species', axis=1)
y = df['Species']

In [29]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3,random_state=10)

In [32]:
model = DecisionTreeClassifier()
model.fit(x_train,y_train)

In [35]:
x_test_pred = model.predict(x_test)
accuracy = accuracy_score(y_test,x_test_pred)

In [36]:
accuracy

1.0