                                                    Decision Tree                                                           

Decision tree is a unsupervised machine learning algorithm used to solve classification problems.
It works by splitting the dataset into smaller subsets based on feature values, leading to a tree-like model of decisions. Each internal node in the tree represents a decision based on a feature, each branch represents the outcome of that decision, and each leaf node represents the final output (class label or predicted value).

Decision Tree is Construction:-

1. Selecting the Best Split: The core of the decision tree algorithm is to decide where to split the data at each node. This is based on criteria that measure the "impurity" of the dataset. The most common criteria are:

- Gini Impurity: Measures how often a randomly chosen element would be incorrectly classified. It's defined as:
    - Gini = 1− ∑pi^2
        where 
        - 𝐶 is the number of classes,
        - 𝑝𝑖 is the proportion of samples belonging to class i at a given node.
    - A Gini impurity of 0 means the node is pure (all samples belong to one class).

- Entropy (Information Gain): Measures the amount of uncertainty or impurity in the data. It’s calculated as:
    - Entropy = − ∑𝑝𝑖.log2(𝑝𝑖)
        where 
        - 𝑝𝑖 is the proportion of samples of class 𝑖 at the node.
    - Lower entropy means a purer node.

- Mean Squared Error (for Regression Trees): Measures the average squared difference between actual and predicted values, helping to determine splits in regression trees.

    The algorithm calculates the impurity before and after each potential split and selects the split that results in the largest reduction in impurity.

2. Splitting the Data: Once the best split is identified, the data is divided into two or more subsets based on the feature and value chosen. This splitting process is repeated recursively on each subset, creating branches and nodes in the tree.

3. Stopping Criteria: To prevent the decision tree from growing indefinitely (which can lead to overfitting), certain stopping criteria are applied:
    - Maximum Depth: Limits the depth of the tree.
    - Minimum Samples per Leaf: Requires a minimum number of samples at each leaf node.
    - Minimum Information Gain: If the reduction in impurity is below a certain threshold, splitting stops.
When one of these criteria is met, the node becomes a leaf and represents the final prediction (class label or continuous value).

4. Prediction: Once the tree is built, predictions for new data are made by following the path from the root node to a leaf node based on the feature values of the data point. The leaf node’s label or value is used as the prediction.



Example of a Simple Decision Tree (Manual Construction):-

Suppose we want to classify whether a person will buy a product based on two features: Age (Young, Middle-aged, Senior) and Income (Low, High).

1. Root Node: Start with all data points.
2. Choose Split: Based on Gini impurity or entropy, determine the best feature to split on. Suppose Age is selected.
3. Create Branches: Split the data into branches based on Age values:
    - If Age = Young: Check Income.
    - If Age = Middle-aged: Predict Will Buy.
    - If Age = Senior: Predict Won't Buy.
4. urther Split: Repeat the process on branches where necessary (e.g., Age = Young branch might split further based on Income).

This structure results in a tree where each path from root to leaf represents a set of rules for predicting the target.

In [1]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

In [2]:
raw_data = pd.read_csv(r"S:\VS code\python\Data _Analytics\Dataset\Classified_Data.csv")
raw_data

Unnamed: 0.1,Unnamed: 0,WTT,PTI,EQW,SBI,LQE,QWG,FDJ,PJF,HQE,NXJ,TARGET CLASS
0,0,0.913917,1.162073,0.567946,0.755464,0.780862,0.352608,0.759697,0.643798,0.879422,1.231409,1
1,1,0.635632,1.003722,0.535342,0.825645,0.924109,0.648450,0.675334,1.013546,0.621552,1.492702,0
2,2,0.721360,1.201493,0.921990,0.855595,1.526629,0.720781,1.626351,1.154483,0.957877,1.285597,0
3,3,1.234204,1.386726,0.653046,0.825624,1.142504,0.875128,1.409708,1.380003,1.522692,1.153093,1
4,4,1.279491,0.949750,0.627280,0.668976,1.232537,0.703727,1.115596,0.646691,1.463812,1.419167,1
...,...,...,...,...,...,...,...,...,...,...,...,...
995,995,1.010953,1.034006,0.853116,0.622460,1.036610,0.586240,0.746811,0.319752,1.117340,1.348517,1
996,996,0.575529,0.955786,0.941835,0.792882,1.414277,1.269540,1.055928,0.713193,0.958684,1.663489,0
997,997,1.135470,0.982462,0.781905,0.916738,0.901031,0.884738,0.386802,0.389584,0.919191,1.385504,1
998,998,1.084894,0.861769,0.407158,0.665696,1.608612,0.943859,0.855806,1.061338,1.277456,1.188063,1


In [3]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    1000 non-null   int64  
 1   WTT           1000 non-null   float64
 2   PTI           1000 non-null   float64
 3   EQW           1000 non-null   float64
 4   SBI           1000 non-null   float64
 5   LQE           1000 non-null   float64
 6   QWG           1000 non-null   float64
 7   FDJ           1000 non-null   float64
 8   PJF           1000 non-null   float64
 9   HQE           1000 non-null   float64
 10  NXJ           1000 non-null   float64
 11  TARGET CLASS  1000 non-null   int64  
dtypes: float64(10), int64(2)
memory usage: 93.9 KB


In [4]:
finalDataset = raw_data.drop(columns=['Unnamed: 0'])

In [5]:
finalDataset.notnull().count()

WTT             1000
PTI             1000
EQW             1000
SBI             1000
LQE             1000
QWG             1000
FDJ             1000
PJF             1000
HQE             1000
NXJ             1000
TARGET CLASS    1000
dtype: int64

In [6]:
finalDataset.describe()

Unnamed: 0,WTT,PTI,EQW,SBI,LQE,QWG,FDJ,PJF,HQE,NXJ,TARGET CLASS
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,0.949682,1.114303,0.834127,0.682099,1.032336,0.943534,0.963422,1.07196,1.158251,1.362725,0.5
std,0.289635,0.257085,0.291554,0.229645,0.243413,0.256121,0.255118,0.288982,0.293738,0.204225,0.50025
min,0.174412,0.441398,0.170924,0.045027,0.315307,0.262389,0.295228,0.299476,0.365157,0.639693,0.0
25%,0.742358,0.942071,0.615451,0.51501,0.870855,0.761064,0.784407,0.866306,0.93434,1.222623,0.0
50%,0.940475,1.118486,0.813264,0.676835,1.035824,0.941502,0.945333,1.0655,1.165556,1.375368,0.5
75%,1.163295,1.307904,1.02834,0.834317,1.19827,1.12306,1.134852,1.283156,1.383173,1.504832,1.0
max,1.721779,1.833757,1.722725,1.634884,1.65005,1.666902,1.713342,1.78542,1.88569,1.89395,1.0


making Decision Tree model

In [9]:
x = finalDataset.iloc[:,:-1]
y = finalDataset['TARGET CLASS']

In [11]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.8)

In [15]:
DecisionTree_Model = DecisionTreeClassifier(random_state=42)
DecisionTree_Model.fit(x_train, y_train)

In [16]:
predications = DecisionTree_Model.predict(x_test)
predications

array([1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1,
       0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0,
       0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1,
       1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0,

In [17]:
y_test

342    1
588    0
3      1
182    1
547    1
      ..
859    0
950    0
711    0
148    1
435    1
Name: TARGET CLASS, Length: 800, dtype: int64

Metrics Evaluation

In [18]:
print(metrics.accuracy_score(y_test, predications))

0.855
