# **Decision Tree**

Decision Tree is a supervised learning algorithm that uses pre-labelled data used for both classification and regression tasks in machine learning. 
It builds a hierarchical, tree-like structure to make decisions or predictions based on the features of a dataset.

Key Components of a Decision Tree:
- Root Node: The topmost node in the tree, representing the initial decision or test on the entire dataset.
- Internal Node (Decision Node): A node that has incoming and outgoing branches. It represents a test on a specific attribute, splitting the data into subsets based on the outcome of that test.
- Branches: The connections between nodes, representing the possible outcomes of a decision or test.
- Leaf Node (Terminal Node): A node that does not split further. It represents the final classification or predicted value.

Due to their tendency to overfit and instability, decision trees are often used as building blocks for more robust ensemble methods like:
- Random Forests: Builds multiple decision trees on different subsets of the data and features, and then averages their predictions (for regression) or takes a majority vote (for classification). This reduces variance and overfitting.
- Gradient Boosting (e.g., XGBoost, LightGBM, CatBoost): Builds trees sequentially, where each new tree corrects the errors of the previous ones, improving overall accuracy.

It is not the most accurate model, but is more simple to code and implement.

In [248]:
import pandas as pd

df = pd.read_csv("https://github.com/RyanNolanData/YouTubeData/blob/main/500hits.csv?raw=true", encoding="latin-1")

In [249]:
df.head(20)

Unnamed: 0,PLAYER,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,BA,HOF
0,Ty Cobb,24,3035,11434,2246,4189,724,295,117,726,1249,357,892,178,0.366,1
1,Stan Musial,22,3026,10972,1949,3630,725,177,475,1951,1599,696,78,31,0.331,1
2,Tris Speaker,22,2789,10195,1882,3514,792,222,117,724,1381,220,432,129,0.345,1
3,Derek Jeter,20,2747,11195,1923,3465,544,66,260,1311,1082,1840,358,97,0.31,1
4,Honus Wagner,21,2792,10430,1736,3430,640,252,101,0,963,327,722,15,0.329,1
5,Carl Yastrzemski,23,3308,11988,1816,3419,646,59,452,1844,1845,1393,168,116,0.285,1
6,Paul Molitor,21,2683,10835,1782,3319,605,114,234,1307,1094,1244,504,131,0.306,1
7,Eddie Collins,25,2826,9949,1821,3315,438,187,47,520,1499,286,744,173,0.333,1
8,Willie Mays,22,2992,10881,2062,3283,523,140,660,1903,1464,1526,338,103,0.302,1
9,Eddie Murray,21,3026,11336,1627,3255,560,35,504,1917,1333,1516,110,43,0.287,1


In [250]:
df = df.drop(columns=['PLAYER', 'CS'])

In [251]:
X = df.iloc[:, 0:13]
y = df.iloc[:, 13]

In [252]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=11)

In [253]:
display(X_train.shape)
display(y_train.shape)

(372, 13)

(372,)

In [254]:
display(X_test.shape)
display(y_test.shape)

(93, 13)

(93,)

In [255]:
from sklearn.tree import DecisionTreeClassifier

# Create a Decision Tree Classifier
# Using 'gini' as the criterion and setting max_depth to 5 for simplicity
# criterion is the function to measure the quality of a split. 
# Here, 'gini' is used for the Gini impurity measure, which is a common choice for classification tasks.
# max_depth limits the depth of the tree to prevent overfitting. 
# Here, 5 means the tree will have at most 5 levels.
# This can be adjusted based on the dataset and requirements
dtc = DecisionTreeClassifier(criterion='gini', max_depth=5, random_state=11)
# Gini impurity is a measure of how often a randomly chosen element would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.

In [256]:
dtc.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': 5,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'random_state': 11,
 'splitter': 'best'}

In [257]:
dtc.fit(X_train, y_train)

In [258]:
Y_pred = dtc.predict(X_test)
display(Y_pred)

array([0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 1], dtype=int64)

In [259]:
print("Decision Tree Score: ", dtc.score(X_test, y_test))

Decision Tree Score:  0.8064516129032258


In [260]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

decision_tree_cm = confusion_matrix(y_test, Y_pred)
decision_tree_accuracy = accuracy_score(y_test, Y_pred)
decision_tree_cr = classification_report(y_test, Y_pred)

print("Confusion Matrix:\n", decision_tree_cm)
print("Accuracy Score:", decision_tree_accuracy)    
print("Classification Report:\n", decision_tree_cr)

Confusion Matrix:
 [[58  9]
 [ 9 17]]
Accuracy Score: 0.8064516129032258
Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.87      0.87        67
           1       0.65      0.65      0.65        26

    accuracy                           0.81        93
   macro avg       0.76      0.76      0.76        93
weighted avg       0.81      0.81      0.81        93



In [261]:
dtc.feature_importances_

array([0.        , 0.07195871, 0.11325707, 0.41271045, 0.02337237,
       0.        , 0.02293956, 0.04263994, 0.05943043, 0.0151489 ,
       0.01755963, 0.02734464, 0.19363829])

In [262]:
X.columns

Index(['YRS', 'G', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'BB', 'SO', 'SB',
       'BA'],
      dtype='object')

In [263]:
features = pd.DataFrame(dtc.feature_importances_, index= X.columns)

In [264]:
features.head(14)

Unnamed: 0,0
YRS,0.0
G,0.071959
AB,0.113257
R,0.41271
H,0.023372
2B,0.0
3B,0.02294
HR,0.04264
RBI,0.05943
BB,0.015149


In [265]:
# The criterion can also be set to 'entropy' for information gain
# entropy is a commonly used measure for classification tasks, which quantifies the amount of uncertainty in the data.
# ccp_alpha is used for pruning the tree to prevent overfitting
# Here, ccp_alpha is set to 0.02 for pruning meaning the tree will be pruned to reduce complexity
dtc2 = DecisionTreeClassifier(criterion='entropy', ccp_alpha=0.02, max_depth=5)
# Entropy is a measure of the amount of uncertainty or randomness in the data, and it is used to determine how well a feature separates the classes.

In [266]:
dtc2.fit(X_train, y_train)

In [267]:
Y_pred2 = dtc2.predict(X_test)
display(Y_pred2)

array([0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 1], dtype=int64)

In [268]:
print("Decision Tree 2 Score: ", dtc2.score(X_test, y_test))

decision_tree_cm2 = confusion_matrix(y_test, Y_pred2)
decision_tree_accuracy2 = accuracy_score(y_test, Y_pred2)
decision_tree_cr2 = classification_report(y_test, Y_pred2)

print("Confusion Matrix 2:\n", decision_tree_cm2)
print("Accuracy Score 2:", decision_tree_accuracy2)    
print("Classification Report 2:\n", decision_tree_cr2)

Decision Tree 2 Score:  0.8602150537634409
Confusion Matrix 2:
 [[63  4]
 [ 9 17]]
Accuracy Score 2: 0.8602150537634409
Classification Report 2:
               precision    recall  f1-score   support

           0       0.88      0.94      0.91        67
           1       0.81      0.65      0.72        26

    accuracy                           0.86        93
   macro avg       0.84      0.80      0.81        93
weighted avg       0.86      0.86      0.86        93



In [269]:
features2 = pd.DataFrame(dtc2.feature_importances_, index=X.columns)
features2.head(15)

Unnamed: 0,0
YRS,0.0
G,0.105823
AB,0.0
R,0.440691
H,0.0
2B,0.0
3B,0.0
HR,0.064995
RBI,0.076152
BB,0.0
