# 2.1 Decision Trees

## 2.1.1 Decision Tree with All Valuable Features

### Decision Trees in Machine Learning

A Decision Tree is a non-parametric supervised learning method used for classification and regression tasks. It creates a model that predicts the target variable by learning decision rules from features.

#### Structure of a Decision Tree

- **Nodes**: These define the structure of a decision tree.
  - **Root Node**: Represents the entire dataset and is the starting point of the tree.
  - **Decision Nodes**: Nodes that split into more nodes based on a decision rule.
  - **Leaf Nodes/Terminal Nodes**: Nodes that provide the prediction and do not split further.

#### Learning Process

1. **Select the Best Feature**: The algorithm selects the feature that results in the most beneficial split according to a certain criterion (e.g., Gini impurity, entropy, or variance reduction).

2. **Split the Data**: The dataset is divided into subsets based on the selected feature.

3. **Recurse on Each Sub-Dataset**: The splitting process is applied recursively to each subset.

4. **Stopping Criteria**: The recursion stops if any of the following conditions are met:
   - All observations in a node have the same value of the target variable.
   - No further features are available to split on.
   - A predefined tree depth is reached.
   - A node has too few samples to split further.

5. **Prediction**: To make a prediction:
   - **Classification**: Traverse from the root to a leaf node based on the input features, and use the class label of the leaf as the prediction.
   - **Regression**: Similarly, traverse the tree, but the prediction is a continuous value, typically the average of target values in the leaf node.

#### Important Features of Decision Trees

- **Interpretability**: Easy to understand and interpret visually.
- **Non-Linear Data Handling**: Can process both numerical and categorical data and model complex relationships.
- **Feature Importance**: They prioritize the most informative features first.
- **Overfitting**: Prone to overfitting, which can be mitigated by pruning, setting maximum depth, or minimum samples per leaf.


In [14]:
# Begin by importing all required libraries
import math
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
from sklearn import datasets
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer


In [15]:
# Read data
train_df = pd.read_csv('csv_files/train.csv')

# Preview data
print('Raw data format:')
display(train_df.head())

# Determining the amount of missing data per column
missing_data = train_df.isna().sum()

# Calculating the percentage of missing data per column
missing_percentage = (missing_data / len(train_df)) * 100

missing_info = pd.DataFrame({
    "Missing Values": missing_data,
    "Percentage": missing_percentage
})

missing_info.sort_values(by="Missing Values", ascending=False)

# Imputers
median_imputer = SimpleImputer(strategy='median')
mode_imputer = SimpleImputer(strategy='most_frequent')

# List of numerical and categorical columns that need imputation
numerical_cols = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
categorical_cols = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP']

# Imputation
train_df[numerical_cols] = median_imputer.fit_transform(train_df[numerical_cols])
train_df[categorical_cols] = mode_imputer.fit_transform(train_df[categorical_cols])


# Assuming train_df is predefined
decision_tree_df = train_df.copy()  # Use copy to avoid SettingWithCopyWarning

# Drop unnecessary columns
decision_tree_df.drop(columns=['PassengerId', 'Name', 'VIP'], inplace=True)

decision_tree_df.dropna(subset=['Cabin'], inplace=True)

# Split 'Cabin' into 'Deck' and 'Side'
decision_tree_df['Deck'] = decision_tree_df['Cabin'].str.split('/').str[0]
decision_tree_df['Side'] = decision_tree_df['Cabin'].str.split('/').str[2]

decision_tree_df.drop(columns=['Cabin'], inplace=True) 

# Convert 'CryoSleep' boolean to int
decision_tree_df['CryoSleep'] = decision_tree_df['CryoSleep'].astype(int)

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Encode categorical variables
for col in ['HomePlanet', 'Destination', 'Deck', 'Side']:
    decision_tree_df[col] = label_encoder.fit_transform(decision_tree_df[col])

# After edits for decision trees
print('DataFrame used for trees:')
decision_tree_df.head()


Raw data format:


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


DataFrame used for trees:


Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Deck,Side
0,1,0,2,39.0,0.0,0.0,0.0,0.0,0.0,False,1,0
1,0,0,2,24.0,109.0,9.0,25.0,549.0,44.0,True,5,1
2,1,0,2,58.0,43.0,3576.0,0.0,6715.0,49.0,False,0,1
3,1,0,2,33.0,0.0,1283.0,371.0,3329.0,193.0,False,0,1
4,0,0,2,16.0,303.0,70.0,151.0,565.0,2.0,True,5,1


## Develop the Decision Tree using the DataFrame

In [16]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score

# Now, split the data into features (X) and target label (y)
X = decision_tree_df.drop(['Transported'], axis=1)  # Target variable 
y = decision_tree_df['Transported']  # Target variable

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

# Initialize the Decision Tree Classifier
decision_tree_model = DecisionTreeClassifier(random_state=4)

# Train the model
decision_tree_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = decision_tree_model.predict(X_test)

# Evaluate the model
print(f'Accuracy of the Decision Tree model: {accuracy_score(y_test, y_pred):.2f}')
print(classification_report(y_test, y_pred))


Accuracy of the Decision Tree model: 0.76
              precision    recall  f1-score   support

       False       0.75      0.75      0.75       830
        True       0.76      0.77      0.76       869

    accuracy                           0.76      1699
   macro avg       0.76      0.76      0.76      1699
weighted avg       0.76      0.76      0.76      1699



A score of 0.76 is good however with feature engineering, we can try improve it by reducing the complexity. 

We start by reducing the number of parameters.

In [17]:
test_expenditure_df = decision_tree_df.copy()

test_expenditure_df['Expenditure'] = decision_tree_df['RoomService'] + decision_tree_df['FoodCourt'] + decision_tree_df['ShoppingMall'] + decision_tree_df['Spa'] + decision_tree_df['VRDeck']

test_expenditure_df = test_expenditure_df.drop(['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'], axis=1)

decision_tree_df.head()

# Now, split the data into features (X) and target label (y)
X = test_expenditure_df.drop(['Transported'], axis=1)  # Target variable 
y = test_expenditure_df['Transported']  # Target variable

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

# Initialize the Decision Tree Classifier
decision_tree_model = DecisionTreeClassifier(random_state=4)

# Train the model
decision_tree_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = decision_tree_model.predict(X_test)

# Evaluate the model
print(f'Accuracy of the Decision Tree model: {accuracy_score(y_test, y_pred):.2f}')
print(classification_report(y_test, y_pred))



Accuracy of the Decision Tree model: 0.67
              precision    recall  f1-score   support

       False       0.67      0.66      0.66       830
        True       0.68      0.68      0.68       869

    accuracy                           0.67      1699
   macro avg       0.67      0.67      0.67      1699
weighted avg       0.67      0.67      0.67      1699

