### **Decision Trees**

Decision trees are predictive models that use a set of rules based on data characteristics to make decisions.

A decision tree is a hierarchical structure where each node represents a feature (or attribute) of the data, each branch represents an outcome of that feature, and each leaf represents a class label or a prediction value.

`USE CASE EXAMPLES`

Purchase Probability:<br>
Variables: Income, Age, Gender<br>
Objective: Determine whether a person will make a certain purchase or not.<br>

    A[Income > $70,000] -->|Yes| B[Age]
    B -->|≥ 40| C[Purchase: Yes]
    B -->|< 40| D[Purchase: No]
    A -->|No| E[Has Bought Before]
    E -->|Yes| F[Will Purchase: Yes]
    E -->|No| G[Will Purchase: No]


`Income:` used as the first decision criterion.<br>
If income is higher than $70,000, the decision tree relies on age. If not, it relies on gender.<br>

`Age:` determines the probability of purchase for customers of a certain age.<br>

`Has Bought Before:` determines the probability of purchase for customers who have previously bought.

In [None]:
!pip3 install numpy pandas matplotlib scikit-learn

In [None]:
# Importing libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.preprocessing import LabelEncoder

print("All libraries imported successfully.")

In [None]:
# Generating synthetic data for market segmentation
np.random.seed(42)
n = 100
# Factors to consider for market segmentation
# - Age: between 18 and 70 years
# - Income: between 20,000 and 120,000
# - Has bought before: yes or no
# - Purchase: yes or no

# We have two datasets, one with a very clear pattern and another completely random; this is useful to observe how prediction accuracy varies

# Here we can see a very clear purchase pattern
incomes = np.random.randint(20000, 100000, n)
ages = np.random.randint(18, 70, n)
has_bought_before = np.random.randint(0, 2, n)
will_purchase = []
for i in range(n):
    if incomes[i] > 70000:
        if ages[i] >= 40:
            will_purchase.append(1)  # Yes
        else:
            will_purchase.append(0)  # No
    else:
        if has_bought_before[i] == 1:
            will_purchase.append(1)  # Yes
        else:
            will_purchase.append(0)  # No
            
data = {
    'Age': ages,
    'Income': incomes,
    'HasBoughtBefore': has_bought_before,
    'WillPurchase': will_purchase
}

# Here we have a dataset with completely random values, the accuracy should drop significantly
# incomes = np.random.randint(20000, 120000, n)
# ages = np.random.randint(18, 70, n)
# has_bought_before = np.random.choice([0, 1], n)
# will_purchase = np.random.choice(['Yes', 'No'], n)

# # Create DataFrame
# data = {
#     'Age': ages,
#     'Income': incomes,
#     'HasBoughtBefore': has_bought_before,
#     'WillPurchase': will_purchase
# }

# Convert the dataset into a pandas DataFrame
df = pd.DataFrame(data)

# Save to CSV
df.to_csv('market_segmentation.csv', index=False)

In [None]:
# # Load data
data = pd.read_csv('market_segmentation.csv')

# # Show the first few rows
print(data.head())

# # Descriptive statistics
# # count: non-null values
# # mean: average (sum of the values of each column divided by the number of rows)
# # std: standard deviation
# # min: minimum value per column
# # 25%: the 25th percentile
# # 50%: the 50th percentile (median)
# # 75%: the 75th percentile
# # max: maximum value per column
print(data.describe())

In [None]:
# Purchase distribution
plt.hist(data['WillPurchase'])
plt.xlabel('WillPurchase')
plt.ylabel('Frequency')
plt.title('Purchase Distribution')
plt.show()

# Encode categorical variables
label_encoder = LabelEncoder()
data['HasBoughtBefore'] = label_encoder.fit_transform(data['HasBoughtBefore'])
data['WillPurchase'] = label_encoder.fit_transform(data['WillPurchase'])

# Here we can see a graph where the x-axis shows the options ('Yes' and 'No')
# The y-axis shows the frequency of each category in the data.
# Frequency indicates how often each category appears in a dataset.

In [None]:
# Features to be considered for training the model
X = df[['Age', 'Income', 'HasBoughtBefore']]
y = df['WillPurchase']

# Split into training and test sets
# X variables to be used for prediction
# y variable we want to predict
# test_size=0.2: indicates that we will use 20% of the data for the test set and 80% for the training set
# The training set is used to train the model, used to teach the model the relationship between patterns in the data (the more, the better)
# The test set is used to compare the predictions and see how accurate they are
# random_state: used to control how data is randomly split. If two people run the same function with the same value for random_state, 
# they will get exactly the same data split (test and training sets).
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
# Create the model
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

In [None]:
# Predict on the test set
y_pred_tree = model.predict(X_test)

# Evaluate the model
print("Decision Tree - Accuracy:", accuracy_score(y_test, y_pred_tree))

# Precision: The proportion of true positives over the total predicted positives (TP / (TP + FP)).
# Recall: The proportion of true positives over the total actual positives (TP / (TP + FN)).
# F1 Score: The harmonic mean of precision and recall, providing a balance between the two metrics (2 * (Precision * Recall) / (Precision + Recall)).
# Support: The number of actual occurrences of the class in the data.
print("Decision Tree - Classification Report:\n", classification_report(y_test, y_pred_tree))

# Confusion matrix
# TP (True Positives): Correct predictions where the model predicts the positive class correctly.
# FP (False Positives): Incorrect predictions where the model predicts the positive class but the instance is negative.
# FN (False Negatives): Incorrect predictions where the model predicts the negative class but the instance is positive.
# TN (True Negatives): Correct predictions where the model predicts the negative class correctly.

#                  Predicted Positive	   Predicted Negative
# Actual Positive	       TP	                       FN
# Actual Negative	       FP                        TN

print("Decision Tree - Confusion Matrix:\n", confusion_matrix(y_test, y_pred_tree))

In [None]:
# from sklearn.tree import plot_tree

# Visualize feature importance
importances = model.feature_importances_
features = X.columns
indices = np.argsort(importances)[::-1]

## In the following graph, we can see which features are the most important to consider

plt.figure(figsize=(12, 8))
plt.title('Feature Importance')
plt.bar(range(X.shape[1]), importances[indices], align='center')
plt.xticks(range(X.shape[1]), [features[i] for i in indices])
plt.show()