# 🌳 1D to 2D Classification using Decision Trees

This notebook demonstrates how to classify data by learning the mapping from 1D measurements (e.g., chord lengths) to 2D shape categories using decision tree algorithms. Such tasks arise in stereology and materials analysis where observed features are lower-dimensional projections of higher-dimensional structures.


## Inferring Sphere Radius from Chord Length Distributions

This script simulates 2D cuts through randomly packed spheres and collects the resulting chord lengths. These are converted into binned histograms (feature vectors) used to identify the original sphere radius via pattern matching.


In [1]:
import matplotlib.pyplot as plt
from collections import defaultdict
import numpy as np

def generate_spheres(num_spheres, radius):
    positions = []
    for _ in range(num_spheres):
        while True:
            x = np.random.uniform(radius, 1 - radius)
            y = np.random.uniform(radius, 1 - radius)
            if all(np.sqrt((x - px) ** 2 + (y - py) ** 2) >= 2 * radius for px, py in positions):
                positions.append((x, y))
                break
    return positions

def chord_length(radius, d_line):
    return 2 * np.sqrt(radius**2 - d_line**2)

def distance_from_line(a, b, x, y):
    return abs(a * x - y + b) / np.sqrt(a**2 + 1)

def collect_chords_for_cuts(positions, radius, num_cuts, y_range=(0, 1)):
    chord_lengths_per_cut = []
    # Spacing for horizontal cuts
    y_cut_values = np.linspace(y_range[0], y_range[1], num_cuts) 
    # Collect chord lengths for each cut
    for y_cut in y_cut_values:
        cut_chords = []
        for (x, y) in positions:
            # Calculate the perpendicular distance from the sphere center to the cut
            d_line = abs(y - y_cut)  # This is simply the vertical distance since it's a horizontal line

            # Check if the line intersects the sphere (distance must be less than the radius)
            if d_line < radius:
                length = chord_length(radius, d_line)
                cut_chords.append(length)      
        # Sort the chords for the current cut
        chord_lengths_per_cut.append(sorted(cut_chords))
    
    return chord_lengths_per_cut


def discretize_chord_lengths(chord_lengths, num_bins, range_min = 0, range_max=0.16):
    bin_edges = np.linspace(range_min, range_max, num_bins + 1)
    
    # Initialize a list to store the count of chord lengths in each bin
    bin_counts = np.zeros(num_bins, dtype=int)

    # For each chord length, find which bin it belongs to and increment the count for that bin
    for chord in chord_lengths:
        for i in range(num_bins):
            if bin_edges[i] < chord <= bin_edges[i + 1]:
                bin_counts[i] += 1
                break

    return bin_counts
    

def distribution_chord_lengths(chord_lengths_per_cut, num_bins):
    discretized_vectors = []
    
    for chord_lengths in chord_lengths_per_cut:
        if chord_lengths:  # If it's not empty, discretize
            discretized_vector = discretize_chord_lengths(chord_lengths, num_bins)
        #else:  # If it's empty, create a zeroed vector of the same length as the bins
        #    discretized_vector = np.zeros(num_bins, dtype=int)
        #
            discretized_vectors.append(discretized_vector)
    
    return discretized_vectors


def flatten_discretized_vectors(discretized_vectors):
    # Flatten each discretized vector into a 1D feature vector
    return np.array([vec.flatten() for vec in discretized_vectors])

def create_feature_radius_pairs(distributions, true_radii):
    # Ensure the number of distributions matches the number of true radii
    if len(distributions) != len(true_radii):
        raise ValueError("The number of distributions must match the number of true radii.")
    
    # Flatten the distributions
    flattened_distributions = [flatten_discretized_vectors(distribution) for distribution in distributions]
    
    # Pair each flattened vector with its corresponding true radius
    feature_radius_pairs = []
    for i, flattened_distribution in enumerate(flattened_distributions):
        for feature_vector in flattened_distribution:
            feature_radius_pairs.append((feature_vector, true_radii[i]))
    
    return feature_radius_pairs

def compute_vector_frequencies(distribution):
    vector_counts = defaultdict(int)

    for vector in distribution:
        vector_tuple = tuple(vector)  # Convert numpy array to tuple for hashing
        vector_counts[vector_tuple] += 1

    return vector_counts


def discretize_vector(vector, num_bins=10, range_min=0, range_max=0.16):
    # Discretize the observed vector using the same bins as the training data
    bin_edges = np.linspace(range_min, range_max, num_bins + 1)
    discretized_vector = np.zeros(num_bins, dtype=int)

    for val in vector:
        for i in range(num_bins):
            if bin_edges[i] < val <= bin_edges[i + 1]:
                discretized_vector[i] += 1
                break

    return tuple(discretized_vector)  # Convert to tuple for comparison in dictionary

# Function to find the best match across multiple dictionaries
def find_best_match_in_dictionaries(observed_vector, dictionaries, num_bins=10, range_min=0, range_max=0.16):
    # Discretize the observed vector
    discretized_observed = discretize_vector(observed_vector, num_bins, range_min, range_max)
    #discretized_observed = (np.int64(0), np.int64(0), np.int64(0), np.int64(1), np.int64(0), np.int64(0), np.int64(0), np.int64(0), np.int64(0), np.int64(0))
    best_match = None
    best_frequency = 0
    best_dict_name = None

    # Loop through each dictionary and check for matches
    for dict_name, vector_frequencies in dictionaries.items():
        # Find matches in the current dictionary
        matching_frequencies = {vec: count for vec, count in vector_frequencies.items() if vec == discretized_observed}
        
        if matching_frequencies:
            # Find the highest frequency in this dictionary
            max_frequency_in_dict = max(matching_frequencies.values())
            
            if max_frequency_in_dict > best_frequency:
                best_frequency = max_frequency_in_dict
                best_match = discretized_observed
                best_dict_name = dict_name

    return best_match, best_frequency, best_dict_name

## Simulating Chord Distributions for Different Sphere Densities

We simulate 2D cross-sections through sets of non-overlapping spheres (with fixed radius) at four different densities. For each case, we compute chord lengths from horizontal cuts and discretize them into histograms for further analysis.


In [2]:
# Parameters
num_cuts = 100  # Number of horizontal cuts
num_bins = 10 #feature vectors

radius = 0.08  # Radius of each circle

positions = generate_spheres(17, radius)
chord_lengths_set_17 = collect_chords_for_cuts(positions, radius, num_cuts)
distribution_17 = distribution_chord_lengths(chord_lengths_set_17, num_bins) #discretised vectors


positions = generate_spheres(15, radius)
chord_lengths_set_15 = collect_chords_for_cuts(positions, radius, num_cuts)
distribution_15 = distribution_chord_lengths(chord_lengths_set_15, num_bins) #discretised vectors


positions = generate_spheres(13, radius)
chord_lengths_set_13 = collect_chords_for_cuts(positions, radius, num_cuts)
distribution_13 = distribution_chord_lengths(chord_lengths_set_13, num_bins) #discretised vectors


positions = generate_spheres(11, radius)
chord_lengths_set_11 = collect_chords_for_cuts(positions, radius, num_cuts)
distribution_11 = distribution_chord_lengths(chord_lengths_set_11, num_bins) #discretised vectors

## Classification Task 1: Inferring Sphere Density from Chord Distributions

We aim to classify the density of spheres based on discretized chord length distributions obtained from horizontal cuts.

- **Classes**: `distribution_11` (low density) vs. `distribution_17` (high density)
- **Input**: a 10×1 feature vector (discretized chord lengths)
- **Goal**: predict which class (density level) the feature vector belongs to
- **Parameters**: 
  - `num_bins = 15` for finer discretization
  - Use a simple classifier (e.g., Decision Tree) to distinguish between the two distributions

This task mimics a supervised learning setting, where the model learns from labeled distributions and predicts the class of new observations.


## Regression Task 2: Inferring Sphere Radius from Chord Distributions

We now approach a **regression** problem: estimating the true **radius** of the spheres based on the shape of their chord length distributions.

- **Classes (Radii)**: choose two known radii, e.g. `r = 0.04` and `r = 0.08`
- **Input**: a 10×1 discretized feature vector from chord lengths
- **Output**: estimated radius of the sphere sample
- **Goal**: train a regression model to learn the mapping from discretized distributions to true radius values

This task demonstrates how statistical features derived from 2D cuts can be used to infer underlying 3D geometric properties.


In [None]:
######### for you to fill in

### 🔧 Lifehack: Sampling a Realistic Discretized Vector to save your time

This small code snippet is a quick way to extract a **realistic discretized vector** from a generated chord length distribution. It simulates a sphere arrangement with known radius and density, collects chord lengths from horizontal cuts, and selects the **first meaningful vector** (i.e., one that contains actual data).


In [4]:
# Parameters
num_spheres = 17
radius = 0.08
num_cuts = 100
num_bins = 10

# Generate spheres and chord lengths
positions = generate_spheres(num_spheres, radius)
chord_lengths_set = collect_chords_for_cuts(positions, radius, num_cuts)
distribution = distribution_chord_lengths(chord_lengths_set, num_bins)

# Extract a single discretized vector (e.g., the first non-empty one)
sample_vector = None
for vec in distribution:
    if vec.sum() > 0:  # Skip empty cuts
        sample_vector = vec
        break

# Show the result
print("Sample discretized vector:", sample_vector)

Sample discretized vector: [0 0 0 0 1 0 0 0 0 0]
