# Dataclass Tutorial Notes

Here are some common use cases of `dataclass` in python.

* **Storing Model Configurations:** Machine learning models often have various hyperparameters that can be tweaked. Dataclasses provide a clean way to define these configurations, making it easier to experiment with different settings and track results. For instance, you could create a dataclass to store learning rate, batch size, and optimizer settings for a neural network.

* **Data Preprocessing Pipelines:** Data preprocessing is a crucial step in machine learning. Dataclasses can be used to represent the various stages of a preprocessing pipeline, including data normalization, feature scaling, and transformation. This imporves code readability and maintainability.

* **Experiment Logging:** When running machine learning experiments, it's essential to keep track of the data used, model configurations, and performance metrics. Dataclasses can be used to create structured logs that capture this information, simplifying analysis and comparison of different runs.

* **Feature Engineering:** Feature engineering involves creating new features from existing data. Dataclasses can be used to represent these new features, making it easier to track their origin and impact on model performance.


**NOTE:** <font color='green'>Dataclasses promote clean, concise, and well-organized code for data-centric tasks in AI and machine learning. This improves readability, maintainability, and helps you manage complex data structures effectively.</font>

## Example 1: Hyperparameter Management

In [5]:
from dataclasses import dataclass

@dataclass
class Hyperparameters:
    learning_rate: float = 0.01
    batch_size: int = 32
    epochs: int = 100

params = Hyperparameters(learning_rate=0.001,
                         batch_size=32,
                         epochs=10)
params2 = Hyperparameters(learning_rate=0.005, 
                          batch_size=8)

print(params)
print(params2)
print('='*20)
print(params.learning_rate)

Hyperparameters(learning_rate=0.001, batch_size=32, epochs=10)
Hyperparameters(learning_rate=0.005, batch_size=8, epochs=100)
0.001


## Example 2: Model Configuration

In [4]:
@dataclass
class ModelConfig:
    input_dim: int
    hidden_dim: int
    output_dim: int
    activation: str

config = ModelConfig(input_dim=100,
                     hidden_dim=50,
                     output_dim=10, 
                     activation='relu')

print(config)

ModelConfig(input_dim=100, hidden_dim=50, output_dim=10, activation='relu')


In [14]:
from dataclasses import dataclass

@dataclass
class PreprocessingStep:
  name: str
  func: callable

def normalize_data(data):
    raise NotImplementedError('Function `normalize_data` has not implemented.')

def scale_features(data):
    raise NotImplementedError('Function `scale_featues` has not implemented.')

# Example usage
preprocessing_pipeline = [
  PreprocessingStep("Normalization", normalize_data),
  PreprocessingStep("Feature Scaling", scale_features),
]

data = [[1,3], [2, 5]]

for step in preprocessing_pipeline:
  data = step.func(data)


NotImplementedError: Function `normalize_data` has not implemented.

In [15]:
from dataclasses import dataclass

@dataclass
class ExperimentLog:
    model_name: str
    hyperparameters: Hyperparameters
    training_time: float
    accuracy: float

log  = ExperimentLog("MLP", params, 120.5, 0.88)

print(log)

ExperimentLog(model_name='MLP', hyperparameters=Hyperparameters(learning_rate=0.001, batch_size=32, epochs=10), training_time=120.5, accuracy=0.88)


In [23]:
from dataclasses import dataclass

@dataclass
class NewFeature:
  name: str
  func: callable
  # Optional: original_features (list of dataclasses referencing source features)

def calculate_ratio(a, b):
    try:
        c = a/b
    except Exception as e:
        print(e)

original_feature1 = [1, 2, 3]

original_feature2 = [5.0, 3, 0]

# Example usage
new_feature = NewFeature("Ratio", calculate_ratio, [original_feature1, original_feature2])


TypeError: NewFeature.__init__() takes 3 positional arguments but 4 were given

In [20]:
def calculate_ratio(a, b):
    try:
        c = a/b
        print(c)
    except Exception as e:
        print(e)

In [21]:
calculate_ratio(2,  1)

2.0


In [22]:
calculate_ratio(2, 0)

division by zero


In [24]:
import json

# After experiment run
log = ExperimentLog(model_name = "MLP", 
                    hyperparameters = params, 
                    training_time = 120.5, 
                    accuracy = 0.88)

with open("experiment_log.json", "w") as f:
    json.dump(log.__dict__, f) # Convert dataclass to dictionary for json

TypeError: Object of type Hyperparameters is not JSON serializable

In [25]:
from dataclasses import dataclass

@dataclass
class MLPConfig:
  learning_rate: float = 0.01
  batch_size: int = 32
  epochs: int = 100
  hidden_units: int = 64

# Example usage
config = MLPConfig(learning_rate=0.005, hidden_units=128)


from dataclasses import dataclass

@dataclass
class ExperimentLog:
  model_name: str
  config: MLPConfig
  training_time: float
  accuracy: float

# Example usage
log = ExperimentLog("MLP", config, 120.5, 0.87)
# You can then store or visualize this log information

In [37]:
import json

# ... your experiment code ...

# After experiment run
log = ExperimentLog(model_name="MLP", config=config, training_time=1100.0, accuracy=0.90)

# Convert config dataclass to dictionary
config_dict = log.config.__dict__

print(config_dict)

# Create the serializable dictionary for JSON
log_dict = {
    "model_name": log.model_name,
    "config": config_dict,
    "training_time": log.training_time,
    "accuracy": log.accuracy
}

with open("experiment_log.json", "a") as f:
  json.dump(log_dict, f)  # Convert dataclass to dictionary for json


{'learning_rate': 0.005, 'batch_size': 32, 'epochs': 100, 'hidden_units': 128}
