# AutoGluon MultiModal (AutoMM) - Essentials Workshop

Welcome to the AutoGluon MultiModal essentials workshop! In this notebook, we'll explore how to use AutoGluon's `MultiModalPredictor` to solve problems involving multiple data modalities such as text, images, and tabular data.

`MultiModalPredictor` is a powerful tool that can automatically build state-of-the-art deep learning models for inputs including images, text, and tabular data. It can predict the values of one column based on other features, handling the complexities of multiple data types seamlessly.

## What You Will Learn

* How to prepare multimodal data for AutoGluon
* Training models with MultiModalPredictor
* Making predictions on new data
* Evaluating model performance
* Handling different types of data modalities
* Best practices for multimodal machine learning

Let's get started by installing AutoGluon and importing the required modules.

In [None]:
!pip install -q uv
# Install AutoGluon and visualization dependencies
!uv pip install autogluon==1.4
!pip install -q matplotlib seaborn scikit-learn pillow

In [None]:
import os
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Image, display

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')
np.random.seed(123)

## 1. Introduction to MultiModal Learning

Multimodal learning involves training models that can process and learn from multiple types of data modalities. In real-world applications, we often have access to data from various sources and formats:

- **Text**: Product descriptions, reviews, articles
- **Images**: Photos, diagrams, visualizations
- **Documents**: PDFs, Word documents, HTML pages
- **Tabular data**: Structured information in rows and columns

Traditional machine learning approaches often require separate models for each data type or extensive feature engineering to combine them. AutoGluon's `MultiModalPredictor` simplifies this process by automatically handling the complexities of multimodal data and creating models that can leverage all available information sources.

### Benefits of MultiModal Learning

- **Complementary information**: Different modalities can provide complementary perspectives on the same problem
- **Enhanced predictive power**: Combining modalities often leads to better performance than using a single modality
- **Robustness**: The model can still make predictions if one modality is missing or noisy
- **Real-world applicability**: Most real-world problems involve multiple data types

AutoGluon's `MultiModalPredictor` is designed to make this powerful approach accessible to everyone, without requiring deep expertise in multimodal deep learning.

## 2. Example Dataset: PetFinder

For this tutorial, we'll use a simplified and subsampled version of the [PetFinder dataset](https://www.kaggle.com/c/petfinder-adoption-prediction). The goal is to predict pet adoption rates based on their adoption profiles, which include images, text descriptions, and tabular features.

In this simplified version, the adoption speed is grouped into two categories: 0 (slow) and 1 (fast).

Let's download and prepare the dataset:

In [None]:
from autogluon.core.utils.loaders import load_zip

download_dir = './ag_multimodal_tutorial'
zip_file = 'https://automl-mm-bench.s3.amazonaws.com/petfinder_for_tutorial.zip'

# Download and unzip the dataset
load_zip.unzip(zip_file, unzip_dir=download_dir)

Now, let's load the dataset into pandas DataFrames and examine it:

In [None]:
dataset_path = f'{download_dir}/petfinder_for_tutorial'

train_data = pd.read_csv(f'{dataset_path}/train.csv', index_col=0)
test_data = pd.read_csv(f'{dataset_path}/test.csv', index_col=0)

label_col = 'AdoptionSpeed'

# Display the first few rows of the training data
train_data.head()

### Data Preparation and Visualization for Multimodal Learning

Let's explore the different data modalities in our dataset:

1. **Tabular features**: Age, Breed, Color, etc.
2. **Text**: Description of the pet
3. **Images**: Photos of the pets

For AutoGluon's `MultiModalPredictor`, image columns should contain strings whose values are paths to image files. Some records in our data have multiple images associated with them, but for simplicity, we'll use only the first image for each pet:

In [None]:
image_col = 'Images'

# Take only the first image for each pet
train_data[image_col] = train_data[image_col].apply(lambda ele: ele.split(';')[0])
test_data[image_col] = test_data[image_col].apply(lambda ele: ele.split(';')[0])

# Helper function to convert relative paths to absolute paths
def path_expander(path, base_folder):
    path_l = path.split(';')
    return ';'.join([os.path.abspath(os.path.join(base_folder, path)) for path in path_l])

# Convert relative paths to absolute paths
train_data[image_col] = train_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))
test_data[image_col] = test_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))

### Exploring and Visualizing the Data

Let's examine the different modalities in our dataset through visualizations to better understand what we're working with:

In [None]:
# Let's create a more comprehensive visualization of the dataset
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image, display, HTML
from matplotlib import gridspec
import random

# Randomly sample indices for visualization
sample_indices = random.sample(range(len(train_data)), 5)
samples = train_data.iloc[sample_indices]

# Create a figure for each sample instead of using gridspec
for i, idx in enumerate(sample_indices):
    sample = train_data.iloc[idx]
    
    # Create a new figure for each sample
    plt.figure(figsize=(14, 8))
    
    # Create a layout with subplots
    plt.subplot(2, 2, 1)  # Top left for image
    try:
        img_path = sample[image_col]
        img = plt.imread(img_path)
        plt.imshow(img)
        plt.title(f"Pet #{idx} Image", fontsize=12)
        plt.axis('off')
    except Exception as e:
        plt.text(0.5, 0.5, f"Error loading image: {e}", ha='center', va='center')
        plt.axis('off')
    
    # Add pet info in top right
    plt.subplot(2, 2, 2)
    pet_type = "Dog" if sample['Type'] == 1 else "Cat"
    adoption_speed = "Fast" if sample[label_col] == 1 else "Slow"
    
    pet_info = f"Pet #{idx}: {pet_type}\n"
    pet_info += f"Age: {sample['Age']} months\n"
    pet_info += f"Gender: {'Male' if sample['Gender'] == 1 else 'Female'}\n"
    pet_info += f"Color: {sample['Color1']}\n"
    pet_info += f"Maturity Size: {sample['MaturitySize']}\n"
    pet_info += f"Adoption Speed: {adoption_speed}\n"
    
    plt.text(0.05, 0.5, pet_info, fontsize=12, va='center')
    plt.axis('off')
    plt.title("Pet Information", fontsize=12)
    
    # Add text description in bottom part (spans full width)
    plt.subplot(2, 1, 2)  # Bottom half
    desc = sample['Description']
    short_desc = desc[:400] + '...' if len(desc) > 400 else desc
    plt.text(0.05, 0.9, "Description:", fontsize=12, fontweight='bold', va='top')
    plt.text(0.05, 0.8, short_desc, fontsize=10, va='top', wrap=True)
    plt.axis('off')
    
    plt.tight_layout()
    plt.show()

# Let's also visualize the distribution of adoption speeds and other features
plt.figure(figsize=(16, 10))

# Subplot 1: Adoption speed distribution
plt.subplot(2, 3, 1)
adoption_counts = train_data[label_col].value_counts().sort_index()
ax = sns.barplot(x=adoption_counts.index, y=adoption_counts.values)
plt.title('Adoption Speed Distribution')
plt.xlabel('Adoption Speed (0=Slow, 1=Fast)')
plt.ylabel('Count')

# Add actual count values on top of bars
for i, count in enumerate(adoption_counts.values):
    ax.text(i, count + 5, str(count), ha='center')

# Subplot 2: Pet type distribution
plt.subplot(2, 3, 2)
type_counts = train_data['Type'].map({1: 'Dog', 2: 'Cat'}).value_counts()
ax = sns.barplot(x=type_counts.index, y=type_counts.values)
plt.title('Pet Type Distribution')
plt.ylabel('Count')

# Add actual count values on top of bars
for i, count in enumerate(type_counts.values):
    ax.text(i, count + 5, str(count), ha='center')

# Subplot 3: Age distribution
plt.subplot(2, 3, 3)
sns.histplot(train_data['Age'], bins=15, kde=True)
plt.title('Pet Age Distribution (months)')
plt.xlabel('Age')

# Subplot 4: Gender distribution
plt.subplot(2, 3, 4)
gender_counts = train_data['Gender'].map({1: 'Male', 2: 'Female', 3: 'Mixed'}).value_counts()
ax = sns.barplot(x=gender_counts.index, y=gender_counts.values)
plt.title('Gender Distribution')
plt.ylabel('Count')

# Add actual count values on top of bars
for i, count in enumerate(gender_counts.values):
    ax.text(i, count + 5, str(count), ha='center')

# Subplot 5: Description length distribution
plt.subplot(2, 3, 5)
# Create a copy of the data for visualization only, don't modify the original data
viz_data = train_data.copy()
viz_data['desc_length'] = viz_data['Description'].apply(len)
sns.histplot(viz_data['desc_length'], bins=15, kde=True)
plt.title('Description Length Distribution')
plt.xlabel('Character Count')

# Subplot 6: Average adoption speed by pet type
plt.subplot(2, 3, 6)
pet_type_adoption = train_data.groupby('Type')[label_col].mean().reset_index()
pet_type_adoption['Type'] = pet_type_adoption['Type'].map({1: 'Dog', 2: 'Cat'})
sns.barplot(x='Type', y=label_col, data=pet_type_adoption)
plt.title('Average Adoption Speed by Pet Type')
plt.ylabel('Adoption Speed (Higher = Faster)')

plt.tight_layout()
plt.show()

As we can see, our dataset includes multiple types of data for each pet: an image, a text description, and various tabular features. The target variable is `AdoptionSpeed`, which indicates whether the pet was adopted quickly (1) or slowly (0).

## 3. Training with MultiModalPredictor

Now that our data is ready, let's train our multimodal model using AutoGluon's `MultiModalPredictor`. The `MultiModalPredictor` will automatically handle the different data types and train appropriate models for each modality, then combine them to make predictions.

In [None]:
from autogluon.multimodal import MultiModalPredictor

# Initialize the MultiModalPredictor
predictor = MultiModalPredictor(
    label=label_col,
    path='ag_petfinder_model',
    presets='medium_quality'
)

Now, let's fit the predictor to our training data. We'll set a time limit to control the training duration:

In [None]:
# Train the model
predictor.fit(
    train_data=train_data,
    time_limit=300  # 5 minutes
)

### Understanding What Happens During Training

During the training process, `MultiModalPredictor` performs several important steps:

1. **Problem type detection**: Automatically detects whether it's a classification or regression problem
2. **Modality detection**: Identifies which columns contain images, text, or tabular data
3. **Model selection**: Selects appropriate models for each modality from the multimodal model pool
4. **Feature extraction**: Processes each modality to extract meaningful features
   - For images: Uses pre-trained vision models like ResNet or ViT
   - For text: Uses NLP models like BERT or RoBERTa
   - For tabular: Uses numeric and categorical feature processors
5. **Fusion**: Combines features from different modalities using a fusion model (MLP or transformer)
6. **Training**: Trains the end-to-end model to predict the target variable

All of these steps happen automatically, allowing you to focus on your problem rather than the technical details of multimodal model development.

## 4. Making Predictions

Now that we've trained our model, let's use it to make predictions on the test dataset:

In [None]:
# Make predictions
# We need to drop the label column when making predictions
test_data_pred = test_data.drop(columns=[label_col])

# Make predictions
predictions = predictor.predict(test_data_pred)

# Display the first few predictions
print("Predicted adoption speeds (0=slow, 1=fast):")
print(predictions[:10])

For classification tasks, we can also get the prediction probabilities for each class:

In [None]:
# Get prediction probabilities
# Use the same test data without the label column
probs = predictor.predict_proba(test_data_pred)

# Display the first few prediction probabilities
print("Prediction probabilities (probability of fast adoption):")
print(probs[:10])

## 5. Evaluation

Let's evaluate our model's performance on the test dataset using various metrics:

In [None]:
# Evaluate the model on the test data
# Note: For evaluation, we use the full test data with labels
evaluation_results = predictor.evaluate(test_data, metrics=['accuracy', 'f1', 'roc_auc'])

# Display the evaluation results
print("Evaluation Results:")
for metric, value in evaluation_results.items():
    print(f"{metric}: {value}")

Let's create a few visualizations to better understand our model's performance:

In [None]:
import seaborn as sns
from sklearn.metrics import confusion_matrix, roc_curve, auc

# Create confusion matrix
y_true = test_data[label_col].values
y_pred = predictions.values
cm = confusion_matrix(y_true, y_pred)

# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Slow', 'Fast'], 
            yticklabels=['Slow', 'Fast'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

# Plot ROC curve
# For binary classification, we need to use only the probabilities of the positive class (class 1)
# If probs is a DataFrame with one column per class, select just the positive class column
if isinstance(probs, pd.DataFrame) and probs.shape[1] > 1:
    # Get the probability of the positive class (class 1)
    # Assuming class 1 is the second column (index 1)
    probs_positive_class = probs.iloc[:, 1]
else:
    # If probs is already a Series or 1D array
    probs_positive_class = probs

# Convert to numpy array
probs_values = probs_positive_class.values

# Now we use the 1D array of positive class probabilities
fpr, tpr, _ = roc_curve(y_true, probs_values)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc='lower right')
plt.show()

## 6. Understanding Model Behavior

To better understand how our model makes decisions, let's examine a few test examples and their predictions:

In [None]:
# Select random examples
sample_indices = random.sample(range(len(test_data)), 5)
samples = test_data.iloc[sample_indices].copy()
sample_predictions = predictions.iloc[sample_indices]
sample_probs = probs.iloc[sample_indices]

for i, idx in enumerate(sample_indices):
    sample = test_data.iloc[idx]
    true_label = sample[label_col]
    pred_label = sample_predictions.iloc[i]
    
    # Handle probability correctly based on its type
    prob = sample_probs.iloc[i]
    if isinstance(prob, pd.Series) and len(prob) > 0:
        # If it's a Series with multiple values, get the probability for the positive class (1)
        if len(prob) > 1:
            prob_value = prob.iloc[1]  # Probability of class 1
        else:
            prob_value = prob.iloc[0]  # If there's only one value
    else:
        # If it's already a scalar
        prob_value = prob
    
    print(f"\nExample {i+1} (Pet #{idx}):")
    print(f"True adoption speed: {true_label} ({'Fast' if true_label == 1 else 'Slow'})")
    print(f"Predicted adoption speed: {pred_label} ({'Fast' if pred_label == 1 else 'Slow'})")
    print(f"Probability of fast adoption: {prob_value:.4f}")
    
    # Display the image
    img_path = sample[image_col]
    display(Image(filename=img_path))
    
    # Display key features
    print("Key features:")
    for feature in ['Type', 'Age', 'Breed1', 'Gender', 'Color1', 'MaturitySize']:
        print(f"- {feature}: {sample[feature]}")
    
    # Display a short version of the description
    desc = sample['Description']
    short_desc = desc[:200] + '...' if len(desc) > 200 else desc
    print(f"Description excerpt: {short_desc}")
    print("-" * 80)

## 7. Advanced Features: Handling Missing Modalities

In real-world applications, data is often incomplete. A key advantage of multimodal learning is the ability to handle missing modalities. Let's see how our model performs when certain modalities are missing:

In [None]:
# Create test data with missing image modality
test_data_no_images = test_data.copy()
test_data_no_images[image_col] = None  # Set all image paths to None

# Create test data with missing text modality
test_data_no_text = test_data.copy()
test_data_no_text['Description'] = None  # Set all descriptions to None

# Make predictions with missing modalities (without label column)
predictions_no_images = predictor.predict(test_data_no_images.drop(columns=[label_col]))
predictions_no_text = predictor.predict(test_data_no_text.drop(columns=[label_col]))

# Compare evaluation metrics with different modalities
print("Evaluation with all modalities:")
eval_full = predictor.evaluate(test_data)
print(eval_full)

print("\nEvaluation without images:")
eval_no_images = predictor.evaluate(test_data_no_images)
print(eval_no_images)

print("\nEvaluation without text:")
eval_no_text = predictor.evaluate(test_data_no_text)
print(eval_no_text)

# Create a summary table with available metrics
# First, identify common metrics across all evaluations
common_metrics = []
for metric in eval_full.keys():
    if metric in eval_no_images and metric in eval_no_text:
        common_metrics.append(metric)

if common_metrics:
    print("\nComparison of common metrics:")
    results_data = {}
    
    for metric in common_metrics:
        results_data[metric] = [
            eval_full[metric],
            eval_no_images[metric],
            eval_no_text[metric]
        ]
    
    # Create DataFrame with rows as metrics and columns as modality configurations
    results = pd.DataFrame(
        results_data,
        index=['All Modalities', 'No Images', 'No Text']
    ).T  # Transpose to have metrics as rows
    
    print(results)

### Analysis of Missing Modalities

The results above show how the model's performance changes when certain modalities are missing. This can help us understand:

1. The relative importance of each modality for the prediction task
2. The model's robustness to missing data
3. The complementary nature of different modalities

In many real-world applications, having a model that can gracefully handle missing modalities is crucial for deployment in production environments.

## 8. Model Customization: Training with Different Configurations

AutoGluon's `MultiModalPredictor` supports a variety of configurations to customize the training process. Let's explore some of these options:

In [None]:
# Define a custom hyperparameter configuration
hyperparameters = {
    # Use a specific text backbone
    'model.hf_text.checkpoint_name': 'google/electra-small-discriminator',
    
    # Use DINOv3 image backbone
    'model.timm_image.checkpoint_name': 'timm/vit_small_patch14_dinov2.lvd142m',
    
    # Customize training parameters
    'optim.lr': 5e-5,
    "optim.max_epochs": 30,
    'env.batch_size':64,
}

In [None]:
# Train a model with custom hyperparameters
custom_predictor = MultiModalPredictor(
    label=label_col,
    path='ag_petfinder_custom_model'
)

custom_predictor.fit(
    train_data=train_data,
    hyperparameters=hyperparameters,
    time_limit=600  # 10 minutes
)

In [None]:
# Evaluate the custom model
custom_eval = custom_predictor.evaluate(test_data, metrics=['accuracy', 'f1'])
print("Custom model evaluation:")
print(custom_eval)

# Compare with the default model
default_eval = predictor.evaluate(test_data, metrics=['accuracy', 'f1'])
print("\nDefault model evaluation:")
print(default_eval)

# Create comparison table for common metrics
common_metrics = []
for metric in default_eval.keys():
    if metric in custom_eval:
        common_metrics.append(metric)

if common_metrics:
    print("\nModel Comparison:")
    comparison_data = {}
    for metric in common_metrics:
        comparison_data[metric] = [default_eval[metric], custom_eval[metric]]
    
    comparison = pd.DataFrame(
        comparison_data,
        index=['Default Model', 'Custom Model']
    ).T  # Transpose to have metrics as rows
    
    print(comparison)

## 10. Working with Text-Only Data

While the PetFinder dataset contains multiple modalities, AutoGluon's `MultiModalPredictor` can also work effectively with single-modality data. Let's see how to use it for a text-only classification task.

### Text Classification Example

Let's create a simple text classification task using just the pet descriptions:

In [None]:
# Create text-only datasets
text_train_data = train_data[['Description', label_col]].copy()
text_test_data = test_data[['Description', label_col]].copy()

# Train a text-only model
text_predictor = MultiModalPredictor(
    label=label_col,
    path='ag_petfinder_text_model',
    presets='medium_quality'
)

text_predictor.fit(
    train_data=text_train_data,
    time_limit=180  # 3 minutes
)

# Evaluate the text-only model
# Note: For evaluation, keep the label column
text_eval = text_predictor.evaluate(text_test_data, metrics=['accuracy', 'f1'])
print("Text-only model performance:")
for metric, value in text_eval.items():
    print(f"{metric}: {value:.4f}")

## Model Comparison: Text-Only vs. Full Multimodal

Let's compare the performance of our text-only model with the full multimodal model to understand the value of including images and tabular features:

In [None]:
# Print individual evaluations
print("Multimodal model evaluation:")
print(default_eval)

print("\nText-only model evaluation:")
print(text_eval)

# Create comparison table for common metrics
common_metrics = []
for metric in default_eval.keys():
    if metric in text_eval:
        common_metrics.append(metric)

if common_metrics:
    print("\nModality Comparison:")
    comparison_data = {}
    for metric in common_metrics:
        comparison_data[metric] = [
            default_eval[metric],
            text_eval[metric]
        ]
    
    comparison_full = pd.DataFrame(
        comparison_data,
        index=['Multimodal', 'Text Only']
    ).T  # Transpose to have metrics as rows
    
    print(comparison_full)
    
    # Visualization
    plt.figure(figsize=(10, 6))
    comparison_full.plot(kind='bar', figsize=(8, 5))
    plt.title('Performance Comparison: Multimodal vs. Text-Only Models')
    plt.ylabel('Score')
    plt.xlabel('Metric')
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.xticks(rotation=0)
    plt.legend(title='Model Type')
    plt.tight_layout()
    plt.show()

### Insights from Modality Comparison

From this comparison, we can observe how much the text descriptions alone contribute to predicting adoption speed versus the full multimodal approach that incorporates images and tabular features. This analysis helps us understand the relative importance of different modalities in our specific prediction task.

The performance difference demonstrates the value of combining multiple data types when available, but also shows that text-only models can still provide reasonable results when images or other modalities are unavailable.

## 10. Model Saving and Loading

Once you've trained a model, you might want to save it for future use or load it in a different session. Here's how to do that with the `MultiModalPredictor`:

In [None]:
# The model is automatically saved during training at the specified path
print(f"Model saved at: {predictor.path}")

# To load a saved model
loaded_predictor = MultiModalPredictor.load(predictor.path)

# Prepare a small subset of data for testing the loaded model (without label column)
test_subset = test_data.iloc[:5].drop(columns=[label_col])

# Verify that the loaded model works
loaded_predictions = loaded_predictor.predict(test_subset)
print("Predictions from loaded model:")
print(loaded_predictions)

## 11. Extracting Embeddings for Other Applications

`MultiModalPredictor` can also be used to extract embeddings (feature vectors) from the data, which can be useful for other applications such as clustering, visualization, or as features for other models:

In [None]:
# Extract embeddings from test data (without label column)
sample_data = test_data.iloc[:]
embeddings = predictor.extract_embedding(sample_data.drop(columns=[label_col]))

# Display the shape of the embeddings
print(f"Embedding shape: {embeddings.shape}")

# Visualize the first two dimensions of the embeddings
plt.figure(figsize=(8, 6))
plt.scatter(
    embeddings[:, 0], 
    embeddings[:, 1],
    c=sample_data[label_col],  # We can still use labels for coloring
    cmap='viridis',
    alpha=0.8
)
plt.colorbar(label='Adoption Speed')
plt.xlabel('Embedding Dimension 1')
plt.ylabel('Embedding Dimension 2')
plt.title('2D Visualization of Multimodal Embeddings')
plt.grid(alpha=0.3)
plt.show()

## 12. Summary and Next Steps

### Workshop Recap

In this workshop, we explored AutoGluon's `MultiModalPredictor` for handling text, image, and tabular data simultaneously. We trained models to predict pet adoption speed using a combination of pet images, text descriptions, and structured features. The workshop demonstrated how AutoGluon automatically handles different data types, simplifying the typically complex process of multimodal machine learning.

### Key Insights

Our experiments confirmed that multimodal models often outperform single-modality models by leveraging complementary information. We showed this by comparing performance across models trained on all modalities versus those limited to text-only or image-only data. We also demonstrated the model's robustness when certain modalities are missing.

### Best Practices

For effective multimodal learning with AutoGluon:
- Ensure image columns contain valid file paths and text data is reasonably clean
- Drop label columns when making predictions but include them for evaluation
- Consider longer training times for better performance
- Save models for future use with the `.load()` method
- Extract embeddings when you need representations for downstream tasks

### Additional Capabilities

Beyond what we've covered, AutoGluon MultiModal supports:

- **Computer Vision**: Object detection, semantic segmentation, and instance segmentation
- **Natural Language Processing**: Named entity recognition, text classification, and question answering
- **Cross-Modal Tasks**: Text-to-image matching and multimodal retrieval
- **Advanced Techniques**: Zero-shot learning, few-shot learning, multi-task learning, and model distillation

### Further Learning

For specialized applications, consult the [AutoGluon documentation](https://auto.gluon.ai/stable/tutorials/multimodal/index.html), which provides comprehensive guides for adapting models to specific domains, optimizing for production deployment, and configuring advanced model architectures.