AI Deep Learning – Simon Stijnen – May 2025

---

# Dinosaur Species Classification using Convolutional Neural Networks

This notebook implements a CNN model to classify dinosaur species using image data from Kaggle.

## 0. Download and Setup Kaggle Dataset

In this section, we'll download the dinosaur image dataset from Kaggle. You need to have a Kaggle account and API key to download datasets programmatically.

In [None]:
# Install kaggle if not already installed
!pip install -q kaggle kagglehub fastbook

# Instructions for downloading kaggle.json credentials (run this once)
print("To download datasets from Kaggle:")
print("1. Go to your Kaggle account settings at https://www.kaggle.com/account")
print("2. Click on 'Create New API Token' to download your kaggle.json file")
print("3. Place the kaggle.json file in .kaggle/")
print("4. Run the cells below to download the dataset")

In [None]:
import os
from fastai.vision.all import *
import pandas as pd

# Check if kaggle.json exists
kaggle_path = os.path.expanduser('.kaggle/kaggle.json')
if os.path.exists(kaggle_path):
    print("Kaggle API credentials found!")
else:
    print("Kaggle API credentials not found. Please follow the instructions above.")

In [None]:
# Set the download path for the dataset - using a directory with user permissions
DOWNLOAD_PATH = os.path.join("data")

# Create download directory if it doesn't exist
if not os.path.exists(DOWNLOAD_PATH):
    os.makedirs(DOWNLOAD_PATH)

import zipfile
import kagglehub
import shutil

# Download latest version
try:
    path = kagglehub.dataset_download("larserikrisholm/dinosaur-image-dataset-15-species")
    print("Dataset downloaded to:", path)
    
    # Copy to our data directory instead of moving
    zip_dest = os.path.join(DOWNLOAD_PATH, os.path.basename(path))
    shutil.copy2(path, zip_dest)
    print(f"Dataset copied to: {zip_dest}")
    
    # Extract the dataset
    extract_folder = os.path.join(DOWNLOAD_PATH, "dinosaur-dataset")
    if not os.path.exists(extract_folder):
        os.makedirs(extract_folder)
        
    with zipfile.ZipFile(zip_dest, 'r') as zip_ref:
        print(f"Extracting to {extract_folder}...")
        zip_ref.extractall(extract_folder)
    print("Dataset extracted successfully!")
    
    # Update dataset path
    DATASET_PATH = extract_folder
    print(f"Dataset path set to: {DATASET_PATH}")
except Exception as e:
    print(f"Error downloading or extracting dataset: {e}")
    print("Please download the dataset manually from https://www.kaggle.com/datasets/larserikrisholm/dinosaur-image-dataset-15-species")
    print("Extract it to the 'data/dinosaur_dataset' folder.")
    DATASET_PATH = os.path.join(DOWNLOAD_PATH, "dinosaur_dataset")

## 1. Load and Explore Dataset

In this section, we'll load the Kaggle dataset containing dinosaur images, explore its structure, and visualize some sample images.

In [None]:
# Import necessary libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
import random

# Configure plot settings
plt.style.use('ggplot')
%matplotlib inline

In [None]:
# Check if the dataset is available
print(DATASET_PATH)
if not os.path.exists(DATASET_PATH):
    print(f"Dataset not found. Please download the dataset from Kaggle and extract it to the {DATASET_PATH} folder.")
    raise FileNotFoundError(f"Dataset not found at {DATASET_PATH}")

print(f"Dataset found at {DATASET_PATH}!")
# List the contents of the dataset directory
print("\nDataset structure:")
for root, dirs, files in os.walk(DATASET_PATH, topdown=True, onerror=None):
    level = root.replace(DATASET_PATH, '').count(os.sep)
    indent = ' ' * 4 * level
    print(f"{indent}{os.path.basename(root)}/")
    for file in files[:5]:  # Show only first 5 files per directory
        print(f"{indent}    {file}")
    if len(files) > 5:
        print(f"{indent}    ... ({len(files) - 5} more files)")

In [None]:
def can_open_image(fn):
    try:
        _ = Image.open(fn)
        return True
    except:
        return False


def get_species(fn: str):
    """Get the species label from the filename."""
    # Assuming the filename format is 'species_name_1.jpg', 'species_name_2.jpg', etc.
    # Adjust the split logic based on your filename format
    # Example: 'species_name_1.jpg' -> 'species_name'
    label = fn.split(os.sep)[-1].split("_")[0]
    return label

In [None]:
# Count images per class
class_counts = {}

for class_name in os.listdir(DATASET_PATH):
    class_path = os.path.join(DATASET_PATH, class_name)
    if os.path.isdir(class_path):
        class_counts[class_name] = len(os.listdir(class_path))
        
# Create dataframe and plot distribution
class_df = pd.DataFrame({
    'Dinosaur Species': list(class_counts.keys()),
    'Image Count': list(class_counts.values())
})

class_df = class_df.sort_values('Image Count', ascending=False)

plt.figure(figsize=(12, 6))
sns.barplot(x='Dinosaur Species', y='Image Count', data=class_df)
plt.xticks(rotation=90)
plt.title('Number of Images per Dinosaur Species')
plt.tight_layout()
plt.show()

print(f"Total number of classes: {len(class_counts)}")
print(f"Total number of images: {sum(class_counts.values())}")

In [None]:
# Display sample images from each class
def display_sample_images(dataset_path, num_classes=5, samples_per_class=4):
    """
    Display sample images from random classes in the dataset
    """
    classes = list(os.listdir(dataset_path))
    selected_classes = random.sample(classes, min(num_classes, len(classes)))
    
    fig, axs = plt.subplots(num_classes, samples_per_class, figsize=(12, 10))
    
    for i, class_name in enumerate(selected_classes):
        class_path = os.path.join(dataset_path, class_name)
        if os.path.isdir(class_path):
            image_files = os.listdir(class_path)
            selected_images = random.sample(image_files, min(samples_per_class, len(image_files)))
            
            for j, img_file in enumerate(selected_images):
                img_path = os.path.join(class_path, img_file)
                img = Image.open(img_path)
                axs[i, j].imshow(img)
                axs[i, j].set_title(f"{class_name}")
                axs[i, j].axis('off')
    
    plt.tight_layout()
    plt.show()

# Display sample images
display_sample_images(DATASET_PATH)

## 2. Preprocess Dataset

In this section, we'll preprocess the images by resizing them to a standard size, normalizing pixel values, and encoding labels.

In [None]:
def get_image_files(path, recurse=True) -> list[str]:
    """Get all image files in the dataset directory."""
    if recurse:
        return [os.path.join(root, file) for root, _, files in os.walk(path) for file in files if file.endswith(('.jpg', '.jpeg', '.png'))]
    else:
        return [os.path.join(path, file) for file in os.listdir(path) if file.endswith(('.jpg', '.jpeg', '.png'))]
print("Getting image files...")

all_images = get_image_files(DATASET_PATH, recurse=True)

df = pd.DataFrame({'image': all_images, 'species': [get_species(fn) for fn in all_images]})
min_count = df['species'].value_counts().min()

balanced_df = df.groupby('species').sample(n=min_count, replace=False, random_state=42)

plt.figure(figsize=(12, 6))
sns.countplot(data=balanced_df, x='species', order=balanced_df['species'].value_counts().index)
plt.xticks(rotation=90)
plt.title('Balanced Number of Images per Dinosaur Species')
plt.tight_layout()
plt.show()
print(f"Balanced dataset size: {len(balanced_df)}")

def get_balanced_image_files(path):
    return balanced_df['image'].values

## 3: Create a datablock, dataloaders and train the Model

Set up the DataBlock with data augmentation, create a learner, find the optimal learning rate, and train the model.

In [None]:
# Check the number of images in the balanced dataset
balanced_files = get_balanced_image_files(path)
print(f"Number of images in the balanced dataset: {len(balanced_files)}")

# If the above number is greater than zero, proceed to create the DataLoaders
if len(balanced_files) > 0:
    dls = DataBlock(
        blocks=(ImageBlock, CategoryBlock),
        get_items=get_balanced_image_files,
        splitter=RandomSplitter(valid_pct=0.25, seed=42),
        get_y=parent_label,
        item_tfms=Resize(460),
        batch_tfms=[
            *aug_transforms(size=224, min_scale=0.75),
            Normalize.from_stats(*imagenet_stats),
        ],
    ).dataloaders(path, bs=32)

    # Create a learner
    learn = vision_learner(dls, resnet34, metrics=error_rate)

    # Find learning rate
    lr_min, lr_steep = learn.lr_find(suggest_funcs=(valley, steep))

    # Train the model
    try:
        learn.fine_tune(6, base_lr=lr_min)
    except Exception as e:
        print(f"Error during fine-tuning: {e}")

else:
    print("No images found in the balanced dataset. Please check your dataset.")

In [None]:
# Save the model
learn.export('model/dinosaur_classifier.pkl')
print("Model saved as 'model/dinosaur_classifier.pkl'")

## 5. Model Inference on New Images

Let's use our trained model to make predictions on new dinosaur images that weren't part of the training dataset.

In [None]:
# Load the saved model
try:
    learn_inference = load_learner('model/dinosaur_classifier.pkl')
    print("Model loaded successfully!")
except Exception as e:
    print(f"Error loading model: {e}")
    print("Training a new model...")
    learn_inference = learn

In [None]:
# Create a function to make predictions on new images
def predict_image(img_path):
    """Make prediction on a dinosaur image"""
    img = PILImage.create(img_path)
    pred_class, pred_idx, probs = learn_inference.predict(img)
    return {
        'prediction': pred_class,
        'probability': float(probs[pred_idx]),
        'all_probabilities': {learn_inference.dls.vocab[i]: float(probs[i]) for i in range(len(probs))}
    }

# Create a function to display predictions on an image
def show_prediction(img_path):
    """Display an image with its prediction"""
    img = PILImage.create(img_path)
    pred_data = predict_image(img_path)
    
    # Display image with prediction
    plt.figure(figsize=(8, 6))
    plt.imshow(img)
    plt.axis('off')
    plt.title(f"Prediction: {pred_data['prediction']}\nProbability: {pred_data['probability']:.2%}", fontsize=16)
    plt.show()
    
    # Display prediction probabilities
    probs_df = pd.DataFrame({
        'Species': list(pred_data['all_probabilities'].keys()),
        'Probability': list(pred_data['all_probabilities'].values())
    }).sort_values('Probability', ascending=False).head(5)
    
    plt.figure(figsize=(10, 5))
    sns.barplot(x='Probability', y='Species', data=probs_df)
    plt.title('Top 5 Predictions', fontsize=14)
    plt.xlim(0, 1)
    plt.show()
    
    return pred_data

In [None]:
# Let's test our model on a few random test images
import random

# Get random images from different dinosaur species
random_test_images = []
classes = list(os.listdir(DATASET_PATH))
for class_name in random.sample(classes, min(5, len(classes))):
    class_path = os.path.join(DATASET_PATH, class_name)
    if os.path.isdir(class_path):
        image_files = os.listdir(class_path)
        if image_files:
            random_img = random.choice(image_files)
            random_test_images.append(os.path.join(class_path, random_img))

# Make predictions on random test images
for img_path in random_test_images:
    show_prediction(img_path)

## 6. Evaluate Model Performance

Let's evaluate our trained model on the test dataset to see how well it generalizes.

In [None]:
# Evaluate the model on the validation dataset using FastAI
val_loss, val_metrics = learn.validate()

print(f"Validation Loss: {val_loss}")
print(f"Validation Error Rate: {val_metrics}")
print(f"Validation Accuracy: {1 - val_metrics:.4f} ({(1 - val_metrics) * 100:.2f}%)")

In [None]:
# Plot the confusion matrix with more detail
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix(figsize=(12, 12), dpi=100)

# Get detailed classification report
from sklearn.metrics import classification_report
preds, y = learn.get_preds()
pred_class = preds.argmax(dim=1)
report = classification_report(y.numpy(), pred_class.numpy(), target_names=learn.dls.vocab)
print("\nDetailed Classification Report:")
print(report)

In [None]:
# Look at the most confused categories
interp.most_confused(min_val=2)

## 7. Model Deployment Considerations

In this section, we'll discuss how the model could be deployed for real-world applications and potential improvements.

In [None]:
# Export model class mapping for deployment
import json
class_mapping = {i: class_name for i, class_name in enumerate(learn.dls.vocab)}

# Save mapping to JSON file for later use in deployment
with open('model/dinosaur_class_mapping.json', 'w') as f:
    json.dump(class_mapping, f)

# Save model architecture details
model_architecture = {
    "model_type": "ResNet34",
    "pre_trained": True,
    "num_classes": len(class_mapping),
    "image_size": 224,
    "normalization": "ImageNet stats"
}

with open('model/dinosaur_model_architecture.json', 'w') as f:
    json.dump(model_architecture, f)

# Save model performance metrics
model_performance = {
    "validation_loss": float(val_loss),
    "validation_error_rate": float(val_metrics),
    "validation_accuracy": float(1 - val_metrics),
    "training_date": "May 5, 2025"
}

with open('model/dinosaur_model_performance.json', 'w') as f:
    json.dump(model_performance, f)

## 8. Conclusion and Future Improvements

In this notebook, we've built a Convolutional Neural Network model for classifying dinosaur species using images. Here's a summary of what we've achieved:

1. **Data Loading and Exploration**: We loaded a dataset of dinosaur images, explored their distribution, and visualized sample images from each class.
2. **Data Preprocessing**: We preprocessed the images by resizing them and applying data augmentation to help the model generalize better.
3. **Model Training**: We fine-tuned a pre-trained ResNet34 model on our dinosaur dataset, achieving good classification accuracy.
4. **Evaluation**: We evaluated the model's performance using validation metrics and visualized the confusion matrix to understand where the model makes mistakes.
5. **Inference**: We demonstrated how to use the trained model to make predictions on new dinosaur images.

### Potential Improvements:

1. **More Data**: Collect more dinosaur images to improve model robustness.
2. **Advanced Architectures**: Experiment with more advanced CNN architectures like EfficientNet or Vision Transformers.
3. **Ensemble Methods**: Combine predictions from multiple models for better accuracy.
4. **Extended Augmentation**: Implement more aggressive data augmentation to handle varied image conditions.
5. **Transfer Learning**: Explore different pre-trained models and fine-tuning strategies.

### Applications:

This model could be integrated into educational applications, museum exhibits, or paleontology research tools to help identify dinosaur species from images.