<a href="https://colab.research.google.com/github/radhakrishnan-omotec/arwan-iris-dog-repo/blob/main/RPI_TESTING_ISEF1_RestNet152_Dog_Emotion_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Enhanced Python Notebook for **TailSense** : ResNet152-based Canine Pet Image and Audio Spectrogram Classification

### Author : ARWAN MAKHIJA

---

Below is an enhanced Python notebook implementation for Google Colab that integrates both image classification and spectrogram audio classification using the ResNet152 model, optimized for maximum accuracy and depth.

It leverages ResNet152’s deep architecture (~60M parameters) with residual connections for classifying dog emotions from both image and audio data derived from videos of a Cocker Spaniel.

The dataset is assumed to contain 8-10 emotion classes (e.g., "defensive," "stressed," "friendly"), and the implementation includes data preprocessing, model training, evaluation, a Gradio interface for real-time inference, and TensorFlow Lite conversion for edge deployment.

---

# ResNet152-based Image and Spectrogram Audio Classification

This notebook implements a dual-classification system using the ResNet152 Convolutional Neural Network (CNN) for classifying dog emotions from both image data (extracted from video frames) and spectrogram audio data (derived from video audio).

The system is optimized for Colab’s GPU environment and includes:

**Image Classification**: Extracts frames from videos and classifies them into 8-10 emotion categories using ResNet152. <br>
**Spectrogram Audio Classification**: Converts audio segments into mel-spectrogram images and classifies them using a separate ResNet152 model.<br>
**Real-time Inference**: Provides a Gradio interface for capturing and predicting emotions from image and audio inputs.<br>
**Edge Deployment**: Converts models to TensorFlow Lite for deployment on devices like Raspberry Pi 5.<br><br><br>
The goal is to achieve the highest possible accuracy (targeting 92-95%) by leveraging ResNet152’s depth, fine-tuning, and data augmentation.

# Workflow <br>
Setup and Import Libraries<br>

1.   Setup and Import Libraries<br>
2.   Generate Individual IMAGES Dataset from Videos<br>
3.   Train ResNet152 Model on IMAGES Dataset for Emotion Classification<br>
4.   Split Audio into AUDIO Dataset Folders<br>
5.   Convert Audio to SPECTROGRAM Images<br>
6.   Train ResNet152 Model on AUDIO SPECTROGRAM Dataset for Emotion Classification<br>
7.   Develop Gradio User Interface for Input Capture (Image and Audio Spectrogram)<br>
8.   Integrate the Gradio Interface with Raspberry Pi 5<br>
9.   Predict Emotions Using CNN Models via Gradio Interface<br>
10.   Evaluate Results with Confusion Matrix and Metrics + Convert to TensorFlow Lite<br>
11.   Evaluate and Visualize Model Results<br>
12.   Develop Gradio Interface for Real-Time Input Capture and Prediction<br>
13.   Raspberry Pi Integration Snippet [OPTIONAL] <br>


---

### **Step 1: Setup and Import Libraries**

Explanation:

Imports libraries for deep learning (TensorFlow, Keras), image processing (OpenCV), audio processing (Librosa), and user interface (Gradio).
Enables GPU support in Google Colab for faster training and inference.

In [None]:
# Cell : Setup and Imports
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.applications import ResNet152
from tensorflow.keras.preprocessing.image import ImageDataGenerator, load_img, img_to_array
import numpy as np
import matplotlib.pyplot as plt
import os
import cv2
import librosa
import librosa.display
import gradio as gr
from google.colab import drive
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import zipfile

In [None]:
# Enable GPU
physical_devices = tf.config.list_physical_devices('GPU')
if physical_devices:
    tf.config.experimental.set_memory_growth(physical_devices[0], True)
print("TensorFlow version:", tf.__version__)
print("GPU available:", tf.test.is_gpu_available())

### **Step 2: Generate Individual IMAGES Dataset from Videos**

Explanation:

Mounts Google Drive to access the video dataset.
Extracts frames from videos at a rate of 1 frame per second and organizes them into emotion-specific subfolders within IMAGE_DATASET.

In [None]:
# Cell : Extract Frames from Videos
drive.mount('/content/drive')

# Define paths
video_dir = '/content/drive/MyDrive/Arwan IRIS/DATASET_VIDEOS'
image_dataset_dir = '/content/drive/MyDrive/Arwan IRIS/IMAGE_DATASET'



In [None]:
# Create image dataset directory if it doesn't exist
os.makedirs(image_dataset_dir, exist_ok=True)


In [None]:
# Extract frames from videos
def extract_frames(video_path, output_folder, frame_rate=1):
    vidcap = cv2.VideoCapture(video_path)
    success, image = vidcap.read()
    count = 0
    while success:
        if count % frame_rate == 0:
            cv2.imwrite(os.path.join(output_folder, f"frame{count}.jpg"), image)
        success, image = vidcap.read()
        count += 1

In [None]:
# Process each emotion folder
for emotion_folder in os.listdir(video_dir):
    emotion_path = os.path.join(video_dir, emotion_folder)
    if os.path.isdir(emotion_path):
        output_emotion_folder = os.path.join(image_dataset_dir, emotion_folder)
        os.makedirs(output_emotion_folder, exist_ok=True)
        for video_file in os.listdir(emotion_path):
            if video_file.endswith('.mp4'):
                video_path = os.path.join(emotion_path, video_file)
                extract_frames(video_path, output_emotion_folder)

print("Image dataset generated successfully.")

### **Step 3: Train ResNet152 Model on IMAGES Dataset for Emotion Classification**

Explanation:

Preprocesses images with augmentation to improve generalization.
Defines a ResNet152 model with a custom head (2048 and 1024 dense layers) for image classification.
Trains initially with frozen base layers, then fine-tunes the last 30 layers for higher accuracy.

In [None]:
# Cell : Load and Preprocess Image Dataset
img_height, img_width = 224, 224
batch_size = 32
num_classes = len(os.listdir(image_dataset_dir))  # Number of emotion classes

train_datagen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    validation_split=0.2,
    preprocessing_function=tf.keras.applications.resnet.preprocess_input
)

train_generator = train_datagen.flow_from_directory(
    image_dataset_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='categorical',
    subset='training',
    shuffle=True
)

val_generator = train_datagen.flow_from_directory(
    image_dataset_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='categorical',
    subset='validation',
    shuffle=False
)

class_names = list(train_generator.class_indices.keys())
print("Class names:", class_names)

In [None]:
# Cell : Define and Train ResNet152 Model for Images
def create_resnet152_model(num_classes):
    base_model = ResNet152(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
    base_model.trainable = False
    model = models.Sequential([
        base_model,
        layers.GlobalAveragePooling2D(),
        layers.Dense(2048, activation='relu'),
        layers.Dropout(0.5),
        layers.Dense(1024, activation='relu'),
        layers.Dropout(0.5),
        layers.Dense(num_classes, activation='softmax')
    ])
    return model

model_image = create_resnet152_model(num_classes)
model_image.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

epochs = 20
callbacks = [
    tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True),
    tf.keras.callbacks.ModelCheckpoint('/content/drive/MyDrive/Arwan IRIS/resnet152_image_best.h5',
                                       monitor='val_accuracy', save_best_only=True)
]

history_image = model_image.fit(
    train_generator,
    steps_per_epoch=train_generator.samples // batch_size,
    validation_data=val_generator,
    validation_steps=val_generator.samples // batch_size,
    epochs=epochs,
    callbacks=callbacks
)

# Fine-tune last 30 layers
base_model = model_image.layers[0]
base_model.trainable = True
for layer in base_model.layers[:-30]:
    layer.trainable = False

model_image.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

fine_tune_epochs = 10
history_fine_image = model_image.fit(
    train_generator,
    steps_per_epoch=train_generator.samples // batch_size,
    validation_data=val_generator,
    validation_steps=val_generator.samples // batch_size,
    epochs=fine_tune_epochs,
    callbacks=callbacks
)


In [None]:
model_image.save('/content/drive/MyDrive/Arwan IRIS/resnet152_image_final.h5')

### **Step 4: Split Audio into AUDIO Dataset Folders**

Explanation:

Extracts audio from videos using FFmpeg and splits it into 10-second segments.
Saves segments into emotion-specific subfolders within AUDIO_DATASET.

In [None]:
# Cell : Extract and Split Audio from Videos
audio_dataset_dir = '/content/drive/MyDrive/Arwan IRIS/AUDIO_DATASET'
os.makedirs(audio_dataset_dir, exist_ok=True)

def extract_and_split_audio(video_path, output_folder, segment_length=10):
    audio_path = os.path.join(output_folder, 'audio.wav')
    os.system(f"ffmpeg -i {video_path} -vn -acodec pcm_s16le -ar 44100 -ac 1 {audio_path}")
    y, sr = librosa.load(audio_path, sr=44100)
    duration = librosa.get_duration(y=y, sr=sr)
    for i in range(0, int(duration), segment_length):
        start = i * sr
        end = min((i + segment_length) * sr, len(y))
        segment = y[start:end]
        segment_path = os.path.join(output_folder, f"segment_{i}.wav")
        librosa.output.write_wav(segment_path, segment, sr)

for emotion_folder in os.listdir(video_dir):
    emotion_path = os.path.join(video_dir, emotion_folder)
    if os.path.isdir(emotion_path):
        output_emotion_folder = os.path.join(audio_dataset_dir, emotion_folder)
        os.makedirs(output_emotion_folder, exist_ok=True)
        for video_file in os.listdir(emotion_path):
            if video_file.endswith('.mp4'):
                video_path = os.path.join(emotion_path, video_file)
                extract_and_split_audio(video_path, output_emotion_folder)

print("Audio dataset generated successfully.")

### **Step 5: Convert Audio to SPECTROGRAM Images**

Explanation:

Converts each audio segment into a mel-spectrogram image using Librosa.
Saves spectrograms into emotion-specific subfolders within SPECTROGRAM_DATASET.

In [None]:
# Cell : Generate Spectrogram Images
spectrogram_dataset_dir = '/content/drive/MyDrive/Arwan IRIS/SPECTROGRAM_DATASET'
os.makedirs(spectrogram_dataset_dir, exist_ok=True)

def generate_spectrogram(audio_path, output_image_path):
    y, sr = librosa.load(audio_path)
    S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128)
    S_dB = librosa.power_to_db(S, ref=np.max)
    plt.figure(figsize=(10, 4))
    librosa.display.specshow(S_dB, sr=sr, x_axis='time', y_axis='mel')
    plt.axis('off')
    plt.savefig(output_image_path, bbox_inches='tight', pad_inches=0)
    plt.close()

for emotion_folder in os.listdir(audio_dataset_dir):
    emotion_path = os.path.join(audio_dataset_dir, emotion_folder)
    if os.path.isdir(emotion_path):
        output_emotion_folder = os.path.join(spectrogram_dataset_dir, emotion_folder)
        os.makedirs(output_emotion_folder, exist_ok=True)
        for audio_file in os.listdir(emotion_path):
            if audio_file.endswith('.wav'):
                audio_path = os.path.join(emotion_path, audio_file)
                output_image_path = os.path.join(output_emotion_folder, f"{os.path.splitext(audio_file)[0]}.png")
                generate_spectrogram(audio_path, output_image_path)

print("Spectrogram dataset generated successfully.")

### **Step 6: Train ResNet152 Model on AUDIO SPECTROGRAM Dataset for Emotion Classification**

Explanation:

Preprocesses spectrogram images with augmentation.
Trains a separate ResNet152 model on the spectrogram dataset, including fine-tuning for optimal performance.

In [None]:
# Cell : Load and Preprocess Spectrogram Dataset
train_datagen_spectrogram = ImageDataGenerator(
    rescale=1./255,
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    validation_split=0.2,
    preprocessing_function=tf.keras.applications.resnet.preprocess_input
)

train_generator_spectrogram = train_datagen_spectrogram.flow_from_directory(
    spectrogram_dataset_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='categorical',
    subset='training',
    shuffle=True
)

val_generator_spectrogram = train_datagen_spectrogram.flow_from_directory(
    spectrogram_dataset_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='categorical',
    subset='validation',
    shuffle=False
)

In [None]:
# Cell : Define and Train ResNet152 Model for Spectrograms
model_spectrogram = create_resnet152_model(num_classes)
model_spectrogram.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

history_spectrogram = model_spectrogram.fit(
    train_generator_spectrogram,
    steps_per_epoch=train_generator_spectrogram.samples // batch_size,
    validation_data=val_generator_spectrogram,
    validation_steps=val_generator_spectrogram.samples // batch_size,
    epochs=epochs,
    callbacks=callbacks
)

# Fine-tune last 30 layers
base_model_spectrogram = model_spectrogram.layers[0]
base_model_spectrogram.trainable = True
for layer in base_model_spectrogram.layers[:-30]:
    layer.trainable = False

model_spectrogram.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

history_fine_spectrogram = model_spectrogram.fit(
    train_generator_spectrogram,
    steps_per_epoch=train_generator_spectrogram.samples // batch_size,
    validation_data=val_generator_spectrogram,
    validation_steps=val_generator_spectrogram.samples // batch_size,
    epochs=fine_tune_epochs,
    callbacks=callbacks
)

model_spectrogram.save('/content/drive/MyDrive/Arwan IRIS/resnet152_spectrogram_final.h5')

### **Step 7: Develop Gradio User Interface for Input Capture**

Explanation:

Creates a Gradio interface to capture image and audio inputs.
Converts audio to a spectrogram and predicts emotions using both models.
Displays predictions with confidence scores.

In [None]:
# Cell : Gradio Interface Setup
def predict_emotions(image, audio):
    # Preprocess image
    img_array = tf.keras.applications.resnet.preprocess_input(img_to_array(image.resize((224, 224))))
    img_array = np.expand_dims(img_array, axis=0)
    pred_image = model_image.predict(img_array)
    pred_class_image = class_names[np.argmax(pred_image)]
    confidence_image = np.max(pred_image) * 100

    # Process audio to spectrogram
    y, sr = librosa.load(audio, sr=44100)
    S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128)
    S_dB = librosa.power_to_db(S, ref=np.max)
    plt.figure(figsize=(10, 4))
    librosa.display.specshow(S_dB, sr=sr)
    plt.axis('off')
    spectrogram_path = '/content/temp_spectrogram.png'
    plt.savefig(spectrogram_path, bbox_inches='tight', pad_inches=0)
    plt.close()

    # Preprocess spectrogram
    spectrogram_img = load_img(spectrogram_path, target_size=(224, 224))
    spectrogram_array = tf.keras.applications.resnet.preprocess_input(img_to_array(spectrogram_img))
    spectrogram_array = np.expand_dims(spectrogram_array, axis=0)
    pred_spectrogram = model_spectrogram.predict(spectrogram_array)
    pred_class_spectrogram = class_names[np.argmax(pred_spectrogram)]
    confidence_spectrogram = np.max(pred_spectrogram) * 100

    return (f"Image: {pred_class_image} ({confidence_image:.2f}%)",
            f"Spectrogram: {pred_class_spectrogram} ({confidence_spectrogram:.2f}%)")

with gr.Blocks() as demo:
    gr.Markdown("## Dog Emotion Classification")
    with gr.Row():
        image_input = gr.Image(label="Capture Image", type="pil")
        audio_input = gr.Audio(label="Record Audio", type="filepath")
    with gr.Row():
        predict_button = gr.Button("Predict")
    with gr.Row():
        image_output = gr.Textbox(label="Image Prediction")
        spectrogram_output = gr.Textbox(label="Spectrogram Prediction")

    predict_button.click(predict_emotions, inputs=[image_input, audio_input],
                         outputs=[image_output, spectrogram_output])

demo.launch()

### **Step 8: Integrate the Gradio Interface with Raspberry Pi 5**

**Steps:**

Deploy the Gradio app on Raspberry Pi 5 by installing dependencies (TensorFlow, Librosa, Gradio).
Use the Pi’s camera and microphone for real-time input.
Run the script locally on the Pi with demo.launch(server_name="0.0.0.0") for network access.

### **Step 9: Predict Emotions Using CNN Models via Gradio Interface**

**Steps:**

The Gradio interface in Step 7 already handles predictions.
Ensure models are loaded (model_image and model_spectrogram) before launching the interface.

### **Step 10: Evaluate Results with Confusion Matrix and Metrics + Convert to TensorFlow Lite**

Explanation:

Generates confusion matrices to evaluate model performance.
Converts both models to TensorFlow Lite with float16 quantization for edge deployment.

In [None]:
# Cell : Evaluate and Visualize Results
# Image model evaluation
val_generator.reset()
preds_image = np.argmax(model_image.predict(val_generator), axis=1)
true_labels_image = val_generator.classes
cm_image = confusion_matrix(true_labels_image, preds_image)
disp_image = ConfusionMatrixDisplay(confusion_matrix=cm_image, display_labels=class_names)
disp_image.plot(cmap=plt.cm.Blues)
plt.title('Confusion Matrix - Image Model')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Spectrogram model evaluation
val_generator_spectrogram.reset()
preds_spectrogram = np.argmax(model_spectrogram.predict(val_generator_spectrogram), axis=1)
true_labels_spectrogram = val_generator_spectrogram.classes
cm_spectrogram = confusion_matrix(true_labels_spectrogram, preds_spectrogram)
disp_spectrogram = ConfusionMatrixDisplay(confusion_matrix=cm_spectrogram, display_labels=class_names)
disp_spectrogram.plot(cmap=plt.cm.Blues)
plt.title('Confusion Matrix - Spectrogram Model')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Convert to TensorFlow Lite
converter = tf.lite.TFLiteConverter.from_keras_model(model_image)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
tflite_model_image = converter.convert()
with open('/content/drive/MyDrive/Arwan IRIS/resnet152_image.tflite', 'wb') as f:
    f.write(tflite_model_image)

converter = tf.lite.TFLiteConverter.from_keras_model(model_spectrogram)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
tflite_model_spectrogram = converter.convert()
with open('/content/drive/MyDrive/Arwan IRIS/resnet152_spectrogram.tflite', 'wb') as f:
    f.write(tflite_model_spectrogram)

print("TFLite models saved successfully.")

---
---
---

### **Step 11: Evaluate and Visualize Model Results**

In [None]:
# =============================
# Step 11: Evaluate and Visualize Model Results
# =============================
# Define a helper function to evaluate a model using a confusion matrix.
def evaluate_model(model, generator, class_labels, title="Confusion Matrix"):
    """
    Evaluates the model on a given data generator and displays the confusion matrix.

    Parameters:
        model: Trained Keras model.
        generator: Data generator (validation/test).
        class_labels: List of class names.
        title (str): Title for the confusion matrix plot.
    """
    generator.reset()
    preds = np.argmax(model.predict(generator), axis=1)
    true_labels = generator.classes
    cm = confusion_matrix(true_labels, preds)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=class_labels)
    disp.plot(cmap=plt.cm.Blues)
    plt.title(title)
    plt.xticks(rotation=45)
    plt.show()

# Evaluate the image model
print("Evaluating Image Model:")
evaluate_model(image_model, val_generator, class_names, title="Image Model Confusion Matrix")
# Evaluate the audio model
print("Evaluating Audio Model:")
evaluate_model(audio_model, val_audio_gen, audio_class_names, title="Audio Model Confusion Matrix")


### **Step 12: Develop Gradio Interface for Real-Time Input Capture and Prediction**

In [None]:
# =============================
# Step 12: Develop Gradio Interface for Real-Time Input Capture and Prediction
# =============================
# This function defines how the system processes an uploaded image and audio file.
# For the audio input, the file is saved, converted to a spectrogram, and then passed to the audio model.
def predict_emotions(input_image, input_audio):
    # Process image input:
    # Resize and preprocess the uploaded image
    img = cv2.resize(np.array(input_image), (img_height, img_width))
    img = tf.keras.applications.resnet.preprocess_input(img)
    img = np.expand_dims(img, axis=0)
    pred_img = image_model.predict(img)
    emotion_img = class_names[np.argmax(pred_img)]
    confidence_img = np.max(pred_img) * 100

    # Process audio input:
    # Save the uploaded audio to a temporary file
    audio_path = "temp_audio.wav"
    with open(audio_path, "wb") as f:
        f.write(input_audio.read())
    # Convert the audio file to a spectrogram image
    spec_path = "temp_spec.png"
    audio_to_spectrogram(audio_path, spec_path)
    # Load and preprocess the spectrogram image
    spec_img = load_img(spec_path, target_size=(img_height, img_width))
    spec_array = img_to_array(spec_img)
    spec_array = tf.keras.applications.resnet.preprocess_input(spec_array)
    spec_array = np.expand_dims(spec_array, axis=0)
    pred_audio = audio_model.predict(spec_array)
    emotion_audio = audio_class_names[np.argmax(pred_audio)]
    confidence_audio = np.max(pred_audio) * 100

    # Return a fusion of both predictions as a result string.
    result = (
        f"Image Emotion: {emotion_img} ({confidence_img:.2f}%)\n"
        f"Audio Emotion: {emotion_audio} ({confidence_audio:.2f}%)"
    )
    return result

# Create the Gradio interface with two inputs: an image and an audio file.
iface = gr.Interface(
    fn=predict_emotions,
    inputs=[
        gr.inputs.Image(label="Capture/Upload Image"),
        gr.inputs.Audio(source="upload", type="file", label="Record/Upload Audio")
    ],
    outputs="text",
    title="Emotion Classification from Image and Audio",
    description="Uses ResNet152-based CNN models to predict emotions from facial expressions and voice (via audio spectrogram)."
)
# Launch the Gradio interface for real-time testing.
iface.launch()

### **Step 13: Raspberry Pi Integration Snippet [OPTIONAL]**

In [None]:
# =============================
# Step 13: (Optional) Raspberry Pi Integration Snippet
# =============================
# The code below shows how you might capture an image using a PiCamera on a Raspberry Pi.
# In an actual deployment on Raspberry Pi hardware, uncomment and use this snippet.

from picamera import PiCamera
def capture_image_with_picamera():
    camera = PiCamera()
    camera.resolution = (1024, 768)
    camera.start_preview()
    time.sleep(2)  # Allow the camera to adjust to lighting conditions
    image_path = "captured_image.jpg"
    camera.capture(image_path)
    camera.stop_preview()
    camera.close()
    return image_path

---
---
---