# DAY 23 — Image Captioning (Multimodal Deep Learning)

## Overview

This notebook demonstrates the core architecture of image captioning, combining CNN-based image understanding with LSTM-based language generation.

## Install & Import Libraries

In [1]:
import tensorflow as tf
from tensorflow.keras import layers, models, applications

## CNN Encoder

In [2]:
cnn = applications.ResNet50(
    weights="imagenet",
    include_top=False,
    pooling="avg"
)
cnn.trainable = False


Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5
[1m94765736/94765736[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


## Decoder Architecture

In [3]:
image_input = layers.Input(shape=(2048,))
caption_input = layers.Input(shape=(20,))

img_embed = layers.Dense(256, activation='relu')(image_input)
word_embed = layers.Embedding(5000, 256)(caption_input)

lstm_out = layers.LSTM(256)(word_embed)
combined = layers.Add()([lstm_out, img_embed])

outputs = layers.Dense(5000, activation='softmax')(combined)

model = models.Model([image_input, caption_input], outputs)
model.summary()

## Compile

In [4]:
model.compile(
    optimizer='adam',
    loss='categorical_crossentropy'
)


## Observations
- CNN extracts semantic image features
- LSTM generates language sequentially
- Model learns joint vision–language representation
- This architecture is the base for modern multimodal AI systems