# Poster Image Feature Extraction

## Enviroment Setup

In [None]:
%pip install pandas numpy requests Pillow dotenv scikit-image

In [None]:
import pandas as pd
import numpy as np
import requests
from PIL import Image
from io import BytesIO
import os
import re
from typing import Optional
import time
from dotenv import load_dotenv
from skimage.feature import local_binary_pattern, hog
from skimage import color

## Load Processed Dataset

In [None]:
train = pd.read_csv('data/train_complete.csv')
test = pd.read_csv('data/test_complete.csv')
movies = pd.concat([train['movieId'], test['movieId']], axis=0, ignore_index=True)
ids = movies['movieId'].astype('Int64').to_numpy()
print(ids[:15])

## Feature Extraction Methods
To represent the visual characteristics of each movie poster, we employ complementary feature extraction techniques that capture different aspects of the image: color, texture, and shape.
These features will later be combined and used for clustering and genre correlation analysis.

1. **Color Histograms (HSV):**
We extract 3D color histograms in the HSV (Hue–Saturation–Value) color space.
This representation is preferred over RGB because it better models human color perception and provides robustness to lighting and contrast variations.
The histogram captures the overall color distribution, which often reflects the poster’s emotional tone or mood — for instance, dark and desaturated colors in horror movies or bright, vivid tones in comedies.

2. **Texture Descriptors (Local Binary Patterns – LBP):**
To characterize the texture and surface patterns, we use Local Binary Patterns (LBP).
LBP encodes local texture information by comparing each pixel with its neighbors and creating a binary pattern, effectively capturing micro-textures and fine structures.
This descriptor is simple, compact, and invariant to monotonic changes in illumination, making it suitable for differentiating posters based on visual style (e.g., realistic vs. illustrated).

3. **Shape and Edge Descriptors (Histogram of Oriented Gradients – HOG):**
To capture the spatial structure and composition of each poster, we extract Histogram of Oriented Gradients (HOG) features.
HOG summarizes the distribution of edge directions and gradient magnitudes in local regions of the image, describing object outlines, silhouettes, and text layouts.
It is particularly effective for representing the overall layout and dominant geometric patterns in the poster design.

4. **Moment Descriptors (Hu Moments):**
To complement local and textural information, we compute Hu Image Moments, a set of seven values derived from normalized central moments that are invariant to translation, rotation, and scale.
These descriptors capture global shape and symmetry properties of the poster, reflecting its balance, structure, and spatial organization.

5. **Combined Representation:**
Each poster is represented by the concatenation of its normalized HSV, LBP, HOG, and Hu Moment feature vectors, forming a comprehensive and discriminative descriptor that encodes color, texture, edge, and geometric information.
The combined feature vectors will be later reduced in dimensionality using some techniques before applying clustering algorithms.

In [None]:
def extract_hsv(image: np.ndarray) -> np.ndarray:
    hsv = cv2.cvtColor(image, cv2.COLOR_RGB2HSV)

    hist_h = cv2.calcHist([hsv], [0], None, [32], [0, 180])
    hist_s = cv2.calcHist([hsv], [1], None, [32], [0, 256])
    hist_v = cv2.calcHist([hsv], [2], None, [32], [0, 256])

    hist = np.concatenate((hist_h, hist_s, hist_v)).flatten()
    hist = hist/(hist.sum() + 1e-7)

    return hist

def extract_lbp(image: np.ndarray) -> np.ndarray:
    lbp = local_binary_pattern(image, 24, 3, method='uniform')
    hist, _ = np.histogram(lbp.ravel(), bins=26, range=(0, 26))
    hist = hist / (hist.sum() + 1e-7)
    return hist

def extract_hog(image: np.ndarray) -> np.ndarray:
    gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
    gray = cv2.resize(gray, (128, 256))
    
    features = hog(gray, orientations=9, pixels_per_cell=(8, 8),
                   cells_per_block=(2, 2), visualize=False)
    
    return features

def extract_hu(image: np.ndarray) -> np.ndarray:
    moments = cv2.moments(image)
    hu_moments = cv2.HuMoments(moments).flatten()
    hu_moments = -np.sign(hu_moments) * np.log10(np.abs(hu_moments) + 1e-10)
    return hu_moments

def extract_features(image_path: str) -> np.ndarray:
    image: np.ndarray = cv2.imread(image_path)
    features = np.hstack([
        extract_hsv(image),
        extract_lbp(image),
        extract_hog(image),
        extract_hu(image)
    ])
    return features