<a href="https://colab.research.google.com/github/Gibonn24/MexicanSignLanguage/blob/main/Proyecto_Final_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Traductor de Lenguaje de Señas a Texto

**Proyecto Final – Machine Learning**



---

## 1. Integrantes
| Nombre | % de contribución |
|--------|-------------------|
| Giordano Fuentes | 100% |

> Ajusta la tabla según corresponda.

## 2. Introducción


> La comunicación entre personas sordas y oyentes sigue siendo una barrera. Este proyecto busca traducir automáticamente videos de Lengua de Señas a texto en español, usando aprendizaje profundo y visión computacional, para facilitar la inclusión.

In [1]:
#Se encuentra en ("./notebooks/EDA_dynamics.ipynb") y ("./notebooks/EDA_letters.ipynb")

## 4. Metodología
Describe la arquitectura general:
1. **Extracción de características** con un modelo preentrenado (p.ej. *I3D* / *S3D*) usando [`video_features`](https://github.com/v-iashin/video_features).
2. **Modelo de traducción** secuencia–a–secuencia (GRU/Transformer) que mapea embeddings de video → texto (glosas o frases).
3. **Pérdida** CTC o CrossEntropy según alineación.

Incluye un diagrama opcional.

In [2]:
from models.r21d.extract_r21d import ExtractR21D
from utils.utils import build_cfg_path
from omegaconf import OmegaConf
import pandas as pd
import os
import glob
import numpy as np
import torch
from pyprojroot import here
from pathlib import Path
import random
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
torch.cuda.get_device_name(0)

'NVIDIA GeForce GTX 1660 Ti'

In [3]:
import av, torchvision, sys, importlib.metadata
print("PyAV:", av.__version__)              # debería mostrar 14.4.0
print("TorchVision:", torchvision.__version__)


PyAV: 12.2.0
TorchVision: 0.20.1+cu121


In [4]:
from torchvision.io import read_video
rgb, _, info = read_video("C:/Users/User/Documents/ML/data/letters/dynamics/J/S1-J-perfil-1.mp4")
print(rgb.shape, info)



torch.Size([57, 900, 900, 3]) {'video_fps': 30.0}


In [5]:
from omegaconf import OmegaConf
from utils.utils import build_cfg_path
from models.r21d.extract_r21d import ExtractR21D

# Cargar config base
args = OmegaConf.load(build_cfg_path("r21d"))
args.feature_type     = "r21d"
args.model_name       = "r2plus1d_34_8_ig65m_ft_kinetics"
args.stack_size       = 8
args.step_size        = 8
args.extraction_fps   = 15          # normaliza todos los vídeos
args.tmp_path         = "tmp"
args.output_path      = "feats"
args.on_extraction    = "return"    # o 'save_numpy'
args.device           = "cuda:0"    # o 'cpu'
args.show_pred        = False

extractor = ExtractR21D(args)


Using cache found in C:\Users\User/.cache\torch\hub\moabitcoin_ig65m-pytorch_master


In [6]:
import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from PIL import Image
import pandas as pd

class CSVDataset(torch.utils.data.Dataset):
    def __init__(self, csv_path, transform=None, class_to_idx=None):
        self.data = pd.read_csv(csv_path)
        self.transform = transform

        # Si no se pasa mapeo externo, lo construye con las etiquetas del CSV
        if class_to_idx is None:
            classes = sorted(self.data["label"].unique())
            class_to_idx = {cls: idx for idx, cls in enumerate(classes)}

        self.class_to_idx = class_to_idx

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        img = Image.open(row["image_path"]).convert("RGB")
        if self.transform:
            img = self.transform(img)
        label = self.class_to_idx[row["label"]]
        return img, label
    
all_labels = pd.concat([
    pd.read_csv("letter_labels.csv")['label'],
    pd.read_csv("dynamics_videos.csv")['label']
]).unique()

class_to_idx = {cls: idx for idx, cls in enumerate(sorted(all_labels))}

# Transformaciones igual que antes
tfm = transforms.Compose([
    transforms.Resize(128),
    transforms.CenterCrop(112),
    transforms.ToTensor(),
    transforms.Normalize([0.43216,0.39466,0.37645],
                         [0.22803,0.22145,0.21698]),
])

ds_static  = CSVDataset("letter_labels.csv",  tfm,   class_to_idx)
# 1. Split reproducible 80/10/10
from torch.utils.data import random_split, DataLoader
N = len(ds_static)
train_len = int(0.8*N); val_len = int(0.1*N); test_len = N - train_len - val_len

train_s, val_s, test_s = random_split(
    ds_static, [train_len, val_len, test_len],
    generator=torch.Generator().manual_seed(42)
)

# 2. DataLoaders
dl_train = DataLoader(train_s, batch_size=64, shuffle=True, num_workers=0)
dl_val   = DataLoader(val_s,   batch_size=64, shuffle=False, num_workers=0)
dl_test  = DataLoader(test_s,  batch_size=64, shuffle=False, num_workers=0)
# Modelo ResNet adaptado
from torchvision import models
import torch.nn as nn
# Cargar modelo preentrenado y adaptarlo
model_img = models.resnet18(weights="IMAGENET1K_V1")
model_img.fc = nn.Linear(model_img.fc.in_features, 27)
model_img = model_img.to("cuda" if torch.cuda.is_available() else "cpu")

In [11]:
# 1) fuera de la clase, una sola vez
n_classes = len(class_to_idx)          # normalmente 27
print("Número de clases:", n_classes)  # <- solo para confirmar

# 2) definición del dataset de vídeos
class VideoCSVDataset(torch.utils.data.Dataset):
    """
    Dataset que lee rutas de vídeo y etiquetas desde un CSV y
    extrae las características (embeddings) con un extractor 3D-CNN.

    El CSV debe tener las columnas:
        video_path,label
    """

    def __init__(self, csv_path: str, extractor, class_to_idx: dict):
        self.data = pd.read_csv(csv_path)
        self.extractor = extractor
        self.class_to_idx = class_to_idx    # mapeo único letra → índice

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        video_path = row["video_path"]

        # embeddings (stacks, 512)
        feats = self.extractor.extract(video_path)["r21d"]

        # pooling temporal → vector (512,)
        feats = torch.tensor(feats, dtype=torch.float32).mean(0)

        label = self.class_to_idx[row["label"]]
        return feats, label
    
# 0. Instancia del extractor R21D (ya lo tienes)
ds_dynamic = VideoCSVDataset("dynamics_videos.csv", extractor, class_to_idx)

# 1. Split
M = len(ds_dynamic)
tr_len = int(0.8*M); va_len = int(0.1*M); te_len = M - tr_len - va_len

video_train, video_val, video_test = random_split(
    ds_dynamic, [tr_len, va_len, te_len],
    generator=torch.Generator().manual_seed(42)
)

# 2. DataLoaders
dl_vtrain = DataLoader(video_train, batch_size=16, shuffle=True, num_workers=0)
dl_vval   = DataLoader(video_val,   batch_size=16, shuffle=False, num_workers=0)
dl_vtest  = DataLoader(video_test,  batch_size=16, shuffle=False, num_workers=0)



Número de clases: 27


## 5. Implementación
- Framework: **PyTorch**
- Semilla de reproducibilidad: `42`
- Enlace a notebook/Colab: <colab_link>

Describe cualquier optimización o técnica especial (e.g., *gradient clipping*, *mixed precision*, *early stopping*).

## Entrenamiento de imagenes estaticas

In [8]:
from tqdm import tqdm
import torch.nn.functional as F

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model_img.parameters(), lr=1e-4)

for epoch in range(10):
    model_img.train()
    running_loss = 0
    pbar = tqdm(dl_train, desc=f"Epoch {epoch+1:02d}", unit="batch")

    for imgs, labels in pbar:
        imgs, labels = imgs.to(device), labels.to(device)
        optimizer.zero_grad()
        out = model_img(imgs)
        loss = criterion(out, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item() * imgs.size(0)

        pbar.set_postfix(loss=loss.item())

    # Validación
    model_img.eval()
    val_loss, correct, total = 0, 0, 0
    with torch.no_grad():
        for imgs, labels in dl_val:
            imgs, labels = imgs.to(device), labels.to(device)
            out = model_img(imgs)
            val_loss += criterion(out, labels).item() * imgs.size(0)
            pred = out.argmax(1)
            correct += (pred == labels).sum().item()
            total += labels.size(0)

    print(f"📊 Epoch {epoch+1:02d} | train_loss={running_loss/len(train_s):.4f} | "
          f"val_loss={val_loss/len(val_s):.4f} | val_acc={correct/total:.3f}")


Epoch 01: 100%|██████████| 1041/1041 [14:26<00:00,  1.20batch/s, loss=0.00117] 


📊 Epoch 01 | train_loss=0.0694 | val_loss=0.0010 | val_acc=1.000


Epoch 02: 100%|██████████| 1041/1041 [13:22<00:00,  1.30batch/s, loss=0.00422] 


📊 Epoch 02 | train_loss=0.0083 | val_loss=0.0084 | val_acc=0.998


Epoch 03: 100%|██████████| 1041/1041 [16:58<00:00,  1.02batch/s, loss=0.0288]  


📊 Epoch 03 | train_loss=0.0051 | val_loss=0.0001 | val_acc=1.000


Epoch 04: 100%|██████████| 1041/1041 [08:02<00:00,  2.16batch/s, loss=6.6e-5]  


📊 Epoch 04 | train_loss=0.0036 | val_loss=0.0000 | val_acc=1.000


Epoch 05: 100%|██████████| 1041/1041 [07:21<00:00,  2.36batch/s, loss=0.00362] 


📊 Epoch 05 | train_loss=0.0043 | val_loss=0.0005 | val_acc=1.000


Epoch 06: 100%|██████████| 1041/1041 [04:45<00:00,  3.64batch/s, loss=0.000199]


📊 Epoch 06 | train_loss=0.0016 | val_loss=0.0001 | val_acc=1.000


Epoch 07: 100%|██████████| 1041/1041 [04:38<00:00,  3.73batch/s, loss=0.00257] 


📊 Epoch 07 | train_loss=0.0013 | val_loss=0.0177 | val_acc=0.995


Epoch 08: 100%|██████████| 1041/1041 [06:36<00:00,  2.62batch/s, loss=7.38e-5] 


📊 Epoch 08 | train_loss=0.0051 | val_loss=0.0000 | val_acc=1.000


Epoch 09: 100%|██████████| 1041/1041 [04:37<00:00,  3.76batch/s, loss=0.000561]


📊 Epoch 09 | train_loss=0.0024 | val_loss=0.0040 | val_acc=0.998


Epoch 10: 100%|██████████| 1041/1041 [12:16<00:00,  1.41batch/s, loss=0.00187] 


📊 Epoch 10 | train_loss=0.0023 | val_loss=0.0432 | val_acc=0.988


In [34]:
from tqdm import tqdm
import torch.nn as nn
import torch

class DynGRU(nn.Module):
    """
    Entrada  : feats  (B, stacks, 512)
    Salida   : logits (B, n_classes)
    Proceso  : GRU bidireccional + dropout + capa densa
    """
    def __init__(self, in_dim=512, hidden=256, n_layers=1,
                 n_classes=n_classes, dropout=0.3):
        super().__init__()
        self.gru = nn.GRU(
            input_size=in_dim,
            hidden_size=hidden,
            num_layers=n_layers,
            batch_first=True,
            bidirectional=True,          # ← captura contexto al revés también
            dropout=dropout if n_layers > 1 else 0.0
        )
        # Bidireccional duplica el hidden → 2*hidden
        self.classifier = nn.Sequential(
            nn.Dropout(dropout),
            nn.Linear(2*hidden, 256),
            nn.ReLU(),
            nn.BatchNorm1d(256),
            nn.Dropout(dropout),
            nn.Linear(256, n_classes)
        )

    def forward(self, feats):            # (B, stacks, 512)  o  (B, 512)
        if feats.dim() == 2:             # si ya viene pool
            feats = feats.unsqueeze(1)   # (B,1,512) para no romper la GRU
        _, h = self.gru(feats)           # h shape: (2*n_layers, B, hidden)
        # Concatenamos estado final dir1+dir2
        h_cat = torch.cat([h[-2], h[-1]], dim=1)   # (B, 2*hidden)
        return self.classifier(h_cat)


model_vid = DynGRU().to(device)

opt_v       = torch.optim.Adam(model_vid.parameters(), lr=3e-4)
scheduler   = torch.optim.lr_scheduler.ReduceLROnPlateau(
                 opt_v, mode="min", factor=0.5, patience=2, verbose=True)
crit        = nn.CrossEntropyLoss()          # añade weight=class_weights si hace falta

for epoch in range(25):                      # entrena más épocas; GRU converge rápido
    model_vid.train()
    running = 0
    pbar = tqdm(dl_vtrain, desc=f"VidGRU {epoch+1:02d}", unit="batch")

    for feats, labels in pbar:
        feats, labels = feats.to(device), labels.to(device)
        opt_v.zero_grad()
        out  = model_vid(feats)
        loss = crit(out, labels)
        
        loss.backward()
        opt_v.step()
        running += loss.item() * feats.size(0)
        pbar.set_postfix(loss=loss.item())

    # validación
    model_vid.eval()
    v_loss = corr = tot = 0
    with torch.no_grad():
        for feats, labels in dl_vval:
            feats, labels = feats.to(device), labels.to(device)
            out = model_vid(feats)
            v_loss += crit(out, labels).item() * feats.size(0)
            pred = out.argmax(1)
            corr += (pred == labels).sum().item()
            tot  += labels.size(0)

    val_loss = v_loss / len(video_val)
    val_acc  = corr / tot
    scheduler.step(val_loss)                 # baja LR si se estanca

    print(f"📈 epoch {epoch+1:02d}  "
          f"train_loss={running/len(video_train):.4f}  "
          f"val_loss={val_loss:.4f}  val_acc={val_acc:.3f}")


VidGRU 01: 100%|██████████| 31/31 [07:51<00:00, 15.20s/batch, loss=3.4] 


📈 epoch 01  train_loss=3.1841  val_loss=2.9528  val_acc=0.403


VidGRU 02: 100%|██████████| 31/31 [07:17<00:00, 14.13s/batch, loss=2.25]


📈 epoch 02  train_loss=2.5653  val_loss=2.2694  val_acc=0.500


VidGRU 03: 100%|██████████| 31/31 [07:19<00:00, 14.19s/batch, loss=1.82]


📈 epoch 03  train_loss=2.0521  val_loss=1.7124  val_acc=0.565


VidGRU 04: 100%|██████████| 31/31 [07:40<00:00, 14.84s/batch, loss=1.29] 


📈 epoch 04  train_loss=1.6545  val_loss=1.2151  val_acc=0.694


VidGRU 05: 100%|██████████| 31/31 [07:19<00:00, 14.18s/batch, loss=1.03] 


📈 epoch 05  train_loss=1.3097  val_loss=0.9462  val_acc=0.758


VidGRU 06: 100%|██████████| 31/31 [07:38<00:00, 14.80s/batch, loss=0.665]


📈 epoch 06  train_loss=1.2046  val_loss=1.0173  val_acc=0.710


VidGRU 07: 100%|██████████| 31/31 [07:01<00:00, 13.60s/batch, loss=1.03] 


📈 epoch 07  train_loss=1.0046  val_loss=0.6567  val_acc=0.871


VidGRU 08: 100%|██████████| 31/31 [06:54<00:00, 13.37s/batch, loss=1.31] 


📈 epoch 08  train_loss=0.9285  val_loss=0.6288  val_acc=0.855


VidGRU 09: 100%|██████████| 31/31 [07:42<00:00, 14.92s/batch, loss=1.04] 


📈 epoch 09  train_loss=0.8393  val_loss=0.7406  val_acc=0.726


VidGRU 10: 100%|██████████| 31/31 [07:27<00:00, 14.43s/batch, loss=0.532]


📈 epoch 10  train_loss=0.7866  val_loss=0.5067  val_acc=0.887


VidGRU 11: 100%|██████████| 31/31 [07:10<00:00, 13.88s/batch, loss=0.885]


📈 epoch 11  train_loss=0.6713  val_loss=0.4495  val_acc=0.855


VidGRU 12: 100%|██████████| 31/31 [07:18<00:00, 14.15s/batch, loss=0.666]


📈 epoch 12  train_loss=0.6464  val_loss=0.5266  val_acc=0.823


VidGRU 13: 100%|██████████| 31/31 [07:34<00:00, 14.66s/batch, loss=0.729]


📈 epoch 13  train_loss=0.5706  val_loss=0.4541  val_acc=0.839


VidGRU 14: 100%|██████████| 31/31 [07:18<00:00, 14.16s/batch, loss=0.513]


📈 epoch 14  train_loss=0.5172  val_loss=0.5731  val_acc=0.774


VidGRU 15: 100%|██████████| 31/31 [07:22<00:00, 14.28s/batch, loss=0.446]


📈 epoch 15  train_loss=0.4628  val_loss=0.4168  val_acc=0.839


VidGRU 16: 100%|██████████| 31/31 [07:10<00:00, 13.87s/batch, loss=0.57] 


📈 epoch 16  train_loss=0.4396  val_loss=0.3984  val_acc=0.855


VidGRU 17: 100%|██████████| 31/31 [07:09<00:00, 13.87s/batch, loss=0.466]


📈 epoch 17  train_loss=0.3925  val_loss=0.4663  val_acc=0.806


VidGRU 18: 100%|██████████| 31/31 [19:55<00:00, 38.56s/batch, loss=0.294] 


📈 epoch 18  train_loss=0.3688  val_loss=0.3159  val_acc=0.919


VidGRU 19: 100%|██████████| 31/31 [07:27<00:00, 14.44s/batch, loss=0.283] 


📈 epoch 19  train_loss=0.3367  val_loss=0.3509  val_acc=0.855


VidGRU 20: 100%|██████████| 31/31 [07:11<00:00, 13.92s/batch, loss=0.266]


📈 epoch 20  train_loss=0.3321  val_loss=0.3274  val_acc=0.871


VidGRU 21: 100%|██████████| 31/31 [07:12<00:00, 13.95s/batch, loss=0.229]


📈 epoch 21  train_loss=0.3970  val_loss=0.4186  val_acc=0.839


VidGRU 22: 100%|██████████| 31/31 [07:04<00:00, 13.71s/batch, loss=0.172]


📈 epoch 22  train_loss=0.3039  val_loss=0.3457  val_acc=0.855


VidGRU 23: 100%|██████████| 31/31 [34:58<00:00, 67.71s/batch, loss=0.35]   


📈 epoch 23  train_loss=0.3064  val_loss=0.3620  val_acc=0.855


VidGRU 24: 100%|██████████| 31/31 [07:06<00:00, 13.76s/batch, loss=0.178]


📈 epoch 24  train_loss=0.2968  val_loss=0.3075  val_acc=0.871


VidGRU 25: 100%|██████████| 31/31 [07:03<00:00, 13.67s/batch, loss=0.166] 


📈 epoch 25  train_loss=0.2745  val_loss=0.3166  val_acc=0.855


In [None]:
import torch, json, pathlib

ckpt_dir = pathlib.Path("checkpoints")
ckpt_dir.mkdir(exist_ok=True)

# 1️⃣  modelo de imágenes (ResNet-18 adaptado)
torch.save(model_img.state_dict(), ckpt_dir / "resnet18_letters.pt")

# 2️⃣  modelo de vídeos (GRU que acabas de entrenar)
torch.save(model_vid.state_dict(), ckpt_dir / "gru_letters.pt")

# 3️⃣  opcional: diccionario letra → índice para reproducir entrenamientos
with open(ckpt_dir / "class_to_idx.json", "w", encoding="utf-8") as f:
    json.dump(class_to_idx, f, indent=2, ensure_ascii=False)

print("✅ Pesos y diccionario guardados en", ckpt_dir.resolve())

#para cargar los pesos y el diccionario:
'''
device = "cuda" if torch.cuda.is_available() else "cpu"

# --- cargar diccionario ---
with open("checkpoints/class_to_idx.json", "r", encoding="utf-8") as f:
    class_to_idx = json.load(f)
n_classes = len(class_to_idx)

# --- cargar modelo de imágenes ---
model_img = models.resnet18(weights=None)
model_img.fc = torch.nn.Linear(model_img.fc.in_features, n_classes)
model_img.load_state_dict(torch.load("checkpoints/resnet18_letters.pt", map_location=device))
model_img.to(device).eval()

# --- cargar modelo de vídeos ---
model_vid = DynGRU(n_classes=n_classes).to(device)
model_vid.load_state_dict(torch.load("checkpoints/gru_letters.pt", map_location=device))
model_vid.eval() '''

✅ Pesos y diccionario guardados en C:\Users\User\Documents\ML\checkpoints


## 6. Experimentación
Presenta las configuraciones de entrenamiento y resultados. Usa tablas o gráficos (matplotlib) para loss y accuracy por época.

In [35]:
img_path = "data/letters/statics/G/S14-G-4-19.jpg"
img = tfm(Image.open(img_path).convert("RGB")).unsqueeze(0).to(device)
pred = model_img(img).argmax(1).item()
print("Predicción imagen:", list(class_to_idx.keys())[pred])

Predicción imagen: G


In [48]:
vid = "data/letters/dynamics/Z/S1-Z-perfil-3.mp4"
feats = extractor.extract(vid)["r21d"]          # (stacks,512)
out = model_vid(torch.tensor(feats, dtype=torch.float32).unsqueeze(0).to(device))
pred = out.argmax(1).item()
print("Predicción vídeo :", list(class_to_idx.keys())[pred])

Predicción vídeo : Z


In [50]:
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
import json, pandas as pd
import torch

def eval_model(model, dataloader, name):
    model.eval()
    all_preds, all_labels = [], []

    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)

            if X.dim() == 3:          # vídeos → (B, stacks, 512)
                out = model(X)        # DynClassifier
            else:                     # imágenes → (B, 3, 112, 112)
                out = model(X)        # ResNet18

            preds = out.argmax(1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(y.cpu().numpy())

    acc = accuracy_score(all_labels, all_preds)
    f1  = f1_score(all_labels, all_preds, average="macro")
    cm  = confusion_matrix(all_labels, all_preds)

    # ----- guardar -----
    metrics_path = f"{name}_metrics.json"
    conf_path    = f"{name}_confusion.csv"

    with open(metrics_path, "w") as fp:
        json.dump({"accuracy": acc, "macro_f1": f1}, fp, indent=2)

    pd.DataFrame(cm, dtype=int).to_csv(conf_path, index=False, header=False)

    print(f"\n[{name.upper()}]  accuracy={acc:.3f}  macro-F1={f1:.3f}")
    print(f"Matriz de confusión guardada en  {conf_path}")
    print(f"Métricas guardadas en            {metrics_path}")

# --------- evaluación -----------
eval_model(model_img, dl_test,  "static")     # imágenes
eval_model(model_vid, dl_vtest, "dynamic")    # vídeos



[STATIC]  accuracy=0.989  macro-F1=0.989
Matriz de confusión guardada en  static_confusion.csv
Métricas guardadas en            static_metrics.json

[DYNAMIC]  accuracy=0.919  macro-F1=0.923
Matriz de confusión guardada en  dynamic_confusion.csv
Métricas guardadas en            dynamic_metrics.json


In [62]:
import cv2, os, torch, numpy as np
from torchvision.io import read_video
from torchvision.transforms.functional import to_pil_image

idx_to_class = {v: k for k, v in class_to_idx.items()}
model_img.eval()
model_vid.eval()


def ensure_hwc(t):
    """
    Hace que el tensor sea (H, W, C) con C <= 4.
    Funciona para (H,W,C), (C,H,W) y (H,C,W).
    """
    if t.ndim != 3:
        raise ValueError(f"Frame debe ser 3-D, got {tuple(t.shape)}")

    # localiza el eje con 1,3 ó 4 canales
    chan_axis = None
    for ax, size in enumerate(t.shape):
        if size in {1, 3, 4}:
            chan_axis = ax
            break
    if chan_axis is None:
        raise ValueError(f"No se encuentra eje de canales en {tuple(t.shape)}")

    # construye permutación para poner ese eje al final
    if chan_axis == 2:          # ya está (H,W,C)
        return t
    order = [ax for ax in range(3) if ax != chan_axis] + [chan_axis]
    return t.permute(*order)



def reencode_to_fps(src_path, out_tmp="tmp_norm.mp4", target_fps=30):
    if os.path.exists(out_tmp):
        os.remove(out_tmp)
    os.system(f'ffmpeg -y -hide_banner -loglevel error -i "{src_path}" '
              f'-filter:v fps=fps={target_fps} "{out_tmp}"')
    return out_tmp

def sliding_windows(total_frames, win, step):
    i = 0
    while i + win <= total_frames:
        yield i, i + win
        i += step

@torch.no_grad()
def classify_window(clip_tensor, frame_tensor):
    """Devuelve (best_letter, best_prob)"""
    # --- dinámico ---
    vec = extractor.name2module['model'](clip_tensor.to(device))  # (1,512)
    logit_dyn = model_vid(vec).softmax(1).squeeze(0)              # (C,)

    # --- estático ---
    logit_img = model_img(frame_tensor.to(device).unsqueeze(0)).softmax(1).squeeze(0)

    # --- fusión simple: elegimos el max global ---
    probs = torch.stack([logit_dyn, logit_img])                   # (2,C)
    best_prob, best_idx = probs.max(1)[0].max(0)
    best_letter = idx_to_class[best_idx.item()]
    return best_letter, best_prob.item()

def video_to_word_mixed(video_path, win_sec=0.8, step_sec=0.4,
                        fps=30, prob_thr=0.4):
    tmp = reencode_to_fps(video_path, target_fps=fps)
    rgb, _, _ = read_video(tmp, pts_unit="sec")     # (T,H,W,3)
    total = rgb.shape[0]
    win   = int(win_sec * fps)
    step  = int(step_sec * fps)

    letters = []
    for s, e in sliding_windows(total, win, step):
        clip = rgb[s:e]                             # (win,H,W,3)
        if clip.shape[0] != win:                    # vídeo muy corto
            continue

        # clip_tensor → (1,3,win,H,W)
        clip_t = extractor.transforms(clip).unsqueeze(0)

        # frame central a PIL → tfm → tensor (3,112,112)
        mid = s + win // 2
        frame = ensure_hwc(rgb[mid])
        frame_np = frame.byte().cpu().numpy()  # Ensure numpy array (H, W, C)
        frame_pil = to_pil_image(frame_np)
        frame_t   = tfm(frame_pil).to(device)

        letter, prob = classify_window(clip_t, frame_t)
        if prob >= prob_thr:
            letters.append(letter)

    # quitar repetidos consecutivos
    word = [letters[0]] if letters else []
    for l in letters[1:]:
        if l != word[-1]:
            word.append(l)
    return "".join(word), letters

# ---------- uso ----------
video = "video_prueba2.mp4"
palabra, secuencia = video_to_word_mixed(video)
print("Palabra detectada:", palabra)
print("Secuencia cruda  :", secuencia)



Palabra detectada: ABABABAB
Secuencia cruda  : ['A', 'A', 'A', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'A', 'B', 'B', 'B', 'B', 'B']


## 7. Discusión
Analiza los resultados: ¿qué patrones encuentras? ¿Qué gestos resultaron difíciles? ¿Cómo influyó la iluminación o el background?

## 8. Conclusiones
Resume los hallazgos más relevantes y menciona posibles mejoras futuras.

## 9. Declaración de Contribución
Describe el aporte de cada miembro del equipo con porcentajes de tiempo/actividad.