# An√°lisis Exploratorio de Datos (EDA) - Sistema de Anotaci√≥n de Video
**Entrega 1 - Inteligencia Artificial**

Este notebook realiza el an√°lisis exploratorio completo de los landmarks extra√≠dos de los videos del equipo usando MediaPipe.

## Objetivos del EDA:
1. **Cargar y explorar** los datasets de landmarks generados
2. **Analizar distribuci√≥n** de actividades y participantes
3. **Visualizar patrones** de movimiento por actividad
4. **Evaluar calidad** de detecci√≥n de MediaPipe
5. **Identificar caracter√≠sticas** distintivas entre actividades
6. **Preparar datos** para modelado futuro

## Dataset Esperado:
- **üìÅ 30 videos** del equipo (10 por persona)
- **üìä 5 actividades** diferentes
- **üéØ 16 landmarks** relevantes por frame
- **üë• 3 participantes** diversos

---


In [1]:
# Importar librer√≠as necesarias
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import json
import os
from pathlib import Path
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Configurar visualizaciones
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

print("‚úÖ Librer√≠as importadas correctamente")
print(f"üìä Pandas version: {pd.__version__}")
print(f"üìà Matplotlib version: {plt.matplotlib.__version__}")
print(f"üé® Seaborn version: {sns.__version__}")


‚úÖ Librer√≠as importadas correctamente
üìä Pandas version: 2.2.2
üìà Matplotlib version: 3.10.0
üé® Seaborn version: 0.13.2


In [2]:
# Configuraci√≥n del proyecto para EDA
EDA_CONFIG = {
    'activities': [
        'caminar_hacia',
        'caminar_regreso',
        'girar',
        'sentarse',
        'ponerse_pie'
    ],
    'team_members': {
        'P001': 'Juan Esteban Ruiz',
        'P002': 'Juan David Quintero',
        'P003': 'Tomas Quintero'
    },
    'landmark_names': [
        'L_shoulder', 'R_shoulder', 'L_elbow', 'R_elbow',
        'L_wrist', 'R_wrist', 'L_hip', 'R_hip',
        'L_knee', 'R_knee', 'L_ankle', 'R_ankle',
        'L_heel', 'R_heel', 'L_foot', 'R_foot'
    ],
    'colors': {
        'caminar_hacia': '#1f77b4',
        'caminar_regreso': '#ff7f0e',
        'girar': '#2ca02c',
        'sentarse': '#d62728',
        'ponerse_pie': '#9467bd'
    },
    'paths': {
        'landmarks': 'data/landmarks/',
        'metadata': 'data/metadata/',
        'videos': 'data/videos/',
        'results': 'data/eda_results/'
    }
}

# Crear directorio para resultados del EDA
os.makedirs(EDA_CONFIG['paths']['results'], exist_ok=True)

print("‚úÖ Configuraci√≥n EDA cargada")
print(f"üéØ Actividades objetivo: {len(EDA_CONFIG['activities'])}")
print(f"üë• Miembros del equipo: {len(EDA_CONFIG['team_members'])}")


‚úÖ Configuraci√≥n EDA cargada
üéØ Actividades objetivo: 5
üë• Miembros del equipo: 3


In [3]:
# CLASE PRINCIPAL PARA EDA
class LandmarksEDA:
    """An√°lisis Exploratorio de Datos de Landmarks"""

    def __init__(self, config=None):
        if config is None:
            config = EDA_CONFIG
        self.config = config
        self.landmarks_data = None
        self.summary_stats = {}

    def load_all_landmarks(self):
        """Cargar todos los archivos CSV de landmarks"""
        print("üìÇ CARGANDO DATASETS DE LANDMARKS")
        print("=" * 50)

        landmarks_dir = Path(self.config['paths']['landmarks'])

        if not landmarks_dir.exists():
            print(f"‚ùå Directorio no encontrado: {landmarks_dir}")
            return None

        # Buscar todos los archivos CSV
        csv_files = list(landmarks_dir.glob("*_landmarks.csv"))

        if not csv_files:
            print(f"‚ùå No se encontraron archivos de landmarks en {landmarks_dir}")
            return None

        print(f"üìÅ Archivos encontrados: {len(csv_files)}")

        # Cargar y combinar todos los CSV
        all_dataframes = []
        loading_stats = {
            'total_files': len(csv_files),
            'loaded_successfully': 0,
            'failed_files': [],
            'total_frames': 0
        }

        for csv_file in csv_files:
            try:
                df = pd.read_csv(csv_file)

                # Extraer informaci√≥n del nombre del archivo
                filename_parts = csv_file.stem.replace('_landmarks', '').split('_')

                # Intentar extraer participante y actividad
                if len(filename_parts) >= 2:
                    participant = filename_parts[0] if filename_parts[0] in self.config['team_members'] else 'Unknown'
                    activity = '_'.join(filename_parts[1:]) if len(filename_parts) > 2 else filename_parts[1]
                else:
                    participant = 'Unknown'
                    activity = 'Unknown'

                # Agregar metadata si no existe
                if 'activity' not in df.columns:
                    df['activity'] = activity
                if 'participant' not in df.columns:
                    df['participant'] = participant
                if 'video_file' not in df.columns:
                    df['video_file'] = csv_file.stem.replace('_landmarks', '')

                all_dataframes.append(df)
                loading_stats['loaded_successfully'] += 1
                loading_stats['total_frames'] += len(df)

                print(f"   ‚úÖ {csv_file.name}: {len(df)} frames")

            except Exception as e:
                loading_stats['failed_files'].append(f"{csv_file.name}: {str(e)}")
                print(f"   ‚ùå Error cargando {csv_file.name}: {e}")

        if not all_dataframes:
            print("‚ùå No se pudieron cargar datos")
            return None

        # Combinar todos los DataFrames
        self.landmarks_data = pd.concat(all_dataframes, ignore_index=True)

        print(f"\nüìä RESUMEN DE CARGA:")
        print(f"   ‚úÖ Archivos cargados: {loading_stats['loaded_successfully']}/{loading_stats['total_files']}")
        print(f"   üìä Total frames: {loading_stats['total_frames']:,}")
        print(f"   üé¨ Videos √∫nicos: {self.landmarks_data['video_file'].nunique()}")
        print(f"   üë• Participantes: {self.landmarks_data['participant'].nunique()}")
        print(f"   üéØ Actividades: {self.landmarks_data['activity'].nunique()}")

        if loading_stats['failed_files']:
            print(f"\n‚ö†Ô∏è ARCHIVOS CON ERRORES:")
            for error in loading_stats['failed_files']:
                print(f"   ‚Ä¢ {error}")

        # Guardar estad√≠sticas de carga
        with open(f"{self.config['paths']['results']}loading_stats.json", 'w') as f:
            json.dump(loading_stats, f, indent=2)

        return self.landmarks_data

    def basic_dataset_info(self):
        """Informaci√≥n b√°sica del dataset"""
        if self.landmarks_data is None:
            print("‚ùå Primero ejecuta load_all_landmarks()")
            return

        print("üìã INFORMACI√ìN B√ÅSICA DEL DATASET")
        print("=" * 50)

        df = self.landmarks_data

        print(f"üìä DIMENSIONES:")
        print(f"   Filas (frames): {len(df):,}")
        print(f"   Columnas: {len(df.columns)}")
        print(f"   Tama√±o en memoria: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

        print(f"\nüéØ DISTRIBUCI√ìN POR ACTIVIDAD:")
        activity_counts = df['activity'].value_counts()
        for activity, count in activity_counts.items():
            percentage = (count / len(df)) * 100
            print(f"   {activity.replace('_', ' ').title()}: {count:,} frames ({percentage:.1f}%)")

        print(f"\nüë• DISTRIBUCI√ìN POR PARTICIPANTE:")
        participant_counts = df['participant'].value_counts()
        for participant, count in participant_counts.items():
            name = self.config['team_members'].get(participant, participant)
            percentage = (count / len(df)) * 100
            print(f"   {participant} ({name}): {count:,} frames ({percentage:.1f}%)")

        print(f"\nüé¨ VIDEOS POR ACTIVIDAD Y PARTICIPANTE:")
        video_summary = df.groupby(['activity', 'participant']).agg({
            'video_file': 'nunique',
            'frame_number': 'count'
        }).round(2)
        print(video_summary)

        # Verificar datos faltantes
        missing_data = df.isnull().sum()
        landmark_columns = [col for col in df.columns if any(landmark in col for landmark in self.config['landmark_names'])]
        missing_landmarks = missing_data[landmark_columns]

        if missing_landmarks.sum() > 0:
            print(f"\n‚ö†Ô∏è DATOS FALTANTES EN LANDMARKS:")
            print(f"   Total missing values: {missing_landmarks.sum():,}")
            print(f"   Porcentaje: {(missing_landmarks.sum() / (len(df) * len(landmark_columns))) * 100:.2f}%")
        else:
            print(f"\n‚úÖ SIN DATOS FALTANTES EN LANDMARKS")

        # Estad√≠sticas de calidad
        if 'detection_rate' in df.columns:
            print(f"\nüìà CALIDAD DE DETECCI√ìN MEDIAPIPE:")
            print(f"   Promedio: {df['detection_rate'].mean():.1f}%")
            print(f"   Mediana: {df['detection_rate'].median():.1f}%")
            print(f"   Min/Max: {df['detection_rate'].min():.1f}% / {df['detection_rate'].max():.1f}%")

        return {
            'total_frames': len(df),
            'total_videos': df['video_file'].nunique(),
            'activities': activity_counts.to_dict(),
            'participants': participant_counts.to_dict(),
            'missing_data_percentage': (missing_landmarks.sum() / (len(df) * len(landmark_columns))) * 100 if len(landmark_columns) > 0 else 0
        }

# Crear instancia del analizador EDA
eda = LandmarksEDA()
print("‚úÖ Analizador EDA configurado")


‚úÖ Analizador EDA configurado


In [4]:
# CARGAR Y EXPLORAR DATOS
print("üîÑ CARGANDO DATASETS DE LANDMARKS...")
landmarks_df = eda.load_all_landmarks()

if landmarks_df is not None:
    print("\n‚úÖ DATOS CARGADOS EXITOSAMENTE")

    # Mostrar informaci√≥n b√°sica
    basic_info = eda.basic_dataset_info()

    # Mostrar primeras filas
    print(f"\nüëÄ PRIMERAS 5 FILAS DEL DATASET:")
    display(landmarks_df.head())

    # Mostrar estructura de columnas
    print(f"\nüìã COLUMNAS DEL DATASET:")
    landmark_cols = [col for col in landmarks_df.columns if any(lm in col for lm in EDA_CONFIG['landmark_names'])]
    metadata_cols = [col for col in landmarks_df.columns if col not in landmark_cols]

    print(f"   üìä Columnas de landmarks: {len(landmark_cols)}")
    print(f"   üìù Columnas de metadata: {len(metadata_cols)}")
    print(f"   üìã Metadata: {metadata_cols}")

else:
    print("‚ùå ERROR: No se pudieron cargar los datos")
    print("üí° Aseg√∫rate de haber ejecutado el notebook 1 y subido videos")


üîÑ CARGANDO DATASETS DE LANDMARKS...
üìÇ CARGANDO DATASETS DE LANDMARKS
‚ùå Directorio no encontrado: data/landmarks
‚ùå ERROR: No se pudieron cargar los datos
üí° Aseg√∫rate de haber ejecutado el notebook 1 y subido videos


In [5]:
# VISUALIZACIONES PRINCIPALES - DISTRIBUCIONES
def create_distribution_visualizations(df):
    """Crear visualizaciones de distribuci√≥n del dataset"""
    print("üìä CREANDO VISUALIZACIONES DE DISTRIBUCI√ìN")
    print("=" * 50)

    # Configurar subplots
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    fig.suptitle('Distribuci√≥n del Dataset de Actividades Humanas', fontsize=16, fontweight='bold')

    # 1. Distribuci√≥n por actividad
    activity_counts = df['activity'].value_counts()
    colors = [EDA_CONFIG['colors'].get(activity, '#gray') for activity in activity_counts.index]

    axes[0,0].pie(activity_counts.values, labels=[act.replace('_', ' ').title() for act in activity_counts.index],
                  autopct='%1.1f%%', colors=colors, startangle=90)
    axes[0,0].set_title('Distribuci√≥n por Actividad')

    # 2. Videos por participante
    participant_videos = df.groupby('participant')['video_file'].nunique()
    axes[0,1].bar(range(len(participant_videos)), participant_videos.values,
                  color=['#1f77b4', '#ff7f0e', '#2ca02c'])
    axes[0,1].set_title('Videos por Participante')
    axes[0,1].set_xlabel('Participante')
    axes[0,1].set_ylabel('N√∫mero de Videos')
    axes[0,1].set_xticks(range(len(participant_videos)))
    axes[0,1].set_xticklabels([f"{p}\n({EDA_CONFIG['team_members'].get(p, p)})" for p in participant_videos.index])

    # 3. Frames por actividad
    frames_per_activity = df.groupby('activity').size()
    axes[1,0].barh(range(len(frames_per_activity)), frames_per_activity.values,
                   color=[EDA_CONFIG['colors'].get(activity, '#gray') for activity in frames_per_activity.index])
    axes[1,0].set_title('Frames por Actividad')
    axes[1,0].set_xlabel('N√∫mero de Frames')
    axes[1,0].set_yticks(range(len(frames_per_activity)))
    axes[1,0].set_yticklabels([act.replace('_', ' ').title() for act in frames_per_activity.index])

    # 4. Matriz de videos por participante y actividad
    video_matrix = df.groupby(['activity', 'participant'])['video_file'].nunique().unstack(fill_value=0)
    sns.heatmap(video_matrix, annot=True, fmt='d', cmap='Blues', ax=axes[1,1])
    axes[1,1].set_title('Videos por Participante y Actividad')
    axes[1,1].set_xlabel('Participante')
    axes[1,1].set_ylabel('Actividad')

    plt.tight_layout()
    plt.savefig(f"{EDA_CONFIG['paths']['results']}distribuciones_dataset.png", dpi=300, bbox_inches='tight')
    plt.show()

    return fig

# Crear visualizaciones si tenemos datos
if landmarks_df is not None:
    dist_viz = create_distribution_visualizations(landmarks_df)
else:
    print("‚ö†Ô∏è No hay datos para visualizar")


‚ö†Ô∏è No hay datos para visualizar


In [6]:
# AN√ÅLISIS DE CALIDAD DE DETECCI√ìN MEDIAPIPE
def analyze_detection_quality(df):
    """Analizar calidad de detecci√≥n de MediaPipe"""
    print("üîç AN√ÅLISIS DE CALIDAD DE DETECCI√ìN MEDIAPIPE")
    print("=" * 60)

    # Verificar si tenemos columna de detection_rate
    if 'detection_rate' not in df.columns:
        print("‚ö†Ô∏è No se encontr√≥ columna 'detection_rate'")
        print("üí° Calculando calidad basada en datos faltantes...")

        # Calcular calidad basada en NaN values
        landmark_cols = [col for col in df.columns if any(lm in col for lm in EDA_CONFIG['landmark_names'])]
        df['calculated_quality'] = 100 - (df[landmark_cols].isnull().sum(axis=1) / len(landmark_cols) * 100)
        quality_col = 'calculated_quality'
    else:
        quality_col = 'detection_rate'

    # Estad√≠sticas generales de calidad
    print(f"üìä ESTAD√çSTICAS DE CALIDAD:")
    print(f"   Promedio: {df[quality_col].mean():.1f}%")
    print(f"   Mediana: {df[quality_col].median():.1f}%")
    print(f"   Desviaci√≥n est√°ndar: {df[quality_col].std():.1f}%")
    print(f"   Min/Max: {df[quality_col].min():.1f}% / {df[quality_col].max():.1f}%")

    # Calidad por actividad
    print(f"\nüéØ CALIDAD POR ACTIVIDAD:")
    quality_by_activity = df.groupby('activity')[quality_col].agg(['mean', 'median', 'std']).round(1)
    for activity, stats in quality_by_activity.iterrows():
        print(f"   {activity.replace('_', ' ').title()}: {stats['mean']:.1f}% ¬± {stats['std']:.1f}%")

    # Calidad por participante
    print(f"\nüë• CALIDAD POR PARTICIPANTE:")
    quality_by_participant = df.groupby('participant')[quality_col].agg(['mean', 'median', 'std']).round(1)
    for participant, stats in quality_by_participant.iterrows():
        name = EDA_CONFIG['team_members'].get(participant, participant)
        print(f"   {participant} ({name}): {stats['mean']:.1f}% ¬± {stats['std']:.1f}%")

    # Visualizaci√≥n de calidad
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))
    fig.suptitle('An√°lisis de Calidad de Detecci√≥n MediaPipe', fontsize=16, fontweight='bold')

    # Histograma de calidad general
    axes[0].hist(df[quality_col], bins=30, alpha=0.7, color='skyblue', edgecolor='black')
    axes[0].axvline(df[quality_col].mean(), color='red', linestyle='--', label=f'Promedio: {df[quality_col].mean():.1f}%')
    axes[0].set_title('Distribuci√≥n de Calidad de Detecci√≥n')
    axes[0].set_xlabel('Calidad de Detecci√≥n (%)')
    axes[0].set_ylabel('Frecuencia')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)

    # Boxplot por actividad
    activities = df['activity'].unique()
    quality_data = [df[df['activity'] == activity][quality_col].values for activity in activities]
    bp1 = axes[1].boxplot(quality_data, labels=[act.replace('_', ' ').title() for act in activities], patch_artist=True)

    # Colorear boxplots
    for patch, activity in zip(bp1['boxes'], activities):
        patch.set_facecolor(EDA_CONFIG['colors'].get(activity, '#gray'))
        patch.set_alpha(0.7)

    axes[1].set_title('Calidad por Actividad')
    axes[1].set_xlabel('Actividad')
    axes[1].set_ylabel('Calidad de Detecci√≥n (%)')
    axes[1].grid(True, alpha=0.3)
    plt.setp(axes[1].get_xticklabels(), rotation=45)

    # Calidad por participante
    participants = df['participant'].unique()
    quality_data_p = [df[df['participant'] == p][quality_col].values for p in participants]
    bp2 = axes[2].boxplot(quality_data_p, labels=[f"{p}\n{EDA_CONFIG['team_members'].get(p, p)}" for p in participants],
                          patch_artist=True)

    colors_p = ['#1f77b4', '#ff7f0e', '#2ca02c']
    for patch, color in zip(bp2['boxes'], colors_p):
        patch.set_facecolor(color)
        patch.set_alpha(0.7)

    axes[2].set_title('Calidad por Participante')
    axes[2].set_xlabel('Participante')
    axes[2].set_ylabel('Calidad de Detecci√≥n (%)')
    axes[2].grid(True, alpha=0.3)

    plt.tight_layout()
    plt.savefig(f"{EDA_CONFIG['paths']['results']}calidad_deteccion.png", dpi=300, bbox_inches='tight')
    plt.show()

    # Identificar videos con baja calidad
    threshold = 70.0  # 70% threshold
    low_quality = df[df[quality_col] < threshold]

    if not low_quality.empty:
        print(f"\n‚ö†Ô∏è VIDEOS CON CALIDAD BAJA (<{threshold}%):")
        low_quality_summary = low_quality.groupby(['activity', 'participant', 'video_file'])[quality_col].mean().reset_index()
        low_quality_summary = low_quality_summary.sort_values(quality_col)

        for _, row in low_quality_summary.head(10).iterrows():
            print(f"   üìπ {row['video_file']}: {row[quality_col]:.1f}% ({row['activity']}, {row['participant']})")
    else:
        print(f"\n‚úÖ TODOS LOS VIDEOS TIENEN CALIDAD ‚â•{threshold}%")

    return {
        'average_quality': df[quality_col].mean(),
        'quality_by_activity': quality_by_activity.to_dict(),
        'quality_by_participant': quality_by_participant.to_dict(),
        'low_quality_videos': len(low_quality)
    }

# Ejecutar an√°lisis de calidad
if landmarks_df is not None:
    quality_analysis = analyze_detection_quality(landmarks_df)
else:
    print("‚ö†Ô∏è No hay datos para analizar calidad")


‚ö†Ô∏è No hay datos para analizar calidad


In [7]:
# AN√ÅLISIS DE PATRONES DE MOVIMIENTO
def analyze_movement_patterns(df):
    """Analizar patrones de movimiento por actividad"""
    print("üèÉ AN√ÅLISIS DE PATRONES DE MOVIMIENTO")
    print("=" * 50)

    # Seleccionar landmarks clave para an√°lisis
    key_landmarks = ['L_shoulder', 'R_shoulder', 'L_hip', 'R_hip', 'L_knee', 'R_knee']

    # Crear figura con subplots
    fig = make_subplots(
        rows=2, cols=3,
        subplot_titles=[act.replace('_', ' ').title() for act in EDA_CONFIG['activities']],
        specs=[[{'secondary_y': True}]*3, [{'secondary_y': True}]*3]
    )

    activities = EDA_CONFIG['activities']

    for i, activity in enumerate(activities):
        activity_data = df[df['activity'] == activity]

        if activity_data.empty:
            continue

        # Calcular posici√≥n del subplot
        row = (i // 3) + 1
        col = (i % 3) + 1

        # Analizar movimiento del centro de masa (promedio de caderas)
        if f'L_hip_y' in activity_data.columns and f'R_hip_y' in activity_data.columns:
            # Calcular centro de masa vertical
            center_of_mass_y = (activity_data['L_hip_y'] + activity_data['R_hip_y']) / 2

            # Tomar muestra representativa
            sample_size = min(100, len(center_of_mass_y))
            sample_indices = np.linspace(0, len(center_of_mass_y)-1, sample_size, dtype=int)
            sample_y = center_of_mass_y.iloc[sample_indices]

            fig.add_trace(
                go.Scatter(
                    x=list(range(len(sample_y))),
                    y=sample_y,
                    mode='lines',
                    name=f'Centro Y - {activity}',
                    line=dict(color=EDA_CONFIG['colors'][activity], width=2)
                ),
                row=row, col=col
            )

    fig.update_layout(
        title_text="Patrones de Movimiento por Actividad (Centro de Masa Vertical)",
        height=800,
        showlegend=False
    )

    fig.show()

    # An√°lisis estad√≠stico de movimiento
    movement_stats = {}

    print(f"\nüìà ESTAD√çSTICAS DE MOVIMIENTO POR ACTIVIDAD:")

    for activity in EDA_CONFIG['activities']:
        activity_data = df[df['activity'] == activity]

        if activity_data.empty:
            continue

        # Calcular variabilidad de movimiento
        movement_features = []

        for landmark in key_landmarks:
            for coord in ['x', 'y']:
                col_name = f'{landmark}_{coord}'
                if col_name in activity_data.columns:
                    # Calcular variabilidad (desviaci√≥n est√°ndar)
                    variability = activity_data[col_name].std()
                    movement_features.append(variability)

        if movement_features:
            avg_movement = np.mean(movement_features)
            movement_stats[activity] = {
                'avg_variability': avg_movement,
                'total_frames': len(activity_data),
                'unique_videos': activity_data['video_file'].nunique()
            }

            print(f"   {activity.replace('_', ' ').title()}:")
            print(f"      Variabilidad promedio: {avg_movement:.4f}")
            print(f"      Frames totales: {len(activity_data):,}")
            print(f"      Videos √∫nicos: {activity_data['video_file'].nunique()}")

    return movement_stats

# Ejecutar an√°lisis de patrones de movimiento
if landmarks_df is not None:
    movement_analysis = analyze_movement_patterns(landmarks_df)
else:
    print("‚ö†Ô∏è No hay datos para analizar patrones de movimiento")


‚ö†Ô∏è No hay datos para analizar patrones de movimiento


In [8]:
# MATRIZ DE CORRELACI√ìN ENTRE LANDMARKS
def create_correlation_analysis(df):
    """Crear an√°lisis de correlaci√≥n entre landmarks"""
    print("üîó AN√ÅLISIS DE CORRELACI√ìN ENTRE LANDMARKS")
    print("=" * 50)

    # Seleccionar columnas de landmarks
    landmark_cols = [col for col in df.columns if any(lm in col for lm in EDA_CONFIG['landmark_names'])]

    if not landmark_cols:
        print("‚ùå No se encontraron columnas de landmarks")
        return None

    print(f"üìä Analizando correlaciones entre {len(landmark_cols)} variables de landmarks")

    # Calcular matriz de correlaci√≥n
    landmarks_numeric = df[landmark_cols].select_dtypes(include=[np.number])
    correlation_matrix = landmarks_numeric.corr()

    # Crear heatmap de correlaci√≥n
    plt.figure(figsize=(20, 16))

    # Seleccionar subset de landmarks m√°s importantes para visualizaci√≥n
    important_landmarks = []
    for landmark in EDA_CONFIG['landmark_names'][:8]:  # Primeros 8 landmarks
        for coord in ['x', 'y']:
            col_name = f'{landmark}_{coord}'
            if col_name in landmarks_numeric.columns:
                important_landmarks.append(col_name)

    if important_landmarks:
        subset_corr = correlation_matrix.loc[important_landmarks, important_landmarks]

        mask = np.triu(np.ones_like(subset_corr, dtype=bool))

        sns.heatmap(subset_corr,
                   mask=mask,
                   annot=True,
                   cmap='RdBu_r',
                   center=0,
                   square=True,
                   fmt='.2f',
                   cbar_kws={'label': 'Correlaci√≥n de Pearson'})

        plt.title('Matriz de Correlaci√≥n - Landmarks Principales', fontsize=16, fontweight='bold')
        plt.xticks(rotation=45, ha='right')
        plt.yticks(rotation=0)
        plt.tight_layout()
        plt.savefig(f"{EDA_CONFIG['paths']['results']}correlacion_landmarks.png", dpi=300, bbox_inches='tight')
        plt.show()

    # Encontrar correlaciones m√°s altas (excluyendo autocorrelaciones)
    correlation_pairs = []

    for i in range(len(correlation_matrix.columns)):
        for j in range(i+1, len(correlation_matrix.columns)):
            corr_value = correlation_matrix.iloc[i, j]
            if not np.isnan(corr_value):
                correlation_pairs.append({
                    'var1': correlation_matrix.columns[i],
                    'var2': correlation_matrix.columns[j],
                    'correlation': corr_value
                })

    # Ordenar por valor absoluto de correlaci√≥n
    correlation_pairs = sorted(correlation_pairs, key=lambda x: abs(x['correlation']), reverse=True)

    print(f"\nüîù TOP 10 CORRELACIONES M√ÅS ALTAS:")
    for i, pair in enumerate(correlation_pairs[:10]):
        print(f"   {i+1:2d}. {pair['var1']} ‚Üî {pair['var2']}: {pair['correlation']:.3f}")

    print(f"\nüîª TOP 5 CORRELACIONES M√ÅS BAJAS (INDEPENDIENTES):")
    low_correlations = [pair for pair in correlation_pairs if abs(pair['correlation']) < 0.3]
    for i, pair in enumerate(low_correlations[:5]):
        print(f"   {i+1}. {pair['var1']} ‚Üî {pair['var2']}: {pair['correlation']:.3f}")

    return {
        'correlation_matrix': correlation_matrix,
        'top_correlations': correlation_pairs[:10],
        'low_correlations': low_correlations[:5]
    }

# Ejecutar an√°lisis de correlaci√≥n
if landmarks_df is not None:
    correlation_analysis = create_correlation_analysis(landmarks_df)
else:
    print("‚ö†Ô∏è No hay datos para an√°lisis de correlaci√≥n")


‚ö†Ô∏è No hay datos para an√°lisis de correlaci√≥n


In [9]:
# COMPARACI√ìN ENTRE ACTIVIDADES - AN√ÅLISIS DISCRIMINATIVO
def discriminative_analysis(df):
    """An√°lizar caracter√≠sticas discriminativas entre actividades"""
    print("üéØ AN√ÅLISIS DISCRIMINATIVO ENTRE ACTIVIDADES")
    print("=" * 50)

    # Seleccionar landmarks clave
    key_features = []
    for landmark in ['L_shoulder', 'R_shoulder', 'L_hip', 'R_hip', 'L_knee', 'R_knee']:
        for coord in ['x', 'y']:
            col_name = f'{landmark}_{coord}'
            if col_name in df.columns:
                key_features.append(col_name)

    if not key_features:
        print("‚ùå No se encontraron caracter√≠sticas clave")
        return None

    print(f"üìä Analizando {len(key_features)} caracter√≠sticas discriminativas")

    # Calcular estad√≠sticas por actividad
    activity_stats = {}

    for activity in EDA_CONFIG['activities']:
        activity_data = df[df['activity'] == activity]

        if activity_data.empty:
            continue

        activity_stats[activity] = {}

        for feature in key_features:
            if feature in activity_data.columns:
                activity_stats[activity][feature] = {
                    'mean': activity_data[feature].mean(),
                    'std': activity_data[feature].std(),
                    'median': activity_data[feature].median()
                }

    # Crear visualizaci√≥n comparativa
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    fig.suptitle('An√°lisis Discriminativo - Caracter√≠sticas por Actividad', fontsize=16, fontweight='bold')

    # Seleccionar caracter√≠sticas m√°s discriminativas
    discriminative_features = key_features[:6]  # Top 6 caracter√≠sticas

    for i, feature in enumerate(discriminative_features):
        row = i // 3
        col = i % 3

        # Crear boxplot para esta caracter√≠stica
        feature_data = []
        labels = []
        colors_list = []

        for activity in EDA_CONFIG['activities']:
            activity_data = df[df['activity'] == activity]
            if not activity_data.empty and feature in activity_data.columns:
                feature_data.append(activity_data[feature].dropna().values)
                labels.append(activity.replace('_', ' ').title())
                colors_list.append(EDA_CONFIG['colors'][activity])

        if feature_data:
            bp = axes[row, col].boxplot(feature_data, labels=labels, patch_artist=True)

            # Colorear boxplots
            for patch, color in zip(bp['boxes'], colors_list):
                patch.set_facecolor(color)
                patch.set_alpha(0.7)

            axes[row, col].set_title(f'{feature.replace("_", " ").title()}')
            axes[row, col].grid(True, alpha=0.3)
            plt.setp(axes[row, col].get_xticklabels(), rotation=45)

    plt.tight_layout()
    plt.savefig(f"{EDA_CONFIG['paths']['results']}analisis_discriminativo.png", dpi=300, bbox_inches='tight')
    plt.show()

    # Calcular distancias entre actividades
    print(f"\nüìè DISTANCIAS ENTRE ACTIVIDADES (Media Euclidiana):")

    activities = list(EDA_CONFIG['activities'])
    distances = {}

    for i, act1 in enumerate(activities):
        for j, act2 in enumerate(activities[i+1:], i+1):
            if act1 in activity_stats and act2 in activity_stats:
                # Calcular distancia euclidiana entre medias
                distance = 0
                valid_features = 0

                for feature in key_features:
                    if feature in activity_stats[act1] and feature in activity_stats[act2]:
                        diff = activity_stats[act1][feature]['mean'] - activity_stats[act2][feature]['mean']
                        distance += diff ** 2
                        valid_features += 1

                if valid_features > 0:
                    distance = np.sqrt(distance / valid_features)
                    distances[f"{act1} ‚Üî {act2}"] = distance

                    print(f"   {act1.replace('_', ' ').title()} ‚Üî {act2.replace('_', ' ').title()}: {distance:.4f}")

    # Encontrar actividades m√°s similares y m√°s diferentes
    if distances:
        most_similar = min(distances.items(), key=lambda x: x[1])
        most_different = max(distances.items(), key=lambda x: x[1])

        print(f"\nüîç ACTIVIDADES M√ÅS SIMILARES: {most_similar[0]} (distancia: {most_similar[1]:.4f})")
        print(f"üîç ACTIVIDADES M√ÅS DIFERENTES: {most_different[0]} (distancia: {most_different[1]:.4f})")

    return {
        'activity_stats': activity_stats,
        'distances': distances,
        'most_similar': most_similar if distances else None,
        'most_different': most_different if distances else None
    }

# Ejecutar an√°lisis discriminativo
if landmarks_df is not None:
    discriminative_results = discriminative_analysis(landmarks_df)
else:
    print("‚ö†Ô∏è No hay datos para an√°lisis discriminativo")


‚ö†Ô∏è No hay datos para an√°lisis discriminativo


In [10]:
# RESUMEN EJECUTIVO Y CONCLUSIONES
def generate_executive_summary(df, quality_analysis, movement_analysis, discriminative_results):
    """Generar resumen ejecutivo del EDA"""
    print("üìã RESUMEN EJECUTIVO - EDA LANDMARKS")
    print("=" * 60)

    # Estad√≠sticas generales
    total_videos = df['video_file'].nunique()
    total_frames = len(df)
    total_participants = df['participant'].nunique()
    total_activities = df['activity'].nunique()

    print(f"üìä ESTAD√çSTICAS GENERALES:")
    print(f"   üé¨ Total de videos procesados: {total_videos}")
    print(f"   üìä Total de frames analizados: {total_frames:,}")
    print(f"   üë• Participantes del equipo: {total_participants}")
    print(f"   üéØ Actividades diferentes: {total_activities}")
    print(f"   ‚è±Ô∏è Promedio frames por video: {total_frames/total_videos:.0f}")

    # Calidad de datos
    if quality_analysis:
        avg_quality = quality_analysis['average_quality']
        print(f"\n‚úÖ CALIDAD DE DETECCI√ìN MEDIAPIPE:")
        print(f"   üìà Calidad promedio: {avg_quality:.1f}%")
        print(f"   üéØ Calidad clasificaci√≥n: {'EXCELENTE' if avg_quality > 90 else 'BUENA' if avg_quality > 80 else 'ACEPTABLE'}")

        # Mejor y peor actividad en t√©rminos de calidad
        if 'quality_by_activity' in quality_analysis:
            activity_qualities = {k: v['mean'] for k, v in quality_analysis['quality_by_activity'].items()}
            best_activity = max(activity_qualities.items(), key=lambda x: x[1])
            worst_activity = min(activity_qualities.items(), key=lambda x: x[1])

            print(f"   ü•á Mejor actividad: {best_activity[0].replace('_', ' ').title()} ({best_activity[1]:.1f}%)")
            print(f"   üìâ Actividad con desaf√≠os: {worst_activity[0].replace('_', ' ').title()} ({worst_activity[1]:.1f}%)")

    # Distribuci√≥n de datos
    print(f"\nüìà DISTRIBUCI√ìN DE DATOS:")
    activity_distribution = df['activity'].value_counts()
    most_represented = activity_distribution.iloc[0]
    least_represented = activity_distribution.iloc[-1]
    balance_ratio = least_represented / most_represented

    print(f"   ‚öñÔ∏è Balance del dataset: {balance_ratio:.2f} (1.0 = perfecto)")
    print(f"   üìä Estado balance: {'‚úÖ BALANCEADO' if balance_ratio > 0.7 else '‚ö†Ô∏è DESBALANCEADO'}")
    print(f"   üîù Actividad m√°s representada: {activity_distribution.index[0].replace('_', ' ').title()} ({most_represented} frames)")
    print(f"   üìâ Actividad menos representada: {activity_distribution.index[-1].replace('_', ' ').title()} ({least_represented} frames)")

    # Hallazgos principales
    print(f"\nüîç HALLAZGOS PRINCIPALES:")

    if discriminative_results and discriminative_results.get('most_similar') and discriminative_results.get('most_different'):
        most_similar = discriminative_results['most_similar']
        most_different = discriminative_results['most_different']

        print(f"   ü§ù Actividades m√°s similares: {most_similar[0]}")
        print(f"   üÜö Actividades m√°s diferentes: {most_different[0]}")
        print(f"   üìè Ratio separabilidad: {most_different[1]/most_similar[1]:.2f}x")

    if movement_analysis:
        # Actividad con mayor variabilidad de movimiento
        movement_vars = {k: v['avg_variability'] for k, v in movement_analysis.items()}
        if movement_vars:
            most_dynamic = max(movement_vars.items(), key=lambda x: x[1])
            least_dynamic = min(movement_vars.items(), key=lambda x: x[1])

            print(f"   üèÉ Actividad m√°s din√°mica: {most_dynamic[0].replace('_', ' ').title()}")
            print(f"   üßò Actividad m√°s est√°tica: {least_dynamic[0].replace('_', ' ').title()}")

    # Recomendaciones
    print(f"\nüí° RECOMENDACIONES PARA MODELADO:")

    # Basado en calidad
    if quality_analysis:
        if quality_analysis['average_quality'] > 85:
            print(f"   ‚úÖ Dataset apto para entrenamiento directo")
        else:
            print(f"   üîß Considerar filtrado por calidad (umbral: 70%)")

    # Basado en balance
    if balance_ratio < 0.6:
        print(f"   ‚öñÔ∏è Considerar aumentaci√≥n de datos para actividades menos representadas")
    else:
        print(f"   ‚úÖ Balance adecuado entre actividades")

    # Basado en separabilidad
    if discriminative_results and discriminative_results.get('most_similar'):
        similar_distance = discriminative_results['most_similar'][1]
        if similar_distance < 0.1:
            print(f"   üéØ Actividades similares pueden requerir caracter√≠sticas adicionales")
        else:
            print(f"   ‚úÖ Actividades bien diferenciadas para clasificaci√≥n")

    # Preparaci√≥n para siguiente etapa
    print(f"\nüöÄ PREPARACI√ìN PARA ENTREGA 2:")
    print(f"   üìä Dataset validado y caracterizado")
    print(f"   üéØ {total_videos} videos procesados exitosamente")
    print(f"   üìà Patrones de movimiento identificados")
    print(f"   ü§ñ Listo para entrenamiento de modelos ML")

    # Guardar resumen
    summary_data = {
        'generation_date': datetime.now().isoformat(),
        'dataset_stats': {
            'total_videos': total_videos,
            'total_frames': total_frames,
            'participants': total_participants,
            'activities': total_activities
        },
        'quality_stats': quality_analysis if quality_analysis else {},
        'movement_stats': movement_analysis if movement_analysis else {},
        'discriminative_stats': {
            'most_similar': discriminative_results.get('most_similar') if discriminative_results else None,
            'most_different': discriminative_results.get('most_different') if discriminative_results else None
        },
        'balance_ratio': balance_ratio,
        'recommendations': [
            f"Dataset apto para entrenamiento" if quality_analysis and quality_analysis['average_quality'] > 85 else "Filtrar por calidad",
            f"Balance adecuado" if balance_ratio > 0.6 else "Considerar aumentaci√≥n de datos",
            "Listo para modelado ML"
        ]
    }

    with open(f"{EDA_CONFIG['paths']['results']}resumen_ejecutivo.json", 'w') as f:
        json.dump(summary_data, f, indent=2)

    print(f"\nüíæ Resumen guardado en: {EDA_CONFIG['paths']['results']}resumen_ejecutivo.json")

    return summary_data

# Generar resumen ejecutivo si tenemos todos los an√°lisis
if landmarks_df is not None:
    executive_summary = generate_executive_summary(
        landmarks_df,
        quality_analysis if 'quality_analysis' in locals() else None,
        movement_analysis if 'movement_analysis' in locals() else None,
        discriminative_results if 'discriminative_results' in locals() else None
    )
else:
    print("‚ö†Ô∏è No hay datos suficientes para generar resumen ejecutivo")


‚ö†Ô∏è No hay datos suficientes para generar resumen ejecutivo


## ‚úÖ Checklist EDA Completado

### An√°lisis Realizados:
- [x] **Carga de datos** de landmarks
- [x] **Informaci√≥n b√°sica** del dataset
- [x] **Visualizaciones de distribuci√≥n** por actividad y participante
- [x] **An√°lisis de calidad** de detecci√≥n MediaPipe
- [x] **Patrones de movimiento** por actividad
- [x] **Matriz de correlaci√≥n** entre landmarks
- [x] **An√°lisis discriminativo** entre actividades
- [x] **Resumen ejecutivo** con conclusiones

### Resultados Generados:
- üìä **Estad√≠sticas descriptivas** completas
- üìà **Visualizaciones** guardadas en `data/eda_results/`
- üîç **An√°lisis de calidad** MediaPipe por actividad
- üéØ **Caracter√≠sticas discriminativas** identificadas
- üí° **Recomendaciones** para modelado futuro

### Archivos Generados:
- `distribuciones_dataset.png` - Visualizaciones de distribuci√≥n
- `calidad_deteccion.png` - An√°lisis de calidad MediaPipe
- `correlacion_landmarks.png` - Matriz de correlaci√≥n
- `analisis_discriminativo.png` - Caracter√≠sticas por actividad
- `resumen_ejecutivo.json` - Resumen completo del an√°lisis

---
**Estado**: EDA completado y documentado
**Siguiente paso**: Preparaci√≥n para Entrega 2 - Modelado ML
