# 04 - An√°lise Estat√≠stica das Intera√ß√µes

Este notebook implementa a **Etapa 4** do pipeline SINKT: an√°lise estat√≠stica completa dos dados de intera√ß√£o gerados.

## Objetivo
Analisar padr√µes, distribui√ß√µes e m√©tricas dos dados de intera√ß√£o para validar qualidade.

## Sa√≠da
- `data/output/analysis_report.json`: Relat√≥rio completo de an√°lise estat√≠stica
- `data/output/metrics_summary.json`: Resumo de m√©tricas

## Importa√ß√£o de Bibliotecas

In [11]:
import json
import os
from datetime import datetime
from typing import Dict, List, Any
import numpy as np
import pandas as pd
from collections import defaultdict, Counter
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Bibliotecas importadas com sucesso")

‚úÖ Bibliotecas importadas com sucesso


## Carregamento de Todos os Dados

In [12]:
# Carregar todos os dados
with open('data/output/notebooks/geracao_perfis/profiles.json', 'r', encoding='utf-8') as f:
    profiles_data = json.load(f)
profiles = profiles_data['profiles']

with open('data/output/notebooks/geracao_estudantes/students.json', 'r', encoding='utf-8') as f:
    students_data = json.load(f)
students = students_data['students']

with open('data/output/interactions.json', 'r', encoding='utf-8') as f:
    interactions_data = json.load(f)
interactions = interactions_data['interactions']

print(f"‚úÖ Dados carregados:")
print(f"  - Perfis: {len(profiles)}")
print(f"  - Estudantes: {len(students)}")
print(f"  - Intera√ß√µes: {len(interactions)}")

‚úÖ Dados carregados:
  - Perfis: 6
  - Estudantes: 100
  - Intera√ß√µes: 4450


## An√°lise 1: Estat√≠sticas Gerais das Intera√ß√µes

In [13]:
def analyze_general_statistics(interactions: List[Dict], students: List) -> Dict[str, Any]:
    """Calcula estat√≠sticas gerais das intera√ß√µes."""
    
    stats = {}
    
    total_interactions = len(interactions)
    correct_interactions = sum(1 for i in interactions if i['is_correct'])
    incorrect_interactions = total_interactions - correct_interactions
    
    stats['total_interactions'] = total_interactions
    stats['correct_interactions'] = correct_interactions
    stats['incorrect_interactions'] = incorrect_interactions
    stats['accuracy'] = correct_interactions / total_interactions if total_interactions > 0 else 0
    
    times = [i['time_spent_seconds'] for i in interactions]
    stats['time_statistics'] = {
        'mean': np.mean(times),
        'median': np.median(times),
        'std': np.std(times),
        'min': np.min(times),
        'max': np.max(times),
        'q25': np.percentile(times, 25),
        'q75': np.percentile(times, 75)
    }
    
    masteries = [i['mastery_after'] for i in interactions]
    stats['mastery_statistics'] = {
        'mean': np.mean(masteries),
        'median': np.median(masteries),
        'std': np.std(masteries),
        'min': np.min(masteries),
        'max': np.max(masteries),
        'q25': np.percentile(masteries, 25),
        'q75': np.percentile(masteries, 75)
    }
    
    interactions_per_student = defaultdict(int)
    for interaction in interactions:
        interactions_per_student[interaction['student_id']] += 1
    
    student_interaction_counts = list(interactions_per_student.values())
    stats['interactions_per_student'] = {
        'mean': np.mean(student_interaction_counts),
        'median': np.median(student_interaction_counts),
        'std': np.std(student_interaction_counts),
        'min': np.min(student_interaction_counts),
        'max': np.max(student_interaction_counts)
    }
    
    return stats

general_stats = analyze_general_statistics(interactions, students)

print("\nüìä Estat√≠sticas Gerais das Intera√ß√µes:\n")
print(f"  Total de Intera√ß√µes: {general_stats['total_interactions']}")
print(f"  Corretas: {general_stats['correct_interactions']} ({general_stats['accuracy']:.1%})")
print(f"  Incorretas: {general_stats['incorrect_interactions']} ({1-general_stats['accuracy']:.1%})")
print(f"\n  Tempo Gasto (segundos):")
print(f"    M√©dia: {general_stats['time_statistics']['mean']:.1f}s")
print(f"    Mediana: {general_stats['time_statistics']['median']:.1f}s")
print(f"    Desvio: {general_stats['time_statistics']['std']:.1f}s")
print(f"    Range: [{general_stats['time_statistics']['min']:.0f}s, {general_stats['time_statistics']['max']:.0f}s]")
print(f"\n  Mastery (Dom√≠nio):")
print(f"    M√©dia: {general_stats['mastery_statistics']['mean']:.3f}")
print(f"    Mediana: {general_stats['mastery_statistics']['median']:.3f}")
print(f"    Desvio: {general_stats['mastery_statistics']['std']:.3f}")
print(f"\n  Intera√ß√µes por Estudante:")
print(f"    M√©dia: {general_stats['interactions_per_student']['mean']:.1f}")
print(f"    Range: [{general_stats['interactions_per_student']['min']:.0f}, {general_stats['interactions_per_student']['max']:.0f}]")


üìä Estat√≠sticas Gerais das Intera√ß√µes:

  Total de Intera√ß√µes: 4450
  Corretas: 1975 (44.4%)
  Incorretas: 2475 (55.6%)

  Tempo Gasto (segundos):
    M√©dia: 157.7s
    Mediana: 159.0s
    Desvio: 81.9s
    Range: [15s, 299s]

  Mastery (Dom√≠nio):
    M√©dia: 0.549
    Mediana: 0.548
    Desvio: 0.137

  Intera√ß√µes por Estudante:
    M√©dia: 44.5
    Range: [30, 60]


## An√°lise 2: Distribui√ß√£o de Erros

In [14]:
def analyze_error_distribution(interactions: List[Dict]) -> Dict[str, Any]:
    """Analisa distribui√ß√£o de tipos de erro."""
    
    error_distribution = Counter()
    error_explanations = defaultdict(list)
    
    for interaction in interactions:
        if interaction['error_type']:
            error_distribution[interaction['error_type']] += 1
            error_explanations[interaction['error_type']].append(
                interaction.get('error_explanation', 'N/A')
            )
    
    total_errors = sum(error_distribution.values())
    
    error_stats = {}
    for error_type, count in error_distribution.items():
        error_stats[error_type] = {
            'count': count,
            'percentage': (count / total_errors * 100) if total_errors > 0 else 0
        }
    
    return {
        'total_errors': total_errors,
        'error_distribution': error_stats,
        'error_types': list(error_distribution.keys())
    }

error_analysis = analyze_error_distribution(interactions)

print("\nüìä An√°lise de Distribui√ß√£o de Erros:\n")
print(f"  Total de Erros: {error_analysis['total_errors']}")
print(f"\n  Distribui√ß√£o por Tipo:")
for error_type, stats in error_analysis['error_distribution'].items():
    print(f"    - {error_type}: {stats['count']} ({stats['percentage']:.1f}%)")


üìä An√°lise de Distribui√ß√£o de Erros:

  Total de Erros: 2475

  Distribui√ß√£o por Tipo:
    - misconception: 490 (19.8%)
    - slip: 513 (20.7%)
    - careless: 462 (18.7%)
    - misunderstanding: 513 (20.7%)
    - incomplete: 497 (20.1%)


## An√°lise 3: Desempenho por Perfil

In [15]:
def analyze_performance_by_profile(interactions: List[Dict], students: List, profiles: Dict) -> Dict[str, Any]:
    """Analisa desempenho agrupado por perfil cognitivo."""
    
    profile_data = defaultdict(lambda: {
        'students': [],
        'accuracies': [],
        'masteries': [],
        'times': []
    })
    
    student_interactions = defaultdict(list)
    for interaction in interactions:
        student_interactions[interaction['student_id']].append(interaction)
    
    for student in students:
        student_id = student['id']
        profile_id = student['profile_id']
        student_ints = student_interactions.get(student_id, [])
        
        if not student_ints:
            continue
        
        accuracy = sum(1 for i in student_ints if i['is_correct']) / len(student_ints)
        avg_mastery = np.mean([i['mastery_after'] for i in student_ints])
        avg_time = np.mean([i['time_spent_seconds'] for i in student_ints])
        
        profile_data[profile_id]['students'].append(student_id)
        profile_data[profile_id]['accuracies'].append(accuracy)
        profile_data[profile_id]['masteries'].append(avg_mastery)
        profile_data[profile_id]['times'].append(avg_time)
    
    profile_stats = {}
    for profile_id, data in profile_data.items():
        if data['accuracies']:
            profile_stats[profile_id] = {
                'num_students': len(data['students']),
                'accuracy': {
                    'mean': np.mean(data['accuracies']),
                    'std': np.std(data['accuracies']),
                    'min': np.min(data['accuracies']),
                    'max': np.max(data['accuracies'])
                },
                'mastery': {
                    'mean': np.mean(data['masteries']),
                    'std': np.std(data['masteries'])
                },
                'time': {
                    'mean': np.mean(data['times']),
                    'std': np.std(data['times'])
                }
            }
    
    return profile_stats

profile_performance = analyze_performance_by_profile(interactions, students, profiles)

print("\nüìä Desempenho por Perfil Cognitivo:\n")
for profile_id, stats in sorted(profile_performance.items()):
    print(f"  {profile_id}:")
    print(f"    Estudantes: {stats['num_students']}")
    print(f"    Acur√°cia: {stats['accuracy']['mean']:.1%} (¬±{stats['accuracy']['std']:.1%})")
    print(f"    Dom√≠nio: {stats['mastery']['mean']:.3f} (¬±{stats['mastery']['std']:.3f})")
    print(f"    Tempo: {stats['time']['mean']:.1f}s (¬±{stats['time']['std']:.1f}s)")
    print()


üìä Desempenho por Perfil Cognitivo:

  balanced:
    Estudantes: 30
    Acur√°cia: 46.0% (¬±7.5%)
    Dom√≠nio: 0.583 (¬±0.049)
    Tempo: 158.5s (¬±8.8s)

  careful:
    Estudantes: 20
    Acur√°cia: 40.9% (¬±8.6%)
    Dom√≠nio: 0.499 (¬±0.052)
    Tempo: 157.0s (¬±11.7s)

  intuitive:
    Estudantes: 10
    Acur√°cia: 40.5% (¬±5.9%)
    Dom√≠nio: 0.480 (¬±0.031)
    Tempo: 159.7s (¬±6.4s)

  logical:
    Estudantes: 10
    Acur√°cia: 42.8% (¬±6.6%)
    Dom√≠nio: 0.533 (¬±0.054)
    Tempo: 153.5s (¬±9.4s)

  quick_learner:
    Estudantes: 20
    Acur√°cia: 55.7% (¬±8.5%)
    Dom√≠nio: 0.734 (¬±0.062)
    Tempo: 158.2s (¬±8.5s)

  struggling:
    Estudantes: 10
    Acur√°cia: 30.8% (¬±7.0%)
    Dom√≠nio: 0.293 (¬±0.037)
    Tempo: 158.5s (¬±11.3s)



## An√°lise 4: Padr√£o de Aprendizado

In [16]:
def analyze_learning_patterns(interactions: List[Dict], students: List) -> Dict[str, Any]:
    """Analisa padr√µes de aprendizado ao longo do tempo."""
    
    student_interactions = defaultdict(list)
    for interaction in interactions:
        student_interactions[interaction['student_id']].append(interaction)
    
    monotonic_increasing = 0
    monotonic_decreasing = 0
    fluctuating = 0
    
    learning_curves = {}
    
    for student_id, student_ints in student_interactions.items():
        if len(student_ints) < 2:
            continue
        
        sorted_ints = sorted(student_ints, key=lambda x: x['timestamp'])
        masteries = [i['mastery_after'] for i in sorted_ints]
        
        learning_curves[student_id] = masteries
        
        first_half_mean = np.mean(masteries[:len(masteries)//2])
        second_half_mean = np.mean(masteries[len(masteries)//2:])
        
        if second_half_mean > first_half_mean:
            monotonic_increasing += 1
        elif second_half_mean < first_half_mean:
            monotonic_decreasing += 1
        else:
            fluctuating += 1
    
    total_students = len(learning_curves)
    
    return {
        'total_students_analyzed': total_students,
        'monotonic_increasing': {
            'count': monotonic_increasing,
            'percentage': (monotonic_increasing / total_students * 100) if total_students > 0 else 0
        },
        'monotonic_decreasing': {
            'count': monotonic_decreasing,
            'percentage': (monotonic_decreasing / total_students * 100) if total_students > 0 else 0
        },
        'fluctuating': {
            'count': fluctuating,
            'percentage': (fluctuating / total_students * 100) if total_students > 0 else 0
        }
    }

learning_patterns = analyze_learning_patterns(interactions, students)

print("\nüìä An√°lise de Padr√µes de Aprendizado:\n")
print(f"  Estudantes Analisados: {learning_patterns['total_students_analyzed']}")
print(f"\n  Tend√™ncias:")
print(f"    Crescimento Monot√¥nico: {learning_patterns['monotonic_increasing']['count']} ({learning_patterns['monotonic_increasing']['percentage']:.1f}%)")
print(f"    Decrescimento: {learning_patterns['monotonic_decreasing']['count']} ({learning_patterns['monotonic_decreasing']['percentage']:.1f}%)")
print(f"    Flutuante: {learning_patterns['fluctuating']['count']} ({learning_patterns['fluctuating']['percentage']:.1f}%)")


üìä An√°lise de Padr√µes de Aprendizado:

  Estudantes Analisados: 100

  Tend√™ncias:
    Crescimento Monot√¥nico: 81 (81.0%)
    Decrescimento: 19 (19.0%)
    Flutuante: 0 (0.0%)


## An√°lise 5: Correla√ß√£o entre Par√¢metros e Desempenho

In [17]:
def analyze_parameter_correlations(interactions: List[Dict], students: List) -> Dict[str, Any]:
    """Analisa correla√ß√£o entre par√¢metros dos estudantes e seu desempenho."""
    
    student_interactions = defaultdict(list)
    for interaction in interactions:
        student_interactions[interaction['student_id']].append(interaction)
    
    data = []
    for student in students:
        student_id = student['id']
        student_ints = student_interactions.get(student_id, [])
        if not student_ints:
            continue
        
        accuracy = sum(1 for i in student_ints if i['is_correct']) / len(student_ints)
        avg_mastery = np.mean([i['mastery_after'] for i in student_ints])
        
        data.append({
            'student_id': student_id,
            'accuracy': accuracy,
            'avg_mastery': avg_mastery,
            'learn_rate': student.get('learn_rate', 0),
            'logic_skill': student.get('logic_skill', 0),
            'reading_skill': student.get('reading_skill', 0),
            'tech_familiarity': student.get('tech_familiarity', 0),
            'mastery_init': student.get('mastery_init_level', 0),
            'slip': student.get('slip', 0),
            'guess': student.get('guess', 0)
        })
    
    df = pd.DataFrame(data)
    
    correlations = {}
    for param in ['learn_rate', 'logic_skill', 'reading_skill',
                  'tech_familiarity', 'mastery_init', 'slip', 'guess']:
        if param in df.columns:
            corr_with_accuracy = df[param].corr(df['accuracy'])
            corr_with_mastery = df[param].corr(df['avg_mastery'])
            correlations[param] = {
                'correlation_with_accuracy': round(corr_with_accuracy, 3),
                'correlation_with_mastery': round(corr_with_mastery, 3)
            }
    
    sorted_by_accuracy = sorted(
        correlations.items(),
        key=lambda x: abs(x[1]['correlation_with_accuracy']),
        reverse=True
    )
    
    return {
        'correlations': dict(sorted_by_accuracy),
        'top_3_factors': [f[0] for f in sorted_by_accuracy[:3]]
    }

correlations = analyze_parameter_correlations(interactions, students)

print("\nüìä Correla√ß√£o entre Par√¢metros e Desempenho:\n")
print("  Par√¢metro | Correla√ß√£o com Acur√°cia | Correla√ß√£o com Dom√≠nio")
print("  " + "-" * 60)
for param, corrs in list(correlations['correlations'].items())[:10]:
    acc_corr = corrs['correlation_with_accuracy']
    mas_corr = corrs['correlation_with_mastery']
    print(f"  {param:20s} | {acc_corr:>6.3f} | {mas_corr:>6.3f}")

print(f"\n  Top 3 Fatores Mais Importantes:")
for i, factor in enumerate(correlations['top_3_factors'], 1):
    print(f"    {i}. {factor}")


üìä Correla√ß√£o entre Par√¢metros e Desempenho:

  Par√¢metro | Correla√ß√£o com Acur√°cia | Correla√ß√£o com Dom√≠nio
  ------------------------------------------------------------
  mastery_init         |  0.730 |  0.956
  tech_familiarity     |  0.617 |  0.857
  learn_rate           |  0.585 |  0.785
  logic_skill          |  0.458 |  0.633
  guess                | -0.208 | -0.367
  reading_skill        |  0.196 |  0.258
  slip                 | -0.025 | -0.119

  Top 3 Fatores Mais Importantes:
    1. mastery_init
    2. tech_familiarity
    3. learn_rate


## Compila√ß√£o do Relat√≥rio Final

In [18]:
def convert_to_json_serializable(obj):
    """Converte tipos NumPy para tipos nativos Python."""
    if isinstance(obj, dict):
        return {key: convert_to_json_serializable(value) for key, value in obj.items()}
    elif isinstance(obj, list):
        return [convert_to_json_serializable(item) for item in obj]
    elif isinstance(obj, (np.integer, np.int64, np.int32)):
        return int(obj)
    elif isinstance(obj, (np.floating, np.float64, np.float32)):
        return float(obj)
    elif isinstance(obj, np.ndarray):
        return obj.tolist()
    else:
        return obj

final_report = {
    "metadata": {
        "description": "Relat√≥rio de an√°lise estat√≠stica das intera√ß√µes SINKT",
        "version": "2.0.0",
        "created_at": datetime.now().isoformat(),
        "total_profiles": len(profiles),
        "total_students": len(students),
        "total_interactions": len(interactions)
    },
    "general_statistics": general_stats,
    "error_analysis": error_analysis,
    "profile_performance": profile_performance,
    "learning_patterns": learning_patterns,
    "parameter_correlations": correlations,
    "summary": {
        "data_quality": "ALTA" if general_stats['accuracy'] > 0.3 and general_stats['accuracy'] < 0.9 else "M√âDIA",
        "learning_pattern_quality": "EXCELENTE" if learning_patterns['monotonic_increasing']['percentage'] > 70 else "BOA",
        "profile_differentiation": "CLARA" if len(profile_performance) > 1 else "FRACA"
    }
}

final_report = convert_to_json_serializable(final_report)

report_file = 'data/output/analysis_report.json'
with open(report_file, 'w', encoding='utf-8') as f:
    json.dump(final_report, f, indent=2, ensure_ascii=False)

print(f"‚úÖ Relat√≥rio salvo em: {report_file}")

‚úÖ Relat√≥rio salvo em: data/output/analysis_report.json


## Resumo Final

In [19]:
print("\n" + "="*70)
print("üéâ AN√ÅLISE ESTAT√çSTICA CONCLU√çDA COM SUCESSO!")
print("="*70)
print(f"\nüìÅ Arquivo gerado:")
print(f"  - {report_file}")
print(f"\nüìä Resumo Executivo:")
print(f"  - Qualidade dos Dados: {final_report['summary']['data_quality']}")
print(f"  - Qualidade do Padr√£o de Aprendizado: {final_report['summary']['learning_pattern_quality']}")
print(f"  - Diferencia√ß√£o entre Perfis: {final_report['summary']['profile_differentiation']}")
print(f"\n‚úÖ Pr√≥ximo passo: Consulte 'docs/03_respostas_obrigatorias.md' para respostas √†s perguntas")
print("\n" + "="*70)


üéâ AN√ÅLISE ESTAT√çSTICA CONCLU√çDA COM SUCESSO!

üìÅ Arquivo gerado:
  - data/output/analysis_report.json

üìä Resumo Executivo:
  - Qualidade dos Dados: ALTA
  - Qualidade do Padr√£o de Aprendizado: EXCELENTE
  - Diferencia√ß√£o entre Perfis: CLARA

‚úÖ Pr√≥ximo passo: Consulte 'docs/03_respostas_obrigatorias.md' para respostas √†s perguntas

