# Análise de Dados do Chat WhatsApp
## Grupo de Antigos Alunos Cedros - "A nossa turma"

Este notebook explora os dados do chat WhatsApp para compreender padrões de mensagens, atividade dos participantes e dinâmica do grupo.


## 1. Carregamento e Limpeza de Dados

In [None]:
import pandas as pd
import numpy as np
import re
from collections import Counter, defaultdict
from datetime import datetime

# Carregar dados garantindo tratamento de vírgulas finais e mensagens multilinha
df = pd.read_csv('../w2.txt', 
                 on_bad_lines='skip',
                 usecols=[0, 1, 2, 3],
                 names=['date', 'time', 'name', 'text'],
                 skiprows=1)

# Filtrar apenas datas válidas (formato AA-MM-DD)
df = df[df['date'].str.match(r'^\d{2}-\d{2}-\d{2}$', na=False)].copy()

# Converter data para datetime
df['date'] = pd.to_datetime(df['date'], format='%y-%m-%d')

print(f"Dimensão do dataset: {df.shape}")
print(f"Colunas: {df.columns.tolist()}")
print(f"\nTipos de dados:\n{df.dtypes}")

In [None]:
# Preview data
df.head(10)

## 2. Dataset Overview

In [None]:
print("=" * 60)
print("DATASET OVERVIEW")
print("=" * 60)
print(f"Total messages: {len(df):,}")
print(f"Date range: {df['date'].min().strftime('%Y-%m-%d')} to {df['date'].max().strftime('%Y-%m-%d')}")
print(f"Duration: {(df['date'].max() - df['date'].min()).days} days (~{(df['date'].max() - df['date'].min()).days // 30} months)")
print(f"Unique participants: {df['name'].nunique()}")

days_active = df['date'].nunique()
print(f"Days with activity: {days_active}")
print(f"Avg messages/day: {len(df) / days_active:.1f}")

## 3. Top Contributors

In [None]:
print("=" * 60)
print("TOP 15 CONTRIBUTORS (by message count)")
print("=" * 60)
top_contributors = df['name'].value_counts().head(15)
total = len(df)
for name, count in top_contributors.items():
    pct = count / total * 100
    bar = '█' * int(pct)
    print(f"{name[:25]:<25} {count:>5} ({pct:>5.1f}%) {bar}")

print(f"\n{'─' * 60}")
print(f"Top 5 contributors account for {top_contributors.head(5).sum() / total * 100:.1f}% of all messages")
print(f"Top 10 contributors account for {top_contributors.head(10).sum() / total * 100:.1f}% of all messages")

## 4. Activity by Hour of Day

In [None]:
df['hour'] = df['time'].str.split(':').str[0].astype(int)

print("=" * 60)
print("ACTIVITY BY HOUR OF DAY")
print("=" * 60)
hourly = df['hour'].value_counts().sort_index()
max_hour = hourly.max()
for hour in range(24):
    count = hourly.get(hour, 0)
    bar_len = int(count / max_hour * 40)
    bar = '█' * bar_len
    print(f"{hour:02d}:00 {count:>5} {bar}")

peak_hours = hourly.nlargest(3)
print(f"\nPeak hours: {', '.join([f'{h}:00 ({c} msgs)' for h, c in peak_hours.items()])}")

## 5. Monthly Activity

In [None]:
df['year_month'] = df['date'].dt.to_period('M')

print("=" * 60)
print("MONTHLY ACTIVITY")
print("=" * 60)
monthly = df['year_month'].value_counts().sort_index()
max_month = monthly.max()

for period, count in monthly.items():
    bar_len = int(count / max_month * 40)
    bar = '█' * bar_len
    print(f"{period} {count:>5} {bar}")

print(f"\n{'─' * 60}")
print(f"Most active month: {monthly.idxmax()} ({monthly.max()} messages)")
print(f"Least active month: {monthly.idxmin()} ({monthly.min()} messages)")
print(f"Average messages/month: {monthly.mean():.0f}")

## 6. Day of Week Analysis

In [None]:
df['day_of_week'] = df['date'].dt.dayofweek  # 0=Monday, 6=Sunday

print("=" * 60)
print("ACTIVITY BY DAY OF WEEK")
print("=" * 60)
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
daily = df['day_of_week'].value_counts().sort_index()
max_day = daily.max()

for day_num, count in daily.items():
    bar_len = int(count / max_day * 40)
    bar = '█' * bar_len
    print(f"{days[day_num]:<12} {count:>5} {bar}")

print(f"\n{'─' * 60}")
weekday_msgs = df[df['day_of_week'] < 5]['day_of_week'].count()
weekend_msgs = df[df['day_of_week'] >= 5]['day_of_week'].count()
print(f"Weekday messages: {weekday_msgs} ({weekday_msgs/len(df)*100:.1f}%)")
print(f"Weekend messages: {weekend_msgs} ({weekend_msgs/len(df)*100:.1f}%)")

## 7. Message Content Analysis

In [None]:
print("=" * 60)
print("MESSAGE CONTENT ANALYSIS")
print("=" * 60)

# Media files
media_count = df['text'].str.contains('<Ficheiro não revelado>', na=False).sum()
print(f"Media files shared: {media_count} ({media_count/len(df)*100:.1f}%)")

# Messages with emojis
emoji_pattern = re.compile("[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF]+")
emoji_msgs = df['text'].apply(lambda x: bool(emoji_pattern.search(str(x)))).sum()
print(f"Messages with emojis: {emoji_msgs} ({emoji_msgs/len(df)*100:.1f}%)")

# Message length distribution
df['msg_len'] = df['text'].fillna('').apply(len)
print(f"\nMessage length stats:")
print(f"  Average: {df['msg_len'].mean():.0f} characters")
print(f"  Median: {df['msg_len'].median():.0f} characters")
print(f"  Max: {df['msg_len'].max()} characters")

## 8. Birthday Celebrations

In [None]:
print("=" * 60)
print("BIRTHDAY/CELEBRATION PATTERNS")
print("=" * 60)

birthday_msgs = df[df['text'].str.contains('parabéns', case=False, na=False)]
print(f"Total birthday messages: {len(birthday_msgs)}")

# Days with most birthday wishes
birthday_days = birthday_msgs['date'].value_counts().head(10)
print(f"\nTop 10 celebration days:")
for date, count in birthday_days.items():
    print(f"  {date.strftime('%Y-%m-%d')} ({date.strftime('%A')[:3]}): {count} messages")

# Birthday celebrations by month
birthday_months = birthday_msgs['date'].dt.month.value_counts().sort_index()
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
print(f"\nBirthday messages by month:")
for m in range(1, 13):
    count = birthday_months.get(m, 0)
    bar = '█' * (count // 5)
    print(f"  {months[m-1]}: {count:>4} {bar}")

## 9. Interaction Patterns

In [None]:
df['datetime'] = pd.to_datetime(df['date'].astype(str) + ' ' + df['time'], format='%Y-%m-%d %H:%M', errors='coerce')

print("=" * 60)
print("CONVERSATION FLOW ANALYSIS")
print("=" * 60)

# Analyze who responds to whom (within 5 minute window)
df_sorted = df.sort_values('datetime').dropna(subset=['datetime'])
reply_pairs = defaultdict(int)
prev_row = None

for idx, row in df_sorted.iterrows():
    if prev_row is not None:
        time_diff = (row['datetime'] - prev_row['datetime']).total_seconds() / 60
        if time_diff <= 5 and row['name'] != prev_row['name']:
            pair = tuple(sorted([prev_row['name'], row['name']]))
            reply_pairs[pair] += 1
    prev_row = row

print("Top 15 most interacting pairs (5-min window):")
sorted_pairs = sorted(reply_pairs.items(), key=lambda x: x[1], reverse=True)[:15]
for (p1, p2), count in sorted_pairs:
    p1_short = p1[:20] if len(p1) > 20 else p1
    p2_short = p2[:20] if len(p2) > 20 else p2
    bar = '█' * (count // 10)
    print(f"  {p1_short} <-> {p2_short}: {count} {bar}")

## 10. Participant Engagement

In [None]:
print("=" * 60)
print("PARTICIPANT ENGAGEMENT")
print("=" * 60)

# Most consistent participants (appeared in most months)
participant_months = df.groupby('name')['year_month'].nunique()
consistent = participant_months.sort_values(ascending=False).head(10)
print(f"Most consistent participants (months active):")
for name, months_active in consistent.items():
    name_short = name[:25] if len(name) > 25 else name
    print(f"  {name_short:<25}: {months_active} months")

# Participant longevity
first_msg = df.groupby('name')['date'].min()
last_msg = df.groupby('name')['date'].max()
participation_days = (last_msg - first_msg).dt.days
long_term = participation_days[participation_days > 365].count()
print(f"\nParticipants active for >1 year: {long_term}")

---

# Summary

## Key Findings

### Dataset Overview
- **9,470 messages** from **40 participants** over **~4 years** (Feb 2022 - Jan 2026)
- Average of **13.6 messages/day** on active days (694 days with activity)

### Participant Dynamics
- **Top 5 contributors** account for ~49% of all messages
- **Top 10 contributors** account for ~70% of all messages
- **32 participants** have been active for over 1 year
- Most consistent members: Tiago Burnay, cedros, Rui Pedro (38 months each)

### Temporal Patterns
- **Peak hour**: 13:00 (lunch time) with 863 messages
- **Most active day**: Tuesday (1,766 messages)
- **Weekday vs Weekend**: 74.6% weekday, 25.4% weekend
- **Most active month**: January 2024 (833 messages)
- **Quietest month**: July 2023 (6 messages) - likely vacation period

### Content Characteristics
- **18.5% of messages** contain media files
- **21.7% of messages** include emojis
- **Average message length**: 57 characters (median: 27)
- **Top keywords**: "abraço" (hug), "parabéns" (congratulations) - indicating warm, celebratory group culture

### Birthday Celebrations
- **1,286 birthday messages** total
- **Peak months**: May (220) and November (211)
- **No birthday messages in July** (vacation/inactive period)

### Social Network
- Most interactive pair: +351 912 551 133 <-> Colegio Cedros Com Foto S (188 interactions)
- Rui Pedro is the top **conversation starter** (204 conversations initiated)

### Group Culture
This is a **close-knit school reunion group** characterized by:
- Regular birthday celebrations (parabéns)
- Warm greetings (abraço = hug)
- Strong core of consistent participants
- Activity pattern typical of working professionals (lunch peak, weekday focus)
- International members (phone numbers from Portugal +351, Australia +61, Brazil +55)