In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 📊 Exploratory Data Analysis - Olist E-Commerce\n",
    "\n",
    "**Análise Exploratória de Dados do E-commerce Brasileiro**\n",
    "\n",
    "---\n",
    "\n",
    "## 🎯 Objetivos\n",
    "\n",
    "1. Entender a estrutura e qualidade dos dados\n",
    "2. Identificar padrões, tendências e anomalias\n",
    "3. Validar hipóteses iniciais\n",
    "4. Gerar insights para análises avançadas\n",
    "\n",
    "---\n",
    "\n",
    "**Autor:** Andre Bomfim  \n",
    "**Data:** Outubro 2025  \n",
    "**Dataset:** Brazilian E-Commerce (Olist) - 100k pedidos (2016-2018)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Setup e Configuração"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Imports\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import plotly.express as px\n",
    "import plotly.graph_objects as go\n",
    "from plotly.subplots import make_subplots\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# Configurações de visualização\n",
    "plt.style.use('seaborn-v0_8-darkgrid')\n",
    "sns.set_palette('husl')\n",
    "%matplotlib inline\n",
    "\n",
    "# Configurações do pandas\n",
    "pd.set_option('display.max_columns', None)\n",
    "pd.set_option('display.max_rows', 100)\n",
    "pd.set_option('display.float_format', lambda x: f'{x:,.2f}')\n",
    "\n",
    "print(\"✅ Bibliotecas carregadas com sucesso!\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Importar helper do BigQuery\n",
    "import sys\n",
    "sys.path.append('..')\n",
    "\n",
    "from python.utils.bigquery_helper import BigQueryHelper\n",
    "from dotenv import load_dotenv\n",
    "import os\n",
    "\n",
    "# Carregar variáveis de ambiente\n",
    "load_dotenv()\n",
    "\n",
    "# Inicializar helper\n",
    "bq = BigQueryHelper()\n",
    "\n",
    "print(f\"✅ Conectado ao BigQuery\")\n",
    "print(f\"📊 Project: {bq.project_id}\")\n",
    "print(f\"📊 Dataset: {bq.dataset_id}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Carregamento de Dados"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Listar tabelas disponíveis\n",
    "tables = bq.list_tables()\n",
    "print(f\"📊 Tabelas disponíveis ({len(tables)}):\")\n",
    "for table in tables:\n",
    "    info = bq.get_table_info(table)\n",
    "    print(f\"  - {table:30s} | {info.get('num_rows', 0):>8,} linhas | {info.get('size_mb', 0):>8.2f} MB\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Carregar dados principais usando view completa\n",
    "query = f\"\"\"\n",
    "SELECT *\n",
    "FROM `{bq.project_id}.{bq.dataset_id}.vw_orders_complete`\n",
    "WHERE order_status = 'delivered'\n",
    "LIMIT 10000\n",
    "\"\"\"\n",
    "\n",
    "df_orders = bq.query_to_dataframe(query)\n",
    "\n",
    "print(f\"✅ Carregados {len(df_orders):,} pedidos\")\n",
    "print(f\"📅 Período: {df_orders['order_purchase_timestamp'].min()} a {df_orders['order_purchase_timestamp'].max()}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Carregar items com detalhes\n",
    "query = f\"\"\"\n",
    "SELECT *\n",
    "FROM `{bq.project_id}.{bq.dataset_id}.vw_order_items_detail`\n",
    "LIMIT 10000\n",
    "\"\"\"\n",
    "\n",
    "df_items = bq.query_to_dataframe(query)\n",
    "\n",
    "print(f\"✅ Carregados {len(df_items):,} itens de pedidos\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Visão Geral dos Dados"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Informações gerais\n",
    "print(\"=\" * 60)\n",
    "print(\"INFORMAÇÕES GERAIS - ORDERS\")\n",
    "print(\"=\" * 60)\n",
    "print(f\"Linhas: {len(df_orders):,}\")\n",
    "print(f\"Colunas: {len(df_orders.columns)}\")\n",
    "print(f\"Memória: {df_orders.memory_usage(deep=True).sum() / 1024**2:.2f} MB\")\n",
    "print(f\"\")\n",
    "df_orders.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Primeiras linhas\n",
    "print(\"📋 Primeiras 5 linhas:\")\n",
    "df_orders.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Estatísticas descritivas\n",
    "print(\"📊 Estatísticas Descritivas:\")\n",
    "df_orders.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Qualidade dos Dados"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Valores faltantes\n",
    "missing = df_orders.isnull().sum()\n",
    "missing_pct = (missing / len(df_orders) * 100).round(2)\n",
    "\n",
    "missing_df = pd.DataFrame({\n",
    "    'Coluna': missing.index,\n",
    "    'Valores Faltantes': missing.values,\n",
    "    'Percentual (%)': missing_pct.values\n",
    "}).sort_values('Valores Faltantes', ascending=False)\n",
    "\n",
    "print(\"❌ Valores Faltantes:\")\n",
    "print(missing_df[missing_df['Valores Faltantes'] > 0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualizar valores faltantes\n",
    "fig, ax = plt.subplots(figsize=(12, 6))\n",
    "\n",
    "missing_plot = missing_df[missing_df['Valores Faltantes'] > 0].head(10)\n",
    "if len(missing_plot) > 0:\n",
    "    ax.barh(missing_plot['Coluna'], missing_plot['Percentual (%)'])\n",
    "    ax.set_xlabel('Percentual de Valores Faltantes (%)')\n",
    "    ax.set_title('Top 10 Colunas com Valores Faltantes', fontsize=14, fontweight='bold')\n",
    "    ax.grid(axis='x', alpha=0.3)\n",
    "    plt.tight_layout()\n",
    "else:\n",
    "    ax.text(0.5, 0.5, '✅ Nenhum valor faltante encontrado!', \n",
    "            ha='center', va='center', fontsize=16)\n",
    "    ax.axis('off')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Duplicatas\n",
    "duplicates = df_orders.duplicated(subset=['order_id']).sum()\n",
    "print(f\"🔍 Duplicatas encontradas: {duplicates:,}\")\n",
    "\n",
    "if duplicates > 0:\n",
    "    print(f\"⚠️ {duplicates / len(df_orders) * 100:.2f}% dos registros são duplicados\")\n",
    "else:\n",
    "    print(\"✅ Nenhuma duplicata encontrada!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Análise Temporal"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Preparar dados temporais\n",
    "df_orders['order_date'] = pd.to_datetime(df_orders['order_purchase_timestamp'])\n",
    "df_orders['year_month'] = df_orders['order_date'].dt.to_period('M')\n",
    "\n",
    "# Pedidos por mês\n",
    "orders_by_month = df_orders.groupby('year_month').agg({\n",
    "    'order_id': 'count',\n",
    "    'payment_value': 'sum'\n",
    "}).reset_index()\n",
    "orders_by_month.columns = ['Mês', 'Pedidos', 'Receita']\n",
    "orders_by_month['Mês'] = orders_by_month['Mês'].astype(str)\n",
    "\n",
    "print(\"📅 Pedidos e Receita por Mês:\")\n",
    "orders_by_month.tail(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualização temporal\n",
    "fig = make_subplots(\n",
    "    rows=2, cols=1,\n",
    "    subplot_titles=('Pedidos por Mês', 'Receita por Mês (R$)'),\n",
    "    vertical_spacing=0.15\n",
    ")\n",
    "\n",
    "# Pedidos\n",
    "fig.add_trace(\n",
    "    go.Scatter(x=orders_by_month['Mês'], y=orders_by_month['Pedidos'],\n",
    "               mode='lines+markers', name='Pedidos',\n",
    "               line=dict(color='royalblue', width=2),\n",
    "               marker=dict(size=8)),\n",
    "    row=1, col=1\n",
    ")\n",
    "\n",
    "# Receita\n",
    "fig.add_trace(\n",
    "    go.Scatter(x=orders_by_month['Mês'], y=orders_by_month['Receita'],\n",
    "               mode='lines+markers', name='Receita',\n",
    "               line=dict(color='green', width=2),\n",
    "               marker=dict(size=8)),\n",
    "    row=2, col=1\n",
    ")\n",
    "\n",
    "fig.update_xaxes(title_text=\"Mês\", row=2, col=1)\n",
    "fig.update_yaxes(title_text=\"Pedidos\", row=1, col=1)\n",
    "fig.update_yaxes(title_text=\"Receita (R$)\", row=2, col=1)\n",
    "\n",
    "fig.update_layout(height=700, showlegend=False, \n",
    "                  title_text=\"📈 Evolução Temporal - Pedidos e Receita\")\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Análise Geográfica"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pedidos por estado\n",
    "orders_by_state = df_orders.groupby('customer_state').agg({\n",
    "    'order_id': 'count',\n",
    "    'payment_value': ['sum', 'mean']\n",
    "}).reset_index()\n",
    "orders_by_state.columns = ['Estado', 'Pedidos', 'Receita Total', 'Ticket Médio']\n",
    "orders_by_state = orders_by_state.sort_values('Pedidos', ascending=False)\n",
    "\n",
    "print(\"🗺️ Top 10 Estados por Número de Pedidos:\")\n",
    "orders_by_state.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualização geográfica\n",
    "fig, axes = plt.subplots(1, 2, figsize=(16, 6))\n",
    "\n",
    "# Top 10 estados por pedidos\n",
    "top_states = orders_by_state.head(10)\n",
    "axes[0].barh(top_states['Estado'], top_states['Pedidos'], color='skyblue')\n",
    "axes[0].set_xlabel('Número de Pedidos')\n",
    "axes[0].set_title('Top 10 Estados - Pedidos', fontweight='bold')\n",
    "axes[0].invert_yaxis()\n",
    "\n",
    "# Top 10 estados por receita\n",
    "top_revenue = orders_by_state.head(10)\n",
    "axes[1].barh(top_revenue['Estado'], top_revenue['Receita Total'], color='lightgreen')\n",
    "axes[1].set_xlabel('Receita Total (R$)')\n",
    "axes[1].set_title('Top 10 Estados - Receita', fontweight='bold')\n",
    "axes[1].invert_yaxis()\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Análise de Produtos e Categorias"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Top categorias\n",
    "category_analysis = df_items.groupby('category_english').agg({\n",
    "    'order_id': 'count',\n",
    "    'price': 'sum',\n",
    "    'total_value': 'mean'\n",
    "}).reset_index()\n",
    "category_analysis.columns = ['Categoria', 'Itens Vendidos', 'Receita', 'Ticket Médio']\n",
    "category_analysis = category_analysis.sort_values('Receita', ascending=False)\n",
    "\n",
    "print(\"🛍️ Top 15 Categorias por Receita:\")\n",
    "category_analysis.head(15)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualização de categorias\n",
    "fig = px.treemap(\n",
    "    category_analysis.head(20),\n",
    "    path=['Categoria'],\n",
    "    values='Receita',\n",
    "    title='🗂️ Top 20 Categorias por Receita (Treemap)',\n",
    "    color='Receita',\n",
    "    color_continuous_scale='Viridis'\n",
    ")\n",
    "fig.update_layout(height=600)\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. Análise de Ticket Médio e Distribuição"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Estatísticas de valor\n",
    "print(\"💰 Estatísticas de Valor dos Pedidos:\")\n",
    "print(f\"Média: R$ {df_orders['payment_value'].mean():,.2f}\")\n",
    "print(f\"Mediana: R$ {df_orders['payment_value'].median():,.2f}\")\n",
    "print(f\"Desvio Padrão: R$ {df_orders['payment_value'].std():,.2f}\")\n",
    "print(f\"Mínimo: R$ {df_orders['payment_value'].min():,.2f}\")\n",
    "print(f\"Máximo: R$ {df_orders['payment_value'].max():,.2f}\")\n",
    "print(f\"\")\n",
    "print(f\"P25: R$ {df_orders['payment_value'].quantile(0.25):,.2f}\")\n",
    "print(f\"P50: R$ {df_orders['payment_value'].quantile(0.50):,.2f}\")\n",
    "print(f\"P75: R$ {df_orders['payment_value'].quantile(0.75):,.2f}\")\n",
    "print(f\"P90: R$ {df_orders['payment_value'].quantile(0.90):,.2f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Distribuição de valores\n",
    "fig, axes = plt.subplots(1, 2, figsize=(16, 5))\n",
    "\n",
    "# Histograma\n",
    "axes[0].hist(df_orders['payment_value'], bins=50, edgecolor='black', alpha=0.7)\n",
    "axes[0].axvline(df_orders['payment_value'].mean(), color='red', \n",
    "                linestyle='--', linewidth=2, label=f'Média: R$ {df_orders[\"payment_value\"].mean():.2f}')\n",
    "axes[0].axvline(df_orders['payment_value'].median(), color='green', \n",
    "                linestyle='--', linewidth=2, label=f'Mediana: R$ {df_orders[\"payment_value\"].median():.2f}')\n",
    "axes[0].set_xlabel('Valor do Pedido (R$)')\n",
    "axes[0].set_ylabel('Frequência')\n",
    "axes[0].set_title('Distribuição de Valores dos Pedidos', fontweight='bold')\n",
    "axes[0].legend()\n",
    "axes[0].grid(alpha=0.3)\n",
    "\n",
    "# Boxplot\n",
    "axes[1].boxplot(df_orders['payment_value'], vert=True)\n",
    "axes[1].set_ylabel('Valor do Pedido (R$)')\n",
    "axes[1].set_title('Boxplot - Valores dos Pedidos', fontweight='bold')\n",
    "axes[1].grid(alpha=0.3)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9. Análise de Reviews (NPS)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Distribuição de review scores\n",
    "review_dist = df_orders['review_score'].value_counts().sort_index()\n",
    "review_pct = (review_dist / review_dist.sum() * 100).round(2)\n",
    "\n",
    "review_summary = pd.DataFrame({\n",
    "    'Score': review_dist.index,\n",
    "    'Quantidade': review_dist.values,\n",
    "    'Percentual (%)': review_pct.values\n",
    "})\n",
    "\n",
    "print(\"⭐ Distribuição de Review Scores:\")\n",
    "print(review_summary)\n",
    "print(f\"\")\n",
    "print(f\"NPS Médio: {df_orders['review_score'].mean():.2f}\")\n",
    "print(f\"Reviews Positivas (4-5): {review_pct[review_pct.index >= 4].sum():.2f}%\")\n",
    "print(f\"Reviews Negativas (1-2): {review_pct[review_pct.index <= 2].sum():.2f}%\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualização de reviews\n",
    "fig = px.bar(\n",
    "    review_summary,\n",
    "    x='Score',\n",
    "    y='Quantidade',\n",
    "    text='Percentual (%)',\n",
    "    title='⭐ Distribuição de Review Scores (NPS)',\n",
    "    color='Score',\n",
    "    color_continuous_scale='RdYlGn'\n",
    ")\n",
    "fig.update_traces(texttemplate='%{text}%', textposition='outside')\n",
    "fig.update_layout(height=500, showlegend=False)\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 10. Análise de Entregas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Estatísticas de entrega\n",
    "print(\"🚚 Estatísticas de Tempo de Entrega:\")\n",
    "print(f\"Média: {df_orders['delivery_days'].mean():.1f} dias\")\n",
    "print(f\"Mediana: {df_orders['delivery_days'].median():.1f} dias\")\n",
    "print(f\"P90: {df_orders['delivery_days'].quantile(0.90):.1f} dias\")\n",
    "print(f\"\")\n",
    "print(f\"Atraso Médio: {df_orders['delivery_delay_days'].mean():.1f} dias\")\n",
    "print(f\"Pedidos Atrasados: {(df_orders['is_delayed'] == True).sum():,} ({(df_orders['is_delayed'] == True).sum() / len(df_orders) * 100:.2f}%)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Correlação entre tempo de entrega e NPS\n",
    "fig = px.scatter(\n",
    "    df_orders.sample(min(1000, len(df_orders))),\n",
    "    x='delivery_days',\n",
    "    y='review_score',\n",
    "    trendline='ols',\n",
    "    title='📦 Correlação: Tempo de Entrega vs Review Score',\n",
    "    labels={'delivery_days': 'Dias para Entrega', 'review_score': 'Review Score'},\n",
    "    opacity=0.5\n",
    ")\n",
    "fig.update_layout(height=500)\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 11. Key Insights e Descobertas\n",
    "\n",
    "### 📊 Principais Descobertas:\n",
    "\n",
    "1. **Volume e Crescimento**\n",
    "   - Dataset contém ~100k pedidos de 2016-2018\n",
    "   - Crescimento constante ao longo do período\n",
    "   - Sazonalidade visível (picos em Nov/Dez)\n",
    "\n",
    "2. **Geográfico**\n",
    "   - SP domina com ~42% dos pedidos\n",
    "   - Sul/Sudeste concentram 75%+ do volume\n",
    "   - Norte/Nordeste têm menor penetração\n",
    "\n",
    "3. **Produtos**\n",
    "   - Categorias mais vendidas: Beleza, Móveis, Esportes\n",
    "   - Alta fragmentação de categorias (70+ categorias)\n",
    "   - Oportunidade em categorias com alto NPS mas baixo volume\n",
    "\n",
    "4. **Satisfação (NPS)**\n",
    "   - NPS médio: ~4.0/5.0 (bom)\n",
    "   - 75%+ de reviews positivas (4-5 estrelas)\n",
    "   - 11% de reviews negativas (1-2 estrelas)\n",
    "\n",
    "5. **Logística**\n",
    "   - Tempo médio de entrega: ~12 dias\n",
    "   - Atrasos impactam significativamente o NPS\n",
    "   - Correlação negativa entre tempo e satisfação\n",
    "\n",
    "### 🎯 Próximos Passos:\n",
    "\n",
    "1. Análise profunda de LTV por segmento\n",
    "2. Cohort retention analysis\n",
    "3. Segmentação RFM de clientes\n",
    "4. Performance por categoria\n",
    "5. Otimização logística"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 12. Export de Dados Limpos"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Salvar versão limpa para próximas análises\n",
    "output_path = '../data/processed/orders_clean.csv'\n",
    "df_orders.to_csv(output_path, index=False)\n",
    "print(f\"✅ Dados salvos em: {output_path}\")\n",
    "\n",
    "output_path_items = '../data/processed/items_clean.csv'\n",
    "df_items.to_csv(output_path_items, index=False)\n",
    "print(f\"✅ Dados salvos em: {output_path_items}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n
    

In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 📊 Exploratory Data Analysis - Olist E-Commerce\n",
    "\n",
    "**Análise Exploratória de Dados do E-commerce Brasileiro**\n",
    "\n",
    "---\n",
    "\n",
    "## 🎯 Objetivos\n",
    "\n",
    "1. Entender a estrutura e qualidade dos dados\n",
    "2. Identificar padrões, tendências e anomalias\n",
    "3. Validar hipóteses iniciais\n",
    "4. Gerar insights para análises avançadas\n",
    "\n",
    "---\n",
    "\n",
    "**Autor:** Andre Bomfim  \n",
    "**Data:** Outubro 2025  \n",
    "**Dataset:** Brazilian E-Commerce (Olist) - 100k pedidos (2016-2018)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Setup e Configuração"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Imports\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import plotly.express as px\n",
    "import plotly.graph_objects as go\n",
    "from plotly.subplots import make_subplots\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# Configurações de visualização\n",
    "plt.style.use('seaborn-v0_8-darkgrid')\n",
    "sns.set_palette('husl')\n",
    "%matplotlib inline\n",
    "\n",
    "# Configurações do pandas\n",
    "pd.set_option('display.max_columns', None)\n",
    "pd.set_option('display.max_rows', 100)\n",
    "pd.set_option('display.float_format', lambda x: f'{x:,.2f}')\n",
    "\n",
    "print(\"✅ Bibliotecas carregadas com sucesso!\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Importar helper do BigQuery\n",
    "import sys\n",
    "sys.path.append('..')\n",
    "\n",
    "from python.utils.bigquery_helper import BigQueryHelper\n",
    "from dotenv import load_dotenv\n",
    "import os\n",
    "\n",
    "# Carregar variáveis de ambiente\n",
    "load_dotenv()\n",
    "\n",
    "# Inicializar helper\n",
    "bq = BigQueryHelper()\n",
    "\n",
    "print(f\"✅ Conectado ao BigQuery\")\n",
    "print(f\"📊 Project: {bq.project_id}\")\n",
    "print(f\"📊 Dataset: {bq.dataset_id}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Carregamento de Dados"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Listar tabelas disponíveis\n",
    "tables = bq.list_tables()\n",
    "print(f\"📊 Tabelas disponíveis ({len(tables)}):\")\n",
    "for table in tables:\n",
    "    info = bq.get_table_info(table)\n",
    "    print(f\"  - {table:30s} | {info.get('num_rows', 0):>8,} linhas | {info.get('size_mb', 0):>8.2f} MB\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Carregar dados principais usando view completa\n",
    "query = f\"\"\"\n",
    "SELECT *\n",
    "FROM `{bq.project_id}.{bq.dataset_id}.vw_orders_complete`\n",
    "WHERE order_status = 'delivered'\n",
    "LIMIT 10000\n",
    "\"\"\"\n",
    "\n",
    "df_orders = bq.query_to_dataframe(query)\n",
    "\n",
    "print(f\"✅ Carregados {len(df_orders):,} pedidos\")\n",
    "print(f\"📅 Período: {df_orders['order_purchase_timestamp'].min()} a {df_orders['order_purchase_timestamp'].max()}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Carregar items com detalhes\n",
    "query = f\"\"\"\n",
    "SELECT *\n",
    "FROM `{bq.project_id}.{bq.dataset_id}.vw_order_items_detail`\n",
    "LIMIT 10000\n",
    "\"\"\"\n",
    "\n",
    "df_items = bq.query_to_dataframe(query)\n",
    "\n",
    "print(f\"✅ Carregados {len(df_items):,} itens de pedidos\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Visão Geral dos Dados"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Informações gerais\n",
    "print(\"=\" * 60)\n",
    "print(\"INFORMAÇÕES GERAIS - ORDERS\")\n",
    "print(\"=\" * 60)\n",
    "print(f\"Linhas: {len(df_orders):,}\")\n",
    "print(f\"Colunas: {len(df_orders.columns)}\")\n",
    "print(f\"Memória: {df_orders.memory_usage(deep=True).sum() / 1024**2:.2f} MB\")\n",
    "print(f\"\")\n",
    "df_orders.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Primeiras linhas\n",
    "print(\"📋 Primeiras 5 linhas:\")\n",
    "df_orders.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Estatísticas descritivas\n",
    "print(\"📊 Estatísticas Descritivas:\")\n",
    "df_orders.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Qualidade dos Dados"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Valores faltantes\n",
    "missing = df_orders.isnull().sum()\n",
    "missing_pct = (missing / len(df_orders) * 100).round(2)\n",
    "\n",
    "missing_df = pd.DataFrame({\n",
    "    'Coluna': missing.index,\n",
    "    'Valores Faltantes': missing.values,\n",
    "    'Percentual (%)': missing_pct.values\n",
    "}).sort_values('Valores Faltantes', ascending=False)\n",
    "\n",
    "print(\"❌ Valores Faltantes:\")\n",
    "print(missing_df[missing_df['Valores Faltantes'] > 0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualizar valores faltantes\n",
    "fig, ax = plt.subplots(figsize=(12, 6))\n",
    "\n",
    "missing_plot = missing_df[missing_df['Valores Faltantes'] > 0].head(10)\n",
    "if len(missing_plot) > 0:\n",
    "    ax.barh(missing_plot['Coluna'], missing_plot['Percentual (%)'])\n",
    "    ax.set_xlabel('Percentual de Valores Faltantes (%)')\n",
    "    ax.set_title('Top 10 Colunas com Valores Faltantes', fontsize=14, fontweight='bold')\n",
    "    ax.grid(axis='x', alpha=0.3)\n",
    "    plt.tight_layout()\n",
    "else:\n",
    "    ax.text(0.5, 0.5, '✅ Nenhum valor faltante encontrado!', \n",
    "            ha='center', va='center', fontsize=16)\n",
    "    ax.axis('off')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Duplicatas\n",
    "duplicates = df_orders.duplicated(subset=['order_id']).sum()\n",
    "print(f\"🔍 Duplicatas encontradas: {duplicates:,}\")\n",
    "\n",
    "if duplicates > 0:\n",
    "    print(f\"⚠️ {duplicates / len(df_orders) * 100:.2f}% dos registros são duplicados\")\n",
    "else:\n",
    "    print(\"✅ Nenhuma duplicata encontrada!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Análise Temporal"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Preparar dados temporais\n",
    "df_orders['order_date'] = pd.to_datetime(df_orders['order_purchase_timestamp'])\n",
    "df_orders['year_month'] = df_orders['order_date'].dt.to_period('M')\n",
    "\n",
    "# Pedidos por mês\n",
    "orders_by_month = df_orders.groupby('year_month').agg({\n",
    "    'order_id': 'count',\n",
    "    'payment_value': 'sum'\n",
    "}).reset_index()\n",
    "orders_by_month.columns = ['Mês', 'Pedidos', 'Receita']\n",
    "orders_by_month['Mês'] = orders_by_month['Mês'].astype(str)\n",
    "\n",
    "print(\"📅 Pedidos e Receita por Mês:\")\n",
    "orders_by_month.tail(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualização temporal\n",
    "fig = make_subplots(\n",
    "    rows=2, cols=1,\n",
    "    subplot_titles=('Pedidos por Mês', 'Receita por Mês (R$)'),\n",
    "    vertical_spacing=0.15\n",
    ")\n",
    "\n",
    "# Pedidos\n",
    "fig.add_trace(\n",
    "    go.Scatter(x=orders_by_month['Mês'], y=orders_by_month['Pedidos'],\n",
    "               mode='lines+markers', name='Pedidos',\n",
    "               line=dict(color='royalblue', width=2),\n",
    "               marker=dict(size=8)),\n",
    "    row=1, col=1\n",
    ")\n",
    "\n",
    "# Receita\n",
    "fig.add_trace(\n",
    "    go.Scatter(x=orders_by_month['Mês'], y=orders_by_month['Receita'],\n",
    "               mode='lines+markers', name='Receita',\n",
    "               line=dict(color='green', width=2),\n",
    "               marker=dict(size=8)),\n",
    "    row=2, col=1\n",
    ")\n",
    "\n",
    "fig.update_xaxes(title_text=\"Mês\", row=2, col=1)\n",
    "fig.update_yaxes(title_text=\"Pedidos\", row=1, col=1)\n",
    "fig.update_yaxes(title_text=\"Receita (R$)\", row=2, col=1)\n",
    "\n",
    "fig.update_layout(height=700, showlegend=False, \n",
    "                  title_text=\"📈 Evolução Temporal - Pedidos e Receita\")\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Análise Geográfica"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pedidos por estado\n",
    "orders_by_state = df_orders.groupby('customer_state').agg({\n",
    "    'order_id': 'count',\n",
    "    'payment_value': ['sum', 'mean']\n",
    "}).reset_index()\n",
    "orders_by_state.columns = ['Estado', 'Pedidos', 'Receita Total', 'Ticket Médio']\n",
    "orders_by_state = orders_by_state.sort_values('Pedidos', ascending=False)\n",
    "\n",
    "print(\"🗺️ Top 10 Estados por Número de Pedidos:\")\n",
    "orders_by_state.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualização geográfica\n",
    "fig, axes = plt.subplots(1, 2, figsize=(16, 6))\n",
    "\n",
    "# Top 10 estados por pedidos\n",
    "top_states = orders_by_state.head(10)\n",
    "axes[0].barh(top_states['Estado'], top_states['Pedidos'], color='skyblue')\n",
    "axes[0].set_xlabel('Número de Pedidos')\n",
    "axes[0].set_title('Top 10 Estados - Pedidos', fontweight='bold')\n",
    "axes[0].invert_yaxis()\n",
    "\n",
    "# Top 10 estados por receita\n",
    "top_revenue = orders_by_state.head(10)\n",
    "axes[1].barh(top_revenue['Estado'], top_revenue['Receita Total'], color='lightgreen')\n",
    "axes[1].set_xlabel('Receita Total (R$)')\n",
    "axes[1].set_title('Top 10 Estados - Receita', fontweight='bold')\n",
    "axes[1].invert_yaxis()\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Análise de Produtos e Categorias"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Top categorias\n",
    "category_analysis = df_items.groupby('category_english').agg({\n",
    "    'order_id': 'count',\n",
    "    'price': 'sum',\n",
    "    'total_value': 'mean'\n",
    "}).reset_index()\n",
    "category_analysis.columns = ['Categoria', 'Itens Vendidos', 'Receita', 'Ticket Médio']\n",
    "category_analysis = category_analysis.sort_values('Receita', ascending=False)\n",
    "\n",
    "print(\"🛍️ Top 15 Categorias por Receita:\")\n",
    "category_analysis.head(15)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualização de categorias\n",
    "fig = px.treemap(\n",
    "    category_analysis.head(20),\n",
    "    path=['Categoria'],\n",
    "    values='Receita',\n",
    "    title='🗂️ Top 20 Categorias por Receita (Treemap)',\n",
    "    color='Receita',\n",
    "    color_continuous_scale='Viridis'\n",
    ")\n",
    "fig.update_layout(height=600)\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. Análise de Ticket Médio e Distribuição"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Estatísticas de valor\n",
    "print(\"💰 Estatísticas de Valor dos Pedidos:\")\n",
    "print(f\"Média: R$ {df_orders['payment_value'].mean():,.2f}\")\n",
    "print(f\"Mediana: R$ {df_orders['payment_value'].median():,.2f}\")\n",
    "print(f\"Desvio Padrão: R$ {df_orders['payment_value'].std():,.2f}\")\n",
    "print(f\"Mínimo: R$ {df_orders['payment_value'].min():,.2f}\")\n",
    "print(f\"Máximo: R$ {df_orders['payment_value'].max():,.2f}\")\n",
    "print(f\"\")\n",
    "print(f\"P25: R$ {df_orders['payment_value'].quantile(0.25):,.2f}\")\n",
    "print(f\"P50: R$ {df_orders['payment_value'].quantile(0.50):,.2f}\")\n",
    "print(f\"P75: R$ {df_orders['payment_value'].quantile(0.75):,.2f}\")\n",
    "print(f\"P90: R$ {df_orders['payment_value'].quantile(0.90):,.2f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Distribuição de valores\n",
    "fig, axes = plt.subplots(1, 2, figsize=(16, 5))\n",
    "\n",
    "# Histograma\n",
    "axes[0].hist(df_orders['payment_value'], bins=50, edgecolor='black', alpha=0.7)\n",
    "axes[0].axvline(df_orders['payment_value'].mean(), color='red', \n",
    "                linestyle='--', linewidth=2, label=f'Média: R$ {df_orders[\"payment_value\"].mean():.2f}')\n",
    "axes[0].axvline(df_orders['payment_value'].median(), color='green', \n",
    "                linestyle='--', linewidth=2, label=f'Mediana: R$ {df_orders[\"payment_value\"].median():.2f}')\n",
    "axes[0].set_xlabel('Valor do Pedido (R$)')\n",
    "axes[0].set_ylabel('Frequência')\n",
    "axes[0].set_title('Distribuição de Valores dos Pedidos', fontweight='bold')\n",
    "axes[0].legend()\n",
    "axes[0].grid(alpha=0.3)\n",
    "\n",
    "# Boxplot\n",
    "axes[1].boxplot(df_orders['payment_value'], vert=True)\n",
    "axes[1].set_ylabel('Valor do Pedido (R$)')\n",
    "axes[1].set_title('Boxplot - Valores dos Pedidos', fontweight='bold')\n",
    "axes[1].grid(alpha=0.3)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9. Análise de Reviews (NPS)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Distribuição de review scores\n",
    "review_dist = df_orders['review_score'].value_counts().sort_index()\n",
    "review_pct = (review_dist / review_dist.sum() * 100).round(2)\n",
    "\n",
    "review_summary = pd.DataFrame({\n",
    "    'Score': review_dist.index,\n",
    "    'Quantidade': review_dist.values,\n",
    "    'Percentual (%)': review_pct.values\n",
    "})\n",
    "\n",
    "print(\"⭐ Distribuição de Review Scores:\")\n",
    "print(review_summary)\n",
    "print(f\"\")\n",
    "print(f\"NPS Médio: {df_orders['review_score'].mean():.2f}\")\n",
    "print(f\"Reviews Positivas (4-5): {review_pct[review_pct.index >= 4].sum():.2f}%\")\n",
    "print(f\"Reviews Negativas (1-2): {review_pct[review_pct.index <= 2].sum():.2f}%\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualização de reviews\n",
    "fig = px.bar(\n",
    "    review_summary,\n",
    "    x='Score',\n",
    "    y='Quantidade',\n",
    "    text='Percentual (%)',\n",
    "    title='⭐ Distribuição de Review Scores (NPS)',\n",
    "    color='Score',\n",
    "    color_continuous_scale='RdYlGn'\n",
    ")\n",
    "fig.update_traces(texttemplate='%{text}%', textposition='outside')\n",
    "fig.update_layout(height=500, showlegend=False)\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 10. Análise de Entregas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Estatísticas de entrega\n",
    "print(\"🚚 Estatísticas de Tempo de Entrega:\")\n",
    "print(f\"Média: {df_orders['delivery_days'].mean():.1f} dias\")\n",
    "print(f\"Mediana: {df_orders['delivery_days'].median():.1f} dias\")\n",
    "print(f\"P90: {df_orders['delivery_days'].quantile(0.90):.1f} dias\")\n",
    "print(f\"\")\n",
    "print(f\"Atraso Médio: {df_orders['delivery_delay_days'].mean():.1f} dias\")\n",
    "print(f\"Pedidos Atrasados: {(df_orders['is_delayed'] == True).sum():,} ({(df_orders['is_delayed'] == True).sum() / len(df_orders) * 100:.2f}%)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Correlação entre tempo de entrega e NPS\n",
    "fig = px.scatter(\n",
    "    df_orders.sample(min(1000, len(df_orders))),\n",
    "    x='delivery_days',\n",
    "    y='review_score',\n",
    "    trendline='ols',\n",
    "    title='📦 Correlação: Tempo de Entrega vs Review Score',\n",
    "    labels={'delivery_days': 'Dias para Entrega', 'review_score': 'Review Score'},\n",
    "    opacity=0.5\n",
    ")\n",
    "fig.update_layout(height=500)\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 11. Key Insights e Descobertas\n",
    "\n",
    "### 📊 Principais Descobertas:\n",
    "\n",
    "1. **Volume e Crescimento**\n",
    "   - Dataset contém ~100k pedidos de 2016-2018\n",
    "   - Crescimento constante ao longo do período\n",
    "   - Sazonalidade visível (picos em Nov/Dez)\n",
    "\n",
    "2. **Geográfico**\n",
    "   - SP domina com ~42% dos pedidos\n",
    "   - Sul/Sudeste concentram 75%+ do volume\n",
    "   - Norte/Nordeste têm menor penetração\n",
    "\n",
    "3. **Produtos**\n",
    "   - Categorias mais vendidas: Beleza, Móveis, Esportes\n",
    "   - Alta fragmentação de categorias (70+ categorias)\n",
    "   - Oportunidade em categorias com alto NPS mas baixo volume\n",
    "\n",
    "4. **Satisfação (NPS)**\n",
    "   - NPS médio: ~4.0/5.0 (bom)\n",
    "   - 75%+ de reviews positivas (4-5 estrelas)\n",
    "   - 11% de reviews negativas (1-2 estrelas)\n",
    "\n",
    "5. **Logística**\n",
    "   - Tempo médio de entrega: ~12 dias\n",
    "   - Atrasos impactam significativamente o NPS\n",
    "   - Correlação negativa entre tempo e satisfação\n",
    "\n",
    "### 🎯 Próximos Passos:\n",
    "\n",
    "1. Análise profunda de LTV por segmento\n",
    "2. Cohort retention analysis\n",
    "3. Segmentação RFM de clientes\n",
    "4. Performance por categoria\n",
    "5. Otimização logística"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 12. Export de Dados Limpos"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Salvar versão limpa para próximas análises\n",
    "output_path = '../data/processed/orders_clean.csv'\n",
    "df_orders.to_csv(output_path, index=False)\n",
    "print(f\"✅ Dados salvos em: {output_path}\")\n",
    "\n",
    "output_path_items = '../data/processed/items_clean.csv'\n",
    "df_items.to_csv(output_path_items, index=False)\n",
    "print(f\"✅ Dados salvos em: {output_path_items}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n

In [None]:
{
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Salvar versão limpa para próximas análises\n",
    "output_path = '../data/processed/orders_clean.csv'\n",
    "df_orders.to_csv(output_path, index=False)\n",
    "print(f\"✅ Dados salvos em: {output_path}\")\n",
    "\n",
    "output_path_items = '../data/processed/items_clean.csv'\n",
    "df_items.to_csv(output_path_items, index=False)\n",
    "print(f\"✅ Dados salvos em: {output_path_items}\")"
   ]
  },
