<a href="https://colab.research.google.com/github/Santosdevbjj/relatoVendasLucros/blob/main/notebooks/analise_vendas_lucros.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# üìà An√°lise de Vendas e Modelagem de Lucro\n",
    "\n",
    "Este notebook realiza uma an√°lise explorat√≥ria e um **modelo de regress√£o** para prever *Lucro* com base em vari√°veis transacionais.\n",
    "\n",
    "**Fluxo:**\n",
    "1. Carregamento e limpeza dos dados\n",
    "2. An√°lise descritiva e visualiza√ß√µes\n",
    "3. Engenharia de features\n",
    "4. Treinamento e avalia√ß√£o de modelo (Regress√£o Linear)\n",
    "5. Interpreta√ß√£o e salvar o modelo\n",
    "\n",
    "> Observa√ß√£o: este notebook espera que exista o arquivo `../data/vendas_tratadas.csv` (gerado pelo ETL em `/src/etl_limpeza_dados.py`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1) Imports e configura√ß√µes\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score\n",
    "import joblib\n",
    "\n",
    "sns.set(style='whitegrid')\n",
    "plt.rcParams['figure.figsize'] = (10,6)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 2) Carregar dados tratados\n",
    "df = pd.read_csv('../data/vendas_tratadas.csv')\n",
    "print('Linhas:', df.shape[0], ' Colunas:', df.shape[1])\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 3) Limpeza e padroniza√ß√£o de colunas (defensivo)\n",
    "df.columns = [c.strip() for c in df.columns]\n",
    "# Padroniza nomes comuns caso tenham varia√ß√µes\n",
    "for col in ['Data', 'Data_venda', 'data', 'data_venda']:\n",
    "    if col in df.columns and 'Data' not in df.columns:\n",
    "        df.rename(columns={col: 'Data'}, inplace=True)\n",
    "\n",
    "if 'Data' in df.columns:\n",
    "    df['Data'] = pd.to_datetime(df['Data'])\n",
    "\n",
    "# Garante tipos num√©ricos para os campos usados\n",
    "numeric_cols = ['Quantidade', 'Preco_Unitario', 'Custo_Unitario', 'Receita', 'Custo', 'Lucro']\n",
    "for nc in numeric_cols:\n",
    "    if nc in df.columns:\n",
    "        df[nc] = pd.to_numeric(df[nc], errors='coerce')\n",
    "\n",
    "# Remove linhas com valores cr√≠ticos nulos\n",
    "df.dropna(subset=['Quantidade','Preco_Unitario','Lucro'], inplace=True)\n",
    "\n",
    "df.reset_index(drop=True, inplace=True)\n",
    "print('Ap√≥s limpeza ‚Äî Linhas:', df.shape[0])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Estat√≠sticas descritivas r√°pidas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[['Quantidade','Preco_Unitario','Receita','Custo','Lucro']].describe().round(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualiza√ß√µes - distribui√ß√£o e rela√ß√µes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Histograma do lucro\n",
    "plt.figure()\n",
    "sns.histplot(df['Lucro'], bins=20, kde=True)\n",
    "plt.title('Distribui√ß√£o do Lucro')\n",
    "plt.xlabel('Lucro')\n",
    "plt.show()\n",
    "\n",
    "# Boxplot por Categoria (se existir)\n",
    "if 'Categoria' in df.columns:\n",
    "    plt.figure(figsize=(12,6))\n",
    "    sns.boxplot(data=df, x='Categoria', y='Lucro')\n",
    "    plt.title('Lucro por Categoria')\n",
    "    plt.xticks(rotation=45)\n",
    "    plt.show()\n",
    "\n",
    "# Scatter: Pre√ßo unit√°rio x Lucro\n",
    "plt.figure()\n",
    "sns.scatterplot(data=df, x='Preco_Unitario', y='Lucro', size='Quantidade', hue='Nome_Regiao' if 'Nome_Regiao' in df.columns else None, alpha=0.8)\n",
    "plt.title('Pre√ßo Unit√°rio vs Lucro (bolha = quantidade)')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Engenharia de Features\n",
    "Criaremos features simples e robustas que servem bem para um modelo linear:\n",
    "- `Quantidade` (j√° existente)\n",
    "- `Preco_Unitario` (j√° existente)\n",
    "- `Receita` (Quantidade * Preco_Unitario) [se n√£o existir]\n",
    "- `Custo` (Quantidade * Custo_Unitario) [se n√£o existir]\n",
    "- `Mes` (vari√°vel categ√≥rica / ordinal) ‚Äî transformada em dummy\n",
    "- `Categoria` (one-hot)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Garantir Receita e Custo\n",
    "if 'Receita' not in df.columns:\n",
    "    df['Receita'] = df['Quantidade'] * df['Preco_Unitario']\n",
    "if 'Custo' not in df.columns and 'Custo_Unitario' in df.columns:\n",
    "    df['Custo'] = df['Quantidade'] * df['Custo_Unitario']\n",
    "\n",
    "# Criar vari√°veis temporais\n",
    "if 'Data' in df.columns:\n",
    "    df['Ano'] = df['Data'].dt.year\n",
    "    df['Mes_num'] = df['Data'].dt.month\n",
    "    df['Mes_nome'] = df['Data'].dt.month_name()\n",
    "\n",
    "# Sele√ß√£o de features candidatas\n",
    "features = ['Quantidade', 'Preco_Unitario', 'Receita']\n",
    "if 'Custo' in df.columns:\n",
    "    features.append('Custo')\n",
    "\n",
    "# One-hot para Categoria (se existir) e Mes (opcional)\n",
    "df_model = df.copy()\n",
    "categorical_cols = []\n",
    "if 'Categoria' in df_model.columns:\n",
    "    categorical_cols.append('Categoria')\n",
    "if 'Mes_nome' in df_model.columns:\n",
    "    categorical_cols.append('Mes_nome')\n",
    "\n",
    "if categorical_cols:\n",
    "    df_model = pd.get_dummies(df_model, columns=categorical_cols, drop_first=True)\n",
    "\n",
    "# Atualiza lista de features com poss√≠veis dummies geradas\n",
    "feature_cols = [c for c in df_model.columns if c in features or any(c.startswith(prefix + '_') for prefix in categorical_cols)]\n",
    "print('Features usadas (exemplo):', feature_cols[:20])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preparar dados de treino e teste"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Target\n",
    "y = df_model['Lucro']\n",
    "\n",
    "# Se n√£o existirem feature_cols detectadas automaticamente, usar fallback\n",
    "if len(feature_cols) == 0:\n",
    "    feature_cols = ['Quantidade', 'Preco_Unitario', 'Receita']\n",
    "\n",
    "X = df_model[feature_cols].fillna(0)\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
    "print('Treino:', X_train.shape, 'Teste:', X_test.shape)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Treinamento: Regress√£o Linear (baseline)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = LinearRegression()\n",
    "model.fit(X_train, y_train)\n",
    "\n",
    "# Previs√µes\n",
    "y_pred_train = model.predict(X_train)\n",
    "y_pred = model.predict(X_test)\n",
    "\n",
    "# M√©tricas\n",
    "mae = mean_absolute_error(y_test, y_pred)\n",
    "rmse = mean_squared_error(y_test, y_pred, squared=False)\n",
    "r2 = r2_score(y_test, y_pred)\n",
    "\n",
    "print(f'MAE: R$ {mae:,.2f}')\n",
    "print(f'RMSE: R$ {rmse:,.2f}')\n",
    "print(f'R¬≤: {r2:.4f}')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Interpreta√ß√£o dos coeficientes (Modelo Linear)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "coefs = pd.DataFrame({'feature': X_train.columns, 'coef': model.coef_})\n",
    "coefs_sorted = coefs.reindex(coefs.coef.abs().sort_values(ascending=False).index)\n",
    "coefs_sorted"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Gr√°fico: Predi√ß√µes vs Observado (Conjunto de Teste)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(8,6))\n",
    "plt.scatter(y_test, y_pred, alpha=0.7)\n",
    "plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')\n",
    "plt.xlabel('Lucro Observado')\n",
    "plt.ylabel('Lucro Predito')\n",
    "plt.title('Predi√ß√µes vs Observado ‚Äî Teste')\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Salvar modelo (artefato)\n",
    "O modelo √© salvo com `joblib` para reuso ou deploy simples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "joblib.dump(model, '../models/regressao_lucro_linear.joblib')\n",
    "print('Modelo salvo em ../models/regressao_lucro_linear.joblib')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclus√µes e pr√≥ximos passos\n",
    "\n",
    "- O modelo linear fornece um baseline interpret√°vel. Verifique R¬≤ e RMSE para entender a qualidade.\n",
    "- Pr√≥ximos passos: testar modelos mais sofisticados (RandomForest, XGBoost), engenharia de features avan√ßada (lags, agrega√ß√µes por cliente/produto), valida√ß√£o temporal e tuning de hiperpar√¢metros.\n",
    "- Integrar esse modelo em um servi√ßo simples (Flask/FastAPI) ou em um painel Dashboard (Dash) para previs√µes em tempo real.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}