In [None]:
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"language": "markdown"
},
"source": [
"# Simulação do fluxo model_builder\n",
"Este notebook carrega os dados, aplica a engenharia de features, divide treino/teste, testa várias combinações de algoritmos e métodos de balanceamento, e inspeciona os parâmetros do melhor pipeline.\n",
"Execute as células sequencialmente."
]
},
{
"cell_type": "code",
"metadata": {
"language": "python"
},
"source": [
"# Imports principais\n",
"import pandas as pd\n",
"import json\n",
"from pprint import pprint\n",
"\n",
"# Funções/objetos do projeto\n",
"from analise_qualidade_vinhos.data.dataset import load_featured_data, train_test_split_featured\n",
"from analise_qualidade_vinhos.pipeline import model_builder\n",
"from analise_qualidade_vinhos.pipeline.model_builder import (\n",
" build_training_pipeline,\n",
" test_multiple_algorithms,\n",
" build_best_pipeline,\n",
" get_class_labels,\n",
")\n",
"\n",
"# Mostrar versões úteis\n",
"import sklearn\n",
"print('pandas', pd.version)\n",
"print('sklearn', sklearn.version)\n",
"try:\n",
" import xgboost\n",
" print('xgboost', xgboost.version)\n",
"except Exception:\n",
" print('xgboost: não instalado')\n",
"try:\n",
" import lightgbm\n",
" print('lightgbm', lightgbm.version)\n",
"except Exception:\n",
" print('lightgbm: não instalado')\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"language": "markdown"
},
"source": [
"## 1) Carregar dados com as features prontos\n",
"A função load_featured_data usa build_feature_matrix do pacote features. Se preferir inspecionar o CSV cru, use load_raw_data()."
]
},
{
"cell_type": "code",
"metadata": {
"language": "python"
},
"source": [
"# Carrega os dados (usa o RAW_DATA_PATH definido em settings se não passar caminho).\n",
"df = load_featured_data()\n",
"print('Shape:', df.shape)\n",
"display(df.head())\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"language": "markdown"
},
"source": [
"## 2) Separar X e y (treino/teste)\n",
"Usamos train_test_split_featured que espera a coluna TARGET_COLUMN já presente no DataFrame (definida em settings)."
]
},
{
"cell_type": "code",
"metadata": {
"language": "python"
},
"source": [
"# Split em treino/teste\n",
"X_train, X_test, y_train, y_test = train_test_split_featured(df)\n",
"print('X_train', X_train.shape, 'X_test', X_test.shape)\n",
"print('y_train distribution:')\n",
"display(y_train.value_counts(normalize=True))\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"language": "markdown"
},
"source": [
"## 3) Construir um pipeline simples e inspecionar parâmetros\n",
"Vamos criar um pipeline com random_forest + smoteenn e ver os parâmetros do pré-processador e do modelo."
]
},
{
"cell_type": "code",
"metadata": {
"language": "python"
},
"source": [
"pipeline = build_training_pipeline(algorithm='random_forest', balance_method='smoteenn')\n",
"print('Pipeline steps:')\n",
"pprint(pipeline.steps)\n",
"\n",
"# Mostrar parâmetros em alto nível\n",
"print('\nPipeline get_params (chaves relevantes):')\n",
"for k in sorted(pipeline.get_params()):\n",
" if any(prefix in k for prefix in ['model', 'preprocess', 'balance']):\n",
" print(k)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"language": "markdown"
},
"source": [
"## 4) Treinar o pipeline e ver métricas simples\n",
"Treinamos e avaliamos accuracy e f1_weighted."
]
},
{
"cell_type": "code",
"metadata": {
"language": "python"
},
"source": [
"pipeline.fit(X_train, y_train)\n",
"preds = pipeline.predict(X_test)\n",
"\n",
"from sklearn.metrics import accuracy_score, f1_score, classification_report\n",
"acc = accuracy_score(y_test, preds)\n",
"f1w = f1_score(y_test, preds, average='weighted')\n",
"print(f'Accuracy: {acc:.4f} | F1 weighted: {f1w:.4f}')\n",
"print('\nClassification report:')\n",
"print(classification_report(y_test, preds))\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"language": "markdown"
},
"source": [
"## 5) Testar múltiplos algoritmos e métodos de balanceamento\n",
"Use test_multiple_algorithms para rodar combinações. Isso pode levar algum tempo dependendo do conjunto e dos modelos disponíveis."
]
},
{
"cell_type": "code",
"metadata": {
"language": "python"
},
"source": [
"# Opcional: reduzir lista (para execução mais rápida durante experimentos)\n",
"algorithms = ['random_forest', 'gradient_boosting']\n",
"if getattr(model_builder, 'XGBOOST_AVAILABLE', False):\n",
" algorithms.append('xgboost')\n",
"if getattr(model_builder, 'LIGHTGBM_AVAILABLE', False):\n",
" algorithms.append('lightgbm')\n",
"\n",
"balance_methods = ['smoteenn', 'adasyn', 'smote']\n",
"\n",
"results = test_multiple_algorithms(X_train, y_train, X_test, y_test, algorithms=algorithms, balance_methods=balance_methods)\n",
"\n",
"print('\nResultados resumidos:')\n",
"for key, res in results.items():\n",
" if 'f1_weighted' in res:\n",
" print(key, '->', f"F1={res['f1_weighted']:.4f}", f"Acc={res['accuracy']:.4f}")\n",
" else:\n",
" print(key, '-> ERROR:', res.get('error'))\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"language": "markdown"
},
"source": [
"## 6) Retreinar o melhor pipeline no treino completo\n",
"build_best_pipeline procura o melhor por F1 weighted, reconstrói e retreina sobre o conjunto de treino."
]
},
{
"cell_type": "code",
"metadata": {
"language": "python"
},
"source": [
"best_pipeline = build_best_pipeline(X_train, y_train, X_test, y_test)\n",
"print('\nMelhor pipeline treinada:')\n",
"print(best_pipeline)\n",
"\n",
"# Inspecionar o estimador final e seus parâmetros\n",
"model_final = best_pipeline.named_steps['model']\n",
"print('\nModelo final: ', type(model_final))\n",
"pprint(model_final.get_params())\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"language": "markdown"
},
"source": [
"## 7) Salvar resultados e o pipeline (opcional)\n",
"Você pode salvar o pipeline com joblib ou pickle caso queira reutilizar depois."
]
},
{
"cell_type": "code",
"metadata": {
"language": "python"
},
"source": [
"# Salvar o pipeline treinado em models/\n",
"import joblib\n",
"from pathlib import Path\n",
"out_dir = Path('models')\n",
"out_dir.mkdir(exist_ok=True)\n",
"model_path = out_dir / 'best_pipeline.joblib'\n",
"joblib.dump(best_pipeline, model_path)\n",
"print('Pipeline salvo em', model_path)\n",
"\n",
"# Salvar métricas em JSON\n",
"metrics_path = out_dir / 'training_results.json'\n",
"simple_results = {k: {kk: vv for kk, vv in v.items() if kk in ('accuracy','f1_weighted','algorithm','balance')} for k, v in results.items()}\n",
"metrics_path.write_text(json.dumps(simple_results, indent=2))\n",
"print('Métricas salvas em', metrics_path)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"language": "markdown"
},
"source": [
"## Dicas finais\n",
"- Se você quiser inspecionar importâncias de features, verifique se o modelo final expõe feature_importances_ (RandomForest, GradientBoosting, LightGBM). Para XGBoost use get_booster()/feature_importances_ conforme disponível.\n",
"- Para acelerar experimentos, reduza n_estimators nos modelos ou use uma amostra menor do conjunto de treino.\n",
"- Se ocorrer erro por colunas faltantes em NUMERIC_FEATURES, execute df.columns e compare com model_builder.NUMERIC_FEATURES e ajuste a engenharia de features.\n"
]
}
]
}