# üß± ETL_Modelo_Streamlit_C_SemNumpy
Notebook de treino do modelo para o app Streamlit (sem uso expl√≠cito de NumPy).

## üöÄ Passos
1. Instala libs compat√≠veis com Python 3.13.7 (mesmas do deploy no Streamlit Cloud).
2. Carrega `aprovados_reprovados.csv`.
3. Seleciona features num√©ricas e categ√≥ricas automaticamente.
4. Treina `HistGradientBoostingClassifier` compacto.
5. Exporta `modelo_c.joblib` (comprimido) e `feature_schema_c.json`.

## 1) Instalar depend√™ncias compat√≠veis

In [12]:
!pip install streamlit
!pip install --upgrade --force-reinstall pandas==2.2.2 scikit-learn==1.5.2 joblib==1.4.2


Collecting pandas==2.2.2
  Downloading pandas-2.2.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Collecting scikit-learn==1.5.2
  Downloading scikit_learn-1.5.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Collecting joblib==1.4.2
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting numpy>=1.26.0 (from pandas==2.2.2)
  Downloading numpy-2.3.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (62 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m62.1/62.1 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-dateutil>=2.8.2 (from pandas==2.2.2)
  Downloading python_dateutil-2.9.0.post0-py2.py3-none-any.whl.metadata (8.4 kB)
Collecting pytz>=2020.1 (from pandas==2.2.2)
  Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas==2.2.2

## 2) Imports e Configura√ß√µes

In [1]:

import os, json
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import classification_report, roc_auc_score, average_precision_score, f1_score

import joblib

# Colunas obrigat√≥rias
ID_VAGA_COL, ID_CAND_COL, TARGET_COL = "id_vaga", "id_candidato", "target"

# Caminhos
TRAIN_PATH = "/content/aprovados_reprovados.csv"
EXPORT_DIR = "/content/"


## 3) Carregar dados de treino

In [2]:

df_train = pd.read_csv(TRAIN_PATH)
print("Shape treino:", df_train.shape)
df_train.head()


Shape treino: (10110, 31)


Unnamed: 0,id_vaga,inf_titulo_vaga,inf_cliente,inf_vaga_sap,perfil_nivel_academico,perfil_nivel profissional,perfil_nivel_ingles,perfil_nivel_espanhol,perfil_competencia_tecnicas_e_comportamentais,perfil_principais_atividades,...,qualificacoes,certificacoes,experiencias,nivel_academico,nivel_ingles,nivel_espanhol,cargo_atual,nivel_profissional,outro_idioma,cursos
0,5184,Consultor PP/QM S√™nior,"Morris, Moran and Dodson",N√£o,Ensino Superior Completo,S√™nior,Fluente,Nenhum,‚Ä¢ Consultor PP/QM S√™nior com experiencia em pr...,Consultor PP/QM Sr.\n\n‚Ä¢ Consultor PP/QM S√™nio...,...,N√£o informado,N√£o informado,N√£o informado,N√£o informado,N√£o informado,N√£o informado,N√£o informado,N√£o informado,N√£o informado,N√£o informado
1,5184,Consultor PP/QM S√™nior,"Morris, Moran and Dodson",N√£o,Ensino Superior Completo,S√™nior,Fluente,Nenhum,‚Ä¢ Consultor PP/QM S√™nior com experiencia em pr...,Consultor PP/QM Sr.\n\n‚Ä¢ Consultor PP/QM S√™nio...,...,N√£o informado,N√£o informado,N√£o informado,N√£o informado,N√£o informado,N√£o informado,N√£o informado,N√£o informado,N√£o informado,N√£o informado
2,5184,Consultor PP/QM S√™nior,"Morris, Moran and Dodson",N√£o,Ensino Superior Completo,S√™nior,Fluente,Nenhum,‚Ä¢ Consultor PP/QM S√™nior com experiencia em pr...,Consultor PP/QM Sr.\n\n‚Ä¢ Consultor PP/QM S√™nio...,...,N√£o informado,N√£o informado,N√£o informado,Mestrado Completo,Fluente,Fluente,N√£o informado,N√£o informado,N√£o informado,Engenharia da Computa√ß√£o
3,5184,Consultor PP/QM S√™nior,"Morris, Moran and Dodson",N√£o,Ensino Superior Completo,S√™nior,Fluente,Nenhum,‚Ä¢ Consultor PP/QM S√™nior com experiencia em pr...,Consultor PP/QM Sr.\n\n‚Ä¢ Consultor PP/QM S√™nio...,...,N√£o informado,N√£o informado,N√£o informado,N√£o informado,N√£o informado,N√£o informado,N√£o informado,N√£o informado,N√£o informado,N√£o informado
4,5183,ANALISTA PL/JR C/ SQL,"Morris, Moran and Dodson",N√£o,Ensino Superior Completo,Analista,Nenhum,Intermedi√°rio,Requisitos mandat√≥rios:\n\no Conhecimentos T√©c...,Descri√ß√£o ‚Äì Atividades:\n\no Monitoramento das...,...,N√£o informado,N√£o informado,N√£o informado,P√≥s Gradua√ß√£o Cursando,B√°sico,B√°sico,N√£o informado,N√£o informado,N√£o informado,Direito


## 4) Sele√ß√£o de features e pipeline

In [3]:

X = df_train.drop(columns=[TARGET_COL])
y = df_train[TARGET_COL].astype(int)

# Detectar colunas num√©ricas e categ√≥ricas
num_cols = X.select_dtypes(include=["number"]).columns.tolist()
cat_cols = [c for c in X.select_dtypes(include=["object","category","bool"]).columns if c not in [ID_VAGA_COL, ID_CAND_COL]]

print("Num√©ricas:", num_cols)
print("Categ√≥ricas:", cat_cols)

# Pr√©-processamento
num_transform = Pipeline([("imp", SimpleImputer(strategy="median"))])
cat_transform = Pipeline([
    ("imp", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False, min_frequency=0.01))
])

preprocessor = ColumnTransformer([
    ("num", num_transform, num_cols),
    ("cat", cat_transform, cat_cols)
])

# Modelo compacto
clf = HistGradientBoostingClassifier(max_iter=300, learning_rate=0.08, random_state=42)

pipe = Pipeline([("prep", preprocessor), ("clf", clf)])


Num√©ricas: ['id_vaga', 'id_candidato']
Categ√≥ricas: ['inf_titulo_vaga', 'inf_cliente', 'inf_vaga_sap', 'perfil_nivel_academico', 'perfil_nivel profissional', 'perfil_nivel_ingles', 'perfil_nivel_espanhol', 'perfil_competencia_tecnicas_e_comportamentais', 'perfil_principais_atividades', 'titulo', 'nome', 'data_candidatura', 'recrutador', 'situacao_candidado', 'objetivo_profissional', 'titulo_profissional', 'area_atuacao', 'conhecimentos_tecnicos', 'qualificacoes', 'certificacoes', 'experiencias', 'nivel_academico', 'nivel_ingles', 'nivel_espanhol', 'cargo_atual', 'nivel_profissional', 'outro_idioma', 'cursos']


## 5) Treino e valida√ß√£o

In [4]:

X_train, X_val, y_train, y_val = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

pipe.fit(X_train, y_train)

proba = pipe.predict_proba(X_val)[:,1]
pred = (proba >= 0.5).astype(int)

print(classification_report(y_val, pred))
print("ROC AUC:", roc_auc_score(y_val, proba))
print("PR AUC:", average_precision_score(y_val, proba))
print("F1:", f1_score(y_val, pred))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1534
           1       1.00      1.00      1.00       488

    accuracy                           1.00      2022
   macro avg       1.00      1.00      1.00      2022
weighted avg       1.00      1.00      1.00      2022

ROC AUC: 1.0
PR AUC: 0.9999999999999998
F1: 1.0


## 6) Exportar artefatos

In [9]:

# Exportar modelo comprimido
joblib.dump(pipe, os.path.join(EXPORT_DIR, "modelo_c.joblib"), compress=("gzip", 3))

# Exportar schema de features
schema = {"num_cols": num_cols, "cat_cols": cat_cols,
          "id_vaga_col": ID_VAGA_COL, "id_cand_col": ID_CAND_COL}

with open(os.path.join(EXPORT_DIR, "feature_schema_c.json"), "w", encoding="utf-8") as f:
    json.dump(schema, f, indent=2)

print("‚úÖ Artefatos exportados em:", EXPORT_DIR)


‚úÖ Artefatos exportados em: /content/


In [21]:
# 7) Exportar app.py (vers√£o com campos adicionais e ajustes de layout)
app_code = """
import os, json
import pandas as pd
import streamlit as st
import joblib

MODEL_FILE = "modelo_c.joblib"
SCHEMA_FILE = "feature_schema_c.json"

@st.cache_resource(show_spinner=False)
def load_schema(schema_path: str = SCHEMA_FILE) -> dict:
    with open(schema_path, "r", encoding="utf-8") as f:
        return json.load(f)

@st.cache_resource(show_spinner=False)
def load_model(model_path: str = MODEL_FILE):
    return joblib.load(model_path)

def align_columns(df: pd.DataFrame, schema: dict) -> pd.DataFrame:
    num_cols = schema["num_cols"]
    cat_cols = schema["cat_cols"]
    id_vaga_col = schema["id_vaga_col"]
    id_cand_col = schema["id_cand_col"]

    feature_cols = num_cols + cat_cols
    id_cols = [id_vaga_col, id_cand_col]
    needed = feature_cols + id_cols

    for c in needed:
        if c not in df.columns:
            df[c] = pd.NA

    return df.reindex(columns=needed)

def rank_candidates(df_pending: pd.DataFrame, schema: dict, model, top_k: int = 10) -> pd.DataFrame:
    df_aligned = align_columns(df_pending.copy(), schema)
    df_aligned = df_aligned.loc[:, ~df_aligned.columns.duplicated()]

    expected = pd.Index(model.feature_names_in_).drop_duplicates()
    X_input = df_aligned.reindex(columns=expected, fill_value=pd.NA)

    scores = model.predict_proba(X_input)[:, 1]
    df_aligned["score"] = pd.Series(scores, index=df_aligned.index).round(2)
    df_aligned["percent_match"] = (df_aligned["score"] * 100).round(1)
    df_aligned["rank"] = df_aligned.groupby(schema["id_vaga_col"])["score"].rank(ascending=False, method="first")

    ranking = (
        df_aligned[df_aligned["rank"] <= top_k]
        .sort_values([schema["id_vaga_col"], "rank"])
        .reset_index(drop=True)
    )
    return ranking

# ------------------- Streamlit UI -------------------
st.set_page_config(page_title="Netflix das Vagas", layout="wide")
st.title("üé¨ Netflix das Vagas")

uploaded = st.file_uploader("üìÇ CSV de pendentes (n√£o classificados)", type=["csv"])
top_k = st.sidebar.number_input("Limite Candidatos / Vaga", min_value=1, max_value=50, value=10, step=1)

if uploaded is not None:
    df_pending = pd.read_csv(uploaded)
    schema = load_schema()
    model = load_model()
    ranking = rank_candidates(df_pending, schema, model, top_k=int(top_k))

    st.success("‚úÖ Ranking gerado!")

    # üîπ Filtro de vagas com ID + T√≠tulo
    if "inf_titulo_vaga" in ranking.columns:
        ranking["vaga_display"] = ranking[schema["id_vaga_col"]].astype(str) + " - " + ranking["inf_titulo_vaga"].astype(str)
    else:
        ranking["vaga_display"] = ranking[schema["id_vaga_col"]].astype(str)

    vagas = sorted(ranking["vaga_display"].unique())
    vaga_sel = st.sidebar.selectbox("Selecione a vaga", vagas)
    vaga_id = vaga_sel.split(" ")[0]

    top = ranking[ranking[schema["id_vaga_col"]].astype(str) == vaga_id].sort_values("rank")
    st.subheader(f"Top {len(top)} candidatos para a vaga {vaga_sel}")

    cols = st.columns(3)
    for i, (_, row) in enumerate(top.iterrows()):
        with cols[i % 3]:
            nome = row.get("nome_candidato", f"Candidato {row[schema['id_cand_col']]}")
            empresa = row.get("inf_cliente", "Empresa n√£o informada")
            st.markdown(f"### üë§ {nome}")
            st.caption(f"üè¢ {empresa}")
            st.metric("Match %", f"{row['percent_match']:.1f}%")
            st.caption(f"Rank: {int(row['rank'])}")

    with st.expander("üìä Detalhes completos da vaga e candidatos"):
        cols_show = [
            "inf_titulo_vaga", "inf_cliente", "inf_qualificacoes",
            "nome_candidato", "data_inscricao", "nome_recrutador",
            "rank", "percent_match"
        ]
        cols_show = [c for c in cols_show if c in top.columns]
        st.dataframe(top[cols_show], use_container_width=True)

else:
    st.info("‚è≥ Aguardando upload do CSV de pendentes‚Ä¶")
"""

# Exportar app.py
EXPORT_DIR = "."
with open(os.path.join(EXPORT_DIR, "app.py"), "w", encoding="utf-8") as f:
    f.write(app_code)

print("‚úÖ app.py exportado em:", os.path.join(EXPORT_DIR, "app.py"))


‚úÖ app.py exportado em: ./app.py
