# 14. In√≠cio da Modelagem Preditiva

Nesta etapa iniciamos a prepara√ß√£o dos dados para Machine Learning.  
Carregamos novamente as bases (2020 e 2021), criamos c√≥pias para preservar os dados brutos  
e verificamos a estrutura das tabelas para identificar tipos de vari√°veis e poss√≠veis ajustes.


In [91]:
# Importa√ß√£o de bibliotecas principais para ML
import pandas as pd
import numpy as np

# Fun√ß√µes √∫teis do scikit-learn para treino e avalia√ß√£o
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder 
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Modelos de regress√£o
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import ExtraTreesRegressor, GradientBoostingRegressor


In [92]:
# Leitura das bases CSV
raw_2021 = pd.read_csv('data/ifood-restaurants-february-2021.csv')
raw_2020 = pd.read_csv('data/ifood-restaurants-november-2020.csv')

In [93]:
# Cria√ß√£o de c√≥pias para n√£o alterar os dados originais
df_2021 = raw_2021.copy()
df_2020 = raw_2020.copy()

In [94]:
# Estrutura da base de 2020
df_2020.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 361447 entries, 0 to 361446
Data columns (total 11 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   avatar         361149 non-null  object 
 1   category       361447 non-null  object 
 2   delivery_fee   361447 non-null  float64
 3   delivery_time  361447 non-null  int64  
 4   distance       361447 non-null  float64
 5   name           361447 non-null  object 
 6   price_range    361447 non-null  object 
 7   rating         361447 non-null  float64
 8   url            361447 non-null  object 
 9   lat            361447 non-null  float64
 10  long           361447 non-null  float64
dtypes: float64(5), int64(1), object(5)
memory usage: 30.3+ MB


In [95]:
# Estrutura da base de 2020
df_2020.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 361447 entries, 0 to 361446
Data columns (total 11 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   avatar         361149 non-null  object 
 1   category       361447 non-null  object 
 2   delivery_fee   361447 non-null  float64
 3   delivery_time  361447 non-null  int64  
 4   distance       361447 non-null  float64
 5   name           361447 non-null  object 
 6   price_range    361447 non-null  object 
 7   rating         361447 non-null  float64
 8   url            361447 non-null  object 
 9   lat            361447 non-null  float64
 10  long           361447 non-null  float64
dtypes: float64(5), int64(1), object(5)
memory usage: 30.3+ MB


In [96]:
# Estrutura da base de 2021
df_2021.head(4)

Unnamed: 0,availableForScheduling,avatar,category,delivery_fee,delivery_time,distance,ibge,minimumOrderValue,name,paymentCodes,price_range,rating,tags,url
0,False,https://static-images.ifood.com.br/image/uploa...,Marmita,3.99,27,1.22,5300108,10.0,Cantina Arte & Sabor,DNR $$ MPAY $$ MOVPAY_MC $$ MC $$ GPY_ELO $$ E...,CHEAPEST,0.0,ADDRESS_PREFORM_TYPE $$ CART::MCHT::100_DELIVE...,https://www.ifood.com.br/delivery/brasilia-df/...
1,False,https://static-images.ifood.com.br/image/uploa...,A√ßa√≠,7.99,61,4.96,5300108,10.0,Raruty A√ßa√≠ Raiz,DNR $$ MPAY $$ MOVPAY_MC $$ MC $$ GPY_ELO $$ E...,CHEAPEST,0.0,ADDRESS_PREFORM_TYPE $$ GUIDED_HELP_TYPE $$ ME...,https://www.ifood.com.br/delivery/brasilia-df/...
2,False,https://static-images.ifood.com.br/image/uploa...,Bebidas,11.99,70,8.35,5300108,5.0,Toma na Kombi,DNR $$ MPAY $$ MOVPAY_MC $$ MC $$ GPY_ELO $$ R...,MODERATE,0.0,ADDRESS_PREFORM_TYPE $$ CPGN_USER_DISCOUNT_6_L...,https://www.ifood.com.br/delivery/brasilia-df/...
3,False,https://static-images.ifood.com.br/image/uploa...,Carnes,16.49,63,6.35,5300108,20.0,Churrasquinho do Barriga¬¥s,DNR $$ MPAY $$ MOVPAY_MC $$ MC $$ GPY_ELO $$ E...,CHEAPEST,0.0,ADDRESS_PREFORM_TYPE $$ GUIDED_HELP_TYPE $$ NO...,https://www.ifood.com.br/delivery/brasilia-df/...


## 15. Estrutura√ß√£o e Prepara√ß√£o dos Dados

- Padroniza√ß√£o das colunas de 2020 e 2021.  
- Cria√ß√£o da coluna `state` a partir da URL.  
- Remo√ß√£o de vari√°veis irrelevantes (ex.: `avatar`, `url`, `tags`).  
- Inclus√£o de coluna `ano` para diferenciar os datasets.  
- Uni√£o das bases em um √∫nico dataframe (`df_total`) para an√°lise preditiva.  


In [97]:
# Extrai o 'state' da URL em 2020
df_2020['state'] = df_2020['url'].str.split("/").str[4].str.split("-").str[-1]

In [98]:
# Remove colunas irrelevantes da base de 2020
df_2020 = df_2020.drop(columns=['avatar', 'url', 'lat', 'long'])

In [99]:
# Visualiza primeiras linhas de 2021
df_2021.head(4)

Unnamed: 0,availableForScheduling,avatar,category,delivery_fee,delivery_time,distance,ibge,minimumOrderValue,name,paymentCodes,price_range,rating,tags,url
0,False,https://static-images.ifood.com.br/image/uploa...,Marmita,3.99,27,1.22,5300108,10.0,Cantina Arte & Sabor,DNR $$ MPAY $$ MOVPAY_MC $$ MC $$ GPY_ELO $$ E...,CHEAPEST,0.0,ADDRESS_PREFORM_TYPE $$ CART::MCHT::100_DELIVE...,https://www.ifood.com.br/delivery/brasilia-df/...
1,False,https://static-images.ifood.com.br/image/uploa...,A√ßa√≠,7.99,61,4.96,5300108,10.0,Raruty A√ßa√≠ Raiz,DNR $$ MPAY $$ MOVPAY_MC $$ MC $$ GPY_ELO $$ E...,CHEAPEST,0.0,ADDRESS_PREFORM_TYPE $$ GUIDED_HELP_TYPE $$ ME...,https://www.ifood.com.br/delivery/brasilia-df/...
2,False,https://static-images.ifood.com.br/image/uploa...,Bebidas,11.99,70,8.35,5300108,5.0,Toma na Kombi,DNR $$ MPAY $$ MOVPAY_MC $$ MC $$ GPY_ELO $$ R...,MODERATE,0.0,ADDRESS_PREFORM_TYPE $$ CPGN_USER_DISCOUNT_6_L...,https://www.ifood.com.br/delivery/brasilia-df/...
3,False,https://static-images.ifood.com.br/image/uploa...,Carnes,16.49,63,6.35,5300108,20.0,Churrasquinho do Barriga¬¥s,DNR $$ MPAY $$ MOVPAY_MC $$ MC $$ GPY_ELO $$ E...,CHEAPEST,0.0,ADDRESS_PREFORM_TYPE $$ GUIDED_HELP_TYPE $$ NO...,https://www.ifood.com.br/delivery/brasilia-df/...


In [100]:
# Extrai o 'state' da URL em 2021
df_2021['state'] = df_2021['url'].str.split("/").str[4].str.split("-").str[-1]

In [101]:
# Remove colunas irrelevantes da base de 2021
df_2021 = df_2021.drop(columns=['availableForScheduling', 'avatar', 'ibge', 'paymentCodes', 'tags', 'url'])

In [102]:
# Cria coluna 'ano' para diferenciar os datasets
df_2020['ano'] = 2020
df_2021['ano'] = 2021

In [103]:
df_2021.tail(8)

Unnamed: 0,category,delivery_fee,delivery_time,distance,minimumOrderValue,name,price_range,rating,state,ano
406391,Lanches,3.0,50,1.77,20.0,Hamburgueria Duarte,MODERATE,5.0,rs,2021
406392,Lanches,3.5,60,2.74,15.0,ki-sabor,CHEAP,5.0,rs,2021
406393,Lanches,0.0,60,2.58,15.0,Pizza Cone del Miko,CHEAPEST,5.0,rs,2021
406394,A√ßa√≠,9.0,60,3.53,30.0,A√ßa√≠ da Duda,CHEAPEST,4.95,rs,2021
406395,A√ßa√≠,6.0,50,2.6,10.0,Pede a√≠ a√ßa√≠,CHEAPEST,0.0,rs,2021
406396,A√ßa√≠,0.0,40,3.61,0.0,A√ßa√≠ do Jeitinho Brasileiro,CHEAPEST,4.46602,rs,2021
406397,Lanches,8.0,60,3.54,20.0,Classic Burger,CHEAPEST,5.0,rs,2021
406398,Doces & Bolos,9.0,20,0.95,0.0,Cacau Show - Cachoerinha Shopping,MODERATE,4.84,rs,2021


## 14. Ajuste da coluna `minimumOrderValue`

- Em 2021 j√° existe a coluna `minimumOrderValue`.  
- Para 2020, os valores foram atribu√≠dos a partir da correspond√™ncia com `delivery_fee` (mapeando pela mediana).  
- Valores que n√£o tinham correspond√™ncia foram preenchidos com a mediana geral de 2021.  
- Ao final, as bases de 2020 e 2021 foram consolidadas em um √∫nico dataframe (`df_total`).


In [104]:
# Criar tabela de correspond√™ncia a partir de 2021
# (se tiver mais de um minimumOrderValue para a mesma taxa, usamos a m√©dia)
map_min_order = (
    df_2021.groupby('delivery_fee')['minimumOrderValue']
    .mean()
    .to_dict())

# Aplicar regra no 2020
df_2020['minimumOrderValue'] = df_2020['delivery_fee'].map(map_min_order)

# Preenche valores ausentes de 2020 (sem correspond√™ncia) com a m√©dia geral de 2021
df_2020['minimumOrderValue'] = df_2020['minimumOrderValue'].fillna(df_2021['minimumOrderValue'].mean())


In [105]:
# Visualiza √∫ltimas linhas do dataset 2020 j√° tratado
df_2020.tail(8)

Unnamed: 0,category,delivery_fee,delivery_time,distance,name,price_range,rating,state,ano,minimumOrderValue
361439,Brasileira,3.99,38,1.82,Restaurante Cantinho da Serra,MODERATE,0.0,sp,2020,15.429268
361440,Bebidas,3.99,42,1.74,Boteco Itapeva,MODERATE,4.75,sp,2020,15.429268
361441,Argentina,3.99,38,1.78,Cantinho da Serra Grill,MODERATE,0.0,sp,2020,15.429268
361442,Brasileira,3.99,18,1.66,Restaurante Nosso Tempero,CHEAPEST,0.0,sp,2020,15.429268
361443,Lanches,3.99,18,1.7,Belgian Waffles,CHEAPEST,0.0,sp,2020,15.429268
361444,Pizza,8.0,45,1.34,Sabor de Campos Pizzaria e Restaurante,CHEAPEST,0.0,sp,2020,14.25701
361445,Cafeteria,1.0,35,0.2,Maria Bonita,CHEAPEST,0.0,sc,2020,11.51023
361446,Doces & Bolos,0.0,60,0.96,Cacau Show - Supermercado Campos Salles,CHEAPEST,5.0,ce,2020,21.016338


In [106]:
# Junta bases 2020 e 2021
df_total = pd.concat([df_2020, df_2021], ignore_index=True)

In [107]:
# Mostra dimens√£o da base consolidada
print(df_total.shape)


(767846, 10)


In [108]:
# Visualiza primeiras linhas do dataset consolidado
df_total.head()

Unnamed: 0,category,delivery_fee,delivery_time,distance,name,price_range,rating,state,ano,minimumOrderValue
0,Lanches,9.0,80,6.2,El'moedor,CHEAP,4.30303,pe,2020,13.483475
1,Doces & Bolos,6.0,35,3.03,Delicia de Brigadeiro,CHEAPEST,0.0,pe,2020,13.599419
2,Brasileira,4.0,40,1.51,Pizzaria Rappi10 - Moreno,CHEAPEST,0.0,pe,2020,366.540819
3,Lanches,0.0,50,0.79,Tapioca Arretada,CHEAPEST,5.0,pe,2020,21.016338
4,Salgados,12.99,36,5.12,Minuto Kit Festa ( Salgados e Doces ),CHEAPEST,5.0,pe,2020,15.018331


## 15. Tratamento da vari√°vel alvo (`rating`)

- Convers√£o da coluna `rating` para tipo num√©rico.  
- Remo√ß√£o de valores inv√°lidos (fora do intervalo 1‚Äì5).  
- Exclus√£o de linhas sem avalia√ß√£o (`NaN`).  
- An√°lise do impacto da limpeza na quantidade de registros.  


In [109]:
# Converte 'rating' para num√©rico (se houver erro, vira NaN)
df_total['rating'] = pd.to_numeric(df_total['rating'], errors='coerce')

# Substitui avalia√ß√µes iguais a 0 por NaN (sem avalia√ß√£o)
df_total.loc[df_total['rating'] == 0, 'rating'] = pd.NA

# manter apenas valores entre 1 e 5
df_total = df_total[df_total['rating'].between(1, 5, inclusive='both')]


In [110]:
# Estat√≠sticas descritivas do rating ap√≥s limpeza
df_total['rating'].describe()

count    448237.000000
mean          4.518793
std           0.604238
min           1.000000
25%           4.363640
50%           4.667880
75%           4.916670
max           5.000000
Name: rating, dtype: float64

In [111]:
# Conta valores ausentes em 'rating'
print("Qtd de NaN em rating:", df_total['rating'].isna().sum())


Qtd de NaN em rating: 0


In [112]:
# Avalia impacto da limpeza no tamanho da base
total_original = 767846
total_final = df_total.shape[0]

removidos = total_original - total_final
perc = removidos / total_original * 100

print(f"Total original: {total_original:,}")
print(f"Ap√≥s limpeza: {total_final:,}")
print(f"Linhas removidas (rating = 0/NaN): {removidos:,}. ({perc:.1f}%).")


Total original: 767,846
Ap√≥s limpeza: 448,237
Linhas removidas (rating = 0/NaN): 319,609. (41.6%).


## 16. Tratamento de outliers e prepara√ß√£o das vari√°veis

- Aplica√ß√£o de **regras de corte** para limitar valores extremos em `delivery_time`, `distance` e `minimumOrderValue`.  
- Defini√ß√£o das vari√°veis independentes (X) e da vari√°vel alvo (`rating`).  
- Separa√ß√£o de colunas em **categ√≥ricas** e **num√©ricas**.  
- Aplica√ß√£o de **One-Hot Encoding** para transformar vari√°veis categ√≥ricas em formato num√©rico.  
- Consolida√ß√£o do dataset final (`X_final`) pronto para treino e teste dos modelos.  


In [113]:
# Estat√≠sticas das vari√°veis num√©ricas relevantes
num_cols = ['delivery_fee','delivery_time','distance','minimumOrderValue','ano']
print(df_total[num_cols].describe().T)

                      count         mean            std      min      25%  \
delivery_fee       448237.0     6.682139       4.357669     0.00     4.00   
delivery_time      448237.0    46.215730      20.929946    -5.00    34.00   
distance           448237.0     3.529183      33.750311     0.01     1.71   
minimumOrderValue  448237.0   374.414960  149366.793103     0.00    13.00   
ano                448237.0  2020.506957       0.499952  2020.00  2020.00   

                           50%      75%          max  
delivery_fee          6.000000     9.00        40.00  
delivery_time        45.000000    60.00      5060.00  
distance              3.050000     4.77     11170.24  
minimumOrderValue    15.429268    20.00  99999999.99  
ano                2021.000000  2021.00      2021.00  


In [114]:
# olhar valores m√°ximos por coluna
for c in num_cols:
    print(f"\nColuna: {c}")
    print("Top maiores valores:")
    print(df_total[c].sort_values(ascending=False).head(5).to_list())


Coluna: delivery_fee
Top maiores valores:
[40.0, 40.0, 36.9, 35.0, 35.0]

Coluna: delivery_time
Top maiores valores:
[5060, 5050, 1335, 480, 350]

Coluna: distance
Top maiores valores:
[11170.24, 11170.24, 9843.84, 9274.69, 5196.45]

Coluna: minimumOrderValue
Top maiores valores:
[99999999.99, 150000.0, 100000.0, 100000.0, 30000.0]

Coluna: ano
Top maiores valores:
[2021, 2021, 2021, 2021, 2021]


In [115]:
# regras de corte
df_total = df_total[df_total['delivery_time'] <= 180]
df_total = df_total[df_total['distance'] <= 50]
df_total = df_total[df_total['minimumOrderValue'] <= 50]

# Verifica tamanho da base ap√≥s remo√ß√£o dos outliers
print("Novo shape:", df_total.shape)
print(df_total[['delivery_fee','delivery_time','distance','minimumOrderValue']].describe())


Novo shape: (421420, 10)
        delivery_fee  delivery_time       distance  minimumOrderValue
count  421420.000000  421420.000000  421420.000000      421420.000000
mean        6.775921      46.026731       3.404854          15.789975
std         4.420967      17.874158       2.125572           6.686551
min         0.000000      -5.000000       0.010000           0.000000
25%         4.000000      33.000000       1.720000          11.764267
50%         6.000000      45.000000       3.070000          15.286707
75%         9.490000      60.000000       4.860000          20.000000
max        35.000000     180.000000      43.570000          50.000000


In [116]:
# Defini√ß√£o das vari√°veis
cat_cols = ['category', 'price_range', 'state']
num_cols = ['delivery_fee', 'delivery_time', 'distance', 'minimumOrderValue', 'ano']

X = df_total[cat_cols + num_cols].copy()
y = df_total['rating'].astype(float)

# One-Hot Encoding nas vari√°veis categ√≥ricas 
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
X_encoded = encoder.fit_transform(X[cat_cols])

# Cria DataFrame com os nomes das colunas codificadas
encoded_cols = encoder.get_feature_names_out(cat_cols)
X_encoded = pd.DataFrame(X_encoded, columns=encoded_cols, index=X.index)

# Junta num√©ricas + categ√≥ricas codificadas
X_final = pd.concat([X[num_cols], X_encoded], axis=1)

# Visualiza primeiras linhas da base final para modelagem
print(X_final.shape)
X_final.head(3)

(421420, 97)


Unnamed: 0,delivery_fee,delivery_time,distance,minimumOrderValue,ano,category_Africana,category_Alem√£,category_Argentina,category_Asi√°tica,category_A√ßa√≠,...,state_pr,state_rj,state_rn,state_ro,state_rr,state_rs,state_sc,state_se,state_sp,state_to
0,9.0,80,6.2,13.483475,2020,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,50,0.79,21.016338,2020,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,12.99,36,5.12,15.018331,2020,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 17. Treinamento e Avalia√ß√£o de Modelos

- Divis√£o da base em treino (80%) e teste (20%).  
- Execu√ß√£o de diferentes algoritmos:  
  - **Linear Regression** (baseline).  
  - **Random Forest** (√°rvores em conjunto).  
  - **Extra Trees**, **Gradient Boosting** e **XGBoost** (modelos de ensemble).  
- Avalia√ß√£o com m√©tricas: **MAE (erro absoluto m√©dio)**, **RMSE (raiz do erro quadr√°tico m√©dio)** e **R¬≤ (coeficiente de determina√ß√£o)**.  
- Compara√ß√£o de desempenho entre os modelos.  


In [117]:
# Split dos dados em treino e teste 
X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.2, random_state=42)

# 1) Linear Regression
lin = LinearRegression().fit(X_train, y_train)
p_lin = lin.predict(X_test)

mae_lin  = mean_absolute_error(y_test, p_lin)
rmse_lin = np.sqrt(mean_squared_error(y_test, p_lin))
r2_lin   = r2_score(y_test, p_lin)

# 2) Random Forest (modelo n√£o linear baseado em √°rvores)
rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1).fit(X_train, y_train)
p_rf = rf.predict(X_test)

# Compara√ß√£o inicial Linear x Random Forest
mae_rf  = mean_absolute_error(y_test, p_rf)
rmse_rf = np.sqrt(mean_squared_error(y_test, p_rf))
r2_rf   = r2_score(y_test, p_rf)

# Compara√ß√£o inicial Linear x Random Forest
results = pd.DataFrame(
    [["Linear Regression", mae_lin, rmse_lin, r2_lin],
    ["Random Forest",    mae_rf,  rmse_rf,  r2_rf ],],
    columns=["Modelo", "MAE", "RMSE", "R¬≤"]).round({"MAE":3, "RMSE":3, "R¬≤":3})

print(results)

              Modelo    MAE   RMSE     R¬≤
0  Linear Regression  0.379  0.592  0.039
1      Random Forest  0.399  0.619 -0.050


In [118]:
# 3) Modelos adicionais (ExtraTrees, GradientBoosting, XGBoost)
models = {
    "ExtraTrees": ExtraTreesRegressor(n_estimators=200, random_state=42, n_jobs=-1),
    "GradientBoosting": GradientBoostingRegressor(n_estimators=200, random_state=42),
    "XGBoost": XGBRegressor(n_estimators=200, learning_rate=0.1, max_depth=8, random_state=42, n_jobs=-1, tree_method="hist"),
}

results = []

for name, model in models.items():
    model.fit(X_train, y_train)
    pred = model.predict(X_test)

    mae = mean_absolute_error(y_test, pred)
    rmse = np.sqrt(mean_squared_error(y_test, pred))
    r2 = r2_score(y_test, pred)

    results.append([name, mae, rmse, r2])
    
# Compara√ß√£o final dos modelos ensemble
df_results = pd.DataFrame(results, columns=["Modelo","MAE","RMSE","R¬≤"]).round(4)
print(df_results)


             Modelo     MAE    RMSE      R¬≤
0        ExtraTrees  0.4265  0.6749 -0.2494
1  GradientBoosting  0.3770  0.5898  0.0458
2           XGBoost  0.3743  0.5875  0.0534


## 19. Transforma√ß√£o do `rating` em classes

- Cria√ß√£o de grupos de avalia√ß√£o a partir do `rating`:  
  - **Ruim (1‚Äì2)**  
  - **Neutro (3)**  
  - **Bom (4‚Äì5)**  
- Nova coluna `rating_group` adicionada ao dataset.  
- C√°lculo da distribui√ß√£o percentual de cada classe para entender o balanceamento.  

In [119]:
# Fun√ß√£o para classificar notas de rating em grupos
def rating_group(r):
    if r <= 2:
        return "Ruim (1-2)"
    elif r == 3:
        return "Neutro (3)"
    else:
        return "Bom (4-5)"
    
# Cria nova coluna com as classes de avalia√ß√£o
df_total["rating_group"] = df_total["rating"].apply(rating_group)

# Contagem e propor√ß√£o
dist = df_total["rating_group"].value_counts(normalize=True).mul(100).round(2)
print(dist)


rating_group
Bom (4-5)     97.16
Ruim (1-2)     1.46
Neutro (3)     1.38
Name: proportion, dtype: float64


---
## üìå Conclus√£o do Projeto de Machine Learning (iFood)

### 1. O que foi feito. 
- Realizamos a **prepara√ß√£o das bases de 2020 e 2021**, padronizando colunas, criando identificadores (`id`, `state`, `city`, `ano`) e removendo atributos irrelevantes.  
- Fizemos a **limpeza do target `rating`**, transformando notas `0` em `NaN` e mantendo apenas valores entre **1 e 5**.  
- Tratamos **outliers** em `delivery_time`, `distance` e `minimumOrderValue`.  
- Criamos a matriz final com **vari√°veis num√©ricas e categ√≥ricas (via OneHotEncoder)**.  
- Testamos diferentes modelos de regress√£o (**Linear, Random Forest, ExtraTrees, GradientBoosting, XGBoost**) avaliando com **MAE, RMSE e R¬≤**.  

### 2. Por que usamos `rating` como alvo.
- A princ√≠pio, a ideia era prever a **avalia√ß√£o m√©dia dos restaurantes** a partir de caracter√≠sticas operacionais (frete, tempo de entrega, dist√¢ncia, pre√ßo etc.).  
- Isso faria sentido como m√©trica de **qualidade percebida pelo cliente**.  

### 3. O que encontramos.
- Apesar de um erro m√©dio relativamente baixo (~0.38 estrelas), o **R¬≤ ficou muito baixo (‚âà5‚Äì7%)** ‚Üí ou seja, os modelos **n√£o conseguem explicar a variabilidade** das notas.  
- Isso acontece porque o `rating` est√° **extremamente desbalanceado**:  
- ~97% dos restaurantes t√™m nota **4 ou 5**  
- Notas baixas (1‚Äì2) s√£o quase inexistentes.  
- Al√©m disso, o `rating` depende de fatores **fora da base** (qualidade da comida, atendimento, promo√ß√µes, marketing), que n√£o conseguimos capturar com as features dispon√≠veis.  

### 4. Conclus√£o.
- O problema **n√£o √© o modelo**, mas o **alvo escolhido**.  
- Com os dados dispon√≠veis, **prever `rating` n√£o √© vi√°vel**, pois falta variabilidade e explica√ß√£o estat√≠stica.  
- O pipeline continua v√°lido: demonstramos todo o processo de **EDA, feature engineering, prepara√ß√£o, modelagem e avalia√ß√£o**, mas justificamos o **encerramento da modelagem preditiva com `rating`**.  

### 5. Pr√≥ximos passos recomendados.  
- Substituir o alvo por vari√°veis mais explic√°veis, como:  
- `delivery_fee` (predi√ß√£o de custo de entrega)  
- `delivery_time` (predi√ß√£o de tempo estimado)  
- Popularidade (`n√∫mero de avalia√ß√µes`).  
- Insistir na predi√ß√£o do `rating` com os dados atuais resultaria em **modelos pouco explicativos**, que no m√°ximo fariam um "chute sofisticado" sempre prevendo pr√≥ximo de 4 ou 5.  
- Isso geraria **falsos insights** para o neg√≥cio, j√° que o modelo n√£o estaria de fato aprendendo padr√µes relevantes.  
- Continuar nesse caminho traria risco de apresentar resultados **enganosos** para stakeholders, sem ganho pr√°tico.  

---
**Resumo final.**  
- O projeto demonstrou todo o pipeline de ML (preparo, tratamento, modelagem, avalia√ß√£o).  
- Por√©m, ficou claro que **o `rating` n√£o √© adequado como vari√°vel alvo** com as features dispon√≠veis.  
- Assim, o trabalho √© **encerrado aqui de forma consciente**, justificando que **um bom modelo exige um alvo de qualidade**.  
- Caso novos dados sejam incorporados ou o problema seja redefinido, o pipeline j√° est√° pronto para ser reaproveitado.


---