<h1 style="text-align: center;">🥈Transformação de dados da camada bronze para silver</h1>


<img src="Imagem_projeto.png" alt="Visualização do Projeto" width="600" style="display: block; margin-left: auto; margin-right: auto;"/>


**📄 Transformações padrão**

+ Valores Nulos;
    - Padronizar valores nulos que foram carregados como **("NaN", "nan", "None", "none", "NULL", "null", "")**

+ Higienização simples de strings:
    - Remover sujeiras como espaços extras, quebras de linha e caracteres de controle invisíveis

+ Analise de IDs duplicados;
    - Verificação simples de IDs que podem estar duplicados na camada bronze e caso esteja, sera mantido apenas o ultimo id com base no scrape_id ou ultima data


In [35]:
#📦import das principais libs


import pandas as pd
import numpy as np
import re
from typing import List, Optional, Sequence
import psycopg2
from psycopg2.extras import execute_values
from sqlalchemy import create_engine
import os
from dotenv import load_dotenv
import geopandas as gpd
from shapely.geometry import Point
from urllib.parse import urlparse

In [102]:
#import das variaveis de ambiente
load_dotenv(r"..\scripts\.env")


#conexao com postgres
try:
    DB_URI = (f"postgresql://{os.getenv('DB_USER')}:{os.getenv('DB_PASSWORD')}@{os.getenv('DB_HOST')}:{os.getenv('DB_PORT')}/{os.getenv('DB_NAME')}")


    engine = create_engine(DB_URI)
    conn = engine.connect()
    raw_conn = engine.raw_connection()
    print("✅ Conexao realizada com sucesso!")
except Exception as e:
    print(f"❌ Erro ao conectar ao banco de dados {e}")


✅ Conexao realizada com sucesso!


### **Tratamento tabela T_DIM_ANUNCIO**

In [63]:
#Leitura da tabela Anuncio
df_anuncio = pd.read_sql('SELECT * FROM a_bronze. "T_DIM_ANUNCIO"', conn)
df_anuncio.head(3)

Unnamed: 0,id_anuncio,listing_url,scrape_id,name,summary,space,description,experiences_offered,neighborhood_overview,notes,transit,access,interaction,house_rules,picture_url
0,14063,https://www.airbnb.com/rooms/14063,20180414160018,Living in a Postcard,"Besides the most iconic's view, our apartment ...",,"Besides the most iconic's view, our apartment ...",none,Best and favorite neighborhood of Rio. Perfect...,,Everything is there. METRO is 5 min walk. Dir...,,,strictly no smoking in the apartment ! We want...,https://a0.muscache.com/im/pictures/66421/ae9b...
1,17878,https://www.airbnb.com/rooms/17878,20180414160018,Very Nice 2Br - Copacabana - WiFi,Please note that special rates apply for New Y...,- large balcony which looks out on pedestrian ...,Please note that special rates apply for New Y...,none,This is the best spot in Rio. Everything happe...,,Excellent location. Close to all major public ...,The entire apartment is yours. It is a vacatio...,I will be available throughout your stay shoul...,Please leave the apartment in a clean fashion ...,https://a0.muscache.com/im/pictures/65320518/3...
2,24480,https://www.airbnb.com/rooms/24480,20180414160018,Nice and cozy near Ipanema Beach,My studio is located in the best of Ipanema. ...,The studio is located at Vinicius de Moraes St...,My studio is located in the best of Ipanema. ...,none,"The beach, the lagoon, Ipanema is a great loca...","O prédio é bastante simples , mas o apartament...",,"From the International airport, take a regula...",Os hóspedes podem perguntar por email suas que...,Please remove sand when you come from the beac...,https://a0.muscache.com/im/pictures/11955612/b...


In [38]:
#Analise de formato de colunas
df_anuncio.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70185 entries, 0 to 70184
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   id_anuncio             70185 non-null  int64 
 1   listing_url            70185 non-null  object
 2   scrape_id              70185 non-null  int64 
 3   name                   70185 non-null  object
 4   summary                70185 non-null  object
 5   space                  70185 non-null  object
 6   description            70185 non-null  object
 7   experiences_offered    70185 non-null  object
 8   neighborhood_overview  70185 non-null  object
 9   notes                  70185 non-null  object
 10  transit                70185 non-null  object
 11  access                 70185 non-null  object
 12  interaction            70185 non-null  object
 13  house_rules            70185 non-null  object
 14  picture_url            70185 non-null  object
dtypes: int64(2), object

In [57]:
#Tratamento de valores nulos
valores_nulos = ["NaN", "nan", "None", "none", "NULL", "null", ""]

contagem_nulos = df_anuncio.apply(lambda col: col.isnull().sum() + col.isin(valores_nulos).sum())

#Criando df para avaliar o percentual de valores nulos por coluna.
nulos_anuncio = pd.DataFrame({
    "colunas" : contagem_nulos.index,
    "total_nulos": contagem_nulos.values,
    "%nulos" : (contagem_nulos.values / len(df_anuncio) * 100).round(2).astype(str) + "%"
})

nulos_anuncio

Unnamed: 0,colunas,total_nulos,%nulos
0,id_anuncio,0,0.0%
1,listing_url,0,0.0%
2,scrape_id,0,0.0%
3,name,83,0.12%
4,summary,4943,7.04%
5,space,30096,42.88%
6,description,3031,4.32%
7,experiences_offered,70185,100.0%
8,neighborhood_overview,34418,49.04%
9,notes,51079,72.78%


#### Notas tabela Anuncio:

+ Colunas com mais de 60% de valores nulo serão removidos da camada Silver.
+ Padronizar urls dos anuncios
+ Padronizar strings
+ Criar colunas de auditoria

#### Aplicar transformações da tabela "T_DIM_ANUNCIO" e ETL para o banco na camada silver

In [79]:
#Copia do df para padronização
df = df_anuncio.copy()

#1) Padronização dos valores nulos
valores_nulos = ["NaN", "nan", "None", "none", "NULL", "null", ""]
colunas = df.select_dtypes(include="object").columns
for c in colunas:
    df[c] = df[c].replace(list(valores_nulos), np.nan)


#2) Higienização simples de strings
def clean_text(s):
    if not isinstance(s, str): return s
    s = s.strip()
    s = re.sub(r"\s+", " ", s) # colapsa espacos 
    s = "".join(ch for ch in s if ch >= " ") #Remover caracteres de controle \n \t etc..
    return s 

for c in colunas:
    df[c] = df[c].map(clean_text)


#4) Remover colunas >60% de valores nulos
null_ratio = df.isnull().mean()
to_drop = null_ratio[null_ratio >= 0.6].index.to_list()
df = df.drop(columns=to_drop, errors='ignore')


#5) Verificação de IDs duplicados e remoção do ultimo scraped_id

df = (df
      .sort_values(['id_anuncio', 'scrape_id'], ascending=[True, False])
      .drop_duplicates(subset=['id_anuncio'], keep='first'))


#6) validação de URL
def is_valid_url(u):
    if not isinstance(u, str) or not u: return False
    try:
        p = urlparse(u)
        return bool(p.scheme and p.netloc)
    except:
        return False
    
for ucol in ["listing_url","picture_url"]:
    if ucol in df.columns:
        df.loc[~df[ucol].map(is_valid_url), ucol] = np.nan


#7) Auditoria
df["fonte"] = 'Airbnb_bronze'
df['dt_ingestao'] = pd.Timestamp.utcnow()


#8) df com os ajustes realizados "Pronto para inserir no banco"


df_anuncio_silver = df.copy()

  df[c] = df[c].replace(list(valores_nulos), np.nan)


In [80]:
df_anuncio_silver

Unnamed: 0,id_anuncio,listing_url,scrape_id,name,summary,space,description,neighborhood_overview,transit,access,interaction,house_rules,picture_url,fonte,dt_ingestao
0,14063,https://www.airbnb.com/rooms/14063,20180414160018,Living in a Postcard,"Besides the most iconic's view, our apartment ...",,"Besides the most iconic's view, our apartment ...",Best and favorite neighborhood of Rio. Perfect...,Everything is there. METRO is 5 min walk. Dire...,,,strictly no smoking in the apartment ! We want...,https://a0.muscache.com/im/pictures/66421/ae9b...,Airbnb_bronze,2025-08-17 21:58:12.064897+00:00
1,17878,https://www.airbnb.com/rooms/17878,20180414160018,Very Nice 2Br - Copacabana - WiFi,Please note that special rates apply for New Y...,- large balcony which looks out on pedestrian ...,Please note that special rates apply for New Y...,This is the best spot in Rio. Everything happe...,Excellent location. Close to all major public ...,The entire apartment is yours. It is a vacatio...,I will be available throughout your stay shoul...,Please leave the apartment in a clean fashion ...,https://a0.muscache.com/im/pictures/65320518/3...,Airbnb_bronze,2025-08-17 21:58:12.064897+00:00
49801,21280,https://www.airbnb.com/rooms/21280,20200420135919,Renovated Modern Apt. Near Beach,Immaculately renovated top-floor apartment ove...,Immaculately renovated top-floor apartment in ...,Immaculately renovated top-floor apartment ove...,This is the best neighborhood in Zona Sul. For...,The new metro station is just a few steps away...,"This is an older ""Art Deco"" style building, so...",Someone will be there at check in and check ou...,This is a booking agreement for rental of a tw...,https://a0.muscache.com/im/pictures/60851312/b...,Airbnb_bronze,2025-08-17 21:58:12.064897+00:00
2,24480,https://www.airbnb.com/rooms/24480,20180414160018,Nice and cozy near Ipanema Beach,My studio is located in the best of Ipanema. T...,The studio is located at Vinicius de Moraes St...,My studio is located in the best of Ipanema. T...,"The beach, the lagoon, Ipanema is a great loca...",,"From the International airport, take a regular...",Os hóspedes podem perguntar por email suas que...,Please remove sand when you come from the beac...,https://a0.muscache.com/im/pictures/11955612/b...,Airbnb_bronze,2025-08-17 21:58:12.064897+00:00
3,25026,https://www.airbnb.com/rooms/25026,20180414160018,Beautiful Modern Decorated Studio in Copa,"Our apartment is a little gem, everyone loves ...",This newly renovated studio (last renovations ...,"Our apartment is a little gem, everyone loves ...",Copacabana is a lively neighborhood and the ap...,At night we recommend you to take taxis only. ...,"internet wi-fi, cable tv, air cond, ceiling fa...","Only at check in, we like to leave our guests ...",Smoking outside only. Family building so pleas...,https://a0.muscache.com/im/pictures/3003965/68...,Airbnb_bronze,2025-08-17 21:58:12.064897+00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68584,43514675,https://www.airbnb.com/rooms/43514675,20200524171540,Quarto no Leblon,Meu apartamento é simples e muito aconchegante...,,Meu apartamento é simples e muito aconchegante...,,,,,,https://a0.muscache.com/im/pictures/74ed68fe-b...,Airbnb_bronze,2025-08-17 21:58:12.064897+00:00
68585,43515526,https://www.airbnb.com/rooms/43515526,20200524171540,"Lindo Apartamento, 2 quadras da praia",,,,,,,,,https://a0.muscache.com/im/pictures/3dce639d-a...,Airbnb_bronze,2025-08-17 21:58:12.064897+00:00
68586,43517044,https://www.airbnb.com/rooms/43517044,20200524171540,Conforto e concentração pra sua estadia,,,,,,,,,https://a0.muscache.com/im/pictures/80589a65-9...,Airbnb_bronze,2025-08-17 21:58:12.064897+00:00
68587,43522964,https://www.airbnb.com/rooms/43522964,20200524171540,Botafogo ll,Excelente apartamento 2 quartos e dois banheir...,,Excelente apartamento 2 quartos e dois banheir...,,,,,,https://a0.muscache.com/im/pictures/b0703b26-2...,Airbnb_bronze,2025-08-17 21:58:12.064897+00:00


#### Realizar carga no banco da tabela b_silver."T_DIM_ANUNCIO" 

In [88]:
#ETL PARA TABELA SILVER

def _ensure_columns(df: pd.DataFrame, target_cols: Sequence[str]) -> pd.DataFrame:
    """Garante que o DF tenha todas as colunas do contrato e na ordem correta."""
    for c in target_cols:
        if c not in df.columns:
            df[c] = np.nan
    return df[list(target_cols)].copy()


def _nan_to_none_records(df: pd.DataFrame) -> List[tuple]:
    """Converte NaN -> None (NULL no Postgres) e retorna lista de tuplas linha-a-linha."""
    records = []
    for row in df.itertuples(index=False, name=None):
        # transforma cada valor NaN em None
        fixed = tuple(None if (isinstance(x, float) and np.isnan(x)) else x for x in row)
        records.append(fixed)
    return records


def upsert_df(
    conn,
    df: pd.DataFrame,
    schema: str,
    table: str,
    target_cols: Sequence[str],
    conflict_cols: Sequence[str],
    update_cols: Optional[Sequence[str]] = None,
    page_size: int = 10000,
):
    """
    Faz UPSERT de um DF para uma tabela Postgres.
    - target_cols: colunas na ordem do contrato (iguais à tabela).
    - conflict_cols: colunas da UNIQUE/PK usadas no ON CONFLICT.
    - update_cols: quais colunas atualizar em caso de conflito.
      (por padrão, todas as target_cols EXCETO as de conflito).
    """
    if not len(df):
        return  # nada a fazer

    # 1) Garante contrato (colunas + ordem)
    df_up = _ensure_columns(df, target_cols)

    # 2) Define colunas a atualizar no DO UPDATE
    if update_cols is None:
        update_cols = [c for c in target_cols if c not in conflict_cols]
    if not update_cols:
        # Se não sobrou nada para atualizar, faça DO NOTHING para evitar erro
        on_conflict_sql = f'ON CONFLICT ({", ".join(f"""\"{c}\"""" for c in conflict_cols)}) DO NOTHING'
    else:
        set_sql = ", ".join(f'"{c}"=EXCLUDED."{c}"' for c in update_cols)
        on_conflict_sql = (
            f'ON CONFLICT ({", ".join(f"""\"{c}\"""" for c in conflict_cols)}) DO UPDATE SET {set_sql}'
        )

    # 3) Monta o INSERT com placeholder %s (execute_values preenche em lote)
    cols_sql = ", ".join(f'"{c}"' for c in target_cols)
    sql = f'INSERT INTO {schema}."{table}" ({cols_sql}) VALUES %s {on_conflict_sql};'

    # 4) Converte NaN->None e envia em lotes
    records = _nan_to_none_records(df_up)
    with conn.cursor() as cur:
        execute_values(cur, sql, records, page_size=page_size)
    conn.commit()

In [89]:

TARGET_COLS = [
    "id_anuncio","listing_url","scrape_id","name","summary","space","description",
    "neighborhood_overview","transit","access","interaction","house_rules",
    "picture_url","fonte","dt_ingestao"
]


upsert_df(
    conn=raw_conn,  
    df=df_anuncio_silver,
    schema="b_silver",
    table="T_DIM_ANUNCIO",
    target_cols=TARGET_COLS,
    conflict_cols=["id_anuncio"],   # chave de UPSERT (PK/UNIQUE)
    # update_cols=None  -> atualiza todas as colunas exceto a de conflito (default)
    page_size=10000
)
print("✅ T_DIM_ANUNCIO carregada/atualizada no Silver.")

✅ T_DIM_ANUNCIO carregada/atualizada no Silver.


### **Tratamento tabela T_DIM_ANFITRIAO**

In [4]:
#leitura da tabela anfitrião

df_anfitriao = pd.read_sql('SELECT * FROM  a_bronze. "T_DIM_ANFITRIAO"', conn)
df_anfitriao.head(3)

Unnamed: 0,host_id,fk_anuncio,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified
0,53598,14063,https://www.airbnb.com/users/show/53598,Shalev,2009-11-12,FL,"Hello , my name is Shalev , I am an orchestra ...",,,f,https://a0.muscache.com/im/users/53598/profile...,https://a0.muscache.com/im/users/53598/profile...,Botafogo,1.0,1.0,"['email', 'phone', 'reviews', 'jumio']",t,t
1,68997,17878,https://www.airbnb.com/users/show/68997,Matthias,2010-01-08,"Rio de Janeiro, Rio de Janeiro, Brazil",I used to work as a journalist all around the ...,within an hour,100%,t,https://a0.muscache.com/im/pictures/67b13cea-8...,https://a0.muscache.com/im/pictures/67b13cea-8...,Copacabana,2.0,2.0,"['email', 'phone', 'reviews']",t,f
2,99249,24480,https://www.airbnb.com/users/show/99249,Goya,2010-03-26,"Rio de Janeiro, Rio de Janeiro, Brazil",Welcome to Rio!\r\nI am a filmmaker and a tea...,within an hour,100%,f,https://a0.muscache.com/im/pictures/6b40475c-2...,https://a0.muscache.com/im/pictures/6b40475c-2...,Ipanema,1.0,1.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t


##### Analise inicial, tipo de dados, valores nulos, novas features.

In [47]:
df_anfitriao.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43180 entries, 0 to 43179
Data columns (total 18 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   host_id                    43180 non-null  int64  
 1   fk_anuncio                 43180 non-null  int64  
 2   host_url                   43180 non-null  object 
 3   host_name                  43180 non-null  object 
 4   host_since                 43147 non-null  object 
 5   host_location              43180 non-null  object 
 6   host_about                 43180 non-null  object 
 7   host_response_time         43180 non-null  object 
 8   host_response_rate         43180 non-null  object 
 9   host_is_superhost          43180 non-null  object 
 10  host_thumbnail_url         43180 non-null  object 
 11  host_picture_url           43180 non-null  object 
 12  host_neighbourhood         43180 non-null  object 
 13  host_listings_count        43147 non-null  flo

In [48]:
#Tratamento de valores nulos
valores_nulos = ["NaN", "nan", "None", "none", "NULL", "null", ""]

contagem_nulos = df_anfitriao.apply(lambda col: col.isnull().sum() + col.isin(valores_nulos).sum())

#Criando df para avaliar o percentual de valores nulos por coluna.
nulos_anfitriao = pd.DataFrame({
    "colunas" : contagem_nulos.index,
    "total_nulos": contagem_nulos.values,
    "%nulos" : (contagem_nulos.values / len(df_anfitriao) * 100).round(2).astype(str) + "%"
})
nulos_anfitriao

Unnamed: 0,colunas,total_nulos,%nulos
0,host_id,0,0.0%
1,fk_anuncio,0,0.0%
2,host_url,0,0.0%
3,host_name,33,0.08%
4,host_since,33,0.08%
5,host_location,332,0.77%
6,host_about,27467,63.61%
7,host_response_time,19886,46.05%
8,host_response_rate,19886,46.05%
9,host_is_superhost,33,0.08%


In [49]:
df_anfitriao['host_total_listings_count'].sort_values(ascending=False)

38681    1495.0
42570     336.0
38634     327.0
38636     325.0
38633     313.0
          ...  
28037       NaN
28825       NaN
40584       NaN
40996       NaN
41908       NaN
Name: host_total_listings_count, Length: 43180, dtype: float64

In [84]:
# contagem por host_id
dup_counts = df_anfitriao.groupby("host_id").size().sort_values(ascending=False)

n_hosts = dup_counts.shape[0]
n_rows  = len(df_anfitriao)
n_dups  = (dup_counts > 1).sum()

print(f"linhas={n_rows:,}  hosts únicos={n_hosts:,}  hosts com duplicata={n_dups:,}")
dup_counts[dup_counts > 1].head(10)

linhas=43,180  hosts únicos=43,180  hosts com duplicata=0


Series([], dtype: int64)

#### Notas:
+ Coluna **host_since**         : Esta como obj mas é data
+ Coluna **host_response_rate** : Esta como obj mas é uma coluna de percentual
+ Coluna **host_about**         : Sera exluida por ter mais de 60% de valores nulos e nao ter muita relevancia para o contesto de analise
+ Coluna **host_listings_count**: Esta como float porem é uma coluna que recebe valores interios, **(Sera necessario arrumar o tipo de dados no banco)**
+ Coluna **host_total_listings_count** : Esta como float porem é uma coluna que recebe valores interios
+ Colunas **"host_is_superhost","host_has_profile_pic","host_identity_verified"**, transformar t/f em True e False

In [120]:
#Copia do df para padronização
df = df_anfitriao.copy()

#1) Padronização dos valores nulos
valores_nulos = ["NaN", "nan", "None", "none", "NULL", "null", ""]
colunas = df.select_dtypes(include="object").columns
for c in colunas:
    df[c] = df[c].replace(list(valores_nulos), np.nan)


#2) Higienização simples de strings
def clean_text(s):
    if not isinstance(s, str): return s
    s = s.strip()
    s = re.sub(r"\s+", " ", s) # colapsa espacos 
    s = "".join(ch for ch in s if ch >= " ") #Remover caracteres de controle \n \t etc..
    return s 

for c in colunas:
    df[c] = df[c].map(clean_text)


#4) Remover colunas >60% de valores nulos
null_ratio = df.isnull().mean()
to_drop = null_ratio[null_ratio >= 0.6].index.to_list()
df = df.drop(columns=to_drop, errors='ignore')



#5)transformar coluna obj em data
df['host_since'] = pd.to_datetime(df['host_since'], errors='coerce')
df['host_since'] = df['host_since'].apply(lambda x: x.to_pydatetime() if pd.notnull(x) else None)
df['host_since'] = df['host_since'].fillna(pd.Timestamp("1900-01-01"))


#6)Tratar strings dentro da coluna host_response_rate (%) e tranformar strinsgs de valores nulos para 0 e manter o dado em proporcao
df['host_response_rate'] = (df['host_response_rate']
                                      .astype(str)
                                      .str.strip()
                                      .str.replace('%','', regex=False)
                                      .replace(valores_nulos, 0)
                                      .astype(float)
                                      )/100


#7) Correção de dados das coluna host_listings_count e host_total_listings_count de float para inteiro

fltcolumns = ['host_listings_count', 'host_total_listings_count']

for col in fltcolumns:
    df[col] = (pd.to_numeric(df[col], errors='coerce')
                         .fillna(0)
                         .astype(int)
                         )

#8) Transformar booleanas t/f em True e False
bool_map = {"t": True, "f": False, True: True, False: False}
for c in ["host_is_superhost","host_has_profile_pic","host_identity_verified"]:
    if c in df.columns:
        df[c] = df[c].map(bool_map).astype("bool")




#9) Padronizar categorias
map_resp_time = {
    "within an hour":"within an hour",
    "within a few hours":"within a few hours",
    "within a day":"within a day",
    "a few days or more":"a few days or more"
}
df["host_response_time"] = df["host_response_time"].str.strip().str.lower().map(map_resp_time)



#10) Padronizar Urls
for ucol in ["host_url","host_picture_url","host_thumbnail_url"]:
    if ucol in df.columns:
        df.loc[~df[ucol].map(is_valid_url), ucol] = np.nan


#11)host_verifications: transformar string/lista-serializada em array “limpo” (se vier como texto tipo ['email','phone'])

import ast
def parse_verifs(v):
    if isinstance(v, list): return v
    if not isinstance(v, str) or not v.strip(): return None
    try:
        x = ast.literal_eval(v)
        if isinstance(x, list):
            return [str(i).strip().lower() for i in x]
    except:
        pass
    return None

df["host_verifications"] = df["host_verifications"].apply(parse_verifs)


#12) Auditoria
df["fonte"] = 'Airbnb_bronze'
df['dt_ingestao'] = pd.Timestamp.utcnow()



#13) df com os ajustes realizados "Pronto para inserir no banco"
df_anfitriao_silver = df.copy()



#### Realizar carga no banco da tabela b_silver."T_ANFITRIAO" 

In [122]:

TARGET_COLS = ['host_id', 'host_url', 'host_name', 'host_since',
       'host_location', 'host_response_time', 'host_response_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'fonte',
       'dt_ingestao'
]


upsert_df(
    conn=raw_conn,  
    df=df_anfitriao_silver,
    schema="b_silver",
    table="T_DIM_ANFITRIAO",
    target_cols=TARGET_COLS,
    conflict_cols=["host_id"],   # chave de UPSERT (PK/UNIQUE)
    # update_cols=None  -> atualiza todas as colunas exceto a de conflito (default)
    page_size=10000
)
print("✅ T_DIM_ANUNCIO carregada/atualizada no Silver.")

✅ T_DIM_ANUNCIO carregada/atualizada no Silver.


### **Tratamento tabela T_DIM_LOCALIZACAO**

In [148]:
#leitura tabela localização

df_localizacao = pd.read_sql('SELECT * FROM a_bronze. "T_DIM_LOCALIZACAO"', conn)
df_localizacao.head()

Unnamed: 0,id_localizacao,fk_anuncio,street,neighbourhood,neighbourhood_cleansed,city,state,zipcode,market,smart_location,country_code,country,latitude,longitude,is_location_exact
0,1,14063,"Rio de Janeiro, RJ, Brazil",Botafogo,Botafogo,Rio de Janeiro,RJ,22250-040,Rio De Janeiro,"Rio de Janeiro, Brazil",BR,Brazil,-22.946854,-43.182737,t
1,2,17878,"Rio de Janeiro, Rio de Janeiro, Brazil",Copacabana,Copacabana,Rio de Janeiro,Rio de Janeiro,22020-050,Rio De Janeiro,"Rio de Janeiro, Brazil",BR,Brazil,-22.965919,-43.178962,t
2,3,24480,"Rio de Janeiro, Rio de Janeiro, Brazil",Ipanema,Ipanema,Rio de Janeiro,Rio de Janeiro,22411-010,Rio De Janeiro,"Rio de Janeiro, Brazil",BR,Brazil,-22.985698,-43.201935,t
3,4,25026,"Rio de Janeiro, Rio de Janeiro, Brazil",Copacabana,Copacabana,Rio de Janeiro,Rio de Janeiro,22060-020,Rio De Janeiro,"Rio de Janeiro, Brazil",BR,Brazil,-22.977117,-43.190454,t
4,5,31560,"Rio de Janeiro, RJ, Brazil",Ipanema,Ipanema,Rio de Janeiro,RJ,22410-003,Rio De Janeiro,"Rio de Janeiro, Brazil",BR,Brazil,-22.983024,-43.21427,t


In [149]:
df_localizacao.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 902210 entries, 0 to 902209
Data columns (total 15 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   id_localizacao          902210 non-null  int64  
 1   fk_anuncio              902210 non-null  int64  
 2   street                  902210 non-null  object 
 3   neighbourhood           902210 non-null  object 
 4   neighbourhood_cleansed  902210 non-null  object 
 5   city                    902210 non-null  object 
 6   state                   902210 non-null  object 
 7   zipcode                 902210 non-null  object 
 8   market                  902210 non-null  object 
 9   smart_location          902210 non-null  object 
 10  country_code            902210 non-null  object 
 11  country                 902210 non-null  object 
 12  latitude                902210 non-null  float64
 13  longitude               902210 non-null  float64
 14  is_location_exact   

In [150]:
#Analise padronização dos dados 
print(df_localizacao['smart_location'].unique())
print("-"*60)
print(df_localizacao['market'].unique())
print("-"*60)
print(df_localizacao['country_code'].unique())
print("-"*60)
print(df_localizacao['city'].unique())

['Rio de Janeiro, Brazil' 'Rio, Brazil' 'Joatinga, Brazil'
 'Copacabana, Brazil' 'Rio de janeiro , Brazil' 'Rio De Janeiro, Brazil'
 'Itanhangá, Brazil' 'Barra da Tijuca, Brazil' 'Glória, Brazil'
 'Jacarepagua, Brazil' 'Pitangueiras, Brazil' 'Ipanema, Brazil'
 'Estácio, Brazil' 'Vidigal, Brazil' 'Santa Teresa, Brazil'
 'Santa Tereza, Brazil' 'Jardim Botanico , Brazil'
 'Rio de janeiro, Brazil' 'Leblon, Brazil' 'Lagoa, Brazil'
 'Rio de Janeiro / Copacabana , Brazil' 'RJ, Brazil' 'Urca, Brazil'
 'Tijuca, Brazil' 'Rio de Janeiro - 22440-000, Brazil'
 'angra dos reis, Brazil' 'Rio de Janeiro -IPANEMA, Brazil'
 'Copacabana, Rio de Janeiro , Brazil'
 'Rio de Janeiro - Laranjeiras, Brazil' 'Riode Janeiro, Brazil'
 ' Copacabana, Brazil' 'Rio, Copacabana, Brazil' 'Rio de Janeiro , Brazil'
 'Vila Isabel, Brazil' 'Catete, Brazil' 'Vargem Grande, Brazil'
 'Leblon, Rio de Janeiro, Brazil' 'Centro, Brazil' 'Rj, Brazil'
 'リオ・デ・ジャネイロ, Brazil' 'Colegio, Brazil' 'São Conrado, Brazil'
 'Andarai, Brazil' 

#### Notas:

+ Coluna **neighbourhood** : Esta coluna recebe o valor inserido pelo anfitrião e depois é ajustada pelo airbnb na coluna **neighbourhood_cleansed** com isso irei eliminar a coluna neighbourhood e vou renomear a coluna neighbourhood_cleansed para neighbourhood
+ Coluna **smart_location**



In [151]:
#Usando API REST publica para obter os limites de bairros do rio de janeiro.

url_bairros = "https://pgeo3.rio.rj.gov.br/arcgis/rest/services/Cartografia/Limites_administrativos/MapServer/4/query?where=1=1&outFields=*&f=geojson"
bairros_rio = gpd.read_file(url_bairros)

print(bairros_rio.columns)
print(bairros_rio.head())


Index(['objectid', 'nome', 'regiao_adm', 'area_plane', 'codbairro', 'codra',
       'codbnum', 'link', 'rp', 'cod_rp', 'codbairro_long', 'st_area(shape)',
       'st_perimeter(shape)', 'geometry'],
      dtype='object')
   objectid            nome                regiao_adm area_plane codbairro  \
0       481         Grumari  BARRA DA TIJUCA                   4       133   
1       402  Jardim Sulacap  REALENGO                          5       137   
2       425           Saúde  PORTUARIA                         1       001   
3       377        Vaz Lobo  MADUREIRA                         3       084   
4       354         Ribeira  ILHA DO GOVERNADOR                3       091   

   codra  codbnum                                               link  \
0     24      133  Grumari                   &area=133           ...   
1     33      137  Jardim Sulacap            &area=137           ...   
2      1        1  Saúde                     &area=1             ...   
3     15       84  Vaz 

In [152]:
# 1) Garante CRS e cria GeoDataFrame de pontos (lon, lat)
gdf_pts = gpd.GeoDataFrame(
    df_localizacao.copy(),
    geometry=gpd.points_from_xy(df_localizacao['longitude'], df_localizacao['latitude']),
    crs="EPSG:4326"
)

# 2) Garante que bairros_rio está no mesmo CRS
if bairros_rio.crs is None:
    bairros_rio = bairros_rio.set_crs("EPSG:4326")
elif bairros_rio.crs != gdf_pts.crs:
    bairros_rio = bairros_rio.to_crs(gdf_pts.crs)

# 3) sjoin (ponto dentro do polígono do bairro)
gjoined = gpd.sjoin(
    gdf_pts,
    bairros_rio[['nome', 'regiao_adm', 'rp', 'geometry']],  # pegue só o que precisa
    how="left",
    predicate="within"
).drop(columns=['index_right'])

In [158]:
# Padronização de campos

def title_pt(s: pd.Series) -> pd.Series:
    def fix(x):
        if not isinstance(x, str): return x
        t = x.title()
        for a,b in [(" De "," de "),(" Da "," da "),(" Do "," do "),
                    (" Dos "," dos "),(" Das "," das "),(" Em "," em "),(" E "," e ")]:
            t = t.replace(a,b)
        return t.strip()
    return s.apply(fix)

# Bairro oficial
gjoined['neighbourhood_cleansed'] = title_pt(gjoined['nome'])

# City e Market: se caiu em bairro do RJ → “Rio de Janeiro”
gjoined['city'] = gjoined['city'].where(gjoined['neighbourhood_cleansed'].isna(), 'Rio de Janeiro')
gjoined['market'] = gjoined['market'].where(gjoined['neighbourhood_cleansed'].isna(), 'Rio de Janeiro')

# Country (corrige scrapes errados tipo “Andorra”)
gjoined['country_code'] = gjoined.get('country_code', pd.Series(index=gjoined.index))
gjoined['country'] = gjoined.get('country', pd.Series(index=gjoined.index))
gjoined.loc[gjoined['city'].eq('Rio de Janeiro'), ['country_code','country']] = ['BR','Brazil']

# Fallback por bbox do Brasil (opcional)
mask_bbox_br = gjoined['longitude'].between(-74, -34) & gjoined['latitude'].between(-34, 5)
gjoined.loc[mask_bbox_br, ['country_code','country']] = ['BR','Brazil']

# smart_location = "Cidade, País"
gjoined['smart_location'] = (
    title_pt(gjoined['city']).fillna('') + ', ' + title_pt(gjoined['country']).fillna('')
).str.strip(', ')

# Padroniza capitalização
gjoined['city'] = title_pt(gjoined['city'])
gjoined['market'] = title_pt(gjoined['market'])

In [159]:
# Remove geometria para voltar a DataFrame “puro”
df_localizacao_silver = pd.DataFrame(gjoined.drop(columns='geometry'))

log_geo = {
    "pct_bairro_atribuido": round(df_localizacao_silver['neighbourhood_cleansed'].notna().mean()*100, 2),
    "pct_city_atribuido": round(df_localizacao_silver['city'].notna().mean()*100, 2),
    "corrigidos_para_BR": int(((df_localizacao_silver['country_code'] == 'BR') & (df_localizacao.get('country_code') != 'BR')).sum()),
    "total_linhas": len(df_localizacao_silver),
}
print(log_geo)

{'pct_bairro_atribuido': np.float64(99.96), 'pct_city_atribuido': np.float64(100.0), 'corrigidos_para_BR': 31, 'total_linhas': 902210}


In [160]:
df_localizacao_silver.head()

Unnamed: 0,id_localizacao,fk_anuncio,street,neighbourhood,neighbourhood_cleansed,city,state,zipcode,market,smart_location,country_code,country,latitude,longitude,is_location_exact,nome,regiao_adm,rp
0,1,14063,"Rio de Janeiro, RJ, Brazil",Botafogo,Botafogo,Rio de Janeiro,RJ,22250-040,Rio de Janeiro,"Rio de Janeiro, Brazil",BR,Brazil,-22.946854,-43.182737,t,Botafogo,BOTAFOGO,Zona Sul
1,2,17878,"Rio de Janeiro, Rio de Janeiro, Brazil",Copacabana,Copacabana,Rio de Janeiro,Rio de Janeiro,22020-050,Rio de Janeiro,"Rio de Janeiro, Brazil",BR,Brazil,-22.965919,-43.178962,t,Copacabana,COPACABANA,Zona Sul
2,3,24480,"Rio de Janeiro, Rio de Janeiro, Brazil",Ipanema,Ipanema,Rio de Janeiro,Rio de Janeiro,22411-010,Rio de Janeiro,"Rio de Janeiro, Brazil",BR,Brazil,-22.985698,-43.201935,t,Ipanema,LAGOA,Zona Sul
3,4,25026,"Rio de Janeiro, Rio de Janeiro, Brazil",Copacabana,Copacabana,Rio de Janeiro,Rio de Janeiro,22060-020,Rio de Janeiro,"Rio de Janeiro, Brazil",BR,Brazil,-22.977117,-43.190454,t,Copacabana,COPACABANA,Zona Sul
4,5,31560,"Rio de Janeiro, RJ, Brazil",Ipanema,Ipanema,Rio de Janeiro,RJ,22410-003,Rio de Janeiro,"Rio de Janeiro, Brazil",BR,Brazil,-22.983024,-43.21427,t,Ipanema,LAGOA,Zona Sul


In [161]:
df_localizacao_silver['market'].value_counts()

market
Rio de Janeiro           902208
Other (International)         2
Name: count, dtype: int64

In [162]:
#Removendo dois valores que nao conseguimos mapear com o geopandas
df_localizacao_silver = df_localizacao_silver[df_localizacao_silver['market'] != 'Other (International)']

In [163]:
df_localizacao_silver.drop(columns=['neighbourhood','neighbourhood_cleansed'], inplace=True)
df_localizacao_silver.rename(columns={'nome': 'neighbourhood'}, inplace=True)

In [164]:
#Definicao de pontos turisticos
pois = pd.DataFrame({
    'poi': [
        'Cristo Redentor','Pao de Acucar','Praia de Copacabana','Praia de Ipanema',
        'Maracana','Arcos da Lapa','Museu do Amanha','Jardim Botanico'
    ],
    'lat': [-22.951916, -22.948611, -22.971177, -22.986869, -22.912161, -22.912167, -22.895911, -22.968801],
    'lon': [-43.210487, -43.156389, -43.182543, -43.155444, -43.230184, -43.179954, -43.180763, -43.223593]
})


gdf_pts = gpd.GeoDataFrame(
    df_localizacao_silver.copy(),
    geometry=gpd.points_from_xy(df_localizacao_silver['longitude'], df_localizacao_silver['latitude']),
    crs="EPSG:4326"
)

gdf_pois = gpd.GeoDataFrame(
    pois.copy(),
    geometry=gpd.points_from_xy(pois['lon'], pois['lat']),
    crs="EPSG:4326"
)

# Projeta para metros (UTM 23S)
gdf_pts_m = gdf_pts.to_crs(31983)
gdf_pois_m = gdf_pois.to_crs(31983)

# Distância para cada POI em km
for _, r in gdf_pois_m.iterrows():
    col = f"dist_{r['poi'].lower().replace(' ','_')}_km"
    gdf_pts_m[col] = gdf_pts_m.geometry.distance(r.geometry) / 1000.0

# POI mais próximo (usando as colunas calculadas)
dist_cols = [c for c in gdf_pts_m.columns if c.startswith('dist_') and c.endswith('_km')]
gdf_pts_m['nearest_poi_km'] = gdf_pts_m[dist_cols].min(axis=1)
gdf_pts_m['nearest_poi_name'] = gdf_pts_m[dist_cols].idxmin(axis=1).str.replace(r'^dist_|_km$', '', regex=True)

# Volta a DF se quiser
df_localizacao_silver = pd.DataFrame(gdf_pts_m.drop(columns='geometry'))


In [165]:
df_localizacao_silver.head(5)

Unnamed: 0,id_localizacao,fk_anuncio,street,city,state,zipcode,market,smart_location,country_code,country,...,dist_cristo_redentor_km,dist_pao_de_acucar_km,dist_praia_de_copacabana_km,dist_praia_de_ipanema_km,dist_maracana_km,dist_arcos_da_lapa_km,dist_museu_do_amanha_km,dist_jardim_botanico_km,nearest_poi_km,nearest_poi_name
0,1,14063,"Rio de Janeiro, RJ, Brazil",Rio de Janeiro,RJ,22250-040,Rio de Janeiro,"Rio de Janeiro, Brazil",BR,Brazil,...,2.900808,2.709394,2.693764,5.241443,6.200742,3.85205,5.645376,4.843959,2.693764,praia_de_copacabana
1,2,17878,"Rio de Janeiro, Rio de Janeiro, Brazil",Rio de Janeiro,Rio de Janeiro,22020-050,Rio de Janeiro,"Rio de Janeiro, Brazil",BR,Brazil,...,3.585761,3.005603,0.688432,3.346509,7.940186,5.953718,7.755335,4.58799,0.688432,praia_de_copacabana
2,3,24480,"Rio de Janeiro, Rio de Janeiro, Brazil",Rio de Janeiro,Rio de Janeiro,22411-010,Rio de Janeiro,"Rio de Janeiro, Brazil",BR,Brazil,...,3.842621,6.219783,2.557381,4.768766,8.643863,8.449579,10.177916,2.904091,2.557381,praia_de_copacabana
3,4,25026,"Rio de Janeiro, Rio de Janeiro, Brazil",Rio de Janeiro,Rio de Janeiro,22060-020,Rio de Janeiro,"Rio de Janeiro, Brazil",BR,Brazil,...,3.46551,4.708603,1.044435,3.748849,8.26755,7.273149,9.048024,3.520814,1.044435,praia_de_copacabana
4,5,31560,"Rio de Janeiro, RJ, Brazil",Rio de Janeiro,RJ,22410-003,Rio de Janeiro,"Rio de Janeiro, Brazil",BR,Brazil,...,3.466836,7.053931,3.507942,6.046838,8.015621,8.600274,10.241277,1.84254,1.84254,jardim_botanico


### **Tratamento tabela T_DIM_PROPRIEDADE**

In [7]:
#Leitura da tabela propriedade
df_propriedade = pd.read_sql('SELECT * FROM a_bronze. "T_DIM_PROPRIEDADE"',conn)

In [8]:
df_propriedade

Unnamed: 0,id_propriedade,fk_anuncio,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,minimum_nights,maximum_nights,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365
0,1,14063,Apartment,Entire home/apt,4,1.0,0.0,2.0,Real Bed,"{TV,Internet,""Air conditioning"",Kitchen,Doorma...",60.0,365.0,7 weeks ago,t,28.0,58.0,88.0,363.0
1,2,17878,Condominium,Entire home/apt,5,1.0,2.0,2.0,Real Bed,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",4.0,90.0,yesterday,t,11.0,29.0,58.0,286.0
2,3,24480,Apartment,Entire home/apt,2,1.0,1.0,1.0,Real Bed,"{TV,""Cable TV"",Wifi,""Air conditioning"",""First ...",3.0,90.0,5 weeks ago,t,0.0,0.0,0.0,0.0
3,4,25026,Apartment,Entire home/apt,3,1.0,1.0,2.0,Real Bed,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",4.0,30.0,today,t,28.0,58.0,88.0,363.0
4,5,31560,Apartment,Entire home/apt,3,1.0,1.0,2.0,Real Bed,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",2.0,1125.0,5 weeks ago,t,15.0,45.0,75.0,345.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
902205,902206,38844730,Apartment,Entire home/apt,4,1.0,0.0,2.0,Real Bed,"{TV,Wifi,""Air conditioning"",Pool,Kitchen,""Free...",1.0,1125.0,today,t,16.0,46.0,76.0,351.0
902206,902207,38846408,Apartment,Entire home/apt,4,2.0,2.0,3.0,Real Bed,"{TV,Wifi,""Air conditioning"",Pool,Kitchen,""Free...",2.0,1125.0,today,t,23.0,53.0,83.0,83.0
902207,902208,38846703,Apartment,Entire home/apt,5,1.0,1.0,2.0,Real Bed,"{TV,Wifi,""Air conditioning"",Kitchen,Elevator,W...",3.0,1125.0,today,t,30.0,60.0,90.0,365.0
902208,902209,38847050,Apartment,Entire home/apt,4,1.0,1.0,1.0,Real Bed,"{TV,Wifi,""Air conditioning"",Pool,Kitchen,""Free...",1.0,1125.0,today,t,17.0,47.0,77.0,77.0


#### Notas:

+ Coluna: **"beds"** faz sentido ser um valor decimal, sendo 1.5 = um banheiro + lavabo

+ Conlunas **"bathrooms", "bedrooms", "minimum_nights", "maximum_nights", "calendar_updated", "availability_30","availability_60", "availability_90", "availability_365"**: Estão como float porem devem ser numeros inteiros