<h1 style="text-align: center;">🥈Transformação de dados da camada bronze para silver</h1>


<img src="Imagem_projeto.png" alt="Visualização do Projeto" width="600" style="display: block; margin-left: auto; margin-right: auto;"/>


In [64]:
#📦import das principais libs

import pandas as pd
import psycopg2
from sqlalchemy import create_engine
import os
from dotenv import load_dotenv

In [81]:
#import das variaveis de ambiente
load_dotenv(r"..\scripts\.env")


#conexao com postgres
try:
    DB_URI = (f"postgresql://{os.getenv('DB_USER')}:{os.getenv('DB_PASSWORD')}@{os.getenv('DB_HOST')}:{os.getenv('DB_PORT')}/{os.getenv('DB_NAME')}")

    engine = create_engine(DB_URI)
    conn = engine.connect()
    print("✅ Conexao realizada com sucesso!")
except Exception as e:
    print(f"❌ Erro ao conectar ao banco de dados {e}")


✅ Conexao realizada com sucesso!


### **Tratamento tabela T_DIM_ANUNCIO**

In [83]:
#Leitura da tabela Anuncio
df_anuncio = pd.read_sql('SELECT * FROM a_bronze. "T_DIM_ANUNCIO"', conn)
df_anuncio.head(3)

Unnamed: 0,id_anuncio,listing_url,scrape_id,last_scraped,calendar_last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,notes,transit,access,interaction,house_rules,picture_url
0,14063,https://www.airbnb.com/rooms/14063,20180414160018,2018-04-14,2018-04-14,Living in a Postcard,"Besides the most iconic's view, our apartment ...",,"Besides the most iconic's view, our apartment ...",none,Best and favorite neighborhood of Rio. Perfect...,,Everything is there. METRO is 5 min walk. Dir...,,,strictly no smoking in the apartment ! We want...,https://a0.muscache.com/im/pictures/66421/ae9b...
1,17878,https://www.airbnb.com/rooms/17878,20180414160018,2018-04-14,2018-04-14,Very Nice 2Br - Copacabana - WiFi,Please note that special rates apply for New Y...,- large balcony which looks out on pedestrian ...,Please note that special rates apply for New Y...,none,This is the best spot in Rio. Everything happe...,,Excellent location. Close to all major public ...,The entire apartment is yours. It is a vacatio...,I will be available throughout your stay shoul...,Please leave the apartment in a clean fashion ...,https://a0.muscache.com/im/pictures/65320518/3...
2,24480,https://www.airbnb.com/rooms/24480,20180414160018,2018-04-14,2018-04-14,Nice and cozy near Ipanema Beach,My studio is located in the best of Ipanema. ...,The studio is located at Vinicius de Moraes St...,My studio is located in the best of Ipanema. ...,none,"The beach, the lagoon, Ipanema is a great loca...","O prédio é bastante simples , mas o apartament...",,"From the International airport, take a regula...",Os hóspedes podem perguntar por email suas que...,Please remove sand when you come from the beac...,https://a0.muscache.com/im/pictures/11955612/b...


In [72]:
#Analise de formato de colunas
df_anuncio.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70185 entries, 0 to 70184
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   id_anuncio             70185 non-null  int64         
 1   listing_url            70185 non-null  object        
 2   scrape_id              70185 non-null  int64         
 3   last_scraped           70185 non-null  datetime64[ns]
 4   calendar_last_scraped  70185 non-null  datetime64[ns]
 5   name                   70185 non-null  object        
 6   summary                70185 non-null  object        
 7   space                  70185 non-null  object        
 8   description            70185 non-null  object        
 9   neighborhood_overview  70185 non-null  object        
 10  transit                70185 non-null  object        
 11  access                 70185 non-null  object        
 12  interaction            70185 non-null  object        
 13  h

##### colunas que precisam de transformação:

+ **last_scraped** :         Alterar de OBJ para datetime
+ **calendar_last_scraped**: Alterar de OBJ para datetime

As demais colunas estao no formato correto, necessitando apenas a analise dos valores nulos

In [59]:
#Tratamento de valores nulos
valores_nulos = ["NaN", "nan", "None", "none", "NULL", "null", ""]

contagem_nulos = df_anuncio.apply(lambda col: col.isnull().sum() + col.isin(valores_nulos).sum())

#Criando df para avaliar o percentual de valores nulos por coluna.
nulos_anuncio = pd.DataFrame({
    "colunas" : contagem_nulos.index,
    "total_nulos": contagem_nulos.values,
    "%nulos" : (contagem_nulos.values / len(df_anuncio) * 100).round(2).astype(str) + "%"
})

nulos_anuncio

Unnamed: 0,colunas,total_nulos,%nulos
0,id_anuncio,0,0.0%
1,listing_url,0,0.0%
2,scrape_id,0,0.0%
3,last_scraped,0,0.0%
4,calendar_last_scraped,0,0.0%
5,name,83,0.12%
6,summary,4943,7.04%
7,space,30096,42.88%
8,description,3031,4.32%
9,experiences_offered,70185,100.0%


##### Notas:

+ Colunas com mais de 60% de valores nulo serão removidos da camada Silver.
+ Como nao temos IDs com valores nulos não sera necessario a exclusão de nenhuma linha desta tabela

#### Aplicar transformações da tabela "T_DIM_ANUNCIO"

In [None]:
#Transformação de de colunas em datetime.

colunas = ['last_scraped', 'calendar_last_scraped']

for col in colunas:
    df_anuncio[col] = pd.to_datetime(df_anuncio[col])

In [70]:
#Exclusao das colunas com mais de 60% dos dados nulos
df_anuncio.drop(columns=['experiences_offered', 'notes'], inplace=True)

In [None]:
#Inclusão mes e ano do anucio
df_anuncio['mes_scp'] = pd.to_datetime(df_anuncio['calendar_last_scraped']).dt.month.astype(int)
df_anuncio['ano_scp'] = pd.to_datetime(df_anuncio['calendar_last_scraped']).dt.year.astype(int)

### **Tratamento tabela T_DIM_ANFITRIAO**

In [84]:
#leitura da tabela anfitrião

df_anfitriao = pd.read_sql('SELECT * FROM  a_bronze. "T_DIM_ANFITRIAO"', conn)
df_anfitriao.head(3)

Unnamed: 0,host_id,fk_anuncio,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified
0,53598,14063,https://www.airbnb.com/users/show/53598,Shalev,2009-11-12,FL,"Hello , my name is Shalev , I am an orchestra ...",,,f,https://a0.muscache.com/im/users/53598/profile...,https://a0.muscache.com/im/users/53598/profile...,Botafogo,1.0,1.0,"['email', 'phone', 'reviews', 'jumio']",t,t
1,68997,17878,https://www.airbnb.com/users/show/68997,Matthias,2010-01-08,"Rio de Janeiro, Rio de Janeiro, Brazil",I used to work as a journalist all around the ...,within an hour,100%,t,https://a0.muscache.com/im/pictures/67b13cea-8...,https://a0.muscache.com/im/pictures/67b13cea-8...,Copacabana,2.0,2.0,"['email', 'phone', 'reviews']",t,f
2,99249,24480,https://www.airbnb.com/users/show/99249,Goya,2010-03-26,"Rio de Janeiro, Rio de Janeiro, Brazil",Welcome to Rio!\r\nI am a filmmaker and a tea...,within an hour,100%,f,https://a0.muscache.com/im/pictures/6b40475c-2...,https://a0.muscache.com/im/pictures/6b40475c-2...,Ipanema,1.0,1.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t


##### Analise inicial, tipo de dados, valores nulos, novas features.

In [85]:
df_anfitriao.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43180 entries, 0 to 43179
Data columns (total 18 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   host_id                    43180 non-null  int64  
 1   fk_anuncio                 43180 non-null  int64  
 2   host_url                   43180 non-null  object 
 3   host_name                  43180 non-null  object 
 4   host_since                 43147 non-null  object 
 5   host_location              43180 non-null  object 
 6   host_about                 43180 non-null  object 
 7   host_response_time         43180 non-null  object 
 8   host_response_rate         43180 non-null  object 
 9   host_is_superhost          43180 non-null  object 
 10  host_thumbnail_url         43180 non-null  object 
 11  host_picture_url           43180 non-null  object 
 12  host_neighbourhood         43180 non-null  object 
 13  host_listings_count        43147 non-null  flo

In [87]:
#Tratamento de valores nulos
valores_nulos = ["NaN", "nan", "None", "none", "NULL", "null", ""]

contagem_nulos = df_anfitriao.apply(lambda col: col.isnull().sum() + col.isin(valores_nulos).sum())

#Criando df para avaliar o percentual de valores nulos por coluna.
nulos_anuncio = pd.DataFrame({
    "colunas" : contagem_nulos.index,
    "total_nulos": contagem_nulos.values,
    "%nulos" : (contagem_nulos.values / len(df_anfitriao) * 100).round(2).astype(str) + "%"
})
nulos_anuncio

Unnamed: 0,colunas,total_nulos,%nulos
0,host_id,0,0.0%
1,fk_anuncio,0,0.0%
2,host_url,0,0.0%
3,host_name,33,0.08%
4,host_since,33,0.08%
5,host_location,332,0.77%
6,host_about,27467,63.61%
7,host_response_time,19886,46.05%
8,host_response_rate,19886,46.05%
9,host_is_superhost,33,0.08%


In [93]:
df_anfitriao['host_total_listings_count'].sort_values(ascending=False)

38681    1495.0
42570     336.0
38634     327.0
38636     325.0
38633     313.0
          ...  
28037       NaN
28825       NaN
40584       NaN
40996       NaN
41908       NaN
Name: host_total_listings_count, Length: 43180, dtype: float64

#### Notas:

+ Coluna **host_response_rate** : Esta como obj mas é uma coluna de percentual
+ Coluna **host_about**         : Sera exluida por ter mais de 60% de valores nulos e nao ter muita relevancia para o contesto de analise
+ Coluna **host_listings_count**: Esta como float porem é uma coluna que recebe valores interios, **(Sera necessario arrumar o tipo de dados no banco)**
