<a href="https://colab.research.google.com/github/RaphaelRAY/airbnb-rating-ml/blob/main/notebooks/01_limpeza_dados.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 01 - Limpeza e Pré-processamento de Dados Airbnb - Rio de Janeiro

Este notebook detalha o processo de limpeza e pré-processamento dos dados do Airbnb para a cidade do Rio de Janeiro, utilizando os arquivo `listings.csv`  . O arquivos `calendar.csv` e `neighbourhoods.csv` foi omitido desta etapa devido ao seu grande volume de dados, que pode ser processado separadamente se necessário e por não conter dados relevantes.

## 1. Configuração Inicial e Carregamento de Dados

Importação das bibliotecas necessárias e carregamento do dataset principal `listings.csv`.

In [51]:
import pandas as pd
import numpy as np
import os
from sklearn.preprocessing import OneHotEncoder

# Criar diretório de saída se não existir
output_dir = "data/processed"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
    print(f"Diretório \'{output_dir}\' criado.")

print("\n--- Carregando listings.csv ---")
df = pd.read_csv("https://raw.githubusercontent.com/RaphaelRAY/airbnb-rating-ml/refs/heads/main/data/listings.csv")
len_df = len(df)
print(f"DataFrame carregado com {len_df} linhas.")


--- Carregando listings.csv ---
DataFrame carregado com 42572 linhas.


In [52]:
df

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,availability_eoy,number_of_reviews_ly,estimated_occupancy_l365d,estimated_revenue_l365d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,17878,https://www.airbnb.com/rooms/17878,20250624193519,2025-06-28,city scrape,"Very Nice 2Br in Copacabana w. balcony, fast WiFi",Please note that elevated rates apply for New ...,This is the one of the bests spots in Rio. Bec...,https://a0.muscache.com/pictures/65320518/3069...,68997,https://www.airbnb.com/users/show/68997,Matthias,2010-01-08,"Rio de Janeiro, Brazil",I am a journalist/writer. Lived in NYC for ...,within an hour,100%,95%,f,https://a0.muscache.com/im/pictures/user/67b13...,https://a0.muscache.com/im/pictures/user/67b13...,Copacabana,2.0,5.0,"['email', 'phone']",t,t,"Rio de Janeiro, Brazil",Copacabana,,-22.965990,-43.179400,Entire condo,Entire home/apt,5,1.0,1 bath,2.0,2.0,"[""Oven"", ""Building staff"", ""TV with standard c...",$254.00,5,28,5.0,5.0,28.0,28.0,5.0,28.0,,t,4,21,31,225,2025-06-28,338,19,1,100,16,190,48260.0,2010-07-15,2025-06-25,4.71,4.76,4.64,4.83,4.91,4.78,4.67,,f,1,1,0,0,1.86
1,25026,https://www.airbnb.com/rooms/25026,20250624193519,2025-07-04,city scrape,Beautiful Modern Decorated Studio in Copacabana,"**Fully renovated in Dec 2022, new kitchen, n...",Copacabana is a lively neighborhood and the ap...,https://a0.muscache.com/pictures/7c08fa4f-1d7b...,102840,https://www.airbnb.com/users/show/102840,Viviane,2010-04-03,"Rio de Janeiro, Brazil","Hi guys,\n\nViviane is a commercial photograph...",,,,t,https://a0.muscache.com/im/pictures/user/315dd...,https://a0.muscache.com/im/pictures/user/315dd...,Copacabana,1.0,5.0,"['email', 'phone']",t,t,"Rio de Janeiro, Brazil",Copacabana,,-22.976490,-43.191220,Entire rental unit,Entire home/apt,3,1.0,1 bath,1.0,2.0,"[""Window AC unit"", ""Room-darkening shades"", ""D...",$252.00,2,60,2.0,2.0,60.0,60.0,2.0,60.0,,t,1,1,29,193,2025-07-04,313,23,2,112,29,138,34776.0,2010-06-07,2025-06-23,4.75,4.74,4.81,4.83,4.93,4.85,4.65,,f,1,1,0,0,1.71
2,35764,https://www.airbnb.com/rooms/35764,20250624193519,2025-06-25,city scrape,COPACABANA SEA BREEZE - RIO - 25 X Superhost,Our newly renovated studio is located in the b...,Our guests will experience living with a local...,https://a0.muscache.com/pictures/23782972/1d3e...,153691,https://www.airbnb.com/users/show/153691,Patricia Miranda & Paulo,2010-06-27,"Rio de Janeiro, Brazil","Hello, We are Patricia Miranda and Paulo.\nW...",within an hour,100%,97%,t,https://a0.muscache.com/im/users/153691/profil...,https://a0.muscache.com/im/users/153691/profil...,Copacabana,1.0,2.0,"['email', 'phone']",t,t,"Rio de Janeiro, Brazil",Copacabana,,-22.981070,-43.191360,Entire loft,Entire home/apt,2,1.5,1.5 baths,1.0,1.0,"[""Building staff"", ""Bed linens"", ""Heating"", ""P...",$190.00,3,15,3.0,5.0,7.0,15.0,3.1,14.8,,t,7,15,30,103,2025-06-25,516,41,2,103,43,246,46740.0,2010-10-03,2025-06-05,4.91,4.94,4.92,4.97,4.95,4.95,4.89,,f,1,1,0,0,2.88
3,48305,https://www.airbnb.com/rooms/48305,20250624193519,2025-06-26,city scrape,Bright 6bed Penthouse Seconds from Beach,Enter Bossa Nova's history by staying in the v...,Enter Bossa Nova history by staying in the ver...,https://a0.muscache.com/pictures/miso/Hosting-...,70933,https://www.airbnb.com/users/show/70933,Goitaca,2010-01-16,"Rio de Janeiro, Brazil",A new frontier of hospitality\n\nThe word mean...,within an hour,100%,95%,t,https://a0.muscache.com/im/pictures/user/c2d77...,https://a0.muscache.com/im/pictures/user/c2d77...,Ipanema,7.0,33.0,"['email', 'phone', 'work_email']",t,t,"Ipanema, Rio de Janeiro, Brazil",Ipanema,,-22.985910,-43.203020,Entire rental unit,Entire home/apt,13,7.0,7 baths,6.0,7.0,"[""Pack \u2019n play/Travel crib"", ""Private pat...","$2,239.00",7,89,6.0,15.0,89.0,89.0,14.1,89.0,,t,23,53,83,351,2025-06-26,183,5,0,178,16,70,156730.0,2011-03-02,2025-02-25,4.77,4.74,4.73,4.84,4.84,4.95,4.59,,t,6,5,1,0,1.05
4,48901,https://www.airbnb.com/rooms/48901,20250624193519,2025-07-01,city scrape,Extra large 4BD 3BT on the AtlanticAve. Copaca...,LARGE Beach side 4 bedrooms 2 Complete bathro...,"Plenty of shops, entertainment andrestaurants<...",https://a0.muscache.com/pictures/hosting/Hosti...,222884,https://www.airbnb.com/users/show/222884,Marcio,2010-09-03,"Rio de Janeiro, Brazil","Carioca "" da gema "", fala português e inglês. ...",within an hour,100%,69%,f,https://a0.muscache.com/im/users/222884/profil...,https://a0.muscache.com/im/users/222884/profil...,Copacabana,1.0,7.0,"['email', 'phone']",t,t,"Rio, Rio de Janeiro, Brazil",Copacabana,,-22.965740,-43.175140,Entire rental unit,Entire home/apt,10,2.5,2.5 baths,4.0,4.0,"[""Microwave"", ""Dedicated workspace"", ""Hot wate...",$743.00,3,1125,3.0,4.0,1125.0,1125.0,3.0,1125.0,,t,15,20,50,311,2025-07-01,48,17,1,131,15,102,75786.0,2015-08-01,2025-06-13,4.63,4.67,4.42,4.88,4.83,4.94,4.60,,f,1,1,0,0,0.40
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42567,1450108622211032237,https://www.airbnb.com/rooms/1450108622211032237,20250624193519,2025-06-28,city scrape,Aconchego Constante,"Overlooking the sea, Aconchego Constante is an...",,https://a0.muscache.com/pictures/miso/Hosting-...,703072927,https://www.airbnb.com/users/show/703072927,Marco,2025-06-23,,,,,100%,f,https://a0.muscache.com/defaults/user_pic-50x5...,https://a0.muscache.com/defaults/user_pic-225x...,Copacabana,1.0,1.0,['phone'],f,t,,Copacabana,,-22.974055,-43.189129,Entire rental unit,Entire home/apt,3,1.0,1 bath,1.0,2.0,"[""Exterior security cameras on property"", ""TV""...",$186.00,2,365,2.0,4.0,365.0,365.0,2.0,365.0,,t,11,41,71,137,2025-06-28,0,0,0,137,0,0,0.0,,,,,,,,,,,t,1,1,0,0,
42568,1450108828266063076,https://www.airbnb.com/rooms/1450108828266063076,20250624193519,2025-06-26,city scrape,Quarto casal,Premium and cozy location.,,https://a0.muscache.com/pictures/hosting/Hosti...,320682549,https://www.airbnb.com/users/show/320682549,Anna,2019-12-24,"Rio de Janeiro, Brazil",,,,,f,https://a0.muscache.com/im/pictures/user/User-...,https://a0.muscache.com/im/pictures/user/User-...,Jacarepaguá,1.0,1.0,"['email', 'phone']",t,t,,Camorim,,-22.979640,-43.423380,Private room in rental unit,Private room,2,1.0,1 private bath,1.0,1.0,"[""Exterior security cameras on property"", ""Ded...",$288.00,1,365,1.0,1.0,365.0,365.0,1.0,365.0,,t,30,60,90,365,2025-06-26,0,0,0,189,0,0,0.0,,,,,,,,,,,f,1,0,1,0,
42569,1450124185987579534,https://www.airbnb.com/rooms/1450124185987579534,20250624193519,2025-06-30,city scrape,Cama em Dorm Misto (9) com AC,"Single bed in a mixed room with nine beds, loc...",,https://a0.muscache.com/pictures/hosting/Hosti...,37776540,https://www.airbnb.com/users/show/37776540,Mariana,2015-07-07,,,within an hour,93%,100%,t,https://a0.muscache.com/im/pictures/user/52b65...,https://a0.muscache.com/im/pictures/user/52b65...,Ipanema,8.0,12.0,"['email', 'phone']",t,t,,Ipanema,,-22.983102,-43.208741,Shared room in hostel,Shared room,9,6.0,6 shared baths,,1.0,"[""Microwave"", ""Coffee maker: drip coffee maker...",$87.00,1,365,1.0,5.0,1.0,365.0,1.2,170.6,,t,24,44,74,213,2025-06-30,0,0,0,166,0,0,0.0,,,,,,,,,,,t,8,0,4,4,
42570,1450124362124784419,https://www.airbnb.com/rooms/1450124362124784419,20250624193519,2025-06-26,city scrape,Quarto para casal,Great location and cozy room to enjoy your trip.,,https://a0.muscache.com/pictures/hosting/Hosti...,378959794,https://www.airbnb.com/users/show/378959794,Anna,2020-12-09,Brazil,,,,,f,https://a0.muscache.com/im/pictures/user/58354...,https://a0.muscache.com/im/pictures/user/58354...,Camorim,1.0,1.0,['phone'],t,t,,Camorim,,-22.984526,-43.431740,Private room in rental unit,Private room,2,1.0,1 private bath,1.0,1.0,"[""Dedicated workspace"", ""TV"", ""Kitchen"", ""Free...",$240.00,1,365,1.0,1.0,365.0,365.0,1.0,365.0,,t,30,60,90,365,2025-06-26,0,0,0,189,0,0,0.0,,,,,,,,,,,f,1,0,1,0,


## 2. Limpeza de `listings.csv`

Esta seção aplica as etapas de limpeza para o dataset `listings.csv`.

### 2.1. Padronização de Colunas

Nomes de colunas são padronizados para minúsculas, sem espaços e caracteres especiais.

In [53]:
print("\n--- 2.1. Padronização de colunas ---")
df.columns = df.columns.str.lower().str.strip().str.replace(" ", "_")
print("Colunas padronizadas:")
print(df.columns.tolist())


--- 2.1. Padronização de colunas ---
Colunas padronizadas:
['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name', 'description', 'neighborhood_overview', 'picture_url', 'host_id', 'host_url', 'host_name', 'host_since', 'host_location', 'host_about', 'host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood', 'host_listings_count', 'host_total_listings_count', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude', 'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability', 'availa

### 2.2. Conversão de Variáveis Monetárias

Colunas como `price`, `cleaning_fee`, `security_deposit` e `extra_people` são convertidas para o tipo float, removendo símbolos de moeda e vírgulas.

In [54]:
print("\n--- 2.2. Conversão de variáveis monetárias ---")
monetary_cols = ["price", "cleaning_fee", "security_deposit", "extra_people"]
for col in monetary_cols:
    if col in df.columns:
        df[col] = (
            df[col]
            .astype(str)
            .str.replace(r'[$,]', '', regex=True)
            .astype(float)
        )
        print(f"Coluna \'{col}\' convertida para float.")


--- 2.2. Conversão de variáveis monetárias ---
Coluna 'price' convertida para float.


### 2.3. Conversão de Tipos de Dados

Conversão de colunas de data (`host_since`) e booleanas (`host_is_superhost`, `instant_bookable`) para os tipos apropriados.

In [55]:
print("\n--- 2.3. Conversão de tipos ---")
if "host_since" in df.columns:
    df["host_since"] = pd.to_datetime(df["host_since"], errors="coerce")
    print("Coluna \'host_since\' convertida para datetime.")
if "host_is_superhost" in df.columns:
    df["host_is_superhost"] = df["host_is_superhost"].map({"t": 1, "f": 0})
    print("Coluna \'host_is_superhost\' convertida para binário.")
if "instant_bookable" in df.columns:
    df["instant_bookable"] = df["instant_bookable"].map({"t": 1, "f": 0})
    print("Coluna \'instant_bookable\' convertida para binário.")


--- 2.3. Conversão de tipos ---
Coluna 'host_since' convertida para datetime.
Coluna 'host_is_superhost' convertida para binário.
Coluna 'instant_bookable' convertida para binário.


### 2.4. Remoção de Colunas Irrelevantes

Remoção de colunas que não são úteis para a análise ou que contêm informações redundantes/sensíveis.

In [56]:
print("\n--- 2.4. Remoção de colunas irrelevantes ---")
drop_cols = [
    "listing_url", "name", "description", "neighborhood_overview", "picture_url",
    "host_url", "host_name", "host_thumbnail_url", "host_picture_url",
    "license", "reviews_per_month", "review_scores_rating",
    "calendar_updated", "neighbourhood_group_cleansed",  'id',
    'scrape_id',
    'source',
    'last_scraped',
    'host_id',
    'host_about',
    'host_location',
    'host_neighbourhood',
    'first_review',
    'last_review',
    'calendar_last_scraped',
    'estimated_occupancy_l365d',
    'estimated_revenue_l365d', 'availability_eoy', 'host_verifications', 'neighbourhood',
     'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d', 'number_of_reviews_ly',
    'availability_30', 'availability_60', 'availability_90', 'availability_365',
    'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin',
    'review_scores_communication', 'review_scores_location', 'review_scores_value'
    # Adicionadas com base na análise anterior
]
df.drop(columns=drop_cols, inplace=True, errors="ignore")
print("Colunas irrelevantes removidas.")
print(f"Número de colunas após remoção: {df.shape[1]}")


--- 2.4. Remoção de colunas irrelevantes ---
Colunas irrelevantes removidas.
Número de colunas após remoção: 35


### 2.5. Tratamento de Valores Ausentes

Análise e tratamento de valores ausentes, incluindo a remoção de colunas com alta porcentagem de NaNs e o preenchimento de outros valores ausentes com medianas ou valores categóricos como 'unknown'.

In [57]:
print("\n--- 2.5. Tratamento de valores ausentes ---")
df["host_response_rate"] = df["host_response_rate"].replace("unknown", np.nan)
df["host_acceptance_rate"] = df["host_acceptance_rate"].replace("unknown", np.nan)



--- 2.5. Tratamento de valores ausentes ---


In [58]:
df["host_response_rate"] = (
    df["host_response_rate"]
    .astype(str)
    .str.replace("%", "", regex=False)
    .astype(float)
) / 100  # ← converte para proporção

df["host_acceptance_rate"] = (
    df["host_acceptance_rate"]
    .astype(str)
    .str.replace("%", "", regex=False)
    .astype(float)
) / 100  # ← converte para proporção


In [59]:
for col in ['host_has_profile_pic', 'host_identity_verified', 'host_is_superhost', 'instant_bookable','has_availability']:
    if col in df.columns:
        df[col] = df[col].map({'t': 1, 'f': 0}).astype(float)


In [60]:

missing_percentage = (df.isna().sum() / len(df)).sort_values(ascending=False)
print("Percentual de valores ausentes por coluna (Top 20):")
print(missing_percentage.head(20))

# Remover colunas com mais de 70% de NaN
cols_to_drop_nan = missing_percentage[missing_percentage > 0.7].index.tolist()
df.drop(columns=cols_to_drop_nan, inplace=True, errors="ignore")
print(f"Colunas com mais de 70% de NaN removidas: {cols_to_drop_nan}")

# Preencher valores ausentes específicos
if "bathrooms" in df.columns:
    df["bathrooms"] = df["bathrooms"].fillna(df["bathrooms"].median())
    print("Valores ausentes em \'bathrooms\' preenchidos com a mediana.")
if "host_response_time" in df.columns:
    df["host_response_time"] = df["host_response_time"].fillna("unknown")
    print("Valores ausentes em \'host_response_time\' preenchidos com \'unknown\'.")

# Preencher outras colunas numéricas com a mediana (exemplo)
for col in ["beds", "bedrooms"]:
    if col in df.columns and df[col].dtype != "object": # Check if it\'s numeric and exists
        df[col] = df[col].fillna(df[col].median())
        print(f"Valores ausentes em \'{col}\' preenchidos com a mediana.")

# Preencher colunas categóricas com a moda ou \'unknown\'
for col in ["host_location", "host_neighbourhood"]:
    if col in df.columns:
        df[col] = df[col].fillna("unknown")
        print(f"Valores ausentes em \'{col}\' preenchidos com \'unknown\'.")

# --- Complemento final para garantir que não restem NaN ---

for col in df.columns:
    if df[col].dtype != "object":  # numéricas
        df[col] = df[col].fillna(df[col].median())
    else:  # categóricas
        df[col] = df[col].fillna("unknown")

df.drop(columns='bathrooms_text', inplace=True, errors="ignore")

print("✅ Todos os valores ausentes foram tratados.")


Percentual de valores ausentes por coluna (Top 20):
host_is_superhost            1.000000
instant_bookable             1.000000
host_response_time           0.188011
host_response_rate           0.188011
host_acceptance_rate         0.117307
bathrooms                    0.085056
beds                         0.084351
price                        0.084093
host_since                   0.036244
host_has_profile_pic         0.036244
host_identity_verified       0.036244
host_listings_count          0.036244
host_total_listings_count    0.036244
bedrooms                     0.017288
has_availability             0.010711
bathrooms_text               0.001315
minimum_maximum_nights       0.000117
minimum_minimum_nights       0.000117
maximum_maximum_nights       0.000117
maximum_minimum_nights       0.000117
dtype: float64
Colunas com mais de 70% de NaN removidas: ['host_is_superhost', 'instant_bookable']
Valores ausentes em 'bathrooms' preenchidos com a mediana.
Valores ausentes em 'host_resp

### 2.6. Remoção de Outliers

Remoção de outliers de preço, considerando os 1% inferiores e 1% superiores.

In [61]:
print("\n--- 2.6. Remover outliers ---")
if "price" in df.columns:
    q_low, q_high = df["price"].quantile([0.01, 0.99])
    df = df[(df["price"] >= q_low) & (df["price"] <= q_high)]
    print(f"Outliers de preço removidos (1% inferior e 1% superior). Novo shape: {df.shape}")


--- 2.6. Remover outliers ---
Outliers de preço removidos (1% inferior e 1% superior). Novo shape: (41724, 32)


In [62]:
new_len_df = len(df)
print(f"\nLinhas removidas: {len_df - new_len_df}")
print(f"Linhas restantes: {new_len_df}")


Linhas removidas: 848
Linhas restantes: 41724


Conferir o tipo de cada coluna


In [63]:
pd.set_option('display.max_columns', None)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 41724 entries, 0 to 42571
Data columns (total 32 columns):
 #   Column                                        Non-Null Count  Dtype         
---  ------                                        --------------  -----         
 0   host_since                                    41724 non-null  datetime64[ns]
 1   host_response_time                            41724 non-null  object        
 2   host_response_rate                            41724 non-null  float64       
 3   host_acceptance_rate                          41724 non-null  float64       
 4   host_listings_count                           41724 non-null  float64       
 5   host_total_listings_count                     41724 non-null  float64       
 6   host_has_profile_pic                          41724 non-null  float64       
 7   host_identity_verified                        41724 non-null  float64       
 8   neighbourhood_cleansed                        41724 non-null  object   

In [64]:
df.head()

Unnamed: 0,host_since,host_response_time,host_response_rate,host_acceptance_rate,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,has_availability,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms
0,2010-01-08,within an hour,1.0,0.95,2.0,5.0,1.0,1.0,Copacabana,-22.96599,-43.1794,Entire condo,Entire home/apt,5,1.0,2.0,2.0,"[""Oven"", ""Building staff"", ""TV with standard c...",254.0,5,28,5.0,5.0,28.0,28.0,5.0,28.0,1.0,1,1,0,0
1,2010-04-03,unknown,1.0,0.9,1.0,5.0,1.0,1.0,Copacabana,-22.97649,-43.19122,Entire rental unit,Entire home/apt,3,1.0,1.0,2.0,"[""Window AC unit"", ""Room-darkening shades"", ""D...",252.0,2,60,2.0,2.0,60.0,60.0,2.0,60.0,1.0,1,1,0,0
2,2010-06-27,within an hour,1.0,0.97,1.0,2.0,1.0,1.0,Copacabana,-22.98107,-43.19136,Entire loft,Entire home/apt,2,1.5,1.0,1.0,"[""Building staff"", ""Bed linens"", ""Heating"", ""P...",190.0,3,15,3.0,5.0,7.0,15.0,3.1,14.8,1.0,1,1,0,0
3,2010-01-16,within an hour,1.0,0.95,7.0,33.0,1.0,1.0,Ipanema,-22.98591,-43.20302,Entire rental unit,Entire home/apt,13,7.0,6.0,7.0,"[""Pack \u2019n play/Travel crib"", ""Private pat...",2239.0,7,89,6.0,15.0,89.0,89.0,14.1,89.0,1.0,6,5,1,0
4,2010-09-03,within an hour,1.0,0.69,1.0,7.0,1.0,1.0,Copacabana,-22.96574,-43.17514,Entire rental unit,Entire home/apt,10,2.5,4.0,4.0,"[""Microwave"", ""Dedicated workspace"", ""Hot wate...",743.0,3,1125,3.0,4.0,1125.0,1125.0,3.0,1125.0,1.0,1,1,0,0


Converter host_since para “antiguidade do anfitrião”

In [65]:
df['host_since'] = pd.to_datetime(df['host_since'], errors='coerce')
df['host_days_active'] = (pd.Timestamp.today() - df['host_since']).dt.days
df.drop(columns='host_since', inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['host_since'] = pd.to_datetime(df['host_since'], errors='coerce')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['host_days_active'] = (pd.Timestamp.today() - df['host_since']).dt.days
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns='host_since', inplace=True)


Tratar amenities (lista de recursos)


In [66]:
df['amenities_count'] = df['amenities'].apply(lambda x: len(eval(x)) if pd.notna(x) else 0)
df.drop(columns='amenities', inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['amenities_count'] = df['amenities'].apply(lambda x: len(eval(x)) if pd.notna(x) else 0)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns='amenities', inplace=True)


In [69]:
df.head()

Unnamed: 0,host_response_time,host_response_rate,host_acceptance_rate,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,has_availability,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,host_days_active,amenities_count
0,within an hour,1.0,0.95,2.0,5.0,1.0,1.0,Copacabana,-22.96599,-43.1794,Entire condo,Entire home/apt,5,1.0,2.0,2.0,254.0,5,28,5.0,5.0,28.0,28.0,5.0,28.0,1.0,1,1,0,0,5774,26
1,unknown,1.0,0.9,1.0,5.0,1.0,1.0,Copacabana,-22.97649,-43.19122,Entire rental unit,Entire home/apt,3,1.0,1.0,2.0,252.0,2,60,2.0,2.0,60.0,60.0,2.0,60.0,1.0,1,1,0,0,5689,38
2,within an hour,1.0,0.97,1.0,2.0,1.0,1.0,Copacabana,-22.98107,-43.19136,Entire loft,Entire home/apt,2,1.5,1.0,1.0,190.0,3,15,3.0,5.0,7.0,15.0,3.1,14.8,1.0,1,1,0,0,5604,30
3,within an hour,1.0,0.95,7.0,33.0,1.0,1.0,Ipanema,-22.98591,-43.20302,Entire rental unit,Entire home/apt,13,7.0,6.0,7.0,2239.0,7,89,6.0,15.0,89.0,89.0,14.1,89.0,1.0,6,5,1,0,5766,33
4,within an hour,1.0,0.69,1.0,7.0,1.0,1.0,Copacabana,-22.96574,-43.17514,Entire rental unit,Entire home/apt,10,2.5,4.0,4.0,743.0,3,1125,3.0,4.0,1125.0,1125.0,3.0,1125.0,1.0,1,1,0,0,5536,33


In [73]:
from sklearn.preprocessing import OneHotEncoder

# 🔹 Lista das colunas categóricas
cat_features = [
    'host_response_time',
    'property_type',
    'room_type',
    'neighbourhood_cleansed'
]

# 🔹 Cria o codificador
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# 🔹 Aplica o encoding e transforma em DataFrame
encoded = encoder.fit_transform(df[cat_features])
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(cat_features))

# 🔹 Junta as novas colunas codificadas com o restante do DataFrame
df_encoded = pd.concat([df.drop(columns=cat_features).reset_index(drop=True),
                        encoded_df.reset_index(drop=True)], axis=1)

print("✅ One-Hot Encoding concluído!")
print("Formato final:", df_encoded.shape)
df_encoded.head()


✅ One-Hot Encoding concluído!
Formato final: (41724, 265)


Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,latitude,longitude,accommodates,bathrooms,bedrooms,beds,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,has_availability,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,host_days_active,amenities_count,host_response_time_a few days or more,host_response_time_unknown,host_response_time_within a day,host_response_time_within a few hours,host_response_time_within an hour,property_type_Boat,property_type_Camper/RV,property_type_Casa particular,property_type_Cave,property_type_Earthen home,property_type_Entire bungalow,property_type_Entire cabin,property_type_Entire chalet,property_type_Entire condo,property_type_Entire cottage,property_type_Entire guest suite,property_type_Entire guesthouse,property_type_Entire home,property_type_Entire loft,property_type_Entire place,property_type_Entire rental unit,property_type_Entire serviced apartment,property_type_Entire townhouse,property_type_Entire vacation home,property_type_Entire villa,property_type_Farm stay,property_type_Houseboat,property_type_Hut,property_type_Private room,property_type_Private room in bed and breakfast,property_type_Private room in boat,property_type_Private room in bungalow,property_type_Private room in cabin,property_type_Private room in casa particular,property_type_Private room in castle,property_type_Private room in chalet,property_type_Private room in condo,property_type_Private room in cottage,property_type_Private room in earthen home,property_type_Private room in farm stay,property_type_Private room in guest suite,property_type_Private room in guesthouse,property_type_Private room in home,property_type_Private room in hostel,property_type_Private room in loft,property_type_Private room in nature lodge,property_type_Private room in rental unit,property_type_Private room in resort,property_type_Private room in serviced apartment,property_type_Private room in shipping container,property_type_Private room in tent,property_type_Private room in tiny home,property_type_Private room in tower,property_type_Private room in townhouse,property_type_Private room in treehouse,property_type_Private room in vacation home,property_type_Private room in villa,property_type_Ranch,property_type_Room in aparthotel,property_type_Room in bed and breakfast,property_type_Room in boutique hotel,property_type_Room in hostel,property_type_Room in hotel,property_type_Room in serviced apartment,property_type_Shared room in aparthotel,property_type_Shared room in bed and breakfast,property_type_Shared room in condo,property_type_Shared room in dome,property_type_Shared room in guest suite,property_type_Shared room in guesthouse,property_type_Shared room in home,property_type_Shared room in hostel,property_type_Shared room in hotel,property_type_Shared room in rental unit,property_type_Shared room in serviced apartment,property_type_Shipping container,property_type_Tiny home,property_type_Tower,property_type_Treehouse,room_type_Entire home/apt,room_type_Hotel room,room_type_Private room,room_type_Shared room,neighbourhood_cleansed_Abolição,neighbourhood_cleansed_Alto da Boa Vista,neighbourhood_cleansed_Anchieta,neighbourhood_cleansed_Andaraí,neighbourhood_cleansed_Anil,neighbourhood_cleansed_Bancários,neighbourhood_cleansed_Bangu,neighbourhood_cleansed_Barra da Tijuca,neighbourhood_cleansed_Barra de Guaratiba,neighbourhood_cleansed_Barros Filho,neighbourhood_cleansed_Benfica,neighbourhood_cleansed_Bento Ribeiro,neighbourhood_cleansed_Bonsucesso,neighbourhood_cleansed_Botafogo,neighbourhood_cleansed_Brás de Pina,neighbourhood_cleansed_Cachambi,neighbourhood_cleansed_Cacuia,neighbourhood_cleansed_Caju,neighbourhood_cleansed_Camorim,neighbourhood_cleansed_Campinho,neighbourhood_cleansed_Campo Grande,neighbourhood_cleansed_Cascadura,neighbourhood_cleansed_Catete,neighbourhood_cleansed_Catumbi,neighbourhood_cleansed_Cavalcanti,neighbourhood_cleansed_Centro,neighbourhood_cleansed_Cidade Nova,neighbourhood_cleansed_Cidade Universitária,neighbourhood_cleansed_Cidade de Deus,neighbourhood_cleansed_Cocotá,neighbourhood_cleansed_Coelho Neto,neighbourhood_cleansed_Colégio,neighbourhood_cleansed_Complexo do Alemão,neighbourhood_cleansed_Copacabana,neighbourhood_cleansed_Cordovil,neighbourhood_cleansed_Cosme Velho,neighbourhood_cleansed_Cosmos,neighbourhood_cleansed_Curicica,neighbourhood_cleansed_Del Castilho,neighbourhood_cleansed_Deodoro,neighbourhood_cleansed_Encantado,neighbourhood_cleansed_Engenheiro Leal,neighbourhood_cleansed_Engenho Novo,neighbourhood_cleansed_Engenho da Rainha,neighbourhood_cleansed_Engenho de Dentro,neighbourhood_cleansed_Estácio,neighbourhood_cleansed_Flamengo,neighbourhood_cleansed_Freguesia (Ilha),neighbourhood_cleansed_Freguesia (Jacarepaguá),neighbourhood_cleansed_Galeão,neighbourhood_cleansed_Gamboa,neighbourhood_cleansed_Gardênia Azul,neighbourhood_cleansed_Gericinó,neighbourhood_cleansed_Glória,neighbourhood_cleansed_Grajaú,neighbourhood_cleansed_Grumari,neighbourhood_cleansed_Guadalupe,neighbourhood_cleansed_Guaratiba,neighbourhood_cleansed_Gávea,neighbourhood_cleansed_Higienópolis,neighbourhood_cleansed_Honório Gurgel,neighbourhood_cleansed_Humaitá,neighbourhood_cleansed_Inhaúma,neighbourhood_cleansed_Inhoaíba,neighbourhood_cleansed_Ipanema,neighbourhood_cleansed_Irajá,neighbourhood_cleansed_Itanhangá,neighbourhood_cleansed_Jacarepaguá,neighbourhood_cleansed_Jacaré,neighbourhood_cleansed_Jardim América,neighbourhood_cleansed_Jardim Botânico,neighbourhood_cleansed_Jardim Carioca,neighbourhood_cleansed_Jardim Guanabara,neighbourhood_cleansed_Jardim Sulacap,neighbourhood_cleansed_Joá,neighbourhood_cleansed_Lagoa,neighbourhood_cleansed_Laranjeiras,neighbourhood_cleansed_Leblon,neighbourhood_cleansed_Leme,neighbourhood_cleansed_Lins de Vasconcelos,neighbourhood_cleansed_Madureira,neighbourhood_cleansed_Magalhães Bastos,neighbourhood_cleansed_Mangueira,neighbourhood_cleansed_Manguinhos,neighbourhood_cleansed_Maracanã,neighbourhood_cleansed_Marechal Hermes,neighbourhood_cleansed_Maria da Graça,neighbourhood_cleansed_Maré,neighbourhood_cleansed_Moneró,neighbourhood_cleansed_Méier,neighbourhood_cleansed_Olaria,neighbourhood_cleansed_Osvaldo Cruz,neighbourhood_cleansed_Paciência,neighbourhood_cleansed_Padre Miguel,neighbourhood_cleansed_Paquetá,neighbourhood_cleansed_Parada de Lucas,neighbourhood_cleansed_Parque Anchieta,neighbourhood_cleansed_Pavuna,neighbourhood_cleansed_Pechincha,neighbourhood_cleansed_Pedra de Guaratiba,neighbourhood_cleansed_Penha,neighbourhood_cleansed_Penha Circular,neighbourhood_cleansed_Piedade,neighbourhood_cleansed_Pilares,neighbourhood_cleansed_Pitangueiras,neighbourhood_cleansed_Portuguesa,neighbourhood_cleansed_Praia da Bandeira,neighbourhood_cleansed_Praça Seca,neighbourhood_cleansed_Praça da Bandeira,neighbourhood_cleansed_Quintino Bocaiúva,neighbourhood_cleansed_Ramos,neighbourhood_cleansed_Realengo,neighbourhood_cleansed_Recreio dos Bandeirantes,neighbourhood_cleansed_Riachuelo,neighbourhood_cleansed_Ribeira,neighbourhood_cleansed_Ricardo de Albuquerque,neighbourhood_cleansed_Rio Comprido,neighbourhood_cleansed_Rocha,neighbourhood_cleansed_Rocha Miranda,neighbourhood_cleansed_Rocinha,neighbourhood_cleansed_Sampaio,neighbourhood_cleansed_Santa Cruz,neighbourhood_cleansed_Santa Teresa,neighbourhood_cleansed_Santo Cristo,neighbourhood_cleansed_Santíssimo,neighbourhood_cleansed_Saúde,neighbourhood_cleansed_Senador Camará,neighbourhood_cleansed_Senador Vasconcelos,neighbourhood_cleansed_Sepetiba,neighbourhood_cleansed_São Conrado,neighbourhood_cleansed_São Cristóvão,neighbourhood_cleansed_São Francisco Xavier,neighbourhood_cleansed_Tanque,neighbourhood_cleansed_Taquara,neighbourhood_cleansed_Tauá,neighbourhood_cleansed_Tijuca,neighbourhood_cleansed_Todos os Santos,neighbourhood_cleansed_Tomás Coelho,neighbourhood_cleansed_Turiaçú,neighbourhood_cleansed_Urca,neighbourhood_cleansed_Vargem Grande,neighbourhood_cleansed_Vargem Pequena,neighbourhood_cleansed_Vasco da Gama,neighbourhood_cleansed_Vaz Lobo,neighbourhood_cleansed_Vicente de Carvalho,neighbourhood_cleansed_Vidigal,neighbourhood_cleansed_Vigário Geral,neighbourhood_cleansed_Vila Isabel,neighbourhood_cleansed_Vila Kosmos,neighbourhood_cleansed_Vila Militar,neighbourhood_cleansed_Vila Valqueire,neighbourhood_cleansed_Vila da Penha,neighbourhood_cleansed_Zumbi,neighbourhood_cleansed_Água Santa
0,1.0,0.95,2.0,5.0,1.0,1.0,-22.96599,-43.1794,5,1.0,2.0,2.0,254.0,5,28,5.0,5.0,28.0,28.0,5.0,28.0,1.0,1,1,0,0,5774,26,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.9,1.0,5.0,1.0,1.0,-22.97649,-43.19122,3,1.0,1.0,2.0,252.0,2,60,2.0,2.0,60.0,60.0,2.0,60.0,1.0,1,1,0,0,5689,38,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.97,1.0,2.0,1.0,1.0,-22.98107,-43.19136,2,1.5,1.0,1.0,190.0,3,15,3.0,5.0,7.0,15.0,3.1,14.8,1.0,1,1,0,0,5604,30,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.95,7.0,33.0,1.0,1.0,-22.98591,-43.20302,13,7.0,6.0,7.0,2239.0,7,89,6.0,15.0,89.0,89.0,14.1,89.0,1.0,6,5,1,0,5766,33,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.69,1.0,7.0,1.0,1.0,-22.96574,-43.17514,10,2.5,4.0,4.0,743.0,3,1125,3.0,4.0,1125.0,1125.0,3.0,1125.0,1.0,1,1,0,0,5536,33,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [76]:
print("\n--- Colunas após limpeza ---")
print(df.columns.tolist())


--- Colunas após limpeza ---
['host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_listings_count', 'host_total_listings_count', 'host_has_profile_pic', 'host_identity_verified', 'neighbourhood_cleansed', 'latitude', 'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'price', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'has_availability', 'calculated_host_listings_count', 'calculated_host_listings_count_entire_homes', 'calculated_host_listings_count_private_rooms', 'calculated_host_listings_count_shared_rooms', 'host_days_active', 'amenities_count']


In [77]:
# Salvar o DataFrame processado
output_path = os.path.join(output_dir, "listings_processed.csv")
df.to_csv(output_path, index=False)
print(f"\nDataFrame processado salvo em \'{output_path}\'")


DataFrame processado salvo em 'data/processed/listings_processed.csv'
