# Extração dos dados (Extract)


Este notebook corresponde à primeira etapa do processo ETL (Extract, Transform, Load) e tem como objetivo realizar a extração e a inspeção inicial dos dados de locações do Airbnb para a cidade de San Francisco, Califórnia, EUA.

Nesta fase, **nenhuma transformação é aplicada aos dados**. O foco está em compreender a estrutura do dataset bruto, verificar sua integridade e garantir rastreabilidade para as etapas posteriores do projeto.

## Objetivo: Extração e inspeção inicial

Conteúdo:

- Fonte dos dados (Inside Airbnb)

- Download / leitura do dataset

- head(), info(), shape

- Salvamento em data/raw/


- Obter os dados diretamente de uma fonte confiável (Inside Airbnb)
- Garantir que os dados representem fielmente o fenômeno estudado
- Preservar a integridade do dataset bruto
- Realizar uma inspeção inicial da estrutura e dos tipos de dados
- Preparar o ambiente para as etapas de transformação e análise

Os dados utilizados neste projeto foram obtidos a partir do portal **Inside Airbnb**, uma iniciativa independente que disponibiliza dados públicos sobre locações de curta duração em diversas cidades do mundo, com o objetivo de promover transparência e apoiar análises acadêmicas e sociais.

- Fonte: Inside Airbnb
- Cidade: San Francisco, Califórnia, EUA
- Dataset: listings.csv

## Importação das Bibliotecas

In [1]:
# importação das Bibliotecas

import pandas as pd 
import geopandas as gpd

pd.set_option('display.max_columns', None)

## Carregamento dos Datasets



In [2]:
# carregar múltiplos arquivos do inside airbnb

base_path = '../data/raw/'

df_listings_raw = pd.read_csv(base_path + "listings.csv.gz")
df_calendar_raw = pd.read_csv(base_path + "calendar.csv.gz")
df_reviews_raw = pd.read_csv(base_path + "reviews.csv.gz")
df_neighbourhoods_raw = pd.read_csv(base_path + "neighbourhoods.csv")
gdf_neighbourhoods_raw = gpd.read_file(base_path + "neighbourhoods.geojson")
df_listings = pd.read_csv(base_path + "listings.csv",low_memory=False)
df_reviews = pd.read_csv(base_path + "reviews.csv",low_memory=False)

## Visualização inicial


In [3]:
# arquivo listings
df_listings_raw.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,host_url,host_profile_id,host_profile_url,host_name,host_since,hosts_time_as_user_years,hosts_time_as_user_months,hosts_time_as_host_years,hosts_time_as_host_months,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,availability_eoy,number_of_reviews_ly,estimated_occupancy_l365d,estimated_revenue_l365d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,958,https://www.airbnb.com/rooms/958,20251204025409,2025-12-04,city scrape,"Bright, Modern Garden Unit - 1BR/1BTH",Our bright garden unit overlooks a lovely back...,Quiet cul de sac in friendly neighborhood<br /...,https://a0.muscache.com/pictures/be1bf5ac-a955...,1169,https://www.airbnb.com/users/show/1169,1462506189282101689,https://www.airbnb.com/users/profile/146250618...,Holly,2008-07-31,17.0,4.0,14.0,11.0,"San Francisco, CA",We are a family of four that live upstairs. W...,within an hour,92%,100%,t,https://a0.muscache.com/im/pictures/user/efdad...,https://a0.muscache.com/im/pictures/user/efdad...,Duboce Triangle,1.0,1.0,"['email', 'phone']",t,t,Neighborhood highlights,Western Addition,,37.77028,-122.43317,Entire serviced apartment,Entire home/apt,3,1.0,1 bath,1.0,2.0,"[""Private entrance"", ""Kitchenette"", ""Luggage d...",,2,30,2.0,2.0,1125.0,1125.0,2.0,1125.0,,t,4,10,15,228,2025-12-04,507,41,1,4,50,255,,2009-07-23,2025-11-15,4.88,4.94,4.93,4.96,4.9,4.98,4.77,STR-0006854,f,1,1,0,0,2.54
1,5858,https://www.airbnb.com/rooms/5858,20251204025409,2025-12-04,city scrape,Creative Sanctuary,We live in a large Victorian house on a quiet ...,I love how our neighborhood feels quiet but is...,https://a0.muscache.com/pictures/hosting/Hosti...,8904,https://www.airbnb.com/users/show/8904,1462506623299518225,https://www.airbnb.com/users/profile/146250662...,Philip Jonathon,2009-03-02,16.0,9.0,14.0,11.0,"San Francisco, CA",Philip: English transplant to the Bay Area and...,within a few hours,80%,79%,f,https://a0.muscache.com/im/users/8904/profile_...,https://a0.muscache.com/im/users/8904/profile_...,Bernal Heights,2.0,2.0,"['email', 'phone', 'work_email']",t,t,Neighborhood highlights,Bernal Heights,,37.74474,-122.42089,Entire rental unit,Entire home/apt,4,2.0,2 baths,2.0,2.0,"[""Private entrance"", ""Washer"", ""Hangers"", ""Iro...",,30,90,30.0,30.0,90.0,90.0,30.0,90.0,,t,30,60,90,365,2025-12-04,105,0,0,28,0,0,,2009-05-03,2017-08-06,4.87,4.85,4.87,4.89,4.85,4.77,4.68,,f,1,1,0,0,0.52
2,8142,https://www.airbnb.com/rooms/8142,20251204025409,2025-12-04,city scrape,*FriendlyRoom Apt. Style -UCSF/USF - San Franc...,Nice and good public transportation. 7 minute...,"N Juda Muni, Bus and UCSF Shuttle.<br /><br />...",https://a0.muscache.com/pictures/hosting/Hosti...,21994,https://www.airbnb.com/users/show/21994,1462506956810615042,https://www.airbnb.com/users/profile/146250695...,Aaron,2009-06-17,16.0,5.0,14.0,11.0,"San Francisco, CA",7 minutes walk to UCSF hospital & school campu...,within a few hours,100%,86%,t,https://a0.muscache.com/im/users/21994/profile...,https://a0.muscache.com/im/users/21994/profile...,Cole Valley,20.0,21.0,"['email', 'phone']",t,t,Neighborhood highlights,Haight Ashbury,,37.76555,-122.45213,Private room in rental unit,Private room,1,4.0,4 shared baths,3.0,1.0,"[""Private entrance"", ""Dishes and silverware"", ...",,32,90,32.0,32.0,90.0,90.0,32.0,90.0,,t,27,57,87,362,2025-12-04,10,0,0,25,0,0,,2014-09-08,2023-07-30,4.7,4.5,4.5,4.8,4.8,4.7,4.7,,f,20,0,20,0,0.07
3,8339,https://www.airbnb.com/rooms/8339,20251204025409,2025-12-04,city scrape,Historic Alamo Square Victorian,"For creative humans who love art, space, photo...",,https://a0.muscache.com/pictures/miso/Hosting-...,24215,https://www.airbnb.com/users/show/24215,1462506994551169471,https://www.airbnb.com/users/profile/146250699...,Rosmarie,2009-07-02,16.0,5.0,14.0,11.0,"San Francisco, CA",I'm an Interior Stylist living in SF.,within a few hours,100%,0%,f,https://a0.muscache.com/im/pictures/user/3e1a7...,https://a0.muscache.com/im/pictures/user/3e1a7...,Alamo Square,1.0,5.0,"['email', 'phone']",t,t,,Western Addition,,37.77377,-122.43614,Entire condo,Entire home/apt,2,1.5,1.5 baths,1.0,1.0,"[""First aid kit"", ""Dishes and silverware"", ""Be...",,9,91,9.0,9.0,91.0,91.0,9.0,91.0,,t,4,34,64,339,2025-12-04,25,0,0,2,0,0,,2010-04-14,2019-06-28,4.86,4.88,5.0,4.94,5.0,4.94,4.75,STR-0000264,f,1,1,0,0,0.13
4,10537,https://www.airbnb.com/rooms/10537,20251204025409,2025-12-04,city scrape,Elegant & Cozy w/City views. Private room: Purple,Casa de Paz (House of Peace) is like staying w...,,https://a0.muscache.com/pictures/airflow/Hosti...,36752,https://www.airbnb.com/users/show/36752,1462507288958203289,https://www.airbnb.com/users/profile/146250728...,Teresa,2009-09-07,16.0,2.0,14.0,11.0,"San Francisco, CA",From San Francisco,within an hour,100%,96%,t,https://a0.muscache.com/im/users/36752/profile...,https://a0.muscache.com/im/users/36752/profile...,Bayview,3.0,4.0,"['email', 'phone', 'work_email']",t,t,,Bayview,,37.7175,-122.39698,Private room,Private room,2,1.5,1.5 shared baths,1.0,1.0,"[""Luggage dropoff allowed"", ""First aid kit"", ""...",,1,90,1.0,1.0,90.0,90.0,1.0,90.0,,t,30,60,90,365,2025-12-04,46,12,1,28,12,132,,2010-02-21,2025-11-07,4.98,4.93,4.98,4.95,5.0,4.68,4.8,2022-011003STR,f,3,1,2,0,0.24


In [4]:
df_listings_raw.shape

(7535, 85)

In [5]:
# arquivo calendar
df_calendar_raw.head()

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,1575184,2025-12-04,t,,,3,28
1,1575184,2025-12-05,t,,,3,28
2,1575184,2025-12-06,t,,,3,28
3,1575184,2025-12-07,t,,,3,28
4,1575184,2025-12-08,t,,,3,28


In [6]:
df_calendar_raw.shape

(2750275, 7)

In [7]:
df_calendar_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2750275 entries, 0 to 2750274
Data columns (total 7 columns):
 #   Column          Dtype  
---  ------          -----  
 0   listing_id      int64  
 1   date            object 
 2   available       object 
 3   price           float64
 4   adjusted_price  float64
 5   minimum_nights  int64  
 6   maximum_nights  int64  
dtypes: float64(2), int64(3), object(2)
memory usage: 146.9+ MB


In [8]:
df_calendar_raw.isna().sum()

listing_id              0
date                    0
available               0
price             2750275
adjusted_price    2750275
minimum_nights          0
maximum_nights          0
dtype: int64

In [9]:
# reviews
df_reviews_raw.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,958,5977,2009-07-23,15695,Edmund C,"Our experience was, without a doubt, a five st..."
1,958,6660,2009-08-03,26145,Simon,Returning to San Francisco is a rejuvenating t...
2,958,11519,2009-09-27,25839,Denis,We were very pleased with the accommodations a...
3,958,16282,2009-11-05,33750,Anna,We highly recommend this accomodation and agre...
4,958,26008,2010-02-13,15416,V Jillian,Holly's place was great. It was exactly what I...


In [10]:
df_reviews_raw.shape

(424402, 6)

In [11]:
# neighbourhoods - csv
df_neighbourhoods_raw.head()

Unnamed: 0,neighbourhood_group,neighbourhood
0,,Bayview
1,,Bernal Heights
2,,Castro/Upper Market
3,,Chinatown
4,,Crocker Amazon


In [12]:
df_neighbourhoods_raw.shape

(37, 2)

In [13]:
# gneighbourhoods geo_json
gdf_neighbourhoods_raw.head()

Unnamed: 0,neighbourhood,neighbourhood_group,geometry
0,Seacliff,,"MULTIPOLYGON (((-122.48409 37.78791, -122.4843..."
1,Haight Ashbury,,"MULTIPOLYGON (((-122.43596 37.76904, -122.4368..."
2,Outer Mission,,"MULTIPOLYGON (((-122.45428 37.70822, -122.4545..."
3,Downtown/Civic Center,,"MULTIPOLYGON (((-122.40891 37.79013, -122.4088..."
4,Diamond Heights,,"MULTIPOLYGON (((-122.43553 37.74146, -122.4356..."


In [14]:
gdf_neighbourhoods_raw.shape

(37, 3)

In [15]:
# listings.csv
df_listings.head()

Unnamed: 0,id,name,host_id,host_profile_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,958,"Bright, Modern Garden Unit - 1BR/1BTH",1169,1462506189282101689,Holly,,Western Addition,37.77028,-122.43317,Entire home/apt,,2,507,2025-11-15,2.54,1,228,41,STR-0006854
1,5858,Creative Sanctuary,8904,1462506623299518225,Philip Jonathon,,Bernal Heights,37.74474,-122.42089,Entire home/apt,,30,105,2017-08-06,0.52,1,365,0,
2,8142,*FriendlyRoom Apt. Style -UCSF/USF - San Franc...,21994,1462506956810615042,Aaron,,Haight Ashbury,37.76555,-122.45213,Private room,,32,10,2023-07-30,0.07,20,362,0,
3,8339,Historic Alamo Square Victorian,24215,1462506994551169471,Rosmarie,,Western Addition,37.77377,-122.43614,Entire home/apt,,9,25,2019-06-28,0.13,1,339,0,STR-0000264
4,10537,Elegant & Cozy w/City views. Private room: Purple,36752,1462507288958203289,Teresa,,Bayview,37.7175,-122.39698,Private room,,1,46,2025-11-07,0.24,3,365,12,2022-011003STR


In [16]:
df_listings.shape

(7535, 19)

In [17]:
df_listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7535 entries, 0 to 7534
Data columns (total 19 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              7535 non-null   int64  
 1   name                            7535 non-null   object 
 2   host_id                         7535 non-null   int64  
 3   host_profile_id                 7535 non-null   int64  
 4   host_name                       7341 non-null   object 
 5   neighbourhood_group             0 non-null      float64
 6   neighbourhood                   7535 non-null   object 
 7   latitude                        7535 non-null   float64
 8   longitude                       7535 non-null   float64
 9   room_type                       7535 non-null   object 
 10  price                           0 non-null      float64
 11  minimum_nights                  7535 non-null   int64  
 12  number_of_reviews               75

In [18]:
df_reviews.head()

Unnamed: 0,listing_id,date
0,958,2009-07-23
1,958,2009-08-03
2,958,2009-09-27
3,958,2009-11-05
4,958,2010-02-13


In [19]:
# organização por tipo de dado
raw_data = {
    "listings": df_listings_raw,
    "calendar": df_calendar_raw,
    "reviews": df_reviews_raw,
    "neighbourhoods": df_neighbourhoods_raw,
    "neighbourhoods_geo": gdf_neighbourhoods_raw
}

## Documentação

### Conjuntos de dados carregados

- **df_listings_raw**: informações detalhada dos anúncios
- **df_calendar_raw**: disponibilidade e preços por data
- **df_reviews_raw**: avaliações detalhada e comentários dos hóspedes
- **df_neighbourhoods_raw**: dados geográficos dos bairros
- **df_listings**: informações resumida dos anúncios
- **df_reviews**: avaliações e comentários dos hóspedes resumida

### Limitação dos dados: 



Os datasets disponibilizados pelo Inside Airbnb para a cidade de San Francisco
não incluem informações sobre o preço das acomodações (`price`). Essa limitação
da fonte de dados restringe a análise a aspectos estruturais, geográficos e de
avaliação dos anúncios, excluindo investigações relacionadas à precificação ou
dimensão financeira.
