# Extração dos dados (Extract)


Este notebook corresponde à primeira etapa do processo ETL (Extract, Transform, Load) e tem como objetivo realizar a extração e a inspeção inicial dos dados de locações do Airbnb para a cidade de San Francisco, Califórnia, EUA.

Nesta fase, **nenhuma transformação é aplicada aos dados**. O foco está em compreender a estrutura do dataset bruto, verificar sua integridade e garantir rastreabilidade para as etapas posteriores do projeto.

## Objetivo: Extração e inspeção inicial

Conteúdo:

- Fonte dos dados (Inside Airbnb)

- Download / leitura do dataset

- head(), info(), shape

- Salvamento em data/raw/


- Obter os dados diretamente de uma fonte confiável (Inside Airbnb)
- Garantir que os dados representem fielmente o fenômeno estudado
- Preservar a integridade do dataset bruto
- Realizar uma inspeção inicial da estrutura e dos tipos de dados
- Preparar o ambiente para as etapas de transformação e análise

Os dados utilizados neste projeto foram obtidos a partir do portal **Inside Airbnb**, uma iniciativa independente que disponibiliza dados públicos sobre locações de curta duração em diversas cidades do mundo, com o objetivo de promover transparência e apoiar análises acadêmicas e sociais.

- Fonte: Inside Airbnb
- Cidade: San Francisco, Califórnia, EUA
- Dataset: listings.csv

## Importação das Bibliotecas

In [1]:
# importação das Bibliotecas

import pandas as pd 
import numpy as np
import os
import geopandas as gpd
import matplotlib.pyplot as plt


## Carregamento dos Datasets



In [2]:
# carregar múltiplos arquivos do inside airbnb

base_path = '../data/raw/'

df_listings_raw = pd.read_csv(base_path + "listings.csv.gz")
df_calendar_raw = pd.read_csv(base_path + "calendar.csv.gz")
df_reviews_raw = pd.read_csv(base_path + "reviews.csv.gz")
df_neighbourhoods_raw = pd.read_csv(base_path + "neighbourhoods.csv")
gdf_neighbourhoods_raw = gpd.read_file(base_path + "neighbourhoods.geojson")

## Visualização inicial


In [3]:
# arquivo listings
df_listings_raw.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,958,https://www.airbnb.com/rooms/958,20251204025409,2025-12-04,city scrape,"Bright, Modern Garden Unit - 1BR/1BTH",Our bright garden unit overlooks a lovely back...,Quiet cul de sac in friendly neighborhood<br /...,https://a0.muscache.com/pictures/be1bf5ac-a955...,1169,...,4.9,4.98,4.77,STR-0006854,f,1,1,0,0,2.54
1,5858,https://www.airbnb.com/rooms/5858,20251204025409,2025-12-04,city scrape,Creative Sanctuary,We live in a large Victorian house on a quiet ...,I love how our neighborhood feels quiet but is...,https://a0.muscache.com/pictures/hosting/Hosti...,8904,...,4.85,4.77,4.68,,f,1,1,0,0,0.52
2,8142,https://www.airbnb.com/rooms/8142,20251204025409,2025-12-04,city scrape,*FriendlyRoom Apt. Style -UCSF/USF - San Franc...,Nice and good public transportation. 7 minute...,"N Juda Muni, Bus and UCSF Shuttle.<br /><br />...",https://a0.muscache.com/pictures/hosting/Hosti...,21994,...,4.8,4.7,4.7,,f,20,0,20,0,0.07
3,8339,https://www.airbnb.com/rooms/8339,20251204025409,2025-12-04,city scrape,Historic Alamo Square Victorian,"For creative humans who love art, space, photo...",,https://a0.muscache.com/pictures/miso/Hosting-...,24215,...,5.0,4.94,4.75,STR-0000264,f,1,1,0,0,0.13
4,10537,https://www.airbnb.com/rooms/10537,20251204025409,2025-12-04,city scrape,Elegant & Cozy w/City views. Private room: Purple,Casa de Paz (House of Peace) is like staying w...,,https://a0.muscache.com/pictures/airflow/Hosti...,36752,...,5.0,4.68,4.8,2022-011003STR,f,3,1,2,0,0.24


In [4]:
# arquivo calendar
df_calendar_raw.head()

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,1575184,2025-12-04,t,,,3,28
1,1575184,2025-12-05,t,,,3,28
2,1575184,2025-12-06,t,,,3,28
3,1575184,2025-12-07,t,,,3,28
4,1575184,2025-12-08,t,,,3,28


In [5]:
# reviews
df_reviews_raw.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,958,5977,2009-07-23,15695,Edmund C,"Our experience was, without a doubt, a five st..."
1,958,6660,2009-08-03,26145,Simon,Returning to San Francisco is a rejuvenating t...
2,958,11519,2009-09-27,25839,Denis,We were very pleased with the accommodations a...
3,958,16282,2009-11-05,33750,Anna,We highly recommend this accomodation and agre...
4,958,26008,2010-02-13,15416,V Jillian,Holly's place was great. It was exactly what I...


In [6]:
# neighbourhoods - csv
df_neighbourhoods_raw.head()

Unnamed: 0,neighbourhood_group,neighbourhood
0,,Bayview
1,,Bernal Heights
2,,Castro/Upper Market
3,,Chinatown
4,,Crocker Amazon


In [7]:
# gneighbourhoods geo_json
gdf_neighbourhoods_raw.head()

Unnamed: 0,neighbourhood,neighbourhood_group,geometry
0,Seacliff,,"MULTIPOLYGON (((-122.48409 37.78791, -122.4843..."
1,Haight Ashbury,,"MULTIPOLYGON (((-122.43596 37.76904, -122.4368..."
2,Outer Mission,,"MULTIPOLYGON (((-122.45428 37.70822, -122.4545..."
3,Downtown/Civic Center,,"MULTIPOLYGON (((-122.40891 37.79013, -122.4088..."
4,Diamond Heights,,"MULTIPOLYGON (((-122.43553 37.74146, -122.4356..."


In [8]:
# organização por tipo de dado
raw_data = {
    "listings": df_listings_raw,
    "calendar": df_calendar_raw,
    "reviews": df_reviews_raw,
    "neighbourhoods": df_neighbourhoods_raw,
    "neighbourhoods_geo": gdf_neighbourhoods_raw
}

## Documentação

### Conjuntos de dados carregados

- **df_listings_raw**: informações gerais dos anúncios
- **df_calendar_raw**: disponibilidade e preços por data
- **df_reviews_raw**: avaliações e comentários dos hóspedes
- **df_neighbourhoods_raw**: dados geográficos dos bairros