# Análise exploratória Airbnb (Rio de Janeiro)
**Autor:** Douglas Trajano

Este notebook irá reunir a análise exploratória feita nos dados do Airbnb (Rio de Janeiro).

O objetivo deste projeto é desenvolver um classificador para predizer o `room_type` de cada anúncio no Airbnb para o Rio de Janeiro. Por isso, muitas visualizações serão focadas em entender o comportamento dos dados perante essa variável.

A estrutura completa do projeto pode ser vista [aqui](https://github.com/DougTrajano/ds_airbnb_rio).

**Disclaimer** Não há uma explicação sobre cada coluna. ¯\\_(ツ)_/¯

## / imports

In [1]:
import pandas as pd
import pandas_profiling as pdp
import numpy as np
import utilis_script
import ast
from tqdm import tqdm
import math

# graphs
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

## / load dataset

In [2]:
df_listings = utilis_script.get_data(origin="listings")
df_listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33715 entries, 0 to 33714
Columns: 106 entries, id to reviews_per_month
dtypes: float64(24), int64(21), object(61)
memory usage: 27.3+ MB


## / Pandas Profiling

Eu gosto de gerar o ProfileReport através da biblioteca Pandas Profiling, pois é possível ter uma visão inicial do dataset muito mais rápido e útil para evoluirmos na nossa análise exploratória.

In [None]:
from IPython.core.display import HTML

profile = pdp.ProfileReport(df_listings, title='Airbnb RJ - listings.csv')
profile.to_file(output_file="airbnb_data_report.html")

HTML('<a href="airbnb_data_report.html" target="_blank">airbnb_data_report.html</a>')

## / Análise exploratória

Após analisar o arquivo gerado pelo Pandas Profiling, identificamos que existem muitas colunas com apenas um valor, com todos os registros nulos ou grande parte dos registros nulos.

Isso provavelmente não agregará ao nosso modelo, mais pra frente iremos remover as colunas que não forem relevantes.

Agora vamos entender melhor os dados visualizando alguns gráficos.

### / Distribuição de room_types

In [3]:
df_temp = df_listings["room_type"].value_counts()

fig = {
  "data": [
    {
      "values": df_temp.values,
      "labels": df_temp.index,
      "domain": {"x": [0, .48]},
      "hole": .7,
      "type": "pie"
    },
    
    ],
  "layout": {
        "title":"Distribuição de room_types",
    }
}

fig = go.Figure(fig)
fig.update_layout(template="seaborn")
iplot(fig)

### / Distribuição de accommodates

In [4]:
room_types = df_listings["room_type"].unique().tolist()

df_accommodates = df_listings[["room_type", "accommodates"]]

for i in range(len(df_accommodates)):
    if df_accommodates["accommodates"][i] > 10:
        df_accommodates["accommodates"][i] = "more than 10"

x_values = df_accommodates.groupby("accommodates").count().index.tolist()
x_values



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 'more than 10']

In [5]:
df_temp = df_accommodates.merge(pd.get_dummies(df_accommodates["room_type"]), left_index=True, right_index=True)
df_temp = df_temp.groupby("accommodates").sum()
df_temp

Unnamed: 0_level_0,Entire home/apt,Hotel room,Private room,Shared room
accommodates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,113.0,39.0,1505.0,200.0
2,2859.0,110.0,5349.0,173.0
3,2379.0,34.0,888.0,62.0
4,7977.0,80.0,613.0,85.0
5,2649.0,18.0,114.0,37.0
6,4363.0,18.0,135.0,65.0
7,775.0,0.0,16.0,11.0
8,1305.0,4.0,43.0,21.0
9,194.0,4.0,10.0,5.0
10,677.0,3.0,22.0,17.0


In [6]:
data_vis = []

for room_type in room_types:
    y_values = df_temp[room_type].tolist()
    temp = go.Bar(name=room_type, x=x_values, y=y_values)
    data_vis.append(temp)
    
fig = go.Figure(data=data_vis)
fig.update_layout(barmode='group', xaxis=dict(tickmode = 'array',
                                              tickvals = list(range(1, len(x_values)+1)),
                                              ticktext = [str(each) for each in x_values]),
                  title="Distribuição de accommodates / room_type", template="seaborn")
fig.show()

### / Seleção das features mais prováveis para o modelo

Utiliando o <a href="airbnb_data_report.html" target="_blank">airbnb_data_report.html</a> vamos avaliar qualitativamente quais poderão ser as features mais importantes para o modelo, com base nelas, iremos construir mais alguns gráficos.

Features irrelevantes que poderão ser removidas:

In [7]:
cols_to_remove = ["city", "calendar_updated", "bed_type", "availability_60", "availability_90", 
                  "availability_365", "calendar_last_scraped", "calculated_host_listings_count_entire_homes", 
                  "country", "country_code", "experiences_offered", "first_review", "has_availability", 
                  "host_acceptance_rate", "host_has_profile_pic", "host_id", "host_location", "host_name", 
                  "host_picture_url", "host_since", "host_thumbnail_url", "host_total_listings_count", 
                  "host_url", "id", "interaction", "is_business_travel_ready", "jurisdiction_names", 
                  "last_review", "last_scraped", "latitude", "longitude", "license", "listing_url", 
                  "market", "maximum_minimum_nights", "maximum_nights", "maximum_nights_avg_ntm", 
                  "medium_url", "minimum_maximum_nights", "minimum_minimum_nights","minimum_nights", 
                  "minimum_nights_avg_ntm", "neighborhood_overview", "neighbourhood_cleansed", 
                  "neighbourhood_group_cleansed", "notes","number_of_reviews", "number_of_reviews_ltm", 
                  "picture_url", "require_guest_phone_verification", "require_guest_profile_picture", 
                  "requires_license", "review_scores_accuracy", "review_scores_checkin", "review_scores_cleanliness", 
                  "review_scores_communication", "review_scores_location", "review_scores_rating", 
                  "review_scores_value", "reviews_per_month", "scrape_id", "smart_location", "space", 
                  "square_feet", "state", "street", "summary", "thumbnail_url", "transit", 
                  "xl_picture_url", "zipcode"]

df_listings.drop(columns=cols_to_remove, inplace=True)
df_listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33715 entries, 0 to 33714
Data columns (total 35 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   name                                          33654 non-null  object 
 1   description                                   32785 non-null  object 
 2   access                                        16032 non-null  object 
 3   house_rules                                   17424 non-null  object 
 4   host_about                                    16172 non-null  object 
 5   host_response_time                            21481 non-null  object 
 6   host_response_rate                            21481 non-null  object 
 7   host_is_superhost                             33695 non-null  object 
 8   host_neighbourhood                            21399 non-null  object 
 9   host_listings_count                           33695 non-null 


Features relevantes:

- bedrooms
- beds
- bathrooms
- cleaning_fee
- cancellation_policy
- calculated_host_listings_count
- calculated_host_listings_count_private_rooms
- calculated_host_listings_count_shared_rooms
- extra_people
> Precisa ser transformada para gerar insights
- amenities (dict)
> Precisa ser transformada para gerar insights
- availability_30
> Precisa ser transformada para gerar insights
- access (str)
> Precisa ser transformada para gerar insights
- description
> Precisa ser transformada para gerar insights
- guests_included
- host_about
> Precisa ser transformada para gerar insights
- host_identity_verified
- host_is_superhost
- host_listings_count
- host_neighbourhood
- host_response_rate
> Precisa ser transformada para gerar insights
- host_response_time
- host_verifications
> Precisa ser transformada para gerar insights
- house_rules
> Precisa ser transformada para gerar insights
- instant_bookable
- is_location_exact
- maximum_maximum_nights
- monthly_price
> Precisa ser transformada para gerar insights
- name
> Precisa ser transformada para gerar insights
- neighbourhood
- price
- property_type
- security_deposit
> Precisa ser transformada para gerar insights
- weekly_price

In [8]:
df_listings["property_type"].value_counts()

Apartment                 25774
House                      3507
Condominium                1851
Serviced apartment          722
Loft                        622
Guest suite                 202
Bed and breakfast           141
Guesthouse                  134
Villa                       110
Hostel                      103
Other                        94
Hotel                        90
Townhouse                    64
Aparthotel                   59
Cottage                      32
Chalet                       31
Boutique hotel               29
Earth house                  29
Tiny house                   29
Boat                         23
Casa particular (Cuba)       11
Cabin                        10
Nature lodge                 10
Bungalow                     10
Island                        5
Treehouse                     4
Hut                           3
Castle                        3
Houseboat                     2
Camper/RV                     2
Dorm                          2
Farm sta

### / Distribuição de property_type

In [9]:
df_listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33715 entries, 0 to 33714
Data columns (total 35 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   name                                          33654 non-null  object 
 1   description                                   32785 non-null  object 
 2   access                                        16032 non-null  object 
 3   house_rules                                   17424 non-null  object 
 4   host_about                                    16172 non-null  object 
 5   host_response_time                            21481 non-null  object 
 6   host_response_rate                            21481 non-null  object 
 7   host_is_superhost                             33695 non-null  object 
 8   host_neighbourhood                            21399 non-null  object 
 9   host_listings_count                           33695 non-null 

In [26]:
def processing(dataset):
    dataset = dataset.to_dict(orient="records")
    new_dataset = []
    bool_cols = ["is_location_exact", "host_is_superhost", "host_identity_verified",
                "instant_bookable"]
    
    # procesing each record
    with tqdm(total=len(dataset)) as pbar:      
        for each in dataset:
            each["property_type"] = _property_type(each["property_type"])
            each["host_response_rate"] = _host_response_rate(each["host_response_rate"])
            each["price"] = _price(each["price"])
            each["weekly_price"] = _price(each["weekly_price"])
            each["monthly_price"] = _price(each["monthly_price"])
            each["security_deposit"] = _price(each["security_deposit"])
            each["cleaning_fee"] = _price(each["cleaning_fee"])
            each["extra_people"] = _price(each["extra_people"])
            
            host_verifications = _host_verifications(each["host_verifications"])
            if isinstance(host_verifications, dict):
                each = {**each, **host_verifications}
                del each["host_verifications"]

            amenities = _amenities(each["amenities"])
            if isinstance(amenities, dict):
                each = {**each, **amenities}
                del each["amenities"]
            
            for col in bool_cols:
                each[col] = _bool_convert(each[col])

            # add processed record
            new_dataset.append(each)
            pbar.update(1)
    new_dataset = pd.DataFrame(new_dataset)
    return new_dataset

def _property_type(value):
    if value in ["Apartment", "House", "Condominium", "Loft", "Guest suite"]:
        return value
    elif value == "Serviced apartment":
        return "Apartment"
    elif value == ["Guesthouse", "Townhouse", "Tiny house", "Earth house"]:
        return "House"
    elif value == ["Boutique hotel", "Aparthotel", "Hostel"]:
        return "Hotel"
    else:
        return "Others"
    
def _bool_convert(value):
    if value == "t":
        return 1
    elif value == "f":
        return 0
    else:
        return np.nan
    
def _host_response_rate(value):
    try:
        value = value.replace("%", "")
        value = int(value)
    except:
        value = np.nan
    finally:
        return value

def _host_verifications(value):
    hosts = {}
    value_lst = ast.literal_eval(value)
    try:
        for each in value_lst:
            key_name = "host_verifications_" + each
            hosts[key_name] = 1
    except:
        hosts = np.nan
    return hosts

def _amenities(value):
    try:
        value = value.replace('"', '')
        value = value.replace('{', '["')
        value = value.replace('}', '"]')
        value = value.replace(',', '","')
        value = ast.literal_eval(value)
    except:
        value = np.nan
    finally:
        try:
            new_value = []
            for each in value:
                each = each.replace(" ", "_")
                each = each.lower()
                new_value.append(each)
                
            amenities = {}
            for each in new_value:
                key_name = "amenities_" + each
                amenities[key_name] = 1
        except:
            amenities = np.nan
    return amenities

def _price(value):
    if isinstance(value, str):
        value = value.replace("$", "")
        value = value.split(".")[0]
        value = value.replace(",", "")
        value = int(value)
    else:
        value = np.nan
    return value

df = processing(df_listings)
print(df.shape)
df.info()

100%|█████████████████████████████████████████████████████████████████████████| 33715/33715 [00:03<00:00, 10054.26it/s]


(33715, 219)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33715 entries, 0 to 33714
Columns: 219 entries, name to amenities_hbo_go
dtypes: float64(196), int64(11), object(12)
memory usage: 56.3+ MB


In [24]:
for col in df.columns:
    if df[col].dtype == np.dtype("O"):
        if df[col].nunique() > 10:
            print(col, "-", df[col].nunique())

name - 32590
description - 32052
access - 14177
house_rules - 15597
host_about - 9571
host_neighbourhood - 164
neighbourhood - 99


In [27]:
df_listings = df

df_temp = df_listings[["property_type", "room_type", "host_is_superhost"]].groupby(["room_type", "property_type"]).count()
df_temp = pd.DataFrame(df_temp).reset_index()

y_values = df_temp["property_type"].unique().tolist()

temp = {}
for each in y_values:
    y_lst = []
    for room_type in room_types:
        try:
            value = df_temp[(df_temp["property_type"] == each) & 
                            (df_temp["room_type"] == room_type)]["host_is_superhost"].values[0]
        except:
            value = 0
        finally:
            y_lst.append(value)
    temp[each] = y_lst

fig = go.Figure()

for each in y_values:
    fig.add_trace(go.Bar(x=room_types, y=temp[each], name=each))

fig.update_layout(barmode='stack', template="seaborn", title="Distribuição de property_type / room_type")
fig.show()

In [28]:
file_path = "./data/listings.csv"
df_listings.to_csv(file_path, index=False, encoding="utf-8")
print("Arquivo salvo:", file_path)

Arquivo salvo: ./data/listings.csv


## / Conclusões

Com isso tivemos uma visão inicial dos dados. As outras features que considero serem relevantes possuem o `dtype == "object"`, precisaremos processar os dados para conseguir gerar as features numéricas necessárias para desenvolver o modelo.

Nos vemos no próximo notebook (processing.ipynb). =)