>Instituto Federal de Educação, Ciência e Tecnologia
>>Câmpus Campinas<br>
>>D3TOP – Tópicos em Ciência de Dados<br>
>>Prof.: Samuel Botter Martins<br>

## Preprocessing and data clean step

### Environment Config

In [2]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [3]:
%%capture
!pip install -U spacy
!pip install -U gensim
!pip install -U neattext
!pip install -U pyspellchecker
!pip install -U stanza 
!pip install unidecode

In [4]:
# Importando bibliotecas e setando parâmetros

import numpy as np
import pandas as pd
from IPython.display import display
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', 500)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# seta o idioma principal do dataset a ser processado e limpo
LANG = 'english' # or 'portuguese'

#### Useful Functions

In [5]:
def remove_cols(df, mask_cmds, inverse=True):
  '''
  Remove colunas de um dado dataframe df, a partir de uma máscara booleana mask.

  Parameters:
    df (pd.Dataframe): dataframe no qual sofrera a operacao de remocao de colunas
    mask_cmds (list): array de expressões a serem avaliadas. As expressões deve retornar array de booleanos contendo True nas posições das colunas que se deseja remover
    inverse (bool): (opcional) - indica se a máscara será invertida ou não. Default: True

  Returns:
    df (pd.Dataframe): dataframe sem as colunas indicadas na máscara
  '''
  mask_exp = mask_cmds.pop(0)
  mask = eval(mask_exp)
  if inverse:
    return remove_cols(df[df.columns[~mask]], mask_cmds) if len(mask_cmds) else df[df.columns[~mask]]
  return remove_cols(df[df.columns[mask]], mask_cmds) if len(mask_cmds) else df[df.columns[mask]]

In [6]:
def missing_values_table(df):
  '''
  Funcao que retorna o calculo de valores ausentes, absoluto e percentual, das colunas de um dado dataframe pandas

  Parameters:
    df (pd.Dataframe): dataframe que será  avaliado

  Returns:
    df (pd.Dataframe): dataframe contendo nas linhas o nome das colunas do dataframe original, e nas colunas
    a quantidade total de valores ausentes e o seu percentual correspondente.
  '''
  # Total missing values
  mis_val = df.isnull().sum()
  
  # Percentage of missing values
  mis_val_percent = 100 * df.isnull().sum() / len(df)
  
  # Make a table with the results
  mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
  
  # Rename the columns
  mis_val_table_ren_columns = mis_val_table.rename(
  columns = {0 : 'Missing Values', 1 : '% of Total Values'})
  
  # Sort the table by percentage of missing descending
  mis_val_table_ren_columns = mis_val_table_ren_columns[
      mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
  '% of Total Values', ascending=False).round(1)
  
  # Print some summary information
  print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
      "There are " + str(mis_val_table_ren_columns.shape[0]) +
        " columns that have missing values.")
  
  # Return the dataframe with missing information
  return mis_val_table_ren_columns

In [7]:
def neat_preprocessing(text_in: str ) -> str:
  '''
  Funcao auxiliar que executa algumas etapas de limpeza em um dado texto usando a biblioteca neattext.
  - Remove tags HTML, URLs, pontuacoes, emails, numeros, simbolos de moeda, espaços multiplos
  - Substitue @usernames pela palavra "USER"
  - Substitue codigos de emogis pela palavra "EMOJI"

  Parameters:
    text_in (str): string de texto que sofrera o processo de limpeza.

  Returns:
    text: string de texto apos o processo de limpeza.
  '''
  import neattext.functions as ntx

  text = text_in.lower()
  
  # text = ntx.fix_contractions(text)
  text = ntx.remove_html_tags(text)
  text = ntx.remove_punctuations(text) 
  text = ntx.replace_term(text, ntx.USER_HANDLES_REGEX, 'USER')  
  text = ntx.remove_urls(text)
  text = ntx.remove_emails(text)
  text = ntx.remove_phone_numbers(text)
  text = ntx.remove_numbers(text)  
  text = ntx.replace_emojis(text, "EMOJI")
  text = ntx.remove_multiple_spaces(text)
  text = ntx.remove_currency_symbols(text)
  
  return text


In [8]:
from concurrent.futures import ThreadPoolExecutor
from tqdm.auto import tqdm

tqdm.pandas()

def parallel_applymap(df, func, worker_count, **kwargs):
  '''
  Funcao que paraleliza o metodo applymap do pandas

  Parameters:
    df (pd.DataFrame): dataframe no qual sera aplicado a funcao.
    func (callable):  funcao a ser aplicada a todas a celulas do dataframe.
    worker_count (int): quantidade de tarefas em paralelo que serão iniciadas. O dataframe sera subdividido em um numero = worker_count para cada tarefa processar.
    kwargs: demais parametros da funcao

  Returns:
    pd.Dataframe concatenado.
  '''
  def _apply(shard):
      return shard.progress_applymap(func, **kwargs)

  shards = np.array_split(df, worker_count)
  with ThreadPoolExecutor(max_workers=worker_count) as e:
      futures = e.map(_apply, shards)
  return pd.concat(list(futures))

### Data gathering

The data gathering procedure was done through [Apify](https://apify.com/dtrungtin/airbnb-scraper/input-schema#locationQuery), a web scraping and automation platform.

In [10]:
file_path = 'https://media.githubusercontent.com/media/HWatanuki/Projeto_D3TOP/main/Datasets/'
file_name = 'dataset_airbnb-scraper_2023-04-13_03-28-09-439.csv'

df_raw = pd.read_csv(file_path + file_name, dtype=str)
print(f'No Rows: {df_raw.shape[0]}')
print(f'No Columns: {df_raw.shape[1]}', end='\n\n')
display(df_raw.head(2))

No Rows: 547
No Columns: 796



Unnamed: 0,additionalHosts/0/about,additionalHosts/0/badges/0,additionalHosts/0/firstName,additionalHosts/0/hasInclusionBadge,additionalHosts/0/hasProfilePic,additionalHosts/0/hostUrl,additionalHosts/0/id,additionalHosts/0/isSuperHost,additionalHosts/0/languages/0,additionalHosts/0/languages/1,additionalHosts/0/languages/2,additionalHosts/0/languages/3,additionalHosts/0/languages/4,additionalHosts/0/memberSince,additionalHosts/0/pictureUrl,additionalHosts/0/responseRate,additionalHosts/0/responseTime,additionalHosts/0/smartName,additionalHosts/0/thumbnailUrl,additionalHosts/1/about,additionalHosts/1/firstName,additionalHosts/1/hasInclusionBadge,additionalHosts/1/hasProfilePic,additionalHosts/1/hostUrl,additionalHosts/1/id,additionalHosts/1/isSuperHost,additionalHosts/1/languages/0,additionalHosts/1/languages/1,additionalHosts/1/languages/2,additionalHosts/1/memberSince,additionalHosts/1/pictureUrl,additionalHosts/1/smartName,additionalHosts/1/thumbnailUrl,additionalHosts/2/about,additionalHosts/2/firstName,additionalHosts/2/hasInclusionBadge,additionalHosts/2/hasProfilePic,additionalHosts/2/hostUrl,additionalHosts/2/id,additionalHosts/2/isSuperHost,additionalHosts/2/languages/0,additionalHosts/2/languages/1,additionalHosts/2/memberSince,additionalHosts/2/pictureUrl,additionalHosts/2/smartName,additionalHosts/2/thumbnailUrl,address,isAvailable,isHostedBySuperhost,location/lat,location/lng,name,numberOfGuests,photos/0/caption,photos/0/pictureUrl,photos/0/thumbnailUrl,photos/1/caption,photos/1/pictureUrl,photos/1/thumbnailUrl,photos/2/caption,photos/2/pictureUrl,photos/2/thumbnailUrl,photos/3/caption,photos/3/pictureUrl,photos/3/thumbnailUrl,photos/4/caption,photos/4/pictureUrl,photos/4/thumbnailUrl,photos/5/caption,photos/5/pictureUrl,photos/5/thumbnailUrl,photos/6/caption,photos/6/pictureUrl,photos/6/thumbnailUrl,photos/7/caption,photos/7/pictureUrl,photos/7/thumbnailUrl,photos/8/caption,photos/8/pictureUrl,photos/8/thumbnailUrl,photos/9/caption,photos/9/pictureUrl,photos/9/thumbnailUrl,photos/10/caption,photos/10/pictureUrl,photos/10/thumbnailUrl,photos/11/caption,photos/11/pictureUrl,photos/11/thumbnailUrl,photos/12/caption,photos/12/pictureUrl,photos/12/thumbnailUrl,photos/13/caption,photos/13/pictureUrl,photos/13/thumbnailUrl,photos/14/caption,photos/14/pictureUrl,photos/14/thumbnailUrl,photos/15/caption,photos/15/pictureUrl,photos/15/thumbnailUrl,photos/16/caption,photos/16/pictureUrl,photos/16/thumbnailUrl,photos/17/caption,photos/17/pictureUrl,photos/17/thumbnailUrl,photos/18/caption,photos/18/pictureUrl,photos/18/thumbnailUrl,photos/19/caption,photos/19/pictureUrl,photos/19/thumbnailUrl,photos/20/caption,photos/20/pictureUrl,photos/20/thumbnailUrl,photos/21/caption,photos/21/pictureUrl,photos/21/thumbnailUrl,photos/22/caption,photos/22/pictureUrl,photos/22/thumbnailUrl,photos/23/caption,photos/23/pictureUrl,photos/23/thumbnailUrl,photos/24/caption,photos/24/pictureUrl,photos/24/thumbnailUrl,photos/25/caption,photos/25/pictureUrl,photos/25/thumbnailUrl,photos/26/caption,photos/26/pictureUrl,photos/26/thumbnailUrl,photos/27/caption,photos/27/pictureUrl,photos/27/thumbnailUrl,photos/28/caption,photos/28/pictureUrl,photos/28/thumbnailUrl,photos/29/caption,photos/29/pictureUrl,photos/29/thumbnailUrl,photos/30/caption,photos/30/pictureUrl,photos/30/thumbnailUrl,photos/31/caption,photos/31/pictureUrl,photos/31/thumbnailUrl,photos/32/caption,photos/32/pictureUrl,photos/32/thumbnailUrl,photos/33/caption,photos/33/pictureUrl,photos/33/thumbnailUrl,photos/34/caption,photos/34/pictureUrl,photos/34/thumbnailUrl,photos/35/caption,photos/35/pictureUrl,photos/35/thumbnailUrl,photos/36/caption,photos/36/pictureUrl,photos/36/thumbnailUrl,photos/37/caption,photos/37/pictureUrl,photos/37/thumbnailUrl,photos/38/caption,photos/38/pictureUrl,photos/38/thumbnailUrl,photos/39/caption,photos/39/pictureUrl,photos/39/thumbnailUrl,photos/40/caption,photos/40/pictureUrl,photos/40/thumbnailUrl,photos/41/caption,photos/41/pictureUrl,photos/41/thumbnailUrl,photos/42/caption,photos/42/pictureUrl,photos/42/thumbnailUrl,photos/43/caption,photos/43/pictureUrl,photos/43/thumbnailUrl,photos/44/caption,photos/44/pictureUrl,photos/44/thumbnailUrl,photos/45/caption,photos/45/pictureUrl,photos/45/thumbnailUrl,photos/46/caption,photos/46/pictureUrl,photos/46/thumbnailUrl,photos/47/caption,photos/47/pictureUrl,photos/47/thumbnailUrl,photos/48/caption,photos/48/pictureUrl,photos/48/thumbnailUrl,photos/49/caption,photos/49/pictureUrl,photos/49/thumbnailUrl,photos/50/caption,photos/50/pictureUrl,photos/50/thumbnailUrl,photos/51/caption,photos/51/pictureUrl,photos/51/thumbnailUrl,photos/52/caption,photos/52/pictureUrl,photos/52/thumbnailUrl,photos/53/caption,photos/53/pictureUrl,photos/53/thumbnailUrl,photos/54/caption,photos/54/pictureUrl,photos/54/thumbnailUrl,photos/55/caption,photos/55/pictureUrl,photos/55/thumbnailUrl,photos/56/caption,photos/56/pictureUrl,photos/56/thumbnailUrl,photos/57/caption,photos/57/pictureUrl,photos/57/thumbnailUrl,photos/58/caption,photos/58/pictureUrl,photos/58/thumbnailUrl,photos/59/caption,photos/59/pictureUrl,photos/59/thumbnailUrl,photos/60/caption,photos/60/pictureUrl,photos/60/thumbnailUrl,photos/61/caption,photos/61/pictureUrl,photos/61/thumbnailUrl,photos/62/caption,photos/62/pictureUrl,photos/62/thumbnailUrl,photos/63/caption,photos/63/pictureUrl,photos/63/thumbnailUrl,photos/64/caption,photos/64/pictureUrl,photos/64/thumbnailUrl,photos/65/caption,photos/65/pictureUrl,photos/65/thumbnailUrl,photos/66/caption,photos/66/pictureUrl,photos/66/thumbnailUrl,photos/67/caption,photos/67/pictureUrl,photos/67/thumbnailUrl,photos/68/caption,photos/68/pictureUrl,photos/68/thumbnailUrl,photos/69/caption,photos/69/pictureUrl,photos/69/thumbnailUrl,photos/70/caption,photos/70/pictureUrl,photos/70/thumbnailUrl,photos/71/caption,photos/71/pictureUrl,photos/71/thumbnailUrl,photos/72/caption,photos/72/pictureUrl,photos/72/thumbnailUrl,photos/73/caption,photos/73/pictureUrl,photos/73/thumbnailUrl,photos/74/caption,photos/74/pictureUrl,photos/74/thumbnailUrl,photos/75/caption,photos/75/pictureUrl,photos/75/thumbnailUrl,photos/76/caption,photos/76/pictureUrl,photos/76/thumbnailUrl,photos/77/caption,photos/77/pictureUrl,photos/77/thumbnailUrl,photos/78/caption,photos/78/pictureUrl,photos/78/thumbnailUrl,photos/79/caption,photos/79/pictureUrl,photos/79/thumbnailUrl,photos/80/caption,photos/80/pictureUrl,photos/80/thumbnailUrl,photos/81/caption,photos/81/pictureUrl,photos/81/thumbnailUrl,photos/82/caption,photos/82/pictureUrl,photos/82/thumbnailUrl,photos/83/caption,photos/83/pictureUrl,photos/83/thumbnailUrl,photos/84/caption,photos/84/pictureUrl,photos/84/thumbnailUrl,photos/85/caption,photos/85/pictureUrl,photos/85/thumbnailUrl,photos/86/caption,photos/86/pictureUrl,photos/86/thumbnailUrl,photos/87/caption,photos/87/pictureUrl,photos/87/thumbnailUrl,photos/88/caption,photos/88/pictureUrl,photos/88/thumbnailUrl,photos/89/caption,photos/89/pictureUrl,photos/89/thumbnailUrl,photos/90/caption,photos/90/pictureUrl,photos/90/thumbnailUrl,photos/91/caption,photos/91/pictureUrl,photos/91/thumbnailUrl,photos/92/caption,photos/92/pictureUrl,photos/92/thumbnailUrl,photos/93/caption,photos/93/pictureUrl,photos/93/thumbnailUrl,photos/94/caption,photos/94/pictureUrl,photos/94/thumbnailUrl,photos/95/caption,photos/95/pictureUrl,photos/95/thumbnailUrl,photos/96/caption,photos/96/pictureUrl,photos/96/thumbnailUrl,photos/97/caption,photos/97/pictureUrl,photos/97/thumbnailUrl,photos/98/caption,photos/98/pictureUrl,photos/98/thumbnailUrl,photos/99/caption,photos/99/pictureUrl,photos/99/thumbnailUrl,photos/100/caption,photos/100/pictureUrl,photos/100/thumbnailUrl,photos/101/caption,photos/101/pictureUrl,photos/101/thumbnailUrl,photos/102/caption,photos/102/pictureUrl,photos/102/thumbnailUrl,photos/103/caption,photos/103/pictureUrl,photos/103/thumbnailUrl,photos/104/caption,photos/104/pictureUrl,photos/104/thumbnailUrl,photos/105/caption,photos/105/pictureUrl,photos/105/thumbnailUrl,photos/106/caption,photos/106/pictureUrl,photos/106/thumbnailUrl,photos/107/caption,photos/107/pictureUrl,photos/107/thumbnailUrl,photos/108/caption,photos/108/pictureUrl,photos/108/thumbnailUrl,photos/109/caption,photos/109/pictureUrl,photos/109/thumbnailUrl,photos/110/caption,photos/110/pictureUrl,photos/110/thumbnailUrl,photos/111/caption,photos/111/pictureUrl,photos/111/thumbnailUrl,photos/112/caption,photos/112/pictureUrl,photos/112/thumbnailUrl,photos/113/caption,photos/113/pictureUrl,photos/113/thumbnailUrl,photos/114/caption,photos/114/pictureUrl,photos/114/thumbnailUrl,photos/115/caption,photos/115/pictureUrl,photos/115/thumbnailUrl,photos/116/caption,photos/116/pictureUrl,photos/116/thumbnailUrl,photos/117/caption,photos/117/pictureUrl,photos/117/thumbnailUrl,photos/118/caption,photos/118/pictureUrl,photos/118/thumbnailUrl,photos/119/caption,photos/119/pictureUrl,photos/119/thumbnailUrl,photos/120/caption,photos/120/pictureUrl,photos/120/thumbnailUrl,photos/121/caption,photos/121/pictureUrl,photos/121/thumbnailUrl,photos/122/caption,photos/122/pictureUrl,photos/122/thumbnailUrl,photos/123/caption,photos/123/pictureUrl,photos/123/thumbnailUrl,photos/124/caption,photos/124/pictureUrl,photos/124/thumbnailUrl,photos/125/caption,photos/125/pictureUrl,photos/125/thumbnailUrl,photos/126/caption,photos/126/pictureUrl,photos/126/thumbnailUrl,photos/127/caption,photos/127/pictureUrl,photos/127/thumbnailUrl,photos/128/caption,photos/128/pictureUrl,photos/128/thumbnailUrl,photos/129/caption,photos/129/pictureUrl,photos/129/thumbnailUrl,photos/130/caption,photos/130/pictureUrl,photos/130/thumbnailUrl,photos/131/caption,photos/131/pictureUrl,photos/131/thumbnailUrl,photos/132/caption,photos/132/pictureUrl,photos/132/thumbnailUrl,photos/133/caption,photos/133/pictureUrl,photos/133/thumbnailUrl,photos/134/caption,photos/134/pictureUrl,photos/134/thumbnailUrl,photos/135/caption,photos/135/pictureUrl,photos/135/thumbnailUrl,photos/136/caption,photos/136/pictureUrl,photos/136/thumbnailUrl,photos/137/caption,photos/137/pictureUrl,photos/137/thumbnailUrl,photos/138/caption,photos/138/pictureUrl,photos/138/thumbnailUrl,photos/139/caption,photos/139/pictureUrl,photos/139/thumbnailUrl,photos/140/caption,photos/140/pictureUrl,photos/140/thumbnailUrl,photos/141/caption,photos/141/pictureUrl,photos/141/thumbnailUrl,photos/142/caption,photos/142/pictureUrl,photos/142/thumbnailUrl,photos/143/caption,photos/143/pictureUrl,photos/143/thumbnailUrl,photos/144/caption,photos/144/pictureUrl,photos/144/thumbnailUrl,photos/145/caption,photos/145/pictureUrl,photos/145/thumbnailUrl,photos/146/caption,photos/146/pictureUrl,photos/146/thumbnailUrl,photos/147/caption,photos/147/pictureUrl,photos/147/thumbnailUrl,photos/148/caption,photos/148/pictureUrl,photos/148/thumbnailUrl,photos/149/caption,photos/149/pictureUrl,photos/149/thumbnailUrl,photos/150/caption,photos/150/pictureUrl,photos/150/thumbnailUrl,photos/151/caption,photos/151/pictureUrl,photos/151/thumbnailUrl,photos/152/caption,photos/152/pictureUrl,photos/152/thumbnailUrl,photos/153/caption,photos/153/pictureUrl,photos/153/thumbnailUrl,photos/154/caption,photos/154/pictureUrl,photos/154/thumbnailUrl,photos/155/caption,photos/155/pictureUrl,photos/155/thumbnailUrl,pricing/rate/amount,pricing/rate/amountFormatted,pricing/rate/currency,pricing/rate/isMicrosAccuracy,pricing/rateType,primaryHost/about,primaryHost/badges/0,primaryHost/badges/1,primaryHost/badges/2,primaryHost/firstName,primaryHost/hasInclusionBadge,primaryHost/hasProfilePic,primaryHost/hostUrl,primaryHost/id,primaryHost/isSuperHost,primaryHost/languages/0,primaryHost/languages/1,primaryHost/languages/2,primaryHost/languages/3,primaryHost/languages/4,primaryHost/languages/5,primaryHost/languages/6,primaryHost/languages/7,primaryHost/languages/8,primaryHost/listingsCount,primaryHost/memberSince,primaryHost/pictureUrl,primaryHost/responseRate,primaryHost/responseTime,primaryHost/smartName,primaryHost/thumbnailUrl,primaryHost/totalListingsCount,reviews/0/author/firstName,reviews/0/author/hasProfilePic,reviews/0/author/id,reviews/0/author/pictureUrl,reviews/0/author/smartName,reviews/0/author/thumbnailUrl,reviews/0/collectionTag,reviews/0/comments,reviews/0/createdAt,reviews/0/id,reviews/0/language,reviews/0/localizedDate,reviews/0/localizedReview,reviews/0/localizedReview/comments,reviews/0/localizedReview/disclaimer,reviews/0/localizedReview/needsTranslation,reviews/0/localizedReview/response,reviews/0/rating,reviews/0/recipient/firstName,reviews/0/recipient/hasProfilePic,reviews/0/recipient/id,reviews/0/recipient/pictureUrl,reviews/0/recipient/smartName,reviews/0/recipient/thumbnailUrl,reviews/0/response,reviews/1/author/firstName,reviews/1/author/hasProfilePic,reviews/1/author/id,reviews/1/author/pictureUrl,reviews/1/author/smartName,reviews/1/author/thumbnailUrl,reviews/1/collectionTag,reviews/1/comments,reviews/1/createdAt,reviews/1/id,reviews/1/localizedDate,reviews/1/localizedReview,reviews/1/rating,reviews/1/recipient/firstName,reviews/1/recipient/hasProfilePic,reviews/1/recipient/id,reviews/1/recipient/pictureUrl,reviews/1/recipient/smartName,reviews/1/recipient/thumbnailUrl,reviews/1/response,reviews/2/author/firstName,reviews/2/author/hasProfilePic,reviews/2/author/id,reviews/2/author/pictureUrl,reviews/2/author/smartName,reviews/2/author/thumbnailUrl,reviews/2/collectionTag,reviews/2/comments,reviews/2/createdAt,reviews/2/id,reviews/2/language,reviews/2/localizedDate,reviews/2/localizedReview,reviews/2/localizedReview/comments,reviews/2/localizedReview/disclaimer,reviews/2/localizedReview/needsTranslation,reviews/2/localizedReview/response,reviews/2/rating,reviews/2/recipient/firstName,reviews/2/recipient/hasProfilePic,reviews/2/recipient/id,reviews/2/recipient/pictureUrl,reviews/2/recipient/smartName,reviews/2/recipient/thumbnailUrl,reviews/2/response,reviews/3/author/firstName,reviews/3/author/hasProfilePic,reviews/3/author/id,reviews/3/author/pictureUrl,reviews/3/author/smartName,reviews/3/author/thumbnailUrl,reviews/3/collectionTag,reviews/3/comments,reviews/3/createdAt,reviews/3/id,reviews/3/language,reviews/3/localizedDate,reviews/3/localizedReview,reviews/3/localizedReview/comments,reviews/3/localizedReview/disclaimer,reviews/3/localizedReview/needsTranslation,reviews/3/localizedReview/response,reviews/3/rating,reviews/3/recipient/firstName,reviews/3/recipient/hasProfilePic,reviews/3/recipient/id,reviews/3/recipient/pictureUrl,reviews/3/recipient/smartName,reviews/3/recipient/thumbnailUrl,reviews/3/response,reviews/4/author/firstName,reviews/4/author/hasProfilePic,reviews/4/author/id,reviews/4/author/pictureUrl,reviews/4/author/smartName,reviews/4/author/thumbnailUrl,reviews/4/collectionTag,reviews/4/comments,reviews/4/createdAt,reviews/4/id,reviews/4/language,reviews/4/localizedDate,reviews/4/localizedReview,reviews/4/localizedReview/comments,reviews/4/localizedReview/disclaimer,reviews/4/localizedReview/needsTranslation,reviews/4/localizedReview/response,reviews/4/rating,reviews/4/recipient/firstName,reviews/4/recipient/hasProfilePic,reviews/4/recipient/id,reviews/4/recipient/pictureUrl,reviews/4/recipient/smartName,reviews/4/recipient/thumbnailUrl,reviews/4/response,reviews/5/author/firstName,reviews/5/author/hasProfilePic,reviews/5/author/id,reviews/5/author/pictureUrl,reviews/5/author/smartName,reviews/5/author/thumbnailUrl,reviews/5/collectionTag,reviews/5/comments,reviews/5/createdAt,reviews/5/id,reviews/5/language,reviews/5/localizedDate,reviews/5/localizedReview,reviews/5/localizedReview/comments,reviews/5/localizedReview/disclaimer,reviews/5/localizedReview/needsTranslation,reviews/5/localizedReview/response,reviews/5/rating,reviews/5/recipient/firstName,reviews/5/recipient/hasProfilePic,reviews/5/recipient/id,reviews/5/recipient/pictureUrl,reviews/5/recipient/smartName,reviews/5/recipient/thumbnailUrl,reviews/5/response,reviews/6/author/firstName,reviews/6/author/hasProfilePic,reviews/6/author/id,reviews/6/author/pictureUrl,reviews/6/author/smartName,reviews/6/author/thumbnailUrl,reviews/6/collectionTag,reviews/6/comments,reviews/6/createdAt,reviews/6/id,reviews/6/localizedDate,reviews/6/localizedReview,reviews/6/rating,reviews/6/recipient/firstName,reviews/6/recipient/hasProfilePic,reviews/6/recipient/id,reviews/6/recipient/pictureUrl,reviews/6/recipient/smartName,reviews/6/recipient/thumbnailUrl,reviews/6/response,reviews/7/author/firstName,reviews/7/author/hasProfilePic,reviews/7/author/id,reviews/7/author/pictureUrl,reviews/7/author/smartName,reviews/7/author/thumbnailUrl,reviews/7/collectionTag,reviews/7/comments,reviews/7/createdAt,reviews/7/id,reviews/7/language,reviews/7/localizedDate,reviews/7/localizedReview,reviews/7/localizedReview/comments,reviews/7/localizedReview/disclaimer,reviews/7/localizedReview/needsTranslation,reviews/7/localizedReview/response,reviews/7/rating,reviews/7/recipient/firstName,reviews/7/recipient/hasProfilePic,reviews/7/recipient/id,reviews/7/recipient/pictureUrl,reviews/7/recipient/smartName,reviews/7/recipient/thumbnailUrl,reviews/7/response,reviews/8/author/firstName,reviews/8/author/hasProfilePic,reviews/8/author/id,reviews/8/author/pictureUrl,reviews/8/author/smartName,reviews/8/author/thumbnailUrl,reviews/8/collectionTag,reviews/8/comments,reviews/8/createdAt,reviews/8/id,reviews/8/language,reviews/8/localizedDate,reviews/8/localizedReview,reviews/8/localizedReview/comments,reviews/8/localizedReview/disclaimer,reviews/8/localizedReview/needsTranslation,reviews/8/localizedReview/response,reviews/8/rating,reviews/8/recipient/firstName,reviews/8/recipient/hasProfilePic,reviews/8/recipient/id,reviews/8/recipient/pictureUrl,reviews/8/recipient/smartName,reviews/8/recipient/thumbnailUrl,reviews/8/response,reviews/9/author/firstName,reviews/9/author/hasProfilePic,reviews/9/author/id,reviews/9/author/pictureUrl,reviews/9/author/smartName,reviews/9/author/thumbnailUrl,reviews/9/collectionTag,reviews/9/comments,reviews/9/createdAt,reviews/9/id,reviews/9/language,reviews/9/localizedDate,reviews/9/localizedReview,reviews/9/localizedReview/comments,reviews/9/localizedReview/disclaimer,reviews/9/localizedReview/needsTranslation,reviews/9/localizedReview/response,reviews/9/rating,reviews/9/recipient/firstName,reviews/9/recipient/hasProfilePic,reviews/9/recipient/id,reviews/9/recipient/pictureUrl,reviews/9/recipient/smartName,reviews/9/recipient/thumbnailUrl,reviews/9/response,roomType,stars,url
0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"Jersey City, New Jersey, United States",True,False,40.7233,-74.03946,Entire Cozy Unit(15 mins to Manhattan),2,,https://a0.muscache.com/im/pictures/miso/Hosting-53775685/original/0812b9a1-2d2e-4725-b51f-881045a939b2.jpeg?aki_policy=large,https://a0.muscache.com/im/pictures/miso/Hosting-53775685/original/0812b9a1-2d2e-4725-b51f-881045a939b2.jpeg?aki_policy=small,,https://a0.muscache.com/im/pictures/miso/Hosting-53775685/original/b280cd02-c018-4813-ad7d-4faa6987480d.jpeg?aki_policy=large,https://a0.muscache.com/im/pictures/miso/Hosting-53775685/original/b280cd02-c018-4813-ad7d-4faa6987480d.jpeg?aki_policy=small,,https://a0.muscache.com/im/pictures/miso/Hosting-53775685/original/2e854153-dae6-44ac-9bbc-c0737cce4221.jpeg?aki_policy=large,https://a0.muscache.com/im/pictures/miso/Hosting-53775685/original/2e854153-dae6-44ac-9bbc-c0737cce4221.jpeg?aki_policy=small,,https://a0.muscache.com/im/pictures/miso/Hosting-53775685/original/85a67377-6fd5-4eed-a2ad-adedac32cbb6.jpeg?aki_policy=large,https://a0.muscache.com/im/pictures/miso/Hosting-53775685/original/85a67377-6fd5-4eed-a2ad-adedac32cbb6.jpeg?aki_policy=small,,https://a0.muscache.com/im/pictures/miso/Hosting-53775685/original/8fb449f4-165b-4cf1-a04c-c3684969af47.jpeg?aki_policy=large,https://a0.muscache.com/im/pictures/miso/Hosting-53775685/original/8fb449f4-165b-4cf1-a04c-c3684969af47.jpeg?aki_policy=small,,https://a0.muscache.com/im/pictures/miso/Hosting-53775685/original/12dc2b37-954f-4560-9b43-193ae6eda375.jpeg?aki_policy=large,https://a0.muscache.com/im/pictures/miso/Hosting-53775685/original/12dc2b37-954f-4560-9b43-193ae6eda375.jpeg?aki_policy=small,,https://a0.muscache.com/im/pictures/e961ee7e-38c8-4a70-a0c0-d9499a2bbffb.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/e961ee7e-38c8-4a70-a0c0-d9499a2bbffb.jpg?aki_policy=small,,https://a0.muscache.com/im/pictures/2503fa36-2bb1-4eca-9754-6600ad060650.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/2503fa36-2bb1-4eca-9754-6600ad060650.jpg?aki_policy=small,,https://a0.muscache.com/im/pictures/d624d377-cade-4abf-8a30-d7827d223207.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/d624d377-cade-4abf-8a30-d7827d223207.jpg?aki_policy=small,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,99999,"$99,999",USD,False,nightly,Hello all and welcome!\nI love seeing the world and to be able to help people feel right at home.,82 Reviews,Identity verified,,Joes,False,True,https://www.airbnb.com/users/show/245395267,245395267,False,中文 (简体),English,,,,,,,,8,Joined in February 2019,https://a0.muscache.com/im/pictures/user/96e734ce-0b07-4b03-b00f-ef4a72ba2a3e.jpg?aki_policy=profile_large,,,Joes,https://a0.muscache.com/im/pictures/user/96e734ce-0b07-4b03-b00f-ef4a72ba2a3e.jpg?aki_policy=profile_small,14,Miranda,False,74889676,https://a0.muscache.com/im/pictures/user/d4b6e27a-5c71-40eb-9692-0237677c4d23.jpg?aki_policy=profile_x_medium,Miranda,https://a0.muscache.com/im/pictures/user/d4b6e27a-5c71-40eb-9692-0237677c4d23.jpg?aki_policy=profile_small,,"great location and nice building, loved that parking was included",2022-06-20T19:58:54Z,653626364155945838,,June 2022,,,,,,4,Joes,False,245395267,https://a0.muscache.com/im/pictures/user/96e734ce-0b07-4b03-b00f-ef4a72ba2a3e.jpg?aki_policy=profile_x_medium,Joes,https://a0.muscache.com/im/pictures/user/96e734ce-0b07-4b03-b00f-ef4a72ba2a3e.jpg?aki_policy=profile_small,,Abhimanyu,False,4800160.0,https://a0.muscache.com/im/pictures/user/8d2009ea-5512-41cc-8ef4-5511a14d4a14.jpg?aki_policy=profile_x_medium,Abhimanyu,https://a0.muscache.com/im/pictures/user/8d2009ea-5512-41cc-8ef4-5511a14d4a14.jpg?aki_policy=profile_small,,"Everything during our stay was perfect and as mentioned in the listing. The location is a 7-min walk to Newport PATH, and right opposite a huge grocery store. Joes was super kind with an early check-in and a late check-out.",2022-05-23T18:57:26Z,6.333017075212511e+17,May 2022,,5.0,Joes,False,245395267.0,https://a0.muscache.com/im/pictures/user/96e734ce-0b07-4b03-b00f-ef4a72ba2a3e.jpg?aki_policy=profile_x_medium,Joes,https://a0.muscache.com/im/pictures/user/96e734ce-0b07-4b03-b00f-ef4a72ba2a3e.jpg?aki_policy=profile_small,,Robert,False,162871850.0,https://a0.muscache.com/im/pictures/user/6874cac7-a3a9-43ab-9ce5-51922332a664.jpg?aki_policy=profile_x_medium,Robert,https://a0.muscache.com/im/pictures/user/6874cac7-a3a9-43ab-9ce5-51922332a664.jpg?aki_policy=profile_small,,Excellent place to stay. Very central fo travel into NYC or if working / travelling in New Jersey. Stayed for 3 months in the apartment. The place was as advertised. Joes is an excellent host also.,2022-05-06T17:28:10Z,6.209355874005248e+17,,May 2022,,,,,,5.0,Joes,False,245395267.0,https://a0.muscache.com/im/pictures/user/96e734ce-0b07-4b03-b00f-ef4a72ba2a3e.jpg?aki_policy=profile_x_medium,Joes,https://a0.muscache.com/im/pictures/user/96e734ce-0b07-4b03-b00f-ef4a72ba2a3e.jpg?aki_policy=profile_small,,Yu,False,303996502.0,https://a0.muscache.com/im/pictures/user/b08b0244-5bfd-400f-b875-fa0f24731ff9.jpg?aki_policy=profile_x_medium,Yu,https://a0.muscache.com/im/pictures/user/b08b0244-5bfd-400f-b875-fa0f24731ff9.jpg?aki_policy=profile_small,,It is the best airbnb that I have ever stayed! It gets the great view outside the window. Comfortable and convenient living experience. Everything just settled up for you. It is so great!,2022-01-09T21:55:09Z,5.362712073993113e+17,,January 2022,,,,,,5.0,Joes,False,245395267.0,https://a0.muscache.com/im/pictures/user/96e734ce-0b07-4b03-b00f-ef4a72ba2a3e.jpg?aki_policy=profile_x_medium,Joes,https://a0.muscache.com/im/pictures/user/96e734ce-0b07-4b03-b00f-ef4a72ba2a3e.jpg?aki_policy=profile_small,,Mitesh,False,253080011.0,https://a0.muscache.com/im/pictures/user/02b87617-c19a-4b4d-bdc3-e089a7003ebb.jpg?aki_policy=profile_x_medium,Mitesh,https://a0.muscache.com/im/pictures/user/02b87617-c19a-4b4d-bdc3-e089a7003ebb.jpg?aki_policy=profile_small,,"A very nice apartment is great locality. The apartment is as listed. i would definitely rate it a 5 star airbnb if not for the pet hair on the carpet and couch. As I had a 9 month old, we were a bit concerned about him having allergies and looking for a clean place. As for Joes, he is very approachable and very easy to communicate.",2021-12-26T20:27:02Z,5.26079993255321e+17,,December 2021,,,,,,4.0,Joes,False,245395267.0,https://a0.muscache.com/im/pictures/user/96e734ce-0b07-4b03-b00f-ef4a72ba2a3e.jpg?aki_policy=profile_x_medium,Joes,https://a0.muscache.com/im/pictures/user/96e734ce-0b07-4b03-b00f-ef4a72ba2a3e.jpg?aki_policy=profile_small,,Arusha,False,103758126.0,https://a0.muscache.com/im/pictures/user/955bf887-2b6f-4018-8c77-20849453fbb8.jpg?aki_policy=profile_x_medium,Arusha,https://a0.muscache.com/im/pictures/user/955bf887-2b6f-4018-8c77-20849453fbb8.jpg?aki_policy=profile_small,,"Being the first customer, I took a chance in trusting this listing but this apartment exceeded my expectations. It was super clean and also very aesthetically pleasing with a minimalistic vibe. This is definitely a luxury apartment building and in a convenient location - right by Shoprite and near the waterfront and path! The host was also very responsive about any questions. Thanks for a great stay!",2021-12-22T21:29:10Z,5.232121651541744e+17,,December 2021,,,,,,5.0,Joes,False,245395267.0,https://a0.muscache.com/im/pictures/user/96e734ce-0b07-4b03-b00f-ef4a72ba2a3e.jpg?aki_policy=profile_x_medium,Joes,https://a0.muscache.com/im/pictures/user/96e734ce-0b07-4b03-b00f-ef4a72ba2a3e.jpg?aki_policy=profile_small,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Entire rental unit,4.67,https://www.airbnb.com/rooms/53775685
1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"New York, United States",True,False,40.70641,-74.0092,Lux Studio on Wall Street. Heart of Fidi!,2,,https://a0.muscache.com/im/pictures/e2388507-1f5f-4000-aa1b-d3b2279e682a.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/e2388507-1f5f-4000-aa1b-d3b2279e682a.jpg?aki_policy=small,,https://a0.muscache.com/im/pictures/b697a55b-8f7b-49bf-9136-081edebbdf78.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/b697a55b-8f7b-49bf-9136-081edebbdf78.jpg?aki_policy=small,,https://a0.muscache.com/im/pictures/a7ddf48b-dff7-4303-9ee2-bb43da85913f.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/a7ddf48b-dff7-4303-9ee2-bb43da85913f.jpg?aki_policy=small,,https://a0.muscache.com/im/pictures/8f422327-f899-4129-be13-d977f8f466f2.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/8f422327-f899-4129-be13-d977f8f466f2.jpg?aki_policy=small,,https://a0.muscache.com/im/pictures/b5fd4efc-03ce-4a0e-8c19-0002cacf5b6b.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/b5fd4efc-03ce-4a0e-8c19-0002cacf5b6b.jpg?aki_policy=small,,https://a0.muscache.com/im/pictures/05a44019-5083-463e-9aaf-6abd65c1fdb5.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/05a44019-5083-463e-9aaf-6abd65c1fdb5.jpg?aki_policy=small,,https://a0.muscache.com/im/pictures/b491d20d-32fe-4b7e-8df1-e4da4c214a22.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/b491d20d-32fe-4b7e-8df1-e4da4c214a22.jpg?aki_policy=small,,https://a0.muscache.com/im/pictures/ff6cac25-6549-4a23-b4c9-673b07cf9df3.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/ff6cac25-6549-4a23-b4c9-673b07cf9df3.jpg?aki_policy=small,,https://a0.muscache.com/im/pictures/da721860-9808-444b-9410-d22ac7df7860.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/da721860-9808-444b-9410-d22ac7df7860.jpg?aki_policy=small,,https://a0.muscache.com/im/pictures/9b8eca08-5970-494d-8d2e-ff3074f1243a.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/9b8eca08-5970-494d-8d2e-ff3074f1243a.jpg?aki_policy=small,,https://a0.muscache.com/im/pictures/1c9edf4c-7884-4478-9295-602be4529f06.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/1c9edf4c-7884-4478-9295-602be4529f06.jpg?aki_policy=small,,https://a0.muscache.com/im/pictures/5fd4aaef-dd94-4561-87ed-0dbf7a83e565.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/5fd4aaef-dd94-4561-87ed-0dbf7a83e565.jpg?aki_policy=small,,https://a0.muscache.com/im/pictures/2720b3bc-2ada-43ed-88fa-67cdf65a4eab.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/2720b3bc-2ada-43ed-88fa-67cdf65a4eab.jpg?aki_policy=small,,https://a0.muscache.com/im/pictures/855ca8ea-9c79-4cd4-a096-020af1ce26b1.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/855ca8ea-9c79-4cd4-a096-020af1ce26b1.jpg?aki_policy=small,,https://a0.muscache.com/im/pictures/92a10d56-ceef-46d6-a82b-30f0228d1846.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/92a10d56-ceef-46d6-a82b-30f0228d1846.jpg?aki_policy=small,,https://a0.muscache.com/im/pictures/ffba0e42-578e-436b-ab11-03139eaa41db.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/ffba0e42-578e-436b-ab11-03139eaa41db.jpg?aki_policy=small,,https://a0.muscache.com/im/pictures/68c59392-76cb-4bea-b92b-dba1ba9fde2a.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/68c59392-76cb-4bea-b92b-dba1ba9fde2a.jpg?aki_policy=small,,https://a0.muscache.com/im/pictures/cc5eff3d-eddd-4851-9fb2-ab40a9215775.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/cc5eff3d-eddd-4851-9fb2-ab40a9215775.jpg?aki_policy=small,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10000,"$10,000",USD,False,nightly,Professional working and living in NYC,1 Review,Identity verified,,Chris,False,True,https://www.airbnb.com/users/show/57586379,57586379,False,,,,,,,,,,1,Joined in February 2016,https://a0.muscache.com/im/pictures/user/07e98a7a-5003-4cba-bf84-0a70db8c6209.jpg?aki_policy=profile_large,,,Chris,https://a0.muscache.com/im/pictures/user/07e98a7a-5003-4cba-bf84-0a70db8c6209.jpg?aki_policy=profile_small,2,Natalia,False,222837411,https://a0.muscache.com/im/pictures/user/286edd1a-d927-4f17-b4f7-62b72abcacf3.jpg?aki_policy=profile_x_medium,Natalia,https://a0.muscache.com/im/pictures/user/286edd1a-d927-4f17-b4f7-62b72abcacf3.jpg?aki_policy=profile_small,,"This place is exactly as described, perfect for my 1 month stay! Very clean and organized. The location was also quite convenient, close to the supermarket, pharmacy, restaurants, etc. Chris was a wonderful host, extremely attentive and easy to communicate with. Would definitely book again!",2021-12-13T19:49:59Z,516639267027005367,,December 2021,,,,,,5,Chris,False,57586379,https://a0.muscache.com/im/pictures/user/07e98a7a-5003-4cba-bf84-0a70db8c6209.jpg?aki_policy=profile_x_medium,Chris,https://a0.muscache.com/im/pictures/user/07e98a7a-5003-4cba-bf84-0a70db8c6209.jpg?aki_policy=profile_small,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Entire rental unit,,https://www.airbnb.com/rooms/52862058


In [None]:
df_raw.columns[df_raw.columns.str.contains('id')]

Index(['additionalHosts/0/id', 'additionalHosts/1/id', 'additionalHosts/2/id',
       'primaryHost/id', 'reviews/0/author/id', 'reviews/0/id',
       'reviews/0/recipient/id', 'reviews/1/author/id', 'reviews/1/id',
       'reviews/1/recipient/id', 'reviews/2/author/id', 'reviews/2/id',
       'reviews/2/recipient/id', 'reviews/3/author/id', 'reviews/3/id',
       'reviews/3/recipient/id', 'reviews/4/author/id', 'reviews/4/id',
       'reviews/4/recipient/id', 'reviews/5/author/id', 'reviews/5/id',
       'reviews/5/recipient/id', 'reviews/6/author/id', 'reviews/6/id',
       'reviews/6/recipient/id', 'reviews/7/author/id', 'reviews/7/id',
       'reviews/7/recipient/id', 'reviews/8/author/id', 'reviews/8/id',
       'reviews/8/recipient/id', 'reviews/9/author/id', 'reviews/9/id',
       'reviews/9/recipient/id'],
      dtype='object')

### Step 01 - discarding information that doesn't matter

Removing columns relating to:
- `additionalHosts`
- property `photos` (we'll keep just first) 
- URL's
- `primaryHost/languages` (we'll keep just first) 

In [11]:
mask_additionalHosts = "df.columns.str.startswith('additionalHosts')"
mask_photos = "df.columns.str.contains(r'^photos/[^0]')"
mask_url = "df.columns.str.contains(r'^(?!photos).*Url$')"
mask_phlang = "df.columns.str.contains(r'^primaryHost/languages/[^0]')"
masks_commands = [mask_additionalHosts, mask_photos, mask_url, mask_phlang]

df_s01 = df_raw.pipe(remove_cols, masks_commands)

print(f'No Columns before step 01: {df_raw.shape[1]}')
print(f'No Columns after step 01: {df_s01.shape[1]}', end='\n\n')
df_s01.sample(1)

No Columns before step 01: 796
No Columns after step 01: 234



Unnamed: 0,address,isAvailable,isHostedBySuperhost,location/lat,location/lng,name,numberOfGuests,photos/0/caption,photos/0/pictureUrl,photos/0/thumbnailUrl,pricing/rate/amount,pricing/rate/amountFormatted,pricing/rate/currency,pricing/rate/isMicrosAccuracy,pricing/rateType,primaryHost/about,primaryHost/badges/0,primaryHost/badges/1,primaryHost/badges/2,primaryHost/firstName,primaryHost/hasInclusionBadge,primaryHost/hasProfilePic,primaryHost/id,primaryHost/isSuperHost,primaryHost/languages/0,primaryHost/listingsCount,primaryHost/memberSince,primaryHost/responseRate,primaryHost/responseTime,primaryHost/smartName,primaryHost/totalListingsCount,reviews/0/author/firstName,reviews/0/author/hasProfilePic,reviews/0/author/id,reviews/0/author/smartName,reviews/0/collectionTag,reviews/0/comments,reviews/0/createdAt,reviews/0/id,reviews/0/language,reviews/0/localizedDate,reviews/0/localizedReview,reviews/0/localizedReview/comments,reviews/0/localizedReview/disclaimer,reviews/0/localizedReview/needsTranslation,reviews/0/localizedReview/response,reviews/0/rating,reviews/0/recipient/firstName,reviews/0/recipient/hasProfilePic,reviews/0/recipient/id,reviews/0/recipient/smartName,reviews/0/response,reviews/1/author/firstName,reviews/1/author/hasProfilePic,reviews/1/author/id,reviews/1/author/smartName,reviews/1/collectionTag,reviews/1/comments,reviews/1/createdAt,reviews/1/id,reviews/1/localizedDate,reviews/1/localizedReview,reviews/1/rating,reviews/1/recipient/firstName,reviews/1/recipient/hasProfilePic,reviews/1/recipient/id,reviews/1/recipient/smartName,reviews/1/response,reviews/2/author/firstName,reviews/2/author/hasProfilePic,reviews/2/author/id,reviews/2/author/smartName,reviews/2/collectionTag,reviews/2/comments,reviews/2/createdAt,reviews/2/id,reviews/2/language,reviews/2/localizedDate,reviews/2/localizedReview,reviews/2/localizedReview/comments,reviews/2/localizedReview/disclaimer,reviews/2/localizedReview/needsTranslation,reviews/2/localizedReview/response,reviews/2/rating,reviews/2/recipient/firstName,reviews/2/recipient/hasProfilePic,reviews/2/recipient/id,reviews/2/recipient/smartName,reviews/2/response,reviews/3/author/firstName,reviews/3/author/hasProfilePic,reviews/3/author/id,reviews/3/author/smartName,reviews/3/collectionTag,reviews/3/comments,reviews/3/createdAt,reviews/3/id,reviews/3/language,reviews/3/localizedDate,reviews/3/localizedReview,reviews/3/localizedReview/comments,reviews/3/localizedReview/disclaimer,reviews/3/localizedReview/needsTranslation,reviews/3/localizedReview/response,reviews/3/rating,reviews/3/recipient/firstName,reviews/3/recipient/hasProfilePic,reviews/3/recipient/id,reviews/3/recipient/smartName,reviews/3/response,reviews/4/author/firstName,reviews/4/author/hasProfilePic,reviews/4/author/id,reviews/4/author/smartName,reviews/4/collectionTag,reviews/4/comments,reviews/4/createdAt,reviews/4/id,reviews/4/language,reviews/4/localizedDate,reviews/4/localizedReview,reviews/4/localizedReview/comments,reviews/4/localizedReview/disclaimer,reviews/4/localizedReview/needsTranslation,reviews/4/localizedReview/response,reviews/4/rating,reviews/4/recipient/firstName,reviews/4/recipient/hasProfilePic,reviews/4/recipient/id,reviews/4/recipient/smartName,reviews/4/response,reviews/5/author/firstName,reviews/5/author/hasProfilePic,reviews/5/author/id,reviews/5/author/smartName,reviews/5/collectionTag,reviews/5/comments,reviews/5/createdAt,reviews/5/id,reviews/5/language,reviews/5/localizedDate,reviews/5/localizedReview,reviews/5/localizedReview/comments,reviews/5/localizedReview/disclaimer,reviews/5/localizedReview/needsTranslation,reviews/5/localizedReview/response,reviews/5/rating,reviews/5/recipient/firstName,reviews/5/recipient/hasProfilePic,reviews/5/recipient/id,reviews/5/recipient/smartName,reviews/5/response,reviews/6/author/firstName,reviews/6/author/hasProfilePic,reviews/6/author/id,reviews/6/author/smartName,reviews/6/collectionTag,reviews/6/comments,reviews/6/createdAt,reviews/6/id,reviews/6/localizedDate,reviews/6/localizedReview,reviews/6/rating,reviews/6/recipient/firstName,reviews/6/recipient/hasProfilePic,reviews/6/recipient/id,reviews/6/recipient/smartName,reviews/6/response,reviews/7/author/firstName,reviews/7/author/hasProfilePic,reviews/7/author/id,reviews/7/author/smartName,reviews/7/collectionTag,reviews/7/comments,reviews/7/createdAt,reviews/7/id,reviews/7/language,reviews/7/localizedDate,reviews/7/localizedReview,reviews/7/localizedReview/comments,reviews/7/localizedReview/disclaimer,reviews/7/localizedReview/needsTranslation,reviews/7/localizedReview/response,reviews/7/rating,reviews/7/recipient/firstName,reviews/7/recipient/hasProfilePic,reviews/7/recipient/id,reviews/7/recipient/smartName,reviews/7/response,reviews/8/author/firstName,reviews/8/author/hasProfilePic,reviews/8/author/id,reviews/8/author/smartName,reviews/8/collectionTag,reviews/8/comments,reviews/8/createdAt,reviews/8/id,reviews/8/language,reviews/8/localizedDate,reviews/8/localizedReview,reviews/8/localizedReview/comments,reviews/8/localizedReview/disclaimer,reviews/8/localizedReview/needsTranslation,reviews/8/localizedReview/response,reviews/8/rating,reviews/8/recipient/firstName,reviews/8/recipient/hasProfilePic,reviews/8/recipient/id,reviews/8/recipient/smartName,reviews/8/response,reviews/9/author/firstName,reviews/9/author/hasProfilePic,reviews/9/author/id,reviews/9/author/smartName,reviews/9/collectionTag,reviews/9/comments,reviews/9/createdAt,reviews/9/id,reviews/9/language,reviews/9/localizedDate,reviews/9/localizedReview,reviews/9/localizedReview/comments,reviews/9/localizedReview/disclaimer,reviews/9/localizedReview/needsTranslation,reviews/9/localizedReview/response,reviews/9/rating,reviews/9/recipient/firstName,reviews/9/recipient/hasProfilePic,reviews/9/recipient/id,reviews/9/recipient/smartName,reviews/9/response,roomType,stars,url
45,"New York, United States",True,False,40.75455,-73.98123,Modern Luxury 2 Bed/ 2 Bath apartment in Midtown!,6,,https://a0.muscache.com/im/pictures/69e26eea-dfd5-4e12-b925-44c43693f292.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/69e26eea-dfd5-4e12-b925-44c43693f292.jpg?aki_policy=small,2000,"$2,000",USD,False,nightly,"Henry prides himself on offering superb accommodations. Over the years he has cultivated a keen talent for creating elegant, luxurious environments and assembled an impeccable collection of luxury rental properties.\n\nFound in prestigious locations around the world, each home is an expression of Henry's meticulous approach to hosting and entertaining. Elegant, lavish, sophisticated and comfortable; these properties were originally intended for Henry's closest friends and relatives to utilize and he now extends that privilege to his private clients.\nHenry was chief legal officer for NYC Law Department before his departure to focus on a more creative role in Architecture and Design.",673 Reviews,4 References,Identity verified,Henry,False,True,836168,False,English,14,Joined in July 2011,100%,within an hour,Henry,17,Jess,False,22098219,Jess,,Amazing location and modern luxury make for a wonderful stay. Totally can't wait to return. ;),2017-03-11T15:12:17Z,136579287,,March 2017,,,,,,5,Henry,False,836168,Henry,,Steve,False,2186402,Steve,,"Great place, excellent host!",2017-01-26T15:10:49Z,128656460,January 2017,,5,Henry,False,836168,Henry,,Alexander,False,32006245,Alexander,,Very modern and well appointed host provided great support. Building staff was courteous and helpful. I would recommend to anyone staying in NYC.,2016-10-27T14:44:38Z,110635962,,October 2016,,,,,,5,Henry,False,836168,Henry,,Alessandro,False,44735648,Alessandro,,Everything was great \r<br/>\r<br/>Thank you,2016-10-06T14:52:19Z,106504100,,October 2016,,,,,,5,Henry,False,836168,Henry,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Entire condo,5,https://www.airbnb.com/rooms/15347268


### Step 02 - Evaluate Missing Values

In [12]:
missing_values_table(df_s01)

Your selected dataframe has 234 columns.
There are 215 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
reviews/4/localizedReview/response,547,100.0
reviews/7/localizedReview/response,547,100.0
reviews/1/collectionTag,547,100.0
reviews/8/localizedReview/response,547,100.0
reviews/1/localizedReview,547,100.0
reviews/8/localizedReview,547,100.0
reviews/8/collectionTag,547,100.0
reviews/2/collectionTag,547,100.0
reviews/2/localizedReview,547,100.0
reviews/2/localizedReview/response,547,100.0


### Step 03 - discarding information based on missing values

keeping only the following columns about reviews:
- Review: `comments`, `createdAt`, `id`, `rating`
- Review Author: `firstName`, `id`

In [13]:
mask_reviews_data = "df.columns.str.contains(r'^reviews/[0-9]+/(?!author/)(?!comments|createdAt|id|rating)')"
mask_author_data = "df.columns.str.contains(r'^reviews/[0-9]+/author/(?!firstName|id)')"
masks_commands = [mask_reviews_data, mask_author_data]

df_s03 = df_s01.pipe(remove_cols, masks_commands)

print(f'No Columns before step 02: {df_s01.shape[1]}')
print(f'No Columns after step 02: {df_s03.shape[1]}', end='\n\n')
df_s03.sample(1)

No Columns before step 02: 234
No Columns after step 02: 94



Unnamed: 0,address,isAvailable,isHostedBySuperhost,location/lat,location/lng,name,numberOfGuests,photos/0/caption,photos/0/pictureUrl,photos/0/thumbnailUrl,pricing/rate/amount,pricing/rate/amountFormatted,pricing/rate/currency,pricing/rate/isMicrosAccuracy,pricing/rateType,primaryHost/about,primaryHost/badges/0,primaryHost/badges/1,primaryHost/badges/2,primaryHost/firstName,primaryHost/hasInclusionBadge,primaryHost/hasProfilePic,primaryHost/id,primaryHost/isSuperHost,primaryHost/languages/0,primaryHost/listingsCount,primaryHost/memberSince,primaryHost/responseRate,primaryHost/responseTime,primaryHost/smartName,primaryHost/totalListingsCount,reviews/0/author/firstName,reviews/0/author/id,reviews/0/comments,reviews/0/createdAt,reviews/0/id,reviews/0/rating,reviews/1/author/firstName,reviews/1/author/id,reviews/1/comments,reviews/1/createdAt,reviews/1/id,reviews/1/rating,reviews/2/author/firstName,reviews/2/author/id,reviews/2/comments,reviews/2/createdAt,reviews/2/id,reviews/2/rating,reviews/3/author/firstName,reviews/3/author/id,reviews/3/comments,reviews/3/createdAt,reviews/3/id,reviews/3/rating,reviews/4/author/firstName,reviews/4/author/id,reviews/4/comments,reviews/4/createdAt,reviews/4/id,reviews/4/rating,reviews/5/author/firstName,reviews/5/author/id,reviews/5/comments,reviews/5/createdAt,reviews/5/id,reviews/5/rating,reviews/6/author/firstName,reviews/6/author/id,reviews/6/comments,reviews/6/createdAt,reviews/6/id,reviews/6/rating,reviews/7/author/firstName,reviews/7/author/id,reviews/7/comments,reviews/7/createdAt,reviews/7/id,reviews/7/rating,reviews/8/author/firstName,reviews/8/author/id,reviews/8/comments,reviews/8/createdAt,reviews/8/id,reviews/8/rating,reviews/9/author/firstName,reviews/9/author/id,reviews/9/comments,reviews/9/createdAt,reviews/9/id,reviews/9/rating,roomType,stars,url
247,"Brooklyn, New York, United States",True,True,40.68649,-73.94174,Solar Powered Brownstone w/ treetop terrace,7,"This is the living and dining space which is on the main floor. It's open plan and the kitchen is just to the left. On this floor there are two bedrooms and a bathroom and up the stairs there are three more bedrooms, a second bathroom and a terrace.",https://a0.muscache.com/im/pictures/54c1ccea-b50f-48f4-8d0d-a8a65fb1b17c.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/54c1ccea-b50f-48f4-8d0d-a8a65fb1b17c.jpg?aki_policy=small,859,$859,USD,False,nightly,husband/ wife \r\narchitect/interior designer & artist\r\nwelcoming hosts/ eager travelers,301 Reviews,Identity verified,,James And Siobhan,False,True,521964,True,,2,Joined in April 2011,100%,within an hour,James And Siobhan,2,Phil,245284228,"James has a beautiful apartment in which my family of five had plenty of space. It is very stylish, clean and in a very interesting neighbourhood. James and siobhan were very welcoming and we always felt at home",2023-04-12T19:11:02Z,868135885016915648,5,Masha,18169883,"I had the best experience staying here. The space is just as pictured (gorgeous!), and actually better in person. James & Siobhan are my favorite hosts - they left so many lovely snacks out for our group and we’re always so quick to respond whenever we needed anything. This is by far my favorite Airbnb and it’s in such a lovely area. Highly recommend.",2023-04-09T18:57:18Z,865954648136194431,5,Alisa,2044811,"It was our first time staying in Brooklyn, and James and Siobhan's brownstone apartment made a very suitable base. The townhouse is in a quiet residential area but is still within easy walking distance of a plethora of small restaurants, coffee shops, bodegas and public transportation. The place was clean, well laid out with enjoyable design elements and thoughtfully stocked with treats and basic supplies. The only caveat I would add is that it is not suitable for mobility challenged people. There are in total three decently steep and longish staircases to navigate both to get into the building and once you are in the apartment. <br/>Replies to my inquiries were received usually within 24 hours (usually less than that), before our arrival, and once we had arrived, replies to quick questions ie. about wifi , took a couple of minutes or less, so it was excellent on that front.<br/>If you have a larger group and want a Brooklyn address, this apartment is a very good choice.",2023-04-04T19:28:29Z,862346462859203532,5,Daryl,430357742,Amazing stay in great Brooklyn neighborhood. Responsive and attentive hosts who live on site. Will definitely be booking again.,2023-03-27T17:55:20Z,856501377863151159,5,Ashley,380226,Lovely place! Our favorite part was the balcony- we had breakfast and coffee up there every morning. Easy walk to the subway stations. Amazing restaurant just around the corner. Would stay again if it’s available!,2023-03-21T17:14:41Z,852132259171539715,5,Chelsea,12242662,We had a comfortable stay at James and Siobhan’s place! It is a beautiful space full of fun and thoughtful touches that is just as pictured. The welcome basket of treats and food are extremely kind and also delicious. A great space to host a group of friends. We had a great time!,2023-03-13T20:27:11Z,846430944804741104,4,Lisa,10499153,"James was such a wonderful host, greeted us when we were there, told us great places to go to in the neighborhood, and made us feel right at home! The house is extremely spacious and quiet with a lot of natural light. The beds are all very comfortable, great linens on the beds as well. This was a great launching point to explore both Brooklyn and Manhattan. Would definitely come here again.",2023-02-26T20:51:36Z,835571595653453291,5,Mitchell,132525888,"I recently stayed at this Airbnb in Brooklyn, New York and had a wonderful experience! The apartment was clean, cozy, and filled with amenities that made the stay even better. The location of the apartment was in a safe and convenient area, with easy access to local parks, restaurants, shops, and more. The host was very friendly, hospitable, and provided us with plenty of helpful tips to make our trip perfect. Overall, I highly recommend this Airbnb in Brooklyn, New York!",2023-02-19T21:51:24Z,830528263978664486,5,Tommy,4532568,"We had 5 amazing days in Brooklyn. <br/><br/>Neighbourhood was very local and with a good vibe. Plenty of good coffee places just around the corner. Chicky is the best!! Also for bread, cheese etc.<br/><br/>Brooklyn and Williamsburg are within walking distance. Easy access to Manhattan by subway - 10 minutes walk to nearest Subway station<br/><br/>The apartment was as good as expected. Nice decorated rooms and very welcoming. This is a place where you can rest and enjoy. <br/><br/>The host was 2nd to none. Friendly and with good taste. <br/><br/>Absolutely a place to recommend",2023-02-16T19:49:42Z,828292686285144763,5,Brittany,79836099,James and Siobhan were great hosts. Very friendly and quick response time. They offered additional kitchen appliances to us from their home. Beautiful terrace and home. Great neighborhood and an easy walk to the subway. Will definitely be back.,2022-12-18T20:55:16Z,784839139162752646,5,Private room in townhouse,4.97,https://www.airbnb.com/rooms/32668712


In [14]:
file_path = 'data/processed/'
file_name = 'df_nyc_s03.csv'

df_s03.to_csv(file_path+file_name, index=False)

In [15]:
df_s03 = pd.read_csv('data/processed/df_nyc_s03.csv')
df_s03.head(3)

Unnamed: 0,address,isAvailable,isHostedBySuperhost,location/lat,location/lng,name,numberOfGuests,photos/0/caption,photos/0/pictureUrl,photos/0/thumbnailUrl,pricing/rate/amount,pricing/rate/amountFormatted,pricing/rate/currency,pricing/rate/isMicrosAccuracy,pricing/rateType,primaryHost/about,primaryHost/badges/0,primaryHost/badges/1,primaryHost/badges/2,primaryHost/firstName,primaryHost/hasInclusionBadge,primaryHost/hasProfilePic,primaryHost/id,primaryHost/isSuperHost,primaryHost/languages/0,primaryHost/listingsCount,primaryHost/memberSince,primaryHost/responseRate,primaryHost/responseTime,primaryHost/smartName,primaryHost/totalListingsCount,reviews/0/author/firstName,reviews/0/author/id,reviews/0/comments,reviews/0/createdAt,reviews/0/id,reviews/0/rating,reviews/1/author/firstName,reviews/1/author/id,reviews/1/comments,reviews/1/createdAt,reviews/1/id,reviews/1/rating,reviews/2/author/firstName,reviews/2/author/id,reviews/2/comments,reviews/2/createdAt,reviews/2/id,reviews/2/rating,reviews/3/author/firstName,reviews/3/author/id,reviews/3/comments,reviews/3/createdAt,reviews/3/id,reviews/3/rating,reviews/4/author/firstName,reviews/4/author/id,reviews/4/comments,reviews/4/createdAt,reviews/4/id,reviews/4/rating,reviews/5/author/firstName,reviews/5/author/id,reviews/5/comments,reviews/5/createdAt,reviews/5/id,reviews/5/rating,reviews/6/author/firstName,reviews/6/author/id,reviews/6/comments,reviews/6/createdAt,reviews/6/id,reviews/6/rating,reviews/7/author/firstName,reviews/7/author/id,reviews/7/comments,reviews/7/createdAt,reviews/7/id,reviews/7/rating,reviews/8/author/firstName,reviews/8/author/id,reviews/8/comments,reviews/8/createdAt,reviews/8/id,reviews/8/rating,reviews/9/author/firstName,reviews/9/author/id,reviews/9/comments,reviews/9/createdAt,reviews/9/id,reviews/9/rating,roomType,stars,url
0,"Jersey City, New Jersey, United States",True,False,40.723,-74.03946,Entire Cozy Unit(15 mins to Manhattan),2,,https://a0.muscache.com/im/pictures/miso/Hosting-53775685/original/0812b9a1-2d2e-4725-b51f-881045a939b2.jpeg?aki_policy=large,https://a0.muscache.com/im/pictures/miso/Hosting-53775685/original/0812b9a1-2d2e-4725-b51f-881045a939b2.jpeg?aki_policy=small,99999,"$99,999",USD,False,nightly,Hello all and welcome!\nI love seeing the world and to be able to help people feel right at home.,82 Reviews,Identity verified,,Joes,False,True,245395267,False,中文 (简体),8,Joined in February 2019,,,Joes,14,Miranda,74889676.0,"great location and nice building, loved that parking was included",2022-06-20T19:58:54Z,6.536263641559459e+17,4.0,Abhimanyu,4800160.0,"Everything during our stay was perfect and as mentioned in the listing. The location is a 7-min walk to Newport PATH, and right opposite a huge grocery store. Joes was super kind with an early check-in and a late check-out.",2022-05-23T18:57:26Z,6.333017075212511e+17,5.0,Robert,162871850.0,Excellent place to stay. Very central fo travel into NYC or if working / travelling in New Jersey. Stayed for 3 months in the apartment. The place was as advertised. Joes is an excellent host also.,2022-05-06T17:28:10Z,6.209355874005248e+17,5.0,Yu,303996502.0,It is the best airbnb that I have ever stayed! It gets the great view outside the window. Comfortable and convenient living experience. Everything just settled up for you. It is so great!,2022-01-09T21:55:09Z,5.362712073993113e+17,5.0,Mitesh,253080011.0,"A very nice apartment is great locality. The apartment is as listed. i would definitely rate it a 5 star airbnb if not for the pet hair on the carpet and couch. As I had a 9 month old, we were a bit concerned about him having allergies and looking for a clean place. As for Joes, he is very approachable and very easy to communicate.",2021-12-26T20:27:02Z,5.26079993255321e+17,4.0,Arusha,103758126.0,"Being the first customer, I took a chance in trusting this listing but this apartment exceeded my expectations. It was super clean and also very aesthetically pleasing with a minimalistic vibe. This is definitely a luxury apartment building and in a convenient location - right by Shoprite and near the waterfront and path! The host was also very responsive about any questions. Thanks for a great stay!",2021-12-22T21:29:10Z,5.2321216515417446e+17,5.0,,,,,,,,,,,,,,,,,,,,,,,,,Entire rental unit,4.67,https://www.airbnb.com/rooms/53775685
1,"New York, United States",True,False,40.706,-74.0092,Lux Studio on Wall Street. Heart of Fidi!,2,,https://a0.muscache.com/im/pictures/e2388507-1f5f-4000-aa1b-d3b2279e682a.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/e2388507-1f5f-4000-aa1b-d3b2279e682a.jpg?aki_policy=small,10000,"$10,000",USD,False,nightly,Professional working and living in NYC,1 Review,Identity verified,,Chris,False,True,57586379,False,,1,Joined in February 2016,,,Chris,2,Natalia,222837411.0,"This place is exactly as described, perfect for my 1 month stay! Very clean and organized. The location was also quite convenient, close to the supermarket, pharmacy, restaurants, etc. Chris was a wonderful host, extremely attentive and easy to communicate with. Would definitely book again!",2021-12-13T19:49:59Z,5.166392670270054e+17,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Entire rental unit,,https://www.airbnb.com/rooms/52862058
2,"New York, United States",True,False,40.753,-73.98487002307368,3 Studio Double Queen at Refinery Hotel,12,2 Comfortable Queen sized beds,https://a0.muscache.com/im/pictures/prohost-api/Hosting-812513786864992000/original/4f70cac3-9b26-4385-91fe-7cb385d0ee46.jpeg?aki_policy=large,https://a0.muscache.com/im/pictures/prohost-api/Hosting-812513786864992000/original/4f70cac3-9b26-4385-91fe-7cb385d0ee46.jpeg?aki_policy=small,1639,"$1,639",USD,False,nightly,,32 Reviews,,,Reservations,False,True,496932087,False,,59,Joined in January 2023,100%,within an hour,Reservations,163,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Entire serviced apartment,,https://www.airbnb.com/rooms/812513786864992000


In [16]:
df_s03.columns

Index(['address', 'isAvailable', 'isHostedBySuperhost', 'location/lat',
       'location/lng', 'name', 'numberOfGuests', 'photos/0/caption',
       'photos/0/pictureUrl', 'photos/0/thumbnailUrl', 'pricing/rate/amount',
       'pricing/rate/amountFormatted', 'pricing/rate/currency',
       'pricing/rate/isMicrosAccuracy', 'pricing/rateType',
       'primaryHost/about', 'primaryHost/badges/0', 'primaryHost/badges/1',
       'primaryHost/badges/2', 'primaryHost/firstName',
       'primaryHost/hasInclusionBadge', 'primaryHost/hasProfilePic',
       'primaryHost/id', 'primaryHost/isSuperHost', 'primaryHost/languages/0',
       'primaryHost/listingsCount', 'primaryHost/memberSince',
       'primaryHost/responseRate', 'primaryHost/responseTime',
       'primaryHost/smartName', 'primaryHost/totalListingsCount',
       'reviews/0/author/firstName', 'reviews/0/author/id',
       'reviews/0/comments', 'reviews/0/createdAt', 'reviews/0/id',
       'reviews/0/rating', 'reviews/1/author/firstName', 

### Step 04 - Getting text columns from dataframe

In [17]:
# Definindo as  colunas de texto
mask_text_cols = "df.columns.str.contains(r'name|about|comments')"
df_txt_cols = df_s03.pipe(remove_cols, [mask_text_cols], inverse=False)
df_txt_cols.sample(1)

Unnamed: 0,name,primaryHost/about,reviews/0/comments,reviews/1/comments,reviews/2/comments,reviews/3/comments,reviews/4/comments,reviews/5/comments,reviews/6/comments,reviews/7/comments,reviews/8/comments,reviews/9/comments
1,Lux Studio on Wall Street. Heart of Fidi!,Professional working and living in NYC,"This place is exactly as described, perfect for my 1 month stay! Very clean and organized. The location was also quite convenient, close to the supermarket, pharmacy, restaurants, etc. Chris was a wonderful host, extremely attentive and easy to communicate with. Would definitely book again!",,,,,,,,,


In [18]:
df_notxt_cols = df_s03.loc[:, df_s03.columns.difference(df_txt_cols.columns)].copy()
df_notxt_cols.sample(1)

Unnamed: 0,address,isAvailable,isHostedBySuperhost,location/lat,location/lng,numberOfGuests,photos/0/caption,photos/0/pictureUrl,photos/0/thumbnailUrl,pricing/rate/amount,pricing/rate/amountFormatted,pricing/rate/currency,pricing/rate/isMicrosAccuracy,pricing/rateType,primaryHost/badges/0,primaryHost/badges/1,primaryHost/badges/2,primaryHost/firstName,primaryHost/hasInclusionBadge,primaryHost/hasProfilePic,primaryHost/id,primaryHost/isSuperHost,primaryHost/languages/0,primaryHost/listingsCount,primaryHost/memberSince,primaryHost/responseRate,primaryHost/responseTime,primaryHost/smartName,primaryHost/totalListingsCount,reviews/0/author/firstName,reviews/0/author/id,reviews/0/createdAt,reviews/0/id,reviews/0/rating,reviews/1/author/firstName,reviews/1/author/id,reviews/1/createdAt,reviews/1/id,reviews/1/rating,reviews/2/author/firstName,reviews/2/author/id,reviews/2/createdAt,reviews/2/id,reviews/2/rating,reviews/3/author/firstName,reviews/3/author/id,reviews/3/createdAt,reviews/3/id,reviews/3/rating,reviews/4/author/firstName,reviews/4/author/id,reviews/4/createdAt,reviews/4/id,reviews/4/rating,reviews/5/author/firstName,reviews/5/author/id,reviews/5/createdAt,reviews/5/id,reviews/5/rating,reviews/6/author/firstName,reviews/6/author/id,reviews/6/createdAt,reviews/6/id,reviews/6/rating,reviews/7/author/firstName,reviews/7/author/id,reviews/7/createdAt,reviews/7/id,reviews/7/rating,reviews/8/author/firstName,reviews/8/author/id,reviews/8/createdAt,reviews/8/id,reviews/8/rating,reviews/9/author/firstName,reviews/9/author/id,reviews/9/createdAt,reviews/9/id,reviews/9/rating,roomType,stars,url
327,"New York, United States",True,False,40.738,-73.997,2,Living Room,https://a0.muscache.com/im/pictures/prohost-api/Hosting-30388449/original/d2209902-9d6f-444a-a88f-d9fbd19a99c1.jpeg?aki_policy=large,https://a0.muscache.com/im/pictures/prohost-api/Hosting-30388449/original/d2209902-9d6f-444a-a88f-d9fbd19a99c1.jpeg?aki_policy=small,607,$607,USD,False,nightly,"2,685 Reviews",Identity verified,,Blueground,False,True,107434423,False,English,4805,Joined in December 2016,100%,within an hour,Blueground,5427,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Entire rental unit,,https://www.airbnb.com/rooms/30388449


In [19]:
arr_texts = df_txt_cols.fillna(' ').values
arr_texts[:5]

array([['Entire Cozy Unit(15 mins to Manhattan)',
        'Hello all and welcome!\nI love seeing the world and to be able to help people feel right at home.',
        'great location and nice building, loved that parking was included',
        'Everything during our stay was perfect and as mentioned in the listing. The location is a 7-min walk to Newport PATH, and right opposite a huge grocery store. Joes was super kind with an early check-in and a late check-out.',
        'Excellent place to stay. Very central fo travel into NYC or if working / travelling in New Jersey. Stayed for 3 months in the apartment. The place was as advertised. Joes is an excellent host also.',
        'It is the best airbnb that I have ever stayed! It gets the great view outside the window. Comfortable and convenient living experience. Everything just settled up for you. It is so great!',
        'A very nice apartment is great locality. The apartment is as listed. i would definitely rate it a 5 star airbnb 

### Step 05 - Defining preprocessing and cleaning operations

In [20]:
if LANG == 'english':
  stanza_lang = 'en'
  spacy_pipe = 'en_core_web_lg'
  
elif LANG == 'portuguese':
  stanza_lang = 'pt'
  spacy_pipe = 'pt_core_news_lg'
  
else:
  print('Antes de continuar, defina a variavel LANG = "portuguese" | "english"')

In [21]:
from urllib.request import Request, urlopen 

req = Request(f'https://countwordsfree.com/stopwords/{LANG}/txt')
req.add_header('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0')
content = urlopen(req)
stop_words = pd.read_csv(content, header=None)
stop_words = stop_words[0].values
# These words are important for problem. We don't want to remove them.
excluding = ['against', 'not', 'don', 'don\'t','ain', 'are', 'aren\'t', 'could', 'couldn\'t',
             'did', 'didn\'t', 'does', 'doesn\'t', 'had', 'hadn\'t', 'has', 'hasn\'t', 
             'have', 'haven\'t', 'is', 'isn\'t', 'might', 'mightn\'t', 'must', 'mustn\'t',
             'need', 'needn\'t','should', 'shouldn\'t', 'was', 'wasn\'t', 'were', 
             'weren\'t', 'won\'t', 'would', 'wouldn\'t']
stop_words = [sw for sw in stop_words if sw not in excluding]

In [22]:
!python -m spacy download {spacy_pipe}


2023-06-01 22:22:56.356247: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-01 22:22:58.733741: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-06-01 22:22:58.734200: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-06-

In [26]:
from functools import cache
from typing import Union
import spacy
spacy.prefer_gpu()
import stanza


stanza.download(stanza_lang)
nlp_spacy = spacy.load(spacy_pipe, disable=['parser', 'ner', 'textcat']) 
spacy_type = type(nlp_spacy)

@cache
def preprocess_txt(textraw:str, nlp_obj:Union[spacy_type, stanza.pipeline.core.Pipeline], lemma=False) -> str:
  '''
  Realiza a limpeza e o pre processamento de um dado texto.
  As etapas de pré-processamento a serem executadas são:
  * Conversão para Minúsculas.
  * Remocao de tags HTML.
  * Remocao de pontuacoes.
  * Remocao de algarismos numericos e numeros de telefone.
  * Remocao de multiplos espacos.
  * Remocao de URLs e E-mails: são removidos, pois eles não adicionam significado ao texto.
  * Substituicao de Emojis: Substitui emojis usando um dicionário predefinido contendo emojis junto com seu significado. (por exemplo: ":)" para "sorriso")
  * Substituindo nomes de usuário: Substitua @Usernames pela palavra "USER". (por exemplo: "@nick_name" para "USER")
  * Remocao de ascentuacao.
  * Removendo Não-Alfabetos: Substitui caracteres (exceto Dígitos e Alfabetos) por um espaço.
  * Removendo letras consecutivas: 3 ou mais letras consecutivas são substituídas por 2 letras. (por exemplo: "Muitoooo" para "Muitoo")
  * Remoção de palavras curtas: palavras com comprimento menor que 2 são removidas.
  * Removendo Stopwords: Stopwords são as palavras que não adicionam muito significado a uma frase. Eles podem ser ignorados com segurança sem sacrificar o significado da frase. (por exemplo: "the", "he", "have")
  * Lematização: A lematização é o processo de converter uma palavra em sua forma básica. (por exemplo: “Great” para “Good”). A lematização geralmente retorna palavras válidas (que existem) enquanto as técnicas de stemming retornam (na maioria das vezes) palavras abreviadas

  Parameters:
    textraw (str): texto bruto o qual sofrera as operacoes de limpeza e pre processamento.
    nlp_obj (spacy.lang.pt.Portuguese | spacy.lang.en.English | stanza.pipeline.core.Pipeline): - objeto que indica a biblioteca a ser utilizada para a etapa de tokenizacao, remocao de stop words e lemmatizacao.
    lemma (boolean): (opcional) se deve substituir as palavras originais do texto por seu lemma ou nao. Default: False

  Returns:
    wordsclean (pd.Dataframe): dataframe sem as colunas indicadas na máscara
  '''
  import re
  from unidecode import unidecode
  # from gensim.utils import simple_preprocess
  
  global stop_words

  # Definindo padrões Regex
  # urlPattern        = r"((http://)[^ ]*|(https://)[^ ]*|( www\.)[^ ]*)"
  # userPattern       = '@[^\s]+'
  alphaPattern      = "[^a-zA-Z0-9]"
  sequencePattern   = r"(.)\1\1+" # letras consecutivas
  seqReplacePattern = r"\1\1"
  
  # Definindo dicionario de emojis e seus respectivos significados
  emojis = {':)': 'smile', ':-)': 'smile', ';d': 'wink', ':-E': 'vampire', ':(': 'sad', 
            ':-(': 'sad', ':-<': 'sad', ':P': 'raspberry', ':O': 'surprised',
            ':-@': 'shocked', ':@': 'shocked',':-$': 'confused', ':\\': 'annoyed', 
            ':#': 'mute', ':X': 'mute', ':^)': 'smile', ':-&': 'confused', '$_$': 'greedy',
            '@@': 'eyeroll', ':-!': 'confused', ':-D': 'smile', ':-0': 'yell', 'O.o': 'confused',
            '<(-_-)>': 'robot', 'd[-_-]b': 'dj', ":'-)": 'sadsmile', ';)': 'wink', 
            ';-)': 'wink', 'O:-)': 'angel','O*-)': 'angel','(:-D': 'gossip', '=^.^=': 'cat'}

   
  # ##################
  # Tentativa de aplicação de técnica para correção de palavras com erro de escrita, 
  # mas o resultado não foi considerado satisfatório.
  # ##################
  # from gensim.utils import simple_tokenize
  # from spellchecker import SpellChecker
  # spell = SpellChecker(language='pt')
  # wordlist = list(simple_tokenize(textraw))
  # misspelled = spell.unknown(wordlist)
  # correctedtext = [spell.correction(word) for word in misspelled]
  # print(correctedtext)

  textclean = neat_preprocessing(textraw)  
  for emoji in emojis.keys():
      textclean = textclean.replace(emoji, "EMOJI" + emojis[emoji])  
  textclean = re.sub(sequencePattern, seqReplacePattern, textclean)
  textclean = unidecode(textclean)
  textclean = re.sub(alphaPattern, " ", textclean)
  wordsclean = ''

  if nlp_obj.__module__ == 'spacy.lang.en' or nlp_obj.__module__ == 'spacy.lang.pt':
    doc = nlp_obj(textclean)
    for token in doc:      
      if not token.is_stop and len(token) > 1:
        if lemma and token.pos_ == 'VERB':
          wordsclean += token.lemma_+' '
        else:
          wordsclean += token.text + ' '

  elif isinstance(nlp_obj, stanza.pipeline.core.Pipeline):
    doc = nlp_obj(textclean)
    for sent in doc.sentences:
      for word in sent.words:   
        if word.text not in stop_words and len(word.text) > 1:   
          if lemma and word.upos == 'VERB':
            wordsclean += word.lemma +' '
          else:
            wordsclean += word.text + ' '
  else:
      raise BaseException('Instancia nao reconhecida do objeto "nlp_obj". Opcoes: spacy | stanza')
    
  return wordsclean[:-1]


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

INFO:stanza:Downloading default packages for language: en (English) ...
INFO:stanza:File exists: /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources.


### Step 06 - Tests

In [24]:
arr_texts[133]

array(['Beautiful house in the heart of Williamsburg',
       'There are 3 places that we call home. Brooklyn, the 1000 Islands &  Puerto Rico. We split our time between all 3 and have created homes that we are happy to share with you.',
       'Julia’s place was exactly what we needed. Great spot to work remotely and explore Brooklyn and the city. The location is close to everything including transit. Beds were comfortable and the home had lots of space to spread out. Both Julia and her cleaning staff were very friendly and kind, quick to respond. <br/>As others mentioned this is a louder location on a busy street. Earplugs are provided and necessary at night which was fine for us. Worth noting: there is no microwave or way to reheat food so if you plan to cook or stay for an extended period of time it’s something to consider.',
       'Great spot in a perfect location! The house was spacious with a well appointed kitchen and plenty of room for our family.',
       "We stayed at Julia

#### Teste Spacy

In [27]:
%%time

postprocess = []
for rawtxt in arr_texts[133]:
  if len(rawtxt):
    postprocess.append(preprocess_txt(rawtxt, nlp_obj=nlp_spacy, lemma=True))

postprocess

CPU times: user 227 ms, sys: 1.4 ms, total: 228 ms
Wall time: 235 ms


['beautiful house heart williamsburg',
 'places home brooklyn islands puerto rico split time create homes happy share',
 'julia place exactly need great spot work remotely explore brooklyn city location close include transit beds comfortable home lots space spread julia clean staff friendly kind quick respond mention louder location busy street earplugs provide necessary night fine worth note microwave way reheat food plan cook stay extended period time consider',
 'great spot perfect location house spacious appoint kitchen plenty room family',
 'stay julias home weeks home renovate home central get williamsburg close best restaurants williamsburg go eat night able multiple daily walks dogs mccarren park day home huge clean work home ample room meetings time yard huge plus play time dogs julia incredibly responsive kind let check earlier plan highly recommend stay',
 'julia responsible kind wellprepared house',
 'place gem humble exterior give way beautiful bespoke interior thoughtful 

####  Teste Stanza

In [None]:
%%time
########## Stanza defined's processors for this task:
# tokenize - Segments a Document into Sentences, each containing a list of Tokens.
# mwt - Expands multi-word tokens (MWTs) into multiple words when they are predicted by the tokenizer.Fix contractions.
# pos - UPOS, XPOS, and UFeats annotations are accessible through Word’s properties `pos`, `xpos`, and `ufeats`.
# lemma - Perform lemmatization on a Word using the `Word.text` and `Word.upos` values. The result can be accessed as `Word.lemma`. 
nlp_stanza = stanza.Pipeline(lang=stanza_lang, processors='tokenize,mwt,pos,lemma', use_gpu=True, download_method=None, verbose=False, max_cache_size=2)
postprocess = []
for rawtxt in arr_texts[133]:
  if len(rawtxt):
    postprocess.append(preprocess_txt(rawtxt, nlp_obj=nlp_stanza, lemma=True))

postprocess

CPU times: user 2.36 s, sys: 570 ms, total: 2.93 s
Wall time: 6.28 s


['beautiful house heart williamsburg ',
 'places brooklyn islands puerto rico split time create homes happy share ',
 'julia place need great spot work remotely explore brooklyn city location close include transit beds comfortable lots space spread julia cleaning staff friendly kind quick respond mention louder location busy street earplugs night fine worth note microwave reheat food plan cook stay extended period time ',
 'great spot perfect location house spacious appoint kitchen plenty room family ',
 'stay julias weeks renovate central williamsburg close restaurants williamsburg eat night multiple daily walks dogs mccarren park day huge clean work ample room meetings time yard huge play time dogs julia incredibly responsive kind check earlier plan highly recommend stay ',
 'julia responsible kind wellprepare house ',
 'place gem humble exterior beautiful bespoke interior thoughtful creative touches living room kitchen backyard area open spacious surprising nyc bedrooms smaller host

### Step 07 - Applying preprocessing operations on dataframe text columns

#### Using SpaCy

In [28]:
%%time
df_txt_cols_processed_spacy = parallel_applymap(df_txt_cols, preprocess_txt, worker_count=10, na_action='ignore', nlp_obj=nlp_spacy, lemma=True)

  0%|          | 0/672 [00:00<?, ?it/s]

  0%|          | 0/684 [00:00<?, ?it/s]

  0%|          | 0/672 [00:00<?, ?it/s]

  0%|          | 0/672 [00:00<?, ?it/s]

  0%|          | 0/672 [00:00<?, ?it/s]

  0%|          | 0/672 [00:00<?, ?it/s]

  0%|          | 0/672 [00:00<?, ?it/s]

  0%|          | 0/672 [00:00<?, ?it/s]

  0%|          | 0/672 [00:00<?, ?it/s]

  0%|          | 0/672 [00:00<?, ?it/s]

CPU times: user 1min 14s, sys: 5.46 s, total: 1min 20s
Wall time: 1min 25s


In [29]:
df_txt_cols_processed_spacy.loc[133:134]

Unnamed: 0,name,primaryHost/about,reviews/0/comments,reviews/1/comments,reviews/2/comments,reviews/3/comments,reviews/4/comments,reviews/5/comments,reviews/6/comments,reviews/7/comments,reviews/8/comments,reviews/9/comments
133,beautiful house heart williamsburg,places home brooklyn islands puerto rico split time create homes happy share,julia place exactly need great spot work remotely explore brooklyn city location close include transit beds comfortable home lots space spread julia clean staff friendly kind quick respond mention louder location busy street earplugs provide necessary night fine worth note microwave way reheat food plan cook stay extended period time consider,great spot perfect location house spacious appoint kitchen plenty room family,stay julias home weeks home renovate home central get williamsburg close best restaurants williamsburg go eat night able multiple daily walks dogs mccarren park day home huge clean work home ample room meetings time yard huge plus play time dogs julia incredibly responsive kind let check earlier plan highly recommend stay,julia responsible kind wellprepared house,place gem humble exterior give way beautiful bespoke interior thoughtful creative touches need home living room kitchen backyard area open spacious surprising nyc bedrooms smaller hosts create storage space etc speak hosts julia incredibly responsive stay know start booking process work reliable helpful people great trip hope stay,julias place truly little oasis outdoor area nice rarely find new york house super spacious modern locate great location right mccarren park near williamsburg shops restaurants stay time visit nyc,wonderful place comfy wellappointed spacious open thoughtful amenities minutes walk mccarren park minutes foods pharmacy ideal location place make feel like safe home open floor plan downstairs make great living space projector tv setup ideal son game heaven outdoor space calming delightful cooks kitchen wellappointe lay outtheres calm sanctuary inside bustling brooklyn right door everchange hipster parade love stay,space lovely room relax breath walk distance love stay hope future,unassuming street facade lovely quiet oasis heart williamsburg husband grown daughters gather julia place home base great space offer comforts home extremely comfortable amenities story spot bedrooms comfy beds need williamsburg adventure walk door walk mccarren park walk ferry stop travel ferry th street river manhattan walk shops bedford ave park car street safe secure lot plenty street parking julia extremely responsive easy work hope enjoy significant outdoor deck find,great place stay
134,beautiful bdrm bath private luxury townhouse,lawyer enjoy garden read,great apartment greenwich village area manhattan location convenient walk get subway call uber feel completely safe come apartment evenings entrance light near major street apartment spacious nyc comfortable beds baths stock plenty towels good showers hairdryers nice kitchen include size fridge coffee maker plenty glasses plates vike range need cook manhattan apartment huge windows nice views roomswe family plenty space bit street noise early morning days sanitation trucks pick evenings mornings apartment completely quiet dan great host responsive send long helpful list restaurants arrive,great location stay recommend place little minor issues fix beds comfortable,great location trip nyc,fabulous location great nyc lots great bars restaurants accessible manhattan key assets lovely apartment maintenance need eg leak roof inaccessible terrace consequently despite show list photographs dollar price pay hope better maintain property dan communicate provide great tips local venues visit etc raise legitimate concerns terrace leak roof go pretty quiet recommend property definitively worth seek assurance feedback provide recent guests address,dan give lot info local guidance responsive keypad entry great lose keys come early area great like,dan responsive considerate host christmas period keep contact send maintenance team hour ask help beds super comfy location amazing walks key landmarks restaurants bars great live music locations doorstep subway,dan place wonderful place sink wasn fall downstairs wall toilet seat wasn break cleaning great notice comforters wash open dishwasher dishes dirty upstairs bathroom windows shower people different apartments shower want work maintenance guy come day pm ceiling dripping get bucket night morning spot leak rent place dan say fix complaint dan start respond partner fill complaint airbnb soon get phone airbnb notice mold upstairs corner steps notice mold upstairs bedroom beds leave,place great great location bed comfortable short walk subway restaurants perfect spot larger group,location location location great place great neighborhood local living experience get right village look forward stay highly recommend,fantastic location great space good amenities lovely apartment nice home base hotel room trip leadup trip great communication dan simple instructions checkin apartmentunfortunately roof leak rain rainwater drip living room places course right laptops luckily cause minimal damage wait response dan resolve little disappointing nt cheap stay nt spoil tripoverall location great apartment recommend check forecast


#### Using Stanza

In [None]:
%%time
nlp_stanza = stanza.Pipeline(lang=stanza_lang, processors='tokenize,mwt,pos,lemma', use_gpu=True, download_method=None, verbose=False, max_cache_size=2)
df_txt_cols_processed_stanza = parallel_applymap(df_txt_cols, preprocess_txt, worker_count=10, na_action='ignore', nlp_obj=nlp_stanza, lemma=True)

  0%|          | 0/684 [00:00<?, ?it/s]

  0%|          | 0/672 [00:00<?, ?it/s]

  0%|          | 0/672 [00:00<?, ?it/s]

  0%|          | 0/672 [00:00<?, ?it/s]

  0%|          | 0/672 [00:00<?, ?it/s]

  0%|          | 0/672 [00:00<?, ?it/s]

  0%|          | 0/672 [00:00<?, ?it/s]

  0%|          | 0/672 [00:00<?, ?it/s]

  0%|          | 0/672 [00:00<?, ?it/s]

  0%|          | 0/672 [00:00<?, ?it/s]

CPU times: user 7min 45s, sys: 30.6 s, total: 8min 15s
Wall time: 6min 48s


#### Comparando o texto não processado com o processado

In [30]:
df_txt_cols.loc[0:5]

Unnamed: 0,name,primaryHost/about,reviews/0/comments,reviews/1/comments,reviews/2/comments,reviews/3/comments,reviews/4/comments,reviews/5/comments,reviews/6/comments,reviews/7/comments,reviews/8/comments,reviews/9/comments
0,Entire Cozy Unit(15 mins to Manhattan),Hello all and welcome!\nI love seeing the world and to be able to help people feel right at home.,"great location and nice building, loved that parking was included","Everything during our stay was perfect and as mentioned in the listing. The location is a 7-min walk to Newport PATH, and right opposite a huge grocery store. Joes was super kind with an early check-in and a late check-out.",Excellent place to stay. Very central fo travel into NYC or if working / travelling in New Jersey. Stayed for 3 months in the apartment. The place was as advertised. Joes is an excellent host also.,It is the best airbnb that I have ever stayed! It gets the great view outside the window. Comfortable and convenient living experience. Everything just settled up for you. It is so great!,"A very nice apartment is great locality. The apartment is as listed. i would definitely rate it a 5 star airbnb if not for the pet hair on the carpet and couch. As I had a 9 month old, we were a bit concerned about him having allergies and looking for a clean place. As for Joes, he is very approachable and very easy to communicate.","Being the first customer, I took a chance in trusting this listing but this apartment exceeded my expectations. It was super clean and also very aesthetically pleasing with a minimalistic vibe. This is definitely a luxury apartment building and in a convenient location - right by Shoprite and near the waterfront and path! The host was also very responsive about any questions. Thanks for a great stay!",,,,
1,Lux Studio on Wall Street. Heart of Fidi!,Professional working and living in NYC,"This place is exactly as described, perfect for my 1 month stay! Very clean and organized. The location was also quite convenient, close to the supermarket, pharmacy, restaurants, etc. Chris was a wonderful host, extremely attentive and easy to communicate with. Would definitely book again!",,,,,,,,,
2,3 Studio Double Queen at Refinery Hotel,,,,,,,,,,,
3,Home-1 near NYC (Filming/Weddings/Retreats),,,,,,,,,,,
4,The Artist's Loft,"Producer/Director with award-winning independent production company, http://www.IndustrialStrengthNYC.com",,,,,,,,,,
5,"Cozy, quiet, private 1 br apt NO SMOKING ALLOWED",,Thank you so much Johanna. I couldn’t be more grateful for how well you accommodated my parents. It means the world to me that they are happy and they told me that you were a wonderful host. 🙏,"DO NOT BOOK !!!!!!!!<br/>By far the worst airbnb experience I have ever had. The woman repeatedly harassed me whenever she knew I was alone and I think it was because she knew I wasn’t from the area. After the first night I stayed she lied to Airbnb about what I was doing with the apartment and tried to have me kicked out without a refund. I spoke to Airbnb and explained to them what was happening and she immediately switched up her story and asked me to stay. I explained to Airbnb and her that I was no longer comfortable staying because I felt she was trying to take advantage of me and she didn’t even have the decency to return my calls or messages even though I messaged her through the Airbnb app. Whenever I had company over she was quiet but whenever she knew I was alone they would lie and try to finesse me for money. She insisted on trying to add random nonsense cleaning fees over things that i had absolutely nothing to do with, at one point I offered her to come inside and check around and she refused to, but she still went to Airbnb accusing me of things that clearly were not happening. Even though Airbnb told her she was not to have any more contact with me she came repeatedly stating that I would have to leave without refund until I called Airbnb and they told her she couldn’t. The next time she came down she brought her husband in what was a clear attempt to intimidate a small foreign woman. At one point she tried to get the police involved and even they told her that what she was doing was wrong, and this is when Airbnb finally refunded me the rest of the stay and I was allowed to leave. She’s a liar trying to finesse people out of their money and even the police said that the Airbnb she’s running is illegal. The apartment is tiny and the woman is rude, disrespectful and a liar. The only thing that separates the apartment from the host upstairs is a staircase and a door that they can open at anytime (of course not mentioned in the listing). Even if the woman was the nicest person in the world the place isn’t worth the price but I would not chance meeting this woman and dealing with her lying for any price.","Place is as described, cozy and clean. Enjoyed having a tv in the living room and bedroom. Bathroom was nice and stocked with lots of shampoo, conditioner and body wash which is perfect incase you forget to bring toiletries for your stay. <br/><br/>Complimentary coffee and hot chocolate with half & half were also a really lovely touch as well. <br/><br/>Host was very accommodating and check in was easy. Definitely recommend this place for a short stay. 5 stars.",Johanna’s place was exactly as described. Everything was so clean and very comfortable. The communication with Johanna was beyond expectations and she put my mind at so much ease while traveling with my newborn son. She accommodated every request.,The stay was wonderful. We enjoyed the neighborhood and the location close to some nightlife and groceries that we needed. Johanna was a great host!,Perfect location to enjoy a weekend in the Bronx close to all mass transit to bring you into the city. Many great places to eat nearby in walking distance. Apartment was clean and cozy.,It’s fine For one or two Guest,La estancia es muy cómoda y limpia. El lugar es muy tranquilo. El lugar muy limpio y práctico.,,


In [None]:
df_txt_cols_processed_spacy.loc[0:5]

Unnamed: 0,name,primaryHost/about,reviews/0/comments,reviews/1/comments,reviews/2/comments,reviews/3/comments,reviews/4/comments,reviews/5/comments,reviews/6/comments,reviews/7/comments,reviews/8/comments,reviews/9/comments
0,entire cozy unit mins manhattan,love people feel,great location nice building love parking include,stay perfect mention listing location min walk newport path huge grocery store joes super kind early checkin late checkout,excellent place stay central fo travel nyc work travel jersey stay months apartment place advertise joes excellent host,airbnb stay great view window comfortable convenient living experience settle great,nice apartment great locality apartment list rate star airbnb pet hair carpet couch month bit concerned allergies clean place joes approachable easy communicate,customer chance trust listing apartment exceed expectations super clean aesthetically pleasing minimalistic vibe luxury apartment building convenient location shoprite waterfront path host responsive questions great stay,,,,
1,lux studio wall street heart fidi,professional work live nyc,place perfect month stay clean organize location convenient close supermarket pharmacy restaurants chris wonderful host extremely attentive easy communicate book,,,,,,,,,
2,studio double queen refinery hotel,,,,,,,,,,,
3,nyc filming weddings retreat,,,,,,,,,,,
4,artists loaf,producer director awardwin independent production company http wwindustrialstrengthnyccom,,,,,,,,,,
5,cozy quiet private br apt smoking allow,,johanna couldn grateful accommodate parents happy tell wonderful host EMOJI,book worst airbnb experience woman repeatedly harass know know wasn area night stay lie airbnb apartment kick refund speak airbnb explain happen switch story ask stay explain airbnb longer comfortable stay feel advantage didn decency return calls messages message airbnb app company quiet know lie finesse money insist add random nonsense cleaning fees absolutely point offer check refuse airbnb accuse happen airbnb tell contact repeatedly state leave refund call airbnb tell couldn time bring husband clear attempt intimidate small foreign woman point police involve tell wrong airbnb finally refund rest stay allow leave liar finesse people money police airbnb run illegal apartment tiny woman rude disrespectful liar separate apartment host upstairs staircase door open anytime mention listing woman nicest person place isn worth price chance meet woman deal lie price,place cozy clean enjoy tv living room bedroom bathroom nice stock lots shampoo conditioner body wash perfect incase forget bring toiletries stay complimentary coffee hot chocolate lovely touch host accommodating check easy recommend place short stay stars,johanna place clean comfortable communication johanna expectations mind ease travel newborn son accommodate request,stay wonderful enjoy neighborhood location close nightlife groceries need johanna great host,perfect location enjoy weekend bronx close mass transit bring city great places eat nearby walking distance apartment clean cozy,fine guest,la estancia es muy comoda limpia el lugar es muy tranquilo el lugar muy limpio practico,,


#### Salvando o dataset preprocessado

In [32]:
df_final = pd.concat([df_notxt_cols, df_txt_cols_processed_spacy], axis=1)
df_final.loc[0:5]

Unnamed: 0,address,isAvailable,isHostedBySuperhost,location/lat,location/lng,numberOfGuests,photos/0/caption,photos/0/pictureUrl,photos/0/thumbnailUrl,pricing/rate/amount,pricing/rate/amountFormatted,pricing/rate/currency,pricing/rate/isMicrosAccuracy,pricing/rateType,primaryHost/badges/0,primaryHost/badges/1,primaryHost/badges/2,primaryHost/firstName,primaryHost/hasInclusionBadge,primaryHost/hasProfilePic,primaryHost/id,primaryHost/isSuperHost,primaryHost/languages/0,primaryHost/listingsCount,primaryHost/memberSince,primaryHost/responseRate,primaryHost/responseTime,primaryHost/smartName,primaryHost/totalListingsCount,reviews/0/author/firstName,reviews/0/author/id,reviews/0/createdAt,reviews/0/id,reviews/0/rating,reviews/1/author/firstName,reviews/1/author/id,reviews/1/createdAt,reviews/1/id,reviews/1/rating,reviews/2/author/firstName,reviews/2/author/id,reviews/2/createdAt,reviews/2/id,reviews/2/rating,reviews/3/author/firstName,reviews/3/author/id,reviews/3/createdAt,reviews/3/id,reviews/3/rating,reviews/4/author/firstName,reviews/4/author/id,reviews/4/createdAt,reviews/4/id,reviews/4/rating,reviews/5/author/firstName,reviews/5/author/id,reviews/5/createdAt,reviews/5/id,reviews/5/rating,reviews/6/author/firstName,reviews/6/author/id,reviews/6/createdAt,reviews/6/id,reviews/6/rating,reviews/7/author/firstName,reviews/7/author/id,reviews/7/createdAt,reviews/7/id,reviews/7/rating,reviews/8/author/firstName,reviews/8/author/id,reviews/8/createdAt,reviews/8/id,reviews/8/rating,reviews/9/author/firstName,reviews/9/author/id,reviews/9/createdAt,reviews/9/id,reviews/9/rating,roomType,stars,url,name,primaryHost/about,reviews/0/comments,reviews/1/comments,reviews/2/comments,reviews/3/comments,reviews/4/comments,reviews/5/comments,reviews/6/comments,reviews/7/comments,reviews/8/comments,reviews/9/comments
0,"Jersey City, New Jersey, United States",True,False,40.723,-74.03946,2,,https://a0.muscache.com/im/pictures/miso/Hosting-53775685/original/0812b9a1-2d2e-4725-b51f-881045a939b2.jpeg?aki_policy=large,https://a0.muscache.com/im/pictures/miso/Hosting-53775685/original/0812b9a1-2d2e-4725-b51f-881045a939b2.jpeg?aki_policy=small,99999,"$99,999",USD,False,nightly,82 Reviews,Identity verified,,Joes,False,True,245395267,False,中文 (简体),8,Joined in February 2019,,,Joes,14,Miranda,74889676.0,2022-06-20T19:58:54Z,6.536263641559459e+17,4.0,Abhimanyu,4800160.0,2022-05-23T18:57:26Z,6.333017075212511e+17,5.0,Robert,162871850.0,2022-05-06T17:28:10Z,6.209355874005248e+17,5.0,Yu,303996502.0,2022-01-09T21:55:09Z,5.362712073993113e+17,5.0,Mitesh,253080011.0,2021-12-26T20:27:02Z,5.26079993255321e+17,4.0,Arusha,103758126.0,2021-12-22T21:29:10Z,5.2321216515417446e+17,5.0,,,,,,,,,,,,,,,,,,,,,Entire rental unit,4.67,https://www.airbnb.com/rooms/53775685,entire cozy unit mins manhattan,hello welcome love see world able help people feel right home,great location nice building love parking include,stay perfect mention listing location min walk newport path right opposite huge grocery store joes super kind early checkin late checkout,excellent place stay central fo travel nyc work travel new jersey stay months apartment place advertise joes excellent host,best airbnb stay get great view outside window comfortable convenient living experience settle great,nice apartment great locality apartment list definitely rate star airbnb pet hair carpet couch month old bit concerned have allergies look clean place joes approachable easy communicate,customer take chance trust listing apartment exceed expectations super clean aesthetically pleasing minimalistic vibe definitely luxury apartment building convenient location right shoprite near waterfront path host responsive questions thanks great stay,,,,
1,"New York, United States",True,False,40.706,-74.0092,2,,https://a0.muscache.com/im/pictures/e2388507-1f5f-4000-aa1b-d3b2279e682a.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/e2388507-1f5f-4000-aa1b-d3b2279e682a.jpg?aki_policy=small,10000,"$10,000",USD,False,nightly,1 Review,Identity verified,,Chris,False,True,57586379,False,,1,Joined in February 2016,,,Chris,2,Natalia,222837411.0,2021-12-13T19:49:59Z,5.166392670270054e+17,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Entire rental unit,,https://www.airbnb.com/rooms/52862058,lux studio wall street heart fidi,professional working live nyc,place exactly describe perfect month stay clean organize location convenient close supermarket pharmacy restaurants etc chris wonderful host extremely attentive easy communicate definitely book,,,,,,,,,
2,"New York, United States",True,False,40.753,-73.98487002307368,12,2 Comfortable Queen sized beds,https://a0.muscache.com/im/pictures/prohost-api/Hosting-812513786864992000/original/4f70cac3-9b26-4385-91fe-7cb385d0ee46.jpeg?aki_policy=large,https://a0.muscache.com/im/pictures/prohost-api/Hosting-812513786864992000/original/4f70cac3-9b26-4385-91fe-7cb385d0ee46.jpeg?aki_policy=small,1639,"$1,639",USD,False,nightly,32 Reviews,,,Reservations,False,True,496932087,False,,59,Joined in January 2023,100%,within an hour,Reservations,163,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Entire serviced apartment,,https://www.airbnb.com/rooms/812513786864992000,studio double queen refinery hotel,,,,,,,,,,,
3,"East Orange, New Jersey, United States",True,False,40.753,-74.20601,16,,https://a0.muscache.com/im/pictures/1a7a1b5e-6782-47c8-9087-581ce7d439b2.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/1a7a1b5e-6782-47c8-9087-581ce7d439b2.jpg?aki_policy=small,1801,"$1,801",USD,False,nightly,151 Reviews,Identity verified,,Lisa & Bernard,False,True,99385862,False,,4,Joined in October 2016,100%,within an hour,Lisa & Bernard,9,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Entire condo,,https://www.airbnb.com/rooms/45544118,home near nyc filming weddings retreat,,,,,,,,,,,
4,"Passaic, New Jersey, United States",True,False,40.856,-74.14331,16,,https://a0.muscache.com/im/pictures/9c55e037-32ee-41e7-b902-be6fb7476890.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/9c55e037-32ee-41e7-b902-be6fb7476890.jpg?aki_policy=small,2500,"$2,500",USD,False,nightly,Identity verified,,,Glenn,False,True,1518598,False,,2,Joined in December 2011,100%,within an hour,Glenn,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Entire loft,,https://www.airbnb.com/rooms/47817262,artists loft,producer director awardwinne independent production company http wwindustrialstrengthnyccom,,,,,,,,,,
5,"The Bronx, New York, United States",True,False,40.833,-73.82839,4,,https://a0.muscache.com/im/pictures/fa1f73ce-eab3-45ce-a498-9d393f39f547.jpg?aki_policy=large,https://a0.muscache.com/im/pictures/fa1f73ce-eab3-45ce-a498-9d393f39f547.jpg?aki_policy=small,2000,"$2,000",USD,False,nightly,51 Reviews,Identity verified,,Johanna,False,True,165054817,False,,1,Joined in December 2017,,,Johanna,1,Tennice,370213901.0,2022-03-05T20:46:50Z,5.760994875389545e+17,5.0,Ana,423671254.0,2022-02-25T19:09:12Z,5.702521370681191e+17,1.0,Julissa,337365171.0,2022-02-14T20:49:51Z,5.623302643692237e+17,5.0,Amanda,318219716.0,2021-12-24T21:15:33Z,5.246548650366917e+17,5.0,Holly,331433990.0,2021-12-08T18:15:13Z,5.129676871483319e+17,5.0,Melissa,146311552.0,2021-11-28T21:52:07Z,5.058291024017224e+17,5.0,Waled,402554229.0,2021-11-25T20:42:34Z,5.036197632250281e+17,4.0,Ana,194764806.0,2021-12-14T19:56:27Z,5.173672927807784e+17,5.0,,,,,,,,,,,Entire rental unit,4.38,https://www.airbnb.com/rooms/40816205,cozy quiet private br apt smoking allow,,thank johanna couldn grateful accommodate parents mean world happy tell wonderful host EMOJI,book far worst airbnb experience woman repeatedly harass know think know wasn area night stay lie airbnb apartment try kick refund speak airbnb explain happen immediately switch story ask stay explain airbnb longer comfortable stay feel try advantage didn decency return calls messages message airbnb app company quiet know lie try finesse money insist try add random nonsense cleaning fees things absolutely point offer come inside check refuse go airbnb accuse things clearly happen airbnb tell contact come repeatedly state leave refund call airbnb tell couldn time come bring husband clear attempt intimidate small foreign woman point try police involve tell wrong airbnb finally refund rest stay allow leave liar try finesse people money police say airbnb run illegal apartment tiny woman rude disrespectful liar thing separate apartment host upstairs staircase door open anytime course mention listing woman nicest person world place isn worth price chance meet woman deal lie price,place describe cozy clean enjoy have tv living room bedroom bathroom nice stock lots shampoo conditioner body wash perfect incase forget bring toiletries stay complimentary coffee hot chocolate half half lovely touch host accommodating check easy definitely recommend place short stay stars,johanna place exactly describe clean comfortable communication johanna expectations mind ease travel newborn son accommodate request,stay wonderful enjoy neighborhood location close nightlife groceries need johanna great host,perfect location enjoy weekend bronx close mass transit bring city great places eat nearby walking distance apartment clean cozy,fine guest,la estancia es muy comoda limpia el lugar es muy tranquilo el lugar muy limpio practico,,


In [33]:
file_path = 'data/processed/'
# file_name = 'dataset_airbnb-processed_2023-04-04_01-45-49-997.csv'
file_name = 'dataset_airbnb-processed-spacy_2023-04-13_03-28-09-439.csv'

df_final.to_csv(file_path+file_name, index=False)