In [2]:
import pandas as pd
import numpy as np

datos = pd.read_json("data/datos_hosting.json")
datos = pd.json_normalize(datos["info_inmuebles"])

columnas = list(datos.columns)

datos = datos.explode(columnas[3:])
datos.reset_index(inplace=True, drop=True)

datos["max_hospedes"] = datos["max_hospedes"].astype(np.int64)
datos[["cantidad_baños", "cantidad_cuartos", "cantidad_camas"]] = datos[
    ["cantidad_baños", "cantidad_cuartos", "cantidad_camas"]
].astype(np.int64)
datos["evaluacion_general"] = datos["evaluacion_general"].astype(np.float64)
datos.precio = datos.precio.apply(lambda x: x.replace("$", "").replace(",", "").strip())
datos["precio"] = datos["precio"].astype(np.float64)
datos[["cuota_deposito", "cuota_limpieza"]] = datos[
    ["cuota_deposito", "cuota_limpieza"]
].map(lambda x: x.replace("$", "").replace(",", "").strip())
datos[["cuota_deposito", "cuota_limpieza"]] = datos[
    ["cuota_deposito", "cuota_limpieza"]
].astype(np.float64)

# El problema del texto

In [3]:
datos.head()

Unnamed: 0,evaluacion_general,experiencia_local,max_hospedes,descripcion_local,descripcion_vecindad,cantidad_baños,cantidad_cuartos,cantidad_camas,modelo_cama,comodidades,cuota_deposito,cuota_limpieza,precio
0,10.0,--,1,This clean and comfortable one bedroom sits ri...,Lower Queen Anne is near the Seattle Center (s...,1,1,1,Real Bed,"{Internet,""Wireless Internet"",Kitchen,""Free Pa...",0.0,0.0,110.0
1,10.0,--,1,Our century old Upper Queen Anne house is loca...,"Upper Queen Anne is a really pleasant, unique ...",1,1,1,Futon,"{TV,Internet,""Wireless Internet"",Kitchen,""Free...",0.0,0.0,45.0
2,10.0,--,1,Cozy room in two-bedroom apartment along the l...,The convenience of being in Seattle but on the...,1,1,1,Futon,"{TV,Internet,""Wireless Internet"",Kitchen,""Free...",0.0,0.0,55.0
3,10.0,--,1,Very lovely and cozy room for one. Convenientl...,"Ballard is lovely, vibrant and one of the most...",1,1,1,Pull-out Sofa,"{Internet,""Wireless Internet"",Kitchen,""Free Pa...",0.0,20.0,52.0
4,10.0,--,1,The “Studio at Mibbett Hollow' is in a Beautif...,--,1,1,1,Real Bed,"{""Wireless Internet"",Kitchen,""Free Parking on ...",0.0,15.0,85.0


Las características, dentro de este estudio, que afectan el precio no serán solamente numéricas. Información como la descripción afectarán el valor, 

**¿Cómo se extrae valor de ese tipo de datos?**
Por ejemplo: Palabras como excelente, fantástico, horrible, nunca más, no lo recomiendo podrían impactar en el precio final asociados a la comodidad.
Se puede extraer valor dentro de una gran cantidad de palabras, extraerlas en razón de esto. A esto se le llama `tokenización`.

Se podría crear una estructura más accesible para realizar el análisis textual.

- tokenización: división del texto en tokens (unidades pequeñas -> palabras, sílabas, conjuntos de palabras).

In [4]:
# al observar una sola columna se maneja una series, ya no es un dataframe
datos['descripcion_local']

0       This clean and comfortable one bedroom sits ri...
1       Our century old Upper Queen Anne house is loca...
2       Cozy room in two-bedroom apartment along the l...
3       Very lovely and cozy room for one. Convenientl...
4       The “Studio at Mibbett Hollow' is in a Beautif...
                              ...                        
3813    Beautiful craftsman home in the historic Wedgw...
3814    Located in a very easily accessible area of Se...
3815    This home is fully furnished and available wee...
3816    This business-themed modern home features:  *H...
3817    This welcoming home is in the quiet residentia...
Name: descripcion_local, Length: 3818, dtype: object

Se observa que no está normalizado, está tal cual lo escribió el cliente.
Mirando el texto rápidamente se puede apreciar que:
- hay mayúsculas y minúsculas
- caracteres extraños

In [5]:
# transformar a minúsculas
datos["descripcion_local"] = datos["descripcion_local"].str.lower()
datos['descripcion_vecindad'] = datos["descripcion_vecindad"].str.lower()

In [6]:
datos.head()

Unnamed: 0,evaluacion_general,experiencia_local,max_hospedes,descripcion_local,descripcion_vecindad,cantidad_baños,cantidad_cuartos,cantidad_camas,modelo_cama,comodidades,cuota_deposito,cuota_limpieza,precio
0,10.0,--,1,this clean and comfortable one bedroom sits ri...,lower queen anne is near the seattle center (s...,1,1,1,Real Bed,"{Internet,""Wireless Internet"",Kitchen,""Free Pa...",0.0,0.0,110.0
1,10.0,--,1,our century old upper queen anne house is loca...,"upper queen anne is a really pleasant, unique ...",1,1,1,Futon,"{TV,Internet,""Wireless Internet"",Kitchen,""Free...",0.0,0.0,45.0
2,10.0,--,1,cozy room in two-bedroom apartment along the l...,the convenience of being in seattle but on the...,1,1,1,Futon,"{TV,Internet,""Wireless Internet"",Kitchen,""Free...",0.0,0.0,55.0
3,10.0,--,1,very lovely and cozy room for one. convenientl...,"ballard is lovely, vibrant and one of the most...",1,1,1,Pull-out Sofa,"{Internet,""Wireless Internet"",Kitchen,""Free Pa...",0.0,20.0,52.0
4,10.0,--,1,the “studio at mibbett hollow' is in a beautif...,--,1,1,1,Real Bed,"{""Wireless Internet"",Kitchen,""Free Parking on ...",0.0,15.0,85.0


# Eliminando caracteres con regex

Se eliminarán caracteres no deseados para este análisis como el punto, coma, exclamación, barras diagonales, comillas dobles, etc.

In [7]:
datos["descripcion_local"][3169]

"built, run and supported by seattle tech and start up veterans, grokhome's focus is to create a supportive environment for smart people working on interesting projects, start ups and more. this listing is an upper bunk, in a 2-person shared room. *note: this fall, there will be major renovations happening on one kitchen and bathroom at a time. there will always be two other working kitchens and two working bathrooms in the house. we'll work to minimize the impact these renovations have on your stay. **this listing is only available to those working in the tech/science space. live in a hacker house, and immerse yourself in the seattle tech scene. you can expect to be surrounded by smart people solving big problems or working on something fun. we have frequent demo nights, and love when our guests share something they are passionate about. if you're new to the city, our deep ties to the seattle tech scene can help you get involved. expand your network, develop your ideas, and learn some

En este registro se pueden ver varios caracteres para eliminar:
- comas
- comillas simples
- guión
- asteriscos
- dos puntos
- barras diagonales

In [8]:
# tercer parámetro es para uso de regex
# '[^a-zA-Z0-9\-\']' primer simbolo significa negacion, resto son mayusculas y minusculas, numeros del 0 al 9
# lo último impide que se sustituyan los guiones, lo mismo con las comillas simples
datos["descripcion_local"]  = datos["descripcion_local"].str.replace(
    r"[^a-zA-Z0-9\-\']", " ", regex=True
)

In [9]:
# tener cuidado con dos palabras unidas por un guión, porque dejan de ser palabras compuestas
# guión no está pegado (sin espacio) a ninguna palabra
# se buscará eliminar estos guiones

# Guiones que no tengan una palabra o caracter a la izquierda o derecha
datos["descripcion_local"] = datos["descripcion_local"].str.replace(r'(?<!\w)-(?!\w)',' ',regex=True)

In [10]:
datos["descripcion_local"].head()

0    this clean and comfortable one bedroom sits ri...
1    our century old upper queen anne house is loca...
2    cozy room in two-bedroom apartment along the l...
3    very lovely and cozy room for one  convenientl...
4    the  studio at mibbett hollow' is in a beautif...
Name: descripcion_local, dtype: object

# Tokenización de strings

Se podrá generar una lista de palabras con cada frase.

In [11]:
# Utilizar el espacio ' ' para identificar separación de palabras
# Por defecto usa el espacio como separador
datos["descripcion_local"] = datos["descripcion_local"].str.split()

In [12]:
datos.head()

Unnamed: 0,evaluacion_general,experiencia_local,max_hospedes,descripcion_local,descripcion_vecindad,cantidad_baños,cantidad_cuartos,cantidad_camas,modelo_cama,comodidades,cuota_deposito,cuota_limpieza,precio
0,10.0,--,1,"[this, clean, and, comfortable, one, bedroom, ...",lower queen anne is near the seattle center (s...,1,1,1,Real Bed,"{Internet,""Wireless Internet"",Kitchen,""Free Pa...",0.0,0.0,110.0
1,10.0,--,1,"[our, century, old, upper, queen, anne, house,...","upper queen anne is a really pleasant, unique ...",1,1,1,Futon,"{TV,Internet,""Wireless Internet"",Kitchen,""Free...",0.0,0.0,45.0
2,10.0,--,1,"[cozy, room, in, two-bedroom, apartment, along...",the convenience of being in seattle but on the...,1,1,1,Futon,"{TV,Internet,""Wireless Internet"",Kitchen,""Free...",0.0,0.0,55.0
3,10.0,--,1,"[very, lovely, and, cozy, room, for, one, conv...","ballard is lovely, vibrant and one of the most...",1,1,1,Pull-out Sofa,"{Internet,""Wireless Internet"",Kitchen,""Free Pa...",0.0,20.0,52.0
4,10.0,--,1,"[the, studio, at, mibbett, hollow', is, in, a,...",--,1,1,1,Real Bed,"{""Wireless Internet"",Kitchen,""Free Parking on ...",0.0,15.0,85.0


In [13]:
# \ es para buscar caracter específico, | se usa como separador. Buscará { } y "
datos["comodidades"] = datos["comodidades"].str.replace(r'\{|}|"', "", regex=True)

In [14]:
# Transformar la columna Comodidades en una lista
datos["comodidades"] = datos["comodidades"].str.split(',')

In [15]:
datos.head()

Unnamed: 0,evaluacion_general,experiencia_local,max_hospedes,descripcion_local,descripcion_vecindad,cantidad_baños,cantidad_cuartos,cantidad_camas,modelo_cama,comodidades,cuota_deposito,cuota_limpieza,precio
0,10.0,--,1,"[this, clean, and, comfortable, one, bedroom, ...",lower queen anne is near the seattle center (s...,1,1,1,Real Bed,"[Internet, Wireless Internet, Kitchen, Free Pa...",0.0,0.0,110.0
1,10.0,--,1,"[our, century, old, upper, queen, anne, house,...","upper queen anne is a really pleasant, unique ...",1,1,1,Futon,"[TV, Internet, Wireless Internet, Kitchen, Fre...",0.0,0.0,45.0
2,10.0,--,1,"[cozy, room, in, two-bedroom, apartment, along...",the convenience of being in seattle but on the...,1,1,1,Futon,"[TV, Internet, Wireless Internet, Kitchen, Fre...",0.0,0.0,55.0
3,10.0,--,1,"[very, lovely, and, cozy, room, for, one, conv...","ballard is lovely, vibrant and one of the most...",1,1,1,Pull-out Sofa,"[Internet, Wireless Internet, Kitchen, Free Pa...",0.0,20.0,52.0
4,10.0,--,1,"[the, studio, at, mibbett, hollow', is, in, a,...",--,1,1,1,Real Bed,"[Wireless Internet, Kitchen, Free Parking on P...",0.0,15.0,85.0


In [16]:
# eliminar caracteres extraños
datos["descripcion_vecindad"] = datos["descripcion_vecindad"].str.replace(
    r"[^a-zA-Z0-9\-\']", " ", regex=True
)

# cuidar uso de guiones
datos["descripcion_vecindad"] = datos["descripcion_vecindad"].str.replace(
    r"(?<!\w)-(?!\w)", " ", regex=True
)

# transformar en lista
datos["descripcion_vecindad"] = datos["descripcion_vecindad"].str.split()