In [1]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('ManagementAirbnb').getOrCreate()

# Ejercicio - Busqueda de Alojamiento en Airbnb.

Supongamos que somos un agente de [Airbnb](http://www.airbnb.com) localizado en Lisboa, y tenemos que atender peticiones de varios clientes. Tenemos un archivo llamado `airbnb.csv` (en la carpeta data) donde tenemos información de todos los alojamientos de Airbnb en Lisboa.

In [2]:
#  ------Cargar Data---------
airbnCsvPath = "data/airbnb.csv"

# Emplear lectura de esquema
airbnbDF = (spark.read
  .option("sep", ",")
  .option("header", True)
  .option("inferSchema", True) 
  .csv(airbnCsvPath))

airbnbDF.show(n=5, truncate=False)

+-------+-------+---------------+-----------------+-------+--------------------+------------+--------+-----+
|room_id|host_id|room_type      |neighborhood     |reviews|overall_satisfaction|accommodates|bedrooms|price|
+-------+-------+---------------+-----------------+-------+--------------------+------------+--------+-----+
|6499   |14455  |Entire home/apt|Belém            |8      |5.0                 |2           |1.0     |57.0 |
|17031  |66015  |Entire home/apt|Alvalade         |0      |0.0                 |2           |1.0     |46.0 |
|25659  |107347 |Entire home/apt|Santa Maria Maior|63     |5.0                 |3           |1.0     |69.0 |
|29248  |125768 |Entire home/apt|Santa Maria Maior|225    |4.5                 |4           |1.0     |58.0 |
|29396  |126415 |Entire home/apt|Santa Maria Maior|132    |5.0                 |4           |1.0     |67.0 |
+-------+-------+---------------+-----------------+-------+--------------------+------------+--------+-----+
only showing top 5 

In [4]:
# Mostrar información básica de datos, columnas, cantidad, etc
airbnbDF.describe().show(vertical=True)

-RECORD 0------------------------------------
 summary              | count                
 room_id              | 13232                
 host_id              | 13232                
 room_type            | 13232                
 neighborhood         | 13232                
 reviews              | 13232                
 overall_satisfaction | 13222                
 accommodates         | 13232                
 bedrooms             | 13232                
 price                | 13232                
-RECORD 1------------------------------------
 summary              | mean                 
 room_id              | 1.0550814671327086E7 
 host_id              | 3.616443520435309E7  
 room_type            | null                 
 neighborhood         | null                 
 reviews              | 29.13006348246675    
 overall_satisfaction | 3.2846770533958556   
 accommodates         | 3.917775090689238    
 bedrooms             | 1.5495012091898428   
 price                | 86.5923518

In [12]:
airbnbDF.show(10)

+-------+-------+---------------+-----------------+-------+--------------------+------------+--------+-------+
|room_id|host_id|      room_type|     neighborhood|reviews|overall_satisfaction|accommodates|bedrooms|  price|
+-------+-------+---------------+-----------------+-------+--------------------+------------+--------+-------+
|   6499|  14455|Entire home/apt|            Belém|      8|                 5.0|           2|     1.0|  57.00|
|  17031|  66015|Entire home/apt|         Alvalade|      0|                 0.0|           2|     1.0|  46.00|
|  25659| 107347|Entire home/apt|Santa Maria Maior|     63|                 5.0|           3|     1.0|  69.00|
|  29248| 125768|Entire home/apt|Santa Maria Maior|    225|                 4.5|           4|     1.0|  58.00|
|  29396| 126415|Entire home/apt|Santa Maria Maior|    132|                 5.0|           4|     1.0|  67.00|
|  29720| 128075|Entire home/apt|          Estrela|     14|                 5.0|          16|     9.0|1154.00|
|

In [13]:
airbnbDF.select(airbnbDF.room_type).distinct().show()

+---------------+
|      room_type|
+---------------+
|    Shared room|
|Entire home/apt|
|   Private room|
+---------------+



In [7]:
airbnbDF.printSchema()

root
 |-- room_id: integer (nullable = true)
 |-- host_id: integer (nullable = true)
 |-- room_type: string (nullable = true)
 |-- neighborhood: string (nullable = true)
 |-- reviews: integer (nullable = true)
 |-- overall_satisfaction: double (nullable = true)
 |-- accommodates: integer (nullable = true)
 |-- bedrooms: double (nullable = true)
 |-- price: double (nullable = true)



In [3]:
from pyspark.sql.types import DecimalType

# Castear 'price' a DecimalType(10, 2)
airbnbDF = airbnbDF.withColumn('price',airbnbDF.price.cast(DecimalType(10,2)))
airbnbDF.printSchema()

root
 |-- room_id: integer (nullable = true)
 |-- host_id: integer (nullable = true)
 |-- room_type: string (nullable = true)
 |-- neighborhood: string (nullable = true)
 |-- reviews: integer (nullable = true)
 |-- overall_satisfaction: double (nullable = true)
 |-- accommodates: integer (nullable = true)
 |-- bedrooms: double (nullable = true)
 |-- price: decimal(10,2) (nullable = true)



En concreto el dataset tiene las siguientes variables:
- room_id: el identificador de la propiedad
- host_id: el identificador del dueño de la propiedad
- room_type: tipo de propiedad (vivienda completa/(habitacion para compartir/habitación privada)
- neighborhood: el barrio de Lisboa
- reviews: El numero de opiniones
- overall_satisfaction: Puntuacion media del apartamento
- accommodates: El numero de personas que se pueden alojar en la propiedad
- bedrooms: El número de habitaciones
- price: El precio (en euros) por noche

<h3>Generación de Vista <h3>

In [4]:
airbnbDF.createOrReplaceTempView('properties')

## Usando SQL y Pyspark.sql

### Caso 1.

Alicia va a ir a Lisboa durante una semana con su marido y sus 2 hijos. Están buscando un apartamento con habitaciones separadas para los padres y los hijos. No les importa donde alojarse o el precio, simplemente quieren tener una experiencia agradable. Esto significa que solo aceptan lugares con más de 10 críticas con una puntuación mayor de 4. Cuando seleccionemos habitaciones para Alicia, tenemos que asegurarnos de ordenar las habitaciones de mejor a peor puntuación. Para aquellas habitaciones que tienen la misma puntuación, debemos mostrar antes aquellas con más críticas. Debemos darle 3 alternativas.

In [5]:
query = """SELECT * FROM properties 
        WHERE room_type = "Private room" AND 
        reviews > 10 AND 
        overall_satisfaction > 4.0
        ORDER BY overall_satisfaction ASC, reviews ASC;
        """
# ORDER BY reviews ASC,overall_satisfaction ASC;
case1DF = spark.sql(query)
case1DF.show(3)

+-------+--------+------------+------------+-------+--------------------+------------+--------+-----+
|room_id| host_id|   room_type|neighborhood|reviews|overall_satisfaction|accommodates|bedrooms|price|
+-------+--------+------------+------------+-------+--------------------+------------+--------+-----+
|3886346|  941411|Private room|     Areeiro|     11|                 4.5|           2|     1.0|29.00|
|8215883|18890834|Private room|     Estrela|     11|                 4.5|           2|     1.0|26.00|
|5883524|17638903|Private room|     Arroios|     11|                 4.5|           1|     1.0|40.00|
+-------+--------+------------+------------+-------+--------------------+------------+--------+-----+
only showing top 3 rows



In [6]:
cond =  (airbnbDF.room_type == "Private room") & (airbnbDF.reviews > 10) & (airbnbDF.overall_satisfaction > 4.0)
airbnbDF.select("*").where(cond).orderBy(airbnbDF.overall_satisfaction.asc(),airbnbDF.reviews.asc()).show(3)

+-------+--------+------------+------------+-------+--------------------+------------+--------+-----+
|room_id| host_id|   room_type|neighborhood|reviews|overall_satisfaction|accommodates|bedrooms|price|
+-------+--------+------------+------------+-------+--------------------+------------+--------+-----+
|3886346|  941411|Private room|     Areeiro|     11|                 4.5|           2|     1.0|29.00|
|8215883|18890834|Private room|     Estrela|     11|                 4.5|           2|     1.0|26.00|
|5883524|17638903|Private room|     Arroios|     11|                 4.5|           1|     1.0|40.00|
+-------+--------+------------+------------+-------+--------------------+------------+--------+-----+
only showing top 3 rows



### Caso 2

Roberto es un casero que tiene una casa en Airbnb. De vez en cuando nos llama preguntando sobre cuales son las críticas de su alojamiento. Hoy está particularmente enfadado, ya que su hermana Clara ha puesto una casa en Airbnb y Roberto quiere asegurarse de que su casa tiene más críticas que las de Clara. Tenemos que crear un dataframe con las propiedades de ambos. Las id de las casas de Roberto y Clara son 97503 y 90387  respectivamente. Finalmente guardamos este dataframe como excel llamado "roberto.xls

In [7]:
query = """SELECT * FROM properties 
        WHERE room_id IN (97503,90387)
        """
case2DF = spark.sql(query)
# case2DF.show()

# Escritura
path_out = 'data/exercises/compareRobertoClara'

options = {
    'header': True,
#     'sep': ','
    'delimiter':','
}

# Write as CSV
(
case2DF.write
 .format("csv")
 .mode("overwrite")
 .option("sep","\t")
 .save(path_out)
)
    
# Export as XLS


Py4JJavaError: An error occurred while calling o67.save.
: org.apache.spark.SparkException: Job aborted.
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:231)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:188)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:132)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:131)
	at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
	at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
	at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:645)
	at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1230)
	at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1435)
	at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:493)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
	at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:678)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:332)
	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:402)
	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:375)
	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:182)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:220)
	... 32 more



### Caso 3

Diana va a Lisboa a pasar 3 noches y quiere conocer a gente nueva. Tiene un presupuesto de 50€ para su alojamiento. Debemos buscarle las 10 propiedades más baratas, dandole preferencia a aquellas que sean habitaciones compartidas *(room_type == Shared room)*, y para aquellas viviendas compartidas debemos elegir aquellas con mejor puntuación.