Prueba 2 realizada por Cristóbal Novoa

Descripción
Yelp es un directorio de servicios a nivel mundial, que permite a sus usuarios evaluar los
servicios (restaurantes, bancos, clínicas, gimnasios, entre otros) para encontrar y sugerir
mejores servicios.
Para esta prueba utilizaremos los datos disponibilizados por Yelp para:
● Identificar usuarios molestosos.
● Probabilidad de cierre de los negocios.
Los datos se disponibilizaron en la página yelp.com.
Para efectos prácticos de la prueba:
● Los archivos disponibles se encuentran en el bucket del módulo, con la dirección
s3://bigdata-desafio/yelp-data/.
● Dentro de esta dirección encontrará distintos archivos json con:
○ Registros respecto al negocio (business.json).
○ check-ins del usuario en un negocio (checkin.json).
○ Fotos asociadas al review (photo.json).
○ Reseñas de un usuario sobre el servicio (review.json).
○ Sugerencias del usuario sobre el servicio (tip.json).
○ Información del usuario (user.json)
La definición de la estructura de datos de cada json se encuentra alojada en la siguiente
dirección provista por Yelp. (Dentro de este link se encontrarán las definiciones de las
columnas y el tipo de registro).


Ejercicio 1: Identificando usuarios molestosos (4.6 Puntos)
Utilizando el archivo user.json.
Desde Yelp están interesados en identificar a aquellos usuarios que se pueden considerar
como molestosos. Para ello, tienen la siguiente definición de un usuario molestoso:
● Un usuario molestoso es aquél que su promedio de evaluaciones es menor o igual a
2, tiene en promedio menos de 100 reviews y tiene cero fans.
A partir de esta definición, se le solicita los siguientes puntos:
● Identifique en una variable dummy todos los usuarios que se puedan clasificar como
molestosos acorde al criterio.
● Recodificaciones en el archivo user.json:
○ friends, que corresponde a un string con todos los user_id de otros
usuarios j que siguen al usuario i. El objetivo es contar la cantidad de
amigos existentes.
○ elite, que corresponde a un string con todos los años en los que el usuario
i fue considerado como un reviewer de elite. El objetivo es contar la cantidad
de años en los cuales se consideró como elite.
○ Asegúrese de eliminar los siguientes registros: friends, yelping_since,
name, elite, user_id.

In [330]:
#Se importan librerias 
from pyspark.sql import SparkSession
from pyspark.sql.functions import when
from pyspark.ml.feature import VectorAssembler

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [331]:
#Se carga archivo json desde s3
df = spark.read.json("s3://bigdata-desafio/yelp-data/user.json")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [332]:
#Se muestra schema del dataframe
df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- average_stars: double (nullable = true)
 |-- compliment_cool: long (nullable = true)
 |-- compliment_cute: long (nullable = true)
 |-- compliment_funny: long (nullable = true)
 |-- compliment_hot: long (nullable = true)
 |-- compliment_list: long (nullable = true)
 |-- compliment_more: long (nullable = true)
 |-- compliment_note: long (nullable = true)
 |-- compliment_photos: long (nullable = true)
 |-- compliment_plain: long (nullable = true)
 |-- compliment_profile: long (nullable = true)
 |-- compliment_writer: long (nullable = true)
 |-- cool: long (nullable = true)
 |-- elite: string (nullable = true)
 |-- fans: long (nullable = true)
 |-- friends: string (nullable = true)
 |-- funny: long (nullable = true)
 |-- name: string (nullable = true)
 |-- review_count: long (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)
 |-- yelping_since: string (nullable = true)

In [333]:
#Se muestra contenido del primer elemento en json
df.first()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Row(average_stars=4.03, compliment_cool=1, compliment_cute=0, compliment_funny=1, compliment_hot=2, compliment_list=0, compliment_more=0, compliment_note=1, compliment_photos=0, compliment_plain=1, compliment_profile=0, compliment_writer=2, cool=25, elite='2015,2016,2017', fans=5, friends='c78V-rj8NQcQjOI8KP3UEA, alRMgPcngYSCJ5naFRBz5g, ajcnq75Z5xxkvUSmmJ1bCg, BSMAmp2-wMzCkhTfq9ToNg, jka10dk9ygX76hJG0gfPZQ, dut0e4xvme7QSlesOycHQA, l4l5lBnK356zBua7B-UJ6Q, 0HicMOOs-M_gl2eO-zES4Q, _uI57wL2fLyftrcSFpfSGQ, T4_Qd0YWbC3co6WSMw4vxg, iBRoLWPtWmsI1kdbE9ORSA, xjrUcid6Ymq0DoTJELkYyw, GqadWVzJ6At-vgLzK_SKgA, DvB13VJBmSnbFXBVBsKmDA, vRP9nQkYTeNioDjtxZlVhg, gT0A1iN3eeQ8EMAjJhwQtw, 6yCWjFPtp_AD4x93WAwmnw, 1dKzpNnib-JlViKv8_Gt5g, 3Bv4_JxHXq-gVLOxYMQX0Q, ikQyfu1iViYh8T0us7wiFQ, f1GGltNaB7K5DR1jf3dOmg, tgeFUChlh7v8bZFVl2-hjQ, -9-9oyXlqsMG2he5xIWdLQ, Adj9fBPVJad8vSs-mIP7gw, Ce49RY8CKXVsTifxRYFTsw, M1_7TLi8CbdA89nFLlH4iw, wFsNv-hqbW_F5-IRqfBN6g, 0Q1L7zXHocaUZ2gsG2XJeg, cBFgmOCBdhYa0xoFEAzp_g, VrD_AgiFvzqtl

In [334]:
#Se identifica a usuarios molestosos con promedio de evaluaciones es menor o igual a 2, tiene en promedio menos de 100 reviews y tiene cero fans
df = df.withColumn('dummy', when((df['average_stars'] <= 2)\
                                         & (df['review_count'] < 100)\
                                         & (df['fans'] == 0), 1).otherwise(0))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [335]:
#Se muestran columna generada
df.select('dummy').show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----+
|dummy|
+-----+
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
|    0|
+-----+
only showing top 20 rows

In [109]:
#Se eliminan columnas average_stars, review_count y fans
drop_list = ['average_stars', 'review_count', 'fans' ]
df = df.select([column for column in df.columns if column not in drop_list])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [110]:
#Se revisan cambios
df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- compliment_cool: long (nullable = true)
 |-- compliment_cute: long (nullable = true)
 |-- compliment_funny: long (nullable = true)
 |-- compliment_hot: long (nullable = true)
 |-- compliment_list: long (nullable = true)
 |-- compliment_more: long (nullable = true)
 |-- compliment_note: long (nullable = true)
 |-- compliment_photos: long (nullable = true)
 |-- compliment_plain: long (nullable = true)
 |-- compliment_profile: long (nullable = true)
 |-- compliment_writer: long (nullable = true)
 |-- cool: long (nullable = true)
 |-- elite: string (nullable = true)
 |-- friends: string (nullable = true)
 |-- funny: long (nullable = true)
 |-- name: string (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)
 |-- yelping_since: string (nullable = true)
 |-- dummy: integer (nullable = false)

In [111]:
#Se revisa columna friends
df.select('friends').show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+
|             friends|
+--------------------+
|c78V-rj8NQcQjOI8K...|
|kEBTgDvFX754S68Fl...|
|4N-HU_T32hLENLnts...|
|RZ6wS38wnlXyj-OOd...|
|mbwrZ-RS76V1HoJ0b...|
|AJxDPGVTzefy3vSHW...|
|RJQTcJVlBsJ3_Yo0J...|
|d1z7Xc9RG5TVBkdUP...|
|ctr_BlCf3Ogny-vLs...|
|N-xeG3U6rUkjVtQ0o...|
|CfGCj80EdA-xS-mTW...|
|tYyOnNs7tBfqAT9IC...|
|Dg8_xYNvjVC6KGNRc...|
|_5p_nO7OczVP7czj_...|
|mFwRTTDW0Yr-rFkTF...|
|xVv_pVxAcOfY_xBjB...|
|Xqo1ru1F7srvbUJaC...|
|9ljCrn-qgfAMRS5wL...|
|zH2whhmSEhwKUqPjg...|
|a4kpoV3nrxRPUHR6z...|
+--------------------+
only showing top 20 rows

Se identifica como un string con los user_id de todos los usuarios de acuerdo a lo mencionado en la descripcion de la prueba

In [112]:
#Se importan librerias para realizar separación del string y conteo de los usuarios
from pyspark.sql.functions import split, col, count
import pyspark.sql.functions as f

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [113]:
#Se aplica separacion y conteo de la columna friend en columna friend_count
df = df.withColumn('friend_count', f.size(f.split(f.col('friends'), ',')))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [114]:
#Se revisa los cambios
df.select('friend_count').show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------------+
|friend_count|
+------------+
|          99|
|        1152|
|          15|
|         525|
|         231|
|        5450|
|        4326|
|        1193|
|         382|
|         898|
|         194|
|          83|
|         582|
|          25|
|         248|
|         367|
|         286|
|         258|
|        3451|
|          46|
+------------+
only showing top 20 rows

In [115]:
#Se revisa columna elite
df.select('elite').show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+
|               elite|
+--------------------+
|      2015,2016,2017|
|                    |
|                    |
|                    |
| 2015,2016,2017,2018|
| 2015,2016,2017,2018|
|2006,2007,2008,20...|
|                    |
|2006,2007,2008,20...|
|                    |
|           2017,2018|
|                    |
|                    |
|                    |
|                    |
|                    |
|2011,2012,2013,20...|
|                    |
|2014,2015,2016,20...|
|                    |
+--------------------+
only showing top 20 rows

Se identifica algo similar a la columna friend

In [116]:
#Se aplica separacion y conteo de la columna elite en columna elite_count
df = df.withColumn('elite_count', f.size(f.split(f.col('elite'), ',')))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [117]:
#Se revisan los cambios
df.select('elite_count').show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------+
|elite_count|
+-----------+
|          3|
|          1|
|          1|
|          1|
|          4|
|          4|
|          8|
|          1|
|          7|
|          1|
|          2|
|          1|
|          1|
|          1|
|          1|
|          1|
|          6|
|          1|
|          5|
|          1|
+-----------+
only showing top 20 rows

In [118]:
#Se muestra schema con cambios
df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- compliment_cool: long (nullable = true)
 |-- compliment_cute: long (nullable = true)
 |-- compliment_funny: long (nullable = true)
 |-- compliment_hot: long (nullable = true)
 |-- compliment_list: long (nullable = true)
 |-- compliment_more: long (nullable = true)
 |-- compliment_note: long (nullable = true)
 |-- compliment_photos: long (nullable = true)
 |-- compliment_plain: long (nullable = true)
 |-- compliment_profile: long (nullable = true)
 |-- compliment_writer: long (nullable = true)
 |-- cool: long (nullable = true)
 |-- elite: string (nullable = true)
 |-- friends: string (nullable = true)
 |-- funny: long (nullable = true)
 |-- name: string (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)
 |-- yelping_since: string (nullable = true)
 |-- dummy: integer (nullable = false)
 |-- friend_count: integer (nullable = false)
 |-- elite_count: integer (nullable = false)

In [119]:
#Se eliminan columnas mencionadas en ejercicio
drop_list = ['friends', 'yelping_since', 'name', 'elite', 'user_id' ]
df = df.select([column for column in df.columns if column not in drop_list])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [120]:
#Se revisan cambios
df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- compliment_cool: long (nullable = true)
 |-- compliment_cute: long (nullable = true)
 |-- compliment_funny: long (nullable = true)
 |-- compliment_hot: long (nullable = true)
 |-- compliment_list: long (nullable = true)
 |-- compliment_more: long (nullable = true)
 |-- compliment_note: long (nullable = true)
 |-- compliment_photos: long (nullable = true)
 |-- compliment_plain: long (nullable = true)
 |-- compliment_profile: long (nullable = true)
 |-- compliment_writer: long (nullable = true)
 |-- cool: long (nullable = true)
 |-- funny: long (nullable = true)
 |-- useful: long (nullable = true)
 |-- dummy: integer (nullable = false)
 |-- friend_count: integer (nullable = false)
 |-- elite_count: integer (nullable = false)

In [121]:
#Se muestra primer elemento del dataframe para comparacion con el mostrado en la prueba
df.first()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Row(compliment_cool=1, compliment_cute=0, compliment_funny=1, compliment_hot=2, compliment_list=0, compliment_more=0, compliment_note=1, compliment_photos=0, compliment_plain=1, compliment_profile=0, compliment_writer=2, cool=25, funny=17, useful=84, dummy=0, friend_count=99, elite_count=3)

Requerimientos Todos los objetivos se deben resolver utilizando pyspark. ● Genere la medición de usuarios molestos en base a los criterios expuestos. (0.8 Puntos). ● Divida la muestra en conjuntos de entrenamiento (preservando un 70% de los datos) y validación (preservando un 30% de los datos). (0.4 Puntos) ● Entrene tres modelos (LogisticRegression, GBTClassifier y DecisionTreeClassifier) sin modificar hiperparámetros que en base a los atributos disponibles en el archivo user.json, clasifique los usuarios molestosos. (2.2 Puntos) ● Reporte cuál es el mejor modelo en base a la métrica AUC. (0.4 Puntos) ● Identifique cuales son los principales atributos asociados a un usuario molestoso y repórtelos. (0.8 Puntos)

In [122]:
#Se indica cantidad de usuarios molestos 
df.groupBy('dummy').count().show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----+-------+
|dummy|  count|
+-----+-------+
|    1| 183676|
|    0|1453462|
+-----+-------+

Se indica medicion de usuarios molestos en consulta

In [123]:
#Se renombra columna dummy por label y se guarda en un nuevo dataframe
train_df = df.withColumnRenamed('dummy','label')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [124]:
#Se remueve label del conjunto de entrenamiento
feats = train_df.columns
feats.remove('label')
print(feats)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

['compliment_cool', 'compliment_cute', 'compliment_funny', 'compliment_hot', 'compliment_list', 'compliment_more', 'compliment_note', 'compliment_photos', 'compliment_plain', 'compliment_profile', 'compliment_writer', 'cool', 'funny', 'useful', 'friend_count', 'elite_count']

In [125]:
#Se genera RDD indicando rango de atributos y transformamos dataframe, seleccionando elementos de interes
assemble_feats = VectorAssembler(inputCols = feats, outputCol = 'assembled_features')
assemble_feats = assemble_feats.transform(train_df)
assemble_feats = assemble_feats.select('label', 'assembled_features')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [126]:
#Se muestra primer elemento del vectorEl
assemble_feats.take(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(label=0, assembled_features=DenseVector([1.0, 0.0, 1.0, 2.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 2.0, 25.0, 17.0, 84.0, 99.0, 3.0]))]

In [127]:
#Se genera separacion train y test 
train, test = assemble_feats.randomSplit([0.7, 0.3],seed = 4982 )

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [128]:
#Se importan librerias de clasificacion
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [129]:
# Se genera objeto indicando de manera explicita atributos, vector objetivo y predicciones
logistic_model = LogisticRegression(featuresCol='assembled_features',labelCol='label',
                    predictionCol='annoying_pred')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [130]:
#Se entrena modelo de regresion logistica
logistic_model = logistic_model.fit(train)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [131]:
#Se visualiza comportamiento del modelo en cuanto a su probabilidad cruda, la normalizada y la mas probable
logistic_model.transform(test).show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----+--------------------+--------------------+--------------------+-------------+
|label|  assembled_features|       rawPrediction|         probability|annoying_pred|
+-----+--------------------+--------------------+--------------------+-------------+
|    0|(16,[0,1,2,3,6,11...|[10.9736645074473...|[0.99998285290274...|          0.0|
|    0|(16,[0,1,2,5,11,1...|[10.2452000648680...|[0.99996447363964...|          0.0|
|    0|(16,[0,1,2,6,11,1...|[13.6501356778461...|[0.99999882016605...|          0.0|
|    0|(16,[0,1,2,7,11,1...|[14.2161432830640...|[0.99999933010439...|          0.0|
|    0|(16,[0,1,2,8,11,1...|[11.1106013003195...|[0.99998504726382...|          0.0|
+-----+--------------------+--------------------+--------------------+-------------+
only showing top 5 rows

In [132]:
#Se indica valor puntaje ROC
print(f"Area Under ROC: {logistic_model.evaluate(test).areaUnderROC}")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Area Under ROC: 0.737494065984864

In [133]:
# Se genera objeto indicando de manera explicita atributos, vector objetivo y predicciones
gbt_model = GBTClassifier(featuresCol='assembled_features',labelCol='label',
                    predictionCol='annoying_pred')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [134]:
#Se entrena modelo gradient boosting
gbt_model = gbt_model.fit(train)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [135]:
#Se visualiza comportamiento del modelo en cuanto a su probabilidad cruda, la normalizada y la mas probable
gbt_model.transform(test).show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----+--------------------+--------------------+--------------------+-------------+
|label|  assembled_features|       rawPrediction|         probability|annoying_pred|
+-----+--------------------+--------------------+--------------------+-------------+
|    0|(16,[0,1,2,3,6,11...|[1.53321390234645...|[0.95548648655102...|          0.0|
|    0|(16,[0,1,2,5,11,1...|[1.53454215608874...|[0.95559933667287...|          0.0|
|    0|(16,[0,1,2,6,11,1...|[1.54087814306956...|[0.95613390555061...|          0.0|
|    0|(16,[0,1,2,7,11,1...|[1.54087814306956...|[0.95613390555061...|          0.0|
|    0|(16,[0,1,2,8,11,1...|[1.51253829786626...|[0.95369423285479...|          0.0|
+-----+--------------------+--------------------+--------------------+-------------+
only showing top 5 rows

In [136]:
#Se guarda comportamiento en objeto gbt
gbt = gbt_model.transform(test)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [137]:
#Se indica puntaje ROC
print(f"Area Under ROC: {evaluator.evaluate(gbt)}")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Area Under ROC: 0.7483202148423891

In [138]:
# Se genera objeto indicando de manera explicita atributos, vector objetivo y predicciones
dct_model = DecisionTreeClassifier(featuresCol='assembled_features',labelCol='label',
                    predictionCol='annoying_pred')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [139]:
#Se entrena modelo decision tree classifier
dct_model = dct_model.fit(train)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [140]:
#Se visualiza comportamiento del modelo en cuanto a su probabilidad cruda, la normalizada y la mas probable
dct_model.transform(test).show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----+--------------------+--------------------+--------------------+-------------+
|label|  assembled_features|       rawPrediction|         probability|annoying_pred|
+-----+--------------------+--------------------+--------------------+-------------+
|    0|(16,[0,1,2,3,6,11...|[1016844.0,129027.0]|[0.88739831970614...|          0.0|
|    0|(16,[0,1,2,5,11,1...|[1016844.0,129027.0]|[0.88739831970614...|          0.0|
|    0|(16,[0,1,2,6,11,1...|[1016844.0,129027.0]|[0.88739831970614...|          0.0|
|    0|(16,[0,1,2,7,11,1...|[1016844.0,129027.0]|[0.88739831970614...|          0.0|
|    0|(16,[0,1,2,8,11,1...|[1016844.0,129027.0]|[0.88739831970614...|          0.0|
+-----+--------------------+--------------------+--------------------+-------------+
only showing top 5 rows

In [141]:
#Se guarda comportamiento en objeto dct
dct = dct_model.transform(test)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [142]:
#Se indica puntaje ROC
print(f"Area Under ROC: {evaluator.evaluate(dct)}")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Area Under ROC: 0.5

El mejor modelo es Gradient Boosting Classifier

In [143]:
#Se indica importancia de los atributos para gbt
gbt_model.featureImportances

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparseVector(16, {0: 0.0111, 1: 0.0011, 3: 0.008, 5: 0.0002, 6: 0.0001, 7: 0.0045, 8: 0.0064, 9: 0.0018, 10: 0.0021, 11: 0.8687, 12: 0.0416, 13: 0.014, 14: 0.0403})

In [144]:
#Se reportan los atributos por orden de importancia
importances = list(zip(gbt_model.featureImportances, feats))
importances.sort(reverse = True)
importances

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[(0.8687327862337242, 'cool'), (0.04156570055572947, 'funny'), (0.0403120355947416, 'friend_count'), (0.013982980798536482, 'useful'), (0.011113685928693854, 'compliment_cool'), (0.008023341010039941, 'compliment_hot'), (0.006441703276079543, 'compliment_plain'), (0.004499575710379125, 'compliment_photos'), (0.002113169512751205, 'compliment_writer'), (0.0018062440514676147, 'compliment_profile'), (0.0010953507145389232, 'compliment_cute'), (0.00017452265202530208, 'compliment_more'), (0.00013890396129270303, 'compliment_note'), (0.0, 'elite_count'), (0.0, 'compliment_list'), (0.0, 'compliment_funny')]

Los atributos más importantes son cool, funny y friend_count

Ejercicio 2: Identificando la probabilidad de cierre de un servicio (5.4
Puntos)
Utilizando el archivo business.json.
Desde Yelp están interesados en predecir la probabilidad de cierre de un servicio en base a
los reviews y características de un negocio. Así, la primera iteración del modelo es generar
una identificación de qué factores están asociados al cierre.
El equipo de desarrollo de Yelp le hace entrega de un archivo llamado
recoding_business_schema.py que describe:
● Atributos a recodificar.
● Atributos a mantener.
Este archivo sirve como guía y no implementa la recodificación en el
pyspark.sql.dataframe.DataFrame, esto es tarea de usted.
De manera adicional, cabe destacar que este archivo no incluye la recodificación del vector
objetivo (is_open). Usted deberá recodificarla de manera tal de identificar como 1 aquellos
servicios que cerraron y 0 el resto.

Requerimientos
Todos los objetivos se deben resolver utilizando pyspark.
● Implemente el esquema de recodificación. (0.8 Puntos)
● Genere la recodificación del vector objetivo. (0.8 Puntos)
● Divida la muestra en conjuntos de entrenamiento (Preservando un 70% de los datos)
y validación (preservando un 30% de los datos). (0.4 Puntos)
● Entrene tres modelos (LogisticRegression, GBTClassifier y
DecisionTreeClassifier) sin modificar hiperparámetros que en base a los
atributos recodificados del archivo business.json, clasifique aquellos servicios
cerrados. (2.2 Puntos)
● Reporte cuál es el mejor modelo en base a la métrica AUC. (0.4 Puntos)
● Identifique cuales son los principales atributos asociados al cierre de un servicio.
(0.8 Puntos)

In [261]:
#Se carga json desde s3
load_business_data = spark.read.json("s3://bigdata-desafio/yelp-data/business.json")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [262]:
#Se muestra schema del json
load_business_data.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- address: string (nullable = true)
 |-- attributes: struct (nullable = true)
 |    |-- AcceptsInsurance: string (nullable = true)
 |    |-- AgesAllowed: string (nullable = true)
 |    |-- Alcohol: string (nullable = true)
 |    |-- Ambience: string (nullable = true)
 |    |-- BYOB: string (nullable = true)
 |    |-- BYOBCorkage: string (nullable = true)
 |    |-- BestNights: string (nullable = true)
 |    |-- BikeParking: string (nullable = true)
 |    |-- BusinessAcceptsBitcoin: string (nullable = true)
 |    |-- BusinessAcceptsCreditCards: string (nullable = true)
 |    |-- BusinessParking: string (nullable = true)
 |    |-- ByAppointmentOnly: string (nullable = true)
 |    |-- Caters: string (nullable = true)
 |    |-- CoatCheck: string (nullable = true)
 |    |-- Corkage: string (nullable = true)
 |    |-- DietaryRestrictions: string (nullable = true)
 |    |-- DogsAllowed: string (nullable = true)
 |    |-- DriveThru: string (nullable = true)
 |    |-- GoodForDancing: str

In [263]:
#Se muestran las categorias  
load_business_data\
    .select("attributes.AcceptsInsurance")\
    .distinct()\
    .show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------------+
|AcceptsInsurance|
+----------------+
|            null|
|            True|
|            None|
|           False|
+----------------+

In [264]:
# Se recodifica columna accepts insurance con 1 en el caso de ser True, de lo contrario 0
load_business_data = load_business_data\
    .withColumn('accepts_insurance',
        when((col('attributes.AcceptsInsurance') == 'True')\
             | (col('attributes.AcceptsInsurance') == "\'True\'")\
             | (col('attributes.AcceptsInsurance') == "u\'True\'"), 1)\
        .otherwise(0))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [265]:
#Se muestra comparativa de las columnas para revisar cambios
load_business_data\
    .select("attributes.AcceptsInsurance","accepts_insurance")\
    .show(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------------+-----------------+
|AcceptsInsurance|accepts_insurance|
+----------------+-----------------+
|            null|                0|
+----------------+-----------------+
only showing top 1 row

In [266]:
#Se revisan etiquetas de la columna
load_business_data\
    .select("attributes.AgesAllowed")\
    .distinct()\
    .show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------+
|AgesAllowed|
+-----------+
|       null|
|  u'19plus'|
|  u'21plus'|
| u'allages'|
|       None|
|  u'18plus'|
+-----------+

In [267]:
#Se recodifica columna de acuerdo a lo indicado allages con 1 de lo contrario 0
load_business_data = load_business_data\
    .withColumn('all_ages_allowed',
        when((col('attributes.AgesAllowed') == 'allages')\
            | (col('attributes.AgesAllowed') == "\'allages\'")\
            | (col('attributes.AgesAllowed') == "u\'allages\'"), 1)\
        .otherwise(0))


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [268]:
#Se comparan resultados de las columnas
load_business_data\
    .select("attributes.AgesAllowed","all_ages_allowed")\
    .show(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------+----------------+
|AgesAllowed|all_ages_allowed|
+-----------+----------------+
|       null|               0|
+-----------+----------------+
only showing top 1 row

In [269]:
#Se revisan etiquetas de la columna
load_business_data\
    .select("attributes.Alcohol")\
    .distinct()\
    .show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------------+
|         Alcohol|
+----------------+
|            null|
|          'none'|
| 'beer_and_wine'|
|      'full_bar'|
|            None|
|     u'full_bar'|
|         u'none'|
|u'beer_and_wine'|
+----------------+

In [270]:
# Se recodifica de acuerdo a lo indicado beer and wine y full bar con 1 de lo contrario 0
load_business_data = load_business_data\
    .withColumn('alcohol_consumption',
        when((col('attributes.Alcohol') == 'beer_and_wine')\
             | (col('attributes.Alcohol') == "\'beer_and_wine\'")\
             | (col('attributes.Alcohol') == "u\'beer_and_wine\'")\
             | (col('attributes.Alcohol') == 'full_bar')\
             | (col('attributes.Alcohol') == "\'full_bar\'")\
             | (col('attributes.Alcohol') == "u\'full_bar\'"), 1)\
        .otherwise(0))


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [271]:
#Se muestra primera fila para verificar cambios
load_business_data\
    .select("attributes.Alcohol","alcohol_consumption")\
    .show(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+-------------------+
|Alcohol|alcohol_consumption|
+-------+-------------------+
|   null|                  0|
+-------+-------------------+
only showing top 1 row

In [272]:
#Se revisan etiquetas de la columna
load_business_data\
    .select("attributes.BusinessAcceptsBitcoin")\
    .distinct()\
    .show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------------------+
|BusinessAcceptsBitcoin|
+----------------------+
|                  null|
|                  True|
|                  None|
|                 False|
+----------------------+

In [273]:
#Se recodifica de acuerdo a lo indicado, True con 1, de lo contrario 0
load_business_data = load_business_data\
    .withColumn('bitcoin_friendly',
        when((col('attributes.BusinessAcceptsBitcoin') == 'True')\
             | (col('attributes.BusinessAcceptsBitcoin') == True)\
             | (col('attributes.BusinessAcceptsBitcoin') == "\'True\'")\
             | (col('attributes.BusinessAcceptsBitcoin') == "u\'True\'"), 1)\
        .otherwise(0))


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [274]:
#Se muestra primera fila para verificar cambios
load_business_data\
    .select("attributes.BusinessAcceptsBitcoin","bitcoin_friendly")\
    .show(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------------------+----------------+
|BusinessAcceptsBitcoin|bitcoin_friendly|
+----------------------+----------------+
|                  null|               0|
+----------------------+----------------+
only showing top 1 row

In [275]:
#Se revisan etiquetas de la columna
load_business_data\
    .select("categories")\
    .distinct()\
    .show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+
|          categories|
+--------------------+
|Home Health Care,...|
|Cocktail Bars, It...|
|Used, Vintage & C...|
|Convenience Store...|
|Coffee & Tea, Bak...|
|Food, Fast Food, ...|
|Bikes, Local Serv...|
|Tires, Oil Change...|
|Movers, Professio...|
|Japanese, Restaur...|
|Bars, Party Bus R...|
|Coffee & Tea, Gro...|
|Bars, American (N...|
|Vape Shops, Head ...|
|Active Life, Gyms...|
|Sandwiches, Resta...|
|Heating & Air Con...|
|Automotive, Local...|
|Beauty & Spas, Ha...|
|Venues & Event Sp...|
+--------------------+
only showing top 20 rows

In [276]:
# Se recodifica categories en food_related que hacen referencia a Food, restaurants y bars con 1, de lo contrario 0
load_business_data = load_business_data\
    .withColumn('food_related',
        when((col('categories').rlike('Food'))\
             | (col('categories').rlike('Restaurants'))\
             | (col('categories').rlike('Bars')), 1)\
        .otherwise(0))


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [277]:
# Se recodifica categories en finance_related que hace referencia a Banks, Insurance y Finance con 1, de lo contrario 0
load_business_data = load_business_data\
    .withColumn('finance_related', when(
        (col('categories').rlike('Banks'))\
         | (col('categories').rlike('Insurance'))\
         | (col('categories').rlike('Finance')), 1)\
        .otherwise(0))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [278]:
# Se recodifica categories en health_related que hace referencia a Fitness, Hospitals y Health con 1, de lo contrario 0
load_business_data = load_business_data\
    .withColumn('health_related', when(
        (col('categories').rlike('Fitness'))\
        | (col('categories').rlike('Hospitals'))\
        | (col('categories').rlike('Health')), 1)\
            .otherwise(0))


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [279]:
#Se revisan etiquetas de la columna
load_business_data\
    .select("attributes.Smoking")\
    .distinct()\
    .show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------+
|   Smoking|
+----------+
|      null|
|      'no'|
|     'yes'|
|     u'no'|
| 'outdoor'|
|    u'yes'|
|      None|
|u'outdoor'|
+----------+

In [280]:
# Se recodifica yes y outdoor con 1, de lo contrario 0
load_business_data = load_business_data\
    .withColumn('smoking',when((col('attributes.Smoking') == '\'yes\'')\
                 |(col('attributes.Smoking') == 'u\'yes\'')\
                 |(col('attributes.Smoking') == 'yes')\
                 |(col('attributes.Smoking') == '\'outdoor\'')\
                 |(col('attributes.Smoking') == 'u\'outdoor\'')\
                 |(col('attributes.Smoking') == 'outdoor'), 1)\
           .otherwise(0))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [281]:
#Se muestra primera fila para verificar cambios
load_business_data\
    .select("attributes.Smoking","smoking")\
    .show(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+-------+
|Smoking|smoking|
+-------+-------+
|   null|      0|
+-------+-------+
only showing top 1 row

In [282]:
#Se revisan etiquetas de la columna
load_business_data\
    .select("attributes.WiFi")\
    .distinct()\
    .show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+
|   WiFi|
+-------+
|   null|
|   'no'|
|  u'no'|
| 'free'|
|u'paid'|
|u'free'|
|   None|
| 'paid'|
+-------+

In [283]:
# Se recodifica columna free_wifi con 1 si es free, de lo contrario 0
load_business_data = load_business_data\
    .withColumn('free_wifi',when((col('attributes.WiFi') == '\'free\'')\
                | (col('attributes.WiFi') == 'u\'free\'')\
                | (col('attributes.WiFi') == 'free'), 1)\
           .otherwise(0))


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [284]:
#Se muestra primera fila para verificar cambios
load_business_data\
    .select("attributes.WiFi","free_wifi")\
    .show(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----+---------+
|WiFi|free_wifi|
+----+---------+
|null|        0|
+----+---------+
only showing top 1 row

In [285]:
#Se revisan etiquetas de la columna
load_business_data\
    .select("attributes.RestaurantsPriceRange2")\
    .distinct()\
    .show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------------------+
|RestaurantsPriceRange2|
+----------------------+
|                  null|
|                     3|
|                  None|
|                     1|
|                     4|
|                     2|
+----------------------+

In [286]:
# Se recodifica expensive_restaurant con 1 si es 3 y 4, de lo contrario 0
load_business_data = load_business_data\
    .withColumn('expensive_restaurant',when((col('attributes.RestaurantsPriceRange2') == 3)\
                 | (col('attributes.RestaurantsPriceRange2') == 4), 1)\
            .otherwise(0))


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [287]:
#Se muestra primera fila para verificar cambios
load_business_data\
    .select("attributes.RestaurantsPriceRange2","expensive_restaurant")\
    .show(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------------------+--------------------+
|RestaurantsPriceRange2|expensive_restaurant|
+----------------------+--------------------+
|                  null|                   0|
+----------------------+--------------------+
only showing top 1 row

In [288]:
#Se revisan etiquetas de la columna
load_business_data\
    .select("attributes.GoodForKids")\
    .distinct()\
    .show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------+
|GoodForKids|
+-----------+
|       null|
|       True|
|       None|
|      False|
+-----------+

In [289]:
# Se recodifica con 1 si es True, de lo contrario 0
load_business_data = load_business_data\
    .withColumn('kid_friendly',when((col('attributes.GoodForKids') == 'True')\
                 | (col('attributes.GoodForKids') == True)\
                 | (col('attributes.GoodForKids') == "\'True\'")\
                 | (col('attributes.GoodForKids') == "u\'True\'"), 1)\
            .otherwise(0))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [290]:
#Se muestra primera fila para verificar cambios
load_business_data\
    .select("attributes.GoodForKids", "kid_friendly")\
    .show(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------+------------+
|GoodForKids|kid_friendly|
+-----------+------------+
|      False|           0|
+-----------+------------+
only showing top 1 row

In [291]:
#Se revisan etiquetas de la columna
load_business_data\
    .select("attributes.HasTv")\
    .distinct()\
    .show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----+
|HasTv|
+-----+
| null|
| True|
| None|
|False|
+-----+

In [292]:
# Se recodifica True con 1, de lo contrario 0
load_business_data = load_business_data\
    .withColumn('has_tv', when((col('attributes.HasTV') == 'True')\
                 | (col('attributes.HasTV') == True)\
                 | (col('attributes.HasTV') == "\'True\'")\
                 | (col('attributes.HasTV') == "u\'True\'"), 1)\
            .otherwise(0))


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [293]:
#Se muestra primera fila para verificar cambios
load_business_data\
    .select("attributes.HasTV", "has_tv")\
    .show(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----+------+
|HasTV|has_tv|
+-----+------+
| null|     0|
+-----+------+
only showing top 1 row

In [294]:
#Se revisan etiquetas de la columna
load_business_data\
    .select("attributes.DogsAllowed")\
    .distinct()\
    .show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------+
|DogsAllowed|
+-----------+
|       null|
|       True|
|       None|
|      False|
+-----------+

In [295]:
# Se recodifica True con 1, de lo contrario 0
load_business_data= load_business_data\
    .withColumn('dog_friendly', when((col('attributes.DogsAllowed') == 'True')\
                 | (col('attributes.DogsAllowed') == True)\
                 | (col('attributes.DogsAllowed') == "\True'\'")\
                 | (col('attributes.DogsAllowed') == "u\'True\'"), 1)\
            .otherwise(0))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [296]:
#Se muestra primera fila para verificar cambios
load_business_data\
    .select("attributes.DogsAllowed", "dog_friendly")\
    .show(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------+------------+
|DogsAllowed|dog_friendly|
+-----------+------------+
|       null|           0|
+-----------+------------+
only showing top 1 row

In [297]:
#Se revisan etiquetas de la columna
load_business_data\
    .select("attributes.NoiseLevel")\
    .distinct()\
    .show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------------+
|  NoiseLevel|
+------------+
|     u'loud'|
|    u'quiet'|
|        null|
|u'very_loud'|
|     'quiet'|
|        None|
| 'very_loud'|
|   'average'|
|  u'average'|
|      'loud'|
+------------+

In [298]:
# Se recodifica loud y very_loud con 1, de lo contrario 0
load_business_data = load_business_data\
    .withColumn('loud_place', when((col('attributes.NoiseLevel') == 'loud')\
                 | (col('attributes.NoiseLevel') == "\'loud\'")\
                 | (col('attributes.NoiseLevel') == "u\'loud\'")\
                 | (col('attributes.NoiseLevel') == "very_loud")\
                 | (col('attributes.NoiseLevel') == "\'very_loud\'")\
                 | (col('attributes.NoiseLevel') == "u\'very_loud\'"), 1)\
            .otherwise(0))


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [299]:
#Se muestra primera fila para verificar cambios
load_business_data\
    .select("attributes.NoiseLevel", "loud_place")\
    .show(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------+----------+
|NoiseLevel|loud_place|
+----------+----------+
|      null|         0|
+----------+----------+
only showing top 1 row

In [300]:
#Se revisan etiquetas de la columna
load_business_data\
    .select("attributes.HappyHour")\
    .distinct()\
    .show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+
|HappyHour|
+---------+
|     null|
|     True|
|     None|
|    False|
+---------+

In [301]:
# 1 if business offers happy hours, 0 otherwise (applies only to restaurants)
load_business_data = load_business_data\
    .withColumn('happy_hour', when((col('attributes.HappyHour') == 'True')\
                 | (col('attributes.HappyHour') == True)\
                 | (col('attributes.HappyHour') == "\'True\'")\
                 | (col('attributes.HappyHour') == "u\'True\'"), 1)\
            .otherwise(0))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [302]:
#Se muestra primera fila para verificar cambios
load_business_data\
    .select("attributes.HappyHour", "happy_hour")\
    .show(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+----------+
|HappyHour|happy_hour|
+---------+----------+
|     null|         0|
+---------+----------+
only showing top 1 row

In [303]:
#Se muestra primer elemento del json
load_business_data.take(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(address='2818 E Camino Acequia Drive', attributes=Row(AcceptsInsurance=None, AgesAllowed=None, Alcohol=None, Ambience=None, BYOB=None, BYOBCorkage=None, BestNights=None, BikeParking=None, BusinessAcceptsBitcoin=None, BusinessAcceptsCreditCards=None, BusinessParking=None, ByAppointmentOnly=None, Caters=None, CoatCheck=None, Corkage=None, DietaryRestrictions=None, DogsAllowed=None, DriveThru=None, GoodForDancing=None, GoodForKids='False', GoodForMeal=None, HairSpecializesIn=None, HappyHour=None, HasTV=None, Music=None, NoiseLevel=None, Open24Hours=None, OutdoorSeating=None, RestaurantsAttire=None, RestaurantsCounterService=None, RestaurantsDelivery=None, RestaurantsGoodForGroups=None, RestaurantsPriceRange2=None, RestaurantsReservations=None, RestaurantsTableService=None, RestaurantsTakeOut=None, Smoking=None, WheelchairAccessible=None, WiFi=None), business_id='1SWheh84yJXfytovILXOAQ', categories='Golf, Active Life', city='Phoenix', hours=None, is_open=0, latitude=33.5221425, longit

In [304]:
#Se muestra columna is_open
load_business_data.select('is_open').show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+
|is_open|
+-------+
|      0|
|      1|
|      1|
|      1|
|      1|
|      1|
|      1|
|      1|
|      0|
|      1|
|      1|
|      1|
|      1|
|      1|
|      1|
|      1|
|      1|
|      0|
|      1|
|      1|
+-------+
only showing top 20 rows

In [305]:
#Se recodifica is_open de acuerdo a lo indicado
load_business_data = load_business_data\
    .withColumn('is_open', when(load_business_data['is_open'] == 0, 1)\
    .otherwise(0))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [306]:
#Se cambia el nombre a columna is open por label
load_business_data = load_business_data.withColumnRenamed('is_open','label')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [307]:
#Se seleccionan columnas para dataframe recodificado
business = load_business_data.select('label', 'review_count', 'stars', 'accepts_insurance',
'all_ages_allowed', 'alcohol_consumption', 'bitcoin_friendly', 'food_related',
'finance_related', 'health_related','smoking','free_wifi','has_tv',
'dog_friendly','kid_friendly','expensive_restaurant','loud_place','happy_hour')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [308]:
#Se muestra primer elemento del dataframe recodificado
business.take(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(label=1, review_count=5, stars=3.0, accepts_insurance=0, all_ages_allowed=0, alcohol_consumption=0, bitcoin_friendly=0, food_related=0, finance_related=0, health_related=0, smoking=0, free_wifi=0, has_tv=0, dog_friendly=0, kid_friendly=0, expensive_restaurant=0, loud_place=0, happy_hour=0)]

In [309]:
#Se eliminan columnas mencionadas en ejercicio
drop_list = ['friends', 'yelping_since', 'name', 'elite', 'user_id' ]
df = df.select([column for column in df.columns if column not in drop_list])#Se indican servicios que cerraron y los que no
business.groupBy('label').count().show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----+------+
|label| count|
+-----+------+
|    1| 34084|
|    0|158525|
+-----+------+

In [310]:
#Se remueve label del conjunto de entrenamiento
feats = business.columns
feats.remove('label')
print(feats)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

['review_count', 'stars', 'accepts_insurance', 'all_ages_allowed', 'alcohol_consumption', 'bitcoin_friendly', 'food_related', 'finance_related', 'health_related', 'smoking', 'free_wifi', 'has_tv', 'dog_friendly', 'kid_friendly', 'expensive_restaurant', 'loud_place', 'happy_hour']

In [311]:
#Se genera RDD indicando rango de atributos y transformamos dataframe, seleccionando
assemble_feats = VectorAssembler(inputCols = feats, outputCol = 'assembled_features')
assemble_feats = assemble_feats.transform(business)
assemble_feats = assemble_feats.select('label', 'assembled_features')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [312]:
#Se muestra primer elemento
assemble_feats.take(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(label=1, assembled_features=SparseVector(17, {0: 5.0, 1: 3.0}))]

In [313]:
#Se genera la separacion en conjuntos de test y train
train, test = assemble_feats.randomSplit([0.7, 0.3], seed = 4982)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [314]:
# Se genera objeto indicando de manera explicita atributos, vector objetivo y predicciones
logistic_model = LogisticRegression(featuresCol='assembled_features',labelCol='label',
                    predictionCol='open_pred')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [315]:
#Se entrena modelo de regresion logistica
logistic_model = logistic_model.fit(train)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [316]:
#Se visualiza comportamiento del modelo en cuanto a su probabilidad cruda, la normalizada y la mas probable
logistic_model.transform(test).show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----+--------------------+--------------------+--------------------+---------+
|label|  assembled_features|       rawPrediction|         probability|open_pred|
+-----+--------------------+--------------------+--------------------+---------+
|    0|(17,[0,1],[3.0,1.0])|[1.78138615815434...|[0.85586794397960...|      0.0|
|    0|(17,[0,1],[3.0,1.0])|[1.78138615815434...|[0.85586794397960...|      0.0|
|    0|(17,[0,1],[3.0,1.0])|[1.78138615815434...|[0.85586794397960...|      0.0|
|    0|(17,[0,1],[3.0,1.0])|[1.78138615815434...|[0.85586794397960...|      0.0|
|    0|(17,[0,1],[3.0,1.0])|[1.78138615815434...|[0.85586794397960...|      0.0|
+-----+--------------------+--------------------+--------------------+---------+
only showing top 5 rows

In [317]:
#Se indica puntaje ROC
print(f"Area Under ROC: {logistic_model.evaluate(test).areaUnderROC}")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Area Under ROC: 0.6897550315459314

In [318]:
# Se genera objeto indicando de manera explicita atributos, vector objetivo y predicciones
gbt_model = GBTClassifier(featuresCol='assembled_features',labelCol='label',
                    predictionCol='open_pred')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [319]:
#Se entrena modelo gradient boosting
gbt_model = gbt_model.fit(train)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [320]:
#Se visualiza comportamiento del modelo en cuanto a su probabilidad cruda, la normalizada y la mas probable
gbt_model.transform(test).show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----+--------------------+--------------------+--------------------+---------+
|label|  assembled_features|       rawPrediction|         probability|open_pred|
+-----+--------------------+--------------------+--------------------+---------+
|    0|(17,[0,1],[3.0,1.0])|[0.92004930751026...|[0.86296037003160...|      0.0|
|    0|(17,[0,1],[3.0,1.0])|[0.92004930751026...|[0.86296037003160...|      0.0|
|    0|(17,[0,1],[3.0,1.0])|[0.92004930751026...|[0.86296037003160...|      0.0|
|    0|(17,[0,1],[3.0,1.0])|[0.92004930751026...|[0.86296037003160...|      0.0|
|    0|(17,[0,1],[3.0,1.0])|[0.92004930751026...|[0.86296037003160...|      0.0|
+-----+--------------------+--------------------+--------------------+---------+
only showing top 5 rows

In [321]:
#Se guarda comportamiento en objeto gbt
gbt = gbt_model.transform(test)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [322]:
#Se indica puntaje roc gbt
print(f"Area Under ROC: {evaluator.evaluate(gbt)}")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Area Under ROC: 0.704484378921334

In [323]:
## Se genera objeto indicando de manera explicita atributos, vector objetivo y predicciones
dct_model = DecisionTreeClassifier(featuresCol='assembled_features',labelCol='label',
                    predictionCol='open_pred')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [324]:
#Se entrena modelo decision tree classifier
dct_model = dct_model.fit(train)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [325]:
#Se visualiza comportamiento del modelo en cuanto a su probabilidad cruda, la normalizada y la mas probable
dct_model.transform(test).show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----+--------------------+----------------+--------------------+---------+
|label|  assembled_features|   rawPrediction|         probability|open_pred|
+-----+--------------------+----------------+--------------------+---------+
|    0|(17,[0,1],[3.0,1.0])|[71259.0,9213.0]|[0.88551297345660...|      0.0|
|    0|(17,[0,1],[3.0,1.0])|[71259.0,9213.0]|[0.88551297345660...|      0.0|
|    0|(17,[0,1],[3.0,1.0])|[71259.0,9213.0]|[0.88551297345660...|      0.0|
|    0|(17,[0,1],[3.0,1.0])|[71259.0,9213.0]|[0.88551297345660...|      0.0|
|    0|(17,[0,1],[3.0,1.0])|[71259.0,9213.0]|[0.88551297345660...|      0.0|
+-----+--------------------+----------------+--------------------+---------+
only showing top 5 rows

In [326]:
#Se guarda su comportamiento en un objeto
dct = dct_model.transform(test)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [327]:
#Se indica puntaje ROC
print(f"Area Under ROC: {evaluator.evaluate(dct)}")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Area Under ROC: 0.36349751197389435

In [328]:
#Se indican importancias de los atributos 
gbt_model.featureImportances

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparseVector(17, {0: 0.2104, 1: 0.1841, 2: 0.0771, 3: 0.0004, 4: 0.1085, 5: 0.0018, 6: 0.1617, 7: 0.0251, 8: 0.0195, 9: 0.0027, 10: 0.031, 11: 0.0382, 12: 0.0045, 13: 0.0625, 14: 0.0398, 15: 0.0189, 16: 0.0138})

In [329]:
#Se reportan los atributos por orden de importancia
importances = list(zip(gbt_model.featureImportances, feats))
importances.sort(reverse = True)
importances

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[(0.21043643080701604, 'review_count'), (0.184106795545111, 'stars'), (0.1616908057325615, 'food_related'), (0.10846847288536979, 'alcohol_consumption'), (0.07705823480821926, 'accepts_insurance'), (0.06253224214329314, 'kid_friendly'), (0.03981474556172798, 'expensive_restaurant'), (0.0382396455856052, 'has_tv'), (0.030977606079952362, 'free_wifi'), (0.025137524088466857, 'finance_related'), (0.019454306986342994, 'health_related'), (0.018858574540806618, 'loud_place'), (0.013844185514394588, 'happy_hour'), (0.004476831955784336, 'dog_friendly'), (0.002701146375408592, 'smoking'), (0.0018466945489796552, 'bitcoin_friendly'), (0.0003557568409600836, 'all_ages_allowed')]

Los atributos más importantes son review_count, stars y food_related