## Sistema de recomendação - Spark  

Objetivo: A empresa FlixBR está pensando em ampliar sua base de dados e melhor sua experiência com o usuário e para isso deseja realizar a implementação de um sistema de recomendação.

**Etapas a serem desenvolvidas:**

1. Utilizar o Dataset gerado na tarefa anterior.
2. Desenvolver um algoritmo de recomendação de filtragem colaborativa utilizando Pyspark.

In [1]:
# Inicializando as bibliotecas 

from pyspark.sql.functions import col, isnan, when, count, trim
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

In [2]:
# Criando uma session spark

spark = SparkSession.builder.appName('predicao').getOrCreate()

In [3]:
# Lendo o arquivo salvo

df = spark.read.csv('dataset.csv', header='True', inferSchema='True')

In [4]:
df.toPandas()

Unnamed: 0,_c0,user_id,item_id,rating
0,0,186,302,3
1,1,22,377,1
2,2,244,51,2
3,3,166,346,1
4,4,298,474,4
...,...,...,...,...
99542,99542,880,476,3
99543,99543,716,204,5
99544,99544,276,1090,1
99545,99545,13,225,2


In [5]:
type(df)

pyspark.sql.dataframe.DataFrame

In [6]:
# Limpando colunas. 

df = df.drop('_c0')

In [7]:
df.toPandas()

Unnamed: 0,user_id,item_id,rating
0,186,302,3
1,22,377,1
2,244,51,2
3,166,346,1
4,298,474,4
...,...,...,...
99542,880,476,3
99543,716,204,5
99544,276,1090,1
99545,13,225,2


In [8]:
df.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- item_id: integer (nullable = true)
 |-- rating: integer (nullable = true)



In [9]:
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()

+-------+-------+------+
|user_id|item_id|rating|
+-------+-------+------+
|      0|      0|     0|
+-------+-------+------+



In [10]:
(training, test) = df.randomSplit([0.8, 0.2])

In [11]:
# Build the recommendation model using ALS on the training data

als = ALS(maxIter=5, regParam=0.01, userCol="user_id", itemCol="item_id", ratingCol="rating")

In [12]:
model = als.fit(training)

In [13]:
predictions = model.transform(test)

In [14]:
predictions.show()

+-------+-------+------+----------+
|user_id|item_id|rating|prediction|
+-------+-------+------+----------+
|    916|    148|     2| 2.4415765|
|    222|    148|     2|   2.65728|
|    363|    148|     3| 1.7397189|
|    620|    148|     3| 3.7090974|
|    455|    148|     3| 3.2294407|
|     15|    148|     3| 4.1548963|
|    486|    148|     2| 2.1218996|
|    677|    148|     4| 1.0111227|
|    757|    148|     4| 2.8575993|
|    438|    148|     5|   5.03842|
|    234|    148|     3| 2.4393735|
|    403|    148|     5| 5.4368844|
|    893|    148|     3| 3.4261796|
|    203|    148|     3| 3.3813648|
|    708|    148|     4|  3.522882|
|    536|    148|     4| 4.1980977|
|    865|    148|     3|0.10078798|
|    825|    148|     4|  4.604362|
|    344|    148|     2| 2.9518635|
|    270|    148|     4| 4.3900695|
+-------+-------+------+----------+
only showing top 20 rows



Após fazer a transformação na base de teste foi identificado dados ausentes no dataset.

In [15]:
predictions.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in predictions.columns]).show()

+-------+-------+------+----------+
|user_id|item_id|rating|prediction|
+-------+-------+------+----------+
|      0|      0|     0|        41|
+-------+-------+------+----------+



In [16]:
# Tratando dados ausentes 

def to_null(c):
    return when(~(col(c).isNull() | isnan(col(c)) | (trim(col(c)) == "")), col(c))


predict = predictions.select([to_null(c).alias(c) for c in predictions.columns]).na.drop()

In [17]:
predict.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in predict.columns]).toPandas()

Unnamed: 0,user_id,item_id,rating,prediction
0,0,0,0,0


In [18]:
# Avaliando o modelo usando RMSE (Root mean squared Error) 

evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",predictionCol="prediction")
rmse = evaluator.evaluate(predict)
print("Root mean square error = " + str(rmse))

Root-mean-square error = 1.07757997736801


In [19]:
evaluator = RegressionEvaluator(metricName="mae", labelCol="rating",predictionCol="prediction")
mae = evaluator.evaluate(predict)
print("Mean absolute erros = " + str(mae))

Erro absoluto médio = 0.8181212489981198


Testando o modelo de recomendação.

Para isso foi usado o user 532.

In [20]:
single_user = test.filter(test['user_id']==532).select(['user_id','item_id'])
single_user.show()

+-------+-------+
|user_id|item_id|
+-------+-------+
|    532|      9|
|    532|     24|
|    532|     52|
|    532|    132|
|    532|    153|
|    532|    168|
|    532|    181|
|    532|    191|
|    532|    205|
|    532|    218|
|    532|    229|
|    532|    230|
|    532|    250|
|    532|    268|
|    532|    272|
|    532|    277|
|    532|    304|
|    532|    305|
|    532|    307|
|    532|    312|
+-------+-------+
only showing top 20 rows



In [21]:
recomen = model.transform(single_user)
recomen.orderBy('prediction',ascending=False).toPandas()

Unnamed: 0,user_id,item_id,prediction
0,532,633,5.46836
1,532,496,5.39467
2,532,272,5.166042
3,532,191,5.036676
4,532,485,5.027513
5,532,739,5.015213
6,532,132,4.897423
7,532,591,4.77234
8,532,1189,4.709031
9,532,153,4.614224
