# Recomendacion

Con el filtrado colaborativo hacemos predicciones (filtrado) sobre los intereses de un usuario recopilando información sobre las preferencias o gustos de muchos usuarios (colaboración). La hipótesis subyacente es que si un usuario A tiene la misma opinión que un usuario B sobre un tema, es más probable que A tenga la opinión de B sobre un tema diferente x que la opinión sobre x de un usuario elegido al azar.

In [2]:
import warnings
warnings.filterwarnings("ignore")

## Creación Spark

In [1]:
import os, subprocess

java8_home = "/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home"

os.environ["JAVA_HOME"] = java8_home
os.environ["PATH"] = os.path.join(java8_home, "bin") + os.pathsep + os.environ.get("PATH","")

os.environ["HADOOP_USER_NAME"] = os.environ.get("USER", "tomas")

print("JAVA_HOME fijado a:", os.environ["JAVA_HOME"])
try:
    print("which java (kernel):", subprocess.check_output(["which","java"]).decode().strip())
    print("java -version (kernel):")
    print(subprocess.check_output(["java","-version"], stderr=subprocess.STDOUT).decode())
except Exception as e:
    print("Error llamando a java desde kernel:", e)

JAVA_HOME fijado a: /Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home
which java (kernel): /Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/bin/java
java -version (kernel):
openjdk version "1.8.0_292"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.292-b10, mixed mode)



In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('rec').getOrCreate()

25/09/13 22:50:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/13 22:50:08 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/09/13 22:50:08 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/09/13 22:50:08 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
25/09/13 22:50:08 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
25/09/13 22:50:08 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.


## Importación de Datos

In [4]:
data = spark.read.csv('../PySparkCourse/MLData/movielens_ratings.csv',inferSchema=True,header=True)

                                                                                

In [5]:
data.head(10)

[Row(movieId=2, rating=3.0, userId=0),
 Row(movieId=3, rating=1.0, userId=0),
 Row(movieId=5, rating=2.0, userId=0),
 Row(movieId=9, rating=4.0, userId=0),
 Row(movieId=11, rating=1.0, userId=0),
 Row(movieId=12, rating=2.0, userId=0),
 Row(movieId=15, rating=1.0, userId=0),
 Row(movieId=17, rating=1.0, userId=0),
 Row(movieId=19, rating=1.0, userId=0),
 Row(movieId=21, rating=1.0, userId=0)]

In [6]:
data.describe().show()

+-------+------------------+------------------+------------------+
|summary|           movieId|            rating|            userId|
+-------+------------------+------------------+------------------+
|  count|              1501|              1501|              1501|
|   mean| 49.40572951365756|1.7741505662891406|14.383744170552964|
| stddev|28.937034065088994| 1.187276166124803| 8.591040424293272|
|    min|                 0|               1.0|                 0|
|    max|                99|               5.0|                29|
+-------+------------------+------------------+------------------+



## Conjunto de Entrenamiento

In [7]:
training, test = data.randomSplit([0.8, 0.2])

## Modelo

- **Alternative Least Squares (ALS):** Técnica de factorización de matrices para recomendación colaborativa

In [11]:
from pyspark.ml.recommendation import ALS

In [9]:
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating")
model = als.fit(training)
predictions = model.transform(test)

25/09/13 23:01:24 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
25/09/13 23:01:24 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
25/09/13 23:01:24 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
25/09/13 23:01:24 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
                                                                                

In [10]:
predictions.show()

                                                                                

+-------+------+------+-----------+
|movieId|rating|userId| prediction|
+-------+------+------+-----------+
|     31|   4.0|    12|  2.6823602|
|     31|   3.0|     7|  2.5327888|
|     85|   1.0|    28| -3.1888137|
|     85|   3.0|     1|   4.807709|
|     85|   1.0|    13|  1.5778544|
|     85|   3.0|     6| 0.13224229|
|     85|   1.0|     5|-0.83726645|
|     85|   1.0|    25| -1.7026141|
|     65|   1.0|    16|-0.23292011|
|     65|   2.0|     3|  -0.597949|
|     65|   1.0|     4|  0.4489627|
|     53|   1.0|    23| 0.97864366|
|     78|   1.0|    28| 0.11902475|
|     78|   1.0|    12| 0.80926186|
|     78|   1.0|     1| 0.91904163|
|     78|   1.0|    20| 0.71978164|
|     78|   1.0|     4| 0.49625564|
|     34|   1.0|     4|  2.1076875|
|     81|   1.0|    19| 0.92394173|
|     28|   1.0|    23| 0.02589339|
+-------+------+------+-----------+
only showing top 20 rows



## Evaluación

In [12]:
from pyspark.ml.evaluation import RegressionEvaluator

In [13]:
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("RMSE on test = ", rmse)



RMSE on test =  1.6509647329597468


                                                                                

**Análisis**

Los ratings van de 1 a 5. 1.65 significa que, en promedio, el modelo se equivoca alrededor de 1.65 puntos de rating. Si un usuario dió 4 estrella, el modelo podría estar prediciendo algo entre 2.3 y 5 con frecuencia. Esto implica un error relativamente alto

## Implementación

In [14]:
single_user = test.filter(test['userId']==11).select(['movieId','userId'])

In [15]:
single_user.show()

+-------+------+
|movieId|userId|
+-------+------+
|      0|    11|
|     13|    11|
|     16|    11|
|     23|    11|
|     32|    11|
|     35|    11|
|     38|    11|
|     61|    11|
|     82|    11|
|     88|    11|
+-------+------+



In [16]:
reccomendations = model.transform(single_user)

In [17]:
reccomendations.orderBy('prediction',ascending=False).show()



+-------+------+----------+
|movieId|userId|prediction|
+-------+------+----------+
|     16|    11| 3.4341564|
|     32|    11| 2.9593024|
|     38|    11| 2.7665155|
|     13|    11| 2.7166576|
|      0|    11|  2.708358|
|     35|    11|  2.708118|
|     88|    11| 2.3236823|
|     23|    11| 2.2557573|
|     82|    11|  2.085736|
|     61|    11| 1.8779682|
+-------+------+----------+



                                                                                

25/09/13 23:57:52 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 909135 ms exceeds timeout 120000 ms
25/09/13 23:57:52 WARN SparkContext: Killing executors is not supported by current scheduler.
