
# Recommender systems for recommending movies and jokes to a user



## Dataset Description



### 1. MovieLens:

 It is one of the most popular recommendation datasets and collects the preferences of Internet users regarding movies that are evaluated from 0 to 5 stars. This dataset has been used in various research studies in areas such as personalized recommendation and social psychology.
-  **Files:** 
   -  ML_ratings.csv: contains the ratings given by users to movies. It is made up of the user_id, movie_id and rating columns.
   -  movies.csv: contains the metadata about the movies. Includes the columns movie_id, title, and genres (where the movie genres are separated by "|").
-  **Number of ratings**: 100836
-  **Number of users**: 610
-  **Number of movies**: 9724
-  **Score** 0 to 5 (integer values)

 References: [Movielens Dataset](https://grouplens.org/datasets/movielens/)


In [None]:
# Libraries
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors

In [None]:
df_movies = pd.read_csv("ML_ratings.csv")
df_movies.head()

Unnamed: 0,user_id,movie_id,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0



### 3.Jester:

 It is a dataset developed by Ken Goldberg and his team at the University of Berkeley, which contains around 6 million ratings regarding 150 short jokes.
-  **Files**: 
   -  JT_ratings.csv: contains the ratings given by users to different short jokes. It is made up of the columns joke_id, user_id and rating.
   -  jokes.csv: contains
-  **Number of ratings**: 199900
-  **Number of users**: 1999
-  **Number of jokes**: 100
-  **Rating**: -10 to 10 (actual values)

 References: [Jester Dataset](http://eigentaste.berkeley.edu/dataset/)


In [None]:
df = pd.read_csv("JT_ratings.csv")
df.head()

Unnamed: 0,joke_id,user_id,rating
0,0,1,5.1
1,1,1,4.9
2,2,1,1.75
3,3,1,-4.17
4,4,1,5.15



## Recommended System Types



a. User based Collaborative Filtering: To recommend a user X, the set of users with similar tastes is sought (calculated with historical user data) to make a recommendation based on what they prefer.

 b. Item based Collaborative Filtering: To recommend a user X, the set of items similar to the items preferred by the user is searched and a recommendation is made based on it.


<img src="https://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs11227-020-03266-2/MediaObjects/11227_2020_3266_Fig1_HTML.png" title="Title text" width="60%" />
<center> <i> Figura 1. Collaborative Filtering. </i> </center>


## 1. User-based Collaborative Filtering



We will apply this method to both datasets.



Sections:
1.  Implementation of a user-based collaborative filter recommendation system using the k nearest neighbors. For this part, we will use &lt;b&gt; cosine similarity &lt;/b&gt; as a measure of similarity between the target user and the rest of the users.
1.  A new user is entered into the system, he is asked to rate 10 products (of his choice) and from that, the recommendation of 5 products that he has not qualified is made. Showing the ids of the k nearest neighbors.
1.  We will answer the following questions:

>  a. What were the 5 recommendations obtained?
> **Dataset Movies:** The recommendations obtained are listed in the first table. As can be seen in the table, the first recommendation is Silence of the Lambs. When analyzing the film genres, we observe that both Action, Crime and Thriller are present in the films that the user has rated and the average ratings are this genre were rated 4, this could explain the recommendation, however, it should also be noted that the user rated some action movies with low rating, the recommendation could be explained by the combination of genres and patterns more s complex in calculating the nearest neighbors. Something important to note is that despite having rated many adventure movies, there are no recommendations for this genre, so user tastes are diversifying and other factors predominated, such as all movies rated by user can be considered as great movies in history, and movies like the godfather or matrix, also well known, were recommended.

 **Dataset Jokes:** The recommendations obtained are listed in the second table. There are 2 really interesting results in the recommendations, the joke with id 47 and 35 have a very similar structure to the jokes with id 25 and 27 that were part of the 10 jokes rated by the new user, in one we have in common the context of a bar and in another the context of different professionals. Both jokes were rated positively, so it makes sense that similar jokes were recommended. On the other hand, the jokes with id 49, 35 and 20 do not seem to agree with the tastes of the new user, since he rated those jokes with a sexual connotation with a negative score. users and thus came to be recommended as well, an item-based approach where we do bias correction at the item level could solve this problem. Finally, the joke with id 10 is similar to the joke with id 22, both are short jokes, since the joke with id 22 was rated positively, it is consistent that a short joke was recommended to the new user.

| movie_id |                         Pelicula | Predicted_Rating |                         Gender |
|---------:|---------------------------------:|-----------------:|-------------------------------:|
| 593      | Silence of the Lambs, The (1991) | 3.924192         | Crime\|Horror\|Thriller        |
| 858      | Godfather, The (1972)            | 3.917455         | Crime\|Drama                   |
| 608      | Fargo (1996)                     | 3.909946         | Comedy\|Crime\|Drama\|Thriller |
| 50       | Usual Suspects, The (1995)       | 3.897961         | Crime\|Mystery\|Thriller       |
| 2571     | Matrix, The (1999)               | 3.884048         | Action\|Sci-Fi\|Thriller       |

| Chiste_id |                                                                                                                                                                                                                                                                                                                                                                                                                Joke | Predicted_Rating |
|----------:|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|-----------------:|
| 49        | A guy goes into confession and says to the priest, "Father,  I'm 80 years\nold, widower, with 11 grandchildren. Last night I met two  beautiful flight\nattendants. They took me home and I made love to both  of them. Twice."\n\nThe priest said: "Well, my son, when was the last  time you were in\nconfession?"\n "Never Father, I'm Jewish."\n "So then,  why are you telling me?"\n "I'm telling everybody."\n | 8.759101         |
| 10        | Q. What do a hurricane, a tornado, and a redneck\ndivorce all have in common? \nA. Someone's going to lose their trailer...\n                                                                                                                                                                                                                                                                                         | 7.832140         |
| 47        | The graduate with a Science degree asks, "Why does it  work?"\nThe graduate with an Engineering degree asks, "How does it  work?"\nThe graduate with an Accounting degree Asks, "How much will it  cost?" \nThe graduate with a  Liberal Arts degree asks, "Do you want  fries \nwith  that?"\n                                                                                                                       | 7.580787         |
| 35        | A guy walks into a bar, orders a beer and says to the  bartender,\n"Hey, I got this great Polish Joke..." \n\nThe barkeep  glares at him and says in a warning tone of voice:\n"Before you go  telling that joke you better know that I'm Polish, both\nbouncers are  Polish and so are most of my customers"\n\n"Okay" says the  customer,"I'll tell it very slowly." \n                                             | 7.368727         |
| 20        | What's the difference between a used tire and 365 used condoms?\n\nOne's a Goodyear, the other's a great year.\n                                                                                                                                                                                                                                                                                                      | 7.220927         |

>  b. What number of close neighbors (k) was chosen for the recommendation? What influences the choice of this parameter? What happens as this parameter increases?<blockquote> **Dataset Movies:** The number of close neighbors chosen was 50, the greater the number of close neighbors we will be considering more users similar to the new user to make the recommendation, that is, we can obtain a more precise estimate as long as we do not reach an excessively large value of k where we are considering users who are not similar to the new user. In this case, k=50 was chosen because the matrix is highly sparse, so achieving matches between movies seen by one user and another can be complex. By considering more nearby users, it is ensured that there are more s users who have seen the same movies. This is important, since the developed algorithm will only match the intersection of the movies seen by each pair of users. **Dataset Jokes:** In this case a k=10 is used. Since the matrix is dense, we do not have the problem of the movies dataset, so it is convenient to consider the closest users to make the recommendation and not risk involving users not so similar to the new user.

 c. What was the sparsity percentage of the users-items matrix? What are the disadvantages of this approach?

>  **Dataset Movies:** The users-movies matrix is 98.3% sparse. The disadvantage of using collaborative filtering methods based on close neighbors is that the closeness between users is determined based on the ratings of items they have in common, since the ratings matrix is so sparse, the similarity results do not necessarily reflect reality. Another disadvantage is that transitive relationships between users and movies, and between users-users, are not being considered, unlike a graph-based approach where more information could be captured traveling through the relationships between nodes. **Dataset Jokes:** The user-jokes matrix is 0.0% sparse. In this dataset, the method used gives good results, since users with similar tastes to others can be clearly defined, since the rating matrix is completely defined.


In [None]:
def predict_rating_ubcf(knn,rating_matrix,rating_nuevo_usuario,ignorar):
  # Media del usuario para correccion de sesgo
  media_usuario = np.true_divide(rating_nuevo_usuario.sum(where=rating_nuevo_usuario!=ignorar),(rating_nuevo_usuario!=ignorar).sum())
  # Se calculan vecinos cercanos al usuario con el kNN entrenado considerando correccion de sesgo del usuario
  distancias, indices = knn.kneighbors(rating_nuevo_usuario-media_usuario)
  print(f'A continuaciÃ³n se muestran los ids de los vecinos cercanos: {indices}')
  # se extraen los vecinos de la matriz de rating
  vecinos = rating_matrix.iloc[indices[0]]
  # La matriz de riesgo pasada a la funcion no es la misma a la que se le aplico correccion de sesgo, por lo que, se vuelve a aplicar
  # TODO evitar este calculo doble
  medias = np.true_divide(vecinos.sum(1),(vecinos!=ignorar).sum(1)).values
  # Numerador del calculo de rating ignorando aquellos casos donde la matriz de riesgo no tenia calificacion
  numerador = np.sum(np.where(vecinos.values.T!=ignorar,np.multiply((np.subtract(vecinos.values.T,medias)),distancias.reshape((indices.shape[1],))),0),axis=1)
  denominador = np.sum(np.absolute(distancias))
  resultado = media_usuario + (numerador/denominador)
  return resultado

In [None]:
df_movies_titles = pd.read_csv("movies.csv")
rating_matrix = df_movies.pivot('user_id','movie_id','rating').fillna(0)
rating_matrix_2 = rating_matrix.astype('i1').T.join(df_movies_titles.set_index('movie_id').title).set_index('title').T.rename_axis('user_id')
rating_matrix_genres = rating_matrix.astype('i1').T.join(df_movies_titles.set_index('movie_id').genres).set_index('genres').T.rename_axis('user_id')

In [None]:
# Corrección de sesgo
medias = np.true_divide(rating_matrix.sum(1),(rating_matrix!=0).sum(1)).values
rating_matrix_norm = np.subtract(rating_matrix.values.T,medias).T

# Se ajusta kNN para posteriormente ser usado para obtener vecinos cercanos
nbrs_movies = NearestNeighbors(n_neighbors=50, metric='cosine', algorithm='brute').fit(rating_matrix_norm)

adventures_movies = df_movies_titles[df_movies_titles['genres'].str.contains("Adventure")]

# Arreglo con ceros donde se guardarÃ¡n las primeras 10 peliculas calificadas por el usuario nuevo
rating_nuevo_usuario = np.zeros(shape=(1,rating_matrix.shape[1]))

# Arreglo con id de peliculas calificadas
movies_rated = [1,2,13,170,260,362,393,480,648,1198]
# Arreglo con rating de cada chiste en el mismo orden
rate_movies =  [4,5,5,4,3,4,2,4,2,5]

# Para visualizar con dataframe.head(10)
calificadas = adventures_movies[adventures_movies['movie_id'].isin(movies_rated)]
calificadas = calificadas.assign(rating_nuevo_usuario = rate_movies)

# Se actualizan los rating del vector de ratings de usuario nuevo
for i in range(len(movies_rated)):
  rating_nuevo_usuario[0][movies_rated[i]] = rate_movies[i]
  
# Se predicen las peliculas recomendadas
resultado = predict_rating_ubcf(nbrs_movies, rating_matrix, rating_nuevo_usuario, ignorar=0)
df_resultado = pd.DataFrame({'movie_id':rating_matrix.columns,'Pelicula':rating_matrix_2.columns,'Predicted_Rating': resultado,'Genero':rating_matrix_genres.columns})
df_resultado = df_resultado.sort_values(by=['Predicted_Rating'],ascending=False)
# Se muestran solo aquellas peliculas que el usuario no haya visto previamente
df_resultado = df_resultado[(~df_resultado['Pelicula'].isin(calificadas['title'].values))]

A continuaciÃ³n se muestran los ids de los vecinos cercanos: [[543  52 188  25 193 568 594 319  48 277 438 363  86 506 256  36 546 298
   34 213 280 292  59 156 144 191 406 573 117 288 405 244 250 432 205 146
  119 493 430 530 575 548 396 567 517 323 162 484 544  91]]



## Movies rated by the new user


In [None]:
calificadas.head(10)

Unnamed: 0,movie_id,title,genres,rating_nuevo_usuario
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,4
1,2,Jumanji (1995),Adventure|Children|Fantasy,5
12,13,Balto (1995),Adventure|Animation|Children,5
142,170,Hackers (1995),Action|Adventure|Crime|Thriller,4
224,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi,3
320,362,"Jungle Book, The (1994)",Adventure|Children|Romance,4
349,393,Street Fighter (1994),Action|Adventure|Fantasy,2
418,480,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller,4
546,648,Mission: Impossible (1996),Action|Adventure|Mystery|Thriller,2
900,1198,Raiders of the Lost Ark (Indiana Jones and the...,Action|Adventure,5



## Top 5 Movies to recommend to the user


In [None]:
df_resultado.head(5)

Unnamed: 0,movie_id,Pelicula,Predicted_Rating,Genero
510,593,"Silence of the Lambs, The (1991)",3.924192,Crime|Horror|Thriller
659,858,"Godfather, The (1972)",3.917455,Crime|Drama
520,608,Fargo (1996),3.909946,Comedy|Crime|Drama|Thriller
46,50,"Usual Suspects, The (1995)",3.897961,Crime|Mystery|Thriller
1938,2571,"Matrix, The (1999)",3.884048,Action|Sci-Fi|Thriller



## Sparsity User-Movie Matrix


In [None]:
movies_sparsity = rating_matrix[rating_matrix!=0.0].isnull().sum().sum()/rating_matrix.size
print(f'La matriz Usiario-Movie es sparse en un: {np.round(movies_sparsity*100,4)}%')

La matriz Usiario-Movie es sparse en un: 98.3%



# jokes


In [None]:
df_jester = pd.read_csv("JT_ratings.csv")

In [None]:
df_jester_texts = pd.read_csv("jokes.csv")
rating_matrix_jester = df_jester.pivot('user_id','joke_id','rating').fillna(0)
rating_matrix_jester_2 = rating_matrix_jester.astype('i1').T.join(df_jester_texts.set_index('joke_id').joke_text).set_index('joke_text').T.rename_axis('user_id')

In [None]:
# CorrecciÃ³n de sesgo antes de aplicar kNN
medias_jester = np.true_divide(rating_matrix_jester.sum(1),(rating_matrix_jester.notnull()).sum(1)).values
rating_matrix_jester_norm = np.subtract(rating_matrix_jester.values.T,medias_jester).T
# Se ajusta kNN para posteriormente ser usado para obtener vecinos cercanos
nbrs_jester = NearestNeighbors(n_neighbors=10, metric='cosine', algorithm='brute').fit(rating_matrix_jester_norm)
# Arreglo con ceros donde se guardarÃ¡n los primeros 10 chistes calificados por el usuario nuevo
rating_nuevo_usuario_jester = np.zeros(shape=(1,rating_matrix_jester.shape[1]))
rating_nuevo_usuario_jester -= 11
# Arreglo con id de chistes
jokes_rated = [22,23,24,25,26,27,28,29,30,31]
# Arreglo con rating de cada chiste en el mismo orden
rate_jokes = [3,-4,-10,1,-2,5,-10,5,-8,4]
# Para visualizar con dataframe.head(10)
calificadas_jester = df_jester_texts[df_jester_texts['joke_id'].isin(jokes_rated)]
calificadas_jester = calificadas_jester.assign(rating_nuevo_usuario = rate_jokes)
# Se actualizan los rating del vector de ratings de usuario nuevo
for i in range(len(movies_rated)):
  rating_nuevo_usuario_jester[0][jokes_rated] = rate_jokes[i]
  
# Se predicen los chistes recomendados
resultado_jester = predict_rating_ubcf(nbrs_jester, rating_matrix_jester, rating_nuevo_usuario_jester, ignorar=-11)
df_resultado_jester = pd.DataFrame({'Chiste_id': df_jester_texts['joke_id'].values,'Chiste':df_jester_texts['joke_text'].values,'Predicted_Rating': resultado_jester})
df_resultado_jester = df_resultado_jester.sort_values(by=['Predicted_Rating'],ascending=False)
# Se mostrarÃ¡n solo aquellos chistes que el usuario no haya visto previamente
df_resultado_jester = df_resultado_jester[(~df_resultado_jester['Chiste_id'].isin(calificadas_jester['joke_id'].values))]

A continuaciÃ³n se muestran los ids de los vecinos cercanos: [[1969  116 1895  146  917  773  888  951  843 1980]]



## Jokes rated by new user


In [None]:
calificadas_jester.head(10)

Unnamed: 0,joke_id,joke_text,rating_nuevo_usuario
22,22,Q: What is the Australian word for a boomerang...,3
23,23,What do you get when you run over a parakeet w...,-4
24,24,Two kindergarten girls were talking outside: o...,-10
25,25,A guy walks into a bar and sits down next to a...,1
26,26,Clinton returns from a vacation in Arkansas an...,-2
27,27,"A mechanical, electrical and a software engine...",5
28,28,An old Scotsmen is sitting with a younger Scot...,-10
29,29,Q: What's the difference between a Lawyer and ...,5
30,30,President Clinton looks up from his desk in t...,-8
31,31,A man arrives at the gates of heaven. St. Pete...,4



## Top 5 jokes to recommend to the user


In [None]:
df_resultado_jester.head(5)

Unnamed: 0,Chiste_id,Chiste,Predicted_Rating
49,49,A guy goes into confession and says to the pri...,8.759101
10,10,"Q. What do a hurricane, a tornado, and a redne...",7.83214
47,47,"The graduate with a Science degree asks, ""Why ...",7.580787
35,35,"A guy walks into a bar, orders a beer and says...",7.368727
20,20,What's the difference between a used tire and ...,7.220927



## Sparsity User-Jokes Matrix


In [None]:
jokes_sparsity = rating_matrix_jester.isnull().sum().sum()/rating_matrix_jester.size
print(f'La matriz Usuario-Joke es sparse en un: {np.round(jokes_sparsity*100,10)}%')

La matriz Usuario-Joke es sparse en un: 0.0%



## 2. Item-based Collaborative Filtering



Sections:
1.  Implementation of a collaborative filter recommendation system based on items using the closest k items. For this part, we will use &lt;b&gt; cosine similarity &lt;/b&gt; as a measure of similarity between the items.
2.  A new user is entered into the system, he is asked to rate 10 products (of his choice) and from that, the recommendation of 5 products that he has not qualified is made. In order to carry out the recommendation, remember the following steps:<blockquote> a. We will generate the similarity matrix between products based on the cosine similarity.</blockquote>
3.  We will answer the following questions:
    >  a. What were the 5 recommendations obtained?

 **MovieLens:** In this case our user had tastes for movies considered in the Action and War categories in most cases, based on the Item-based procedure for this particular user, the recommended movies are related to the categories of the movies that were classified by the user with id=700. The movies that can be seen below belong to the categories of Action, War and Horror which is related to the other categories. As can be seen in the table, the recommended movies have practically the same ranking, I think this is due to the correction of the implemented bias, since, when choosing to rate movies at random for the user entered, the way of relating new movies with the already classified, it can be more complex as they do not have attributes in common.

       
| Movies                                                             | Gender                              | Ranking            |
|-----------------------------------------------------------------------|-------------------------------------|--------------------|
| Nightmare on Elm Street 4: The Dream Master, A (1988)                 | Horror\|Thriller                    | 2.5000000000000004 |
| 8 Â½ Women (a.k.a. 8 1/2 Women) (a.k.a. Eight and a Half Women) (1999) | Comedy                              | 2.5                |
| Friday the 13th Part 3: 3D (1982)                                     | Horror                              | 2.5                |
| Friday the 13th Part IV: The Final Chapter (1984)                     | Horror                              | 2.5                |
| Jurassic World: Fallen Kingdom (2018)                                 | Action\|Adventure\|Sci-Fi\|Thriller | 2.5                |

 **Jester:** In this case, a user dataset with id = 2000 was made, where some jokes were evaluated, most of which could be classified as "soft" jokes, which are short and in some cases as jokes to get out of trouble. , below you can see some recommendations based on the likes delivered by our user, which are related to the likes of our user_id. In this case, the rankings obtained are quite good and are better evaluated compared to the results obtained with the previous dataset. This may be due to the fact that there is a greater relationship between the jokes evaluated by the user entered and the rankings found to recommend.

 |  Jokes                                                                                                                                                                                                                                                                                                        | Ranking           |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------|
| "May I take your order?" the waiter asked. "Yes, how do you prepare your chickens?" "Nothing special sir," he replied. "We just tell them straight out that they're going to die."                                                                                                                             | 8.555944999999998 |
| Q: What is the Australian word for a boomerang that won't come back? A: A stick                                                                                                                                                                                                                                | 8.555944999999998 |
| A dog walks into Western Union and asks the clerk to send a telegram. He fills out a form on which he writes down the telegram he wishes to send: "Bow wow wow, Bow wow wow." The clerk says, "You can add another 'Bow wow' for the same price." The dog responded, "Now wouldn't that sound a little silly?" | 7.886754219543453 |
| Q. Did you hear about the dyslexic devil worshiper? A. He sold his soul to Santa.                                                                                                                                                                                                                              | 7.753240030771999 |
| Q:  What did the blind person say when given some matzah? A:  Who the hell wrote this?                                                                                                                                                                                                                         | 5.346952669160046 |



> b. What number of close neighbors (k) was chosen for the recommendation? What influences the choice of this parameter? What happens as I increase this parameter?<blockquote> MovieLens: In this case, the number of K-nearest neighbors chosen was 50, as with the first algorithm, in this case the greater the number of K we will be considering more items similar to the new user to make the recommendation, that is, we can obtain a more precise estimate, if a very large number is chosen, it is the case that movies are recommended that are not related to the tastes of our user. Jester: In this case a k=10 is used. Since the matrix is dense, we do not have the problem of the movies dataset, so it is convenient to consider the closest items to make the recommendation and not risk involving jokes that are not considered similar to the tastes of our new user. in this case with id=2000.

 c. What are the advantages of this approach over the previous one?

>  The main advantage of this approach is to consider products similar to users, that is, it is based on their own tastes (items) that the user has already evaluated to make recommendations for new movies or jokes for a new user. Unlike the user-based approach, which focuses on the experiences of nearby neighbors, in this case we focus on items close to the movie or joke that have not yet been evaluated by our user in question.


# MovieLens


# MovieLens


In [None]:
#Librerias
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors


In [None]:
# Añadimos datos de un nuevo usuario y sus gustos por 10 peliculas
usuario = pd.read_csv("usuario_movies.csv") # se le asigna el user_id=700, ya que, este no estaba utilizado. 
usuario

Unnamed: 0,user_id,movie_id,rating
0,700,5749,4
1,700,2835,3
2,700,149406,3
3,700,866,5
4,700,73017,4
5,700,5388,5
6,700,90376,3
7,700,1517,2
8,700,1198,1
9,700,3877,5


In [None]:
#se lee la matriz de rating
ratings = pd.read_csv("ML_ratings.csv") 
#aÃ±adimos el nuevo usuario a dicho dataframe
ratings = pd.concat([ratings, usuario])
#se lee los datos de las peliculas
movies = pd.read_csv("movies.csv")
#Se calcula la media para calcular el rating ajustado para el correcto funcionamiento de nuestro algoritmo
media= ratings.groupby(['movie_id'], as_index = False, sort = False).mean().rename(columns = {'rating': 'rating_mean'})[['movie_id','rating_mean']]
#Se unen los 3 dataframes
df = pd.merge(ratings,media,on = 'movie_id', how = 'left', sort = False).merge(movies,on='movie_id')
#se define el rating ajustado el cual es restar el rating de la pelicula - rating de la media
df['rating_adjusted']=df['rating']-df['rating_mean']
df

Unnamed: 0,user_id,movie_id,rating,rating_mean,title,genres,rating_adjusted
0,1,1,4.0,3.92093,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0.07907
1,5,1,4.0,3.92093,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0.07907
2,7,1,4.5,3.92093,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0.57907
3,15,1,2.5,3.92093,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,-1.42093
4,17,1,4.5,3.92093,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0.57907
...,...,...,...,...,...,...,...
100841,610,160341,2.5,2.50000,Bloodmoon (1997),Action|Thriller,0.00000
100842,610,160527,4.5,4.50000,Sympathy for the Underdog (1971),Action|Crime|Drama,0.00000
100843,610,160836,3.0,3.00000,Hazard (2005),Action|Drama|Thriller,0.00000
100844,610,163937,3.5,3.50000,Blair Witch (2016),Horror|Thriller,0.00000


In [None]:
#verificamos que nuestro usuario id=700, posea 10 reviews
freq = df['user_id'].value_counts() 
print(freq)

414    2698
599    2478
474    2108
448    1864
274    1346
       ... 
320      20
569      20
189      20
442      20
700      10
Name: user_id, Length: 611, dtype: int64


In [None]:
#Realizamos la matriz de rating/usuarios usando user_id y el rating ajustado
df = df.pivot_table(index='title',columns='user_id',values='rating_adjusted').fillna(0)
#Se realiza una copia por seguridad
df1 = df.copy()
#Mostramos la matriz rating/usuarios
df.head() 

user_id,1,2,3,4,5,6,7,8,9,10,...,602,603,604,605,606,607,608,609,610,700
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Salem's Lot (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Til There Was You (1997),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
def recomendar_peliculas(usuario, cantidad_recomendaciones):  
    peliculas_recomendadas = []
    for m in df[df[usuario] == 0].index.tolist():
        index_df = df.index.tolist().index(m)
        rating_predecido = df1.iloc[index_df, df1.columns.tolist().index(usuario)]
        peliculas_recomendadas.append((m, rating_predecido))
    
    recomendacion_ordenado = sorted(peliculas_recomendadas, key=lambda x:x[1], reverse=True)
    print('En base a sus gustos las pelÃ­culas recomendadas son las siguientes: \n')
    rank = 1
    for peliculas_recomendada in recomendacion_ordenado[:cantidad_recomendaciones]:
        print('{}: {} --- Rating:{}'.format(rank, peliculas_recomendada[0], peliculas_recomendada[1]))
        rank = rank + 1

def recomendador_peliculas(usuario, k, n_recomendacion):
  
    numero_k = k

    knn = NearestNeighbors(metric='cosine', algorithm='brute')
    knn.fit(df.values)
    distancia, indices = knn.kneighbors(df.values, n_neighbors=numero_k)
    usuario_index = df.columns.tolist().index(usuario)
    
#como tenemos muchas peliculas con valores 0, que no fueron punteadas por el usuario, se realiza el analisis y son
#eliminados si es el caso, para que no existan resultados sesgados.

    for m,t in list(enumerate(df.index)):
        if df.iloc[m, usuario_index] == 0:
            sim_movies = indices[m].tolist()
            distancia_peliculas = distancia[m].tolist()    
        if m in sim_movies:
            id_movie = sim_movies.index(m)
            sim_movies.remove(m)
            distancia_peliculas.pop(id_movie) 
        else:
            sim_movies = sim_movies[:k-1]
            distancia_peliculas = distancia_peliculas[:k-1]
        
        pelicula_similar = [1-x for x in distancia_peliculas]
        pelicula_similar_copy = pelicula_similar.copy()
        nominador = 0

        for s in range(0, len(pelicula_similar)):
            if df.iloc[sim_movies[s], usuario_index] == 0:
                if len(pelicula_similar_copy) == (numero_k - 1):
                    pelicula_similar_copy.pop(s)
                else:
                    pelicula_similar_copy.pop(s-(len(pelicula_similar)-len(pelicula_similar_copy)))
            else:
                nominador = nominador + pelicula_similar[s]*df.iloc[sim_movies[s],usuario_index]
          
            if len(pelicula_similar_copy) > 0:
                if sum(pelicula_similar_copy) > 0:
                    prediccion_rating = nominador/sum(pelicula_similar_copy)
                else:
                    prediccion_rating = 0
            else:
                prediccion_rating = 0
        
        df1.iloc[m,usuario_index] = prediccion_rating
    
    recomendar_peliculas(usuario,n_recomendacion)

In [None]:
recomendador_peliculas(700, 50, 5)

En base a sus gustos las pelÃ­culas recomendadas son las siguientes: 

1: Nightmare on Elm Street 4: The Dream Master, A (1988) --- Rating:2.5000000000000004
2: 8 Â½ Women (a.k.a. 8 1/2 Women) (a.k.a. Eight and a Half Women) (1999) --- Rating:2.5
3: Friday the 13th Part 3: 3D (1982) --- Rating:2.5
4: Friday the 13th Part IV: The Final Chapter (1984) --- Rating:2.5
5: Jurassic World: Fallen Kingdom (2018) --- Rating:2.5



# Jester


In [None]:
# añadimos datos de nuevo usuario y sus gustos por 10 chistes
usuario = pd.read_csv("usuario_joker.csv") # se le asigna el user_id=2000, ya que, este no estaba utilizado. 
usuario

Unnamed: 0,user_id,joke_id,rating
0,2000,45,-5
1,2000,100,-10
2,2000,36,1
3,2000,89,2
4,2000,94,3
5,2000,20,4
6,2000,10,5
7,2000,15,6
8,2000,6,7
9,2000,9,10


In [None]:
ratings = pd.read_csv("JT_ratings.csv")
jokes = pd.read_csv("jokes.csv")
ratings = pd.concat([ratings, usuario])
ratings

Unnamed: 0,joke_id,user_id,rating
0,0,1,5.10
1,1,1,4.90
2,2,1,1.75
3,3,1,-4.17
4,4,1,5.15
...,...,...,...
5,20,2000,4.00
6,10,2000,5.00
7,15,2000,6.00
8,6,2000,7.00


In [None]:
#se ajusta el rating con la media, para eliminar el sesgo
Mean= ratings.groupby(['joke_id'], as_index = False, sort = False).mean().rename(columns = {'rating': 'rating_mean'})[['joke_id','rating_mean']]
df =pd.merge(ratings,Mean,on = 'joke_id', how = 'left', sort = False).merge(jokes, on='joke_id')
df['rating_adjusted']=df['rating']-df['rating_mean']
df.head()

Unnamed: 0,joke_id,user_id,rating,rating_mean,joke_text,rating_adjusted
0,0,1,5.1,1.381991,"A man visits the doctor. The doctor says ""I ha...",3.718009
1,0,2,-8.79,1.381991,"A man visits the doctor. The doctor says ""I ha...",-10.171991
2,0,3,-3.5,1.381991,"A man visits the doctor. The doctor says ""I ha...",-4.881991
3,0,4,7.14,1.381991,"A man visits the doctor. The doctor says ""I ha...",5.758009
4,0,5,-8.79,1.381991,"A man visits the doctor. The doctor says ""I ha...",-10.171991


In [None]:
df2=df.copy()
#se ajustan los datos del valor del rating
df = df.pivot_table(index='joke_text',columns='user_id',values='rating_adjusted').fillna(0)
df1 = df.copy()
df.head()

user_id,1,2,3,4,5,6,7,8,9,10,...,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000
joke_text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"President Clinton looks up from his desk in the Oval Office to see\n one of his aides nervously approach him. \n ""What is it?"" exclaims the President. \n""It's this Abortion Bill Mr. President, what do you want to do\n about it?"" the aide replies. \n""Just go ahead and pay it."" responds the President. \n",-3.855913,3.814087,2.744087,-5.315913,-4.675913,4.304087,-6.665913,-1.185913,-0.655913,-1.325913,...,-0.895913,-7.105913,-7.785913,-1.035913,-8.225913,2.504087,-0.405913,4.394087,3.964087,0.0
"""May I take your order?"" the waiter asked. \n\n""Yes, how do you prepare your chickens?"" \n\n""Nothing special sir,"" he replied. ""We just tell them straight out\nthat they're going to die.""\n",2.011931,-8.278069,0.021931,0.751931,4.531931,7.641931,-2.508069,-1.198069,-0.228069,-8.818069,...,3.371931,3.611931,2.441931,-2.938069,6.281931,-3.138069,4.481931,2.741931,-2.068069,0.0
"A Czechoslovakian man felt his eyesight was growing steadily worse, and \nfelt it was time to go see an optometrist. \n\nThe doctor started with some simple testing, and showed him a standard eye \nchart with letters of\ndiminishing size: CRKBNWXSKZY. . . \n\n""Can you read this?"" the doctor asked. \n\n""Read it?"" the Czech answered. ""Doc, I know him!""\n",-6.828834,-0.228834,-11.108834,-0.768834,3.211166,0.351166,-2.418834,-0.378834,-0.768834,-7.028834,...,2.971166,4.041166,-7.948834,1.761166,3.511166,-0.178834,3.801166,1.761166,-12.078834,0.0
"A Jewish young man was seeing a psychiatrist for an eating and\nsleeping disorder. \n\n""I am so obsessed with my mother... As soon as I go to sleep, I start\ndreaming, and everyone in my dream turns into my mother. I wake up in\nsuch a state, all I can do is go downstairs and eat a piece of toast.""\n\nThe psychiatrist replies:\n\n""What, just one piece of toast, for a big boy like you?""\n",-8.775115,5.354885,-4.405115,8.024885,1.664885,6.714885,0.064885,-3.185115,-2.025115,6.084885,...,-1.635115,-0.225115,-3.435115,0.544885,6.324885,-4.845115,-2.465115,-4.115115,-0.955115,1.984885
"A Panda bear walks into a bar. Sits down at a table and orders a beer \nand a double cheeseburger. After he is finished eating, he pulls out a gun\nand rips the place with gunfire. Patrons scatter and dive under chairs and\ntables as the bear runs out the door. After ensuring that no one is hurt, \nthe bartender races out the door, and calls after the bear ""What the hell did\nyou do that for?"" The bear calls back, ""I'm a Panda bear. Look it up in the\ndictionary."" \n\nThe bartender returns, pulls out his dictionary.\n\npanda : \Pan""da\, n. (Zo[""o]l.)\nA small Asiatic mammal (Ailurus fulgens) having fine soft fur.\nIt is related to the bears, and inhabits the mountains of Northern India.\nEats shoots and leaves.\n",3.282221,1.532221,-0.217779,-6.527779,-2.347779,-1.327779,-2.207779,1.682221,3.522221,1.532221,...,6.242221,1.142221,-5.557779,2.022221,4.742221,1.482221,-2.057779,3.472221,-1.327779,0.0


In [None]:
def recomendar_chistes(usuario, cantidad_recomendaciones):  
    chistes_recomendadas = []
    for m in df[df[usuario] == 0].index.tolist():
        index_df = df.index.tolist().index(m)
        rating_predecido = df1.iloc[index_df, df1.columns.tolist().index(usuario)]
        chistes_recomendadas.append((m, rating_predecido))
    
    recomendacion_ordenado = sorted(chistes_recomendadas, key=lambda x:x[1], reverse=True)
    print('En base a sus gustos los chistes recomendados son los siguientes: \n')
    rank = 1
    for chistes_recomendada in recomendacion_ordenado[:cantidad_recomendaciones]:
        print('{}: {} --- Rating:{}'.format(rank, chistes_recomendada[0], chistes_recomendada[1]))
        rank = rank + 1

def recomendador_chistes(usuario, k, n_recomendacion):
  
    numero_k = k

    knn = NearestNeighbors(metric='cosine', algorithm='brute')
    knn.fit(df.values)
    distancia, indices = knn.kneighbors(df.values, n_neighbors=numero_k)
    usuario_index = df.columns.tolist().index(usuario)
    
#como tenemos muchas chistes con valores 0, que no fueron punteadas por el usuario, se realiza el analisis y son
#eliminados si es el caso, para que no existan resultados sesgados.

    for m,t in list(enumerate(df.index)):
        if df.iloc[m, usuario_index] == 0:
            sim_chistes = indices[m].tolist()
            distancia_chistes = distancia[m].tolist()    
        if m in sim_chistes:
            id_chiste = sim_chistes.index(m)
            sim_chistes.remove(m)
            distancia_chistes.pop(id_chiste) 
        else:
            sim_chistes = sim_chistes[:k-1]
            distancia_chistes = distancia_chistes[:k-1]
        
        chiste_similar = [1-x for x in distancia_chistes]
        chiste_similar_copy = chiste_similar.copy()
        nominador = 0

        for s in range(0, len(chiste_similar)):
            if df.iloc[sim_chistes[s], usuario_index] == 0:
                if len(chiste_similar_copy) == (numero_k - 1):
                    chiste_similar_copy.pop(s)
                else:
                    chiste_similar_copy.pop(s-(len(chiste_similar)-len(chiste_similar_copy)))
            else:
                nominador = nominador + chiste_similar[s]*df.iloc[sim_chistes[s],usuario_index]
          
            if len(chiste_similar_copy) > 0:
                if sum(chiste_similar_copy) > 0:
                    prediccion_rating = nominador/sum(chiste_similar_copy)
                else:
                    prediccion_rating = 0
            else:
                prediccion_rating = 0
        
        df1.iloc[m,usuario_index] = prediccion_rating
    
    recomendar_chistes(usuario,n_recomendacion)

In [None]:
recomendador_chistes(2000, 10, 5)

En base a sus gustos los chistes recomendados son los siguientes: 

1: "May I take your order?" the waiter asked. 

"Yes, how do you prepare your chickens?" 

"Nothing special sir," he replied. "We just tell them straight out
that they're going to die."
 --- Rating:8.555944999999998
2: Q: What is the Australian word for a boomerang that won't
   come back? 

A: A stick
 --- Rating:8.555944999999998
3: A dog walks into Western Union and asks the clerk to send a telegram. He fills out a form on which he
writes down the telegram he wishes to send: "Bow wow wow, Bow wow wow."

The clerk says, "You can add another 'Bow wow' for the same price."

The dog responded, "Now wouldn't that sound a little silly?" 
 --- Rating:7.886754219543453
4: Q. Did you hear about the dyslexic devil worshiper? 

A. He sold his soul to Santa.
 --- Rating:7.753240030771999
5: Q:  What did the blind person say when given some matzah?

A:  Who the hell wrote this?
 --- Rating:5.346952669160046



## 3. Conclusions


1. We will analyze the results obtained and contrast both approaches (ubcf and ibcf).

 To understand the results obtained, the interpretation will be given case by case and then a conclusion regarding the different approaches applied.

>  **UBCF:** It is observed that although the first 10 ratings of the new user belong to adventure genre movies, in turn, these movies contain varied genres, therefore, the recommendations to the user are also varied in terms of genre. © nero and it is not seen that the adventure type predominates. Interestingly, the user rated 10 movies that marked milestones in cinema, and the recommendations also contain some well-known movies, such as **The Godfather** and The **Matrix** , which is related to the movies rated as containing action. On the other hand, for the jokes dataset, the most interesting case is the jokes where professionals are involved, both have the same structure. **IBCF:** In the case of the collaborative item-based approach, when using the experiences or tastes of items similar to the user, this generates an advantage in my view, since it is based on the user&#39;s similar tastes. For data processing of the new user, in this approach a .csv file was used with the information corresponding to the new user, where 10 movies or jokes with the respective assigned rating were entered. When doing this in a "semi-random" way in the case of movies, most of them belonged to the action and war genre, there was only one that belonged to the comedy genre but with a great score, the expected result was to obtain movies similar to the genre of action or wars or some genre closer to most of the movies since 9/10 belonged to this field; but in the results I was surprised that in the 3rd place you can see the movie **8 ½ Women (aka 8 1/2 Women) (aka Eight and a Half Women) (1999)** which belongs to the comedy genre as a recommendation; In this case, I think this recommendation is mainly due to the fact that, among the tastes of the new user, the comedy movie entered in their recommendations had a high rating, for which the algorithm was looking for movies similar to this one in the item-item approach. On the other hand, the results obtained with the jester dataset, based on the recommendations given by the new user regarding jokes and their scores, give very good results. In the tastes of this new entered user, there were mostly jokes that are considered clean jokes or white jokes, which serve to get out of the way when it comes to having to make a joke. In the recommendations given, the jokes belonged to this same type of jokes, which were expected as results, in addition to the ratings found, they are considered good, that is, they are similar to the items rated by the new user, as an example we have the #1 recommendation: 1: **"Can I take your order?" the waiter asked. "Yes, how do you prepare your chickens?" "Nothing special sir," he replied. "We only tell them directly that they are going to die"** , with a rating: 8.555944999999998.

 The results obtained above show that although it provides good recommendations for our new users in the MovieLens and Jester dataset, the best results are presented by the jokes or jokes dataset; where the rating results are better compared to the rating results obtained with the movie dataset. Even so, the results obtained with the movie dataset present good results, since the movies obtained are from genres similar to those entered by our users in both recommendation approaches.
1.  Advantages and disadvantages of both approaches.

>  **UBCF:** One of the main advantages that we obtain with this approach is a greater diversity in the recommendations, since, for example in the case of movies, it is rare for a person to watch only movies of one genre or for a person who likes the same genre as another, necessarily have tastes for the same movie genres, the above leads to generating more combinations when weighting with users similar to the target user, so , they could be recommended new types of genres that until now they do not consume and that they might like, that is, with UBCF we get the user to explore more content instead of exploiting the same type of content. Another advantage is the ease of implementation of this algorithm. On the other hand, an important disadvantage when the number of possible elements to choose from (for example, movies) is large, is that it could be difficult to find users who have seen the same movies, so UBCF suffers a lot when the ratings matrix it is highly sparse. Another disadvantage of this method is the well-known Cold-Start, that is, new users do not have more information to find their close neighbors and the recommendation could be ambiguous. **IBCF:** Item-item collaborative filtering is a type of recommendation method that finds similar items based on items that users have already liked or positively interacted with, i.e. it is based on the items when making the recommendations, in this way the problem is solved more easily, since we will probably have fewer products than users, which makes the problem more manageable in computational terms. This recommendation largely avoids the fact that the user changes over time (his personality, his tastes, among others) since the products will ideally be the same, but in any case the trends can change over time. time, this may be due to the way in which the products are perceived due to the cultural, socioeconomic context, social crises, etc. But given that we are working under items, their existence can be considered easier to manage than when working with the data of nearby users, that is why this approach shows an advantage when using the own tastes of movies or jokes. that are similar to the new item to analyze.

 The main disadvantage arises when seeking to recommend an item with which the user has no similarities, since it would not be within the range and could give biased results when ranking the item.
