# Proyecto Final

***Equipo 07***

- Aide Jazmín González Cruz
- Elena Villalobos Nolasco
- Carolina Acosta Tovany

#### Instrucciones

El proyecto/examen final consistirá en:

La implementación del algoritmo de filtrado colaborativo utilizando la metodología vista en clase (uso de otra metodología no se calificará).

Todos los algoritmos de aprendizaje de máquina que se utilicen deberán haber sido creados por ustedes. Sólo podrán utilizar Transformers y funciones de apoyo de scikit-learn (para realizar la división de los datos en entrenamiento y prueba, o el procedimiento de validación cruzada, etc.) mas ningún estimator (regresión logística, máquina de vectores de soporte, k medias, etc.). 

Se deberá explicar como se obtuvo la k con la que se generó el resultado final.

Se utilizarán los archivos con el conjunto pequeño de calificaciones y películas ubicado en la siguiente https://www.kaggle.com/rounakbanik/the-movies-dataset:

- **links_small.csv**: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.

- **ratings_small.csv**: The subset of 100,000 ratings from 700 users on 9,000 movies.

Con el fin de mejorar la calificación (opcional, puntos extra), se podrán utilizar los algoritmos desarrollado en las tareas del curso y los datos relevantes (los que hacen match con los datos anteriores) contenidos en los archivos:

- **movies_metadata.csv**: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.

- **keywords.csv**: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.

- **credits.csv**: Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object.

La métrica con la que se determinará el desempeño del algoritmo es el NDCG 

(https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG)

Una vez obtenida la matriz de calificaciones, el programa deberá ser capaz de regresar las 5 mejores recomendaciones del o de los usuarios que se consulten.

El proyecto se entregará en un Jupyter notebook. El readme file debe contener las instrucciones para que se ejecute el código. Deben cerciorarse que siguiendo esas instrucciones el programa corre sin errores. 

Se deberá subir a la carpeta proyecto_final/equipo_xx en el repositorio GitHub antes de las 7:00 am del día del examen final (14 de diciembre de 2020).    

In [1]:
# Importación de paqueterías necesarias
import pandas as pd
import numpy as np
import random
from sympy import solve
import sympy
from sympy.tensor.array import derive_by_array
from scipy.optimize import fsolve

In [2]:
np.set_printoptions(precision=3, suppress=True)

In [3]:
# Importación de datos
links_small = pd.read_csv('links_small.csv')
links_small.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [4]:
links_small.shape

(9125, 3)

In [5]:
ratings_small = pd.read_csv('ratings_small.csv')
ratings_small.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [6]:
ratings_small.shape

(100004, 4)

In [7]:
# Películas en catálogo que no han calificado los usuarios
df_mov_u = pd.DataFrame(ratings_small['movieId'])
df_mov = pd.DataFrame(links_small['movieId'])

In [8]:
common = df_mov.merge(df_mov_u, on=["movieId"])
common

Unnamed: 0,movieId
0,1
1,1
2,1
3,1
4,1
...,...
99999,161944
100000,162376
100001,162542
100002,162672


In [9]:
result = df_mov[~df_mov.movieId.isin(common.movieId)]
result.shape

(59, 1)

In [82]:
# Construyendo la matriz Y_ai
y_ia = links_small.set_index('movieId').join(ratings_small.set_index('movieId'))
y_ia = y_ia.reset_index()
y_ia

Unnamed: 0,movieId,imdbId,tmdbId,userId,rating,timestamp
0,1,114709,862.0,7.0,3.0,8.518667e+08
1,1,114709,862.0,9.0,4.0,9.386292e+08
2,1,114709,862.0,13.0,5.0,1.331380e+09
3,1,114709,862.0,15.0,2.0,9.979383e+08
4,1,114709,862.0,19.0,3.0,8.551901e+08
...,...,...,...,...,...,...
100058,162672,3859980,402672.0,611.0,3.0,1.471524e+09
100059,163056,4262980,315011.0,,,
100060,163949,2531318,391698.0,547.0,5.0,1.476419e+09
100061,164977,27660,137608.0,,,


In [88]:
y_ia['movieId'].nunique()

9125

In [11]:
max(y_ia.rating)

5.0

In [12]:
#y_ia.pivot(index="userId", columns="movieId", values="rating") 
y_ia = pd.DataFrame(y_ia.pivot(index='userId', columns='movieId', values='rating'))
y_ia = pd.DataFrame(y_ia.to_records())
y_ia = y_ia[pd.notnull(y_ia['userId'])]
y_ia['userId'] = y_ia['userId'].astype(int)
y_ia = y_ia.drop(['userId'], axis=1)
y_ia

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,161830,161918,161944,162376,162542,162672,163056,163949,164977,164979
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,4.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,4.0,...,,,,,,,,,,
5,,,4.0,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,,,,,,4.0,,,,,...,,,,,,,,,,
668,,,,,,,,,,,...,,,,,,,,,,
669,,,,,,,,,,,...,,,,,,,,,,
670,4.0,,,,,,,,,,...,,,,,,,,,,


In [13]:
users, movies = y_ia.shape

La función objetivo:
    
$$J(X) = \frac{1}{2} \displaystyle\sum_{(a,i)\in\mathbb{D}} \left(Y_{ai}-\left [ UV^T \right ]_{ai} \right)^2 + \frac{\lambda}{2} \displaystyle\sum_{a=1}^n \displaystyle\sum_{j=1}^k U_{aj}^2 + \frac{\lambda}{2} \displaystyle\sum_{i=1}^m \displaystyle\sum_{j=1}^k V_{ij}^2$$

Tenemos una $k = 1$

In [14]:
# Fijando semilla
random.seed(0)
# Creando el vector V que son las películas al azar
V = np.random.randint(1,9,size = (1,movies))
V.shape

(1, 9125)

In [15]:
# Sacando el vector U que son los usuarios
U = np.random.randint(1,9,size = (1,users))
U.shape

(1, 671)

In [45]:
a =  np.array([[ 2,  7,  8]])
b = a*a.T
b = 1/b
sum(b)


array([0.384, 0.11 , 0.096])

In [51]:
yai = np.array([[5,0, 7.0],
                [1,   2,0]])
yai

array([[5., 0., 7.],
       [1., 2., 0.]])

In [18]:
v = np.array([2, 7, 8])
v

array([2, 7, 8])

In [19]:
u = np.array([2,2])
u

array([2, 2])

In [20]:
# Producto exterior
np.outer(u,v)

array([[ 4, 14, 16],
       [ 4, 14, 16]])

In [53]:
u1, u2 = sympy.symbols("u1, u2")

In [54]:
fo_u1 = (5-2*u1)**2/2 + (7-8*u1)**2/2 + 1/2 * u1**2

In [55]:
fo_u1

0.5*u1**2 + (5 - 2*u1)**2/2 + (7 - 8*u1)**2/2

In [42]:
gf_u1 = derive_by_array(fo_u1, (u1))

In [43]:
gf_u1

69.0*u1 - 66

In [44]:
solve(gf_u1)

[0.956521739130435]

In [27]:
fo_u2 = (1-2*u2)**2/2 + (2-7*u2)**2/2 + 1/2*u2**2
fo_u2

0.5*u2**2 + (1 - 2*u2)**2/2 + (2 - 7*u2)**2/2

In [28]:
gf_u2 = derive_by_array(fo_u2, (u2))
gf_u2

54.0*u2 - 16

In [29]:
solve(gf_u2)

[0.296296296296296]

In [30]:
ula = np.array([solve(gf_u1),solve(gf_u2)])

In [31]:
ula

array([[33/34],
       [0.296296296296296]], dtype=object)

In [32]:
yai

array([[5, None, 7.0],
       [1, 2, None]], dtype=object)

In [35]:
33/34

0.9705882352941176

In [36]:
66/68

0.9705882352941176

In [37]:
22/23

0.9565217391304348

In [39]:
66/69

0.9565217391304348

In [47]:
np.outer(a,a.T)

array([[ 4, 14, 16],
       [14, 49, 56],
       [16, 56, 64]])

In [49]:
np.outer(v,u)

array([[ 4,  4],
       [14, 14],
       [16, 16]])

In [52]:
yai

array([[5., 0., 7.],
       [1., 2., 0.]])

In [56]:
v

array([2, 7, 8])

In [60]:
np.dot(v.T,v)

117

In [None]:
# Películas en catálogo que no han calificado los usuarios
df_mov_u = pd.DataFrame(movies_metadata['movieId'])
df_mov = pd.DataFrame(links_small['movieId'])

In [66]:
movies_metadata = pd.read_csv('movies_metadata.csv',low_memory=False)
movies_metadata.shape

(45466, 24)

In [72]:
movies_metadata.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [73]:
movies_metadata.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

In [89]:
movies_metadata['imdb_id'].nunique()

45417

In [90]:
movies_metadata['title'].nunique()

42277

In [91]:
movies_metadata['id'].nunique()

45436

In [94]:
df_mov_u = pd.DataFrame(ratings_small['movieId'])
df_mov_u

Unnamed: 0,movieId
0,31
1,1029
2,1061
3,1129
4,1172
...,...
99999,6268
100000,6269
100001,6365
100002,6385


In [95]:
df_mov = pd.DataFrame(links_small['movieId'])
df_mov

Unnamed: 0,movieId
0,1
1,2
2,3
3,4
4,5
...,...
9120,162672
9121,163056
9122,163949
9123,164977


In [93]:
common = df_mov.merge(df_mov_u, on=["movieId"])
common

Unnamed: 0,movieId
0,1
1,1
2,1
3,1
4,1
...,...
99999,161944
100000,162376
100001,162542
100002,162672


In [146]:
hola1 = pd.DataFrame(movies_metadata[['id','title']])
hola2 = pd.DataFrame(y_ia['movieId'].unique(),columns=['movieId'])

In [149]:
hola1 = hola1.rename(columns = {'id':'movieId'})

In [150]:
hola1

Unnamed: 0,movieId,title
0,862,Toy Story
1,8844,Jumanji
2,15602,Grumpier Old Men
3,31357,Waiting to Exhale
4,11862,Father of the Bride Part II
...,...,...
45461,439050,Subdue
45462,111109,Century of Birthing
45463,67758,Betrayal
45464,227506,Satan Triumphant


In [151]:
hola2

Unnamed: 0,movieId
0,1
1,2
2,3
3,4
4,5
...,...
9120,162672
9121,163056
9122,163949
9123,164977


In [162]:
result = hola2[~hola2.movieId.isin(hola1.movieId)]

In [163]:
result

Unnamed: 0,movieId
0,1
1,2
2,3
3,4
4,5
...,...
9120,162672
9121,163056
9122,163949
9123,164977
