# Proyecto Final

***Equipo 07***

- Aide Jazmín González Cruz
- Elena Villalobos Nolasco
- Carolina Acosta Tovany

#### Instrucciones

El proyecto/examen final consistirá en:

La implementación del algoritmo de filtrado colaborativo utilizando la metodología vista en clase (uso de otra metodología no se calificará).

Todos los algoritmos de aprendizaje de máquina que se utilicen deberán haber sido creados por ustedes. Sólo podrán utilizar Transformers y funciones de apoyo de scikit-learn (para realizar la división de los datos en entrenamiento y prueba, o el procedimiento de validación cruzada, etc.) mas ningún estimator (regresión logística, máquina de vectores de soporte, k medias, etc.). 

Se deberá explicar como se obtuvo la k con la que se generó el resultado final.

Se utilizarán los archivos con el conjunto pequeño de calificaciones y películas ubicado en la siguiente https://www.kaggle.com/rounakbanik/the-movies-dataset:

- **links_small.csv**: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.

- **ratings_small.csv**: The subset of 100,000 ratings from 700 users on 9,000 movies.

Con el fin de mejorar la calificación (opcional, puntos extra), se podrán utilizar los algoritmos desarrollado en las tareas del curso y los datos relevantes (los que hacen match con los datos anteriores) contenidos en los archivos:

- **movies_metadata.csv**: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.

- **keywords.csv**: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.

- **credits.csv**: Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object.

La métrica con la que se determinará el desempeño del algoritmo es el NDCG 

(https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG)

Una vez obtenida la matriz de calificaciones, el programa deberá ser capaz de regresar las 5 mejores recomendaciones del o de los usuarios que se consulten.

El proyecto se entregará en un Jupyter notebook. El readme file debe contener las instrucciones para que se ejecute el código. Deben cerciorarse que siguiendo esas instrucciones el programa corre sin errores. 

Se deberá subir a la carpeta proyecto_final/equipo_xx en el repositorio GitHub antes de las 7:00 am del día del examen final (14 de diciembre de 2020).    

In [3]:
# Importación de paqueterías necesarias
import pandas as pd
import numpy as np
import random

La función objetivo:
    
$$J(X) = \frac{1}{2} \displaystyle\sum_{(a,i)\in\mathbb{D}} \left(Y_{ai}-\left [ UV^T \right ]_{ai} \right)^2 + \frac{\lambda}{2} \displaystyle\sum_{a=1}^n \displaystyle\sum_{j=1}^k U_{aj}^2 + \frac{\lambda}{2} \displaystyle\sum_{i=1}^m \displaystyle\sum_{j=1}^k V_{ij}^2$$

In [4]:
def load_data():
    """
    Carga datos
    Regresa dataframe de Usuarios, Películas y raitings
    """
    
    # Carga de datos
    links_small = pd.read_csv('links_small.csv')
    ratings_small = pd.read_csv('ratings_small.csv')
    
    # Películas en catálogo que no han calificado los usuarios
    df_mov_u = pd.DataFrame(ratings_small['movieId'])
    df_mov = pd.DataFrame(links_small['movieId'])

    common = df_mov.merge(df_mov_u, on=["movieId"])
    result = df_mov[~df_mov.movieId.isin(common.movieId)]
    
    # Construyendo la matriz Y_ai
    y_ia = links_small.set_index('movieId').join(ratings_small.set_index('movieId'))
    y_ia = y_ia.reset_index()
    #y_ia.pivot(index="userId", columns="movieId", values="rating") 
    y_ia = pd.DataFrame(y_ia.pivot(index='userId', columns='movieId', values='rating'))
    y_ia = pd.DataFrame(y_ia.to_records())
    # Eliminando usuario Nan
    y_ia = y_ia[pd.notnull(y_ia['userId'])]
    # Borrando columna 1 con user_id
    y_ia = y_ia.drop(['userId'], axis=1)
    # Cambiando Nan por zeros
    #y_ia[np.isnan(y_ia)] = 0
    
    return y_ia
    

In [5]:
Y = load_data()
Y

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,161830,161918,161944,162376,162542,162672,163056,163949,164977,164979
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,4.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,4.0,...,,,,,,,,,,
5,,,4.0,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,,,,,,4.0,,,,,...,,,,,,,,,,
668,,,,,,,,,,,...,,,,,,,,,,
669,,,,,,,,,,,...,,,,,,,,,,
670,4.0,,,,,,,,,,...,,,,,,,,,,


In [37]:
maxValues = Y.max() 
maxValues = pd.DataFrame(maxValues)
maxValues = maxValues.max() 
maxValues

0    5.0
dtype: float64

In [39]:
minValues = Y.min() 
minValues = pd.DataFrame(minValues)
minValues = minValues.min() 
minValues

0    0.5
dtype: float64

In [6]:
# Cambiando Nan por zeros
Y_0 = Y.copy()
Y_0[np.isnan(Y_0)] = 0
Y_0

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,161830,161918,161944,162376,162542,162672,163056,163949,164977,164979
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
668,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
669,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
670,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
Prueba = Y_0.iloc[[0,1,2,3,4], : 10]
Prueba

Unnamed: 0,1,2,3,4,5,6,7,8,9,10
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
5,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
# gregando más datos de prueba
Prueba.iloc[1][1] = 1
Prueba.iloc[1][4] = 5
Prueba

Unnamed: 0,1,2,3,4,5,6,7,8,9,10
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,4.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
5,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [60]:
# Fijando semilla
random.seed(0)
random_matrix = np.random.randint(1,5,(5,10))
random_matrix = pd.DataFrame(random_matrix)
# Agregando calificaciones de la matriz original
random_matrix.iloc[1][1] = 1
random_matrix.iloc[1][4] = 5
random_matrix.iloc[1][9] = 4
random_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,4,2,2,3,4,2,2,1,1,1
1,4,1,4,3,5,1,4,1,4,4
2,1,2,4,2,1,1,1,4,2,2
3,1,1,3,2,3,3,3,2,1,3
4,4,2,2,1,2,3,2,4,1,2


In [54]:
def recomendaciones_id(Y, MCompleta, top=5):
    """
    Regresa id de las peliculas reomendadas dado un usuario
    Param: Recibe matriz con Nan, id del usurario y el top por default 5
    Return: Arreglo de ids de las películas sugeridas
    """
    
    # ------------------------------------ Para nuevas recomendaciones------------------------------
    # Se multiplica boleano con Y para poner en cero los ratings dados por el usuario
    nvos = []
    # Agregando índice
    for i in range (len(MCompleta)):
        #Se suma 1 por que los usuarios inician en 1
        nvos.append([i+1,MCompleta[i]*(~Y[i].any())])
    
    # nvos es un arreglo se pasa a dataframe para mejor manejabilidad
    nvos = pd.DataFrame(nvos)
    # se colocan nombres a las columnas
    nvos.columns = ['id_movie','rating_recom']
    # se ordenan de forma descendente
    nvos = nvos.sort_values(by=['rating_recom'], ascending=False)
    # Borrando 0
    nvos = nvos[(nvos[['rating_recom']] != 0).all(axis=1)]
    # Obteniendo el top
    recomendaciones = nvos['id_movie'].head(5).to_numpy()
    
    # ------------------------------------ Para recomendaciones incluyendo las existentes ----------
    # Incluyendo los datos calificados
    todos = []
    # Agregando índice
    for i in range (len(MCompleta)):
        #Se suma 1 por que los usuarios inician en 1
        todos.append([i+1,MCompleta[i]])
    # todos es un arreglo se pasa a dataframe para mejor manejabilidad
    todos = pd.DataFrame(todos)
    # se colocan nombres a las columnas
    todos.columns = ['id_movie','rating_recom']
    # se ordenan de forma descendente
    todos = todos.sort_values(by=['rating_recom'], ascending=False)
    # Obteniendo el top
    recomendaciones_t = todos['id_movie'].head(5).to_numpy()  
    
    
    return recomendaciones, recomendaciones_t


In [55]:
user = 1
Prueba.iloc[user]

1     0.0
2     1.0
3     0.0
4     0.0
5     5.0
6     0.0
7     0.0
8     0.0
9     0.0
10    4.0
Name: 2, dtype: float64

In [61]:
# Prueba es la matriz Y con 0 en NaN y random_matrix es la matriz predicha
recomendaciones_id(np.array(Prueba.iloc[user]), np.array(random_matrix.iloc[user]))

(array([1, 3, 7, 9, 4], dtype=int64), array([5, 1, 3, 7, 9], dtype=int64))