# Aprendizaje no supervisado: Sistemas de recomendacion ́
#### Maximiliano Vides

### Objetivo 
En este practico tendran que integrar en la notebook el sistema de recomendacion basado en contenido que se propone en http://nbviewer.jupyter.org/github/khanhnamle1994/movielens/blob/master/Content_Based_and_Collaborative_Filtering_Models.ipynb.

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', 100)

# Dataset

Usaremos un dataset extraido de https://arxiv.org/ que consiste en datos de 10000 papers de la sección de matemática de dicha página que fueron descargados utilizando la Arxiv API (https://arxiv.org/help/api/index).
De este dataset guardado en formato Json, contiene metadatos de cada artículo, como titulo, resumen, autores, link del artículo, distitnos tags entre otros, de los cuales vamos a utilizar solamente los datos correspondientes a los resúmenes y titulos para intentar hacer un clustering de los papers en relación a las palabras en común.

Cargamos el dataset, extraemos los datos de titulos y resúmenes de los articulos, como en el laboratorio anterior,y utilizamos TfidfVectorizer para representarlos.

In [2]:
import json
with open('data.json') as f:
    datamath = json.load(f,encoding="UTF-8")

In [3]:
titles = [datamath["entries"][i]["title"] for i in range(len(datamath["entries"]))]
len(titles)


10000

In [4]:
summaries = [datamath["entries"][i]["summary"] for i in range(len(datamath["entries"]))]
len(summaries)

10000

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(summaries)
tfidf_matrix.shape

(10000, 349232)

# Recomendación basada en contenido

Para implementar el recomendador basado en contenido utilizaremos la similaridad del coseno entre los resumenes de los artículos.

In [6]:
from sklearn.metrics.pairwise import linear_kernel
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim[:4, :4]

array([[1.        , 0.00753694, 0.04343791, 0.27474781],
       [0.00753694, 1.        , 0.00369972, 0.005619  ],
       [0.04343791, 0.00369972, 1.        , 0.07211436],
       [0.27474781, 0.005619  , 0.07211436, 1.        ]])

A continuación, veremos los articulos mas "similares" para asegurarnos que funciona bien.

In [7]:
for i in range(1000):
 for j in range(1000):
  if j>i :
   if cosine_sim[i, j]>0.5:
    print(i,titles[i])
    print(j,titles[j])
    print("cosine similarity: %r" %cosine_sim[i, j])
    print('=' * 60)

7 Right Bousfield Localization and Operadic Algebras
949 Right Bousfield Localization and Eilenberg-Moore Categories
cosine similarity: 0.5471909482710594
292 Green's function of the problem of bounded solutions in the case of a
  block triangular coefficient
987 Computation of Green's function of the bounded solutions problem
cosine similarity: 0.5631867992305797
303 Classification of equivariant vector bundles over real projective plane
304 Classification of equivariant vector bundles over two-torus
cosine similarity: 0.5306357243588083
328 Ore's theorem for cyclic subfactor planar algebras and applications
739 Ore's theorem on cyclic subfactor planar algebras and beyond
cosine similarity: 0.5272474922502542
370 An Introduction to Hilbert Module Approach to Multivariable Operator
  Theory
371 Applications of Hilbert Module Approach to Multivariable Operator Theory
cosine similarity: 0.7079576076445369
557 Functions of perturbed operators
560 Operator Hölder--Zygmund functions
cosine 

Observamos que entre los articulos mas similares aparecen versiones del mismo paper, traducciones (que tienen el mismo resumen en inglés) y volumenes del mismo artículo, por lo que parece que funciona bastante bien.

Procedemos a definir la función "recommendations", cuya entrada sera el titulo de un paper y la salida serán los 20 papers más similares.

In [8]:
from sklearn.metrics.pairwise import cosine_similarity

Titles = pd.Series(titles)
indices = pd.Series(summaries, index=titles)


def recommendations(title):
    idx = titles.index(title)
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:21]
    paper_indices = [i[0] for i in sim_scores]
    sim_scores=[i[1] for i in sim_scores]
    df=pd.DataFrame({'Title': Titles.iloc[paper_indices],'Cosine Similiarity': sim_scores})
    print(title)
    return df

A continuación veremos algunos ejemplos para evaluar el funcionamiento del recomendador.

In [9]:
x=5563 
recommendations(titles[x])

Applications of Bimatrices to some Fuzzy and Neutrosophic Models


Unnamed: 0,Title,Cosine Similiarity
5477,Basic Neutrosophic Algebraic Structures and their Application to Fuzzy\n and Neutrosophic Models,0.270344
5546,Introduction to Bimatrices,0.221517
5581,Fuzzy and Neutrosophic Analysis of Periyar's views on Untouchability,0.21889
5590,Introduction to n-adaptive fuzzy models to analyze public opinion on\n AIDS,0.211508
6441,Super Fuzzy Matrices and Super Fuzzy Models for Social Scientists,0.208041
5697,Elementary fuzzy matrix theory and fuzzy models for social scientists,0.187087
5403,Smarandache Fuzzy Algebra,0.181675
6278,Mathematical Analysis of the Problems faced by the People With\n Disabilities (PWDs),0.179262
5988,"Reservation for Other Backward Classes in Indian Central Government\n Institutions like IITs, I...",0.173588
6528,New classes of Neutrosophic Linear Algebras,0.16579


In [10]:
recommendations(titles[5305])

The cardinality of the set of real numbers


Unnamed: 0,Title,Cosine Similiarity
5438,Countability of the Real Numbers,0.295263
5440,Remarks on Cantor's diagonalization proof of 1891,0.200049
5465,A severe inconsistency of transfinite set theory,0.196416
5823,The property of the set of the real numbers generated by a\n Gelfond-Schneider operator and the...,0.19268
5896,The Continuum is Countable: Infinity is Unique,0.192603
6033,Cantor versus Cantor,0.148338
5439,The Meaning of Infinity,0.137424
5383,On Cantor's important proofs,0.132368
4504,Continuous Maps on Aronszajn Trees,0.132241
3005,A new approach to the real numbers,0.130614


In [11]:
recommendations(titles[7087])

A Note on the Classification of Permutation Matrix


Unnamed: 0,Title,Cosine Similiarity
5558,Tensor Permutation Matrices in Finite Dimensions,0.271487
8650,Pseudospectra of Isospectrally Reduced Matrices and Systems,0.131143
7704,An algebraic approach to representations of the permutation group,0.118978
6703,The Neighbor Matrix: Generalizing the Degree Distribution,0.114383
4028,Nonnegative Factorization of a Data Matrix as a Motivational Example for\n Basic Linear Algebra,0.111425
4049,A Short and Elementary Proof of the Two-sidedness of the Matrix-Inverse,0.11112
7119,A Fast Algorithm to Calculate Power Sum of Natural Numbers,0.10837
5479,A matrix-based proof of the quaternion representation theorem for\n four-dimensional rotations,0.10754
515,A Note on the Reduction Formulas for Some Systems of Linear Operator\n Equations,0.10701
5048,Primitive permutation groups containing a cycle,0.102744


In [12]:
recommendations(titles[1560])

Generalizations of Triangle Inequalities to Spherical and Hyperbolic
  Geometry


Unnamed: 0,Title,Cosine Similiarity
1529,On the works of Euler and his followers on spherical geometry,0.148198
1527,On Lobachevsky's trigonometric formulae,0.107013
7085,New Proofs of Classical Triangle Inequality,0.103398
2355,Ceva's triangle inequalities,0.100967
1531,Euclidean plane and its relatives; a minimalist introduction,0.087734
5831,A little noticed right triangle,0.081389
6039,Smarandache's Cevian Triangle Theorem in The Einstein Relativistic\n Velocity Model of Hyperbol...,0.081108
3971,Revisiting the quadrisection problem of Jacob Bernoulli,0.077686
1517,Generalization of the Apollonius Circles,0.075631
7054,Mercer's inequality for h-convex functions,0.075124


En estos ejemplos podemos ver que parece haber una buena performance en las recomendaciones de los artículos. Si bien no todas las recomendaciones parecen ser pertinentes, debido que algunas recomendaciones tienen un score de similarity bastante bajo, en general parece funcionar bastante bien.