Приклад рекомендаційної системи на основі контента
===
Імпортуємо необхідні бібліотеки
---

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Зчитуємо дані - книги
--

In [2]:
df = pd.read_csv('datasets/book-crossing/Books.csv',
                on_bad_lines='skip',
                 sep=';'
                )

In [3]:
df.head()

Unnamed: 0,ISBN,Title,Author,Year,Publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton & Company


Завантажуємо дані з користувачами та рейтингами 


In [4]:
user_ratings = pd.read_csv('datasets/book-crossing/users-ratings.csv')

Відфільтруємо дані - працюємо тільки з підмножиною книг, які були прочитані хочаб одним користувачем

In [5]:
df = df[df['ISBN'].isin(user_ratings['ISBN'])]

* Заповнимо порожні значення
* Зтворимо нову змінну text, яка є поєднанням автору та назви

In [6]:
df.fillna({'Title':'',
           'Author': ''},
          inplace=True)
df['Title'] = df['Title'].apply(lambda s: s.lower())
df['Author'] = df['Author'].apply(lambda s: s.lower())
df['text'] = df['Title'].fillna('') + ' ' + df['Author'].fillna('')

* Заберемо дублікати за автором та назвою
* Приберемо порожні значення

In [7]:
df.drop_duplicates(subset=['Author', 'Title'],inplace=True)
df.dropna(subset=['text'],
          inplace=True)

Оскільки наступний код спирається на індекси книг у датафрейми, зробимо reset. 

In [8]:
df.reset_index(drop=True, inplace=True)

Застосовуємо tfidf для трансформації текста в матрицю

In [9]:
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df['text'])

Рахуємо косинуси між векторами

In [10]:
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

Знайдемо подібні книги до заданої.

In [11]:
def get_recommendations(isbn, cosine_sim=cosine_sim, top_n = 10):

    idx = df.index[df['ISBN'] == isbn].tolist()[0]

    sim_scores = list(enumerate(cosine_sim[idx]))

    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    sim_scores = sim_scores[1:top_n + 1] 

    book_indices = [i[0] for i in sim_scores]

    return df.iloc[book_indices]

In [12]:
df[df['ISBN']=='0440234743']

Unnamed: 0,ISBN,Title,Author,Year,Publisher,text
0,440234743,the testament,john grisham,1999,Dell,the testament john grisham


In [13]:
get_recommendations('0440234743')

Unnamed: 0,ISBN,Title,Author,Year,Publisher,text
153,0385424728,the chamber,john grisham,1994,Doubleday Books,the chamber john grisham
154,0385472951,the partner,john grisham,1997,Doubleday Books,the partner john grisham
322,0385510438,the last juror,john grisham,2004,Doubleday,the last juror john grisham
61,0385497466,the brethren,john grisham,2000,Doubleday,the brethren john grisham
64,0385511612,bleachers,john grisham,2003,Doubleday,bleachers john grisham
99,038542471X,the client,john grisham,1993,Doubleday Books,the client john grisham
114,044021145X,the firm,john grisham,1992,Bantam Dell Publishing Group,the firm john grisham
228,044022165X,the rainmaker,john grisham,1996,Dell,the rainmaker john grisham
854,0440241073,the summons,john grisham,2002,Dell Publishing Company,the summons john grisham
62,0385508042,the king of torts,john grisham,2003,Doubleday Books,the king of torts john grisham


Перепишемо фукцію, де тепер рекомендації будуть підбиратися для користувача. Враховуючу його уподобання, та прочитані книги.

In [16]:
def get_recommendations(user_id, top_n=10):
    high_rated_books = user_ratings[(user_ratings['User-ID']==user_id)&(user_ratings['Rating']>=8)]
    indices = pd.Series(df.index, index=df['ISBN'])
    read_indices = [indices.get(isbn) for isbn in high_rated_books['ISBN'] if isbn in indices]

    sim_scores =sum([cosine_sim[idx] for idx in read_indices if idx is not None])
    sim_scores = sorted(list(enumerate(sim_scores)), key=lambda x: x[1], reverse=True)

    book_indices = [i[0] for i in sim_scores if i[0] not in read_indices][:top_n]
    return df.iloc[book_indices]

In [23]:
(user_ratings[(user_ratings['User-ID']==243)&(user_ratings['Rating']>8)]
.join(df.set_index('ISBN'), on='ISBN'))

Unnamed: 0,User-ID,Age,ISBN,Rating,Title,Author,Year,Publisher,text
0,243,,60915544,10,the bean trees,barbara kingsolver,1989.0,Perennial,the bean trees barbara kingsolver
4,243,,316601950,9,the pilot's wife : a novel,anita shreve,1999.0,Back Bay Books,the pilot's wife : a novel anita shreve
8,243,,316776963,9,me talk pretty one day,david sedaris,2001.0,Back Bay Books,me talk pretty one day david sedaris
14,243,,375400117,10,memoirs of a geisha,arthur golden,1997.0,Alfred A. Knopf,memoirs of a geisha arthur golden
27,243,,425163407,9,,,,,
35,243,,446364800,9,the general's daughter,nelson demille,1993.0,Warner Books,the general's daughter nelson demille


In [17]:
get_recommendations(243)

Unnamed: 0,ISBN,Title,Author,Year,Publisher,text
1099,316788228,the pilot's wife,anita shreve,2001,"Little, Brown",the pilot's wife anita shreve
279,99771519,memoirs of a geisha uk,arthur golden,0,Trafalgar Square,memoirs of a geisha uk arthur golden
342,156006529,where or when : a novel,anita shreve,1999,Harvest Books,where or when : a novel anita shreve
184,316789089,the pilot's wife : a novel tag: author of the ...,anita shreve,1999,"Little, Brown",the pilot's wife : a novel tag: author of the ...
1767,446611913,up country,nelson demille,2003,Warner Vision,up country nelson demille
970,679745203,the english patient,michael ondaatje,1996,Vintage Books USA,the english patient michael ondaatje
345,316789844,resistance : a novel,anita shreve,1997,Back Bay Books,resistance : a novel anita shreve
921,446605409,plum island,nelson demille,1998,Warner Books,plum island nelson demille
718,446608262,the lion's game,nelson demille,2000,Warner Books,the lion's game nelson demille
810,60959037,prodigal summer: a novel,barbara kingsolver,2001,Perennial,prodigal summer: a novel barbara kingsolver
