[Markdown styles](https://towardsdatascience.com/enrich-your-jupyter-notebook-with-these-tips-55c8ead25255)

[Hydra overview](https://towardsdatascience.com/stop-hard-coding-in-a-data-science-project-use-config-files-instead-479ac8ffc76f)

<u>credits</u>: [Will Koehrsen](https://medium.com/p/4d028e6f0526)

<h1 style="text-align:center"><u>Book Recommendation System</u></h1>

<h2>Introduction</h2>

The main purpose of this work is to learn <font color='orange'>**embeddings**</font>. Embeddings can be considered as a representation of a document in a space (called **embedding space**) where similar documents should be very close to each other. Therefore to guide our comprehension of what are embeddings, we will build a recommendation system based on books in wikipedia. The principle is simple: "<u>`books with Wikipedia pages that link to similar Wikipedia pages are similar to each other`</u>". Ex: Book1--> link2; Book3--> link2 => Book1 and Book3 are similar. So, to exhibit this similarity, we will use neural network entity embeddings, mapping each book and each Wikipedia link (Wikilink) to a 50-number vector.

Embeddings have demonstrated some advantages in representing categorical variables: 1- They help in reducing the original space where documents are expressed. In our case for example, if we were using *one-hot-encoding* method to representing books, we would have 3.7e+06 length vector for each book. 2- They preserve similarity in terms of semantic meaning of each document (the books in our case) that is not the case of *one-hot-encoding* where semantic meaning of document is not preserved. By training a neural network to learn entity embeddings, we not only get a reduced dimension representation of the books, we also get a representation that keeps similar books (in terms of what they talk about) closer to each other. Basic approach for a recommendation system is to find the closest books for any book in other to recommend to a user books that may have same content like those he/she already read. Thanks to [this notebook](1-Downloading%20and%20Parsing%20Wikipedia%20Articles.ipynb), we have access to every single book article on Wikipedia, which will let us create an effective recommendation system.

<h2>Approach</h2>

To create entity embeddings, we need to <u>build an embedding neural network</u> and train it <u>on a supervised machine learning task</u> that will <u>result in similar books (and similar links) having closer representations</u> in embedding space. The parameters of the neural network - the weights - are the embeddings, and so during training, these numbers are adjusted to minimize the loss on the prediction problem. In other words, the network tries to accurately complete the task by changing the representations of the books and the links.

Once we have the embeddings for the books and the links, we can find the most similar book to a given book by computing the distance between the embedded vector for that book and all the other book embeddings. <u>We'll use the cosine distance</u> which measures the angle between two vectors as a measure of similarity (another valid option is the Euclidean distance). We can also do the same with the links, finding the most similar page to a given page. (I use links and wikilinks interchangeably in this notebook). The steps we will follow are:

1. Load in data and clean
2. Prepare data for supervised machine learning task
3. Build the entity embedding neural network
4. Train the neural network on prediction task
5. Extract embeddings and find most similar books and wikilinks
6. Visualize the embeddings using dimension reduction techniques

*Question*: But what consists the Supervised Machine Learning Task ? 🤔<br/>
*Answer*: Find the model that correctly <u>Maps Books to Links</u> 😀

For our machine learning task, we'll set up the problem as <u>identifying</u> <font color='green'>whether</font> <u>or</u> <font color='red'>not</font> <u>a particular link was present in a book article</u>. The training examples will consist of (book, link) pairs, with some pairs true examples - actually in the data - and others negative examples - do not occur in the data. It will be the network's job to adjust the entity embeddings of the books and the links in order to accurately make the classification. 

|                      Books                          | Links |  Link present/absent ?  |
|:---------------------------------------------------:|:---------:|:------:|
| Book4|   wikilink1   |  <font color='green'>Present</font>  |
| Book4|   wikilink7   |  <font color='red'>Absent<font>   |
| Book4|   wikilink8   |  <font color='green'>Present</font>  |
| Book2|   wikilink7   |  <font color='red'>Absent<font>   |
| Book2|   wikilink1   |  <font color='green'>Present</font>  |
| Book5|   wikilink3   |  <font color='red'>Absent<font>   |
| Book5|   wikilink5   |  <font color='red'>Absent<font>   |
| Book5|   wikilink1   |  <font color='green'>Present</font>  |
| Book1|   wikilink5   |  <font color='green'>Present</font>  |

<u>**Keep in mind**</u>: Although we are training for a supervised machine learning task, our end objective is not to make accurate predictions on new data, but learn the best entitiy embeddings, so we do not use a validation or testing set. We use the prediction problem as a means to an end rather than the final outcome.

<h2>Neural Network Embeddings for better representing large sequence of texts</h2>


Neural Network embeddings have proven to be very powerful concepts both for modeling language and for representing categorical variables.  For example, the [Word2Vec](https://www.tensorflow.org/tutorials/text/word2vec?hl=fr) word embeddings by Google, map a word to a vector based on training a neural network on millions of words. These embeddings can be used in any supervised model because they are just numerical representations of categorical variables. Much as we one-hot-encode categorical variables to use them in a random forest for a supervised task, we can also use entity embeddings to include categorical variables in a model. The embeddings are also useful because we can find entities that are close to one another in embedding space which might - as in a book recommendation system - allow us to find the most similar categories among tens of thousands of choices.

We can also use the Entity Embeddings to visualize words or categorical variables, such as creating a map of all books on Wikipedia. The entity embeddings typically are still high-dimensional - we'll use 50 numbers for each entity - so we need to use a dimension reduction technique such as [TSNE](https://distill.pub/2016/misread-tsne/) or [UMAP](https://umap-learn.readthedocs.io/en/latest/how_umap_works.html) to visualize the embeddings in lower dimensions. (These are both manifold embedding methods so in effect we will embed the embeddings for visualization!) We'll take a look at doing this at the end of the notebook and later will upload the embeddings into a application custom-built for this purpose ([projector.tensorflow.org](https://projector.tensorflow.org/)). Entity embeddings are becoming more widespread thanks to the ease of development of neural networks in Keras and are a useful approach when we want to represent categorical variables with vectors that place similar categories close to one another. Other approaches for encoding categorical variables do not represent similar entities as being closer to one another, and entity embedding is a learning-based method for this important task.

Overall, this project is a great look at the potential for neural networks to create meaningful embeddings of high dimensional data and a practical application of deep learning. The code itself is relatively simple, and the Keras library makes developing deep learning models enjoyable!

The code here is adapted from the excellent [Deep Learning Cookbook](https://www.amazon.com/Deep-Learning-Cookbook-Practical-Recipes/dp/149199584X), the notebooks for which can be found on [this GitHub repository]((https://github.com/DOsinga/deep_learning_cookbook/tree/master)). Check out this book for practical applications of deep learning and great projects!

<h3>1. Load in data and clean</h3>

The data is stored as json with line for every book. This data contains every single book article on Wikipedia which was parsed in the [Downloading and Parsing Wikipedia Data](1-Downloading%20and%20Parsing%20Wikipedia%20Articles.ipynb) Notebook.

In [1]:
from IPython.core.interactiveshell import InteractiveShell

# Set shell to show all lines of output
InteractiveShell.ast_node_interactivity = 'all'

In [2]:
import json

books = []

with open('./data/found_books_filtered.ndjson', 'r') as file:
    # Append each line to the books
    books = [json.loads(row) for row in file]

# Remove non-book articles
books_with_wikipedia = [book for book in books if 'Wikipedia:' in book[0]]
books = [book for book in books if 'Wikipedia:' not in book[0]]
print(f'Found {len(books)} books.')

Found 37020 books.


In [3]:
books_with_wikipedia[0]

['Wikipedia:Wikipedia Signpost/2014-06-25/Recent research',
 {'name': 'Global Wikipedia',
  'author': 'Pnina Fichman and Noriko Hara',
  'country': 'United States',
  'language': 'English',
  'subject': 'Wikipedia',
  'publisher': 'Rowman  &  Littlefield',
  'release_date': '2014',
  'pages': '178',
  'isbn': '978-0810891012'},
 ['User:Adler.fa',
  'User:Maximilianklein',
  'User:Piotrus',
  'User:Kimaus',
  'User:Tbayer (WMF)',
  'Rowman  &  Littlefield',
  'Indiana University Bloomington',
  'User:Maximilianklein',
  'User talk:Maximilianklein',
  'File:Immanuel Kant (painted portrait).jpg',
  'Immanuel Kant',
  'PageRank',
  'CheiRank',
  'm:Research:Newsletter/2013/April#How_Wikipedia.27s_Google_matrix_differs_for_politicians_and_artists',
  'm:Research:Newsletter/2013/July#Multilingual_ranking_analysis:_Napoleon_and_Michael_Jackson_as_Wikipedia.27s_.22global_heroes.22',
  'DBpedia',
  'User:Piotrus',
  'OpenSym',
  'Chinese Wikipedia',
  'Baidu Baike',
  'microblog',
  'Twitter',


In [4]:
books[0]

['Freud: His Life and His Mind',
 {'1': '< !-- See Wikipedia:WikiProject_Books -- >',
  'name': 'Freud: His Life and His Mind',
  'image': 'File:Freud, His Life and His Mind (first edition).jpg',
  'caption': 'Cover of the first edition',
  'author': 'Helen Walker Puner',
  'country': 'United States',
  'language': 'English',
  'subject': 'Sigmund Freud',
  'publisher': 'Dell Publishing',
  'pub_date': '1947',
  'media_type': 'Print (Hardcover and Paperback)',
  'pages': '288 (1959 edition)',
  'isbn': '978-1560006114'},
 ['Sigmund Freud',
  'Dell Publishing',
  'Hardcover',
  'Paperback',
  'Sigmund Freud',
  'Erich Fromm',
  'Dell Publishing',
  'Anna Freud',
  'Ernest Jones',
  'Carl Jung',
  'Wilhelm Stekel',
  'Fritz Wittels',
  'Maurice English',
  'The Nation',
  'Frederick Crews',
  'The New York Review of Books',
  'Peter Gay',
  'Freud: A Life for Our Time',
  'The Life and Work of Sigmund Freud',
  'Louis Breger',
  'Wilhelm Fliess',
  'Freud family',
  'Cambridge University

There are a few articles that were caught which are clearly not books.

In [5]:
[book[0] for book in books_with_wikipedia][:5]

['Wikipedia:Wikipedia Signpost/2014-06-25/Recent research',
 'Wikipedia:New pages patrol/Unpatrolled articles/December 2010',
 'Wikipedia:Templates for discussion/Log/2012 September 23',
 'Wikipedia:Articles for creation/Redirects/2012-10',
 'Wikipedia:Templates for discussion/Log/2012 October 4']

<img src="static\book_template.PNG" width=20% align ='center'>

Each legitimate book contains the title, the information from the `Infobox book` (image above) template, the internal wikipedia links, the external links, the date of last edit, and the number of characters in the article (a rough estimate of the length of the article).

In [6]:
n = 21 # book N°21
print(f"-Title-- : {books[n][0]}; \n\n -Infobox-- : {books[n][1]};\n\n -Wikilinks-- : {books[n][2][:5]}; \n\n -External links-- : \
    {books[n][3][:5]};\n\n {books[n][3][:5]};\n\n -Date of last edit-- : {books[n][4]};\n\n -Number of pages-- : {books[n][5]}")

-Title-- : Limonov (novel); 

 -Infobox-- : {'name': 'Limonov', 'author': 'Emmanuel Carrère', 'translator': 'John Lambert', 'country': 'France', 'language': 'French', 'publisher': 'P.O.L.', 'pub_date': '2011', 'english_pub_date': '2014', 'pages': '488', 'isbn': '978-2-8180-1405-9'};

 -Wikilinks-- : ['Emmanuel Carrère', 'biographical novel', 'Emmanuel Carrère', 'Eduard Limonov', 'Prix de la langue française']; 

 -External links-- :     ['http://www.lefigaro.fr/flash-actu/2011/10/05/97001-20111005FILWWW00615-le-prix-de-la-langue-francaise-a-e-carrere.php', 'http://www.lexpress.fr/culture/livre/emmanuel-carrere-prix-renaudot-2011_1046819.html', 'http://limonow.de/carrere/index.html', 'http://www.tout-sur-limonov.fr/222318809'];

 ['http://www.lefigaro.fr/flash-actu/2011/10/05/97001-20111005FILWWW00615-le-prix-de-la-langue-francaise-a-e-carrere.php', 'http://www.lexpress.fr/culture/livre/emmanuel-carrere-prix-renaudot-2011_1046819.html', 'http://limonow.de/carrere/index.html', 'http://ww

> Book 21

<img src="./static/Limonov(novel).png" width=75% align ='center'>

**General note:** We will only use the wikilinks, which are saved as the third element (index 2) for each book.

<h3>2. Prepare data for supervised machine learning task</h3>

> Map books to integers

First we want to create a mapping of book titles to integers. When we feed books into the embedding neural network, we will have to represent them as numbers, and this mapping will let us keep track of the books. We'll also create the reverse mapping, from integers back to the title.

In [7]:
book_index = {book[0]: idx for idx, book in enumerate(books)}
index_book = {idx: book for book, idx in book_index.items()}

book_index['Anna Karenina']
index_book[22494]

22494

'Anna Karenina'

> Exploring Wikilinks

Although it's not our main focus, we can do a little exploration. Let's find the number of unique Wikilinks and the most common ones. To create a single list from a list of lists, we can use the `itertools` chain method.

In [8]:
from itertools import chain

wikilinks = list(chain(*[book[2] for book in books])) # list(chain(*[['a', 'b'], ['c', 'a', 'e']])) => ['a', 'b', 'c', 'a', 'e']
print(f"There are {len(set(wikilinks))} unique wikilinks.")

There are 311276 unique wikilinks.


In [9]:
wikilinks_other_books = [link for link in wikilinks if link in book_index.keys()]
print(f"There are {len(set(wikilinks_other_books))} unique wikilinks to other books.")

There are 17032 unique wikilinks to other books.


> EDA: Most Linked-to Articles

Let's take a look at which pages are most linked to by books on Wikipedia.

We'll make a utility function that takes in a list and returns a sorted ordered dictionary of the counts of the items in the list. The collections module has a number of useful functions for dealing with groups of objects.

In [10]:
from collections import Counter, OrderedDict

def count_items(l:list):
    """Return ordered dictionary of counts of objects in `l`"""
    
    # Create a counter object
    counts = Counter(l)
    
    # Sort by highest count first and place in ordered dictionary
    counts = sorted(counts.items(), key = lambda x: x[1], reverse = True)
    counts = OrderedDict(counts)
    
    return counts

We only want to count wikilinks from each book once, so we first find the set of links for each book, then we flatten the list of lists to a single list, and finally pass it to the count_items function.

In [11]:
# Find set of wikilinks for each book and convert to a flattened list
unique_wikilinks = list(chain(*[list(set(book[2])) for book in books]))

wikilink_counts = count_items(unique_wikilinks)
list(wikilink_counts.items())[:10]

[('Hardcover', 7489),
 ('Paperback', 7311),
 ('Wikipedia:WikiProject Books', 6043),
 ('Wikipedia:WikiProject Novels', 6015),
 ('English language', 4185),
 ('United States', 3060),
 ('Science fiction', 3030),
 ('The New York Times', 2727),
 ('science fiction', 2502),
 ('novel', 1979)]

The most linked to pages are in fact not that surprising! One thing we should notice is that there are discrepancies in capitalization. We want to normalize across capitalization, so we'll lowercase all of the links and redo the counts.

In [12]:
wikilinks = [link.lower() for link in unique_wikilinks]
print(f"There are {len(set(wikilinks))} unique wikilinks.")

wikilink_counts = count_items(wikilinks)
list(wikilink_counts.items())[:10]

There are 297624 unique wikilinks.


[('paperback', 8740),
 ('hardcover', 8648),
 ('wikipedia:wikiproject books', 6043),
 ('wikipedia:wikiproject novels', 6016),
 ('science fiction', 5665),
 ('english language', 4248),
 ('united states', 3063),
 ('novel', 2983),
 ('the new york times', 2742),
 ('fantasy', 2003)]

$Conclusion:$ Normalizing wikilinks changes the rankings ! This illustrates an important point: always make sure to take a look at your data before modeling!😉

> Remove Most Popular Wikilinks

I'm going to remove the **most popular wikilinks** because these **are not very informative**. Knowing whether a book is 'hardcover' or 'paperback' is not that important to the content. We also don't need the two Wikipedia... (wikipedia:wikiproject books, wikipedia:wikiproject novels) links since these do not distinguish the books based on content. <br/>

-> Task 1: It is recommended to play around with the wikilinks that are removed because some might have a large effect on the recommendations.

(This step is similar to the idea of [TF-IDF (Term Frequency Inverse Document Frequency)](https://tfidf.com/). When dealing with words in documents, the words that appear most often across documents are usually not that helpful because they don't distinguish documents. TF-IDF is a way to weight a word higher for appearing more often within an article but decrease the weighting for a word appearing more often between articles.)

In [13]:
to_remove = ['hardcover', 'paperback', 'hardback', 'e-book', 'wikipedia:wikiproject books', 'wikipedia:wikiproject novels']
for t in to_remove:
    wikilinks.remove(t)
    _ = wikilink_counts.pop(t)

Since there are so many unique wikilinks, I'm going to limit the list to wikilinks mentioned 4 or more times. Hopefully this reduces the noise that might come from wikilinks that only appear a few times. Keeping every single link will increase the training time significantly, but experiment with this parameter if you are interested.

-> Task 2: Try different treshold for wikilinks appearance then analyse results obtained

In [14]:
# Limit to greater than 3 links
links = [t[0] for t in wikilink_counts.items() if t[1] >= 4]
print(len(links))

41758


> Most Linked-to Books

As a final bit of exploration, let's look at the books that are mentioned the most by other books on Wikipedia. We'll take the set of links for each book so that we don't have multiple counts for books that are linked to by another book more than once.

In [15]:
# Find set of book wikilinks for each book
unique_wikilinks_books = list(chain(*[list(set(link for link in book[2] if link in book_index.keys())) for book in books]))

# Count the number of books linked to by other books
wikilink_book_counts = count_items(unique_wikilinks_books)
list(wikilink_book_counts.items())[:10]

[('The Encyclopedia of Science Fiction', 127),
 ('The Discontinuity Guide', 104),
 ('The Encyclopedia of Fantasy', 63),
 ('Dracula', 55),
 ('Encyclopædia Britannica', 51),
 ('Nineteen Eighty-Four', 51),
 ('Don Quixote', 49),
 ('The Wonderful Wizard of Oz', 49),
 ("Alice's Adventures in Wonderland", 47),
 ('Jane Eyre', 39)]