### Get Data with API

In [2]:
import pandas as pd
import wikipedia
import wikipediaapi
import wptools

wiki_wiki = wikipediaapi.Wikipedia('en')

def get_sections(bookname, level=0):
    p = wiki_wiki.page(bookname)
    sections = p.sections
    titles = []
    texts = []
    for s in sections:
           titles.append(s.title)
           texts.append(s.text)
    return dict(zip(titles, texts))  


book = pd.read_csv('book.csv')
books = book[:50]['name']


In [19]:
%%capture

title = []
summary = []
content = []
link = []
author = []
country = []
language = []
genre = []

for i in range(10):
    try:
        page = wikipedia.page(books[i])
        title.append(page.title)
        summary.append(page.summary)
        content.append(get_sections(books[i]))
        link.append(page.links)
        parse = wptools.page(books[i]).get_parse()
        infobox = parse.data.get('infobox', None)
        if infobox:
            author.append(infobox.get('author', None))
            country.append(infobox.get('country', None))
            language.append(infobox.get('language', None))
            genre.append(infobox.get('genre', None))
        else:
            author.append(None)
            country.append(None)
            language.append(None)
            genre.append(None)
            
    except:
        pass

df = pd.DataFrame({'title':title, 
                   'author': author,
                   'country': country,
                   'language':language,
                   'genre': genre,
                   'summary':summary, 
                   'content': content, 
                   'link':link,})


In [21]:
display(df)

Unnamed: 0,title,author,country,language,genre,summary,content,link
0,Freud: His Life and His Mind,Helen Walker Puner,United States,English,,Freud: His Life and His Mind is a 1947 biograp...,{'Summary': 'Puner attempts to explain Freud's...,"[Cambridge University Press, Carl Jung, Dell P..."
1,Blackbox (novel),Nick Walker,[[United Kingdom]],[[English language|English]],,Blackbox is the first novel by British writer ...,"{'Plot': 'A stowaway dies on board a flight.',...","[2002 in literature, Chris Morris (author), Co..."
2,My Real Children,[[Jo Walton]],United States,English,"[[Fantasy literature]], [[alternate history]]",My Real Children is a 2014 alternate history n...,"{'Plot': 'In 2015, Patricia is 89 years old an...","[1964 United States presidential election, A W..."
3,The Alleys of Eden,[[Robert Olen Butler]],United States,English,,The Alleys of Eden is the first published nove...,{'Synopsis': 'Set in Saigon during the final d...,"[1981 in literature, Dewey Decimal Classificat..."
4,Day of the Dogs,[[Andrew Cartmel]],,,Science fiction,Day of the Dogs is an original novel written b...,{'Synopsis': 'Johnny Alpha and Middenface McNu...,"[Andrew Cartmel, Black Flame (publisher), Brit..."
5,Oedipus in the Trobriands,[[Melford Spiro]],United States,English,,Oedipus in the Trobriands is a 1982 book about...,{'Summary': 'Spiro discusses the Oedipus compl...,"[American Anthropologist, American Ethnologica..."
6,On the Genealogy of Morality,[[Christopher Janaway]],,English,,On the Genealogy of Morality: A Polemic (Germa...,{'Critical reception': 'Paul Bishop believed J...,"[Amor fati, Anarchism and Friedrich Nietzsche,..."
7,Wild Boy (novel),[[Jill Dawson]],United Kingdom,English,,Wild Boy is a 2003 novel by English author Jil...,{'Plot introduction': 'The novel is split into...,"[Asperger's Syndrome, Autism: Explaining the E..."
8,Ali and Ramazan,[[Perihan Mağden]],[[Turkey]],[[Turkish language|Turkish]],Novel,Ali and Ramazan (Ali ile Ramazan in Turkish) i...,{'Plot summary': 'Ali and Ramazan are two boys...,"[AmazonCrossing, Bisexuality, Gay, Internation..."
9,Future perfect,[[Steven Berlin Johnson]],,English,,The future perfect is a verb form or construct...,{'Key concepts': 'The main idea that Johnson p...,"[Afrikaans, Auxiliary verb, Catalan grammar, C..."


### Map Books to Integers

First we want to create a mapping of book titles to integers. When we feed books into the embedding neural network, we will have to represent them as numbers, and this mapping will let us keep track of the books. We'll also create the reverse mapping, from integers back to the title.

In [25]:
book_index = {book: idx for idx, book in enumerate(title)}
index_book = {idx: book for book, idx in book_index.items()}

### Exploring Wikilinks

Although it's not our main focus, we can do a little exploration. Let's find the number of unique Wikilinks and the most common ones. To create a single list from a list of lists, we can use the itertools chain method.

In [24]:
from itertools import chain
wikilinks = list(chain(*[i for i in link]))
print(f"There are {len(set(wikilinks))} unique wikilinks.")

There are 362 unique wikilinks.


How many of these are links to other books?

In [26]:
wikilinks_other_books = [link for link in wikilinks if link in book_index.keys()]
print(f"There are {len(set(wikilinks_other_books))} unique wikilinks to other books.")

There are 0 unique wikilinks to other books.


### Most Linked-to Articles

Let's take a look at which pages are most linked to by books on Wikipedia.

We'll make a utility function that takes in a list and returns a sorted ordered dictionary of the counts of the items in the list. The collections module has a number of useful functions for dealing with groups of objects.

In [27]:
from collections import Counter, OrderedDict

def count_items(l):
    """Return ordered dictionary of counts of objects in `l`"""
    
    # Create a counter object
    counts = Counter(l)
    
    # Sort by highest count first and place in ordered dictionary
    counts = sorted(counts.items(), key = lambda x: x[1], reverse = True)
    counts = OrderedDict(counts)
    
    return counts


We only want to count wikilinks from each book once, so we first find the set of links for each book, then we flatten the list of lists to a single list, and finally pass it to the count_items function.

In [29]:
# Find set of wikilinks for each book and convert to a flattened list
unique_wikilinks = list(chain(*[list(set(link[i])) for i in range(len(link))]))

wikilink_counts = count_items(unique_wikilinks)
list(wikilink_counts.items())[:10]

[('International Standard Book Number', 9),
 ('Paperback', 3),
 ('Hardcover', 2),
 ('Psychoanalysis', 2),
 ('Sigmund Freud', 2),
 ('Peter Gay', 2),
 ('Vietnam War', 2),
 ('OCLC', 2),
 ('Cambridge University Press', 1),
 ('Wilhelm Fliess', 1)]


The most linked to pages are in fact not that surprising! One thing we should notice is that there are discrepancies in capitalization. We want to normalize across capitalization, so we'll lowercase all of the links and redo the counts.

In [30]:
wikilinks = [link.lower() for link in unique_wikilinks]
print(f"There are {len(set(wikilinks))} unique wikilinks.")

wikilink_counts = count_items(wikilinks)
list(wikilink_counts.items())[:10]

There are 362 unique wikilinks.


[('international standard book number', 9),
 ('paperback', 3),
 ('hardcover', 2),
 ('psychoanalysis', 2),
 ('sigmund freud', 2),
 ('peter gay', 2),
 ('vietnam war', 2),
 ('oclc', 2),
 ('cambridge university press', 1),
 ('wilhelm fliess', 1)]

### Remove Most Popular Wikilinks
I'm going to remove the most popular wikilinks because these are not very informative. Knowing whether a book is hardcover or paperback is not that important to the content. We also don't need the two Wikipedia: links since these do not distinguish the books based on content. I'd recommend playing around with the wikilinks that are removed because it might have a large effect on the recommendations.

(This step is similar to the idea of TF-IDF (Term Frequency Inverse Document Frequency. When dealing with words in documents, the words that appear most often across documents are usually not that helpful because they don't distinguish documents. TF-IDF is a way to weight a word higher for appearing more often within an article but decrease the weighting for a word appearing more often between articles.)