In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
%load_ext autotime

# Wikipedia Extractor - Example

This Jupyter notebook shows a walkthrough of potential usages for the library. This library exploits the knowledge of Wikipedia and fetches only the information needed, reducing as much overhead as possible.

In [3]:
import wiki_extract

time: 107 ms


It is possible to change language via `wiki_extract.set_lang(<prefix>)` (note: NER only supported for '*en*' and '*nl*').

In [4]:
wiki_extract.set_lang('en')  # Default language is 'en'

time: 453 µs


If page is not found (*Trump*) it autocorrects automatically to most suitable page (*Donald Trump*). This can be turned off via the Boolean argument `autocorrect` in `wiki_extract.get()`.

In [5]:
wiki_extract.get(query='Trump')



{'page': <WikipediaPage 'Donald Trump'>, 'title': 'Donald Trump'}

time: 2.54 s


Results are cached, hence, second time it doesn't take as long.

In [6]:
result = wiki_extract.get(query='Trump')

time: 387 ms


## Extracted Fields

Via the `wiki_extract.get` method, it is possible to fetch other interesting fields of the Wikipedia page, which can be used for other use-cases as well. The supported fields are:
 - query: Wikipedia query to be searched
 - autocorrect: If query does not match a specific page, correct it to the most similar one (if exists)
 - categories: Extract categories to which query belongs
 - content: Extract content in clean text
 - images: Extract all image-URLs present
 - infobox: Extract infobox as dictionary (if exists)
 - languages: Get language alternatives for found page
 - links: List all links present
 - ner: Performed Named-Entity Recognition on the page
 - parent_id: Get ID of parent page
 - parse_tree: Get page's parse-tree
 - raw_html: Get the raw HTML
 - rev_id: Get the page's revision ID
 - sections: List all sections present
 - short_desc: Get page's short description (if exists)
 - summary: Get page's summary (introduction, first paragraph)

In [7]:
wiki_extract.get(query='Trump', infobox=True, short_desc=True)

{'page': <WikipediaPage 'Donald Trump'>,
 'title': 'Donald Trump',
 'infobox': {'image': 'Donald Trump official portrait.jpg',
  'alt': 'Official White House portrait. Head shot of Trump smiling in front of the U.S. flag, wearing a dark blue suit jacket with American flag lapel pin, white shirt, and light blue necktie.',
  'order': '45th',
  'office': 'President of the United States',
  'vicepresident': '[[Mike Pence]]',
  'term_start': 'January 20, 2017',
  'predecessor': '[[Barack Obama]]',
  'birth_name': 'Donald John Trump',
  'birth_date': '{{birth date and age|1946|6|14}}',
  'birth_place': '[[Queens]], [[New York City]]',
  'party': '[[Republican Party (United States)|Republican]] (1987–1999, 2009–2011, 2012–present)',
  'otherparty': '{{plainlist|\n* [[Democratic Party (United States)|Democratic]] (until 1987, 2001–2009)\n* [[Reform Party of the United States of America|Reform]] (1999–2001)\n* [[Independent politician|Independent]] (2011–2012)}}',
  'spouse': '{{plainlist|\n* |

time: 1.59 s


Furthermore, each `wiki_extract.get` also returns the raw `WikipediaPage` object, which functions as a container for each of the supported properties. In other words, one can also use this to request certain properties.

In [8]:
result = wiki_extract.get(query='Trump')
page = result['page']
page.title

'Donald Trump'

time: 517 ms


It is possible to query the page directly in order to fetch new information.

In [9]:
page.summary

"Donald John Trump (born June 14, 1946) is the 45th and current president of the United States. Before entering politics, he was a businessman and television personality.\nTrump was born and raised in Queens, a borough of New York City. He attended Fordham University for two years and received a bachelor's degree in economics from the Wharton School of the University of Pennsylvania. He became president of his father's real-estate business in 1971, renamed it The Trump Organization, and expanded its operations to building or renovating skyscrapers, hotels, casinos, and golf courses. Trump later started various side ventures, mostly by licensing his name. Trump and his businesses have been involved in more than 4,000 state and federal legal actions, including six bankruptcies. He owned the Miss Universe brand of beauty pageants from 1996 to 2015. He produced and hosted The Apprentice, a reality television series, from 2003 to 2015. As of April 2020, Forbes estimated his net worth to be 

time: 329 ms
