# Python and Wikipedia
The functions below extract data from Wikipedia and export it for analysis. With each function, I've listed the goal and the required Python skills: the idea is that a CS course could assign students to write similar code.

All functions use the Wikipedia package, available here: https://pypi.org/project/wikipedia/. The package includes an object type, wikipedia.page, that returns the content of a Wikipedia page.

### Contents

1. getValidPage
2. getHeaders
3. getSomePages
4. pageInfo
5. getContent
6. getPageAndLanguage
7. setLanguage
8. showOnMap

# The getValidPage function

## Goal
Input a string and return a wikipedia.page object. The function must:
* Ensure the string leads to a valid page
* Handle disambiguation errors (strings that can refer to multiple pages)

## Python Skills Used
* Loops (while and for)
* If/else statements
* Error checking (try, except, pass)
* Input validation (using pyinputplus)

In [30]:
import wikipedia
import pyinputplus as pyip

def getValidPage(pageName):
        
    while True:
                
        try:
            Page = wikipedia.page(pageName, auto_suggest=False)
            break
        
        except wikipedia.exceptions.DisambiguationError as d: #if a disambiguation page
            print (pageName + ' could refer to multiple pages. Here they are:')
            for i in range(len(d.options)): #prints a list of the potential matches
                print (str(i+1) + '. '+ d.options[i])
            print ('Enter the number for your choice. Enter 0 if none of these choices is right.')
            selectionNum = pyip.inputInt(max=(len(d.options))) #asks for the number of the desired entry, then returns the associated page
            if selectionNum == 0:
                print ('Reenter the topic whose Wikipedia page you want.')
                pageName = pyip.inputStr()
                pass
            else:
                pageName = d.options[selectionNum-1]
            pass #goes back to start, in case new title raises disambiguation error
        
        except wikipedia.exceptions.PageError: #if the page doesn't exist
            print (pageName + ' is not a valid page. Please enter another option.')
            pageName = pyip.inputStr()
            pass
    
    return Page

In [37]:
examplePage = getValidPage('FiskUniver')
examplePage

FiskUniver is not a valid page. Please enter another option.
Fisk
Fisk could refer to multiple pages. Here they are:
1. Fisk, Iowa
2. Fisk, Missouri
3. Fisk, Wisconsin
4. Fisk University
5. Fisk Generating Station
6. Fisk (surname)
7. Fisk Tire Company
8. Fria liberaler i Svenska kyrkan
9. Fiske
10. Fisker (disambiguation)
11. Justice Fisk (disambiguation)
Enter the number for your choice. Enter 0 if none of these choices is right.
4


<WikipediaPage 'Fisk University'>

# The getHeaders Function
Wikipedia articles are typically broken down into sections, and section headings convey some information about the structure of each page. Wikipedia uses Wikicode, which puts headers between equals signs: =header=

Because the first header is the page name section headers on Wikipedia begin with == (subheaders with ===, etc.)

## Goal
Input a Wikipedia page and return a list of the headers in the document. Use the content property of the Wikipedia page class

## Python Skills
* Regular Expressions (the code below will include headers of six or fewer words. To get subheaders, add an extra = to each side).

In [38]:
import re

def getHeaders(wikiPage): #takes a wikipedia page and returns a list of the headers
    
    headerRegex = re.compile(r'(== )(\w+\s*\w*\s*\w*\s*\w*\s*\w*\s*\w*)( ==\s+)')
    matches = []
    for groups in headerRegex.findall(wikiPage.content):
        matches.append(groups[1]) #group 1 catches the middle group, the word
    return matches

In [39]:
getHeaders(examplePage)

['History',
 'Campus',
 'Science programs',
 'Rankings',
 'Athletics',
 'Notable alumni',
 'Notable faculty',
 'References',
 'External links']

# The getSomePages Function

## Goal
Prompt a user to input one or more topics they'd like to explore on Wikipedia. Returns a list the pages.
* Optional: print summaries of the pages in the list, using the summary property of the Wikipedia page class.

## Python Skills
* Uses the getValidPage function
* Input Validation (pyinputplus)
* Creating and adding to lists
* While loops

In [17]:
def getSomePages():
    somePages = []
    
    print ('You will be asked to enter a list of topics, and we\'ll collect data from those topics\' Wikipedia pages.')
    
    while True:
        print ('\nEnter the name of a topic to get the Wikipedia page. If you are finished, enter /"done./"')
        pageName = pyip.inputStr()
        if pageName == 'done' or pageName == 'Done':
            break
        somePages.append(getValidPage(pageName))
        
    print ('\nWould you like to see a list of page summaries? \(yes or no\)')
    sums = pyip.inputYesNo()
    if sums == 'yes':
        for i in range(len (somePages)):
            print (str(i+1) + '. ' + wikipedia.summary(somePages[i].title,sentences=1))
    
    return somePages

In [19]:
getSomePages()

You will be asked to enter a list of topics, and we'll collect data from those topics' Wikipedia pages.

Enter the name of a topic to get the Wikipedia page. If you are finished, enter /"done./"
Nashville

Enter the name of a topic to get the Wikipedia page. If you are finished, enter /"done./"
Memphis
Memphis could refer to multiple pages. Here they are:
1. Memphis, Egypt
2. Memphis, Tennessee
3. Mampsis
4. Memphis, Egypt
5. Memphis, Alabama
6. Memphis, Florida
7. Memphis, Indiana
8. Memphis, Michigan
9. Memphis, Mississippi
10. Memphis, Missouri
11. Memphis, Nebraska
12. Memphis, New York
13. Memphis, Ohio
14. Memphis, Tennessee
15. Memphis metropolitan area
16. Memphis, Texas
17. Memphis (film)
18. Memphis (band)
19. Memphis Industries
20. Memphis (musical)
21. Memphis (Boz Scaggs album)
22. Memphis (Roy Orbison album)
23. Memphis (The Badloves song)
24. "Memphis, Tennessee" (song)
25. Lonnie Mack
26. Walking in Memphis
27. Fleet Foxes
28. David Nail
29. Journals
30. Peter Tosh
31. 

[<WikipediaPage 'Nashville, Tennessee'>,
 <WikipediaPage 'Memphis, Tennessee'>,
 <WikipediaPage 'Knoxville, Tennessee'>]

# The pagesFromList function
This does the same as getSomePages(), but takes a list of strings to avoid the need for user input.

## Goal
From a list of strings, output a list of corresponding Wikipedia pages.
* Use the getValidPage function. Print a list of strings that don't return valid pages
* Optional: allow user input, for any pages that raise disambiguation errors
* Optional: make English the default language, but allow the function to be passed another language abbreviation. Then use the set_lang function in the Wikipedia package to change the language.

## Python Skills
* Functions with multiple inputs
* Lists
* Handling errors (try, except)

In [20]:
def pagesFromList(topics, language = None, disambiguation = False):
    somePages = []
    errors = []
    
    if language != None:
        wikipedia.set_lang(language)
    
    for pageName in topics:
        try:
            somePages.append(wikipedia.page(pageName, auto_suggest=False))
        
        except wikipedia.exceptions.DisambiguationError as d: #if a disambiguation page
            if not disambiguation:
                errors.append(pageName)
            else:
                print (pageName + ' could refer to multiple pages. Here they are:')
                for i in range(len(d.options)): #prints a list of the potential matches
                    print (str(i+1) + '. '+ d.options[i])
                print ('Enter the number for your choice.')
                selectionNum = pyip.inputInt(max=(len(d.options)+1)) #asks for the number of the desired entry, then returns the associated page
                pageName = d.options[selectionNum-1]
                try:
                    somePages.append(wikipedia.page(pageName, auto_suggest=False))
                except:
                    errors.append(pageName)
        
        except wikipedia.exceptions.PageError: #if the page doesn't exist
            errors.append(pageName)
            
    if errors != []:
        print ('The following topics returned errors and were not added to the list: ')
        print(*errors, sep = "\n")
    
    return somePages

In [24]:
Romantics = ['William Wordsworth', 'Samuel Taylor Coleridge', 'Anna Barbauld', 'Mary Wollstonecraft', 'Percy Shelley']
Victorians = ['Christina Rossetti', 'Alfred Tennyson', 'Robert Browning', 'Elizabeth Barrett Browning', 'Rudyard Kipling']
Modernists = ['Virginia Woolf', 'W. B. Yeats', 'T. S. Eliot', 'James Joyce', 'Claude McKay']
Postmodernists = ['Daljit Nagra', 'Grace Nichols', 'Seamus Heaney', 'Derek Walcott', 'Zadie Smith']

BLsyllabus = Romantics + Victorians + Modernists + Postmodernists

testList = pagesFromList(BLsyllabus)
testList

# The pageInfo Function

## Goal
Take a list of Wikipedia pages and return a list of dictionaries, with the page titles as the values and the page attributes as the keys.
* See https://wikipedia.readthedocs.io/en/latest/code.html#api for the attributes of each page that could be collected. Currently collects these attributes: title, summary, page length, section headers, and categories. 
* Optional: export the dictionary as a csv file, with the attributes as column headers. (Could also use Pandas here, for different kinds of analysis)

## Python Skills
* Dictionaries
* File output

In [28]:
import csv

def pageInfo(pages, fileName = None):
    listOfDicts = []
        
    for i in range(len(pages)):
        dictForPage = {}
        dictForPage['Title'] = pages[i].title
        dictForPage['Summary'] = wikipedia.summary(pages[i].title,sentences=1)
        dictForPage['Page length'] = len(pages[i].content)
        dictForPage['Section Headers'] = getHeaders(pages[i])
        dictForPage['Categories'] = pages[i].categories
        '''Can also collect more information:      
        dictForPage['Full text'] = pages[i].content
        try:
            dictForPage['Coordinates'] = pages[i].coordinates
        except:
            dictForPage['Coordinates'] = 'NA'
        dictForPage['Image URLs'] = pages[i].images
        dictForPage['Wikipedia Pages Linked to on the Page'] = pages[i].links
        dictForPage['Links to External Pages'] = pages[i].references
        dictForPage['Parent ID'] = pages[i].parent_id
        dictForPage['Revision ID'] = pages[i].revision_id'''
                
        listOfDicts.append(dictForPage)
    
    if fileName != None:
        fileName += '.csv'
        colHeaders = listOfDicts[0].keys() #Gets the keys from the first dictionary (they all have the same)
        with open(fileName, 'w', encoding='utf-8') as csv_file_object:
            writer = csv.DictWriter(csv_file_object, fieldnames=colHeaders)
            writer.writeheader()
            for row_dict in listOfDicts:
                writer.writerow(row_dict)
     
    return listOfDicts

In [29]:
pageInfo(testList,'wikiBLSyllabus')

[{'Title': 'William Wordsworth',
  'Summary': 'William Wordsworth (7 April 1770 – 23 April 1850) was an English Romantic poet who, with Samuel Taylor Coleridge, helped to launch the Romantic Age in English literature with their joint publication Lyrical Ballads (1798).',
  'Page length': 20736,
  'Section Headers': ['Early life',
   'Relationship with Annette Vallon',
   'First publication and Lyrical Ballads',
   'The Borderers',
   'Marriage and children',
   'The Prospectus',
   'Religious beliefs',
   'Laureateship and other honours',
   'Death',
   'In popular culture',
   'Major works',
   'See also',
   'References',
   'Further reading',
   'External links'],
  'Categories': ['1770 births',
   '1850 deaths',
   '18th-century Christian mystics',
   '18th-century English poets',
   '18th-century English writers',
   '18th-century male writers',
   '19th-century Christian mystics',
   '19th-century English poets',
   '19th-century English writers',
   '19th-century male writers',


# getContent Function

## Goal
Input a list of Wikipedia pages and create a list of two dictionaries with the 'Title' and 'Full text' for each article on the list. If passed a file name, out puts a csv file

## Python Skills
* Dictionaries
* File output

In [32]:
def getContent(pages, fileName = None):
    listOfDicts = []
    
    for i in range(len(pages)):
        dictForPage = {}
        dictForPage['Title'] = pages[i].title
        dictForPage['Full text'] = pages[i].content
        
        listOfDicts.append(dictForPage)
    
    if fileName != None:
        fileName += '.csv'
        colHeaders = listOfDicts[0].keys() #Gets the keys from the first dictionary (they all have the same)
    
        with open(fileName, 'w', encoding='utf-8') as csv_file_object:
            writer = csv.DictWriter(csv_file_object, fieldnames=colHeaders)
            writer.writeheader()
            for row_dict in listOfDicts:
                writer.writerow(row_dict)
     
    return listOfDicts

In [33]:
getContent(testList,'syllabusFullText')

[{'Title': 'William Wordsworth',
  'Full text': 'William Wordsworth (7 April 1770 – 23 April 1850) was an English Romantic poet who, with Samuel Taylor Coleridge, helped to launch the Romantic Age in English literature with their joint publication Lyrical Ballads (1798).\nWordsworth\'s magnum opus is generally considered to be The Prelude, a semi-autobiographical poem of his early years that he revised and expanded a number of times. It was posthumously titled and published by his wife in the year of his death, before which it was generally known as "the poem to Coleridge".Wordsworth was Britain\'s poet laureate from 1843 until his death from pleurisy on 23 April 1850.\n\n\n== Early life ==\n\nThe second of five children born to John Wordsworth and Ann Cookson, William Wordsworth was born on 7 April 1770 in what is now named Wordsworth House in Cockermouth, Cumberland, part of the scenic region in northwestern England known as the Lake District. William\'s sister, the poet and diarist 

# The setLanguage function
There are 309 Wikipedias, in different languages. For a list of languages, and the associated abbreviations, see https://meta.wikimedia.org/wiki/List_of_Wikipedias. This isn't a hard programming assignment, but accessing the different language Wikipedias opens up some interesting questions to ask.

## Goal
Change the wikipedia API to a different language and return the language as a string. After the language is set, all queries will be to that language Wikipedia.

## Python Skills
* Input validation (using a while loop; could also use pyinputplus)

In [11]:
def setLanguage(language_wanted):
  
    while language_wanted not in wikipedia.languages():
        print('Sorry, ' + language_wanted + ' is not a valid choice. Please select an abbreviation from https://meta.wikimedia.org/wiki/List_of_Wikipedias')
        language_wanted = input()
    wikipedia.set_lang(language_wanted)
    print ('Language now set to ' + wikipedia.languages()[language_wanted])
    
    return wikipedia.languages()[language_wanted]

In [12]:
setLanguage('en')

Language now set to English


'English'

# showOnMap

## Goal
Given a Wikipedia page for a place, get the coordinates using the coordinates property and show them on Google Maps.

## Python Skills
* using webbrowser package

In [34]:
import webbrowser

def showOnMap(page):

    try:
        coords = page.coordinates
        latitude = str(coords[0])
        longitude = str(coords[1])
        
    except:
        print ('No location available.')
    
    zoom = '4'
    googleMapUrl = 'http://www.google.com/maps/place/'+latitude+','+longitude+'/@'+latitude+','+longitude+','+zoom+'z'
    print(googleMapUrl)  # not necessary to print this, but useful for debugging
    webbrowser.open_new_tab(googleMapUrl)

In [35]:
place = getValidPage('Nashville')
showOnMap(place)

http://www.google.com/maps/place/36.1666666699999979073254507966339588165283203125,-86.78333333000000493484549224376678466796875/@36.1666666699999979073254507966339588165283203125,-86.78333333000000493484549224376678466796875,4z
