# Scraping wikipedia for Suburb summary and images

## Goal
Swapping out the planet images and descriptions in Endless Sky with suburbs in Victoria, using publicly available images and wikipedia summary 

## Method
The list of suburbs per LGA were compiled from this [wikipedia page]('https://en.wikipedia.org/wiki/List_of_localities_in_Victoria'), partially by web scraping and partially manually.

They are stored in: `Working LGA Suburbs sheet.xlsx`
There are 79 LGAs and we made an arbitray decision to pick 4 suburbs/LGA -> the 316 suburbs


### Libraries used
- pandas: dataframe, excel read/write functionality
- requests: web site access
- Wikipedia APIs: search wikipedia for pages, get summary and image list. Found two, tried both in the following order
  - [wiki-api](https://github.com/richardasaurus/wiki-api)
  - [wikipedia](https://wikipedia.readthedocs.io/en/latest/quickstart.html#quickstart)
  - Used `wiki-api` initially for suburbs summary and image grab, but it only grabs the low resolution thumbnails of the main article image. I switched to the `wikipedia` which gives the full list of images on each article
- re: regexp

In [1]:
import pandas as pd

In [2]:
from lxml import html, etree
import requests

Read in the LGA suburb list

In [3]:
dat = pd.read_excel(r'Working LGA Suburbs sheet.xlsx')

In [4]:
dat = dat.set_index('Local government')

In [7]:
from wikiapi import WikiApi
import wikipedia as wk
import re

In [8]:
wiki = WikiApi()

In [9]:
flatsuburb=dat.stack()

In [5]:
# Obsolete function - used initial with wiki-api
import urllib.parse as up
def suburb_image(row):
    suburb,url = [row['index'],row['source']]
    mobj = re.search(r'.*\.(.*)',up.urlparse(url).path)
    if mobj:
        return '{0}.{1}'.format(suburb,mobj.group(1).lower())
    else:
        return url

In [4]:
# Image downloading function copy/pasted from stackoverflow, added primitive exception handling
def download_image(url,fname):
    try:
        r = requests.get(url, stream=True)
        if r.status_code == 200:
            with open(fname, 'wb') as f:
                for chunk in r:
                    f.write(chunk)
        return 'Done'
    except:
        print(url+": something's wrong")
        return 'null'

In [19]:
# placeholder suburb image
hawaii_image = r'https://www.google.com.au/imgres?imgurl=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2F8%2F8d%2FNa_Pali_Coast%2C_Kauai%2C_Hawaii.jpg&imgrefurl=https%3A%2F%2Fen.m.wikipedia.org%2Fwiki%2FHawaiian_culture&docid=MrshDp0UcD3_nM&tbnid=rDnsZuzMjRMggM%3A&w=1761&h=1173&bih=747&biw=760&ved=0ahUKEwiS5sTB4JrOAhUHpJQKHay-AYgQMwgcKAEwAQ&iact=mrc&uact=8'
def spicker(hitlist):
    viclist = [hit for hit in hitlist if re.search(r'victoria',hit,flags=re.I)]
    if len(viclist)>0:
        return viclist[0]
    elif len(hitlist)>0:
        return hitlist[0]
    else:
        return None
    
def wiki_suburb_search(suburb):
    results = wiki.find(suburb+' victoria')
    article = wiki.get_article(spicker(results))
    print(article)
    if results:
        return [article.summary,article.image]
    else:
        return ['null','null']
    
def wiki_suburb_search2(suburb):
    try:
        results = wk.search(suburb+' victoria')
        a = wk.page(spicker(results))
        #print(results)

        if results:
            img_list = sorted(a.images,key=avoid)
            img_url = img_list[0]
            mobj = re.search(r'.*\.(.*)',up.urlparse(img_url).path)
            img_fn = '{0}.{1}'.format(suburb,mobj.group(1).lower())
            try:
                fdict[img_fn]
                img_status = 'Done'
            except:
                img_status = download_image(img_url,img_fn)
            return [a.url, a.summary, img_url, img_fn, img_status, str(img_list)]
        else:
            return ['null']*6
    except:
        return ['null']*6
    
def avoid(instr):
        for i,x in enumerate(['png', 'map', 'logo', 'red_pog', 'svg']):
            if re.search(x,instr,flags=re.I):
                return i+2
        return 0

def img_cmp(a,b):
    if avoid(a) < avoid(b):
        return 1
    else:
        return -1

## Code testing

In [23]:
a = wiki_suburb_search2(flatsuburb[7])

['Lake Bolac, Victoria', 'The Exploders', 'Lake Bolac stone arrangement', 'Mininera & District Football League', 'Hopkins River', 'Victoria River (Victoria)', 'Plenty River (Victoria)', 'Calder River (Victoria)', 'Little River (Moroka River, Victoria)', 'Glenelg Highway']


In [22]:
flatsuburb[6]

'Taytoon'

In [778]:
results = wk.search('Taytoon'+' Victoria')

In [17]:
import os
fdict = dict([[f,1] for f in os.listdir()])

## Running the wikipedia scraper

In [25]:
suburbdata=[wiki_suburb_search2(suburb) for suburb in flatsuburb]

['Myrtleford', 'Richard Colbeck', 'Shire of Myrtleford', 'County of Bogong', 'Myrtleford railway station', 'Curing of tobacco', 'Ross Milne', 'Battle of Mont Saint-Quentin', 'Chloe McConville', 'Tobacco']
['Bright, Victoria', 'Bright', 'Shire of Bright', 'Christian Geiger', 'County of Bogong', '87.6 FM', 'Audax Alpine Classic', 'Tourist gateway', 'Centenary Test', 'Phillip Tracey']
['Porepunkah', 'Porepunkah Airfield', 'Porepunkah railway station', 'County of Bogong', 'Ovens River', 'Buckland River (Victoria)', 'Mount Selwyn (mountain)', 'List of airports in Victoria (Australia)', 'Warrnambool Airport', 'Portland Airport (Victoria)']
['Tawonga, Victoria', 'Mount Beauty, Victoria', 'Electoral district of Benambra', 'Shire of Bright', 'Kiewa Valley Highway', 'Tallangatta & District Football League', 'Kiewa River', 'Great Alpine Road', 'Tea in Australia', 'List of localities in Victoria']
['Ararat, Victoria', 'Dunneworthy, Victoria', 'Ararat Football Club', 'Hopkins Correctional Centre (A

## Formatting and Exporting the results

In [28]:
#a.url, a.summary, img_url, img_fn, img_status, str(img_list)
suburbdf = pd.DataFrame(suburbdata,columns=['wiki.url','wiki.summary','img_url','img_fn','img_status','img_list'])
suburbdf = suburbdf.set_index(flatsuburb)
#suburbdf = suburbdf.reset_index()
#suburbdf['image']=suburbdf.apply(suburb_image, axis=1)

In [30]:
suburbdf[suburbdf['wiki.url']=='null']

Unnamed: 0,wiki.url,wiki.summary,img_url,img_fn,img_status,img_list
Taytoon,,,,,,
Tamnick,,,,,,
Gembrook,,,,,,
Reservoir,,,,,,
Karingal,,,,,,
Jeparit,,,,,,
Boort,,,,,,
Croydon,,,,,,
Wattle Glen,,,,,,


In [31]:
writer = pd.ExcelWriter('suburbs-v2.xlsx')
suburbdf.to_excel(writer,'Sheet1')
writer.save()

In [633]:
suburbdf

Unnamed: 0,index,summary,source,image
0,Myrtleford,"Myrtleford is a town in Victoria, Australia. I...",http://upload.wikimedia.org/wikipedia/commons/...,Myrtleford.jpg
1,Bright,Bright (pronunciation [ˈbɹɑet̥]) is a town in ...,http://upload.wikimedia.org/wikipedia/en/thumb...,Bright.png
2,Porepunkah,"Porepunkah is a town in northeast Victoria, Au...",http://upload.wikimedia.org/wikipedia/commons/...,Porepunkah.jpg
3,Tawonga South,,,
4,Ararat,"Ararat is a city in south-west Victoria, Austr...",http://upload.wikimedia.org/wikipedia/commons/...,Ararat.jpg
5,Moyston,Moyston is a town in the Western District regi...,http://upload.wikimedia.org/wikipedia/commons/...,Moyston.jpg
6,Taytoon,Tatyoon is a town in the northern region of Vi...,http://upload.wikimedia.org/wikipedia/commons/...,Taytoon.png
7,Cathcart,,,
8,Wendouree,Wendouree is a large suburb on the north weste...,http://upload.wikimedia.org/wikipedia/en/thumb...,Wendouree.jpg
9,Sebastopol,Sebastopol is a southern suburb on the rural-u...,http://upload.wikimedia.org/wikipedia/commons/...,Sebastopol.jpg
