# WikiRecentPhase3

[WikiRecentPhase2](./imgAna_2.jupyter-py36.ipynb) illustrated processing Wikipedia events continuously with Streams using the windowing facility to process 'chunks' of events on a time or count basis.
Building on the previous notebooks, this extracts images from Wikipedia events and renders them.


## Overview - Image Extraction

The previous notebooks recieved and filtered events from Wikipedia. This continues the processing of events, determining if the event pertains to an image and extacts the URL using
[beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). The image's URL is injected into the stream. This notebook gets the extracted URL via a view and renders it.


<a name="setup"></a>
# Setup
### Add credentials for the IBM Streams service

#### ICPD setup

With the cell below selected, click the "Connect to instance" button in the toolbar to insert the credentials for the service.

<a target="blank" href="https://developer.ibm.com/streamsdev/wp-content/uploads/sites/15/2019/02/connect_icp4d.gif">See an example</a>.

#### Cloud setup

To use Streams instance running in the cloud setup a [credential.py](setup_credential.ipynb)


##  Show me
After doing the 'Setup' above you can use Menu 'Cell' | 'Run All' to compose, build, submit and start the rendering of the live Wikidata, go to [Show me now](#showMeNow) for the rendering.


In [1]:
# Install components
!pip install sseclient
!pip install --user --upgrade streamsx

[33mYou are using pip version 18.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Requirement already up-to-date: streamsx in /Users/siegenth/.local/lib/python3.5/site-packages (1.12.10)
[33mYou are using pip version 18.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
# Setup 
import pandas as pd

from IPython.core.debugger import set_trace
from IPython.display import display, clear_output

from statistics import mean
from collections import deque
from collections import Counter

import matplotlib.pyplot as plt
import ipywidgets as widgets
from ipywidgets import Button, HBox, VBox, Layout

from bs4 import BeautifulSoup

%matplotlib inline

from sseclient import SSEClient as EventSource

from ipywidgets import Button, HBox, VBox, Layout

from  functools import lru_cache
import requests

from streamsx.topology.topology import *
import streamsx.rest as rest
from streamsx.topology import context

#from streamsx.topology.topology import Topology

## Support functions for Jupyter

In [3]:
def catchInterrupt(func):
    """decorator : when interupt occurs the display is lost if you don't catch it
       TODO * <view>.stop_data_fetch()  # stop
       
    """
    def catch_interrupt(*args, **kwargs):
        try: 
            func(*args, **kwargs)
        except (KeyboardInterrupt): pass
    return catch_interrupt

#
# Support for locating/rendering views.
def display_view_stop(eventView, period=2):
    """Wrapper for streamsx.rest_primitives.View.display() to have button. """
    button =  widgets.Button(description="Stop Updating")
    display(button)
    eventView.display(period=period) 
    def on_button_clicked(b):
        eventView.stop_data_fetch()
        b.description = "Stopped"
    button.on_click(on_button_clicked)

def view_events(views):
    """
    Build interface to display a list of views and 
    display view when selected from list.
     
    """
    view_names = [view.name for view in views]
    nameView = dict(zip(view_names, views))    
    select = widgets.RadioButtons(
        options = view_names,
        value = None,
        description = 'Select view to display',
        disabled = False
    )
    def on_change(b):
        if (b['name'] == 'label'):
            clear_output(wait=True)
            [view.stop_data_fetch() for view in views ]
            display(select)
            display_view_stop(nameView[b['new']], period=2)
    select.observe(on_change)
    display(select)

def find_job(instance, job_name=None):
    """locate job within instance"""
    for job in instance.get_jobs():    
        if job.applicationName.split("::")[-1] == job_name:
            return job
    else:
        return None

def display_views(instance, job_name):
    "Locate/promote and display all views of a job"
    job = find_job(instance, job_name=job_name)
    if job is None:
        print("Failed to locate job")
    else:
        views = job.get_views()
        view_events(views)

def list_jobs(_instance=None, cancel=False):
    """
    Interactive selection of jobs to cancel.
    
    Prompts with SelectMultiple widget, if thier are no jobs, your presente with a blank list.
    
    """
    active_jobs = { "{}:{}".format(job.name, job.health):job for job in _instance.get_jobs()}

    selectMultiple_jobs = widgets.SelectMultiple(
        options=active_jobs.keys(),
        value=[],
        rows=len(active_jobs),
        description = "Cancel jobs(s)" if cancel else "Active job(s):",
        layout=Layout(width='60%')
    )
    cancel_jobs = widgets.ToggleButton(
        value=False,
        description='Cancel',
        disabled=False,
        button_style='warning', # 'success', 'info', 'warning', 'danger' or ''
        tooltip='Delete selected jobs',
        icon="stop"
    )
    def on_value_change(change):
        for job in selectMultiple_jobs.value:
            print("canceling job:", job, active_jobs[job].cancel())
        cancel_jobs.disabled = True
        selectMultiple_jobs.disabled = True

    cancel_jobs.observe(on_value_change, names='value')
    if cancel:
        return HBox([selectMultiple_jobs, cancel_jobs])
    else:
        return HBox([selectMultiple_jobs])

# Connect to the server :  ICP4D or Cloud instance. 
Attempt to import if fails the cfg will not be defined we know were using 
Cloud.

In [4]:
def get_instance():
    """Setup to access your Streams instance.

    ..note::The notebook is work within Cloud and ICP4D. 
            Refer to the 'Setup' cells above.              
    Returns:
        instance : Access to Streams instance, used for submitting and rendering views.
    """
    try:
        from icpd_core import icpd_util
        import urllib3
        global cfg
        cfg[context.ConfigParams.SSL_VERIFY] = False
        instance = rest.Instance.of_service(cfg)
        print("Within ICP4D")
        urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
    except ImportError:
        cfg = None
        print("Outside ICP4D")
        import credential  
        sc = rest.StreamingAnalyticsConnection(service_name='Streaming3Turbine', 
                                               vcap_services=credential.vcap_conf)
        instance = sc.get_instances()[0]
    return instance,cfg

instance,cfg = get_instance()

Outside ICP4D


## List jobs and cancel....
This page will submit a job named 'WikiPhase3'. If it's running you'll want to cancel it before submitting a new version. If it is running, no need to cancel/submit you can just procede to the [Viewing data section](#viewingData).

In [5]:
list_jobs(instance)

## Support functions that are executed within Streams
Details of these functions can be found in previous notebooks of this suite.

In [6]:
def get_events():
    """fetch recent changes from wikievents site using SSE"""
    for change in EventSource('https://stream.wikimedia.org/v2/stream/recentchange'):
        if len(change.data):
            try:
                obj = json.loads(change.data)
            except json.JSONDecodeError as err:
                print("JSON l1 error:", err, "Invalid JSON:", change.data)
            except json.decoder.JSONDecodeError as err:
                print("JSON l2 error:", err, "Invalid JSON:", change.data)
            else:
                yield(obj)


class sum_aggregation():
    def __init__(self, sum_map={'new_len':'newSum','old_len':'oldSum','delta_len':'deltaSum' }):
        """
        Summation of column(s) over a window's tuples. 
        Args::
            sum_map :  specfify tuple columns to be summed and the result field. 
            tuples : at run time, list of tuples will flow in. Sum each fields
        """
        self.sum_map = sum_map
    def __call__(self, tuples)->dict: 
        """
        Args:
            tuples : list of tuples constituting a window, over all the tuples sum using the sum_map key/value 
                     to specify the input and result field.
        Returns:
            dictionary of fields summations over tuples
            
        """
        summaries = dict()
        for summary_field,result_field in self.sum_map.items():
            summation = sum([ele[summary_field] for ele in tuples])
            summaries.update({result_field : summation})
        return(summaries)

import collections
class tally_fields(object):
    def __init__(self, top_count=3, fields=['user', 'wiki', 'title']):
        """
        Tally fields of a list of tuples.
        Args::
            fields :  fields of tuples that are to be tallied
        """
        self.fields = fields
        self.top_count = top_count
    def __call__(self, tuples)->dict:
        """
        Args::
            tuples : list of tuples tallying to perform. 
        return::
            dict of tallies
        """
        tallies = dict()
        for field in self.fields:
            stage = [tuple[field] for tuple in tuples if tuple[field] is not None]
            tallies[field] = collections.Counter(stage).most_common(self.top_count)
        return tallies

import csv
class wiki_lang():
    """
    Augment the tuple to include language wiki event.
    
    Mapping is loaded at build time and utilized at runtime.
    """

    def __init__(self, fname="wikimap.csv"):
        self.wiki_map = dict()
        with open(fname, mode='r') as csv_file:
            csv_reader = csv.DictReader(csv_file)
            for row in csv_reader:
                self.wiki_map[row['dbname']] = row

    def __call__(self, tuple):
        """using 'wiki' field to look pages code, langauge and native
        Args:
            tuple: tuple (dict) with a 'wiki' fields
        Returns:'
            input tuple with  'code', 'language, 'native' fields added to the input tuple.
        """
        if tuple['wiki'] in self.wiki_map:
            key = tuple['wiki']
            tuple['code'] = self.wiki_map[key]['code']
            tuple['language'] = self.wiki_map[key]['in_english']
            tuple['native'] = self.wiki_map[key]['name_language']
        else:
            tuple['code'] = tuple['language'] = tuple['native'] = None
        return tuple



## Shredding web pages

The next phase of the Stream will be to check if the event is associated with an image, if it is extract the 
image URL. 

- find possible link to image
- build url and use to fetch page, shred,  searching for an image link
- shredding can go down mulitple levels.



In [7]:
#@lru_cache(maxsize=None)
def shred_item_image(url):
    """Shred the item page, seeking image. 
    
    Discover if referencing image by shredding referening url. If it is, dig deeper 
    and extract the 'src' link. 
    
    Locate the image within the page, locate <a class='image' src=**url** ,..>
    
    This traverses two files, pulls the thumbnail ref and follows to fullsize.
    
    Args:
        url: item page to analyse
    
    Returns: 
        If image found [{name,title,org_url},...]
    
    .. warning:: this fetches from wikipedia, requesting too frequenty is bad manners. Uses the lru_cache()
    so it minimises the requests.
    
    This can pick up multiple titles, on a page that is extract, dropping to only one. 
    """
    img_urls = list()
    try:
        rThumb = requests.get(url = url)
        #print(r.content)
        soupThumb = BeautifulSoup(rThumb.content, "html.parser")
        divThumb = soupThumb.find("div", class_="thumb")
        if divThumb is None:
            print("No thumb found", url  )
            return img_urls
        thumbA = divThumb.find("a", class_="image")
        thumbHref = thumbA.attrs['href']

        rFullImage = requests.get(url=thumbHref)
        soupFull = BeautifulSoup(rFullImage.content, "html.parser")
    except Exception as e:
        print("Error request.get, url: {} except:{}".format(url, str(e)))
    else:
        divFull = soupFull.find("div", class_="fullImageLink", id="file")
        if (divFull is not None):
            fullA = divFull.find("a")
            img_urls.append({"title":soupThumb.title.getText(),"img": fullA.attrs['href'],"org_url":url})
    finally:
        return img_urls

In [8]:
#@lru_cache(maxsize=None)
def shred_jpg_image(url):
    """Shed the jpg page, seeking image, the reference begins with 'Fred:' and 
    ends with '.jpg'.
    
    Discover if referencing image by shredding referening url. If it is, dig deeper 
    and extract the 'src' link. 
    
    Locate the image within the page, 
            locate : <div class='fullImageLinks'..>
                         <a href="..url to image" ...>.</a>
                         :
                     </div>  
    Args:
        url: item page to analyse
    
    Returns: 
        If image found [{name,title,org_url='requesting url'},...]
    
    .. warning:: this fetches from wikipedia, requesting too frequenty is bad manners. Uses the lru_cache()
    so it minimises the requests.
    
    """
    img_urls = list()
    try:
        r = requests.get(url = url)
        soup = BeautifulSoup(r.content, "html.parser")
    except Exception as e:
        print("Error request.get, url: {} except:{}".format(url, str(e)))
    else:
        div = soup.find("div", class_="fullImageLink")
        if (div is not None):
            imgA = div.find("a")
            img_urls.append({"title":soup.title.getText(),"img":"https:" + imgA.attrs['href'],"org_url":url})
        else: 
            print("failed to find div for",url)
    finally:
        return img_urls

In [9]:
class soup_image_extract():
    """If the the field_name has a potential a image we
    
    Return: 
        None : field did not have potenital for an image.
        [] : had potential but no url found. 
        [{title,img,href}]
    """
    def __init__(self, field_name="title", url_base="https://www.wikidata.org/wiki/"):
        self.url_base = url_base
        self.field_name = field_name
    
    def __call__(self, _tuple):
        title = _tuple[self.field_name]
        img_desc = None 
        if (title[0] == "Q"):
            lnk = self.url_base + title
            img_desc = shred_item_image(lnk)
        elif title.startswith("File:") and (title.endswith('.JPG') or title.endswith('.jpg')):
            lnk = self.url_base + title.replace(' ','_')
            img_desc = shred_jpg_image(lnk)
        _tuple['img_desc'] = img_desc
        return _tuple

<a id='composeBuildSubmit'></a>
## Compose, build and submit the Streams application.
The following Code cell composed the Streams application depicted here:
![stillPhase3.jpg](images/stillPhase3.jpg)

This is notebook is an extention of the previous, I'll only discuss processing beyond 'langAugment' for details regarding prior processing refer to previous [notebook](./imgAna_2.ipynb)s.

The events output by the map named 'langAugment' are limited to those with of type 'edit' and bot is 'False'. 
The fields are: code, delta_len, language, native, new_len, old_len, timestamp,
title, user and wiki. This phase uses the 'title' field to build a url of a webpage, the webpage is feched and processed looking for a image URL. 

The map method named 'imageSoup'  invokes soup_image_extract() where it uses the 'title' field attempting to locate an image. If no image is found, None is returned and nothing flows out of the operator. 
If an image is found then the output includes a 'img_desc' field. A filter is applied to the 'img_desc' for content, 
if it does have content the tuple procedes to the view 'soupActive' where it can be viewed.


In [10]:
list_jobs(instance, cancel=True)

In [11]:
def WikiPhase3(jobName=None, wiki_lang_fname=None):
    """
    Compose topology. 
    -- wiki_lang : csv file mapping database name to langauge

    """
    topo = Topology(name=jobName)
    ### make sure we sseclient in Streams environment.
    topo.add_pip_package('sseclient')
    topo.add_pip_package('bs4')

    ## wiki events
    wiki_events = topo.source(get_events, name="wikiEvents")
    ## select events generated by humans
    human_filter = wiki_events.filter(lambda x: x['type']=='edit' and x['bot'] is False, name='humanFilter')
    # pare down the humans set of columns
    pared_human= human_filter.map(lambda x : {'timestamp':x['timestamp'],
                                              'new_len':x['length']['new'],
                                              'old_len':x['length']['old'], 
                                              'delta_len':x['length']['new'] - x['length']['old'],
                                              'wiki':x['wiki'],'user':x['user'],
                                              'title':x['title']}, 
                        name="paredHuman")
    pared_human.view(buffer_time=1.0, sample_size=200, name="paredEdits", description="Edits done by humans")

    ## Define window(count)& aggregate
    sum_win = pared_human.last(100).trigger(20)
    sum_aggregate = sum_win.aggregate(sum_aggregation(sum_map={'new_len':'newSum','old_len':'oldSum','delta_len':'deltaSum' }), name="sumAggregate")
    sum_aggregate.view(buffer_time=1.0, sample_size=200, name="aggEdits", description="Aggregations of human edits")

    ## Define window(count) & tally edits
    tally_win = pared_human.last(100).trigger(10)
    tally_top = tally_win.aggregate(tally_fields(fields=['user', 'title'], top_count=10), name="talliesTop")
    tally_top.view(buffer_time=1.0, sample_size=200, name="talliesCount", description="Top count tallies: user,titles")

    ## augment filterd/pared edits with language
    if cfg is None:
        lang_augment = pared_human.map(wiki_lang(fname='../datasets/wikimap.csv'), name="langAugment")
    else:
        lang_augment = pared_human.map(wiki_lang(fname=os.environ['DSX_PROJECT_DIR']+'/datasets/wikimap.csv'), name="langAugment")
    lang_augment.view(buffer_time=1.0, sample_size=200, name="langAugment", description="Language derived from wiki")

    ## Define window(time) & tally language
    time_lang_win = lang_augment.last(datetime.timedelta(minutes=2)).trigger(5)
    time_lang = time_lang_win.aggregate(tally_fields(fields=['language'], top_count=10), name="timeLang")
    time_lang.view(buffer_time=1.0, sample_size=200, name="talliesTime", description="Top timed tallies: language")

    ## attempt to extract image using beautifulsoup add img_desc[{}] field
    soup_image = lang_augment.map(soup_image_extract(field_name="title", url_base="https://www.wikidata.org/wiki/"),name="imgSoup")
    soup_active = soup_image.filter(lambda x: x['img_desc'] is not None and len(x['img_desc']) > 0, name="soupActive")
    soup_active.view(buffer_time=1.0, sample_size=200, name="soupActive", description="Image extracted via Bsoup")


    return ({"topo":topo,"view":{ }})

## Submitting job : ICP or Cloud

In [12]:
resp = WikiPhase3(jobName="WikiPhase3")
if cfg is not None:
    # Disable SSL certificate verification if necessary
    cfg[context.ConfigParams.SSL_VERIFY] = False
    submission_result = context.submit("DISTRIBUTED",resp['topo'], config=cfg)

if cfg is None:
    import credential
    cloud = {
        context.ConfigParams.VCAP_SERVICES: credential.vcap_conf,
        context.ConfigParams.SERVICE_NAME: "Streaming3Turbine",
        context.ContextTypes.STREAMING_ANALYTICS_SERVICE:"STREAMING_ANALYTIC",
        context.ConfigParams.FORCE_REMOTE_BUILD: True,
    }
    submission_result = context.submit("STREAMING_ANALYTICS_SERVICE",resp['topo'],config=cloud)

# The submission_result object contains information about the running application, or job
if submission_result.job:
    print("JobId: ", submission_result['id'] , "Name: ", submission_result['name'])


JobId:  10 Name:  ipythoninput113a4b30d5de96::WikiPhase3_10


<a id='viewingData'></a>
## Viewing data 

The running application has number of views to see what what data is moving through the stream. The following 
cell will fetch the views' queue and display it's data when selected. 

|view name | description of data is the view | bot |
|---------|-------------|------|
|aggEdits  | summarised fields | False |
|langAugment | mapped augmented fields | False |
|paredEdits | seleted fields | False |
|talliesCount | last 100 messages tallied | False | 
|talliesTimes | 2 minute windowed | False |
|soupActive | extracted images links| False | 


You want to stop the the fetching the view data when done.

## Acces Views / Render Views UI

In [13]:
# Render the views.....
display_views(instance, job_name="WikiPhase3")

## Render image submitted to wiki feed 
Build dashboard to display images are being submitted to Wikipedia. 

It's not uncommon to see the  same image multiple times. An image (any content) may need to be vetted for 
quailty, copyright, pornograpy etc... Each vet stage generating another event on the Stream

A variety of images are submitted, unfortunaly not all images are rendered in all browsers. I found that the Safari 
browser and render .tif files. 


In [14]:
# Notebook support

def render_image(image_url=None, output_region=None):
    """Write the image into a output region.
    
    Args::
        url: image
        output_region: output region
        
    .. note:: The creation of the output 'stage', if this is not done the image is rendered in the page and
        the output region. 
        
    """
    
    try:
        response = requests.get(image_url)
        stage = widgets.Output(layout={'border': '1px solid green'})
    except:
        print("Error on request : ", image_url)
    else:
        if response.status_code == 200:
            with output_region:
                stage.append_display_data(widgets.Image(
                    value=response.content,
                    #format='jpg',
                    width=300,
                    height=400,
                ))
            output_region.clear_output(wait=True) 

ana_stage = list()
def display_image(tup, image_region=None, title_region=None, url_region=None):
    if tup['img_desc'] is not None and len(tup['img_desc']) > 0:
        display_desc = tup['img_desc'][0]
        ana_stage.append(display_desc)
        title_region.value = "Img Title:{}".format(display_desc['title'] )
        url_region.value = "{}".format(display_desc['img'])
        render_image(image_url=display_desc['img'], output_region=image_region)

### Show me now
<a id='showMeNow'></a>

In [None]:
## Setup the Dashboard - display images sent to Wikipedia 
##                         Next cell populates the 'Dashboard'.....
status_widget = widgets.Label(value="Status", layout={'border': '1px solid green','width':'30%'})
url_widget = widgets.Label(value="Img URL", layout={'border': '1px solid green','width':'100%'})
image_widget = widgets.Output(layout={'border': '1px solid red','width':'30%','height':'270pt'})
title_widget = widgets.Label(value="Title", layout={'border': '1px solid green','width':'30%'})

dashboard = widgets.VBox([status_widget, image_widget, title_widget, url_widget])
display(dashboard)

In [None]:
# Notebook support
# setup 
_view = instance.get_views(name="soupActive")[0]
_view.start_data_fetch()

@catchInterrupt
def server_soup(count=25):
    """Fetch and display images from view.
    Args::
        count: number of iterations to fetch images, count<0
        is infinite
    """
    while count != 0:
        count -= 1
        view_tuples = _view.fetch_tuples(max_tuples=100, timeout=2)
        for soup_tuple in view_tuples:
            status_widget.value = soup_tuple['title']
            display_image(soup_tuple, image_region=image_widget, title_region=title_widget, url_region=url_widget)

server_soup()

## Cancel jobs when your done

In [None]:
list_jobs(instance, cancel=True)

# Notebook wrap up.¶
In  notebook composed and deployed a Streams application that processes live Wikipedia events on a server. It 
extended the previous application to extract images assocated with the event. In the case that the event
does have an associated image, it pushed out to a view where it was rendered. 


In the next notebook we will continue the build out, using the extraced image we'll  apply AI image processing to extract out faces an score them. 
