# Reading Data

> This notebook retrieves all data necessary from the API and stores it in json files for visualization in downstream tasks. You can find the visualization takss in `visualize.ipynb`

In [346]:
# These are standard python modules
import json, time, urllib.parse, collections

# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests

import pandas as pd

### Constants 

> These variables will stay the same for all functions used in this notebook. They are used in our functions for querying data but remain the same throughout. If you choose to reproduce/build on this work then you will need to change a few fields. These fields, `Request Headers`, `ARTICLE_PAGEVIEWS_PARAMS_DESKTOP`, `ARTICLE_PAGEVIEWS_PARAMS_MOBILE_APP`, and `ARTICLE_PAGEVIEWS_PARAMS_MOBILE_WEB` may need to be changed. The User Agent will need to be changed to your name and affiliation. The start and end keys will need to be changed to the dates that you would like to access. 

In [347]:
# The REST API 'pageviews' URL - this is the common URL/endpoint for all 'pageviews' API requests
API_REQUEST_PAGEVIEWS_ENDPOINT = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/'

# This is a parameterized string that specifies what kind of pageviews request we are going to make
# In this case it will be a 'per-article' based request. The string is a format string so that we can
# replace each parameter with an appropriate value before making the request
API_REQUEST_PER_ARTICLE_PARAMS = 'per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}'

# The Pageviews API asks that we not exceed 100 requests per second, we add a small delay to each request
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making a request to the Wikimedia API they ask that you include a "unique ID" that will allow them to
# contact you if something happens - such as - your code exceeding request limits - or some other error happens
REQUEST_HEADERS = {
    'User-Agent': '<braj1@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022',
}

# This template is used to map parameter values into the API_REQUST_PER_ARTICLE_PARAMS portion of an API request. The dictionary has a
# field/key for each of the required parameters. In the example, below, we only vary the article name, so the majority of the fields
# can stay constant for each request. Of course, these values *could* be changed if necessary.
ARTICLE_PAGEVIEWS_PARAMS_DESKTOP = {
    "project":     "en.wikipedia.org",
    "access":      "desktop",    
    "agent":       "user",
    "article":     "",       
    "granularity": "monthly",
    "start":       "2015070100",  # July 1st Hour 00 of 2015
    "end":         "2022090200"   # Oct 1st Hour 00 of 2022
}

ARTICLE_PAGEVIEWS_PARAMS_MOBILE_APP = {
    "project":     "en.wikipedia.org",
    "access":      "mobile-app",  
    "agent":       "user",
    "article":     "",         
    "granularity": "monthly",
    "start":       "2015070100",  # July 1st Hour 00 of 2015
    "end":         "2022090200"   # Oct 1st Hour 00 of 2022
}

ARTICLE_PAGEVIEWS_PARAMS_MOBILE_WEB = {
    "project":     "en.wikipedia.org",
    "access":      "mobile-web",  
    "agent":       "user",
    "article":     "",            
    "granularity": "monthly",
    "start":       "2015070100",  # July 1st Hour 00 of 2015
    "end":         "2022090200"   # Oct 1st Hour 00 of 2022
}

### Functions

> These are the functions we use for making requests to the API. The function `request_pageviews_per_article` allows us to query the endpoint for an article for monthly views.  


In [350]:
def request_pageviews_per_article(article_title = None, 
                                  request_type = None,
                                  endpoint_url = API_REQUEST_PAGEVIEWS_ENDPOINT, 
                                  endpoint_params = API_REQUEST_PER_ARTICLE_PARAMS, 
                                  headers = REQUEST_HEADERS) -> str:
    '''Retrieves JSON response from Wikpedia API given the title of the article and parameters (saved in constants above)'''
    if request_type.lower() not in ['desktop', 'mobile-app', 'mobile-web']:
        print('Please pass in request template to be desktop, mobile-app, or mobile-user')
        return None
    
    if request_type.lower() == 'mobile-web':
        request_template = ARTICLE_PAGEVIEWS_PARAMS_MOBILE_WEB
    elif request_type.lower() == 'mobile-app':
        request_template = ARTICLE_PAGEVIEWS_PARAMS_MOBILE_APP
    else:
        request_template = ARTICLE_PAGEVIEWS_PARAMS_DESKTOP
    
    # Make sure we have an article title
    if not article_title: return None
    
    # Titles are supposed to have spaces replaced with "_" and be URL encoded
    article_title_encoded = urllib.parse.quote(article_title.replace(' ','_'))
    request_template['article'] = article_title_encoded
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

def get_dinosaur_article_names(input_csv_file_location) -> list:
    '''Read in a csv to get all dinosaur article names'''
    # Make sure we are actually passing in an input csv
    if not input_csv_file_location: return None
    
    df = pd.read_csv(input_csv_file_location)
    
    if 'name' not in df.columns:
        return None # could not find name of articles in that input csv.
    
    names = list(df['name'].values) # convert to list of names from numpy array
                                             
    return names

def generate_json_for_articles(article_names, request_type, save_name) -> str:
    ''' 
    This is a helper function to generate dataframes to be aggregated before
    the json step. This takes only two request types, desktop and mobile. Mobile will
    group mobile-app and mobile-web together.
    '''
    if request_type.lower() not in ['desktop', 'mobile', 'cumulative']:
        print('Please pass in request template to be desktop, mobile-app, or mobile-user')
        return None
    
    per_article_map = collections.defaultdict(str)
    
    for name in article_names:
        try:
            if request_type == 'desktop':
                desktop_views = request_pageviews_per_article(article_title=name, request_type='desktop')
                for item in desktop_views['items']:
                    del item['access']

                per_article_map[name] = desktop_views['items']
            elif request_type == 'mobile':
                web_views = request_pageviews_per_article(article_title=name, request_type='mobile-app')
                app_views = request_pageviews_per_article(article_title=name, request_type='mobile-web')
                per_article_map[name] = _sum_article_views(name_of_article=name, type_views=[web_views, app_views])
            else:
                desktop_views = request_pageviews_per_article(article_title=name, request_type='desktop')
                web_views = request_pageviews_per_article(article_title=name, request_type='mobile-app')
                app_views = request_pageviews_per_article(article_title=name, request_type='mobile-web')
                per_article_map[name] = _sum_article_views(name_of_article=name, type_views=[desktop_views, app_views, web_views])
        except Exception as e:
            # print(views)
            print(f'Error at article: {name}, error; {e}')
    
    article_json_obj = json.dumps(per_article_map, indent = 4) 
    
    with open(f"{save_name}.json", "w+") as outfile:
        outfile.write(article_json_obj)
        
    print(f'Wrote JSON to file: {save_name}.json')
    return article_json_obj

def _sum_article_views(name_of_article, type_views=[]):
    '''Helper function to retrieve and sum all articles of similar type between different access types'''
    if len(type_views) == 0:
        print('Plase pass in json objects of num views')
        return None
    try:
        combined_view = collections.defaultdict(int)
        
        for views in type_views:
            for view in views['items']:
                ts, num_views = view['timestamp'], view['views']
                combined_view[ts] += num_views

        result_output = []

        for ts,num_views in combined_view.items():
            result_output.append({
                'project': 'en.wikipedia',
                'article': name_of_article,
                'granularity': 'monthly',
                'timestamp': ts,
                'agent': 'user',
                'views': num_views
            })
            
        return result_output
    except Exception as e:
        print(f'error with article: {name_of_article} with error : {e}')
        return None
    
        

### Get the output for each name within the dinosaur csv

In [351]:
article_names = get_dinosaur_article_names('./dinosaur_genera.cleaned.SEPT.2022 - dinosaur_genera.cleaned.SEPT.2022.csv.csv')
article_names[:10]

['Coelosaurus antiquus',
 'Aachenosaurus',
 'Aardonyx',
 'Abdarainurus',
 'Abditosaurus',
 'Abelisaurus',
 'Abrictosaurus',
 'Abrosaurus',
 'Abydosaurus',
 'Acantholipan']

### Make Requests to Wikipedia to generate JSON files

> Here we make requests to the Wikipedia API using the `generate_json_for_articles` method. This method will sum all json outputs for an article name if there are multiple JSON requests using the `_sum_article_views` helper function. Finally, they will save them into the json file name that we pass in. Below, we pass in the same `dino_monthly_mobile_<201507>-<202209>`, `dino_monthly_desktop_<201507>-<202209>`, `dino_monthly_cumulative_<201507>-<202209>`

In [339]:
mobile_articles = generate_json_for_articles(article_names, 'mobile', 'data/dino_monthly_mobile_<201507>-<202209>')

error with article: Elemgasem with error : 'items'
error with article: Tuebingosaurus with error : 'items'
Wrote JSON to file: dino_monthly_mobile_<201507>-<202209>.json


In [340]:
desktop_articles = generate_json_for_articles(article_names, 'desktop', 'data/dino_monthly_desktop_<201507>-<202209>')

Error at article: Elemgasem, error; 'items'
Error at article: Tuebingosaurus, error; 'items'
Wrote JSON to file: dino_monthly_desktop_<201507>-<202209>.json


In [345]:
cumulative_articles = generate_json_for_articles(article_names, 'cumulative', 'data/dino_monthly_cumulative_<201507>-<202209>')

error with article: Elemgasem with error : 'items'
error with article: Tuebingosaurus with error : 'items'
Wrote JSON to file: dino_monthly_cumulative_<201507>-<202209>.json


> Looks like Elemgasem/Tuebingosaurus are invalid API request to Wikpedia API so they will not be in our data.