# Monthly Wikipedia article views of famous movies
This notebook has been used to publish datasets of monthly article traffic for a select set of movies from English Wikipedia from July 1, 2015 through September 30, 2023. It also shows some exploratory analysis in forms of visualizations.


## Source of data
The page view data is sourced using the [Wikimedia REST API](https://www.mediawiki.org/wiki/Wikimedia_REST_API). The API documentation, [pageviews/per-article](https://wikimedia.org/api/rest_v1/#/Pageviews%20data), covers additional details that may be helpful when trying to use or understand how to retrieve monthly page view per article.

The list of articles represent 1359 Academy award winning movies.

## License
Step 1a  and Step 1b were developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - August 14, 2023


## Implementation
Let us deepdive into the implementation

### Step 1: Data Acquisition

In [1]:

# These are standard python modules
import json, time, urllib.parse
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests

# Standard python libaries and modules for data visualization
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

### Step 1a : Defining the parameters used to scrape data from the Wikipedia API

In [2]:
#########
#
#    CONSTANTS
#

# The REST API 'pageviews' URL - this is the common URL/endpoint for all 'pageviews' API requests
API_REQUEST_PAGEVIEWS_ENDPOINT = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/'

# This is a parameterized string that specifies what kind of pageviews request we are going to make
# In this case it will be a 'per-article' based request. The string is a format string so that we can
# replace each parameter with an appropriate value before making the request
API_REQUEST_PER_ARTICLE_PARAMS = 'per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}'

# The Pageviews API asks that we not exceed 100 requests per second, we add a small delay to each request
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making a request to the Wikimedia API they ask that you include your email address which will allow them
# to contact you if something happens - such as - your code exceeding rate limits - or some other error 
REQUEST_HEADERS = {
    'User-Agent': '<uwnetid@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = [ 'Everything Everywhere All at Once','Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This template is used to map parameter values into the API_REQUST_PER_ARTICLE_PARAMS portion of an API request. The dictionary has a
# field/key for each of the required parameters. In the example, below, we only vary the article name, so the majority of the fields
# can stay constant for each request. Of course, these values *could* be changed if necessary.
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE = {
    "project":     "en.wikipedia.org",
    "access":      "",      # this should be changed for the different access types
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015010100",   # start and end dates need to be set
    "end":         "2023090100"    # end date
}


### Step 1b: Defining a function that can be reused to scrape data from the PI. The function returns pageviews per article.

In [3]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageviews_per_article(article_title = None, 
                                  access_method='desktop',
                                  endpoint_url = API_REQUEST_PAGEVIEWS_ENDPOINT, 
                                  endpoint_params = API_REQUEST_PER_ARTICLE_PARAMS, 
                                  request_template = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE,
                                  headers = REQUEST_HEADERS):
    
    request_template['access']=access_method

    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['article'] = article_title

    if not request_template['article']:
        raise Exception("Must supply an article title to make a pageviews request.")

    # Titles are supposed to have spaces replaced with "_" and be URL encoded
    article_title_encoded = urllib.parse.quote(request_template['article'].replace(' ','_'))
    request_template['article'] = article_title_encoded
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


### Step 1c: Loading data of Academy award winning movies from a CSV

In [4]:
# Academy award winning movies in a dataframe

import pandas as pd
famous_movies_df=pd.read_csv('/Users/sayo/Documents/Projects/Repos/DATA-512/sources/thank_the_academy.AUG.2023.csv',
                             header=0
                             )
print('Shape ',famous_movies_df.shape)
famous_movies_df['name'][2]

Shape  (1358, 2)


'The Whale (2022 film)'

### Step 1d: Iterating through all movie names for getting their page views accessed via DESKTOP:

In [None]:
formatted_view={}
for i in range(len(famous_movies_df)):
    article_title=famous_movies_df['name'][i]
    print("Getting pageview data for: ",article_title)
    views = request_pageviews_per_article(article_title)
    formatted_view[article_title]=(views['items'])
    print(f"Finished appending {len(views['items'])} months data to json file for: {article_title}")
with open('academy_monthly_desktop.json','a' ) as file:
    json.dump(formatted_view,file, indent=4)

### Unit testing that the data structure produced is of valid JSON format

In [None]:


try:
  with open('academy_monthly_desktop.json', "r") as read_content: 
    json.load(read_content)
except json.JSONDecodeError:
  print("The JSON string is not valid")
else:
  print("The JSON string is valid")

### Step 1e: Iterating through all movie names for getting their page views accessed via MOBILE (both mobile-app and mobile-web):

In [None]:
formatted_view_mob={}
for i in range(len(famous_movies_df)):
    article_title_mob=famous_movies_df['name'][i]
    print("Getting pageview data for: ",article_title_mob)
    views_mob_app = request_pageviews_per_article(article_title_mob,access_method='mobile-app')
    views_mob_web = request_pageviews_per_article(article_title_mob,access_method='mobile-web')
    views_mob_app=views_mob_app['items']
    views_mob_web=views_mob_web['items']
    consolidated_page_view=[]
    for x in range(len(views_mob_app)):
        total_views=views_mob_app[x]['views']+views_mob_web[x]['views']
        consolidated_page_view.append({
                                "project": views_mob_app[x]['project'],
                                "article": views_mob_app[x]['article'],
                                "granularity": views_mob_app[x]['granularity'],
                                "timestamp": views_mob_app[x]['timestamp'],
                                "agent": views_mob_app[x]['agent'],
                                "views": total_views
                                })
    formatted_view_mob[article_title_mob]=(consolidated_page_view)
with open('academy_monthly_mobile2.json','a' ) as file:
    json.dump(formatted_view_mob,file, indent=4)


In [None]:
# unit testing
try:
  with open('academy_monthly_mobile2.json', "r") as read_content: 
    json.load(read_content)
except json.JSONDecodeError:
  print("The JSON string is not valid")
else:
  print("The JSON string is valid")

### Step 1f: Iterating through all movie names for getting their page views accessed via MOBILE + DESKTOP :

In [None]:
formatted_view_all={}
for i in range(len(famous_movies_df)):
    article_title_all=famous_movies_df['name'][i]
    print("Getting pageview data for: ",article_title_all)
    views_mob_app = request_pageviews_per_article(article_title_all,access_method='mobile-app')
    views_mob_web = request_pageviews_per_article(article_title_all,access_method='mobile-web')
    views_mob_desktop = request_pageviews_per_article(article_title_all,access_method='desktop')
    views_mob_app=views_mob_app['items']
    views_mob_web=views_mob_web['items']
    views_mob_desktop=views_mob_desktop['items']
    consolidated_page_view=[]
    for x in range(len(views_mob_app)):
        total_views=views_mob_app[x]['views']+views_mob_web[x]['views']+views_mob_desktop[x]['views']
        consolidated_page_view.append({
                                "project": views_mob_app[x]['project'],
                                "article": views_mob_app[x]['article'],
                                "granularity": views_mob_app[x]['granularity'],
                                "timestamp": views_mob_app[x]['timestamp'],
                                "agent": views_mob_app[x]['agent'],
                                "views": total_views
                                })
    formatted_view_all[article_title_all]=(consolidated_page_view)
with open('academy_monthly_cumulative2.json','a' ) as file:
    json.dump(formatted_view_all,file, indent=4)

In [None]:
try:
  with open('academy_monthly_cumulative2.json', "r") as read_content: 
    json.load(read_content)
except json.JSONDecodeError:
  print("The JSON string is not valid")
else:
  print("The JSON string is valid")

### Step 2: Analysis

### Step 2a: Transforming JSON dataset to pandas dataframe for easier Exploratory data analysis

In [None]:

## convert json to pandas DF for easier EDA -DESKTOP
with open('academy_monthly_desktop.json', "r") as read_content: 
    data_json=json.load(read_content)

dataframes = []
for key, value in data_json.items():
    df = pd.DataFrame(value) 
    df['article']=key
    dataframes.append(df)


monthly_desktop_df = pd.concat(dataframes, ignore_index=True)
monthly_desktop_df.head(-10)


In [None]:
## convert json to pandas DF for easier EDA -MOBILE

with open('academy_monthly_mobile2.json', "r") as read_content2: 
    data_json_mob=json.load(read_content2)

dataframes_mob = []
for key1, value1 in data_json_mob.items():
    df_mob = pd.DataFrame(value1) 
    df_mob['article']=key1
    dataframes_mob.append(df_mob)


monthly_mobile_df = pd.concat(dataframes_mob, ignore_index=True)
monthly_mobile_df.head(-10)


In [None]:
## convert json to pandas DF for easier EDA - ALL(CUMULATIVE)

with open('academy_monthly_cumulative2.json', "r") as read_content3: 
    data_json_all=json.load(read_content3)

dataframes_all = []
for key2, value2 in data_json_all.items():
    df_all = pd.DataFrame(value2) 
    df_all['article']=key2
    dataframes_all.append(df_all)


monthly_all_df = pd.concat(dataframes_all, ignore_index=True)
monthly_all_df.head(-10)


### Step 2b: Visual analysis

#### Maximum Average and Minimum Average - 
The first graph  contains time series for the articles that have the highest average monthly page requests and the lowest average monthly page requests for desktop access and mobile access. 

In [None]:


average_views = monthly_desktop_df.groupby('article')['views'].mean().reset_index()
average_views_mob=monthly_mobile_df.groupby('article')['views'].mean().reset_index()

# Sort articles based on average views for desktop
sorted_articles = average_views.sort_values(by='views', ascending=False)

# Convert 'timestamp' column to datetime objects
monthly_desktop_df['timestamp'] = pd.to_datetime(monthly_desktop_df['timestamp'], format='%Y%m%d%H')
monthly_mobile_df['timestamp'] = pd.to_datetime(monthly_mobile_df['timestamp'], format='%Y%m%d%H')

# Sort articles based on average views for mobile
sorted_articles_mob = average_views_mob.sort_values(by='views', ascending=False)

# Select articles with highest and lowest average views for desktop
articles_highest_avg = sorted_articles.head(1)['article'].values[0]
articles_lowest_avg = sorted_articles.tail(1)['article'].values[0]

# Select articles with highest and lowest average views for desktop
articles_highest_avg_mob = sorted_articles_mob.head(1)['article'].values[0]
articles_lowest_avg_mob = sorted_articles_mob.tail(1)['article'].values[0]

# Filter DataFrame for selected articles for desktop
highest_avg_df = monthly_desktop_df[monthly_desktop_df['article'] == articles_highest_avg]
lowest_avg_df = monthly_desktop_df[monthly_desktop_df['article'] == articles_lowest_avg]

# Filter DataFrame for selected articles for desktop
highest_avg_df_mob = monthly_mobile_df[monthly_mobile_df['article'] == articles_highest_avg_mob]
lowest_avg_df_mob = monthly_mobile_df[monthly_mobile_df['article'] == articles_lowest_avg_mob]

# Plot time series data for articles with highest and lowest average views
fig,ax=plt.subplots(figsize=(20, 10))

#plot desktop high /low
plt.plot(sorted(highest_avg_df['timestamp']), highest_avg_df['views'], label=f'Highest Avg for Desktop: {articles_highest_avg}')
plt.plot(sorted(lowest_avg_df['timestamp']), lowest_avg_df['views'], label=f'Lowest Avg for Desktop: {articles_lowest_avg}')
#plot mobile high/low
plt.plot(sorted(highest_avg_df_mob['timestamp']), highest_avg_df_mob['views'], label=f'Highest Avg for Mobile: {articles_highest_avg_mob}')
plt.plot(sorted(lowest_avg_df_mob['timestamp']), lowest_avg_df_mob['views'], label=f'Lowest Avg for Mobile: {articles_lowest_avg_mob}')

half_year_locator = mdates.MonthLocator(interval=6)
year_month_formatter = mdates.DateFormatter("%Y-%m") # four digits for year, two for month

ax.xaxis.set_major_locator(half_year_locator)
ax.xaxis.set_major_formatter(year_month_formatter) # formatter for major axis only

plt.xlabel('Timestamp')
plt.ylabel('Page Requests')
plt.title('Page Requests Over Time')
plt.xticks(rotation=45)
plt.legend()
plt.ticklabel_format(axis='y', style='plain')
plt.show()


#### Top 10 Peak Page Views - 
The second graph should contain time series for the top 10 article pages by largest (peak) page views over the entire time by access type. You first find the month for each article that contains the highest (peak) page views, and then order the articles by these peak values. Your graph should contain the top 10 for desktop and top 10 for mobile access (20 lines).

In [None]:
# Calculate total page views for each article
total_views_desktop = monthly_desktop_df.groupby('article')['views'].sum().reset_index()
total_views_mobile= monthly_mobile_df.groupby('article')['views'].sum().reset_index()

# Sort articles based on total views and select the top 10
top_10_articles_desktop = total_views_desktop.sort_values(by='views', ascending=False).head(10)
top_10_articles_mobile = total_views_mobile.sort_values(by='views', ascending=False).head(10)

# Filter DataFrame for top 10 articles
top_10_df_desktop = monthly_desktop_df[monthly_desktop_df['article'].isin(top_10_articles_desktop['article'])]
top_10_df_mobile = monthly_mobile_df[monthly_mobile_df['article'].isin(top_10_articles_desktop['article'])]

# Plot time series data for top 10 articles
fig,ax=plt.subplots(figsize=(20, 10))

for article in top_10_articles_desktop['article']:
    article_data = top_10_df_desktop[top_10_df_desktop['article'] == article]
    plt.plot(article_data['timestamp'], article_data['views'],'o-', label='Desktop: '+article)

for article in top_10_articles_mobile['article']:
    article_data_mobile = top_10_df_mobile[top_10_df_mobile['article'] == article]
    plt.plot(article_data_mobile['timestamp'], article_data_mobile['views'],'o-', label='Mobile: '+article)

half_year_locator = mdates.MonthLocator(interval=6)
year_month_formatter = mdates.DateFormatter("%Y-%m") # four digits for year, two for month

ax.xaxis.set_major_locator(half_year_locator)
ax.xaxis.set_major_formatter(year_month_formatter) # formatter for major axis only

plt.xlabel('Timestamp')
plt.ylabel('Page Requests')
plt.title('Top 10 Articles by Page Views Over Time')
plt.xticks(rotation=45)
plt.legend()
plt.ticklabel_format(axis='y', style='plain')
plt.show()

#### Fewest Months of Data - 

The third graph should show pages that have the fewest months of available data. These will all be relatively short time series and should contain a set of the most recent academy award winners. Your graph should show the 10 articles with the fewest months of data for desktop access and the 10 articles with the fewest months of data for mobile access.

In [None]:
total_data_available_desktop = monthly_desktop_df.groupby('article')['timestamp'].nunique().reset_index()

least_available_10articles_desktop=total_data_available_desktop.sort_values(by='timestamp',ascending=True).head(10)

least_10_df_desktop = monthly_desktop_df[monthly_desktop_df['article'].isin(least_available_10articles_desktop['article'])]

# Plot time series data for 10 articles with least available data for desktop 
fig2,ax2=plt.subplots(figsize=(20, 10))

for article in least_available_10articles_desktop['article']:
    article_data_least = least_10_df_desktop[least_10_df_desktop['article'] == article]
    plt.plot(article_data_least['timestamp'], article_data_least['views'],label=article)

plt.xlabel('Timestamp')
plt.ylabel('Page Views')
plt.title('Least 10 Articles by Available data Over Time')
plt.xticks(rotation=45)
plt.legend()
plt.ticklabel_format(axis='y', style='plain')
plt.show()
