# Anlaysis of Wikipedia article traffic for rare diseases

This notebook contains the code for the analysis of Wikipedia article traffic for a list of rare diseases as specified by [National Organization for Rare Diseases (NORD)](https://rarediseases.org). The collected list can be found in the csv file on the repository associated with this notebook ( [Link](https://github.com/Chakita/data-512-homework_1/blob/master/rare-disease_cleaned.AUG.2024.csv) to csv file on repository).

The aim of this analysis is to visualize the desktop and mobile traffic trends for these rare diseases and try to unearth any patterns within the data.  

To gather the trafiic data, we use the Wikimedia Analytics Pageviews API ([documentation](https://doc.wikimedia.org/generated-data-platform/aqs/analytics-api/reference/page-views.html)).  The Pageviews API provides data pertaining to desktop, mobile web, and mobile app traffic data starting from July 2015 through the previous complete month. As of writing this analysis, this consists of 111 months of data (if available). Some articles created after 2015 will have lesser number of months in the data. The extracted data is stored in JSON files ordered using the article title as the key.

There are 3 JSON files containing data corresponding to desktop access, mobile access and cumulative (sum total of desktop and mobile) accesses. The JSON files are named in the following format: ```rare-disease_monthly_<access_type>_<startYYYYMM>-<endYYYYMM>.json ```
where access type can be one of mobile, desktop or cumulative.
The JSON files produced for the time period July 1, 2015 to Oct 1, 2024 can be found [here](https://github.com/Chakita/data-512-homework_1) on the repostiory associated with this notebook.

The data is utilised to visualize:

1. <b> Maximum Average and Minimum Average </b> - Articles that have the highest average page requests and the lowest average page requests for desktop access and mobile access over the entire time series.

2. <b> Top 10 Peak Page Views </b> -  Top 10 article pages by largest (peak) page views over the entire time series by access type (top 10 for desktop access + top 10 for mobile access).

3. <b> Fewest Months of Data </b> -  The 10 articles with the fewest months of data for desktop access and the 10 articles with the fewest months of data for mobile access.

The work in this notebook builds upon the code provided in this [notebook](https://drive.google.com/file/d/1fYTIX79t9jk-Jske8IwysV-rbRkD4_dc/view) by Dr. David McDonald that provides example usage for the PageViews API

# Step 1: Loading the required dependencies

Importing the required python libraries. If not already present in your environment, you might have to install them using the ``` pip install <library-name>``` command

In [1]:
# These are standard python modules
import json, time, urllib.parse
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests
from datetime import datetime, timedelta
import pandas as pd
from collections import OrderedDict
import matplotlib.pyplot as plt

# Step 2: Defining constants as global variables

These are variables related to the API call parameters and are not expected to change throughout the execution of this notebook. They provide a good template for making API requests. Code adpated from the example notebook provided by Dr. David McDonald

In [21]:
#########
#
#    CONSTANTS
#

# The REST API 'pageviews' URL - this is the common URL/endpoint for all 'pageviews' API requests
API_REQUEST_PAGEVIEWS_ENDPOINT = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/'

# This is a parameterized string that specifies what kind of pageviews request we are going to make
# In this case it will be a 'per-article' based request. The string is a format string so that we can
# replace each parameter with an appropriate value before making the request
API_REQUEST_PER_ARTICLE_PARAMS = 'per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}'

# The Pageviews API asks that we not exceed 100 requests per second, we add a small delay to each request
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making a request to the Wikimedia API they ask that you include your email address which will allow them
# to contact you if something happens - such as - your code exceeding rate limits - or some other error
REQUEST_HEADERS = {
    'User-Agent': '<uwnetid@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# This template is used to map parameter values into the API_REQUST_PER_ARTICLE_PARAMS portion of an API request. The dictionary has a
# field/key for each of the required parameters. In the example, below, we only vary the article name, so the majority of the fields
# can stay constant for each request. Of course, these values *could* be changed if necessary.
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE = {
    "project":     "en.wikipedia.org",
    "access":      "",
    "agent":       "user",
    "article":     "",
    "granularity": "monthly",
    "start":       "2015070100",  # July 2015
    "end":         (datetime.now().replace(day=1) - timedelta(days=1)).strftime("%Y%m%d00")  # Last day of previous month
}
ACCESS_TYPES = {"desktop": "desktop",
                "mobile-web": "mobile-web",
                "mobile-app": "mobile-app",
                "all-access": "all-access"
                }


# Step 3: Defining a function to request pageview data

This function performs the API request to fetch pageview data for a given article tilte and access type (mobile/desktop) and returns the results in the form of JSON response. Code adapted from the example notebook provided by Dr. David McDonald

In [15]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageviews_per_article(article_title = None,
                                  access_type = None,
                                  endpoint_url = API_REQUEST_PAGEVIEWS_ENDPOINT,
                                  endpoint_params = API_REQUEST_PER_ARTICLE_PARAMS,
                                  request_template = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE,
                                  headers = REQUEST_HEADERS):
    """
    Requests pageview data for a specific Wikipedia article from the Wikimedia Pageviews API.

    This function constructs and sends a request to the Wikimedia Pageviews API to retrieve
    pageview data for a given article. It handles URL encoding of the article title and
    implements rate limiting to avoid exceeding API usage limits.

    Args:
        article_title (str, required): The title of the Wikipedia article to request data for.
            If not provided, it must be present in the request_template.
        access_type (str, required): The access type for the pageviews (e.g., 'desktop', 'mobile-web').
            If not provided, it must be present in the request_template.
        endpoint_url (str, optional): The base URL for the Wikimedia Pageviews API.
            Defaults to API_REQUEST_PAGEVIEWS_ENDPOINT.
        endpoint_params (str, optional): The parameter template for the API request.
            Defaults to API_REQUEST_PER_ARTICLE_PARAMS.
        request_template (dict, optional): A template dictionary containing default values for the request.
            Defaults to ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE.
        headers (dict, optional): HTTP headers to include in the request.
            Defaults to REQUEST_HEADERS.

    Returns:
        dict or None: A dictionary containing the JSON response from the API if the request was successful,
        or None if an error occurred.

    Raises:
        Exception: If neither article_title nor request_template['article'] is provided.
        Exception: If neither access_type nor request_template['access'] is provided.

    Note:
        This function implements rate limiting using the API_THROTTLE_WAIT constant.
        Ensure that this constant is properly set to comply with the API's usage policies.
    """
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['article'] = article_title

    if not request_template['article']:
        raise Exception("Must supply an article title to make a pageviews request.")

    if access_type:
        request_template['access'] = access_type

    if not request_template['access']:
        raise Exception("Must supply an access type to make a pageviews request.")

    # Titles are supposed to have spaces replaced with "_" and be URL encoded
    # Properly encode the article title
    article_title_encoded = request_template['article'].replace(' ', '_')
    article_title_encoded = urllib.parse.quote(article_title_encoded, safe='')
    request_template['article'] = article_title_encoded

    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


# Step 4: Reading the list of rare diseases

The list of rare diseases are stored in the ```rare-disease_cleaned.AUG.2024.csv``` file. The file is known to contain some non-ASCII characters and hence is read with the utf-8 format.

In [6]:
import pandas as pd

# Read the CSV file, specifying encoding to handle special characters
df = pd.read_csv('rare-disease_cleaned.AUG.2024.csv', encoding='utf-8')

print(df.head())


                           disease    pageid  \
0             Klinefelter syndrome  19833554   
1           Aarskog–Scott syndrome   7966521   
2             Abetalipoproteinemia     68451   
3                            MT-TP  20945466   
4  Ablepharon macrostomia syndrome  10776100   

                                                 url  
0  https://en.wikipedia.org/wiki/Klinefelter_synd...  
1  https://en.wikipedia.org/wiki/Aarskog–Scott_sy...  
2  https://en.wikipedia.org/wiki/Abetalipoprotein...  
3                https://en.wikipedia.org/wiki/MT-TP  
4  https://en.wikipedia.org/wiki/Ablepharon_macro...  


In [7]:
rare_diseases = df['disease'].tolist()

['Klinefelter syndrome', 'Aarskog–Scott syndrome', 'Abetalipoproteinemia', 'MT-TP', 'Ablepharon macrostomia syndrome', 'Acanthocheilonemiasis', 'Acanthosis nigricans', 'Aceruloplasminemia', 'Megaesophagus', 'Achard–Thiers syndrome', 'Achondrogenesis', 'Achondroplasia', 'Dwarfism', 'Osteochondrodysplasia', 'Fibroblast growth factor receptor 3', 'Vestibular schwannoma', 'Brain tumor', 'Acquired generalized lipodystrophy', 'Barraquer–Simons syndrome', 'Acrodermatitis enteropathica', 'Zinc deficiency', 'Brown-Séquard syndrome', 'Spinal cord injury', 'Brucellosis', 'Yellowstone Park bison herd', 'Māui dolphin', 'Brugada syndrome', 'Nav1.8', 'Sports cardiology', 'Budd–Chiari syndrome', 'Thrombosis', 'Hepatic veno-occlusive disease', 'Thromboangiitis obliterans', 'Bullous pemphigoid', 'Pemphigoid', 'Trigonocephaly', 'CADASIL', 'Campomelic dysplasia', 'Camurati–Engelmann disease', 'Canavan disease', 'Spongy degeneration of the central nervous system', 'Candidiasis', 'Breastfeeding difficulties

# Step 5: Get pageview data for each rare disease

We iterate through our list of rare diseases and call the ```request_pageviews_per_article()``` function for each disease to get the required article traffic details. The resulting JSON response contains the field ```items``` that contains the required metrics.

The ```items``` field contains the following sub-fields:
 1. project - is usually en.wikipedia since these are wikipedia articles.
 2. article - the title of the article or in this case the name of the rare disease
 3. granularity - in our case this is monthly since we are pulling monthly traffic data from the API
 4. timestamp - timestamp associated with the data in the format YYYYMMDD
 5. agent - in this case we only consider user pageview details so this field should be "user"
 6. views - the number of page views for the given article and timestamp.

 We drop a field called access since this field is misleading for mobile accesses and cumulative access.
 Mobile views consists of 2 types - mobile app views and mobile webiste views. The API distinguishes between these 2 types and hence we query it separately for app and website accesses and sum up the views to find the total number of mobile views.

 We then sort the records using the article title as key and save them in the JSON format.

In [8]:
def get_pageviews(diseases, access_type):
    """
    Retrieve pageview data for multiple diseases and a specific access type.

    This function iterates through a list of diseases, making API requests to fetch
    pageview data for each disease. It collects monthly pageview statistics for the
    specified access type.

    Parameters:
    diseases (list): A list of disease names to fetch pageview data for.
    access_type (str): The type of access to filter pageviews by. Should be a key in the ACCESS_TYPES dictionary.

    Returns:
    dict: A dictionary where keys are disease names and values are lists of monthly pageview data.
          Each list item is expected to be a dictionary containing pageview statistics for a specific month.

    Side effects:
    - Prints progress messages to the console for each disease processed.
    - Prints error messages if data collection fails for any disease.

    Note:
    - Relies on an external function `request_pageviews_per_article` to fetch the actual data.
    - Uses a global dictionary ACCESS_TYPES to map access_type to its corresponding value.
    """
    all_data = {}
    for disease in diseases:
        print(f"Getting pageview data for: {disease}")
        views = request_pageviews_per_article(disease, ACCESS_TYPES[access_type])
        if views and 'items' in views:
            print(f"Collected {len(views['items'])} months of pageview data")
            all_data[disease] = views['items']
        else:
            print(f"Failed to collect data for {disease}")
    return all_data

In [9]:
def sort_data_by_article(data):
    """
    Sort the input data dictionary by article names in alphabetical order.

    This function takes a dictionary where keys are article names and values are
    their associated data. It returns a new OrderedDict with the same key-value
    pairs, but sorted alphabetically by the keys (article names).

    Parameters:
    data (dict): A dictionary with article names as keys and associated data as values.

    Returns:
    OrderedDict: A new OrderedDict containing the same key-value pairs as the input,
                 but sorted alphabetically by the keys (article names).

    Note:
    - This function assumes that the keys in the input dictionary are strings (article names).
    - The original dictionary is not modified; a new OrderedDict is created and returned.
    - Requires the OrderedDict class to be imported from the collections module.
    """
    return OrderedDict(sorted(data.items()))

In [10]:
def process_data(data):
    """
    Process and restructure the input data, removing the 'access' field from each item.

    This function takes a dictionary where keys are disease names and values are lists
    of data items. It creates a new dictionary with the same structure, but each data
    item is transformed to include only specific fields, excluding the 'access' field.

    Parameters:
    data (dict): A dictionary where keys are disease names and values are lists of
                 data items. Each data item is expected to be a dictionary containing
                 various fields including 'project', 'article', 'granularity',
                 'timestamp', 'views', and potentially others.

    Returns:
    dict: A new dictionary with the same keys as the input, but values are lists of
          processed data items. Each processed item is a dictionary containing only
          the fields: 'project', 'article', 'granularity', 'timestamp', 'agent', and 'views'.

    Note:
    - The 'access' field, if present in the original data items, is deliberately excluded
      from the processed output.
    - A new field 'agent' is added to each processed item with a fixed value of "user".
    - The original input dictionary is not modified; a new dictionary is created and returned.
    """
    processed_data = {}
    for disease, items in data.items():
        processed_data[disease] = [
            {
                'project': item['project'],
                'article': item['article'],
                'granularity': item['granularity'],
                'timestamp': item['timestamp'],
                'agent': "user",
                'views': item['views']
            }
            for item in items
        ]
    return processed_data

In [11]:
def save_json(data, filename):
    """
    Save the given data as a JSON file, with articles sorted alphabetically.

    This function takes a dictionary of data, sorts it by article names using
    the sort_data_by_article function, and then saves the sorted data as a
    JSON file with the specified filename.

    Parameters:
    data (dict): The data to be saved. Expected to be a dictionary where keys
                 are article names and values are the corresponding data.
    filename (str): The name (including path if necessary) of the file where
                    the JSON data will be saved.

    Returns:
    None

    Side effects:
    - Creates or overwrites a file with the given filename.
    - Writes the sorted JSON data to this file.
    """
    sorted_data = sort_data_by_article(data)
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

# Step 6: Putting it all together

Here we run the entire pipeline we have defined so far, from querying to API for data, processing the data to suit our requirements to saving the data in JSON files.

In [23]:
def main():
    """
    Process and save Wikipedia pageview data for rare diseases across different access types.

    This function performs the following steps:
    1. Reads a list of rare diseases from a CSV file.
    2. Defines a date range from July 2015 to the current month.
    3. Retrieves and processes pageview data for desktop access.
    4. Retrieves and processes pageview data for mobile access (combining mobile-web and mobile-app).
    5. Combines desktop and mobile data to create cumulative pageview data.
    6. Saves processed data for desktop, mobile, and cumulative views as separate JSON files.

    The function uses several helper functions:
    - get_pageviews(): To retrieve pageview data for each access type.
    - process_data(): To process and structure the retrieved data.
    - save_json(): To save the processed data as JSON files.

    Files generated:
    - rare-disease_monthly_desktop_{date_range}.json: Desktop pageview data.
    - rare-disease_monthly_mobile_{date_range}.json: Combined mobile pageview data.
    - rare-disease_monthly_cumulative_{date_range}.json: Total pageview data across all access types.

    Note:
    - Requires pandas for reading the CSV file.
    - Uses the datetime module for date calculations.
    - The date range in filenames is formatted as "YYYYMM-YYYYMM".
    - Assumes all helper functions and necessary modules are imported and available.
    """
    # Read the CSV file
    df = pd.read_csv('rare-disease_cleaned.AUG.2024.csv', encoding='utf-8')
    diseases = df['disease'].tolist()

    # Get start and end dates
    start_date = "2015070100"  # July 2015
    end_date = datetime.now().strftime("%Y%m0100")
    date_range = f"{start_date[:6]}-{end_date[:6]}"

    # Process desktop data
    desktop_data = get_pageviews(diseases, 'desktop')
    processed_desktop = process_data(desktop_data)
    save_json(processed_desktop, f"rare-disease_monthly_desktop_{date_range}.json")

    # Process mobile data (combine mobile-web and mobile-app)
    mobile_web_data = get_pageviews(diseases, 'mobile-web')
    mobile_app_data = get_pageviews(diseases, 'mobile-app')

    mobile_data = {}
    for disease in diseases:
        mobile_data[disease] = []
        web_items = mobile_web_data.get(disease, [])
        app_items = mobile_app_data.get(disease, [])

        for web_item, app_item in zip(web_items, app_items):
            mobile_data[disease].append({
                'project': web_item['project'],
                'article': web_item['article'],
                'granularity': web_item['granularity'],
                'timestamp': web_item['timestamp'],
                'views': web_item['views'] + app_item['views']
            })

    processed_mobile = process_data(mobile_data)
    save_json(processed_mobile, f"rare-disease_monthly_mobile_{date_range}.json")

    # Process cumulative data
    cumulative_data = {}
    for disease in diseases:
        cumulative_data[disease] = []
        desktop_items = desktop_data.get(disease, [])
        mobile_items = mobile_data.get(disease, [])

        for desktop_item, mobile_item in zip(desktop_items, mobile_items):
            cumulative_data[disease].append({
                'project': desktop_item['project'],
                'article': desktop_item['article'],
                'granularity': desktop_item['granularity'],
                'timestamp': desktop_item['timestamp'],
                'views': desktop_item['views'] + mobile_item['views']
            })

    processed_cumulative = process_data(cumulative_data)
    save_json(processed_cumulative, f"rare-disease_monthly_cumulative_{date_range}.json")

if __name__ == "__main__":
    main()

Getting pageview data for: Klinefelter syndrome
Collected 111 months of pageview data
Getting pageview data for: Aarskog–Scott syndrome
Collected 111 months of pageview data
Getting pageview data for: Abetalipoproteinemia
Collected 111 months of pageview data
Getting pageview data for: MT-TP
Collected 111 months of pageview data
Getting pageview data for: Ablepharon macrostomia syndrome
Collected 111 months of pageview data
Getting pageview data for: Acanthocheilonemiasis
Collected 111 months of pageview data
Getting pageview data for: Acanthosis nigricans
Collected 111 months of pageview data
Getting pageview data for: Aceruloplasminemia
Collected 111 months of pageview data
Getting pageview data for: Megaesophagus
Collected 111 months of pageview data
Getting pageview data for: Achard–Thiers syndrome
Collected 111 months of pageview data
Getting pageview data for: Achondrogenesis
Collected 111 months of pageview data
Getting pageview data for: Achondroplasia
Collected 111 months of p

KeyboardInterrupt: 

# Step 7: Visualzing the data

We load the data from the json files into dataframe using the ```create_dataframe()``` function for ease of use since JSON can be hard to work with while visualizing data.

The first plot is the ** Maximum Average and Minimum Average**  that depicts articles that have the highest average page requests and the lowest average page requests for desktop access and mobile access over the entire time series.

The second plot is the **Top 10 Peak Page Views** which visualizes the top 10 article pages by largest (peak) page views over the entire time series by access type (top 10 for desktop access + top 10 for mobile access).

The third plot is the **Fewest Months of Data** that shows the 10 articles with the fewest months of data for desktop access and the 10 articles with the fewest months of data for mobile access.

In [24]:
def load_data(file_path):
    """
    Load JSON data from a specified file.

    This function opens a file at the given path and reads its contents,
    parsing them as JSON data.

    Parameters:
    file_path (str): The path to the JSON file to be loaded.

    Returns:
    dict or list: The parsed JSON data, typically a dictionary or a list,
                  depending on the structure of the JSON file.

    Raises:
    FileNotFoundError: If the specified file does not exist.
    json.JSONDecodeError: If the file content is not valid JSON.
    """
    with open(file_path, 'r') as f:
        return json.load(f)

In [25]:
def create_dataframe(data):
    """
    Convert a dictionary of articles and view data into a pandas DataFrame.

    This function processes a dictionary where the keys represent article names,
    and the values are lists of dictionaries containing timestamp and view count
    for each article. It organizes this data into a pandas DataFrame.

    Parameters:
    data (dict): A dictionary where each key is an article (str) and each value
                 is a list of dictionaries with keys 'timestamp' (str or datetime)
                 and 'views' (int).

    Returns:
    pandas.DataFrame: A DataFrame containing the articles, timestamps, and view counts,
                      with columns 'article', 'timestamp', and 'views'.

    Raises:
    KeyError: If the view data does not contain 'timestamp' or 'views' keys.
    """
    rows = []
    for article, views in data.items():
        for view in views:
            rows.append({
                'article': article,
                'timestamp': view['timestamp'],
                'views': view['views']
            })
    return pd.DataFrame(rows)

In [26]:
def plot_max_min_average(desktop_df, mobile_df):
    """
    Plot the maximum and minimum average page views for articles accessed via Desktop and Mobile.

    This function takes two dataframes, one for desktop access and one for mobile access,
    calculates the average page views per article for each, and then plots the time series
    of the articles with the highest and lowest average views for both access types. The plot
    is saved as 'max_min_average.png'.

    Parameters:
    desktop_df (pandas.DataFrame): DataFrame containing page view data for articles accessed
                                   via desktop. It should have columns 'article', 'timestamp',
                                   and 'views'.
    mobile_df (pandas.DataFrame): DataFrame containing page view data for articles accessed
                                  via mobile. It should have columns 'article', 'timestamp',
                                  and 'views'.

    Returns:
    None: The function saves the plot as a PNG file and does not return any value.
    """
    plt.figure(figsize=(15, 10))

    for df, access_type in [(desktop_df, 'Desktop'), (mobile_df, 'Mobile')]:
        avg_views = df.groupby('article')['views'].mean().sort_values()
        max_article = avg_views.index[-1]
        min_article = avg_views.index[0]

        for article, label in [(max_article, f'Max {access_type}'), (min_article, f'Min {access_type}')]:
            article_data = df[df['article'] == article].sort_values('timestamp')
            plt.plot(article_data['timestamp'], article_data['views'], label=f"{label} ({article})")

    plt.title('Maximum and Minimum Average Page Requests')
    plt.xlabel('Date')
    plt.ylabel('Page Views')
    plt.legend()
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig('max_min_average.png')
    plt.close()

In [27]:
def plot_top_10_peak(desktop_df, mobile_df):

    """
    Plot the top 10 articles with the highest peak page views for Desktop and Mobile access.

    This function takes two dataframes, one for desktop access and one for mobile access,
    identifies the top 10 articles with the highest single-day page views for each access type,
    and plots their view trends over time. The plot is saved as 'top_10_peak.png'.

    Parameters:
    desktop_df (pandas.DataFrame): DataFrame containing page view data for articles accessed
                                   via desktop. It should have columns 'article', 'timestamp',
                                   and 'views'.
    mobile_df (pandas.DataFrame): DataFrame containing page view data for articles accessed
                                  via mobile. It should have columns 'article', 'timestamp',
                                  and 'views'.

    Returns:
    None: The function saves the plot as a PNG file and does not return any value.

    Example:
    plot_top_10_peak(desktop_df, mobile_df)
    """
    plt.figure(figsize=(15, 10))

    for df, access_type in [(desktop_df, 'Desktop'), (mobile_df, 'Mobile')]:
        peak_views = df.groupby('article')['views'].max().sort_values(ascending=False).head(10)

        for article in peak_views.index:
            article_data = df[df['article'] == article].sort_values('timestamp')
            plt.plot(article_data['timestamp'], article_data['views'], label=f"{access_type}: {article}")

    plt.title('Top 10 Peak Page Views')
    plt.xlabel('Date')
    plt.ylabel('Page Views')
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig('top_10_peak.png')
    plt.close()

In [None]:
def plot_fewest_months(desktop_df, mobile_df):

    """
    Plot the articles with the fewest months of data for Desktop and Mobile access.

    This function takes two dataframes, one for desktop access and one for mobile access,
    identifies the top 10 articles with the fewest months of recorded data for both access types,
    and plots their time series data. The plot is saved as 'fewest_months.png'.

    Parameters:
    desktop_df (pandas.DataFrame): DataFrame containing page view data for articles accessed
                                   via desktop. It should have columns 'article', 'timestamp',
                                   and 'views'.
    mobile_df (pandas.DataFrame): DataFrame containing page view data for articles accessed
                                  via mobile. It should have columns 'article', 'timestamp',
                                  and 'views'.

    Returns:
    None: The function saves the plot as a PNG file and does not return any value.

    Raises:
    KeyError: If the dataframes do not contain the expected 'article', 'timestamp', or 'views' columns.
    """

    plt.figure(figsize=(15, 10))

    for df, access_type in [(desktop_df, 'Desktop'), (mobile_df, 'Mobile')]:
        month_counts = df.groupby('article').size().sort_values().head(10)

        for article in month_counts.index:
            article_data = df[df['article'] == article].sort_values('timestamp')
            plt.plot(article_data['timestamp'], article_data['views'], label=f"{access_type}: {article}")

    plt.title('Articles with Fewest Months of Data')
    plt.xlabel('Date')
    plt.ylabel('Page Views')
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig('fewest_months.png')
    plt.close()

In [28]:
def main():
    """
    Main function to load data, create dataframes, and generate plots.

    This function performs the following steps:
    1. Loads desktop and mobile page view data from two JSON files.
    2. Converts the loaded data into pandas DataFrames.
    3. Parses the 'timestamp' column as a datetime object.
    4. Generates and saves the following plots:
       - Maximum and minimum average page views per article for both Desktop and Mobile.
       - Top 10 articles with the highest peak page views for both Desktop and Mobile.
       - Articles with the fewest months of data for both Desktop and Mobile.

    The plots are saved as PNG files in the current directory.
    Returns:
    None: The function generates plots and does not return any value.
    """
    desktop_data = load_data('rare-disease_monthly_desktop_201507-202410.json')
    mobile_data = load_data('rare-disease_monthly_mobile_201507-202410.json')

    desktop_df = create_dataframe(desktop_data)
    mobile_df = create_dataframe(mobile_data)

    desktop_df['timestamp'] = pd.to_datetime(desktop_df['timestamp'], format='%Y%m%d00')
    mobile_df['timestamp'] = pd.to_datetime(mobile_df['timestamp'], format='%Y%m%d00')

    plot_max_min_average(desktop_df, mobile_df)
    plot_top_10_peak(desktop_df, mobile_df)
    plot_fewest_months(desktop_df, mobile_df)

if __name__ == "__main__":
    main()

# Conclusion

From the plots we can observe that the rare disease with most number of monthly accesses on desktop and mobile is "Black Death" while the disease with the least number of monthly accesses on desktop and mobile is "Filippi Syndrome".

It is interesting to note that "COVID-19 vaccine misinformation and hesitancy" is listed as a rare disease on the list and features on the plot of the articles with fewest months of data. This makes sense considering that the COVID-19 virus is relatively new and the peak can be explained by a sudden interest in the topic due to the widespread vaccination efforts to contol the pandemic.

It is also worth noting the peak of the term "pandemic" both for mobile and desktop access in the year 2020 when the COVID-19 pandemic hit. It seems that the impact of the COVID-19 pandemic prompted people to look into other fatal pandemics that were experienced in the past with "Black Death" being considered one of the most fatal pandemics, claiming the lives of around 50 million people. This cpould explain why the two pandemic related terms peaked in terms of page views during the year 2020.