# Rare Disease Article Page Views - Data Collection and Analysis
This notebook demonstrates how to collect and analyze monthly article traffic data from Wikipedia for a set of rare disease-related pages. The data is collected using the [Wikimedia REST API](https://www.mediawiki.org/wiki/Wikimedia_REST_API), specifically the [pageviews/per-article](https://wikimedia.org/api/rest_v1/#/Pageviews%20data) endpoint, which provides access to desktop, mobile web, and mobile app traffic data. 

The dataset spans from July 1, 2015, through September 30, 2024, and includes separate monthly pageview counts for desktop and mobile access. This notebook also illustrates basic visual analysis of pageview trends across time for different subsets of rare disease articles.

## Objective
The goal of this project is to:
1. Retrieve and aggregate Wikipedia page view data for articles related to rare diseases.
2. Perform basic analysis and visualize trends for desktop and mobile access.
3. Follow best practices for reproducibility in data science by documenting and sharing the dataset, code, and results.

#### IMPORTS
These are all standard python modules

In [1]:
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

We will also import a local module that has the functions for getting page views from wikipedia. <br />This module includes other standard modules and the 'requests' module which is not standard and may have to be pip installed

In [2]:
import access_wiki_pageviews as access

#### SET CONSTANTS
Lets begin by setting some constants that we will need to provide to the `access`functions:

Wikimedia requests you include information in the **request header** that will allow them to contact you if something happens such as your code exceeding rate limits, or some other error. <br />

For convenience, I have created a function that will set the request header in the correct format when provided with 
   - email
   - organization
   - project

In [3]:
REQUEST_HEADER = access.set_request_header(user_email   = 'rollk@uw.edu',
                                           organization = 'University of Washington',
                                           project      = 'Homework 1')
# view the output:
REQUEST_HEADER

{'User-Agent': 'rollk@uw.edu, University of Washington, Homework 1'}

Lets also set the **date range** for the dataset we're going to collect.<br />
We are given a start date of July 2015, and asked to collect data through the last completed month.<br />
At the time of this assignment the last completed month was September 2024, which I will hardcode in.


In [4]:
# Set the start and end dates for looking up
# should be string formatted like so: YYYYMMDDHH
START_DATE = "2015070100"  # July 1, 2015
END_DATE   = "2024093000"  # September 30, 2024


Start date: 2015070100
End date: 2024093000


If you would like to use other dates you can either hardcode them manually or use the convenience function `get_previous_complete_month()`  in the `access` module which will dynamically get the last date of the last completed month.

you could use it in code like so:
```
END_DATE = access.get_previous_complete_month()
```

### Load Rare Disease Information
The CSV with information on rare diseases is provided to us. We are to assume it is clean and ready to use. <br /> 
A copy of the csv is saved in the current working folder for loading convenience, as well as in the Provided Resources subfolder.

In [5]:
# load the data
rare_diseases_df = pd.read_csv('rare-disease_cleaned_AUG2024.csv')

# lets look at the rare diseases data we were given
rare_diseases_df

Unnamed: 0,disease,pageid,url
0,Klinefelter syndrome,19833554,https://en.wikipedia.org/wiki/Klinefelter_synd...
1,Aarskog–Scott syndrome,7966521,https://en.wikipedia.org/wiki/Aarskog–Scott_sy...
2,Abetalipoproteinemia,68451,https://en.wikipedia.org/wiki/Abetalipoprotein...
3,MT-TP,20945466,https://en.wikipedia.org/wiki/MT-TP
4,Ablepharon macrostomia syndrome,10776100,https://en.wikipedia.org/wiki/Ablepharon_macro...
...,...,...,...
1768,Bowen–Conradi syndrome,46447611,https://en.wikipedia.org/wiki/Bowen–Conradi_sy...
1769,Bowenoid papulosis,22589842,https://en.wikipedia.org/wiki/Bowenoid_papulosis
1770,Branchio-oculo-facial syndrome,41341790,https://en.wikipedia.org/wiki/Branchio-oculo-f...
1771,Bronchopulmonary dysplasia,3251174,https://en.wikipedia.org/wiki/Bronchopulmonary...


## Collect Page View Data

### Single article example
Lets start with a simple example. <br />
Here we will get a single access type & single example disease and return the pageviews to look at the formatting.<br />We will use the `request_pageviews` function from the `access` module. <br />
The function arguments are as follows:
- article_title: string
- start_date: string
- end_date: string
- access_type: string, options are "desktop", "mobile-app", "mobile-web"
- headers: dict 

In [6]:
# set an access type
example_access_type = 'desktop'

# set a single example disease
example_single_disease = rare_diseases_df["disease"][6]  # get the disease at index 6 
# see what disease was set
print('Disease: {}'.format(example_single_disease))

# look at the output from the pageviews request: 
access.request_pageviews(article_title = example_single_disease,
                         start_date     = START_DATE,
                         end_date       = END_DATE,
                         access_type    = example_access_type,
                         request_header = REQUEST_HEADER)

Disease: Acanthosis nigricans


{'items': [{'project': 'en.wikipedia',
   'article': 'Acanthosis_nigricans',
   'granularity': 'monthly',
   'timestamp': '2015070100',
   'access': 'desktop',
   'agent': 'user',
   'views': 16602},
  {'project': 'en.wikipedia',
   'article': 'Acanthosis_nigricans',
   'granularity': 'monthly',
   'timestamp': '2015080100',
   'access': 'desktop',
   'agent': 'user',
   'views': 15241},
  {'project': 'en.wikipedia',
   'article': 'Acanthosis_nigricans',
   'granularity': 'monthly',
   'timestamp': '2015090100',
   'access': 'desktop',
   'agent': 'user',
   'views': 14754},
  {'project': 'en.wikipedia',
   'article': 'Acanthosis_nigricans',
   'granularity': 'monthly',
   'timestamp': '2015100100',
   'access': 'desktop',
   'agent': 'user',
   'views': 13779},
  {'project': 'en.wikipedia',
   'article': 'Acanthosis_nigricans',
   'granularity': 'monthly',
   'timestamp': '2015110100',
   'access': 'desktop',
   'agent': 'user',
   'views': 13203},
  {'project': 'en.wikipedia',
   'ar

#### Expected Output Dictionary Structure for `request_article_pageviews`
The request_pageviews function returns a dictionary of pageview data fetched from the Wikimedia API for a specified article and access type. Here's a detailed explanation of the structure of the output dictionary:

Structure of the Output Dictionary:
The output dictionary returned by request_pageviews is expected to have the following format:..
    ]
}


```
{
    "items": [
        {
            "project": "en.wikipedia.org",
            "article": "Article_Title",
            "granularity": "monthly",
            "timestamp": "YYYYMM0100")
            "access": "dele-app"
            "agent": "user",
            "vieand month
        },
        # More monthly data entries.```..
    ]
}



##### Breakdown of the Output Fields:
- **items**:
This key contains a list of dictionaries. Each dictionary represents the pageview data for a specific month. This is where the main data of interest resides.
- **project**:
The Wikimedia project from which the data is retrieved. In this case, it is "en.wikipedia.org", meaning English Wikipedia.

- **article**:
The title of the article for which pageviews are retrieved. The spaces in the article title are replaced with underscores (e.g., "Klinefelter_syndrome").

- **granularity**:
The time granularity of the data. Since this function is designed to retrieve monthly data, this field will be "monthly".

- **timestamp**:
The timestamp representing the specific month for the data. It is in the format "YYYYMMDD00" where:
  - YYYY is the year.
  - MM is the month.
  - DD is the day (always set to 01 for monthly data).
  - 00 represents the hour, which is not relevant in monthly data.

- **access**:
Indicates the type of access used to retrieve the pageviews:
  - "desktop" for desktop pageviews.
  - "mobile-web" for mobile web pageviews.
  - "mobile-app" for mobile app pageviews.

- **agent**:
The type of agent that generated the pageviews. It will always be "user", representing human-generated views.

- **views**:
The number of pageviews for the article in the specified month.

## Dataset Example (multiple diseases, all access types)
Now lets collect the entire dataset <br />
We will use the `generate_pageview_datasets` function from the `access` module.<br />
This function will get data for each access type ("desktop", "mobile-web", and "mobile-app") for each article.<br />
It's input is very similar to the `request_article_pageviews` function, it does not require an `access_type`, and it takes a list of article titles  <br />

**Function arguments**:
- articles_list: list of strings
- start_date: string
- end_date: string
- header: dict


**Output**:
- tuple of dicts: desktop_data, mobile_data, cumulative_data
- json files saved for desktop_data, mobile_data, cumulative_data

In [7]:
# create the diseases list from the dataframe
rare_diseases_list = rare_diseases_df["disease"].tolist()

## Warning- THIS WILL TAKE A WHILE TO RUN! 
# there are print statements along the way to let you know how far you are
desktop_data, mobile_data, cumulative_data = access.generate_pageview_datasets(articles_list  = rare_diseases_list,
                                                                               start_date     = START_DATE,
                                                                               end_date       = END_DATE,
                                                                               request_header = REQUEST_HEADER)

Processing article: Klinefelter syndrome, 1 of 1773
Processing article: Aarskog–Scott syndrome, 2 of 1773
Processing article: Abetalipoproteinemia, 3 of 1773
Processing article: MT-TP, 4 of 1773
Processing article: Ablepharon macrostomia syndrome, 5 of 1773
Processing article: Acanthocheilonemiasis, 6 of 1773
Processing article: Acanthosis nigricans, 7 of 1773
Processing article: Aceruloplasminemia, 8 of 1773
Processing article: Megaesophagus, 9 of 1773
Processing article: Achard–Thiers syndrome, 10 of 1773
Processing article: Achondrogenesis, 11 of 1773
Processing article: Achondroplasia, 12 of 1773
Processing article: Dwarfism, 13 of 1773
Processing article: Osteochondrodysplasia, 14 of 1773
Processing article: Fibroblast growth factor receptor 3, 15 of 1773
Processing article: Vestibular schwannoma, 16 of 1773
Processing article: Brain tumor, 17 of 1773
Processing article: Acquired generalized lipodystrophy, 18 of 1773
Processing article: Barraquer–Simons syndrome, 19 of 1773
Proces

FileNotFoundError: [WinError 3] The system cannot find the path specified: ''

## Output Format