## Homework 2:
##### In this project, we will be analyzing a dataset of Wikipedia articles with a dataset of state populations, and use a machine learning service called ORES to estimate the quality of the articles about the cities. Part of this file uses code developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - August 14, 2023

### Step 1: Setting up the API call function to get information about the Wikipedia page.

#### 1.1 Importing required libraries

In [180]:
# These are standard python modules
import json, time, requests, urllib.parse
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import math
import warnings
warnings.filterwarnings('ignore')
from tqdm import tqdm
import numpy as np



#### 1.2 Defining constants for the API call to get Revision ID

In [2]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<mzameer@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

#### Step 1.3:  Defining the API call function to get Revision ID

In [3]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

#### Step 1.4:  Defining constants for the API call to get Article Scores

In [37]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (60.0/5000.0)-API_LATENCY_ASSUMED

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<{email_address}>, University of Washington, MSDS DATA 512 - AUTUMN 2023",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "mzameer@uw.edu",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

#
#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = "Mehjabeenz"
ACCESS_TOKEN = ""
#

#### Step 1.5:  Defining the API call function to get Article Scores

In [38]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
     #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None

    return json_response

### Step 2: Setting up the Dataset of Articles with ORES Scores for Analysis

##### For this analysis, the data lives in several different places. We will need data that lists Wikipedia articles about US cities, data for US state populations and data for 'region' demarcation within the US. Putting this together, to get a Wikipedia page quality prediction from ORES for each article page we will: a) read each line of us_cities_by_state_SEPT.2023.csv, b) make a page info request to get the current article page revision, and c) then  make an ORES request using the current revision id.   

#### Step 2.1: Read the list of Wikipedia article pages about US cities from each state.

In [70]:
filename = r'C:\Users\mehja\Documents\UW Masters\DATA 512 Human Centered Data Science\Homeworks\Homework 2\data-512-homework_2\data\us_cities_by_state_SEPT.2023.csv'
articles = pd.read_csv(filename)
display(articles)

Unnamed: 0,state,page_title,url
0,Alabama,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama"
1,Alabama,"Adamsville, Alabama","https://en.wikipedia.org/wiki/Adamsville,_Alabama"
2,Alabama,"Addison, Alabama","https://en.wikipedia.org/wiki/Addison,_Alabama"
3,Alabama,"Akron, Alabama","https://en.wikipedia.org/wiki/Akron,_Alabama"
4,Alabama,"Alabaster, Alabama","https://en.wikipedia.org/wiki/Alabaster,_Alabama"
...,...,...,...
22152,Wyoming,"Wamsutter, Wyoming","https://en.wikipedia.org/wiki/Wamsutter,_Wyoming"
22153,Wyoming,"Wheatland, Wyoming","https://en.wikipedia.org/wiki/Wheatland,_Wyoming"
22154,Wyoming,"Worland, Wyoming","https://en.wikipedia.org/wiki/Worland,_Wyoming"
22155,Wyoming,"Wright, Wyoming","https://en.wikipedia.org/wiki/Wright,_Wyoming"


#### Step 2.2: For each Article, get the Revision ID

In [29]:
for name in tqdm(articles["page_title"]):
    try:
        page = request_pageinfo_per_article(name)
        pageid = str(list(page["query"]["pages"].keys())[0])
        revid = str(page["query"]["pages"][pageid]["lastrevid"])
        articles.loc[articles["page_title"] == name, "revid"] = revid
    except KeyError:
        print("Revision ID not found for article: {}".format(name))
    except Exception as e:
        print("Error making request for article: {}. Error raised: {}".format(name, e.__cause__))

 13%|█▎        | 2983/22157 [45:00<146:00:22, 27.41s/it]

('Connection aborted.', TimeoutError(10060, 'A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond', None, 10060, None))


 80%|███████▉  | 17633/22157 [5:00:50<94:23:26, 75.11s/it]

('Connection aborted.', TimeoutError(10060, 'A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond', None, 10060, None))


100%|██████████| 22157/22157 [6:08:50<00:00,  1.00it/s]   


##### We see two errors above. So checking if there are any articles for which Revision IDs were not found

In [96]:
data_types = {"revid": "Int64"}
articles = pd.read_csv('output.csv', dtype = data_types)

display(articles.isna().mean()*100)

state         0.000000
page_title    0.000000
url           0.000000
pageid        0.009026
revid         0.009026
dtype: float64

##### So for 0.009% articles, Revision IDs were not found. As this a small number, we will leave it as is. 

#### Step 2.3: For each Article, get the ORES score

In [97]:
for revid in tqdm(articles["revid"]):
        
    try:
        score = request_ores_score_per_article(article_revid=int(revid),
                                       email_address="mzameer@uw.edu",
                                       access_token=ACCESS_TOKEN)
        
        score = score["enwiki"]["scores"][str(revid)]["articlequality"]["score"]["prediction"]
        articles.loc[articles["revid"] == revid, "score"] = score
        
    except Exception as e:
        print("ORES score not found for article: {}. Error raised: {}".format(name, e.__cause__))


 21%|██▏       | 4739/22157 [1:46:54<8:02:40,  1.66s/it] 

Expecting value: line 1 column 1 (char 0)


 25%|██▌       | 5614/22157 [2:05:42<7:58:54,  1.74s/it] 

Expecting value: line 1 column 1 (char 0)


100%|██████████| 22157/22157 [7:38:54<00:00,  1.24s/it]   


##### Checking if there are any articles for which ORES scores were not found

In [106]:
articles.isna().mean()*100

state         0.000000
page_title    0.000000
url           0.000000
revid         0.009026
score         0.031593
dtype: float64

#### Articles for which Revision IDs were not found.

In [107]:
display(articles[articles['revid'].isnull()])

Unnamed: 0,state,page_title,url,revid,score
2982,Georgia_(U.S._state),"Bogart, Georgia","https://en.wikipedia.org/wiki/Bogart,_Georgia",,
17632,Pennsylvania,"Salem Township, Clarion County, Pennsylvania","https://en.wikipedia.org/wiki/Salem_Township,_...",,


#### Articles for which ORES scores were not found.

In [108]:
display(articles[articles['score'].isnull()])

Unnamed: 0,state,page_title,url,revid,score
2982,Georgia_(U.S._state),"Bogart, Georgia","https://en.wikipedia.org/wiki/Bogart,_Georgia",,
4738,Illinois,"Peoria Heights, Illinois","https://en.wikipedia.org/wiki/Peoria_Heights,_...",1175551393.0,
5613,Indiana,"Shoals, Indiana","https://en.wikipedia.org/wiki/Shoals,_Indiana",1150406533.0,
6722,Kansas,"Wichita, Kansas","https://en.wikipedia.org/wiki/Wichita,_Kansas",1179871048.0,
7376,Louisiana,"Madisonville, Louisiana","https://en.wikipedia.org/wiki/Madisonville,_Lo...",1177247546.0,
7396,Louisiana,"Mooringsport, Louisiana","https://en.wikipedia.org/wiki/Mooringsport,_Lo...",1167346653.0,
17632,Pennsylvania,"Salem Township, Clarion County, Pennsylvania","https://en.wikipedia.org/wiki/Salem_Township,_...",,


##### Dropping the URL column as it is not needed. 

In [110]:
#drop the article id
articles.drop(columns=['url'], inplace=True)
articles['state'] = articles['state'].str.replace('_', ' ')
articles['state'] = articles['state'].replace('Georgia (U.S. state)', 'Georgia')
display(articles)

Unnamed: 0,state,page_title,revid,score
0,Alabama,"Abbeville, Alabama",1171163550,C
1,Alabama,"Adamsville, Alabama",1177621427,C
2,Alabama,"Addison, Alabama",1168359898,C
3,Alabama,"Akron, Alabama",1165909508,GA
4,Alabama,"Alabaster, Alabama",1179139816,C
...,...,...,...,...
22152,Wyoming,"Wamsutter, Wyoming",1169591845,GA
22153,Wyoming,"Wheatland, Wyoming",1176370621,GA
22154,Wyoming,"Worland, Wyoming",1166347917,GA
22155,Wyoming,"Wright, Wyoming",1166334449,GA


#### Step 2.4: Read the population data of each state.

##### Clean the dataset so that it can be merged with articles. 

In [230]:
filename2 = r'C:\Users\mehja\Documents\UW Masters\DATA 512 Human Centered Data Science\Homeworks\Homework 2\data-512-homework_2\data\NST-EST2022-POP.xlsx'
population = pd.read_excel(filename2)
population.dropna(inplace=True)
population = population.reset_index(drop=True)
population.drop(index=range(5), inplace=True)
population = population.reset_index(drop=True)
population = population.drop(columns=['Unnamed: 1', 'Unnamed: 2', 'Unnamed: 3'])
population = population.rename(columns = {'table with row headers in column A and column headers in rows 3 through 4. (leading dots indicate sub-parts)': 'State', 'Unnamed: 4': 'Population'})
population['State'] = population['State'].str.replace('.', '')
population['Population'] = population['Population'].astype('int64')

display(population)

Unnamed: 0,State,Population
0,Alabama,5074296
1,Alaska,733583
2,Arizona,7359197
3,Arkansas,3045637
4,California,39029342
5,Colorado,5839926
6,Connecticut,3626205
7,Delaware,1018396
8,District of Columbia,671803
9,Florida,22244823


In [200]:
articles_pop = articles.merge(population, left_on='state', right_on ='State', how='left')
articles_pop.drop('State', axis=1, inplace=True)
print(population[~population['State'].isin(articles_pop['state'])]['State'].unique())
articles_pop.isna().sum()


['Connecticut' 'District of Columbia' 'Nebraska' 'Puerto Rico']


state         0
page_title    0
revid         2
score         7
Population    0
dtype: int64

In [156]:
filename3 = r'C:\Users\mehja\Documents\UW Masters\DATA 512 Human Centered Data Science\Homeworks\Homework 2\data-512-homework_2\data\US States by Region - US Census Bureau.xlsx'
divisions = pd.read_excel(filename3)
divisions

Unnamed: 0,REGION,DIVISION,STATE
0,Northeast,,
1,,New England,
2,,,Connecticut
3,,,Maine
4,,,Massachusetts
...,...,...,...
58,,,Alaska
59,,,California
60,,,Hawaii
61,,,Oregon


In [157]:
divisions.drop(columns=['REGION'], inplace=True)
display(divisions)

Unnamed: 0,DIVISION,STATE
0,,
1,New England,
2,,Connecticut
3,,Maine
4,,Massachusetts
...,...,...
58,,Alaska
59,,California
60,,Hawaii
61,,Oregon


In [168]:
target_states1 = ['Connecticut', 'Maine', 'Massachusetts', 'New Hampshire', 'Rhode Island', 'Vermont']
target_states2 = ['New Jersey', 'New York', 'Pennsylvania']
target_states3 = ['Illinois', 'Indiana', 'Michigan', 'Ohio', 'Wisconsin']
target_states4 = ['Iowa','Kansas','Minnesota','Missouri','Nebraska','North Dakota','South Dakota']
target_states5 = ['Delaware','Florida','Georgia','Maryland','North Carolina','South Carolina','Virginia','West Virginia']
target_states6 = ['Alabama','Kentucky','Mississippi','Tennessee']
target_states7 = ['Arkansas','Louisiana','Oklahoma','Texas']
target_states8 = ['Arizona','Colorado','Idaho','Montana','Nevada','New Mexico','Utah','Wyoming']
target_states9 = ['Alaska','California','Hawaii','Oregon','Washington']

divisions.loc[divisions['STATE'].isin(target_states1), 'DIVISION'] = 'New England'
divisions.loc[divisions['STATE'].isin(target_states2), 'DIVISION'] = 'Middle Atlantic'
divisions.loc[divisions['STATE'].isin(target_states3), 'DIVISION'] = 'East North Central'
divisions.loc[divisions['STATE'].isin(target_states4), 'DIVISION'] = 'West North Central'
divisions.loc[divisions['STATE'].isin(target_states5), 'DIVISION'] = 'South Atlantic'
divisions.loc[divisions['STATE'].isin(target_states6), 'DIVISION'] = 'East South Central'
divisions.loc[divisions['STATE'].isin(target_states7), 'DIVISION'] = 'West South Central'
divisions.loc[divisions['STATE'].isin(target_states8), 'DIVISION'] = 'Mountain'
divisions.loc[divisions['STATE'].isin(target_states9), 'DIVISION'] = 'Pacific'

divisions.dropna(inplace=True)
display(divisions)

Unnamed: 0,DIVISION,STATE
2,New England,Connecticut
3,New England,Maine
4,New England,Massachusetts
5,New England,New Hampshire
6,New England,Rhode Island
7,New England,Vermont
9,Middle Atlantic,New Jersey
10,Middle Atlantic,New York
11,Middle Atlantic,Pennsylvania
14,East North Central,Illinois


In [201]:
articles_pop = articles_pop.merge(divisions, left_on='state', right_on ='STATE', how='left')
articles_pop.drop('STATE', axis=1, inplace=True)
articles_pop.isna().sum()

state         0
page_title    0
revid         2
score         7
Population    0
DIVISION      0
dtype: int64

In [202]:
wp_scored_city_articles_by_state = articles_pop.rename(columns={'page_title':'article_title', 'revid':'revision_id', 'score':'article_quality', 'Population':'population','DIVISION':'regional_division'})
new_order = ['state', 'regional_division', 'population', 'article_title', 'revision_id', 'article_quality' ]
wp_scored_city_articles_by_state = wp_scored_city_articles_by_state[new_order]



In [205]:
outputfile = r'C:\Users\mehja\Documents\UW Masters\DATA 512 Human Centered Data Science\Homeworks\Homework 2\data-512-homework_2\data\wp_scored_city_articles_by_state.csv'
wp_scored_city_articles_by_state.to_csv(outputfile)

# Top 10 states by coverage

In [184]:
articles_by_state = (wp_scored_city_articles_by_state.groupby('state')['revision_id'].count()/(wp_scored_city_articles_by_state.groupby('state')['population'].max())).to_frame(name='total_articles_per_capita').reset_index()
articles_by_state_top = articles_by_state.sort_values('total_articles_per_capita', ascending= False).head(10)
display(articles_by_state_top)


Unnamed: 0,state,total_articles_per_capita
42,Vermont,0.000508
31,North Dakota,0.000457
17,Maine,0.000349
38,South Dakota,0.000342
13,Iowa,0.000326
1,Alaska,0.000203
35,Pennsylvania,0.000197
0,Alabama,0.000182
20,Michigan,0.000177
47,Wyoming,0.00017


# Bottom 10 US states by coverage

In [185]:
articles_by_state_bottom = articles_by_state.sort_values('total_articles_per_capita', ascending= True).head(10)
display(articles_by_state_bottom)

Unnamed: 0,state,total_articles_per_capita
30,North Carolina,5e-06
25,Nevada,6e-06
4,California,1.2e-05
2,Arizona,1.2e-05
7,Florida,1.9e-05
33,Oklahoma,1.9e-05
14,Kansas,2.1e-05
18,Maryland,2.5e-05
43,Virginia,3.1e-05
46,Wisconsin,3.3e-05


# Top 10 US states by high quality

In [217]:
articles_by_quality = wp_scored_city_articles_by_state[(wp_scored_city_articles_by_state['article_quality']== 'FA') | (wp_scored_city_articles_by_state['article_quality']== 'GA')]

articles_by_quality_state = (articles_by_quality.groupby('state')['revision_id'].count()/(articles_by_quality.groupby('state')['population'].max())).to_frame(name='total_articles_per_capita').reset_index().replace(np.inf, np.nan).dropna(axis=0)
articles_by_quality_top = articles_by_quality_state.sort_values('total_articles_per_capita',ascending= False).head(10)
display(articles_by_quality_top)

Unnamed: 0,state,total_articles_per_capita
42,Vermont,7e-05
47,Wyoming,6.7e-05
38,South Dakota,6.2e-05
45,West Virginia,6e-05
24,Montana,4.9e-05
26,New Hampshire,4.5e-05
35,Pennsylvania,4.4e-05
23,Missouri,4.3e-05
1,Alaska,4.2e-05
27,New Jersey,4.1e-05


# Bottom 10 US states by high quality

In [218]:
articles_by_quality_bottom = articles_by_quality_state.sort_values('total_articles_per_capita', ascending= True).head(10)
display(articles_by_quality_bottom)

Unnamed: 0,state,total_articles_per_capita
30,North Carolina,2e-06
25,Nevada,3e-06
2,Arizona,3e-06
43,Virginia,4e-06
4,California,4e-06
7,Florida,5e-06
29,New York,6e-06
18,Maryland,7e-06
14,Kansas,7e-06
33,Oklahoma,8e-06


# Census divisions by total coverage

In [220]:
articles_by_division = (wp_scored_city_articles_by_state.groupby('regional_division')['revision_id'].count()/(wp_scored_city_articles_by_state.groupby('regional_division')['population'].unique().apply(sum))).to_frame(name='total_articles_per_capita').reset_index()
articles_by_division = articles_by_division.sort_values('total_articles_per_capita',ascending= False)
display(articles_by_division)

Unnamed: 0,regional_division,total_articles_per_capita
7,West North Central,0.000181
4,New England,0.000125
1,East South Central,0.000102
0,East North Central,0.000101
2,Middle Atlantic,9e-05
8,West South Central,5.1e-05
3,Mountain,4.7e-05
6,South Atlantic,3e-05
5,Pacific,2.4e-05


# Census divisions by high quality coverage

In [222]:
articles_by_quality_division = (articles_by_quality.groupby('regional_division')['revision_id'].count()/(articles_by_quality.groupby('regional_division')['population'].unique().apply(sum))).to_frame(name='total_articles_per_capita').reset_index()

articles_by_quality_division = articles_by_quality_division.sort_values('total_articles_per_capita',ascending= False)
display(articles_by_quality_division)

Unnamed: 0,regional_division,total_articles_per_capita
7,West North Central,3.2e-05
2,Middle Atlantic,2.5e-05
4,New England,2e-05
1,East South Central,1.9e-05
8,West South Central,1.5e-05
0,East North Central,1.5e-05
3,Mountain,1.3e-05
5,Pacific,9e-06
6,South Atlantic,8e-06
