Bazham Khanatayev \
Data 512 HW_2 \
10.15.2023 \
The purpose of this notebook is to take the prepared article data and grab the quality scores from the ORES API.

In [None]:
import pandas as pd

In [None]:
# Read the CSV file into a DataFrame
df_input = pd.read_csv('article_data_clean.csv')

In [1]:
# Display the first few rows of the DataFrame
df_input.head()

Unnamed: 0,Article Title,URL,Revision ID
0,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama",1171163550
1,"Adamsville, Alabama","https://en.wikipedia.org/wiki/Adamsville,_Alabama",1177621427
2,"Addison, Alabama","https://en.wikipedia.org/wiki/Addison,_Alabama",1168359898
3,"Akron, Alabama","https://en.wikipedia.org/wiki/Akron,_Alabama",1165909508
4,"Alabaster, Alabama","https://en.wikipedia.org/wiki/Alabaster,_Alabama",1179139816


In [None]:
# Number of NaN values in the 'Revision ID' column
num_nan_revision_id = df_input['Revision ID'].isna().sum()

In [None]:
# Number of NaN values in the 'Article Title' column
num_nan_article_title = df_input['Article Title'].isna().sum()

In [2]:
print(f"Number of NaN values in 'Revision ID' column: {num_nan_revision_id}")
print(f"Number of NaN values in 'Article Title' column: {num_nan_article_title}")

Number of NaN values in 'Revision ID' column: 0
Number of NaN values in 'Article Title' column: 0


The following code is example code provided by the classroom instructor. The citation for this code: This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the Creative Commons CC-BY license. Revision 1.0 - August 15, 2023

In [3]:
# 
# These are standard python modules
import json, time, urllib.parse
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests

In [4]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (60.0/5000.0)-API_LATENCY_ASSUMED

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<{email_address}>, University of Washington, MSDS DATA 512 - AUTUMN 2023",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

#
#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = ""
ACCESS_TOKEN = ""
#

In [5]:
#   Once you've done the right set up with your Wikimedia account, it should provide you with three different keys, a Client ID,
#   a Client secret, and a Access token.
#
#   In this case I don't want to distribute my keys with the source of the notebook, so I wrote a key manager object that helps
#   track all of my API keys - a username and domain name retrieves the key. The key manager hides the keys on disk separate
#   from the code. A common code idiom to hide API keys will use code to extract the key from an OS environment variable. 

USERNAME = "<Insert your username"
ACCESS_TOKEN = "<Insert your own Access Code>"
#print(ACCESS_TOKEN)
#
#   You can specify these as constants for your own use - just don't distribute the notebook without removing your token
#
#USERNAME = "<your_wikimedia_username>"
#ACCESS_TOKEN = "<your_wikimedia_provided_access_token_its_a_really_long_string>"
#
#

Function written to request the ORES score. This was written with the help of Google Bard. It is also based off of the example code mentioned earlier.

In [9]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


Lets test the API with just 10 rows

In [None]:
from tqdm import tqdm
import pandas as pd

In [None]:
def get_quality_score(revision_id, article_title):
    email = "<put your email here>" 
    access_token = ACCESS_TOKEN  # Replace with your token
    
    # Get the ORES score for the given revision_id
    response = request_ores_score_per_article(
        article_revid=revision_id,
        email_address=email,
        access_token=access_token
    )
    
    # Handle the response: extract the quality score or return 'Issue' if any problem
    try:
        # Assuming the highest probability score indicates the quality
        scores = response['enwiki']['scores'][str(revision_id)]['articlequality']['score']['probability']
        return max(scores, key=scores.get)
    except (TypeError, KeyError, AttributeError):  # handling possible issues in the response
        return "Issue"

In [None]:
# Sample the first 10 rows for testing
subset_df = df_input.iloc[:10].copy()

In [None]:
# Apply the function to each row in the subset with a tqdm progress bar
subset_df['Quality Score Pred'] = [get_quality_score(series['Revision ID'], series['Article Title']) for _, series in tqdm(subset_df.iterrows(), total=subset_df.shape[0])]

In [12]:
subset_df

100%|██████████| 10/10 [00:08<00:00,  1.22it/s]


Unnamed: 0,Article Title,URL,Revision ID,Quality Score Pred
0,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama",1171163550,C
1,"Adamsville, Alabama","https://en.wikipedia.org/wiki/Adamsville,_Alabama",1177621427,C
2,"Addison, Alabama","https://en.wikipedia.org/wiki/Addison,_Alabama",1168359898,C
3,"Akron, Alabama","https://en.wikipedia.org/wiki/Akron,_Alabama",1165909508,GA
4,"Alabaster, Alabama","https://en.wikipedia.org/wiki/Alabaster,_Alabama",1179139816,C
5,"Albertville, Alabama","https://en.wikipedia.org/wiki/Albertville,_Ala...",1179198677,C
6,"Alexander City, Alabama","https://en.wikipedia.org/wiki/Alexander_City,_...",1179140073,GA
7,"Aliceville, Alabama","https://en.wikipedia.org/wiki/Aliceville,_Alabama",1167792390,GA
8,"Allgood, Alabama","https://en.wikipedia.org/wiki/Allgood,_Alabama",1165909718,C
9,"Altoona, Alabama","https://en.wikipedia.org/wiki/Altoona,_Alabama",1165909823,C


In [None]:
# Copy df_input to a new DataFrame
quality_scores_df = df_input.copy()

In [None]:
# Apply the function to each row in the quality_scores_df with a tqdm progress bar
quality_scores_df['Quality Score Pred'] = [get_quality_score(series['Revision ID'], series['Article Title']) for _, series in tqdm(quality_scores_df.iterrows(), total=quality_scores_df.shape[0])]

In [13]:
quality_scores_df

100%|██████████| 21484/21484 [3:02:32<00:00,  1.96it/s]   


Unnamed: 0,Article Title,URL,Revision ID,Quality Score Pred
0,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama",1171163550,C
1,"Adamsville, Alabama","https://en.wikipedia.org/wiki/Adamsville,_Alabama",1177621427,C
2,"Addison, Alabama","https://en.wikipedia.org/wiki/Addison,_Alabama",1168359898,C
3,"Akron, Alabama","https://en.wikipedia.org/wiki/Akron,_Alabama",1165909508,GA
4,"Alabaster, Alabama","https://en.wikipedia.org/wiki/Alabaster,_Alabama",1179139816,C
...,...,...,...,...
21479,"Wamsutter, Wyoming","https://en.wikipedia.org/wiki/Wamsutter,_Wyoming",1169591845,Issue
21480,"Wheatland, Wyoming","https://en.wikipedia.org/wiki/Wheatland,_Wyoming",1176370621,Issue
21481,"Worland, Wyoming","https://en.wikipedia.org/wiki/Worland,_Wyoming",1166347917,Issue
21482,"Wright, Wyoming","https://en.wikipedia.org/wiki/Wright,_Wyoming",1166334449,Issue


Let us look at the results and see how many rows we had issues with.

In [14]:
issue_count = (quality_scores_df['Quality Score Pred'] == 'Issue').sum()
print(f"There are {issue_count} rows with 'Issue' as their Quality Score Pred.")


There are 5136 rows with 'Issue' as their Quality Score Pred.


We will create two dataframes, one with the no issue rows and one with the issue rows. We will also create csv's out of those two dataframes.

In [None]:
# Filter rows without issue
no_issue_df = quality_scores_df[quality_scores_df['Quality Score Pred'] != 'Issue']

In [15]:
# Filter rows with issue
issue_df = quality_scores_df[quality_scores_df['Quality Score Pred'] == 'Issue']

In [16]:
# Save to CSV files
issue_df.to_csv('issue_quality_scores.csv', index=False)

In [None]:
no_issue_df.to_csv('quality_scores_pred.csv', index=False)