# Article Page Info MediaWiki API Example
This example illustrates how to access page info data using the [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page). This example shows how to request summary 'page info' for a single article page. The API documentation, [API:Info](https://www.mediawiki.org/wiki/API:Info), covers additional details that may be helpful when trying to use or understand this example.

## License
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.1 - August 14, 2023



In [61]:
# 
# These are standard python modules
import json, time, urllib.parse
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests
import pandas as pd
import os
import csv
import openpyxl
import numpy as np

The example relies on some constants that help make the code a bit more readable.

In [3]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': 'sboral@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [4]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


### Step 1: Data Acquisition

#### Step 1A: Retrieving and loading article data from Wikipedia for all cities. 
First we will load the cities data into a dataframe.
Then we will use pass this dataframe to the wikipedia API request function to get details on these cities.

In [5]:
cities_df=pd.read_csv('/Users/sayo/Documents/Projects/Home-Projects/ScratchPad/us_cities_by_state_SEPT.2023.csv')
cities_df.head(5)
#cities_df['page_title'][1]

Unnamed: 0,state,page_title,url
0,Alabama,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama"
1,Alabama,"Adamsville, Alabama","https://en.wikipedia.org/wiki/Adamsville,_Alabama"
2,Alabama,"Addison, Alabama","https://en.wikipedia.org/wiki/Addison,_Alabama"
3,Alabama,"Akron, Alabama","https://en.wikipedia.org/wiki/Akron,_Alabama"
4,Alabama,"Alabaster, Alabama","https://en.wikipedia.org/wiki/Alabaster,_Alabama"


In [6]:
# fetch city data
print(f"Getting page info data for: {cities_df['page_title'][1]}")
info = request_pageinfo_per_article(cities_df['page_title'][1])
print(json.dumps(info['query']['pages'],indent=4))

Getting page info data for: Adamsville, Alabama
{
    "104761": {
        "pageid": 104761,
        "ns": 0,
        "title": "Adamsville, Alabama",
        "contentmodel": "wikitext",
        "pagelanguage": "en",
        "pagelanguagehtmlcode": "en",
        "pagelanguagedir": "ltr",
        "touched": "2023-10-10T22:35:37Z",
        "lastrevid": 1177621427,
        "length": 18040,
        "talkid": 281272,
        "fullurl": "https://en.wikipedia.org/wiki/Adamsville,_Alabama",
        "editurl": "https://en.wikipedia.org/w/index.php?title=Adamsville,_Alabama&action=edit",
        "canonicalurl": "https://en.wikipedia.org/wiki/Adamsville,_Alabama"
    }
}


In [9]:
# writing data to json file
formatted_view={}
for i in range(len(cities_df)):
    page_title= cities_df['page_title'][i]
    print("Getting page info data for:" ,page_title)
    info = request_pageinfo_per_article(page_title)
    formatted_view[page_title]=(info['query']['pages'])
if os.path.exists('wikipedia_us_city_info.json'):
    os.remove('wikipedia_us_city_info.json')
with open('wikipedia_us_city_info.json','a' ) as file:
    json.dump(formatted_view,file, indent=4)

Getting page info data for: Abbeville, Alabama
Getting page info data for: Adamsville, Alabama
Getting page info data for: Addison, Alabama
Getting page info data for: Akron, Alabama
Getting page info data for: Alabaster, Alabama
Getting page info data for: Albertville, Alabama
Getting page info data for: Alexander City, Alabama
Getting page info data for: Aliceville, Alabama
Getting page info data for: Allgood, Alabama
Getting page info data for: Altoona, Alabama
Getting page info data for: Andalusia, Alabama
Getting page info data for: Anderson, Lauderdale County, Alabama
Getting page info data for: Anniston, Alabama
Getting page info data for: Arab, Alabama
Getting page info data for: Ardmore, Alabama
Getting page info data for: Argo, Alabama
Getting page info data for: Ariton, Alabama
Getting page info data for: Arley, Alabama
Getting page info data for: Ashford, Alabama
Getting page info data for: Ashland, Alabama
Getting page info data for: Ashville, Alabama
Getting page info dat

#### Step 1B: Getting Population data for these cities

In [115]:
population_df=pd.read_excel('/Users/sayo/Documents/Projects/Home-Projects/ScratchPad/NST-EST2022-POP.xlsx'
                            ,header=[3]
                            ,names=['Geography'
                                    ,'Pop_estimate_april2020'
                                    ,'Pop_estimate_july2020'
                                    ,'Pop_estimate_july2021'
                                    ,'Pop_estimate_july2022']
                            ,skiprows=5
                            )
population_df['State']=population_df['Geography'].str.replace(".","",regex=False)
population_df.drop(columns=['Geography'],inplace=True)
population_df.head(10)

Unnamed: 0,Pop_estimate_april2020,Pop_estimate_july2020,Pop_estimate_july2021,Pop_estimate_july2022,State
0,5024356.0,5031362.0,5049846.0,5074296.0,Alabama
1,733378.0,732923.0,734182.0,733583.0,Alaska
2,7151507.0,7179943.0,7264877.0,7359197.0,Arizona
3,3011555.0,3014195.0,3028122.0,3045637.0,Arkansas
4,39538245.0,39501653.0,39142991.0,39029342.0,California
5,5773733.0,5784865.0,5811297.0,5839926.0,Colorado
6,3605942.0,3597362.0,3623355.0,3626205.0,Connecticut
7,989957.0,992114.0,1004807.0,1018396.0,Delaware
8,689546.0,670868.0,668791.0,671803.0,District of Columbia
9,21538226.0,21589602.0,21828069.0,22244823.0,Florida


#### Step 1C: Getting Region data

In [20]:
csv_file_path='/Users/sayo/Documents/Projects/Home-Projects/ScratchPad/US States by Region - US Census Bureau.xlsx'

In [22]:
# Read the Excel file
df = pd.read_excel(csv_file_path, header=0)

# Remove empty rows
df = df.dropna(how='all')
# Create a nested dictionary from the DataFrame
data = {}
current_region = None
current_division = None
for index, col in df.iterrows():
    if pd.notnull(col[0]):
        current_region = col[0]
        data[current_region] = {}
    elif pd.notnull(col[1]):
        current_division = col[1]
        data[current_region][current_division] = []
    elif pd.notnull(col[2]):
        state= col[2]
        data[current_region][current_division].append(state)

# Convert the nested dictionary to JSON
json_data = json.dumps(data, indent=4)

# Write the JSON data to a file
if os.path.exists('staging_outputs/region_df.json'):
    os.remove('staging_outputs/region_df.json')
with open('region_df.json', 'a') as json_file:
    json_file.write(json_data)

os.replace('/Users/sayo/Documents/Projects/Home-Projects/Human-Centered-Data-Science/data-512-homework_2/notebooks/region_df.json',
           '/Users/sayo/Documents/Projects/Home-Projects/Human-Centered-Data-Science/data-512-homework_2/staging_outputs/region_df.json')

print("JSON data has been written to region_df.json in staging_ouput")

JSON data has been written to region_df.json in staging_ouput


#### Step 1D: Converting JSON into Dataframe for ease of merging

#### First we need to convert the city info JSON output to Dataframe for ease of merging.

In [12]:
## convert city info json to pandas DF
with open('wikipedia_us_city_info.json', "r") as read_content: 
    data_json=json.load(read_content)

#print(data_json.items())
data_list=[]
for city, city_data in data_json.items():
    for pageid, page_data in city_data.items():
        flat_data = {
            #"State":city.split(',',2)[1],
            #"City": city.split(',',2)[0],
            "City": city,
            "PageID": pageid,
            "Title": page_data["title"],
            "LatestRevisionid":page_data["lastrevid"]
        }
        data_list.append(flat_data)

city_wiki_df = pd.DataFrame(data_list)
print(len(city_wiki_df))
print(city_wiki_df.head(-10))

21519
                             City  PageID                       Title  \
0              Abbeville, Alabama  104730          Abbeville, Alabama   
1             Adamsville, Alabama  104761         Adamsville, Alabama   
2                Addison, Alabama  105188            Addison, Alabama   
3                  Akron, Alabama  104726              Akron, Alabama   
4              Alabaster, Alabama  105109          Alabaster, Alabama   
...                           ...     ...                         ...   
21504           Sinclair, Wyoming  140079           Sinclair, Wyoming   
21505  Star Valley Ranch, Wyoming  140145  Star Valley Ranch, Wyoming   
21506           Sundance, Wyoming  140089           Sundance, Wyoming   
21507           Superior, Wyoming  140218           Superior, Wyoming   
21508          Ten Sleep, Wyoming  140241          Ten Sleep, Wyoming   

       LatestRevisionid  
0            1171163550  
1            1177621427  
2            1168359898  
3            

While loading the city data from Wikipedia, I noticed few invalid city names. All valid city names are of the format : City, State. I will use a regex to identify rows which fall into that format and ones which are of invalid format.

In [43]:
# Regular expression pattern for valid 'city, State' format
#valid_pattern1 = r'^[A-Za-z\s\.\'()\'-]+,\s[A-Za-z\s]+$'
#valid_pattern2 = r'^[A-Za-z\s\.\'()\'-]+,\s[A-Za-z\s]+,\s[A-Za-z\s]+$'

In [48]:
# Invalid rows
#invalid_rows = city_wiki_df[~city_wiki_df['Title'].str.contains(valid_pattern1, na=False)
#                            & ~city_wiki_df['Title'].str.contains(valid_pattern2, na=False)]
#print(invalid_rows)

### Step 2: Requesting ORES scores through LiftWing ML Service API

In [16]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (60.0/5000.0)-API_LATENCY_ASSUMED

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<{email_address}>, University of Washington, MSDS DATA 512 - AUTUMN 2023",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

#
#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
#ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = ""
ACCESS_TOKEN = ""
#

In [17]:
#   Once you've done the right set up with your Wikimedia account, it should provide you with three different keys, a Client ID,
#   a Client secret, and a Access token.
#
#   In this case I don't want to distribute my keys with the source of the notebook, so I wrote a key manager object that helps
#   track all of my API keys - a username and domain name retrieves the key. The key manager hides the keys on disk separate
#   from the code. A common code idiom to hide API keys will use code to extract the key from an OS environment variable. 

USERNAME = "" #<redacted>
ACCESS_TOKEN = "" #<redacted>

#### Define a function to make the ORES API request

The API request will be made using a function to encapsulate call and make access reusable in other notebooks. The procedure is parameterized, relying on the constants above for some important default parameters. The primary assumption is that this function will be used to request data for a set of article revisions. The main parameter is 'article_revid'. One should be able to simply pass in a new article revision id on each call and get back a python dictionary as the result. A valid result will be a dictionary that contains the probabilities that the specific revision is one of six different article quality levels. Generally, quality level with the highest probability score is considered the quality level for the article. This can be tricky when you have two (or more) highly probable quality levels.

In [18]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        #response = requests.post(request_url, headers=headers, data=request_data)        
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [72]:
#
#  
# What article - the key for the article dictionary defined above

article_title = city_wiki_df['Title'][1]
#
#   We have to pass in some parameters used for the request header. Create a copy of the template and fill in some fields.
hparams = REQUEST_HEADER_PARAMS_TEMPLATE.copy()
hparams['email_address'] = "sboral@uw.edu"
hparams['access_token'] = ACCESS_TOKEN
#
#    We can also do this with the request data - although this might not be as useful as with the header params
rd = ORES_REQUEST_DATA_TEMPLATE.copy()
rd['rev_id']= 1171098806
print(rd)
#
print(f"ORES scores for '{article_title}' with revid: {1171098806}")
#
#    Make the call, just pass in the article revision ID and the header parameters
score = request_ores_score_per_article(article_revid = 1171098806, 
                                       email_address="sboral@uw.edu", 
                                       access_token=ACCESS_TOKEN,
                                       header_params=hparams)

#
#    Output the result
print(score['enwiki']['scores'][str(1171098806)]['articlequality']['score']['prediction'])
#

{'lang': 'en', 'rev_id': 1171098806, 'features': True}
ORES scores for 'Abbeville, Alabama' with revid: 1171098806
C


In [None]:
# This conversion is required to make the JSON data serializable
city_wiki_df['LatestRevisionid'] = city_wiki_df['LatestRevisionid'].astype(np.int64)
# Adding a ORES SCore column and Setting it to a default value
city_wiki_df['ORES_Score']='0'

In [188]:
ores_scores_df=pd.DataFrame()

In [229]:

# What article - the key for the article dictionary defined above
for i in range(len(city_wiki_df)):
    article_title = city_wiki_df['Article_Title'][i]
    
#    We can also do this with the request data - although this might not be as useful as with the header params
    rd = ORES_REQUEST_DATA_TEMPLATE.copy()
    rd['rev_id']= city_wiki_df['Revision_ID'][i].tolist()
#   We have to pass in some parameters used for the request header. Create a copy of the template and fill in some fields.
    hparams = REQUEST_HEADER_PARAMS_TEMPLATE.copy()
    hparams['email_address'] = "sboral@uw.edu"
    hparams['access_token'] = ACCESS_TOKEN  
#
    print(f"ORES scores for '{article_title}' with revid: {city_wiki_df['Revision_ID'][i]}")
#
#    Make the call, just pass in the article revision ID and the header parameters
    revid=city_wiki_df['Revision_ID'][i]
    score = request_ores_score_per_article(
                                           request_data=rd,
                                           header_params=hparams)


#    Output the result
    ores_scores_df.loc[i,'Revision_ID']=revid
    ores_scores_df.loc[i,'ORES_score']= [score]
    #city_wiki_df.at[i,'ORES_Score']=score['enwiki']['scores'][str(city_wiki_df['LatestRevisionid'][i])]['articlequality']['score']['prediction']
#
ores_scores_df.to_csv('ores_scores_df2.csv')

ORES scores for 'Abbeville, Alabama' with revid: 1171163550
ORES scores for 'Adamsville, Alabama' with revid: 1177621427
ORES scores for 'Addison, Alabama' with revid: 1168359898
ORES scores for 'Akron, Alabama' with revid: 1165909508
ORES scores for 'Alabaster, Alabama' with revid: 1179139816
ORES scores for 'Albertville, Alabama' with revid: 1179198677
ORES scores for 'Alexander City, Alabama' with revid: 1179140073
ORES scores for 'Aliceville, Alabama' with revid: 1167792390
ORES scores for 'Allgood, Alabama' with revid: 1165909718
ORES scores for 'Altoona, Alabama' with revid: 1165909823
ORES scores for 'Andalusia, Alabama' with revid: 1179141586
ORES scores for 'Anderson, Lauderdale County, Alabama' with revid: 662691565
ORES scores for 'Anniston, Alabama' with revid: 1176049382
ORES scores for 'Arab, Alabama' with revid: 1171375371
ORES scores for 'Ardmore, Alabama' with revid: 1176903479
ORES scores for 'Argo, Alabama' with revid: 1177623964
ORES scores for 'Ariton, Alabama' wit

In [256]:
ores_scores_df['ORES_prediction']=0

In [260]:
# Extract 'prediction' values using list comprehension
predictions = []
for i, row in ores_scores_df.iterrows():
    try:
        prediction = row['ORES_score'][0]['enwiki']['scores'][str(row['Revision_ID'])]['articlequality']['score']['prediction']
        predictions.append(prediction)
    except (KeyError, TypeError):
        # Handle missing or inconsistent data gracefully
        predictions.append(None)  # Or any default value you prefer for invalid entries

# Assign the 'predictions' list to a new column in the DataFrame
ores_scores_df['ORES_Prediction'] = predictions

In [263]:
ores_scores_df.head(5)

Unnamed: 0,Revision_ID,ORES_score,ORES_Prediction
0,1171163550,[{'enwiki': {'models': {'articlequality': {'ve...,C
1,1177621427,[{'enwiki': {'models': {'articlequality': {'ve...,C
2,1168359898,[{'enwiki': {'models': {'articlequality': {'ve...,C
3,1165909508,[{'enwiki': {'models': {'articlequality': {'ve...,GA
4,1179139816,[{'enwiki': {'models': {'articlequality': {'ve...,C


### Step 3: Combining all datasets

Retrieving State column.

In [None]:
city_wiki_df=city_wiki_df.merge(cities_df,how='left',left_on='Title',right_on='page_title')

In [147]:
city_wiki_df.drop(columns=['state_x','state_y','url_x','url_y','page_title_x','page_title_y'],inplace=True)
city_wiki_df.head(5)

Unnamed: 0,City,PageID,Title,LatestRevisionid,ORES_Score,state
0,"Abbeville, Alabama",104730,"Abbeville, Alabama",1171163550,0,Alabama
1,"Abbeville, Alabama",104730,"Abbeville, Alabama",1171163550,0,Alabama
2,"Abbeville, Alabama",104730,"Abbeville, Alabama",1171163550,0,Alabama
3,"Abbeville, Alabama",104730,"Abbeville, Alabama",1171163550,0,Alabama
4,"Abbeville, Alabama",104730,"Abbeville, Alabama",1171163550,0,Alabama


Cleaning up population data

In [122]:
# dropping all rows having null/NaN in them 
population_df.dropna(axis=0, inplace=True)

# Keeping only required columns
population_df_final=population_df.loc[:,['State','Pop_estimate_july2022']]
population_df_final.head(5)

Unnamed: 0,State,Pop_estimate_july2022
0,Alabama,5074296.0
1,Alaska,733583.0
2,Arizona,7359197.0
3,Arkansas,3045637.0
4,California,39029342.0


Joining Population data with city wiki data.

In [150]:
city_wiki_df=city_wiki_df.merge(population_df_final,how='left',left_on='state',right_on='State')

In [152]:
city_wiki_df.drop(columns=['State'],inplace=True)

In [187]:
# Final population data merged with wiki data

city_wiki_df.head(-10)

Unnamed: 0,index,City,PageID,Title,LatestRevisionid,ORES_Score,Pop_estimate_july2022,Division,State
0,0,"Abbeville, Alabama",104730,"Abbeville, Alabama",1171163550,C,5074296.0,East South Central,Alabama
1,8,"Adamsville, Alabama",104761,"Adamsville, Alabama",1177621427,0,5074296.0,East South Central,Alabama
2,16,"Addison, Alabama",105188,"Addison, Alabama",1168359898,0,5074296.0,East South Central,Alabama
3,24,"Akron, Alabama",104726,"Akron, Alabama",1165909508,0,5074296.0,East South Central,Alabama
4,32,"Alabaster, Alabama",105109,"Alabaster, Alabama",1179139816,0,5074296.0,East South Central,Alabama
...,...,...,...,...,...,...,...,...,...
18148,22957,"Sinclair, Wyoming",140079,"Sinclair, Wyoming",1162125136,0,581381.0,Mountain,Wyoming
18149,22958,"Star Valley Ranch, Wyoming",140145,"Star Valley Ranch, Wyoming",1166752249,0,581381.0,Mountain,Wyoming
18150,22959,"Sundance, Wyoming",140089,"Sundance, Wyoming",1166330861,0,581381.0,Mountain,Wyoming
18151,22960,"Superior, Wyoming",140218,"Superior, Wyoming",1166330907,0,581381.0,Mountain,Wyoming


Next we will merge Division data from Region into the city wiki dataframe.
We had previously loaded the Region data into a json file : region_df.json. Let us load that back into a dataframe.

In [159]:
with open('/Users/sayo/Documents/Projects/Home-Projects/Human-Centered-Data-Science/data-512-homework_2/staging_outputs/region_df.json', "r") as read_content: 
    region_data_json=json.load(read_content)

region_data_json

{'Northeast': {'New England': ['Connecticut',
   'Maine',
   'Massachusetts',
   'New Hampshire',
   'Rhode Island',
   'Vermont'],
  'Middle Atlantic': ['New Jersey', 'New York', 'Pennsylvania']},
 'Midwest': {'East North Central': ['Illinois',
   'Indiana',
   'Michigan',
   'Ohio',
   'Wisconsin'],
  'West North Central': ['Iowa',
   'Kansas',
   'Minnesota',
   'Missouri',
   'Nebraska',
   'North Dakota',
   'South Dakota']},
 'South': {'South Atlantic': ['Delaware',
   'Florida',
   'Georgia',
   'Maryland',
   'North Carolina',
   'South Carolina',
   'Virginia',
   'West Virginia'],
  'East South Central': ['Alabama', 'Kentucky', 'Mississippi', 'Tennessee'],
  'West South Central': ['Arkansas', 'Louisiana', 'Oklahoma', 'Texas']},
 'West': {'Mountain': ['Arizona',
   'Colorado',
   'Idaho',
   'Montana',
   'Nevada',
   'New Mexico',
   'Utah',
   'Wyoming'],
  'Pacific': ['Alaska', 'California', 'Hawaii', 'Oregon', 'Washington']}}

In [164]:
region_data=[]
for division, states in region_data_json["Northeast"].items():
    for state in states:
       region_data.append({"Division": division, "State": state})

for division, states in region_data_json["Midwest"].items():
    for state in states:
        region_data.append({"Division": division, "State": state})

for division, states in region_data_json["South"].items():
    for state in states:
       region_data.append({"Division": division, "State": state})

for division, states in region_data_json["West"].items():
    for state in states:
       region_data.append({"Division": division, "State": state})

region_data_df=pd.DataFrame(region_data)
region_data_df

Unnamed: 0,Division,State
0,New England,Connecticut
1,New England,Maine
2,New England,Massachusetts
3,New England,New Hampshire
4,New England,Rhode Island
5,New England,Vermont
6,Middle Atlantic,New Jersey
7,Middle Atlantic,New York
8,Middle Atlantic,Pennsylvania
9,East North Central,Illinois


Combining city Wiki data and Division data

In [168]:
city_wiki_df=city_wiki_df.merge(region_data_df,left_on='state',right_on='State')

In [198]:
city_wiki_df.drop(columns=['City','PageID'],inplace=True)

In [171]:
# dropping duplicates if any
city_wiki_df = city_wiki_df.drop_duplicates()

In [194]:
#Renaming columns

city_wiki_df=city_wiki_df.rename(columns={"Title":"Article_Title",
                             "LatestRevisionid":"Revision_ID",
                             "ORES_Score":"article_quality",
                             "Pop_estimate_july2022":"population",
                             "Division":"Regional_division"
                             })

In [200]:
city_wiki_df=city_wiki_df.reindex(columns=['State','Regional_division','population','Article_Title','Revision_ID','article_quality'])
city_wiki_df.to_csv("wp_scored_city_articles_by_state.csv",index=False)

Combining ORES score with rest of the data

In [266]:
city_wiki_df=city_wiki_df.merge(ores_scores_df,how='left',left_on='Revision_ID',right_on='Revision_ID')

In [268]:
city_wiki_df.drop(columns=['article_quality','ORES_score'],inplace=True)

In [269]:
city_wiki_df.rename(columns={'ORES_Prediction':'article_quality'},inplace=True)

In [270]:
city_wiki_df.head(5)

Unnamed: 0,State,Regional_division,population,Article_Title,Revision_ID,article_quality
0,Alabama,East South Central,5074296.0,"Abbeville, Alabama",1171163550,C
1,Alabama,East South Central,5074296.0,"Adamsville, Alabama",1177621427,C
2,Alabama,East South Central,5074296.0,"Addison, Alabama",1168359898,C
3,Alabama,East South Central,5074296.0,"Akron, Alabama",1165909508,GA
4,Alabama,East South Central,5074296.0,"Alabaster, Alabama",1179139816,C


In [271]:
city_wiki_df.to_csv("wp_scored_city_articles_by_state.csv",index=False)

### Step 4 + 5 : Analysis and Results

We need to calculate total-articles-per-population (a ratio representing the number of articles per person)  and high-quality-articles-per-population (a ratio representing the number of high quality articles per person) on a state-by-state and divisional basis. All of these values are “per capita” ratios.

For this analysis we will consider "high quality" articles to be articles that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes


##### Top 10 US states by coverage: The 10 US states with the highest total articles per capita (in descending order) .



In [223]:
analysis_1_df=pd.read_csv('/Users/sayo/Documents/Projects/Home-Projects/Human-Centered-Data-Science/data-512-homework_2/final_outputs/wp_scored_city_articles_by_state.csv')
analysis_1_df
#analysis_1_df.groupby('State').count()

# Calculate total articles per state
total_articles_per_state = analysis_1_df.groupby('State')['Revision_ID'].count().reset_index()
total_articles_per_state.columns = ['State', 'Total_Articles']

# Calculate total population per state
total_population_per_state = analysis_1_df.groupby('State')['population'].max().reset_index()

# Merge the total articles and total population DataFrames
merged_df = pd.merge(total_articles_per_state, total_population_per_state, on='State')

# Calculate articles per population
merged_df['Articles_Per_Population'] = merged_df['Total_Articles'] / merged_df['population']

# Get top 10 states with highest articles per population
top_10_states_by_coverage = merged_df.nlargest(10, 'Articles_Per_Population')

# Print the top 10 states
print(top_10_states_by_coverage[['State', 'Articles_Per_Population']])


           State  Articles_Per_Population
32       Vermont                 0.000508
16         Maine                 0.000349
12          Iowa                 0.000326
1         Alaska                 0.000203
28  Pennsylvania                 0.000197
19      Michigan                 0.000177
36       Wyoming                 0.000170
3       Arkansas                 0.000164
22      Missouri                 0.000154
20     Minnesota                 0.000149


##### Bottom 10 US states by coverage: The 10 US states with the lowest total articles per capita (in ascending order) .


In [224]:
# Get bottom 10 states with highest articles per population
bottom_10_states_by_coverage=merged_df.nsmallest(10, 'Articles_Per_Population')

# Print the bottom 10 states
print(bottom_10_states_by_coverage[['State', 'Articles_Per_Population']])

         State  Articles_Per_Population
24      Nevada                 0.000006
4   California                 0.000012
2      Arizona                 0.000012
33    Virginia                 0.000015
7      Florida                 0.000019
26    Oklahoma                 0.000019
13      Kansas                 0.000021
17    Maryland                 0.000025
35   Wisconsin                 0.000033
34  Washington                 0.000036


##### Top 10 US states by high quality: The 10 US states with the highest high quality articles per capita (in descending order) 

In [280]:
analysis_2_df=pd.read_csv('/Users/sayo/Documents/Projects/Home-Projects/Human-Centered-Data-Science/data-512-homework_2/final_outputs/wp_scored_city_articles_by_state.csv')

cities_with_high_quality_articles=analysis_2_df[(analysis_2_df['article_quality']=='FA') |
                                                (analysis_2_df['article_quality']=='GA')]



# Calculate total articles per state
total_hq_articles_per_state = cities_with_high_quality_articles.groupby('State')['Revision_ID'].count().reset_index()
total_hq_articles_per_state.columns = ['State', 'Total_high_quality_Articles']

# Calculate total population per state
total_population_per_state = cities_with_high_quality_articles.groupby('State')['population'].max().reset_index()

# Merge the total articles and total population DataFrames
merged_df2 = pd.merge(total_hq_articles_per_state, total_population_per_state, on='State')

# Calculate articles per population
merged_df2['HQ_Articles_Per_Population'] = merged_df2['Total_high_quality_Articles'] / merged_df2['population']

# Get top 10 states with highest articles per population
top_10_states_by_coverage = merged_df2.nlargest(10, 'HQ_Articles_Per_Population')

# Print the top 10 states
print(top_10_states_by_coverage[['State', 'HQ_Articles_Per_Population']])

        State  HQ_Articles_Per_Population
26    Wyoming                    0.000067
16    Montana                    0.000049
15   Missouri                    0.000043
1      Alaska                    0.000042
11       Iowa                    0.000033
20     Oregon                    0.000033
13  Minnesota                    0.000030
6    Delaware                    0.000025
3    Arkansas                    0.000024
9       Idaho                    0.000021


##### Bottom 10 US states by high quality: The 10 US states with the lowest high quality articles per capita (in ascending order)

In [281]:
#Get bottom 10 states with highest articles per population
bottom_10_states_by_coverage=merged_df2.nsmallest(10, 'HQ_Articles_Per_Population')

# Print the bottom 10 states
print(bottom_10_states_by_coverage[['State', 'HQ_Articles_Per_Population']])

           State  HQ_Articles_Per_Population
21  Pennsylvania                    0.000001
17        Nevada                    0.000003
2        Arizona                    0.000003
4     California                    0.000004
7        Florida                    0.000005
19      Oklahoma                    0.000008
10      Illinois                    0.000009
12      Michigan                    0.000010
0        Alabama                    0.000010
25     Wisconsin                    0.000011


##### Census divisions by total coverage: A rank ordered list of US census divisions (in descending order) by total articles per capita

##### Census divisions by high quality coverage: Rank ordered list of US census divisions (in descending order) by high quality articles per capita