## Data Extraction
This notebook presents a Python script for querying Wikipedia to gather page information for a dataset of U.S. cities by state. The script covers various key aspects to achieve this data retrieval task.



## Importing Essential Python Modules for Data Processing and HTTP Requests
This section imports essential Python modules required for the subsequent code execution. It includes modules like json for handling JSON data, time for time-related operations, requests for making HTTP requests, pandas for data manipulation, and tqdm for creating progress bars.

In [14]:
# Import necessary Python modules
import json
import time
import requests
import pandas as pd
from tqdm import tqdm



## Code Attribution and Licensing Information
Original Code Attribution
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program.
This code is provided under the Creative Commons CC-BY license. Revision 1.1 - August 14, 2023
Source: https://colab.research.google.com/drive/15UoE16s-IccCTOXREjU3xDIz07tlpyrl#scrollTo=2i0WSJn4TXqu&printMode=true

Define Constants and Configuration

*****This is modified version of the code.


## Defining Constants and Configuration for Wikipedia API Requests
This part defines various constants and configuration settings used throughout the code. It includes settings for the Wikipedia API endpoint, assumed API latency, and request headers. It also defines parameters for querying Wikipedia pages.

In [15]:

API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_LATENCY_ASSUMED = 0.002
API_THROTTLE_WAIT = (1.0/100.0) - API_LATENCY_ASSUMED
REQUEST_HEADERS = {
    'User-Agent': '<uwnetid@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}



## Defining a Function for Requesting Wikipedia Page Information
This section defines a Python function named request_pageinfo_per_article. This function is used to request information about a single Wikipedia article. It checks for valid input and handles exceptions, ensuring that API requests are made correctly.

In [16]:
# Function to request page info for a single article
def request_pageinfo_per_article(article_title=None, endpoint_url=API_ENWIKIPEDIA_ENDPOINT,
                                 request_template=PAGEINFO_PARAMS_TEMPLATE, headers=REQUEST_HEADERS):
    if article_title:
        request_template['titles'] = article_title
    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    try:
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response



## Loading U.S. City Data from CSV into a Pandas DataFrame

In [17]:
# Load Wikipedia Data from CSV
df = pd.read_csv('../data/us_cities_by_state_SEPT.2023.csv')



In [18]:
df

Unnamed: 0,state,page_title,url
0,Alabama,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama"
1,Alabama,"Adamsville, Alabama","https://en.wikipedia.org/wiki/Adamsville,_Alabama"
2,Alabama,"Addison, Alabama","https://en.wikipedia.org/wiki/Addison,_Alabama"
3,Alabama,"Akron, Alabama","https://en.wikipedia.org/wiki/Akron,_Alabama"
4,Alabama,"Alabaster, Alabama","https://en.wikipedia.org/wiki/Alabaster,_Alabama"
...,...,...,...
22152,Wyoming,"Wamsutter, Wyoming","https://en.wikipedia.org/wiki/Wamsutter,_Wyoming"
22153,Wyoming,"Wheatland, Wyoming","https://en.wikipedia.org/wiki/Wheatland,_Wyoming"
22154,Wyoming,"Worland, Wyoming","https://en.wikipedia.org/wiki/Worland,_Wyoming"
22155,Wyoming,"Wright, Wyoming","https://en.wikipedia.org/wiki/Wright,_Wyoming"


# Requesting Wikipedia Page Info for U.S. Cities and Storing in CSV
This code block initiates the process of querying Wikipedia for information about each city in the DataFrame. It iterates through the list of city titles, requests page information, and stores the results in a list. The list is then converted into a DataFrame, and the resulting data is saved to a CSV file.

In [19]:
# Request and store Page Info for Wikipedia Articles in a CSV file
data = []
for title in tqdm(df['page_title'].tolist()):
    info = request_pageinfo_per_article(title)
    if 'query' in info and 'pages' in info['query']:
        pages = info['query']['pages']
        for key, value in pages.items():
            if 'lastrevid' in value and 'title' in value:
                data.append({'Title': value['title'], 'Last_Revision_ID': value['lastrevid']})

# Convert the data into a DataFrame and store it in a CSV file
result_df = pd.DataFrame(data)
result_df.to_csv('../data/wiki_page_info.csv', index=False)

100%|███████████████████████████████████| 22157/22157 [1:29:54<00:00,  4.11it/s]
