# **Implementing the Wikipedia API**

Copyright 2024, Denis Rothman

[Wikipedia API documentation](https://pypi.org/project/Wikipedia-API/)

The Citations of the Wikipedia pages are in the `Chapter10/citations` directory of the repository.

For more on Wikipedia citations: [Citations](https://en.wikipedia.org/w/index.php?title=Special:CiteThisPage&page=Mark_Twain&id=1231834317&wpFormIdentifier=titleform)


# Installing the environment

In [1]:
try:
  import wikipediaapi
except:
  !pip install Wikipedia-API==0.6.0
  import wikipediaapi

## Defining the tokenization function

In [2]:
import nltk
from nltk.tokenize import word_tokenize

# Ensure you have the necessary NLTK resource downloaded
nltk.download('punkt')

def nb_tokens(text):
    # More sophisticated tokenization which includes punctuation
    tokens = word_tokenize(text)
    return len(tokens)

[nltk_data] Downloading package punkt to /usr/local/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Retrieving Wikipedia Data and Metadata

## Creating an instance

In [3]:
# Create an instance of the Wikipedia API with a detailed user agent
wiki = wikipediaapi.Wikipedia(
    language='en',
    user_agent='Knowledge/1.0 ([dk548@cornell.edu])'
)

## Defining root page

Check the [Wikipedia rate limits](https://api.wikimedia.org/wiki/Rate_limits#:~:text=User%2Dauthenticated%20requests,requests%20per%20hour%20per%20user.) before making API requests.


In [4]:
topic="cybersecurity"     # topic
filename="cybersecurity"  # filename for saving the outputs
maxl=100              # maximum number of links to retrieve. This value was set to 100 the URL dataset.

## Root page summary

In [5]:
import textwrap # to wrap the text and display in 
nltk.download('punkt_tab')
page=wiki.page(topic)

if page.exists()==True:
  print("Page - Exists: %s" % page.exists())
  summary=page.summary
  # number of tokens)
  nbt=nb_tokens(summary)
  print("Number of tokens: ",nbt)
  # Use textwrap to wrap the summary text to a specified width, e.g., 70 characters
  wrapped_text = textwrap.fill(summary, width=60)
  # Print the wrapped summary text
  print(wrapped_text)
else:
  print("Page does not exist")

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /usr/local/share/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Page - Exists: True
Number of tokens:  224
Computer security (also cybersecurity, digital security, or
information technology (IT) security) is the protection of
computer software, systems and networks from threats that
can lead to unauthorized information disclosure, theft or
damage to hardware, software, or data, as well as from the
disruption or misdirection of the services they provide. The
significance of the field stems from the expanded reliance
on computer systems, the Internet, and wireless network
standards. Its importance is further amplified by the growth
of smart devices, including smartphones, televisions, and
the various devices that constitute the Internet of things
(IoT). Cybersecurity has emerged as one of the most
significant new challenges facing the contemporary world,
due to both the complexity of information systems and the
societies they support. Security is particularly crucial for
systems that govern large-scale systems with far-reaching
physical effects, such

## URLs and Citations

In [6]:
print(page.fullurl)

https://en.wikipedia.org/wiki/Computer_security


## Links in the page

In [7]:
# prompt: read the program up to this cell. Then retrieve all the links for this page: print the link and a summary of each link

# Get all the links on the page
links = page.links

# Print the link and a summary of each link
urls = []
counter=0
for link in links:
  try:
    counter+=1
    print(f"Link {counter}: {link}")
    summary = wiki.page(link).summary
    print(f"Link: {link}")
    print(wiki.page(link).fullurl)
    urls.append(wiki.page(link).fullurl)
    print(f"Summary: {summary}")
    if counter>=maxl:
      break
  except page.exists()==False:
    # Ignore pages that don't exist
    pass

print(counter)
print(urls)

Link 1: 2015 Ukraine power grid hack
Link: 2015 Ukraine power grid hack
https://en.wikipedia.org/wiki/2015_Ukraine_power_grid_hack
Summary: On December 23, 2015, the power grid in two western oblasts of Ukraine was hacked, which resulted in power outages for roughly 230,000 consumers in Ukraine for 1-6 hours.  The attack took place during the ongoing Russo-Ukrainian War (2014-present) and is attributed to a Russian advanced persistent threat group known as "Sandworm". It is the first publicly acknowledged successful cyberattack on a power grid.
Link 2: 2600: The Hacker Quarterly
Link: 2600: The Hacker Quarterly
https://en.wikipedia.org/wiki/2600:_The_Hacker_Quarterly
Summary: 2600: The Hacker Quarterly is an American seasonal publication of technical information and articles, many of which are written and submitted by the readership, on a variety of subjects including hacking, telephone switching systems, Internet protocols and services, as well as general news concerning the computer 

In [8]:
page.fullurl

'https://en.wikipedia.org/wiki/Computer_security'

## Writing the citations page and collecting the URLs

In [9]:
from datetime import datetime

# Get all the links on the page
links = page.links

# Prepare a file to store the outputs
fname = filename+"_citations.txt"
with open(fname, "w") as file:
    # Write the citation header
    file.write(f"Citation. In Wikipedia, The Free Encyclopedia. Pages retrieved from the following Wikipedia contributors on {datetime.now()}\n")
    file.write("Root page: " + page.fullurl + "\n")
    counter = 0
    urls = []
    urls.append(page.fullurl)
    # Loop through the links and collect summaries
    for link in links:
        try:
            counter += 1
            page_detail = wiki.page(link)
            summary = page_detail.summary

            # Print details to the file
            file.write(f"Link {counter}: {link}\n")
            file.write(f"Link: {link}\n")
            file.write(f"{page_detail.fullurl}\n")
            urls.append(page_detail.fullurl)
            file.write(f"Summary: {summary}\n")

            # Limit to 20 pages to avoid excessive scraping
            if counter >= maxl:
                break
        except wiki.exceptions.PageError:
            # Ignore pages that don't exist
            continue

    # Write the total counts and URLs at the end
    file.write(f"Total links processed: {counter}\n")
    file.write("URLs:\n")
    file.write("\n".join(urls))

# Note: Ensure the topic you specify corresponds to a valid Wikipedia article.

In [10]:
urls

['https://en.wikipedia.org/wiki/Computer_security',
 'https://en.wikipedia.org/wiki/2015_Ukraine_power_grid_hack',
 'https://en.wikipedia.org/wiki/2600:_The_Hacker_Quarterly',
 'https://en.wikipedia.org/wiki/ACM_Computing_Classification_System',
 'https://en.wikipedia.org/wiki/ARPANET',
 'https://en.wikipedia.org/wiki/AT%26T',
 'https://en.wikipedia.org/wiki/Accelerometer',
 'https://en.wikipedia.org/wiki/Access-control_list',
 'https://en.wikipedia.org/wiki/Access_control',
 'https://en.wikipedia.org/wiki/Access-control_list',
 'https://en.wikipedia.org/wiki/Fitness_tracker',
 'https://en.wikipedia.org/wiki/Acute_stress_reaction',
 'https://en.wikipedia.org/wiki/Adam_Back',
 'https://en.wikipedia.org/wiki/Address_Resolution_Protocol',
 'https://en.wikipedia.org/wiki/Advanced_Encryption_Standard',
 'https://en.wikipedia.org/wiki/Advanced_driver-assistance_system',
 'https://en.wikipedia.org/wiki/Advanced_persistent_threat',
 'https://en.wikipedia.org/wiki/Adware',
 'https://en.wikipedi

## Writing the URL file

In [11]:
# Write URLs to a file
ufname = filename+"_urls.txt"
with open(ufname, 'w') as file:
    for url in urls:
        file.write(url + '\n')

print("URLs have been written to urls.txt")

URLs have been written to urls.txt


In [12]:
# Read URLs from the file
with open(ufname, 'r') as file:
    urls = [line.strip() for line in file]

# Display the URLs
print("Read URLs:")
for url in urls:
    print(url)

Read URLs:
https://en.wikipedia.org/wiki/Computer_security
https://en.wikipedia.org/wiki/2015_Ukraine_power_grid_hack
https://en.wikipedia.org/wiki/2600:_The_Hacker_Quarterly
https://en.wikipedia.org/wiki/ACM_Computing_Classification_System
https://en.wikipedia.org/wiki/ARPANET
https://en.wikipedia.org/wiki/AT%26T
https://en.wikipedia.org/wiki/Accelerometer
https://en.wikipedia.org/wiki/Access-control_list
https://en.wikipedia.org/wiki/Access_control
https://en.wikipedia.org/wiki/Access-control_list
https://en.wikipedia.org/wiki/Fitness_tracker
https://en.wikipedia.org/wiki/Acute_stress_reaction
https://en.wikipedia.org/wiki/Adam_Back
https://en.wikipedia.org/wiki/Address_Resolution_Protocol
https://en.wikipedia.org/wiki/Advanced_Encryption_Standard
https://en.wikipedia.org/wiki/Advanced_driver-assistance_system
https://en.wikipedia.org/wiki/Advanced_persistent_threat
https://en.wikipedia.org/wiki/Adware
https://en.wikipedia.org/wiki/Agencies_of_the_European_Union
https://en.wikipedia.