# Wikepedia API

Copyright 2024, Denis Rothman

**Summary:**
This Python shows how to interact with the Wikipedia API to retrieve information about specified topics, tokenize the retrieved text, and manage citations from Wikipedia articles.

**Description of Functions and Sections**

#### Installing the environment
- **Installation of Wikipedia-API**: This section ensures that the `wikipediaapi` library is installed.

#### Functions
- **Tokenization (`nb_tokens`)**: This function takes a string of text as input and returns the number of tokens in the text using the NLTK library for sophisticated tokenization, which includes punctuation.

#### Retrieving Wikipedia Metadata
- **Creating an instance**: Sets up an instance of the Wikipedia API with a specified language and user agent to start retrieving data.
- **Defining root page**: Specifies the main topic and filename associated with the Wikipedia page of interest.
- **Root page summary**: Retrieves the summary of the specified Wikipedia page, checks if the page exists, and prints its summary.
- **URLs and Citations**: Displays the full URL of the Wikipedia page.
- **Links in the page**: This subsection is intended to process and print the links found within the page.

#### Writing the citations and URLs pages
- **Management of citations**: Generates texts file containing citations for the links retrieved from a Wikipedia page and their urls.


[Wikipedia API documentation](https://pypi.org/project/Wikipedia-API/)

The Citations of the Wikipedia pages are in the `Chapter10/citations` directory of the repository.

For more on Wikipedia citations: [Citations](https://en.wikipedia.org/w/index.php?title=Special:CiteThisPage&page=Mark_Twain&id=1231834317&wpFormIdentifier=titleform)


# Installing the environment

In [8]:
try:
  import wikipediaapi
except:
  !pip install Wikipedia-API==0.6.0
  import wikipediaapi

## Functions

In [9]:
import nltk
from nltk.tokenize import word_tokenize

# Ensure you have the necessary NLTK resource downloaded
nltk.download('punkt')

def nb_tokens(text):
    # More sophisticated tokenization which includes punctuation
    tokens = word_tokenize(text)
    return len(tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Retrieving Wikipedia Data and Metadata

## Creating an instance

In [10]:
# Create an instance of the Wikipedia API with a detailed user agent
wiki = wikipediaapi.Wikipedia(
    language='en',
    user_agent='Knowledge/1.0 (Denis.Rothman76@gmail.com)'
)

## Defining root page

In [11]:
topic="Alan Turing"     #topic
filename="AlanTuring"   #filename for saving the outputs
maxl=25                 #maximum number of links to retrieve. This value was set to 100 for the Alan Turing URL dataset

## Root page summary

In [12]:
import textwrap # to wrap the text and display in paragraphs
page=wiki.page(topic)

if page.exists()==True:
  print("Page - Exists: %s" % page.exists())
  summary=page.summary
  # number of tokens)
  nbt=nb_tokens(summary)
  print("Number of tokens: ",nbt)
  # Use textwrap to wrap the summary text to a specified width, e.g., 70 characters
  wrapped_text = textwrap.fill(summary, width=60)
  # Print the wrapped summary text
  print(wrapped_text)
else:
  print("Page does not exist")

Page - Exists: True
Number of tokens:  531
Alan Mathison Turing  (; 23 June 1912 – 7 June 1954) was an
English mathematician, computer scientist, logician,
cryptanalyst, philosopher and theoretical biologist. He was
highly influential in the development of theoretical
computer science, providing a formalisation of the concepts
of algorithm and computation with the Turing machine, which
can be considered a model of a general-purpose computer.
Turing is widely considered to be the father of theoretical
computer science. Born in London, Turing was raised in
southern England. He graduated in maths from King's College,
Cambridge, and in 1938, earned a maths PhD from Princeton
University. During the Second World War, Turing worked for
the Government Code and Cypher School at Bletchley Park,
Britain's codebreaking centre that produced Ultra
intelligence. He led Hut 8, the section responsible for
German naval cryptanalysis. Turing devised techniques for
speeding the breaking of German ciphers,

## URLs and Citations

In [13]:
print(page.fullurl)

https://en.wikipedia.org/wiki/Alan_Turing


## Links in the page

In [14]:
# prompt: read the program up to this cell. Then retrieve all the links for this page: print the link and a summary of each link

# Get all the links on the page
links = page.links

# Print the link and a summary of each link
urls = []
counter=0
for link in links:
  try:
    counter+=1
    print(f"Link {counter}: {link}")
    summary = wiki.page(link).summary
    print(f"Link: {link}")
    print(wiki.page(link).fullurl)
    urls.append(wiki.page(link).fullurl)
    print(f"Summary: {summary}")
    if counter>=maxl:
      break
  except page.exists()==False:
    # Ignore pages that don't exist
    pass

print(counter)
print(urls)

Link 1: 1926 United Kingdom general strike
Link: 1926 United Kingdom general strike
https://en.wikipedia.org/wiki/1926_United_Kingdom_general_strike
Summary: The 1926 General Strike in the United Kingdom was a general strike that lasted nine days, from 4 to 12 May 1926. It was called by the General Council of the Trades Union Congress (TUC) in an unsuccessful attempt to force the British government to act to prevent wage reductions and worsening conditions for 1.2 million locked-out coal miners. Some 1.7 million workers went out, especially in transport and heavy industry. 
It was a sympathy strike, with many of those who were not miners and not directly affected striking to support the locked-out miners. The government was well prepared, and enlisted middle class volunteers to maintain essential services. There was little violence and the TUC gave up in defeat.
Link 2: ACE (computer)
Link: ACE (computer)
https://en.wikipedia.org/wiki/Automatic_Computing_Engine
Summary: The Automatic C

## Writing the citations page

In [15]:
from datetime import datetime

# Get all the links on the page
links = page.links

# Prepare a file to store the outputs
fname = filename+"_citations.txt"
with open(fname, "w") as file:
    # Write the citation header
    file.write(f"Citation. In Wikipedia, The Free Encyclopedia. Pages retrieved from the following Wikipedia contributors on {datetime.now()}\n")

    # Initialize a counter
    counter = 0
    urls = []

    # Loop through the links and collect summaries
    for link in links:
        try:
            counter += 1
            page_detail = wiki.page(link)
            summary = page_detail.summary

            # Print details to the file
            file.write(f"Link {counter}: {link}\n")
            file.write(f"Link: {link}\n")
            file.write(f"{page_detail.fullurl}\n")
            urls.append(page_detail.fullurl)
            file.write(f"Summary: {summary}\n")

            # Limit to 20 pages to avoid excessive scraping
            if counter >= maxl:
                break
        except wiki.exceptions.PageError:
            # Ignore pages that don't exist
            continue

    # Write the total counts and URLs at the end
    file.write(f"Total links processed: {counter}\n")
    file.write("URLs:\n")
    file.write("\n".join(urls))

# Note: Ensure the topic you specify corresponds to a valid Wikipedia article.


In [16]:
urls

['https://en.wikipedia.org/wiki/1926_United_Kingdom_general_strike',
 'https://en.wikipedia.org/wiki/Automatic_Computing_Engine',
 'https://en.wikipedia.org/wiki/Abraham_Wald',
 'https://en.wikipedia.org/wiki/Abram_Besicovitch',
 'https://en.wikipedia.org/wiki/Action_This_Day_(memo)',
 'https://en.wikipedia.org/wiki/Ada_Lovelace',
 'https://en.wikipedia.org/wiki/Adele_Goldstine',
 'https://en.wikipedia.org/wiki/Adolf_Hitler',
 'https://en.wikipedia.org/wiki/Akio_Morita',
 'https://en.wikipedia.org/wiki/Alan_Turing:_The_Enigma',
 'https://en.wikipedia.org/wiki/Alan_Turing_law',
 'https://en.wikipedia.org/wiki/Albert_Einstein',
 'https://en.wikipedia.org/wiki/Albert_Neuberger',
 'https://en.wikipedia.org/wiki/Alexander_Fleming',
 'https://en.wikipedia.org/wiki/Alfred_D._Chandler_Jr.',
 'https://en.wikipedia.org/wiki/Alfred_Ubbelohde',
 'https://en.wikipedia.org/wiki/Algorithm',
 'https://en.wikipedia.org/wiki/Alick_Glennie',
 'https://en.wikipedia.org/wiki/Alonzo_Church',
 'https://en.wi

In [17]:
# Write URLs to a file
ufname = filename+"_urls.txt"
with open(ufname, 'w') as file:
    for url in urls:
        file.write(url + '\n')

print("URLs have been written to urls.txt")

URLs have been written to urls.txt


In [18]:
# Read URLs from the file
with open(ufname, 'r') as file:
    urls = [line.strip() for line in file]

# Display the URLs
print("Read URLs:")
for url in urls:
    print(url)

Read URLs:
https://en.wikipedia.org/wiki/1926_United_Kingdom_general_strike
https://en.wikipedia.org/wiki/Automatic_Computing_Engine
https://en.wikipedia.org/wiki/Abraham_Wald
https://en.wikipedia.org/wiki/Abram_Besicovitch
https://en.wikipedia.org/wiki/Action_This_Day_(memo)
https://en.wikipedia.org/wiki/Ada_Lovelace
https://en.wikipedia.org/wiki/Adele_Goldstine
https://en.wikipedia.org/wiki/Adolf_Hitler
https://en.wikipedia.org/wiki/Akio_Morita
https://en.wikipedia.org/wiki/Alan_Turing:_The_Enigma
https://en.wikipedia.org/wiki/Alan_Turing_law
https://en.wikipedia.org/wiki/Albert_Einstein
https://en.wikipedia.org/wiki/Albert_Neuberger
https://en.wikipedia.org/wiki/Alexander_Fleming
https://en.wikipedia.org/wiki/Alfred_D._Chandler_Jr.
https://en.wikipedia.org/wiki/Alfred_Ubbelohde
https://en.wikipedia.org/wiki/Algorithm
https://en.wikipedia.org/wiki/Alick_Glennie
https://en.wikipedia.org/wiki/Alonzo_Church
https://en.wikipedia.org/wiki/Amadeo_Giannini
https://en.wikipedia.org/wiki/Amer