# Wikepedia API

Copyright 2024, Denis Rothman

**Summary:**
This Python shows how to interact with the Wikipedia API to retrieve information about specified topics, tokenize the retrieved text, and manage citations from Wikipedia articles.

**Description of Functions and Sections**

#### Installing the environment
- **Installation of Wikipedia-API**: This section ensures that the `wikipediaapi` library is installed.

#### Functions
- **Tokenization (`nb_tokens`)**: This function takes a string of text as input and returns the number of tokens in the text using the NLTK library for sophisticated tokenization, which includes punctuation.

#### Retrieving Wikipedia Metadata
- **Creating an instance**: Sets up an instance of the Wikipedia API with a specified language and user agent to start retrieving data.
- **Defining root page**: Specifies the main topic and filename associated with the Wikipedia page of interest.
- **Root page summary**: Retrieves the summary of the specified Wikipedia page, checks if the page exists, and prints its summary.
- **URLs and Citations**: Displays the full URL of the Wikipedia page.
- **Links in the page**: This subsection is intended to process and print the links found within the page.

#### Writing the citations and URLs pages
- **Management of citations**: Generates texts file containing citations for the links retrieved from a Wikipedia page and their urls.


[Wikipedia API documentation](https://pypi.org/project/Wikipedia-API/)

The Citations of the Wikipedia pages are in the `Chapter10/citations` directory of the repository.

For more on Wikipedia citations: [Citations](https://en.wikipedia.org/w/index.php?title=Special:CiteThisPage&page=Mark_Twain&id=1231834317&wpFormIdentifier=titleform)


# Installing the environment

In [1]:
try:
  import wikipediaapi
except:
  !pip install Wikipedia-API==0.6.0
  import wikipediaapi

Collecting Wikipedia-API==0.6.0
  Downloading Wikipedia_API-0.6.0-py3-none-any.whl (14 kB)
Installing collected packages: Wikipedia-API
Successfully installed Wikipedia-API-0.6.0


## Functions

In [2]:
import nltk
from nltk.tokenize import word_tokenize

# Ensure you have the necessary NLTK resource downloaded
nltk.download('punkt')

def nb_tokens(text):
    # More sophisticated tokenization which includes punctuation
    tokens = word_tokenize(text)
    return len(tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# Retrieving Wikipedia Metadata

## Creating an instance

In [3]:
# Create an instance of the Wikipedia API with a detailed user agent
wiki = wikipediaapi.Wikipedia(
    language='en',
    user_agent='Knowledge/1.0 (Denis.Rothman76@gmail.com)'
)

## Defining root page

In [4]:
topic="Mark Twain"     #topic
filename="MarkTwain"   #filename for saving the outputs
maxl=20                #maximum number of links to retrieve
#topic="Honoré de Balzac"

## Root page summary

In [5]:
import textwrap # to wrap the text and display in paragraphs
page=wiki.page(topic)

if page.exists()==True:
  print("Page - Exists: %s" % page.exists())
  summary=page.summary
  # number of tokens)
  nbt=nb_tokens(summary)
  print("Number of tokens: ",nbt)
  # Use textwrap to wrap the summary text to a specified width, e.g., 70 characters
  wrapped_text = textwrap.fill(summary, width=60)
  # Print the wrapped summary text
  print(wrapped_text)
else:
  print("Page does not exist")

Page - Exists: True
Number of tokens:  578
Samuel Langhorne Clemens (November 30, 1835 – April 21,
1910), known by the pen name Mark Twain, was an American
writer, humorist, and essayist. He was praised as the
"greatest humorist the United States has produced," with
William Faulkner calling him "the father of American
literature." Twain's novels include The Adventures of Tom
Sawyer (1876) and its sequel, Adventures of Huckleberry Finn
(1884), with the latter often called the "Great American
Novel." He also wrote A Connecticut Yankee in King Arthur's
Court (1889) and Pudd'nhead Wilson (1894), and co-wrote The
Gilded Age: A Tale of Today (1873) with Charles Dudley
Warner. Twain was raised in Hannibal, Missouri, which later
provided the setting for both Tom Sawyer and Huckleberry
Finn. He served an apprenticeship with a printer early in
his career, and then worked as a typesetter, contributing
articles to his older brother Orion Clemens' newspaper.
Twain then became a riverboat pilot on t

## URLs and Citations

In [6]:
print(page.fullurl)

https://en.wikipedia.org/wiki/Mark_Twain


## Links in the page

In [7]:
# prompt: read the program up to this cell. Then retrieve all the links for this page: print the link and a summary of each link

# Get all the links on the page
links = page.links

# Print the link and a summary of each link
urls = []
counter=0
for link in links:
  try:
    counter+=1
    print(f"Link {counter}: {link}")
    summary = wiki.page(link).summary
    print(f"Link: {link}")
    print(wiki.page(link).fullurl)
    urls.append(wiki.page(link).fullurl)
    print(f"Summary: {summary}")
    if counter>=maxl: # limit Wikipedia scraping
      break
  except page.exists()==False:
    # Ignore pages that don't exist
    pass

print(counter)
print(urls)

Link 1: 1601 (Mark Twain)
Link: 1601 (Mark Twain)
https://en.wikipedia.org/wiki/1601_(Mark_Twain)
Summary: [Date: 1601.] Conversation, as it was by the Social Fireside, in the Time of the Tudors. or simply 1601 is the title of a short risqué squib by Mark Twain, first published anonymously in 1880, and finally acknowledged by the author in 1906.
Written as an extract from the diary of an "old man", Queen Elizabeth I's "cup-bearer", the pamphlet purports to record a conversation between Elizabeth and several famous writers of the day. The topics discussed are scatological, notably flatulence, flatulence humor, and sex.
1601 was, according to Edward Wagenknecht, "the most famous piece of pornography in American literature." However, it was more ribaldry than pornography; its content was more in the nature of irreverent and vulgar comedic shock than obscenity for sexual arousal.
Before the court decisions in the United States in 1959–1966 that legalized the publication of Lady Chatterley'

## Writing the citataions page

In [18]:
from datetime import datetime

# Get all the links on the page
links = page.links

# Prepare a file to store the outputs
fname = filename+"_citations.txt"
with open(fname, "w") as file:
    # Write the citation header
    file.write(f"Citation. In Wikipedia, The Free Encyclopedia. Pages retrieved from the following Wikipedia contributors on {datetime.now()}\n")

    # Initialize a counter
    counter = 0
    urls = []

    # Loop through the links and collect summaries
    for link in links:
        try:
            counter += 1
            page_detail = wiki.page(link)
            summary = page_detail.summary

            # Print details to the file
            file.write(f"Link {counter}: {link}\n")
            file.write(f"Link: {link}\n")
            file.write(f"{page_detail.fullurl}\n")
            urls.append(page_detail.fullurl)
            file.write(f"Summary: {summary}\n")

            # Limit to 20 pages to avoid excessive scraping
            if counter == 20:
                break
        except wiki.exceptions.PageError:
            # Ignore pages that don't exist
            continue

    # Write the total counts and URLs at the end
    file.write(f"Total links processed: {counter}\n")
    file.write("URLs:\n")
    file.write("\n".join(urls))

# Note: Ensure the topic you specify corresponds to a valid Wikipedia article.


In [19]:
urls

['https://en.wikipedia.org/wiki/1601_(Mark_Twain)',
 'https://en.wikipedia.org/wiki/1884_United_States_presidential_election',
 'https://en.wikipedia.org/wiki/1906_San_Francisco_earthquake',
 'https://en.wikipedia.org/wiki/A_Connecticut_Yankee_(film)',
 'https://en.wikipedia.org/wiki/A_Connecticut_Yankee_(musical)',
 'https://en.wikipedia.org/wiki/A_Connecticut_Yankee_in_King_Arthur%27s_Court',
 'https://en.wikipedia.org/wiki/A_Connecticut_Yankee_in_King_Arthur%27s_Court_(1921_film)',
 'https://en.wikipedia.org/wiki/A_Connecticut_Yankee_in_King_Arthur%27s_Court_(1949_film)',
 'https://en.wikipedia.org/wiki/A_Dog%27s_Tale',
 'https://en.wikipedia.org/wiki/A_Double_Barrelled_Detective_Story',
 'https://en.wikipedia.org/wiki/A_Horse%27s_Tale',
 'https://en.wikipedia.org/wiki/A_Kid_in_King_Arthur%27s_Court',
 'https://en.wikipedia.org/wiki/A_Knight_in_Camelot',
 'https://en.wikipedia.org/wiki/A_Literary_Nightmare',
 'https://en.wikipedia.org/wiki/A_Modern_Twain_Story:_The_Prince_and_the_Pa

In [20]:
# Write URLs to a file
ufname = filename+"_urls.txt"
with open(ufname, 'w') as file:
    for url in urls:
        file.write(url + '\n')

print("URLs have been written to urls.txt")

URLs have been written to urls.txt


In [22]:
# Read URLs from the file
with open(ufname, 'r') as file:
    urls = [line.strip() for line in file]

# Display the URLs
print("Read URLs:")
for url in urls:
    print(url)

Read URLs:
https://en.wikipedia.org/wiki/1601_(Mark_Twain)
https://en.wikipedia.org/wiki/1884_United_States_presidential_election
https://en.wikipedia.org/wiki/1906_San_Francisco_earthquake
https://en.wikipedia.org/wiki/A_Connecticut_Yankee_(film)
https://en.wikipedia.org/wiki/A_Connecticut_Yankee_(musical)
https://en.wikipedia.org/wiki/A_Connecticut_Yankee_in_King_Arthur%27s_Court
https://en.wikipedia.org/wiki/A_Connecticut_Yankee_in_King_Arthur%27s_Court_(1921_film)
https://en.wikipedia.org/wiki/A_Connecticut_Yankee_in_King_Arthur%27s_Court_(1949_film)
https://en.wikipedia.org/wiki/A_Dog%27s_Tale
https://en.wikipedia.org/wiki/A_Double_Barrelled_Detective_Story
https://en.wikipedia.org/wiki/A_Horse%27s_Tale
https://en.wikipedia.org/wiki/A_Kid_in_King_Arthur%27s_Court
https://en.wikipedia.org/wiki/A_Knight_in_Camelot
https://en.wikipedia.org/wiki/A_Literary_Nightmare
https://en.wikipedia.org/wiki/A_Modern_Twain_Story:_The_Prince_and_the_Pauper
https://en.wikipedia.org/wiki/A_Murder,_a_