There is a library Wikiquote API library https://github.com/federicotdn/wikiquote, however it didn't work, so we get the data manually.

In the first few cells, we step by step get and clean the data.

**At the bottom of the notebook, there is a function that takes all the code and extract the quotes for any given Wikiquote website.**

In [109]:
import urllib
import re
import json

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/zofialenarczyk/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/zofialenarczyk/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [97]:
endpoint_url = "https://en.wikiquote.org/w/api.php?"
format = "format=json"
action = "action=query"
page = "titles=Art"
prop = "prop=revisions&rvprop=content"

query = "{}{}&{}&{}&{}".format(endpoint_url, action, format, page, prop)
print(query)


https://en.wikiquote.org/w/api.php?action=query&format=json&titles=Art&prop=revisions&rvprop=content


In [98]:
req = urllib.request.Request(query)
req.add_header("User-Agent", "MyWikipediaClient/1.0 (example@example.com)")
wikiresponse = urllib.request.urlopen(req)
wikidata = wikiresponse.read()
wikitext = wikidata.decode('utf-8')

In [100]:
data = json.loads(wikitext)
data

{'batchcomplete': '',
  'revisions': {'*': 'Because "rvslots" was not specified, a legacy format has been used for the output. This format is deprecated, and in the future the new format will always be used.'}},
 'query': {'pages': {'126300': {'pageid': 126300,
    'ns': 0,
    'title': 'Art',
    'revisions': [{'contentformat': 'text/x-wiki',
      'contentmodel': 'wikitext',
      '*': '[[File:\'David\' by Michelangelo JBU16.JPG|thumb|The work of art … is an instrument for tilling the human psyche, that it may continue to yield a harvest of vital beauty. ~ [[Herbert Read]]]]\n[[File:André Gide01.jpg|thumb|The great artists are the ones who dare to entitle to beauty things so natural that when they’re seen afterward, people say: Why did I never realize before that this too was beautiful? ~ [[André Gide]]]]\n[[File:1901 Munch Mädchen auf der Brücke anagoria.JPG|thumb|Wherever rights are denied the poor, the prophetic anger is turned not merely against the incumbent rulers but equally a

To process the data we will use:
- Quotes listed on a website begin with *
- There are also quotes used as image captions - we are not using these, as they are repeated in the list
- We are not using authors

In [101]:
# getting through the layered structure of the JSON response to get to the wikitext content
page = next(iter(data["query"]["pages"].values()))
text = page["revisions"][0]["*"]

# exclude the "See also" section and everything after it (it's after the quote list)
clean_text = re.sub(r'==\s*See also\s*==[\s\S]*$', '', text, flags=re.IGNORECASE)
# take only the lines starting with "* " (quotes)
quote_lines = re.findall(r'^\*(?![\*\#\:\;])(.*)', clean_text, flags=re.MULTILINE)

# cleaning quotes to have only plain text
clean_quotes = []
for q in quote_lines:

    # remove wikitext formatting (templates made of bracket patterns)
    q = re.sub(r'\{\{[^{}]*\}\}', '', q)

    # remove internal wikilinks
    q = re.sub(r'\[\[([^|\]]*\|)?([^\]]+)\]\]', r'\2', q)

    # remove external links
    q = re.sub(r'\[https?://[^\s\]]+\s+([^\]]+)\]', r'\1', q)
    q = re.sub(r'\[https?://[^\]]+\]', '', q)

    # remove HTML tags
    q = re.sub(r'<ref[^>]*>.*?</ref>', '', q, flags=re.IGNORECASE)
    q = re.sub(r'<[^>]+>', ' ', q)

    # remove bold and italic markings
    q = re.sub(r"''+", '', q)

    # replace sequences of whitespace with a single space
    q = ' '.join(q.split()).strip()

    if q:
        clean_quotes.append(q)

for quote in clean_quotes:
    print(quote)

When you have a cause, the best way to express yourself is artistically.
The coming extinction of art is prefigured in the increasing impossibility of representing historical events.
Art is magic delivered from the lie of being truth.
I have this very what you call today "square" idea that art is something that makes you breathe with a different kind of happiness.
Technique without art is shallow and doomed. Art without technique is insulting.
I make art because it centers me in my body, and by doing so I hope to offer that experience to someone else.
The arts are a wonderful medicine for the soul.
Art would not be important if life were not important, and life is important.
One writes out of one thing only — one's own experience. Everything depends on how relentlessly one forces from this experience the last drop, sweet or bitter, it can possibly give. This is the only real concern of the artist, to recreate out of the disorder of life that order which is art.
Art is made by the alone

In [None]:
# save to txt file
with open('art_quotes.txt', 'w', encoding='utf-8') as f:
    for quote in clean_quotes:
        f.write(quote + '\n')

The function below enables to get quotes from a Wikiquote website, it combines the code above into one function.

In [103]:
def get_wikiquotes(page_name: str) -> list:

    # access Wikiquote API to get wikitext content
    endpoint_url = "https://en.wikiquote.org/w/api.php?"
    format = "format=json"
    action = "action=query"
    page_param = f"titles={page_name}"
    prop = "prop=revisions&rvprop=content"

    query = "{}{}&{}&{}&{}".format(endpoint_url, action, format, page_param, prop)

    req = urllib.request.Request(query)
    req.add_header("User-Agent", "MyWikipediaClient/1.0 (example@example.com)")
    wikiresponse = urllib.request.urlopen(req)
    wikidata = wikiresponse.read()
    wikitext = wikidata.decode('utf-8')

    data = json.loads(wikitext)

    # clean and extract quotes

    # getting through the layered structure of the JSON response to get to the wikitext content
    page = next(iter(data["query"]["pages"].values()))
    text = page["revisions"][0]["*"]

    # exclude the "See also" section and everything after it (it's after the quote list)
    clean_text = re.sub(r'==\s*See also\s*==[\s\S]*$', '', text, flags=re.IGNORECASE)
    # take only the lines starting with "* " (quotes)
    quote_lines = re.findall(r'^\*(?![\*\#\:\;])(.*)', clean_text, flags=re.MULTILINE)

    # cleaning quotes to have only plain text
    clean_quotes = []
    for q in quote_lines:

        # remove wikitext formatting (templates made of bracket patterns)
        q = re.sub(r'\{\{[^{}]*\}\}', '', q)

        # remove internal wikilinks
        q = re.sub(r'\[\[([^|\]]*\|)?([^\]]+)\]\]', r'\2', q)

        # remove external links
        q = re.sub(r'\[https?://[^\s\]]+\s+([^\]]+)\]', r'\1', q)
        q = re.sub(r'\[https?://[^\]]+\]', '', q)

        # remove HTML tags
        q = re.sub(r'<ref[^>]*>.*?</ref>', '', q, flags=re.IGNORECASE)
        q = re.sub(r'<[^>]+>', ' ', q)

        # remove bold and italic markings
        q = re.sub(r"''+", '', q)

        # replace sequences of whitespace with a single space
        q = ' '.join(q.split()).strip()

        if q:
            clean_quotes.append(q)

    # save to txt file
    with open(f'{page_name}_quotes.txt', 'w', encoding='utf-8') as f:
        for quote in clean_quotes:
            f.write(quote + '\n')


    return clean_quotes

In [None]:
page = "War" # "Death" "Love"
quotes = get_wikiquotes(page)
quotes

['It would be superfluous in me to point out to your Lordship that this is war.',
 'My voice is still for war.',
 'They sent forth men to battle, But no such men return; And home, to claim their welcome, Come ashes in an urn.',
 "What is the only provocation that could bring about the use of nuclear weapons? Nuclear weapons. What is the priority target for nuclear weapons? Nuclear weapons. What is the only established defense against nuclear weapons? Nuclear weapons. How do we prevent the use of nuclear weapons? By threatening the use of nuclear weapons. And we can't get rid of nuclear weapons, because of nuclear weapons. The intransigence, it seems, is a function of the weapons themselves.",
 'The arms race is a race between nuclear weapons and ourselves.',
 "There are two rules of war that have not yet been invalidated by the new world order. The first rule is that the belligerent nation must be fairly sure that its actions will make things better; the second rule is that the bellige