There is a library Wikiquote API library https://github.com/federicotdn/wikiquote, however it didn't work, so we get the data manually.

In the first few cells, we step by step get and clean the data.

**At the bottom of the notebook, there is a function that takes all the code in the cells and extracts the quotes for any given Wikiquote website.**

In [136]:
import urllib
import re
import json

import nltk
nltk.download('punkt_tab')
nltk.download('words')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/zofialenarczyk/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     /Users/zofialenarczyk/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [None]:
endpoint_url = "https://en.wikiquote.org/w/api.php?"
format = "format=json"
action = "action=query"
page = "titles=Art" 
prop = "prop=revisions&rvprop=content"

query = "{}{}&{}&{}&{}".format(endpoint_url, action, format, page, prop)
print(query)


https://en.wikiquote.org/w/api.php?action=query&format=json&titles=Art&prop=revisions&rvprop=content


In [138]:
req = urllib.request.Request(query)
req.add_header("User-Agent", "MyWikipediaClient/1.0 (example@example.com)")
wikiresponse = urllib.request.urlopen(req)
wikidata = wikiresponse.read()
wikitext = wikidata.decode('utf-8')

In [139]:
data = json.loads(wikitext)
data

{'batchcomplete': '',
  'revisions': {'*': 'Because "rvslots" was not specified, a legacy format has been used for the output. This format is deprecated, and in the future the new format will always be used.'}},
 'query': {'pages': {'126300': {'pageid': 126300,
    'ns': 0,
    'title': 'Art',
    'revisions': [{'contentformat': 'text/x-wiki',
      'contentmodel': 'wikitext',
      '*': '[[File:\'David\' by Michelangelo JBU16.JPG|thumb|The work of art … is an instrument for tilling the human psyche, that it may continue to yield a harvest of vital beauty. ~ [[Herbert Read]]]]\n[[File:André Gide01.jpg|thumb|The great artists are the ones who dare to entitle to beauty things so natural that when they’re seen afterward, people say: Why did I never realize before that this too was beautiful? ~ [[André Gide]]]]\n[[File:1901 Munch Mädchen auf der Brücke anagoria.JPG|thumb|Wherever rights are denied the poor, the prophetic anger is turned not merely against the incumbent rulers but equally a

To process the data we will use:
- Quotes listed on a website begin with *
- There are also quotes used as image captions - we are not using these, as they are repeated in the list
- We are not using authors

In [140]:
# getting through the layered structure of the JSON response to get to the wikitext content
page = next(iter(data["query"]["pages"].values()))
text = page["revisions"][0]["*"]

# exclude the "See also" section and everything after it (it's after the quote list)
clean_text = re.sub(r'==\s*See also\s*==[\s\S]*$', '', text, flags=re.IGNORECASE)
# take only the lines starting with "* " (quotes)
quote_lines = re.findall(r'^\*(?![\*\#\:\;])(.*)', clean_text, flags=re.MULTILINE)

# cleaning quotes to have only plain text
clean_quotes = []
for q in quote_lines:

    # remove wikitext formatting (templates made of bracket patterns)
    q = re.sub(r'\{\{[^{}]*\}\}', '', q)

    # remove internal wikilinks
    q = re.sub(r'\[\[([^|\]]*\|)?([^\]]+)\]\]', r'\2', q)

    # remove external links
    q = re.sub(r'\[https?://[^\s\]]+\s+([^\]]+)\]', r'\1', q)
    q = re.sub(r'\[https?://[^\]]+\]', '', q)

    # remove HTML tags
    q = re.sub(r'<ref[^>]*>.*?</ref>', '', q, flags=re.IGNORECASE)
    q = re.sub(r'<[^>]+>', ' ', q)

    # remove bold and italic markings
    q = re.sub(r"''+", '', q)

    # replace sequences of whitespace with a single space
    q = ' '.join(q.split()).strip()

    if q:
        clean_quotes.append(q)

for quote in clean_quotes:
    print(quote)

When you have a cause, the best way to express yourself is artistically.
The coming extinction of art is prefigured in the increasing impossibility of representing historical events.
Art is magic delivered from the lie of being truth.
I have this very what you call today "square" idea that art is something that makes you breathe with a different kind of happiness.
Technique without art is shallow and doomed. Art without technique is insulting.
I make art because it centers me in my body, and by doing so I hope to offer that experience to someone else.
The arts are a wonderful medicine for the soul.
Art would not be important if life were not important, and life is important.
One writes out of one thing only — one's own experience. Everything depends on how relentlessly one forces from this experience the last drop, sweet or bitter, it can possibly give. This is the only real concern of the artist, to recreate out of the disorder of life that order which is art.
Art is made by the alone

In [141]:
# take only quotes in english

english_vocab = set(w.lower() for w in nltk.corpus.words.words())

filtered_quotes = []
for quote in clean_quotes:
    # tokenize the quote into words
    tokens = nltk.word_tokenize(quote.lower())
    
    # filter tokens to only keep alphabetic words
    alpha_tokens = [w for w in tokens if w.isalpha()]
    
    # count how many tokens are in the english vocabulary
    english_count = sum(1 for token in alpha_tokens if token in english_vocab)
    
    # check if at least 70% of the tokens are in english
    if alpha_tokens and english_count >= len(alpha_tokens) * 0.70:
        filtered_quotes.append(quote)

print(f"Total quotes: {len(clean_quotes)}")
print(f"English quotes: {len(filtered_quotes)}")


Total quotes: 340
English quotes: 321


In [142]:
# save to txt file
with open('Art_quotes.txt', 'w', encoding='utf-8') as f:
    for quote in filtered_quotes:
        f.write(quote + '\n')

The function below enables to get quotes from a Wikiquote website, it combines the code above into one function.

In [143]:
def get_wikiquotes(page_name: str) -> list:

    # access Wikiquote API to get wikitext content
    endpoint_url = "https://en.wikiquote.org/w/api.php?"
    format = "format=json"
    action = "action=query"
    page_param = f"titles={page_name}"
    prop = "prop=revisions&rvprop=content"

    query = "{}{}&{}&{}&{}".format(endpoint_url, action, format, page_param, prop)

    req = urllib.request.Request(query)
    req.add_header("User-Agent", "MyWikipediaClient/1.0 (example@example.com)")
    wikiresponse = urllib.request.urlopen(req)
    wikidata = wikiresponse.read()
    wikitext = wikidata.decode('utf-8')

    data = json.loads(wikitext)

    # clean and extract quotes

    # getting through the layered structure of the JSON response to get to the wikitext content
    page = next(iter(data["query"]["pages"].values()))
    text = page["revisions"][0]["*"]

    # exclude the "See also" section and everything after it (it's after the quote list)
    clean_text = re.sub(r'==\s*See also\s*==[\s\S]*$', '', text, flags=re.IGNORECASE)
    # take only the lines starting with "* " (quotes)
    quote_lines = re.findall(r'^\*(?![\*\#\:\;])(.*)', clean_text, flags=re.MULTILINE)

    # cleaning quotes to have only plain text
    clean_quotes = []
    for q in quote_lines:

        # remove wikitext formatting (templates made of bracket patterns)
        q = re.sub(r'\{\{[^{}]*\}\}', '', q)

        # remove internal wikilinks
        q = re.sub(r'\[\[([^|\]]*\|)?([^\]]+)\]\]', r'\2', q)

        # remove external links
        q = re.sub(r'\[https?://[^\s\]]+\s+([^\]]+)\]', r'\1', q)
        q = re.sub(r'\[https?://[^\]]+\]', '', q)

        # remove HTML tags
        q = re.sub(r'<ref[^>]*>.*?</ref>', '', q, flags=re.IGNORECASE)
        q = re.sub(r'<[^>]+>', ' ', q)

        # remove bold and italic markings
        q = re.sub(r"''+", '', q)

        # replace sequences of whitespace with a single space
        q = ' '.join(q.split()).strip()

        if q:
            clean_quotes.append(q)
    
    
    # take only quotes in english
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())

    filtered_quotes = []
    for quote in clean_quotes:
        # tokenize the quote into words
        tokens = nltk.word_tokenize(quote.lower())
        
        # filter tokens to only keep alphabetic words
        alpha_tokens = [w for w in tokens if w.isalpha()]
        
        # count how many tokens are in the english vocabulary
        english_count = sum(1 for token in alpha_tokens if token in english_vocab)
        
        # check if at least 70% of the tokens are in english
        if alpha_tokens and english_count >= len(alpha_tokens) * 0.70:
            filtered_quotes.append(quote)

    print(f"Total quotes: {len(clean_quotes)}")
    print(f"English quotes: {len(filtered_quotes)}")

    # save to txt file
    with open(f'{page_name}_quotes.txt', 'w', encoding='utf-8') as f:
        for quote in filtered_quotes:
            f.write(quote + '\n')


    return filtered_quotes

In [145]:
page = "Love" # "War"
quotes = get_wikiquotes(page)
quotes

Total quotes: 1591
English quotes: 1451


['Love flowers best in openness and freedom.',
 'Love is composed of a single soul inhabiting two bodies.',
 'Love can defeat that nameless terror. Loving one another, we take the sting from death. Loving our mysterious blue planet, we resolve riddles and dissolve all enigmas in contingent bliss.',
 '“Love,” Asa said, “is like a pigeon shitting over a crowd.” “How so?” “Where it lands hasn’t got much to do with who deserves it.”',
 "Well, when you think you love somebody, you love them. That's what love is. Thoughts…",
 'There is, in the human Breast, a social Affection, which extends to our whole Species.',
 'The Encyclopedia Galactica, in its chapter on Love states that it is far too complicated to define. The Hitchhiker\'s Guide to the Galaxy has this to say on the subject of love: "Avoid, if at all possible."',
 'Mysterious love, uncertain treasure, Hast thou more of pain or pleasure! Endless torments dwell about thee: Yet who would live, and live without thee!',
 "When love's well