In [1]:
import os
import wikipedia as wp

# We do only read operations, therefore no user config is necessary.
# Normally the system crashes when there is no user config unless we tell it otherwise with this environment variable.
#   0 is default
#   1 means ignore the config
#   2 means ignore the config and don't throw warnings
os.environ["PYWIKIBOT_NO_USER_CONFIG"] = "2"

# Now we can import pywikibot
import pywikibot as pwb

wiki_site = pwb.Site(code="en", fam="wikipedia")

# We list here the search terms for EPFL
epfl_alts = [
    "EPFL",
    "École Polytechnique Fédérale de Lausanne",
    "Swiss Federal Institute of Technology",
    "EPF Lausanne",
    "ETH Lausanne",
    "Poly Lausanne",
]

In [2]:
# Let's search the first result of an "EPFL" query

an = wp.search("EPFL")[0]
pg = wp.page(an)
pg.title

'École Polytechnique Fédérale de Lausanne'

In [3]:
# Its URL

pg.url

'https://en.wikipedia.org/wiki/%C3%89cole_Polytechnique_F%C3%A9d%C3%A9rale_de_Lausanne'

In [4]:
# Its 100 first words

" ".join(pg.content.split(" ")[:100])

"The École polytechnique fédérale de Lausanne (EPFL) is a research institute and university in Lausanne, Switzerland, that specializes in natural sciences and engineering. It is one of the two Swiss Federal Institutes of Technology, and it has three main missions: education, research and technology transfer.The QS World University Rankings ranks EPFL 14th in the world across all fields in their 2020/2021 ranking, whereas Times Higher Education World University Rankings ranks EPFL as the world's 19th best school for Engineering and Technology in 2020. EPFL is located in the French-speaking part of Switzerland; the sister institution in the German-speaking part of"

In [5]:
# The links found in the page

pg.links[:5]

['Aalborg University',
 'Aalto University',
 'Aart de Geus',
 'Academic Ranking of World Universities',
 'Adolphe Merkle Institute']

In [6]:
# How about the 10 first sentences in the French article in French?

wp.set_lang("fr")
pg = wp.search(an)
wp.summary(pg, sentences=6)

"Mattia Binotto, né le 3 novembre 1969 à Lausanne (Suisse), est un  concepteur, ingénieur automobile italien, nommé en janvier 2019 directeur de la gestion sportive (Team Principal) de la Scuderia Ferrari en Formule 1 après avoir été son responsable du département technique.\n\n\n== Biographie ==\nDiplômé en génie mécanique à l'École polytechnique fédérale de Lausanne (EPFL) en 1994, Mattia Binotto obtient une maîtrise en génie automobile à Modène. Il est trilingue français-italien-anglais.\nEn 1995, il rejoint l'équipe de test de la Scuderia Ferrari puis accède au rang d'ingénieur en 1997. Il rejoint le département moteur en 2004 pour devenir, à partir de 2007, l'un des principaux responsables du montage des moteurs ainsi que de l'électronique avec l'arrivée du système de récupération de l'énergie cinétique en Formule 1. Vice-directeur du département moteur en 2013, lors de l'arrivée du V6 turbo hybride, il en devient le directeur principal fin 2014, après une saison difficile pour le

Python's `wikipedia` module is not intended for big data uses as seen above. The `wikipedia.search` function will only return the first 500 occurences of a word, which is not appropriate. We will thus use Pywikibot from https://pywikibot.toolforge.org/

In [7]:
an = wp.search("EPFL")[0]
pg = wp.page(an)
print(pg.links[:5])

['1853', '1869', '1890', '1943', '1946']


In [8]:
[x.split('"')[0] for x in pg.html().split('href="') if x[:4] == "http"][:5]

['https://fr.wikipedia.org/w/index.php?title=%C3%89cole_polytechnique_f%C3%A9d%C3%A9rale_de_Lausanne&amp;action=edit',
 'https://www.wikidata.org/wiki/Q262760?uselang=fr#P488',
 'https://www.wikidata.org/wiki/Q262760?uselang=fr#P131',
 'http://www.epfl.ch',
 'https://fr.wikipedia.org/w/index.php?title=%C3%89cole_polytechnique_f%C3%A9d%C3%A9rale_de_Lausanne&amp;veaction=edit&amp;section=0']

## Limitations of the search functionality of `wikipedia` and `Pywikibot`
Search on `Pywikibot` works fine but seems to cap at 10'000 items, whereas searching on Wikipedia goes clearly above, as the example below shows. This is still much better than the Python `wikipedia` package which caps at 500. For reference, searching for "obama" on Wikipedia yields over 32'000 results. But for the context of our task, this is acceptable as EPFL is mentioned in less than 1'000 articles overall and is unlikely to increase tenfold overnight.

In [9]:
obama_query_1 = wp.search("obama", results=999999)
obama_count_1 = len(obama_query_1)

print("Using the Wikipedia package, searching for 'obama' yields a maximum of", obama_count_1, "results.")

Using the Wikipedia package, searching for 'obama' yields a maximum of 500 results.


In [10]:
obama_query_2 = wiki_site.search("obama", namespaces=0)
obama_count_2 = len([x for _, x in zip(range(999999), obama_query_2)])

print("Using Pywikibot, searching for 'obama' yields a maximum of", obama_count_2, "results.")

Using Pywikibot, searching for 'obama' yields a maximum of 10000 results.


In [11]:
for alt in epfl_alts:
    epfl_query = wiki_site.search(alt, namespaces=0)
    epfl_count = len([x for _, x in zip(range(999999), epfl_query)])

    print("Searching for '" + alt + "' yields", epfl_count, "results.")
    
print("All of those results are within the range of Pywikibot. Success!")

Searching for 'EPFL' yields 766 results.
Searching for 'École Polytechnique Fédérale de Lausanne' yields 667 results.
Searching for 'Swiss Federal Institute of Technology' yields 4036 results.
Searching for 'EPF Lausanne' yields 184 results.
Searching for 'ETH Lausanne' yields 381 results.
Searching for 'Poly Lausanne' yields 68 results.
All of those results are within the range of Pywikibot. Success!


In [12]:
for alt in ["ecublens", "écublens", "Ecublens", "Écublens"]:
    test_query = wiki_site.search(alt, namespaces=0)
    test_count = len([x for _, x in zip(range(999999), test_query)])

    print("Searching for '" + alt + "' yields", test_count, "results.")
    
print("This certifies that Wikipedia's search API is case- and accentuation-insensitive.")

Searching for 'ecublens' yields 58 results.
Searching for 'écublens' yields 58 results.
Searching for 'Ecublens' yields 58 results.
Searching for 'Écublens' yields 58 results.
This certifies that Wikipedia's search API is case- and accentuation-insensitive.
