# Required python packages

* requests
* BeautifulSoup4

In [1]:
import requests
from bs4 import BeautifulSoup

# Investigate HTML structure

Here, I have **`?&limit=1000000&from=0`** after the full link, in order to retreive all the backlinks in one shot instead of **50** which is the default.

In [2]:
a = requests.get('https://en.wikipedia.org/wiki/Special:WhatLinksHere/Alphabet?&limit=1000000&from=0')

Here is the html content

In [3]:
b = BeautifulSoup(a.text, 'html.parser')
print(b.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Pages that link to "Alphabet" - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"Special","wgCanonicalSpecialPageName":"Whatlinkshere","wgNamespaceNumber":-1,"wgPageName":"Special:WhatLinksHere/Alphabet","wgTitle":"WhatLinksHere/Alphabet","wgCurRevisionId":0,"wgRevisionId":0,"wgArticleId":0,"wgIsArticle":false,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgBreakFrames":true,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","Augus

In [4]:
# I'm interested in the following list
c = b.find(id="mw-whatlinkshere-list")

In [5]:
# Grab all the list elements inside the list
d = c.find_all('li')
# Let's see how many backlinks are there
len(d)

2274

In [6]:
# Print to see all of the links
g = []
for elem in d:
    e = elem.a
    f = e.get('href')
    g.append(f)
    print(f)

/wiki/A
/wiki/Animalia_(book)
/wiki/ASCII
/wiki/Talk:Alphabet
/wiki/Abjad
/wiki/Abugida
/wiki/Aramaic_alphabet
/wiki/Asterix
/wiki/ABC
/wiki/Arabic_alphabet
/wiki/Armenian_language
/wiki/Talk:Arabic_alphabet/Archive_1
/wiki/Alphabet_song
/wiki/Braille
/wiki/Blissymbols
/wiki/Bi-directional_text
/wiki/Communication
/wiki/Cyrillic_script
/wiki/Consonant
/wiki/Cryptanalysis
/wiki/Civilization_(video_game)
/wiki/Celts
/wiki/Comet_Shoemaker%E2%80%93Levy_9
/wiki/Collation
/wiki/Cirth
/wiki/David_Lynch
/wiki/Delta_(letter)
/wiki/Deseret_alphabet
/wiki/Dyslexia
/wiki/Diacritic
/wiki/Daniel_Jones_(phonetician)
/wiki/Devanagari
/wiki/Enigma_machine
/wiki/Electrical_telegraph
/wiki/Eth
/wiki/History_of_Esperanto
/wiki/Formal_language
/wiki/Frederick_Douglass
/wiki/Glagolitic_script
/wiki/Grapheme
/wiki/Gurmukhi_script
/wiki/Hittites
/wiki/Hebrew_alphabet
/wiki/Hiragana
/wiki/History_of_Israel
/wiki/Ideogram
/wiki/International_Phonetic_Alphabet
/wiki/Korean_language
/wiki/Katakana
/wiki/Kana
/wik

Let's filter out some weird ones.
We have 

* Talk:Arabic_alphabet/Archive_1
* User:Szabi
* Wikipedia:Articles_for_creation/2006-01-10
* Wikipedia_talk:WikiProject_Languages/Template
* Category_talk:Radio_stations_in_the_Dallas%E2%80%93Fort_Worth_metroplex

I will filter out the ons with colons

In [7]:
h = list(filter(lambda x: ':' not in x, g))

In [8]:
len(h)

1115

Let's make this as one function!

In [9]:
# I am assuming site is like a following format
# /wiki/Alphabet
def scrape(site):
    sp = site.split('/') # [wiki, Alphabet]
    a = requests.get('https://en.wikipedia.org/wiki/Special:WhatLinksHere/{}?&limit=1000000&from=0'.format(sp[1]))
    b = BeautifulSoup(a.text, 'html.parser')
    c = b.find(id="mw-whatlinkshere-list")
    d = c.find_all('li')
    e = map(lambda elem:elem.a.get('href'), d)
    f = list(filter(lambda x: ':' not in x, e))
    return f

In [10]:
test = '/wiki/Jean-Charles_Castelletto'
%time k = scrape(test)

CPU times: user 1.77 s, sys: 32 ms, total: 1.8 s
Wall time: 3.73 s


Looks like we can scrape about 1000 pages in an hour.

Btw, Wikipedia has 5,485,177 articles
5485 hours = 228 days

# Data storing

I want the data to be stored in a safe, atomic way.
I will use sqlite3 since it's simple

In [11]:
import sqlite3
db = sqlite3.connect('testdb.sqlite')
c = db.cursor()

In [12]:
# Create table for the graph data
try:
    c.execute('CREATE TABLE EDGES (source TEXT, destination TEXT)')
except:
    print("Don't make the table again")

In [13]:
def add_edge(s, d):
    c.execute("INSERT INTO EDGES (source, destination) VALUES ('{}', '{}')".format(s, d))

In [14]:
add_edge('/wiki/Jean-Charles_Castelletto', '/wiki/ABC')