In [1]:
%load_ext autoreload
%autoreload 2
import os,sys
sys.path.insert(1, os.path.join(sys.path[0], '..', 'module'))

## Exploring the wiki dump

In [2]:
import wiki

path_base = '/Users/harangju/Developer/data/wiki/dumps/'
name_xml = 'enwiki-20190801-pages-articles-multistream.xml.bz2'
name_index = 'enwiki-20190801-pages-articles-multistream-index.txt.bz2'
path_xml = path_base + name_xml
path_index = path_base + name_index
dump = wiki.Dump(path_xml, path_index)

In [3]:
%time dump.load_page('Portal:Physics/Topics')
dump.links[:5]

Dump: Loading index...
Dump: Loaded.
CPU times: user 1min 7s, sys: 1.92 s, total: 1min 9s
Wall time: 1min 9s


['Classical physics', 'Mechanics', 'Optics', 'Electricity', 'Magnetism']

In [4]:
dump.load_page('Danielle Bassett')
dump.links[:4]

['University of pennsylvania',
 'Pennsylvania state university',
 'University of cambridge',
 'Sloan research fellowship']

In [5]:
dump.load_page('Matter', filter_top=True).strip_code()[:200]

"In classical physics and general chemistry, '''matter''' is any substance that has mass and takes up space by having volume. All everyday objects that can be touched are ultimately composed of atoms, "

## Get index of physics articles

* [all indices on Wikipedia](https://en.wikipedia.org/wiki/Portal:Contents/Indices)
* topics not searched
* international trade ("topics"), theory of constraints (small)
* too big: mathematics, neuroscience

In [6]:
# natural & physical sciences
topics = ['anatomy', 'biochemistry', 'cognitive science', 'evolutionary biology',
          'genetics', 'immunology', 'molecular biology']
topics += ['chemistry', 'biophysics', 'energy', 'optics', 
           'earth science', 'geology', 'meteorology']
# philosophy
# topics += []
topics += ['philosophy of language', 'philosophy of law', 
           'philosophy of mind', 'philosophy of science']
# social sciences
topics += ['economics', 'accounting', 'education', 'linguistics', 'law', 'psychology', 'sociology']
# technology & applied sciences
topics += ['electronics', 'software engineering', 'robotics']

In [7]:
links = {}
for topic in topics:
    dump.load_page('Index of %s articles' % topic)
    links[topic] = [str(l) for l in dump.article_links]
    print('Topic "' + topic + '" has ' + str(len(links[topic])) + ' articles.')

Topic "anatomy" has 2331 articles.
Topic "biochemistry" has 1216 articles.
Topic "cognitive science" has 127 articles.
Topic "evolutionary biology" has 287 articles.
Topic "genetics" has 1441 articles.
Topic "immunology" has 572 articles.
Topic "molecular biology" has 507 articles.
Topic "chemistry" has 1088 articles.
Topic "biophysics" has 773 articles.
Topic "energy" has 158 articles.
Topic "optics" has 386 articles.
Topic "earth science" has 135 articles.
Topic "geology" has 116 articles.
Topic "meteorology" has 761 articles.
Topic "philosophy of language" has 275 articles.
Topic "philosophy of law" has 208 articles.
Topic "philosophy of mind" has 109 articles.
Topic "philosophy of science" has 448 articles.
Topic "economics" has 562 articles.
Topic "accounting" has 154 articles.
Topic "education" has 872 articles.
Topic "linguistics" has 420 articles.
Topic "law" has 3657 articles.
Topic "psychology" has 1801 articles.
Topic "sociology" has 772 articles.
Topic "electronics" has 127

In [8]:
import string

links['physics'] = []
for letter in ['!$@', '0–9'] + list(string.ascii_uppercase):
    dump.load_page('Index of physics articles (%s)' % letter)
    links['physics'].extend([str(l) for l in dump.article_links])
print('Topic "' + 'physics' + '" has ' + str(len(links['physics'])) + ' articles.')

Topic "physics" has 15215 articles.


## Build graphs of topics

### 1 network per topic

In [9]:
import pickle
import gensim.utils as gu

path_save = '/Users/harangju/Developer/data/wiki/models/'
tfidf = gu.SaveLoad.load(path_save + 'tfidf.model')
dct = pickle.load(open(path_save + 'dict.model','rb'))

In [10]:
path_to_save = '/Users/harangju/Developer/data/wiki/graphs/dated/'

networks = {}
for topic, ls in links.items():
    print('Topic: ' + topic)
    networks[topic] = wiki.Net()
    networks[topic].build_graph(dump=dump, nodes=ls, model=tfidf, dct=dct)
    networks[topic].save_graph(path_to_save + topic + '.gexf')
    networks[topic].save_barcodes(path_to_save + topic + '.barcode')

Topic: anatomy
wiki.Net: traversing Wikipedia...
wiki.Net: depth = 0
wiki.Net: len(queue) = 7760
wiki.Net: depth = 1
wiki.Net: removing isolates...
wiki.Net: adding years...
wiki.Net: filling empty years...
wiki.Net: calculating weights...
wiki.Net: computing barcodes... (skip negatives)
wiki.Net: barcode 13082/13087
Topic: biochemistry
wiki.Net: traversing Wikipedia...
wiki.Net: depth = 0
wiki.Net: len(queue) = 4390
wiki.Net: depth = 1
wiki.Net: removing isolates...
wiki.Net: adding years...
wiki.Net: filling empty years...
wiki.Net: calculating weights...
wiki.Net: computing barcodes... (skip negatives)
wiki.Net: barcode 11250/11251
Topic: cognitive science
wiki.Net: traversing Wikipedia...
wiki.Net: depth = 0
wiki.Net: len(queue) = 480
wiki.Net: depth = 1
wiki.Net: removing isolates...
wiki.Net: adding years...
wiki.Net: filling empty years...
wiki.Net: calculating weights...
wiki.Net: computing barcodes... (skip negatives)
wiki.Net: barcode 794/796
Topic: evolutionary biology
wiki.

wiki.Net: barcode 6203/6209
Topic: software engineering
wiki.Net: traversing Wikipedia...
wiki.Net: depth = 0
wiki.Net: len(queue) = 101
wiki.Net: depth = 1
wiki.Net: removing isolates...
wiki.Net: adding years...
wiki.Net: filling empty years...
wiki.Net: calculating weights...
wiki.Net: computing barcodes... (skip negatives)
wiki.Net: barcode 1204/1219
Topic: robotics
wiki.Net: traversing Wikipedia...
wiki.Net: depth = 0
wiki.Net: len(queue) = 2920
wiki.Net: depth = 1
wiki.Net: removing isolates...
wiki.Net: adding years...
wiki.Net: filling empty years...
wiki.Net: calculating weights...
wiki.Net: computing barcodes... (skip negatives)
wiki.Net: barcode 2195/2247
Topic: physics
wiki.Net: traversing Wikipedia...
wiki.Net: depth = 0
wiki.Net: len(queue) = 45160
wiki.Net: depth = 1
wiki.Net: removing isolates...
wiki.Net: adding years...
wiki.Net: filling empty years...
wiki.Net: calculating weights...
wiki.Net: computing barcodes... (skip negatives)
wiki.Net: barcode 114155/114156


Gephi notes
* node size, fruchterman reingold = [10, 40], force atlas 2 = [4 16]
* text size = [1 1.4]
* preview font size = 5