## Math genealogy

This is a website document mathematicians, their advisors and their students:

http://www.genealogy.ams.org

We start with webscraping the data. This is task is made easier because each mathematician gets their own page with a unique ID, so we can just scrap each page.

In [232]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import lxml
import os
import random

In [289]:
base_url = "http://www.genealogy.ams.org/id.php?id="
an_id = "24594"
#an_id = '51796'
an_id = '123359'

In [290]:
response = requests.get(base_url + an_id)

In [291]:
if (response.status_code == 200):
    soup = BeautifulSoup(response.text, 'lxml')

In [201]:
#[x.text.split(":") for x in soup.findAll('p') if 'Advisor' in x.text]

In [215]:
def getName(soup):
    return soup.find("h2").text.strip().replace(',','')

In [216]:
def getAdvisors(soup):
    for x in soup.findAll("p"):
        if "Advisor" in x.text:
            return [y.text.replace(',','') for y in x.findAll('a')]

In [204]:
soup.findAll('p')[2]

<p style="text-align: center; line-height: 2.75ex">Advisor: <a href="id.php?id=74408">Boris Nikolaevich Delone</a><br/></p>

In [205]:
[x for x in soup.findAll('span') if 'Dissertation' in x.text][0].parent

<div style="text-align: center"><span style="color: #000066">Dissertation:</span> <span id="thesisTitle" style="font-style:italic">\n\nInvestigations on Finite Extensions</span></div>

In [214]:
def getThesisTitle(soup):
    return soup.find('span', id='thesisTitle').text.strip().replace(',', '')

In [300]:
def getUniversityYearString(soup):
    for x in soup.findAll('span'):
        if 'Ph.D.' in x.text or 'Dr. phil.' in x.text:
            return x.text.split()


In [311]:
getUniversityYearString(soup)

[u'Ph.D.', u'University', u'of', u'California,', u'Los', u'Angeles', u'1951']

In [314]:
def getUniversity(soup):
    for x in soup.findAll('span'):
        if 'Ph.D.' in x.text:
            return ' '.join(x.text.split()[1:-1]).replace(',', '')
        elif 'Dr. phil.' in x.text:
            return x.text.split()[2:-1]
    return ''

In [315]:
getUniversity(soup)

u'University of California Los Angeles'

In [272]:
def getCountry(soup):
    try:
        return [x['title'] for x in soup.findAll('img') if 'img/flags' in x['src']][0]
    except IndexError:
        return ''

In [273]:
getCountry(soup)

'Russia'

In [276]:
[x.text for x in soup.findAll('p') if 'According to our current on-line database' in x.text]

[u'According to our current on-line database, Igor Shafarevich has 13 students and 447  descendants.\n\nWe welcome any additional information.']

In [287]:
def numStudentsAndDescentants(soup):
    try: 
        line = [x.text for x in soup.findAll('p') if 'According to our current on-line database' in x.text][0].strip('.').split()
        students = -1
        descendants = -1
        for i in range(len(line)-1):
            if line[i].isdigit() and 'students' in line[i+1]:
                students = int(line[i])
            if line[i].isdigit() and 'descendants' in line[i+1]:
                descendants = int(line[i])
        return students, descendants
                
    except IndexError:
        return 0,0

In [292]:
numStudentsAndDescentants(soup)

(0, 0)

In [293]:
getName(soup)

u'Melvin E. Maron'

In [294]:
getCountry(soup)

'UnitedStates'

In [295]:
getAdvisors(soup)

[u'Hans  Reichenbach']

In [296]:
getThesisTitle(soup)

u'Theory of Probability'

In [297]:
getUniversityYearString(soup)

u'Ph.D. University of California, Los Angeles 1951'

In [208]:
# define the data frame
#cols = ['id', 'name','advisors','thesis']
#data = [1, 'bill' ,['bob'], 'alalala']
#mathematicians = pd.DataFrame(data, index=cols[0], columns=cols[1::])

In [226]:
outlines = []
for i in range(100,120):
    an_id = str(i)
    response = requests.get(base_url + an_id)
    
    if (response.status_code == 200):
        soup = BeautifulSoup(response.text, 'lxml')
    else:
        print("bad response, moving to next id.")
        continue
    comb = an_id+',' + getName(soup) + ',' + ';'.join(getAdvisors(soup)) +','+ getThesisTitle(soup)
    outlines.append(comb)
    

In [227]:
with open("test.csv", "w") as outFile:
    outFile.write('math_id, name, advisors, thesis\n')
    outFile.writelines([line+'\n' for line in outlines])

In [324]:
mathematicians = pd.read_csv('test.csv', delimiter=',', index_col='mathId')

In [325]:
mathematicians.thesisUniversity.

Unnamed: 0_level_0,name,advisors,thesis,thesisUniversity,thesisCountry,thesisYear,numStudents,numDescendants
mathId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
100,David L. Simpson,Robert Joe Lambert,A numerical method of characteristics for solv...,Iowa State University,UnitedStates,1967,0,0
101,James Douglas Watson,Clair George Maple,A Numerical Technique for Solution of the Line...,Iowa State University,UnitedStates,1967,0,0
102,Lonny B. Winrich,Clair George Maple,An Explicit Method for the Numerical Solution ...,Iowa State University,UnitedStates,1968,0,0
103,Douglas Warren Curtis,Donald Eugene Sanderson,Deficiency and Stability in Infinite Dimension...,Iowa State University,UnitedStates,1968,2,2
104,Paul Albert Haeder,Clair George Maple,On the Zeros of Solutions of Elliptic Partial ...,Iowa State University,UnitedStates,1968,0,0
105,Robert Allen McCoy,Donald Eugene Sanderson,Cells and Cellularity in Infinite-Dimensional ...,Iowa State University,UnitedStates,1968,4,7
106,John Melvan Clark,George H. Seifert,Asymptotic Stability in General Systems,Iowa State University,UnitedStates,1968,0,0
107,Joseph Nicholas O'Brien,Robert Joe Lambert,Stability and Error Analysis of Linear Multist...,Iowa State University,UnitedStates,1968,0,0
108,Werner William Shoultz,Bernard Vinograde,Chains of Minimal Generating Sets of Inseparab...,Iowa State University,UnitedStates,1968,0,0
109,Thomas Wayne Wineinger,Robert Joe Lambert;Thomas R. Rogge,Singular Perturbation Techniques and its Appli...,Iowa State University,UnitedStates,1968,0,0


In [231]:
x = range(1,200000)

In [233]:
random.shuffle(x)