<span style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">An Exception was encountered at '<a href="#papermill-error-cell">In [3]</a>'.</span>

# Scrape Data from sumo db

In this notebook we will use requests, beautiful soup and pandas to scrape banzuke (rankings), hoshitori (tournament results) from sumo db and store locally for further processing

## MVP

for the mvp we only need the rankings and results from the previous two tournaments

### Banzuke format
sample URL
http://sumodb.sumogames.de/Banzuke_text.aspx?b=202009

URL template
http://sumodb.sumogames.de/Banzuke_text.aspx?b=yyyymm

In [10]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

### Banzuke mvp

It's faster to request the text than it is to request the tables

In [11]:
url = 'http://sumodb.sumogames.de/Banzuke_text.aspx?b=202009'

res = requests.get(url)
soup = BeautifulSoup(res.content)
# table = soup.find_all('table')[0] 
# df = pd.read_html(str(table))
# df.head()

<span id="papermill-error-cell" style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">Execution using papermill encountered an exception here and stopped:</span>

In [12]:
def retrieve_rows(soup, start_rank = 'Makuuchi', end_rank = 'Juryo'):
    """
    takes banzuke soup object and grabs rows data from makuuchi and juryo
    returns as dataframe
    """
    
    if len(soup.find_all('pre')) == 0:
        return 'None'
    
    l = str(soup.find_all('pre')[0]).split('\r\n')
    
    start_idx = l.index(start_rank) + 2
    end_idx = l.index(end_rank) -1
    
    to_df = l[start_idx:end_idx]
    to_df = [y.split() for y in to_df]
    
    cols = ['rank', 'name', 'pob', 'stable', 'birthdate', 'height', 'weight']
    
    df = pd.DataFrame(to_df, columns = cols)
    
    return df

x = retrieve_rows(soup)
print(x)

    rank          name        pob       stable   birthdate height weight
0    Y1e        Hakuho   Mongolia     Miyagino  11.03.1985    193  150.7
1    Y1w       Kakuryu   Mongolia    Michinoku  10.08.1985    186    150
2    O1e     Asanoyama     Toyama     Takasago  01.03.1994    189    158
3    O1w    Takakeisho      Hyogo   Chiganoura  05.08.1996    173    149
4    S1e        Shodai   Kumamoto  Tokitsukaze  05.11.1991    182    150
5    S1w     Mitakeumi     Nagano    Dewanoumi  25.12.1992    179    149
6    S2e      Daieisho    Saitama     Oitekaze  10.11.1993    179  135.7
7    K1e      Okinoumi    Shimane      Hakkaku  29.07.1985    190    154
8    K1w          Endo   Ishikawa     Oitekaze  19.10.1990    183  145.6
9    M1e    Terunofuji   Mongolia    Isegahama  29.11.1991    192  158.5
10   M1w     Takanosho      Chiba   Chiganoura  14.11.1994  181.5  128.5
11   M2e    Hokutofuji    Saitama      Hakkaku  15.07.1992    182    158
12   M2w     Tamawashi   Mongolia    Kataonami  16.

The above should work for both makuuchi to juryou and juryo to makushita

In [13]:
retrieve_rows(soup, start_rank = 'Juryo', end_rank = 'Makushita')

Unnamed: 0,rank,name,pob,stable,birthdate,height,weight
0,J1e,Ikioi,Osaka,Isenoumi,11.10.1986,194.0,157.0
1,J1w,Nishikigi,Iwate,Isenoumi,25.08.1990,184.5,149.5
2,J2e,Kotoyuki,Kagawa,Sadogatake,02.04.1991,176.0,162.4
3,J2w,Kotonowaka,Chiba,Sadogatake,19.11.1997,186.0,147.0
4,J3e,Wakamotoharu,Fukushima,Arashio,05.10.1993,185.0,127.2
5,J3w,Chiyomaru,Kagoshima,Kokonoe,17.04.1991,178.0,176.3
6,J4e,Chiyoshoma,Mongolia,Kokonoe,20.07.1991,183.0,118.2
7,J4w,Daiamami,Kagoshima,Oitekaze,15.12.1992,185.0,167.0
8,J5e,Daishomaru,Osaka,Oitekaze,10.07.1991,175.0,153.0
9,J5w,Kyokushuho,Mongolia,Tomozuna,09.08.1988,191.0,149.6


In [14]:
def combine_divisions(div1, div2):
    """
    join data frames from two different divisions
    """
    
    jf = pd.concat([div1, div2])
    
    return jf

div1 = retrieve_rows(soup, 'Makuuchi', 'Juryo')
print(div1.shape)
div2 = retrieve_rows(soup, 'Juryo', 'Makushita')
print(div2.shape)
jf = combine_divisions(div1, div2)
jf.shape

(42, 7)
(28, 7)


(70, 7)

In [15]:
jf.head()

Unnamed: 0,rank,name,pob,stable,birthdate,height,weight
0,Y1e,Hakuho,Mongolia,Miyagino,11.03.1985,193,150.7
1,Y1w,Kakuryu,Mongolia,Michinoku,10.08.1985,186,150.0
2,O1e,Asanoyama,Toyama,Takasago,01.03.1994,189,158.0
3,O1w,Takakeisho,Hyogo,Chiganoura,05.08.1996,173,149.0
4,S1e,Shodai,Kumamoto,Tokitsukaze,05.11.1991,182,150.0


In [16]:
[str(x).zfill(2) for x in list(range(1,13,1))]

['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']

In [17]:
def scrape_banzuke(year = 2019):
    """
    scrape all the banzukes for given year
    """
    
#     if year == 2020:
#         months = [str(x).zfill(2) for x in list(range(1,13,2))]
#         months.remove('05')
#     else:
#         months = [str(x).zfill(2) for x in list(range(1,13,2))]

    months = [str(x).zfill(2) for x in list(range(1,13,1))]
    
    url = 'http://sumodb.sumogames.de/Banzuke_text.aspx?b={}{}'
    
    urls = [url.format(year, x) for x in months ]
    
    banzuke = []
    
    for r in urls:
        res = requests.get(r)
        soup = BeautifulSoup(res.content)
        mak = retrieve_rows(soup, start_rank = 'Makuuchi', end_rank = 'Juryo')
        if type(mak) == type(str()):
            continue
        jur = retrieve_rows(soup, start_rank = 'Juryo', end_rank = 'Makushita')
        
        print(r)
        df = combine_divisions(mak, jur)
        df['year'] = year
        df['month'] = r[-2::1]
        banzuke.append(df)
    
    banzuke = pd.concat(banzuke, ignore_index=True)
    
    return banzuke

scrape_banzuke(year = 2020)

http://sumodb.sumogames.de/Banzuke_text.aspx?b=202001
http://sumodb.sumogames.de/Banzuke_text.aspx?b=202003
http://sumodb.sumogames.de/Banzuke_text.aspx?b=202007
http://sumodb.sumogames.de/Banzuke_text.aspx?b=202009
http://sumodb.sumogames.de/Banzuke_text.aspx?b=202011


Unnamed: 0,rank,name,pob,stable,birthdate,height,weight,year,month
0,Y1e,Hakuho,Mongolia,Miyagino,11.03.1985,193,150.7,2020,01
1,Y1w,Kakuryu,Mongolia,Michinoku,10.08.1985,186,150,2020,01
2,O1e,Takakeisho,Hyogo,Chiganoura,05.08.1996,173,149,2020,01
3,O1w,Goeido,Osaka,Sakaigawa,06.04.1986,183,158.2,2020,01
4,S1e,Asanoyama,Toyama,Takasago,01.03.1994,189,158,2020,01
...,...,...,...,...,...,...,...,...,...
345,J12w,Jokoryu,Tokyo,Kise,07.08.1988,187,162.4,2020,11
346,J13e,Ura,Osaka,Kise,22.06.1992,172,113,2020,11
347,J13w,Nishikifuji,Aomori,Isegahama,22.07.1996,180,134,2020,11
348,J14e,Fujiazuma,Tokyo,Tamanoi,19.04.1987,182,183.6,2020,11


In [18]:
year = 2020
banzuke = scrape_banzuke(year = year)
banzuke.to_csv('banzuke_{}.csv'.format(year), index = False)

http://sumodb.sumogames.de/Banzuke_text.aspx?b=202001
http://sumodb.sumogames.de/Banzuke_text.aspx?b=202003
http://sumodb.sumogames.de/Banzuke_text.aspx?b=202007
http://sumodb.sumogames.de/Banzuke_text.aspx?b=202009
http://sumodb.sumogames.de/Banzuke_text.aspx?b=202011
