# Scrape Data from sumo db

In this notebook we will use requests, beautiful soup and pandas to scrape banzuke (rankings), hoshitori (tournament results) from sumo db and store locally for further processing

## MVP

for the mvp we only need the rankings and results from the previous two tournaments

### Banzuke format
sample URL
http://sumodb.sumogames.de/Banzuke_text.aspx?b=202009

URL template
http://sumodb.sumogames.de/Banzuke_text.aspx?b=yyyymm

In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

### Banzuke mvp

It's faster to request the text than it is to request the tables

In [54]:
def retrieve_rows(soup, start_rank = 'Makuuchi', end_rank = 'Juryo'):
    """
    takes banzuke soup object and grabs rows data from makuuchi and juryo
    returns as dataframe
    """
    
    l = str(soup.find_all('pre')[0]).split('\r\n')
    
    start_idx = l.index(start_rank) + 2
    end_idx = l.index(end_rank) -1
    
    to_df = l[start_idx:end_idx]
    to_df = [y.split() for y in to_df]
    
    cols = ['rank', 'name', 'pob', 'stable', 'birthdate', 'height', 'weight']
    
    df = pd.DataFrame(to_df, columns = cols)
    
    return df

x = retrieve_rows(soup)
print(x)

    rank          name        pob       stable   birthdate height weight
0    Y1e        Hakuho   Mongolia     Miyagino  11.03.1985    193  150.7
1    Y1w       Kakuryu   Mongolia    Michinoku  10.08.1985    186    150
2    O1e     Asanoyama     Toyama     Takasago  01.03.1994    189    158
3    O1w    Takakeisho      Hyogo   Chiganoura  05.08.1996    173    149
4    S1e        Shodai   Kumamoto  Tokitsukaze  05.11.1991    182    150
5    S1w     Mitakeumi     Nagano    Dewanoumi  25.12.1992    179    149
6    S2e      Daieisho    Saitama     Oitekaze  10.11.1993    179  135.7
7    K1e      Okinoumi    Shimane      Hakkaku  29.07.1985    190    154
8    K1w          Endo   Ishikawa     Oitekaze  19.10.1990    183  145.6
9    M1e    Terunofuji   Mongolia    Isegahama  29.11.1991    192  158.5
10   M1w     Takanosho      Chiba   Chiganoura  14.11.1994  181.5  128.5
11   M2e    Hokutofuji    Saitama      Hakkaku  15.07.1992    182    158
12   M2w     Tamawashi   Mongolia    Kataonami  16.

The above should work for both makuuchi to juryou and juryo to makushita

In [55]:
retrieve_rows(soup, start_rank = 'Juryo', end_rank = 'Makushita')

Unnamed: 0,rank,name,pob,stable,birthdate,height,weight
0,J1e,Ikioi,Osaka,Isenoumi,11.10.1986,194.0,157.0
1,J1w,Nishikigi,Iwate,Isenoumi,25.08.1990,184.5,149.5
2,J2e,Kotoyuki,Kagawa,Sadogatake,02.04.1991,176.0,162.4
3,J2w,Kotonowaka,Chiba,Sadogatake,19.11.1997,186.0,147.0
4,J3e,Wakamotoharu,Fukushima,Arashio,05.10.1993,185.0,127.2
5,J3w,Chiyomaru,Kagoshima,Kokonoe,17.04.1991,178.0,176.3
6,J4e,Chiyoshoma,Mongolia,Kokonoe,20.07.1991,183.0,118.2
7,J4w,Daiamami,Kagoshima,Oitekaze,15.12.1992,185.0,167.0
8,J5e,Daishomaru,Osaka,Oitekaze,10.07.1991,175.0,153.0
9,J5w,Kyokushuho,Mongolia,Tomozuna,09.08.1988,191.0,149.6


In [14]:
url = 'http://sumodb.sumogames.de/Banzuke_text.aspx?b=202009'

res = requests.get(url)
soup = BeautifulSoup(res.content)
# table = soup.find_all('table')[0] 
# df = pd.read_html(str(table))
# df.head()

In [37]:
str(soup.find_all('pre')[0]).split('\r\n')

['<pre>Aki 2020',
 'Tokyo, Ryogoku Kokugikan',
 '',
 'Makuuchi',
 '',
 'Y1e    Hakuho         Mongolia  Miyagino    11.03.1985    193 150.7',
 'Y1w    Kakuryu        Mongolia  Michinoku   10.08.1985    186   150',
 'O1e    Asanoyama      Toyama    Takasago    01.03.1994    189   158',
 'O1w    Takakeisho     Hyogo     Chiganoura  05.08.1996    173   149',
 'S1e    Shodai         Kumamoto  Tokitsukaze 05.11.1991    182   150',
 'S1w    Mitakeumi      Nagano    Dewanoumi   25.12.1992    179   149',
 'S2e    Daieisho       Saitama   Oitekaze    10.11.1993    179 135.7',
 'K1e    Okinoumi       Shimane   Hakkaku     29.07.1985    190   154',
 'K1w    Endo           Ishikawa  Oitekaze    19.10.1990    183 145.6',
 'M1e    Terunofuji     Mongolia  Isegahama   29.11.1991    192 158.5',
 'M1w    Takanosho      Chiba     Chiganoura  14.11.1994  181.5 128.5',
 'M2e    Hokutofuji     Saitama   Hakkaku     15.07.1992    182   158',
 'M2w    Tamawashi      Mongolia  Kataonami   16.11.1984    190   

hellohello