<span style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">An Exception was encountered at '<a href="#papermill-error-cell">In [3]</a>'.</span>

# Scrape Data from sumo db

In this notebook we will use requests, beautiful soup and pandas to scrape banzuke (rankings), hoshitori (tournament results) from sumo db and store locally for further processing

## MVP

for the mvp we only need the rankings and results from the previous two tournaments

### Banzuke format
sample URL
http://sumodb.sumogames.de/Banzuke_text.aspx?b=202009

URL template
http://sumodb.sumogames.de/Banzuke_text.aspx?b=yyyymm

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

### Banzuke mvp

It's faster to request the text than it is to request the tables

In [2]:
url = 'http://sumodb.sumogames.de/Banzuke_text.aspx?b=202009'

res = requests.get(url)
soup = BeautifulSoup(res.content)
# table = soup.find_all('table')[0] 
# df = pd.read_html(str(table))
# df.head()

<span id="papermill-error-cell" style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">Execution using papermill encountered an exception here and stopped:</span>

In [3]:
def retrieve_rows(soup, start_rank = 'Makuuchi', end_rank = 'Juryo'):
    """
    takes banzuke soup object and grabs rows data from makuuchi and juryo
    returns as dataframe
    """
    
    if len(soup.find_all('pre')) == 0:
        return 'None'
    
    l = str(soup.find_all('pre')[0]).split('\r\n')
    
    start_idx = l.index(start_rank) + 2
    end_idx = l.index(end_rank) -1
    
    to_df = l[start_idx:end_idx]
    to_df = [y.split() for y in to_df]
    
    cols = ['rank', 'name', 'pob', 'stable', 'birthdate', 'height', 'weight']
    
    df = pd.DataFrame(to_df, columns = cols)
    
    return df

x = retrieve_rows(soup)
print(x)

ValueError: 'Makuuchi' is not in list

The above should work for both makuuchi to juryou and juryo to makushita

In [None]:
retrieve_rows(soup, start_rank = 'Juryo', end_rank = 'Makushita')

In [None]:
def combine_divisions(div1, div2):
    """
    join data frames from two different divisions
    """
    
    jf = pd.concat([div1, div2])
    
    return jf

div1 = retrieve_rows(soup, 'Makuuchi', 'Juryo')
print(div1.shape)
div2 = retrieve_rows(soup, 'Juryo', 'Makushita')
print(div2.shape)
jf = combine_divisions(div1, div2)
jf.shape

In [None]:
jf.head()

In [None]:
[str(x).zfill(2) for x in list(range(1,13,1))]

In [None]:
def scrape_banzuke(year = 2019):
    """
    scrape all the banzukes for given year
    """
    
#     if year == 2020:
#         months = [str(x).zfill(2) for x in list(range(1,13,2))]
#         months.remove('05')
#     else:
#         months = [str(x).zfill(2) for x in list(range(1,13,2))]

    months = [str(x).zfill(2) for x in list(range(1,13,1))]
    
    url = 'http://sumodb.sumogames.de/Banzuke_text.aspx?b={}{}'
    
    urls = [url.format(year, x) for x in months ]
    
    banzuke = []
    
    for r in urls:
        res = requests.get(r)
        soup = BeautifulSoup(res.content)
        mak = retrieve_rows(soup, start_rank = 'Makuuchi', end_rank = 'Juryo')
        if type(mak) == type(str()):
            continue
        jur = retrieve_rows(soup, start_rank = 'Juryo', end_rank = 'Makushita')
        
        print(r)
        df = combine_divisions(mak, jur)
        df['year'] = year
        df['month'] = r[-2::1]
        banzuke.append(df)
    
    banzuke = pd.concat(banzuke, ignore_index=True)
    
    return banzuke

scrape_banzuke(year = 2020)

In [None]:
year = 2020
banzuke = scrape_banzuke(year = year)
banzuke.to_csv('banzuke_{}.csv'.format(year), index = False)