# Getting the data
TODO make work with day 2

### development notes
The important code from this notebook has been exported to `data_collection_pipeline.py`,
and is being updated throughout the project. Currently, changes from the 'wrangling' phase have been implemented.
This document was updated to reflect these changes, but isn't 'perfect'
This document is meant as a development notebook, and any future use of the pipe line should use the python file.
Using this to test changes to the pipeline could be useful still, so I'll do my best to keep this updated going forward.



Every year, bushiroad hosts a large circut of tournaments all over the world. The goal of this pipeline is to take the information published about each circute (the event information, and the tournament report), and gather all the data in one place for analysis.

The circut starts with an annoucement, and a webpage going up with all the information on the events.
We need to use a webscraper that:
* Takes this new URL as an input
* For each event in the circut, creates a list of the:
    - region,
    - location,
    - and date,
    - then outputs this information to a 'super list' for storage

This list will not be enough on it's own to get us started, since the event reports have extenstions that don't perfectly match to a patern. In the future, a webscraper could be prepared to help with the next step, but it's easy enogh to manage for now.
The event report website 'hub' will take an extension to go to the 'branch' for a particular event.
By taking these extensions, and adding them to each corresponding 'sub list', our next level of the pipeline can get the information about each event

Bushiroad uploads a list of all the players who participated in each of their higher level events, with each player's:
* handle
* placement
* \# of wins
* nation
* grade 3 in r.d.
* decklog id

This information is uploaded to a URL, with each page's HTML containing links to decklog, so we will need to:
* create a dataframe out of the information on the main page
* follow those decklog links to obtain more detailed data
* (in the future, look up individual card names for even more detailed data)
* store information about just this event in one place for easy isolation and redundancy
* add data about the event we got from the previous stage (location, region, date)
* store it all for future use in a central database for easy comparison between events

The data from decklists doesn't fit too fell into a traditional database of rows and columns, but all the information can fit neatly into dictionaries, or JSON.
Each object we save to the database should contain 2 'levels' of data:
* the information from the original webpage at the base level at the base level
* a dictionary contaiing information about the deck list, which is nested dictionary, containing a dict each for the
    * main,
    * ride,
    * and g decks

##### In the future, cards may have `id`'s instead of names, so that more detailed card information can be stored in a central location

Each 'layer' will be populated by a different pipleine.
A "zero-th" 'preperation' pipeline was created to gather information about each event before getting data from the event reports. (Region, Location, and Date)
Pipeline 1 will get data from the central page, placing it into the first dictionary, as well as add data from the preperation pipeline, and format the data appropriately.
and pipelie 2 will take the dekcklogs frome pipeline 1, obtain data from each link, using it to fill the second dictionary
A third pipeline will save this data to the database.

##### NOTE
Due to the structure of the main table we're scraping having special scripts for the ride deck g3's of the top 3 players, when using the second pipeline, the ride deck g3's name needs to be extracted, and as a key value pair to the first dicionary

# Part 0

In [1]:

# Import used in this notebook

import pandas as pd

import requests
from bs4 import BeautifulSoup as Soup

# For webscrapping pages after loading
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.expected_conditions\
    import presence_of_element_located as present
from selenium.webdriver.support.ui import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager


The first approach of determining which evens to do was a little bit of manual clicking, but gathering the dates and regions felt like a bit too much manual effort. So, we've whipped up a script to make a base list to use, containing some of the public info on each event, such as city, region, and date.
We'll still need to manually verify and enter some of the bushiroad event report links to get the data into the format we want.

Maybe next year, we can write a scraper to scrape all the links on the website.

In [2]:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# NEW EVENT INFO SCRAOING TOOL (PRE_PREP PIPELINE)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
def scrape_event_info():
    """
    This function is to prepare our event list for our main function.
    The output of this function is expected to be manually edited
    before being fed into main;
    ) Names need to be changed
    _ URLs need to be manually obtained from the event report site

    When re-using this function, start by changing the URL,
    and go from there. Good luck :P
    """

    URL = 'https://en.bushiroad.com/events/bcs2526/schedule/'
    request = requests.get(URL)
    soup=Soup(request.text, 'html.parser')

    region_map = {
        'regional-na': 'NA',
        'regional-eu': 'EU',
        'regional-asia': 'AO'
    }
    final_event_list = []
    # Credt: Geminin Pro
    # 3. Iterate through the defined regions
    for region_id, continent_code in region_map.items():
        # Find the main container for this specific continent
        region_container = soup.find('div', id=region_id)

        if not region_container:
            print(f"-> Info: Container for ID '{region_id}' ({continent_code}) not found in this HTML snippet. Skipping.")
            continue

        # print(f"-> Processing continent: {continent_code}")

        # 4. Find individual event cards within this region container.
        # In your HTML, every event seems to be wrapped in a div with class="event-card"
        event_cards = region_container.find_all('div', class_='event-card')

        for card in event_cards:
            # Extract Location Name
            # It looks like <h5 class="mb-0">City Name (Country)</h5>
            location_tag = card.find('h5', class_='mb-0')
            if location_tag:
                # .get_text(strip=True) removes surrounding whitespace and HTML tags
                location_name = location_tag.get_text(strip=True)
            else:
                # Fallback: try getting it from the data-city attribute if the h5 fails
                location_name = card.get('data-city', 'Unknown Location')

            # Extract Date
            # It looks like <p class="sm-txt schedule-date">Date Range</p>
            date_tag = card.find('p', class_='schedule-date')
            if date_tag:
                date_text = date_tag.get_text(strip=True)
            else:
                date_text = "Unknown Date"

            # Create the sublist [Location, Continent, Date]
            event_info = [location_name, continent_code, date_text]
            final_event_list.append(event_info)

    # print("\nExtraction complete. Here is your list:")
    # print("-" * 30)
    # # Pretty print the final list
    # for item in final_event_list:
        # print(item)

    # If you want to use this list later in the script, it is stored in `final_event_list`
    # print("\nRaw list output:")
    # print(final_event_list)# Credt: Geminin Pro
    # 3. Iterate through the defined regions
    for region_id, continent_code in region_map.items():
        # Find the main container for this specific continent
        region_container = soup.find('div', id=region_id)

        if not region_container:
            print(f"-> Info: Container for ID '{region_id}' ({continent_code}) not found in this HTML snippet. Skipping.")
            continue

        # print(f"-> Processing continent: {continent_code}")

        # 4. Find individual event cards within this region container.
        # In your HTML, every event seems to be wrapped in a div with class="event-card"
        event_cards = region_container.find_all('div', class_='event-card')

        for card in event_cards:
            # Extract Location Name
            # It looks like <h5 class="mb-0">City Name (Country)</h5>
            location_tag = card.find('h5', class_='mb-0')
            if location_tag:
                # .get_text(strip=True) removes surrounding whitespace and HTML tags
                location_name = location_tag.get_text(strip=True)
            else:
                # Fallback: try getting it from the data-city attribute if the h5 fails
                location_name = card.get('data-city', 'Unknown Location')

            # Extract Date
            # It looks like <p class="sm-txt schedule-date">Date Range</p>
            date_tag = card.find('p', class_='schedule-date')
            if date_tag:
                date_text = date_tag.get_text(strip=True)
            else:
                date_text = "Unknown Date"

            # Create the sublist [Location, Continent, Date]
            event_info = [location_name, continent_code, date_text]
            final_event_list.append(event_info)

    # print("\nExtraction complete. Here is your list:")
    # print("-" * 30)
    # # Pretty print the final list
    # for item in final_event_list:
        # print(item)

    # If you want to use this list later in the script, it is stored in `final_event_list`
    # print("\nRaw list output:")
    # print(final_event_list)

    return final_event_list

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# events = scrape_event_info()
# events
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Below is the copied and edited list.

The events list scraped is going to need some manual fixing up, for now. 

TODO's represent events who's data hasn't been collected yet. Mostly future dates.

Maybe for next year, we can add a webscraper to automate this part as well :)

In [3]:
final_event_list=\
[
    # ['illinois', 'Rosemont', 'NA', 'October 4, 2025'],
    # ['mexico', 'Mexico', 'NA', 'November 8, 2025'],
    # ['bcs2526-houston-tx', 'TX', 'NA', 'November 22, 2025'],
    # ['bcs2526-california', 'LA', 'NA', 'December 6, 2025'],
    # ['vancouver', 'BC', 'NA', 'January 10, 2026'],
    # ['argentina', 'Argentina', 'NA', 'January 17, 2026'],
    # ['duluth', 'Duluth', 'NA', 'January 17, 2026'],
    # ['puerto-rico', 'Puerto Rico', 'NA', 'January 24, 2026'],
    
    # ['TODO', 'Toronto', 'NA', 'February 14, 2026'],
    # ['TODO', 'Costa Rica', 'NA', 'February 21, 2026'],
    # ['TODO', 'Philadelphia', 'NA', 'March 21, 2026'],

    
    # ['modling-austria', 'Austria', 'EU', 'November 15, 2025'],
    # ['italy', 'Italy', 'EU', 'December 13, 2025'],
    # ['spain', 'Spain', 'EU', 'January 3, 2026'],
    # ['france', 'France', 'EU', 'February 7, 2026'],
    
    # ['TODO', 'Germany', 'EU', 'February 21, 2026'],
    # ['TODO', 'Greece', 'EU', 'March 7, 2026'],
    # ['TODO', 'United Kingdom', 'EU', 'March 21, 2026'],

    
    # ['ho-chi-minh-city-vietnam', 'Vietnam', 'AO', 'November 2, 2025'],
    # ['surabaya-indonesia', 'Indonesia', 'AO', 'November 16, 2025'],
    # ['bcs2526-malaysia', 'Malaysia', 'AO', 'December 6, 2025'],
    # ['manila', 'Philippines', 'AO', 'January 17, 2026'],
    # ['singapore', 'Singapore', 'AO', 'January 24, 2026'],
    
    # ['TODO', 'Melbourne, Australia', 'AO', 'February 28, 2026'],
    # ['TODO', 'Sydney, Australia', 'AO', 'March 28, 2026'],
    # ['TODO', 'Indonesia', 'AO', 'March 28, 2026']
]

## **Part** 1

# **INPUT** This script requires an input sring, 'EVENT' in order to function

### Obtain HTML for event, and extract to df

In [4]:
def get_values(row):
    """This procederal function deals with the fact that the first 3 rows of
    the table have a slightly different structure than the rest.

    While I could write two functions, one for the first 3 rows, 
    and one for the rest, this just felt more simple and natural to write,
    allthough it's a bit tough to read.

    Our i counter being manual allows us to skip over the missing row
    for our top 3 players, while still keeping the incremental couting logic.

    Since the first 3 rows are a different size, we need to adjust the amount of values,
    so our data frame is happy. It ended up seeming simplest to just add `None` as we pass through
    """
    values=list()
    for i, item in enumerate(row):
        if i == 1: # This double conditional is true when the first 3 row's second element
            if item.div: # is a division, instead of text. we need to treat it different
                values.append("Champion") # we could extract the name, but it would be very complex
                values.append(None) # Add a none placeholder for our missing value
                continue # this continue statement seperates this special behaviour from the simple case

        values.append(row[i].text.strip())

    return values

In [None]:

# Full example for testing
# FULL_URL = "https://en.cf-vanguard.com/event/bcs2526/bcs2526-california/"

# Use this to substitute for the input this function will recieve
# EVENT = 'bcs2526-california'

# After creation, we've started tracking more data. 
# WOP: Updateing the function to work with this new input:
# Event info schema
# event_info =[
#     url_extension, # <- this was the original input
#     converted_name,
#     region,
#     date
# ]

TEST_EVENT = [
    'bcs2526-california',
    'LA',
    'NA',
    'December 6, 2025'
]

# These are for indexing into the event info list object
URL_EXT = 0
NAME = 1
REGION = 2
DATE = 3

BASE_URL = "https://en.cf-vanguard.com/event/bcs2526/"

base_url=BASE_URL
event_info=TEST_EVENT


# Part 1: Get basic info and decklog from event page ~~~~~~~~~~~~~~~~~~~~~~
url = f'{base_url}{event_info[URL_EXT]}'
soup = Soup(requests.get(url).text, features='html.parser')
rows = soup.table.find_all('tr')
# Remove header row, so all rows have td, and not th
rows.pop(0)

dataDict = dict()
for row in rows:
    # Decklog as the key
    key = row['data-deck-id']
    values = get_values(row.find_all('td'))
    values.append(key)
    values.append(None) # deck info place holder

    dataDict[key] = values

df = pd.DataFrame(dataDict).transpose()
df = df.set_axis([
                    'rank',#
                    'name',
                    'boss',
                    'wins',#
                    'nation',
                    'decklog',
                    'deck'
                ], 
                axis=1)

# Part 1.5 - Wrangle the data ~~~~~~~~~~~~~`
# the rank and wins column need to be converted to ints,
# and now we have some new attributes that need to be encoded.`
df['rank'] = df['rank'].str[:-2].astype(int)
df['wins'] = df['wins'].astype(int)
df['location'] = event_info[NAME]
df['region'] = event_info[REGION]
df['date'] = event_info[DATE]

df.head

<bound method NDFrame.head of        rank              name                                      boss  wins  \
146MF     1          Champion                                      None    11   
6EZB7     2          Champion                                      None    11   
42N7X     3          Champion                                      None    10   
1AQ5W     4               Ken           Fated One of Taboo, Zorga Nadir    10   
6B3RK     5               Kay  Fated One of Unparalleled, Varga Dragres     8   
...     ...               ...                                       ...   ...   
3NLY5   485       gilalmighty              Omniscience Regalia, Minerva     0   
4HGK7   486  Guardian Paladin              Soul Awakening Guard, Leuhan     0   
2LQPN   487       Nessiechomp    Demon Stealth Dragon, Shiranui ‘Oboro’     0   
4DG85   488           Metelx8          Super Dimensional Robo, Daiyusha     0   
5WRG9   489        RidazD3135                         Dragonic Overlord     0  

In [11]:
df.describe()

Unnamed: 0,rank,wins
count,489.0,489.0
mean,245.0,2.822086
std,141.306405,1.953098
min,1.0,0.0
25%,123.0,1.0
50%,245.0,3.0
75%,367.0,4.0
max,489.0,11.0


## Part 2 

### Obtain HTML for each deck's decklog, and add to df

Now, we need to settup a function to run on each decklog that we have in the dataframe

Not every decklog will have a G zone, and will thus contain a different number of 'row' divisions.
As far as standard decks go, there will be 7 rows if there is no g zone, and up one more to 8 with one.
The use of `None` for the G deck will help us skip over it without creating any errors.
These magic numbers for indexing below are a bit cringe, so I hope to come back and fix this later.

**NOTE: THIS WILL NEED UPDATING FOR PREMIUM DECKS DUE TO MAGIC NUMBER HACK**

In [12]:
def decklogToDict(soup):
    """
    TODO make work with premium
    
    Given the BeautifulSoup for a decklogs...
    """
    # Initialize the decks we return
    boss = None
    rideDeckDict = dict()
    mainDeckDict = dict()
    gDeckDict = dict()
    deckDict = {
        'RideDeck': rideDeckDict,
        'MainDeck': mainDeckDict,
        'GDeck': gDeckDict
    }

    # obtain data from soup
    rows =  soup.find_all('div', 'row')
    rideDeck = rows[5].find_all('div', 'card-controller')
    mainDeck = rows[6].find_all('div', 'card-controller')
    if len(rows) == 8:
        gDeck = rows[7].find_all('div', 'card-controller')
    else:
        gDeck = None

    # pair soup data with dict
    decks = {
        "RideDeck": rideDeck, 
        "MainDeck": mainDeck,
        "GDeck": gDeck
    }

    # extract data from the soup into the dictionary
    for deck in deckDict:
        if not decks[deck]: continue
        for card in decks[deck]:
            spans = card.find_all('span')
            namespan= str(spans[0])
            index1 = namespan.find(' : ') + 3 #trim whitespace
            index2 = namespan.find('"></span>')
            
            card_name = namespan[index1:index2]
            quant = int(spans[1].text)
        
            if card_name in deckDict[deck]:
                deckDict[deck][card_name] += quant
            else:
                boss = boss or card_name
                deckDict[deck][card_name] =  quant
            
    return deckDict, boss

Above, we created the framework to hold all the information.
Now, below, we will scrape all the information from the page into the framework.

In [None]:
#Credit Gemini for skeletron

# Setup Chrome to run headless (without a visible window)
chrome_options = Options()
chrome_options.add_argument("--headless")
# Initialize the browser
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options) 
waiter = WebDriverWait(driver=driver, timeout=20)

# Go to each page
BASE_URL = "https://decklog-en.bushiroad.com/view/" 
for i, row in df.iterrows():
    # Most of the issues with getting deck logs just need a retry.
    tries = 0
    while tries < 3:
        tries += 1
        try:
            if df.at[i, 'deck'] != None: continue # skip completed decks on re-run
            code = df.at[i, 'decklog']
            url = BASE_URL + str(code)
            driver.get(url)
            # Wait for JavaScript to execute (you can use smarter waits, but sleep is simple for testing)
            waiter.until(
                present((By.CLASS_NAME, "card-controller"))
            )
            html = driver.page_source
            deckSoup = Soup(html)
            deckDict, boss = decklogToDict(deckSoup)
            df.at[i, 'deck'] = deckDict
            #TODO MAKE WORK WITH PREMIUM DECKS
            df.at[i, 'boss'] = boss
            break
        except:
            continue

# clean-up
driver.quit()

This bit ensures we finished getting each deck, in case we run into connection issues

In [None]:
# if None in df.deck: raise ValueError to preserve database authenticity when re-using this notebook
todo = df.deck.isnull().sum()
if todo != 0: raise ValueError

I've encountered some errors with weird decklogs that aren't correct. [5.97E+06, 2.62E+02].

I checked that these were issues on the Bushi site, and not the script, and those were the logs on the site.

These logs are incorrect, and don't take to a valid deck log page. It's important we keep the script functional even when it recieves bad inputs.

## Part 3
### Upload the data

TODO FIX SECURITY ISSUE

In [None]:
from pymongo import MongoClient

# # Connect to MongoDB
# username = 'sjmichael17_db_user'
# password = 'rVtL43eBjseB5XkS' # plz don't hack me bro ;-;
# cluster_address = 'bcsproto.peazuyx.mongodb.net/?appName=BCSproto'

# client = MongoClient(f'mongodb+srv://{username}:{password}@{cluster_address}')

# db = client['JSONproto']
# collection = db[event_info[NAME]]

# cleanDF = df.reset_index().rename(columns={'index':'_id'})
# collection.insert_many(cleanDF.to_dict(orient='records'))

# print("Done")
# client.close()