# Gathering Data

## Imports

- requests: Necessary to pull data from the internet.
- time: Used to separate our data using time intervals.
- json: Allow us to decode json information obtained from the data source.
- math: Allow us to use $\log$ to determine how long a binary search will take.
- calendar: For time conversions from UTC

In [2]:
import requests
import time
import json
import math
import calendar

## Pre-Info

Data for this project will be obtained from SendouQ match data, found at `https://sendou.ink/q/match/[ID]`. The ID numbers increase sequentially, with the first match ever having an ID of 1, and the current last match (at the time of writing) having an ID around 45400.

I use a sample match from my most recent one at the time, which can be found with ID [45359](https://sendou.ink/q/match/45359).

The data we're interested in is only two things, the score of every player in the game, and the result of the match. Since this information is shown on the screen, it should be possible to extract this data from the page source.

Looking at the page source, it's not easily decipherable at first glance, especially since I have little knowledge of web-dev. However, after some investigation, there is some json data which contains the data we're looking for which can be found immediately after the phrase `"features/sendouq/routes/q.match.$id":` and taking the next text until the bracket ends.

In [36]:
match_ID = '45359'

def get_match(match_ID):
    # grab the data from the page
    r = requests.get('https://sendou.ink/q/match/' + match_ID)
    start_phrase = '"features/sendouq/routes/q.match.$id":'

    # find the start_phrase in the result
    start_index = r.text.find(start_phrase)

    # copy all text after the start_phrase
    data = r.text[start_index + len(start_phrase):]

    # find the end_index by looking for balanced parentheses
    open_parentheses = 0
    for i, c in enumerate(data):
        if c == '{':
            open_parentheses += 1
        elif c == '}':
            open_parentheses -= 1
        if open_parentheses == 0:
            break
    end_index = i + 1

    # trim off the extra data
    return data[:end_index]

# print out data (use an external tool to visualize the json like https://codebeautify.org/jsonviewer)
print(get_match(match_ID))

{"match":{"id":45643,"alphaGroupId":379223,"bravoGroupId":379231,"createdAt":1714871261,"reportedAt":null,"reportedByUserId":null,"memento":{"modePreferences":{"SZ":[{"userId":27094,"preference":"PREFER"},{"userId":23509,"preference":"PREFER"},{"userId":9800,"preference":"PREFER"},{"userId":41674}],"TC":[{"userId":27094,"preference":"PREFER"},{"userId":23509},{"userId":9800},{"userId":41674}],"RM":[{"userId":27094,"preference":"PREFER"},{"userId":23509,"preference":"AVOID"},{"userId":9800,"preference":"PREFER"},{"userId":41674}],"CB":[{"userId":27094,"preference":"PREFER"},{"userId":23509},{"userId":9800},{"userId":41674}]},"pools":[{"userId":27094,"pool":[{"stages":[2,7,10,15,16,17,21],"mode":"SZ"},{"stages":[2,3,6,8,9,10,19],"mode":"TC"},{"stages":[0,2,6,10,16,17,19],"mode":"RM"},{"stages":[0,2,6,8,10,15,17],"mode":"CB"}]},{"userId":7994,"pool":[{"stages":[0,1,2,5,8,9,10],"mode":"TW"},{"stages":[6,8,10,14,15,16,21],"mode":"SZ"},{"stages":[2,8,9,10,14,16,20],"mode":"TC"},{"stages":[2,

Visualize the data above on any json visualizer, I personally used [jsonbeautifier](https://jsonbeautifier.org/). 

### Finding player ratings

The first data we want to obtain is the ratings of the players. This can be found at `groupAlpha.members[0-3].skill.ordinal` for the left team, and replace `groupAlpha` with `groupBeta` for the right team.

However, this value is not what we might expect. From the website, player ratings, which can be seen by clicking on the pictures of the ranks, show a number around 1200-1700. However, the value in data is a smaller number between 0-50. Therefore, there must be some conversion to obtain the rating from the ordinal.

We can compare ordinals vs. ratings to see how we can write a conversion function. This data is mostly from the game I was using above, and then some data from the match with the ID right before (ID: `45359`).

| ordinal $(o)$ | rating $(\mathrm R)$ |
|---|---|
|23.189129414253898|1348|
|25.8303791446169|1387|
|23.431072851780918|1351|
|19.726415339139816|1296|
|29.049225431933444|1436|
|24.00009753744956|1360|
|38.153938142944824|1572|
|31.82404177844055|1477|
|11.527993868993082|1173|
|7.477124875867283|1112|

If you graph this data, you can find a linear correlation, with the only error being a very small rounding error. Thus, we can create a function to convert from ordinal to rating.
$$\operatorname R(o) = 15o + 1000$$

### Finding the winner of the match

The second data we want to obtain is the winner of the match. There seem to be two ways to do this. The first way, and what I expected the website does to display the winner, is to check `match.mapList[0-6].winnerGroupId`, which gives the ID of the team which won each game of the match. This ID can be checked against `alphaGroupId` or `bravoGroupId` to see which team was the winner. The team that won the most maps is the winner of the match.

However, there is an easier, second way to find the information we want. Since we don't care about the actual score of the game, just the winner, we can check `groupAlpha.members[0].skillDifference.spDiff`. If this value is positive, then the first team was the winner, and if it's negative then the second team won. This is because this value contains the amount of points gained or lost as a result of the match, and since the team that wins will always gain points, and the team that loses will always lose points, this can be used to determine the result of the match.

There are times when the second method will not work. This is when the players do not have ratings before the match started, since there won't be a gain or loss of points when the match ends. However, for this project, we must ignore matches where not all players have a rank, since the ranks of all players is important for our data. Thus, this method should always work for finding the result of the match.

In [18]:
match_ID = '45359'

json_data = json.loads(get_match(match_ID))
alpha_won = json_data['groupAlpha']['members'][0]['skillDifference']['spDiff'] > 0

if alpha_won:
    print("Alpha", end='')
else:
    print("Bravo", end='')
print(" won this game")

Alpha won this game


### Finding match time

We already have all the data we strictly need. However, it can be good to split our data into certain time intervals. For example, SendouQ is split into different season, which are periods of time where the service is active. Between seasons, rating data is partially reset, so at the beginning of seasons might give more noisy and unreliable data. Additionally, some time during season 1 and between October 2nd and October 5th, ratings were added to the match history. Matches before these ones will not have rating data to be pulled.

We want to separate our data and see how different seasons might give different results. Additionally, removing the first two weeks of each seasons may provide useful to get cleaner data.

The creation date of each match is stored at `match.createdAt`. This is stored as seconds since the epoch, which can be converted and compared in python pretty easily.

In [24]:
match_ID = '45359'

json_data = json.loads(get_match(match_ID))
local_time = time.localtime(json_data['match']['createdAt'])

print(time.strftime('%x, %I:%M %p', local_time))

05/02/24, 07:00 PM


## Finding Intervals
*All code cells below are necessary to run for the algorithm at the bottom of the notebook to work.*

Before we start writing algorithms to grab data, we need to consider possibly getting rate limited or ip-banned, which we would like to avoid. Since we're going to be pinging the server many times, it's important to space our requests out so we don't overload the server. To do this, we will define a class which will wrap around the `requests.get` function. This class will regulate our usage based on time, and stop us from potentially spamming the server.

The `SafeRequester` class is initialized with a time interval in *seconds*, which we should wait between requests. Its only member function is `SafeRequester.get(...)`, which takes the same parameters that `requests.get(...)` will.

In [7]:
REQUEST_INTERVAL = 2
HEADERS = {'User-Agent':
           'L1ghtBeam/cs5-final-project (contact __beam on discord if there are any issues)'}

class SafeRequester:
    def __init__(self, interval):
        self.last_request = time.monotonic()
        self.interval = interval

    def get(self, *args, **kwargs) -> requests.Response:
        current_time = time.monotonic()
        time_diff = current_time - self.last_request

        # if we haven't waited all of the time interval yet, sleep the remaining
        # amount of time before requesting
        while time_diff < self.interval:
            time.sleep(self.interval - time_diff) 
            current_time = time.monotonic()
            time_diff = current_time - self.last_request
        
        self.last_request = current_time
        return requests.get(*args, **kwargs, headers=HEADERS)

requester = SafeRequester(REQUEST_INTERVAL)

Before we do anything else, let's redefine our `get_match` function from earlier to use this new class. Additionally, split the function into two for when we want to get a request and not the data associated with it.

In [5]:
# grab the request data from the input match ID
def get_match(match_ID: int) -> requests.Response:
    # grab the info from the page
    return requester.get(f'https://sendou.ink/q/match/{match_ID}')

# grab the json data from the given request
def match_json(r: requests.Response) -> list | dict:
    assert(r.ok)

    start_phrase = '"features/sendouq/routes/q.match.$id":'

    # find the start_phrase in the result
    start_index = r.text.find(start_phrase)

    # copy all text after the start_phrase
    data = r.text[start_index + len(start_phrase):]

    # find the end_index by looking for balanced parentheses
    open_parentheses = 0
    in_quotes = False       # flag if we're in quotes
    escaped = False         # flag if we're escaped
    for i, c in enumerate(data):
        # if we're escaped, ignore this character
        if escaped:
            escaped = False
            continue
        if c == '\\':
            escaped = True
            continue

        # When we see a quote, toggle the flag in_quotes. This will prevent us
        # from counting brackets found in names of strings. To see where this
        # would be a problem, see match 37500 where a player has the name
        # "プ{Doωm'"
        if c == '"':
            in_quotes = not in_quotes
            continue
        if in_quotes:
            continue

        if c == '{':
            open_parentheses += 1
        elif c == '}':
            open_parentheses -= 1
        if open_parentheses == 0:
            break
    end_index = i + 1

    # trim off the extra data
    return json.loads(data[:end_index])

# small test to make sure a match with a player named "プ{Doωm'" will work
# match_ID = 37500
# r = get_match(match_ID)
# assert(r.ok)
# j = match_json(r)
# assert(j['match']['isLocked'])

In [65]:
match_ID = 45670
r = get_match(match_ID)
assert(r.ok)
j = match_json(r)

print(json.dumps(j, indent=2))

{
  "match": {
    "id": 45670,
    "alphaGroupId": 379394,
    "bravoGroupId": 379391,
    "createdAt": 1714878038,
    "reportedAt": 1714878120,
    "reportedByUserId": 23675,
    "memento": {
      "modePreferences": {
        "SZ": [
          {
            "userId": 2801,
            "preference": "PREFER"
          },
          {
            "userId": 1059
          },
          {
            "userId": 198
          }
        ],
        "TC": [
          {
            "userId": 2801
          },
          {
            "userId": 1059
          },
          {
            "userId": 198,
            "preference": "PREFER"
          }
        ],
        "RM": [
          {
            "userId": 2801
          },
          {
            "userId": 1059
          },
          {
            "userId": 198,
            "preference": "PREFER"
          }
        ],
        "CB": [
          {
            "userId": 2801
          },
          {
            "userId": 1059
          },
       

### Finding the last match

The first match is match `1` (this service is surprisingly not zero-indexed). However, the last match, the most recent one, is always changing. We can find this match using a binary search algorithm. All matches which exist should return a status code of `200`, meaning that it exists. Additionally, all match IDs which don't yet exist should return error code `404`. Since we have a range of values which all appear before the index we want to find, and a different range which appears only after the index, we can use binary search.

In [60]:
OK = 200
NOT_FOUND = 404

# starting value for the last match.
# it's very possible far in the future that this will no longer be greater than 
# the actual last match, which is why this algorithm uses two loops and not one
if 'search_interval' not in locals():
    search_interval = 100000


if 'last_match' in locals():
    L = last_match  # type: ignore
else:
    L = 1

R = search_interval
found_last_match = False

while not found_last_match:
    # define variables for our percentage estimate
    search_range = R - L + 1
    max_searches = math.floor(math.log2(search_range)) + 1
    searches = 0
    print(f"Searching range {L}-{R}:\n0%")

    # search loop using binary search
    while L <= R:
        m = (L + R) // 2
        r = get_match(m)
        if r.ok:
            L = m + 1
        elif r.status_code == NOT_FOUND:
            R = m - 1
        else:
            raise RuntimeError(f"Got an unexpected error code {r.status_code} ({r.reason})")
    
        searches += 1
        print(f"{searches/max_searches:.1%}")

    if R < search_interval:
        found_last_match = True
    else:
        # in the rare case that this code is being run years into the future 
        # where there have been more than 100,000 matches played, expanded our
        # search interval and search again
        print(f"There have been more games played than {search_interval}!")
        L = search_interval + 1
        search_interval *= 2
        R = search_interval
        print(f"Searching again up to {search_interval} games.")


# once the loop is complete, R is the last match
last_match = R
print(f"Found the last match: {last_match}.")

Searching range 1-100000:
0%
5.9%
11.8%
17.6%
23.5%
29.4%
35.3%
41.2%
47.1%
52.9%
58.8%
64.7%
70.6%
76.5%
82.4%
88.2%
94.1%
100.0%
Found the last match: 45670.


The match we found above will be the last match that exists, but not the last match with a reported score. There's no easy way to find the last valid match without checking every match one by one. If necessary for any algorithms in the future, we can be safe by just using matches at least a few hours old. However, I don't think this will be necessary.

### Finding match intervals by time

Now that we have the last match, we can use binary search to find the intervals between certain times. Since we want to do this multiple times to find multiple intervals, create a function for this purpose.

In [75]:
def find_match_after(t: int, L: int, R: int) -> int:
    # define variables for our percentage estimate
    search_range = R - L + 1
    max_searches = math.floor(math.log2(search_range)) + 1
    searches = 0
    print(f"Searching range {L}-{R}:\n0%")

    # search loop using binary search
    while L <= R:
        m = (L + R) // 2
        r = get_match(m)
        assert(r.ok) # should always be true as long as R <= last_match
        j = match_json(r)
        if j['match']['createdAt'] < t:
            L = m + 1
        else:
            R = m - 1
    
        searches += 1
        print(f"{searches/max_searches:.1%}")
    
    # L is our result
    input_time = time.strftime('%x, %I:%M %p', time.localtime(t))
    print(f"Found the first match after {input_time}: {L}.")
    return L

In [79]:
print("Enter time to search for")
print("Use format MM/DD/YYYY, HH:MM AM/PM")
input_time = input("Enter time:")
if not input_time:
    raise SystemExit()
t = time.strptime(input_time, "%m/%d/%Y, %I:%M %p")

find_match_after(time.mktime(t), 1, last_match)

Enter time to search for
Use format MM/DD/YYYY, HH:MM AM/PM
Searching range 1-45670:
0%
6.2%
12.5%
18.8%
25.0%
31.2%
37.5%
43.8%
50.0%
56.2%
62.5%
68.8%
75.0%
81.2%
87.5%
93.8%
Found the first match after 05/04/24, 06:00 PM: 45643.


45643

Now it's time to find the intervals we want to use. Like said before, SendouQ is split into seasons. Additionally, we want to take a few weeks into each season when ratings level out to reduce noise. Information about each season so far is found below.

- **Season 0**: August 14 - August 27, 2023
- **Season 1**: September 11 - November 19, 2023
- **Season 2**: December 4, 2023 - February 18, 2024
- **Season 3**: March 4 - May 19, 2024

However, there's one more thing to consider. The first is that ranks were not stored in match history until sometime in Season 1. I manually searched through the games and found match `10988` to be the first one with ranks and a result. Thus our intervals should look like this.

Season 1 (10/3/2023, 11:34 AM) - ID: `10988`

Season 2 start (12/04/2023, 9:00 AM)

Season 2, two weeks (12/18/2023, 9:00 AM)

Season 3 start (3/4/2024, 9:00 AM)

Season 3, two weeks (3/18/2024, 9:00 AM) 

Season 3, end (5/19/2024, 9:00 PM)

This is in *US Pacific time*. For the code below, it's easier to just leave it in UTC time, which turns 9:00 AM -> 4:00 PM

In [84]:
s1_ranks = 10988

print("Finding season 2, start time")
s2_start_time = calendar.timegm(time.strptime('12/4/23, 4:00 PM', '%m/%d/%y, %I:%M %p'))
s2_start = find_match_after(s2_start_time, 1, last_match)

print("\nFinding season 2, two weeks in")
s2_2weeks_time = calendar.timegm(time.strptime('12/18/23, 4:00 PM', '%m/%d/%y, %I:%M %p'))
s2_2weeks = find_match_after(s2_2weeks_time, s2_start, last_match)

print("\nFinding season 3, start time")
s3_start_time = calendar.timegm(time.strptime('3/4/24, 4:00 PM', '%m/%d/%y, %I:%M %p'))
s3_start = find_match_after(s3_start_time, s2_2weeks, last_match)

print("\nFinding season 3, two weeks in")
s3_2weeks_time = calendar.timegm(time.strptime('3/18/24, 4:00 PM', '%m/%d/%y, %I:%M %p'))
s3_2weeks = find_match_after(s3_2weeks_time, s3_start, last_match)

print("\nFinding season 3, end time")
s3_end_time = calendar.timegm(time.strptime('5/19/24, 9:00 PM', '%m/%d/%y, %I:%M %p'))
s3_end = find_match_after(s3_end_time, s3_2weeks, last_match)

print("\nInterval Results:")
print("Interval 1: Ranks introduced in S1 - end of S1")
print(f"{s1_ranks} - {s2_start - 1} ({s2_start - s1_ranks} matches)")

print("Interval 2: Two weeks into S2 - end of S2")
print(f"{s2_2weeks} - {s3_start - 1} ({s3_start - s2_2weeks} matches)")

print("Interval 3: Two weeks into S3 - end of s3")
print(f"{s3_2weeks} - {s3_end - 1} ({s3_end - s3_2weeks} matches)")

Finding season 2, start time
Searching range 1-45670:
0%
6.2%
12.5%
18.8%
25.0%
31.2%
37.5%
43.8%
50.0%
56.2%
62.5%
68.8%
75.0%
81.2%
87.5%
93.8%
100.0%
Found the first match after 12/04/23, 08:00 AM: 20218.

Finding season 2, two weeks in
Searching range 20218-45670:
0%
6.7%
13.3%
20.0%
26.7%
33.3%
40.0%
46.7%
53.3%
60.0%
66.7%
73.3%
80.0%
86.7%
93.3%
100.0%
Found the first match after 12/18/23, 08:00 AM: 23027.

Finding season 3, start time
Searching range 23027-45670:
0%
6.7%
13.3%
20.0%
26.7%
33.3%
40.0%
46.7%
53.3%
60.0%
66.7%
73.3%
80.0%
86.7%
93.3%
Found the first match after 03/04/24, 08:00 AM: 34409.

Finding season 3, two weeks in
Searching range 34409-45670:
0%
7.1%
14.3%
21.4%
28.6%
35.7%
42.9%
50.0%
57.1%
64.3%
71.4%
78.6%
85.7%
92.9%
100.0%
Found the first match after 03/18/24, 09:00 AM: 37626.

Finding season 3, end time
Searching range 37626-45670:
0%
7.7%
15.4%
23.1%
30.8%
38.5%
46.2%
53.8%
61.5%
69.2%
76.9%
84.6%
92.3%
100.0%
Found the first match after 05/19/24, 02:0

The code above is ugly, but it gets the jobs done for the intervals of data I want to collect. Two weeks could have been longer, since most games at two weeks has at least 1 player who is unranked, but I think it should be good enough to not change it.