# Preprocessing

## Imports

- csv: Allow us to interact with the file format we're using for storing data
- os: Access the file system for creating folders

In [2]:
import csv
import os

## Grouping our Datasets

At the the of writing, around 11.5K data is available to train on. This data is good on it's own, but it can also be split into different batches based on our intervals. Additionally, we should remove the first line, since that line is only useful to our data grabbing algorithm, and not useful for our analysis.

### Main Batches

Lets use a folder to easily store all of our batches of data. First, we need to create that folder if it doesn't already exist.

In [7]:
BATCH_DIRNAME = 'batches'
if not os.path.exists(BATCH_DIRNAME):
    print(rf"Path .\{BATCH_DIRNAME} not found!")
    os.makedirs(BATCH_DIRNAME)
    print("Directory created!")
else:
    print(rf"Found .\{BATCH_DIRNAME}")

Found .\batches


Now, before we start trying to make more batches, let's make a utility function that will separate our batch based on whatever parameters we provide. We can pass in an input and output file name to use, and a function to use for comparing and modifying data. Additionally, lets always skip the first line, which does not contain relevant data.

In [17]:
def make_batch(input_batch: str, output_batch: str, operator, skip_first: bool = False) -> None:
    with open(input_batch, 'r', encoding='utf-8', newline='') as infile, \
            open(output_batch, 'w', encoding='utf-8', newline='') as outfile:
        
        if skip_first:
            infile.readline()   # discard the first line if we're reading from our main collected data.

        reader = csv.reader(infile, quoting=csv.QUOTE_MINIMAL, doublequote=False, escapechar='\\')
        writer = csv.writer(outfile, quoting=csv.QUOTE_MINIMAL, doublequote=False, escapechar='\\')
        for line in reader:
            data = operator(line)
            if data:
                writer.writerow(data)

Now we can easily create our batches just by specifying a batch function. This function will be given a list of data on a line, stored in strings, and should output the data that should be written in the output, or nothing.

In [18]:
COLLECTED_DATA_FILENAME = 'sendouq-data.csv'

make_batch(COLLECTED_DATA_FILENAME, f'{BATCH_DIRNAME}\\empty.csv', lambda x : None, True)
make_batch(COLLECTED_DATA_FILENAME, f'{BATCH_DIRNAME}\\all.csv', lambda x : x, True)

As you can see, we're easily able to make batches of our data by just providing a simple anonymous function.

### Batching by Date

When collecting our data, we gathered data from separate intervals in season 1, 2, and 3. We can make batches for each of these seasons.  

In [20]:
# indicate the match_ID values for the start of each interval
SEASON_2_START = 23027
SEASON_3_START = 37626

make_batch(f'{BATCH_DIRNAME}\\all.csv', f'{BATCH_DIRNAME}\\season_1.csv',
           lambda x : x if int(x[0]) < SEASON_2_START else None)
make_batch(f'{BATCH_DIRNAME}\\all.csv', f'{BATCH_DIRNAME}\\season_2.csv',
           lambda x : x if SEASON_2_START <= int(x[0]) < SEASON_3_START else None)
make_batch(f'{BATCH_DIRNAME}\\all.csv', f'{BATCH_DIRNAME}\\season_3.csv',
           lambda x : x if SEASON_3_START <= int(x[0]) else None)

## Sorting Players

Finally, we could add a preprocessing step to see if that results in any significant change to our results. In this step, instead of filtering as before, this will transform the data. This step will result in the players on the same team to be sorted by rating. This doesn't actually change anything significant about our data, since the order of each player on a team does not change. Which players were on Alpha and Bravo will not be changed by this step.

In [24]:
def sort_players(data: list[str]) -> list[str | int]:
    alpha = []
    # 1-4 is team alpha (0 is match_id)
    for i in range(1, 5):
        alpha.append(float(data[i]))
    
    bravo = []
    # 5-8 is team bravo (9 is result)
    for i in range(5, 9):
        bravo.append(float(data[i]))
    
    # sort alpha and bravo teams
    alpha.sort()
    bravo.sort()

    # make our new data entry
    return [data[0]] + alpha + bravo + [data[9]]

batches_to_sort = ['all', 'season_1', 'season_2', 'season_3']
for batch in batches_to_sort:
    make_batch(f'{BATCH_DIRNAME}\\{batch}.csv',
               f'{BATCH_DIRNAME}\\{batch}_sorted.csv', sort_players)