# Trim and Format the Dataset

Shortly after the end of r/Place 2022, Reddit released a [dataset](https://www.reddit.com/r/place/comments/txvk2d/rplace_datasets_april_fools_2022/) containing the timestamp, user ID (hashed), pixel color, and coordinates of every tile placed throughout the entire event (that's about 160 million tiles!). Here's a sample of the data from the CSV file (I've modified the last entry to give it four coordinate values):

|timestamp                  |user_id                                                                                 |pixel_color|coordinate         |
|---------------------------|----------------------------------------------------------------------------------------|-----------|-------------------|
|2022-04-04 00:53:51.577 UTC|ovTZk4GyTS1mDQnTbV+vDOCu1f+u6w+CkIZ6445vD4XN8alFy/6GtNkYp5MSic6Tjo/fBCCGe6oZKMAN3rEZHw==|#00CCC0    |826,1048           |
|2022-04-04 00:53:53.758 UTC|6NSgFa1CvIPly1VniNhlbrmoN3vgDFbMSKqh+c4TTfrr3dMib91oUWONX96g5PPcioIxedF24ldNOu/g5yqDrg==|#94B3FF    |583,1031           |
|2022-04-04 00:53:54.685 UTC|O5Oityp3Z3owzTuwM9XnMggpLcqKEumsOMKGhRiDTTImWbNLhLKmLI4gn1QPbaABqZqmFC/OmE/O732n39dGIQ==|#6A5CFF    |1873,558           |
|2022-04-04 00:54:57.541 UTC|tc273UiqS0wKa6VwiOs/iz/t4LyPYrhL2Q347awn11IQQELrEzZBCmGe28NWM+O1IdfH4CieCpEnE5sHecW9Ow==|#009EAA    |1627,255           |
|2022-04-04 00:55:16.307 UTC|OOWsU/HLb4UUkQwclDeXFtsJTOXMlAdNHiRpFA1Qk+SxUrJE7lpGFevfV9w+zImFimmNANlDdfN3kluz69M9MQ==|#94B3FF    |49,1478            |
|2022-04-04 00:55:20.64 UTC |A0HdtcPvI7ipKivvXNVZDa3gkcjGXFjNxF5tca5QXazENCGVR8d3j65QVgnaVIgvGbdtiGuSvRs1rAj2f1oMAQ==|#E4ABFF    |408,1863           |
|2022-04-04 00:55:34.898 UTC|1U4LPuB22P6Yf7eRhKz6zU1dFMK5wXIzsNVPUNhP7eHIwGuMfnDz/hDh/8QhDug6qqsKXHYvgH9L5FnWNMGm3A==|#94B3FF    |111,1582           |
|2022-04-04 00:55:57.168 UTC|tPcrtm7OtEmSThdRSWmB7jmTF9lUVZ1pltNv1oKqPY9bom/EGIO3/b5kjRenbD3vMF48psnR9MnhIrTT1bpC9A==|#6A5CFF    |1908,1854          |
|2022-04-04 00:55:40.375 UTC|0AGoMGF50j0DJDc+704SwMylU90YDILDIgo8WOetgpiEWGEhMB3Eh8Q8r+Wa9XIJhQZQcOquqpRgZ5REB4jrOA==|#6A5CFF    |1134,1640,1135,1641|


This initial form of the dataset provided by Reddit isn't well suited for making our animation. Some things could be better:

- The CSV format is slow to load and is much larger than other binary formats, like [Apache Parquet](https://parquet.apache.org/).
- The bulk of the file comes from the `user_id` column, which we don't care about.
- The timestamp is a long string, and it's in a format that's expensive to parse. It would be better if it were just milliseconds as an integer.
- There is only one `coordinate` column, which contains the x and y values separated by a comma. There should be separate x and y columns.
- The r/Place admins used a "rectangle drawing tool" to cover "inappropriate content" on 21 occasions. These rectangles are given with four coordinates (x1, y1, x2, y2). The last row of the sample data above is an example of this.
- The `pixel_color` column gives the colors in hex format. There were only 32 colors to choose from, so we could cut this down significantly by assigning each color a number.

Lets start first with making functions to convert our timestamps and pixel colors into more suitable formats. We'll worry about parsing the coordinate column later.

## Format dates with `parse_timestamp()`

Reformatting the YYYY-MM-DD HH:MM:SS.SSS timestamp to milliseconds will cut down the resulting file size significantly. It will also be much faster to sort, and it's a better format for the calcuations we will perform later. We can also subtract the POSIX timestamp of the earliest pixel to reduce the size of the timestamp to a number that fits into a 32-bit unsigned integer.

In [28]:
from datetime import datetime

def parse_timestamp(timestamp):
    """Convert a YYYY-MM-DD HH:MM:SS.SSS timestamp to milliseconds after the start of r/Place 2022."""
    date_format = "%Y-%m-%d %H:%M:%S.%f"
    try:
        timestamp = datetime.strptime(timestamp[:-4], date_format).timestamp()
    except ValueError:
        # The timestamp is exactly on the second, so there is no decimal (%f).
        # This happens 1/1000 of the time.
        timestamp = datetime.strptime(timestamp[:-4], date_format[:-3]).timestamp()

    # Convert from a float in seconds to an int in milliseconds
    timestamp *= 1000.0
    timestamp = int(timestamp)

    # The earliest timestamp is 1648806250315, so subtract that from each timestamp
    # to get the time in milliseconds since the beginning of the experiment.
    timestamp -= 1648806250315

    return str(timestamp)

## Format Colors with `parse_pixel_color()`

We can save storage space by asigning each of the 32 colors used in the event to an integer key. This key fits in an 8-bit unsigned integer, and can later be used to convert back to the color value.

In [29]:
def parse_pixel_color(pixel_color):
    """Convert a hex color code to an integer key."""
    hex_to_key = {
        "#000000": "0",
        "#00756F": "1",
        "#009EAA": "2",
        "#00A368": "3",
        "#00CC78": "4",
        "#00CCC0": "5",
        "#2450A4": "6",
        "#3690EA": "7",
        "#493AC1": "8",
        "#515252": "9",
        "#51E9F4": "10",
        "#6A5CFF": "11",
        "#6D001A": "12",
        "#6D482F": "13",
        "#7EED56": "14",
        "#811E9F": "15",
        "#898D90": "16",
        "#94B3FF": "17",
        "#9C6926": "18",
        "#B44AC0": "19",
        "#BE0039": "20",
        "#D4D7D9": "21",
        "#DE107F": "22",
        "#E4ABFF": "23",
        "#FF3881": "24",
        "#FF4500": "25",
        "#FF99AA": "26",
        "#FFA800": "27",
        "#FFB470": "28",
        "#FFD635": "29",
        "#FFF8B8": "30",
        "#FFFFFF": "31",
    }

    return hex_to_key[pixel_color]

## Format Coordinates with `parse_coordinate()`

The Reddit dataset contains the xy coordinates in a single column. Each coordinate is given as a string of coordinates separated by commas (for example: `"1627,255"`). As mentioned previously, some of the coordinates are given with four values to represent a rectangle. We'll revisit to that problem soon.

In [30]:
def parse_coordinate(coordinate):
    """Split the coordinate string into two numbers."""
    return coordinate.split(",")

## Putting it All Together

In [31]:
import pandas as pd

CHUNK_SIZE = 1_000_000

def process_chunk(chunk, df):
    """Process a chunk of data and append it to a dataframe."""
    chunk["timestamp"] = chunk["timestamp"].astype("uint32")
    chunk["pixel_color"] = chunk["pixel_color"].astype("uint8")

    # Group by point and rectangle coordinates.
    # Points have x and y coordinates, rectangles have x1, y1, x2, y2 coordinates.
    # We can determine the type of the coordinate by the number of commas.
    groups = chunk.groupby(chunk["coordinate"].str.count(",") == 1)
    rectangles = None
    points = groups.get_group(True).reset_index(drop=True)
    try:
        rectangles = groups.get_group(False).reset_index(drop=True)
    except KeyError:
        # There are no rectangles in this chunk.
        pass

    # Convert point's coordinate column into x and y columns.
    points["coordinate"] = points["coordinate"].apply(lambda x: x.split(","))
    points["x"] = points["coordinate"].apply(lambda x: x[0]).astype("uint16")
    points["y"] = points["coordinate"].apply(lambda x: x[1]).astype("uint16")
    del points["coordinate"]

    # Append the points to the dataframe.
    df = pd.concat((df, points), ignore_index=True)

    if rectangles is None:
        # If this chunk has no rectangles.
        return df


    # Separate the rectangle coordinate string into a list of ints.
    rectangles["coordinate"] = rectangles["coordinate"].apply(
        lambda x: [int(c) for c in x.split(",")]
    )

    # We will convert each rectangle into several point coordinates.
    
    # Make a new dataframe to store the points created from the rectangles.
    pts_from_recs = pd.DataFrame(columns=["timestamp", "pixel_color", "x", "y"])

    # Iterate over the rectangles in this chunk.
    for rect in rectangles.itertuples():
        x1, y1, x2, y2 = rect.coordinate
        width = x2 - x1 + 1
        height = y2 - y1 + 1

        for i in range(width):
            for j in range(height):
                x = x1 + i
                y = y1 + j

                pts_from_recs.loc[len(pts_from_recs)] = [
                    rect.timestamp,
                    rect.pixel_color,
                    x,
                    y,
                ]

    # Convert the columns into the correct dtypes.
    pts_from_recs["timestamp"] = pts_from_recs["timestamp"].astype("uint32")
    pts_from_recs["pixel_color"] = pts_from_recs["pixel_color"].astype("uint8")
    pts_from_recs["x"] = pts_from_recs["x"].astype("uint16")
    pts_from_recs["y"] = pts_from_recs["y"].astype("uint16")

    return pd.concat((df, pts_from_recs), ignore_index=True)


def trim(infile, outfile):
    """Trim the infile data and write it to outfile."""
    df = pd.DataFrame(columns=["timestamp", "pixel_color", "x", "y"])
    df["timestamp"] = df["timestamp"].astype("uint32")
    df["pixel_color"] = df["pixel_color"].astype("uint8")
    df["x"] = df["x"].astype("uint16")
    df["y"] = df["y"].astype("uint16")

    with pd.read_csv(
        infile,
        usecols=["timestamp", "pixel_color", "coordinate"],
        converters={
            "timestamp": parse_timestamp,
            "pixel_color": parse_pixel_color,
        },
        chunksize=CHUNK_SIZE,
        engine="c",
        compression={"method": "gzip"},
    ) as csv:
        for chunk in csv:
            df = process_chunk(chunk, df)

    df["timestamp"] = df["timestamp"].astype("uint32")

    df.sort_values("timestamp", inplace=True, ignore_index=True)
    df.to_parquet(
        outfile,
        # The default pyarrow version is 1.0, which changes the timestamp column to int64.
        # https://github.com/pandas-dev/pandas/issues/37327
        # https://issues.apache.org/jira/browse/ARROW-9215
        version="2.6",
    )

    return df


trim('data/2022_place_canvas_history.csv.gzip', 'data/2022_place_canvas_history.parquet')

df = pd.read_parquet('data/2022_place_canvas_history.parquet')
df

Unnamed: 0,timestamp,pixel_color,x,y
0,0,14,42,42
1,12356,3,999,999
2,16311,7,44,42
3,21388,21,2,2
4,34094,7,23,23
...,...,...,...,...
160455374,300589751,31,408,493
160455375,300589830,31,1232,312
160455376,300589857,31,770,866
160455377,300589880,31,1046,1721
