The objective of the final data pipeline is parsing each column of the CSV file into the correct data type and save the new data as Parquet file.

# 1. `games_description.csv`

The file wraps all columns as a string and have many columns with nested data types. Attempt to parse the data type from the start did not work. It's a challenge to process the column, which I actually welcomed 🤗.

Schema:
```
name: string
short_description: string
long_description: string
genres: object (array[string])
minimum_system_requirement: object (struct[string])
recommend_system_requirement: object (struct[string])
release_date: date
developer: object (array[string])
publisher: object (array[string])
overall_player_rating: categorical
number_of_reviews_from_purchased_people: int32
number_of_english_reviews: int32
link: string
```

In [6]:
import polars as pl
from pathlib import Path
import re

local_dir = Path("/teamspace/studios/this_studio/Steam-RecSys/data_pipeline/data")


def parse_reviews(value):
    if "%" in value:
        # Extract percentage and total number
        match = re.search(r"(\d+)% of ([\d,]+)", value)
        if match:
            percentage = int(match.group(1))
            total = int(match.group(2).replace(",", ""))
            return int((percentage / 100) * total)
    else:
        # Extract the number directly
        match = re.search(r"\(([\d,]+)\)", value)
        if match:
            return int(match.group(1).replace(",", ""))


def parse_system_requirements(requirements_list):
    result = {}
    for item in requirements_list:
        if ":" in item:
            key, value = item.split(":")[:2]
            result[key.strip()] = value.strip()
    return result


df = pl.scan_csv(local_dir / "games_description.csv")
df = df.with_columns(
    pl.col("genres").str.replace_many(["]", "'", "["], "").str.split(", "),
    pl.col("number_of_english_reviews").str.replace_all(",", "").cast(pl.Int32),
    pl.col(["minimum_system_requirement", "recommend_system_requirement"])
    .str.replace_many(["]", "'", "["], "")
    .str.split(", ")
    .map_elements(parse_system_requirements, return_dtype=pl.Object),
    pl.col(["developer", "publisher"])
    .str.replace_many(["]", "'", "["], "")
    .str.split(", "),
    pl.col("overall_player_rating").cast(pl.Categorical("lexical")),
    pl.when(pl.col("release_date").str.contains(r"\d{1,2} \w{3}, \d{4}"))
    .then(pl.col("release_date").str.to_date("%d %b, %Y", strict=False))
    .otherwise(pl.col("release_date").str.to_date("%b %Y", strict=False))
    .alias("release_date"),
    pl.col("number_of_reviews_from_purchased_people").map_elements(
        parse_reviews, return_dtype=pl.Int32
    ),
)

df.first().collect()

name,short_description,long_description,genres,minimum_system_requirement,recommend_system_requirement,release_date,developer,publisher,overall_player_rating,number_of_reviews_from_purchased_people,number_of_english_reviews,link
str,str,str,list[str],object,object,date,list[str],list[str],cat,i32,i32,str
"""Black Myth: Wukong""","""Black Myth: Wukong is an actio…","""About This Game  Black M…","[""Mythology"", ""Action RPG"", … ""Violent""]","{'OS': 'Windows 10 64-bit', 'Processor': 'Intel Core i5-8400 / AMD Ryzen 5 1600', 'Memory': '16 GB RAM', 'Graphics': 'NVIDIA GeForce GTX 1060 6GB / AMD Radeon RX 580 8GB', 'DirectX': 'Version 11', 'Storage': '130 GB available space', 'Sound Card': 'Windows Compatible Audio Device', 'Additional Notes': 'HDD Supported'}","{'OS': 'Windows 10 64-bit', 'Processor': 'Intel Core i7-9700 / AMD Ryzen 5 5500', 'Memory': '16 GB RAM', 'Graphics': 'NVIDIA GeForce RTX 2060 / AMD Radeon RX 5700 XT / INTEL Arc A750', 'DirectX': 'Version 12', 'Storage': '130 GB available space', 'Sound Card': 'Windows Compatible Audio Device', 'Additional Notes': 'SSD Required. The above specifications were tested with DLSS/FSR/XeSS enabled.'}",2024-08-19,"[""Game Science""]","[""Game Science""]","""Overwhelmingly Positive""",654820,51931,"""https://store.steampowered.com…"


# 2. `games_ranking.csv`

The file is significantly easier to parse.

Schema:
```
game_name: string
genre: categorical
rank_type: categorical
rank: uint8
```

In [8]:
import polars as pl
from pathlib import Path

local_dir = Path("/teamspace/studios/this_studio/Steam-RecSys/data_pipeline/data")
schema = pl.Schema(
    {
        "game_name": pl.String(),
        "genre": pl.Categorical(),
        "rank_type": pl.Categorical(),
        "rank": pl.UInt8(),
    }
)

df = pl.scan_csv(local_dir / "games_ranking.csv", schema=schema)

df.head().collect()

game_name,genre,rank_type,rank
str,cat,cat,u8
"""Counter-Strike 2""","""Action""","""Sales""",1
"""Warhammer 40,000: Space Marine…","""Action""","""Sales""",2
"""Cyberpunk 2077""","""Action""","""Sales""",3
"""Black Myth: Wukong""","""Action""","""Sales""",4
"""ELDEN RING""","""Action""","""Sales""",5


# 3. `steam_game_reviews.csv`

The file is the most important: our actual game review dataset. The most important thing is to parse the reviews into a suitable format for the recommendation system.

Schema:
```

```

In [14]:
import polars as pl
from pathlib import Path
from datetime import datetime

local_dir = Path("/teamspace/studios/this_studio/Steam-RecSys/data_pipeline/data")


def parse_date(date_str):
    try:
        # Try parsing with the format "13 September"
        return datetime.strptime(date_str, "%d %B").replace(year=2024).date()
    except ValueError:
        pass
    return None


df = pl.scan_csv(local_dir / "steam_game_reviews.csv", infer_schema_length=10000)

df = df.with_columns(
    pl.col("hours_played").str.replace_all(",", "").cast(pl.Float32),
    pl.col(["helpful", "funny"]).str.replace_all(",", "").cast(pl.Int64),
    pl.col("recommendation").cast(pl.Categorical("lexical")),
    pl.when(pl.col("date").str.contains(r"\d{1,2} \w+, \d{4}"))
    .then(pl.col("date").str.to_date("%d %B, %Y", strict=False))
    .when(pl.col("date").str.contains(r"\w+ \d{1,2}, \d{4}"))
    .then(pl.col("date").str.to_date("%B %d, %Y", strict=False))
    .otherwise(pl.col("date").map_elements(parse_date, return_dtype=pl.Date))
    .alias("date"),
    pl.when(pl.col("username").str.contains("\n"))
    .then(pl.col("username").str.extract(r"^(.*?)\n"))
    .otherwise(pl.col("username")),
)

df.collect()

review,hours_played,helpful,funny,recommendation,date,game_name,username
str,f32,i64,i64,cat,date,str,str
"""The game itself is also super …",39.900002,1152,13,"""Recommended""",2024-09-14,"""Warhammer 40,000: Space Marine…","""Sentinowl"""
"""Never cared much about Warhamm…",91.5,712,116,"""Recommended""",2024-09-13,"""Warhammer 40,000: Space Marine…","""userpig"""
"""A salute to all the fallen bat…",43.299999,492,33,"""Recommended""",2024-09-14,"""Warhammer 40,000: Space Marine…","""Imparat0r"""
"""this game feels like it was ma…",16.799999,661,15,"""Recommended""",2024-09-14,"""Warhammer 40,000: Space Marine…","""Fattest_falcon"""
"""Reminds me of something I've l…",24.0,557,4,"""Recommended""",2024-09-12,"""Warhammer 40,000: Space Marine…","""Jek"""
…,…,…,…,…,…,…,…
"""2022 Early Access Review Loads…",4.2,1,0,"""Recommended""",2022-08-04,"""Turbo Golf Racing""","""Fatal Exit"""
"""2022 Early Access Review Great…",8.5,1,0,"""Recommended""",2022-08-04,"""Turbo Golf Racing""","""cleybaR"""
"""2022 Early Access Review Excel…",83.300003,2,0,"""Recommended""",2022-08-04,"""Turbo Golf Racing""","""Sim"""
"""2022 Early Access Review This …",3.8,1,0,"""Recommended""",2022-08-04,"""Turbo Golf Racing""","""Fatboybadboy"""
