# Data Preparation - Reviews <a id='top'></a>

In this last notebook of data preparation, the data that we will be using is the one about data reviews. Here, we will make some transformations to the reviews feature by cleaning a bit the text, exclude really small reviews, unpacks several variables from unother column, removes unecessary columns and formats boolean features.


The structure of this notebook is as follows:

[0. Import Libraries](#libraries) <br>
[1. Define Functions](#functions) <br>
&emsp; [1.1. Clean Reviews](#clean) <br>
&emsp; [1.2. Prepapre Data](#prepare) <br>

# 0. Import Libraries<a id='libraries'></a>
[to the top](#top)  

The first step is to import the necessary libraries.

In [None]:
import polars as pl
from datetime import datetime, timezone
import re
from helper_functions import clean_parquet_files

# 1. Define Functions<a id='functions'></a>
[to the top](#top) 

TO facilitate the process of data preparation, we start by defining the functions that we will be using.

## 1.1. Clean Reviews<a id='clean'></a>
[to the top](#top)  

The function below is designed to clean text reviews by removing unwanted characters and formatting. It first checks if the input is a string, raising an error if not. The function then removes all non-ASCII characters using a regular expression, ensuring the text contains only standard English characters. Additionally, it eliminates newline and carriage return characters. These steps help standardize reviews for further processing or analysis.

In [None]:
def preprocess_review(review):
    # Remove non-ASCII characters
    review = re.sub(r'[^\x00-\x7F]', '', review)
    # Remove newline and carriage return characters
    review = review.replace('\n', '').replace('\r', '')
    return review


## 1.2. Prepare Data<a id='prepare'></a>
[to the top](#top) 

The preprocess_gamereviews function is designed to clean and standardize game review data stored in dictionaries. It begins by preprocessing the review text to remove unwanted characters and ignoring reviews shorter than 20 characters. It then formats timestamp fields to UTC datetime objects for consistency. The function also unpacks nested fields from the author section into flat fields for easier access and removes unnecessary fields to streamline the data. Additionally, it ensures boolean fields are correctly typed and converts certain fields to floats. The cleaned and standardized dictionary is then returned, ready for further analysis or storage.

In [None]:
def preprocess_gamereviews(gamereviews_dict):
    
    # Preprocess the review
    gamereviews_dict["review"] = preprocess_review(gamereviews_dict["review"])
    
    # Skip rows with reviews less than 20 characters
    if len(gamereviews_dict["review"]) < 20:
        return None
    
    # Format timestamps
    gamereviews_dict["timestamp_created"] = datetime.fromtimestamp(gamereviews_dict["timestamp_created"], tz=timezone.utc)
    gamereviews_dict["timestamp_updated"] = datetime.fromtimestamp(gamereviews_dict["timestamp_updated"], tz=timezone.utc)
    gamereviews_dict["author"]["last_played"] = datetime.fromtimestamp(gamereviews_dict["author"]["last_played"], tz=timezone.utc)
    
    # Unpack nested author fields
    gamereviews_dict["user_steamid"] = gamereviews_dict["author"]["steamid"]
    gamereviews_dict["user_num_games_owned"] = gamereviews_dict["author"]["num_games_owned"]
    gamereviews_dict["user_num_reviews"] = gamereviews_dict["author"]["num_reviews"]
    gamereviews_dict["user_playtime_forever"] = gamereviews_dict["author"]["playtime_forever"]
    gamereviews_dict["user_playtime_at_review"] = gamereviews_dict["author"]["playtime_at_review"]
    gamereviews_dict["user_last_played"] = gamereviews_dict["author"]["last_played"]
    
    # Remove unnecessary fields
    gamereviews_dict.pop("author")
    gamereviews_dict.pop("language", None)
    gamereviews_dict.pop("hidden_in_steam_china", None)
    gamereviews_dict.pop("steam_china_location", None)
    
    # Format boolean and float fields
    gamereviews_dict["voted_up"] = bool(gamereviews_dict["voted_up"])
    gamereviews_dict["steam_purchase"] = bool(gamereviews_dict["steam_purchase"])
    gamereviews_dict["received_for_free"] = bool(gamereviews_dict["received_for_free"])
    gamereviews_dict["written_during_early_access"] = bool(gamereviews_dict["written_during_early_access"])
    gamereviews_dict["weighted_vote_score"] = float(gamereviews_dict["weighted_vote_score"])
    
    # Convert back to a DataFrame row
    return gamereviews_dict

## 1.2. Prepare Data<a id='prepare'></a>
[to the top](#top) 

The preprocess_parquet_file function processes game review data stored in a Parquet file and prepares it for further analysis. It begins by reading the Parquet file into a DataFrame and applying the preprocess_gamereviews function to each row to clean and standardize the review data. Rows that do not meet the criteria, such as reviews shorter than 20 characters, are filtered out. The cleaned data is then compiled into a new DataFrame, with an additional column for the appid of the game. The function also ensures there are no duplicate rows in the DataFrame. Finally, the processed DataFrame is written back to a Parquet file at the specified output path, ready for efficient storage and retrieval.

In [None]:
def preprocess_parquet_file(filepath, output_filepath, appid):
    # Read the parquet file
    df = pl.read_parquet(filepath)
    
    # Apply preprocessing to each row
    preprocessed_data = [preprocess_gamereviews(row) for row in df.iter_rows(named=True) if row is not None]
    
    # Filter out None values (reviews less than 20 characters)
    preprocessed_data = [row for row in preprocessed_data if row is not None]
    
    # Create a new DataFrame from preprocessed data
    if preprocessed_data:
        preprocessed_df = pl.DataFrame(preprocessed_data)
        
        # Add the appid column
        preprocessed_df = preprocessed_df.with_column(pl.lit(appid).alias("appid"))
        
        # Check for duplicates
        preprocessed_df = preprocessed_df.unique()
        
        # Write the processed file
        preprocessed_df.write_parquet(output_filepath)

# 2. Data Preparation<a id='preparation'></a>
[to the top](#top) 

Before applying our newly created functions, we have to define in which parquet files we are going to use them. TO do that, we created another function clean_parquet_files  that is designed to manage and clean Parquet files in a specified directory by focusing on the count of reviews for each app. It uses a regular expression to extract the app ID and review count from each filename and stores the highest count file for each app in a dictionary. Initially, it scans all Parquet files, retaining only those with more than 250 reviews and identifying the file with the highest count for each app. In the second step, it iterates through the files again, deleting any that either have a count of 250 or fewer reviews or are not the file with the highest count for their respective app ID. This ensures that only the most substantial and relevant Parquet file for each app is kept, optimizing storage and data management.

Before applying this new function, we define where the parquet files input and output folders.

In [None]:
input_folder = 'data/parquets'
output_folder = 'data/parquets_preprocessed'

clean_parquet_files(input_folder)

Below, we apply our functions to clean the data in the selected parquet files.

In [1]:
import polars as pl
import os
from datetime import datetime, timezone
import re
from helper_functions import clean_parquet_files

def preprocess_review(review):
    # Remove non-European characters
    review = re.sub(r'[^\x00-\x7F]', '', review)
    # Remove newline and carriage return characters
    review = review.replace('\n', '').replace('\r', '')
    return review

def preprocess_gamereviews(gamereviews_dict):
    # Preprocess the review
    gamereviews_dict["review"] = preprocess_review(gamereviews_dict.get("review", ""))
    
    # Skip rows with reviews less than 20 characters
    if len(gamereviews_dict["review"]) < 20:
        return None
    
    # Format timestamps
    gamereviews_dict["timestamp_created"] = datetime.fromtimestamp(gamereviews_dict.get("timestamp_created", 0), tz=timezone.utc)
    gamereviews_dict["timestamp_updated"] = datetime.fromtimestamp(gamereviews_dict.get("timestamp_updated", 0), tz=timezone.utc)
    
    # Ensure author field is not None
    author = gamereviews_dict.get("author", {})
    
    # Unpack nested author fields
    gamereviews_dict["user_steamid"] = author.get("steamid", "")
    gamereviews_dict["user_num_games_owned"] = author.get("num_games_owned", 0)
    gamereviews_dict["user_num_reviews"] = author.get("num_reviews", 0)
    gamereviews_dict["user_playtime_forever"] = author.get("playtime_forever", 0)
    
    # Remove unnecessary fields
    gamereviews_dict.pop("author", None)
    gamereviews_dict.pop("author_last_played", None)
    gamereviews_dict.pop("language", None)
    gamereviews_dict.pop("hidden_in_steam_china", None)
    gamereviews_dict.pop("steam_china_location", None)
    
    # Format boolean and float fields
    gamereviews_dict["voted_up"] = bool(gamereviews_dict.get("voted_up", False))
    gamereviews_dict["steam_purchase"] = bool(gamereviews_dict.get("steam_purchase", False))
    gamereviews_dict["received_for_free"] = bool(gamereviews_dict.get("received_for_free", False))
    gamereviews_dict["written_during_early_access"] = bool(gamereviews_dict.get("written_during_early_access", False))
    gamereviews_dict["weighted_vote_score"] = float(gamereviews_dict.get("weighted_vote_score", 0.0))
    
    # Convert back to a DataFrame row
    return gamereviews_dict

def preprocess_parquet_file(filepath, output_filepath, appid):
    # Read the parquet file
    df = pl.read_parquet(filepath)
    
    # Apply preprocessing to each row
    preprocessed_data = [preprocess_gamereviews(row) for row in df.iter_rows(named=True) if row is not None]
    
    # Filter out None values (reviews less than 20 characters)
    preprocessed_data = [row for row in preprocessed_data if row is not None]
    
    # Create a new DataFrame from preprocessed data
    if preprocessed_data:
        preprocessed_df = pl.DataFrame(preprocessed_data)
        
        # Add the appid column
        preprocessed_df = preprocessed_df.with_columns(pl.lit(appid).alias('appid'))
        
        # Check for duplicates
        preprocessed_df = preprocessed_df.unique()
        
        # Write the processed file
        preprocessed_df.write_parquet(output_filepath)

input_folder = 'data/parquets'
output_folder = 'data/parquets_preprocessed'

clean_parquet_files(input_folder)

for file_name in os.listdir(input_folder):
    match = re.match(r"(\d+)_reviews_\d+\.parquet", file_name)
    if match:
        appid = match.group(1)
        input_filepath = os.path.join(input_folder, file_name)
        output_filepath = os.path.join(output_folder, file_name.replace('.parquet', '_preprocessed.parquet'))
        
        # Check if the processed file already exists
        if not os.path.exists(output_filepath):
            preprocess_parquet_file(input_filepath, output_filepath, appid)
