## Data Cleaning

In [1]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
import os
import random

In [2]:
FORESIGHT_DIRECTORY = os.path.abspath(os.path.join(os.getcwd(), os.pardir))

DATA_RAW_DIRECTORY = os.path.join(FORESIGHT_DIRECTORY, "data", "raw")
DATA_INTERIM_DIRECTORY = os.path.join(FORESIGHT_DIRECTORY, "data", "interim")

# Nombre de archivo
DATA_FILENAME = "combat_results_lvl_5.csv"

# Rutas completas
FILE_PATH = os.path.join(DATA_RAW_DIRECTORY, DATA_FILENAME)

In [3]:
df = pd.read_csv(FILE_PATH)
df = df.iloc[:,1:]

## Creating the `num_players` Column

In this section, we will create a new column called `num_players`. This column will capture the number of heroes present in each party based on the hero class columns (`pc2_class`, `pc3_class`, ..., `pc7_class`). The logic behind this is straightforward:

- If `pc2_class` is `"-"`, then only one hero is present, and `num_players` will be set to 1.
- If `pc2_class` is valid but `pc3_class` is `"-"`, then there are two heroes, and `num_players` will be 2.
- Similarly, we continue this pattern up to `pc7_class`. If none of these columns contain `"-"`, then `num_players` will be 7.

This derived column will provide us with a clear measure of the party's composition, which is critical for subsequent analysis and modeling of encounter difficulty.

In [4]:
def determine_num_players(row):
    if row["pc2_class"] == "-":
        return 1
    elif row["pc3_class"] == "-":
        return 2
    elif row["pc4_class"] == "-":
        return 3
    elif row["pc5_class"] == "-":
        return 4
    elif row["pc6_class"] == "-":
        return 5
    elif row["pc7_class"] == "-":
        return 6
    else:
        return 7

df["num_players"] = df.apply(determine_num_players, axis=1)

## Treating 0 and '-' as Null Values

In this section, we will classify any values that are either `0` or `"-"` as missing (null) in our dataset. This reclassification helps ensure that our subsequent analyses and visualizations work with accurate representations of the data, as these values often indicate the absence of meaningful data.

By converting `0` and `"-"` to null, we can more easily filter, impute, or exclude these values during our exploratory data analysis and modeling processes.

In [5]:
# Definir los índices de columnas a revisar (indexación desde 0)
indices_to_check = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130]
total_cols = df.shape[1]  # número total de columnas

# Iterar sobre los índices especificados
for idx in indices_to_check:
    # Obtener el nombre de la columna
    check_col = df.columns[idx]

    # Reemplazar '-' por NaN en esa columna
    df[check_col] = df[check_col].replace("-", np.nan)

    # Crear una máscara donde esa columna es NaN
    condition = df[check_col].isna()

    # Determinar el rango de las siguientes 9 columnas o hasta el final del DataFrame
    start = idx + 1
    end = min(idx + 10, total_cols)

    # Iterar sobre las columnas siguientes
    for j in range(start, end):
        col_to_update = df.columns[j]

        # Reemplazar por NaN si la condición es True, mantener el valor original si es False
        df.loc[condition, col_to_update] = np.nan


## Removing Columns with All Null Values

In this section, we will eliminate any columns that contain only null values. Removing these columns helps clean the dataset by discarding features that do not provide any useful information for our analysis or modeling.

In [6]:
# Obtener el conteo de nulos por columna
null_counts = df.isnull().sum()

# Total de filas del DataFrame
total_rows = len(df)

# Identificar las columnas que tienen al menos un valor no nulo
columns_to_keep = [col for col in df.columns if null_counts[col] < total_rows]

# Seleccionar únicamente esas columnas
df = df[columns_to_keep]

Finally, we will save it in a dataset called `combat_results.csv`.

In [7]:
df.to_csv(os.path.join(DATA_INTERIM_DIRECTORY, "combat_results.csv"), index=False)
