### Constants and lambdas
Variables and lambdas used through the whole preprocessing.

In [10]:
columns_to_drop=["Platform", "Other_Sales", "Critic_Count", "User_Count", "Rating", "Developer"]
needed_samples_lambda = lambda x: round((30./x)+1)
range_for_samples_lambda = lambda x: x*0.04 if x*0.04>0.01 else 0.01

def standardize(column):
    print("Max: {} Min: {}".format(column.max(), column.min()))
    stddev = column.std()
    mean = column.mean()
    return column.apply(lambda x: (x-mean)/stddev)

### Data loading and preprocessing
Load data from excel file, drop unused columns, replace missing values indicators (like tbd) with NaNs,
convert column types from object to numeric if necessary.

In [11]:
import pandas as pd
import numpy as np

data = pd.read_excel("../data/games_sales_2016_modified.xlsx")
data = data.drop(columns=columns_to_drop)
data = data.replace({'tbd': np.NaN})
data["Critic_Score"] = pd.to_numeric(data["Critic_Score"])
data["User_Score"] = pd.to_numeric(data["User_Score"])

### Handling missing values
There are no missing values in EU_Sales, JP_Sales, NA_Sales and Global_Sales columns.
Some columns should be filled manually, while others can be generated (for example some
critic and user scores).

Remove entries without release date or without name

In [12]:
data = data.drop(data[data["Year_of_Release"].isna()].index)
data = data.drop(data[data["Name"].isna()].index)

Entries without publisher will be filled manually

In [13]:
print(data[data["Publisher"].isna()])

                                                    Name  Year_of_Release  \
483    Moshi, Kono Sekai ni Kami-sama ga Iru to suru ...           2016.0   
530                                    Dance with Devils           2016.0   
1182                                      World of Tanks           2014.0   
3502                                        Stronghold 3           2011.0   
4052                    Demolition Company: Gold Edition           2011.0   
4131                              Driving Simulator 2011           2011.0   
4823                              Farming Simulator 2011           2010.0   
5237                                  UK Truck Simulator           2010.0   
8295                                  Sonic the Hedgehog           2007.0   
8410        Shrek / Shrek 2 2-in-1 Gameboy Advance Video           2007.0   
10522                         wwe Smackdown vs. Raw 2006           2005.0   
10626                                 Bentley's Hackpack           2005.0   

Generate missing Critic Score values.

In [14]:
data_for_global_sales_in_range = data[(data["Global_Sales"]<8) & (data["Global_Sales"]>=0.2)]
for index, values in data_for_global_sales_in_range[data_for_global_sales_in_range["Critic_Score"].isna()].iterrows():
    number_of_needed_samples = needed_samples_lambda(values["Global_Sales"])
    range_for_samples = range_for_samples_lambda(values["Global_Sales"])
    possible_samples = data.drop(data[data["Critic_Score"].isna()].index)
    possible_samples= possible_samples[(values["Global_Sales"] - range_for_samples <= possible_samples["Global_Sales"]) 
         & (values["Global_Sales"] + range_for_samples >= possible_samples["Global_Sales"])]
    if len(possible_samples.index) > number_of_needed_samples:
        randomized_critic_score = possible_samples.sample()["Critic_Score"]
        data.at[index, "Critic_Score"] = randomized_critic_score

Generate missing User Score values.

In [15]:
data_for_global_sales_in_range = data[(data["Global_Sales"]<8) & (data["Global_Sales"]>=0.2)]
for index, values in data_for_global_sales_in_range[data_for_global_sales_in_range["User_Score"].isna()].iterrows():
    number_of_needed_samples = needed_samples_lambda(values["Global_Sales"])
    range_for_samples = range_for_samples_lambda(values["Global_Sales"])
    possible_samples = data.drop(data[data["User_Score"].isna()].index)
    possible_samples= possible_samples[(values["Global_Sales"] - range_for_samples <= possible_samples["Global_Sales"]) 
         & (values["Global_Sales"] + range_for_samples >= possible_samples["Global_Sales"])]
    if len(possible_samples.index) > number_of_needed_samples:
        randomized_user_score = possible_samples.sample()["User_Score"]
        data.at[index, "User_Score"] = randomized_user_score

Remove games without Critic or User Score with global sales below 0.2m

In [16]:
data = data.drop(data[
    (data["Critic_Score"].isna() | data["User_Score"].isna()) &
    (data["Global_Sales"] < 0.2)
].index)

Normalization & Standardization

In [17]:
data["Critic_Score"] = data["Critic_Score"] / 100.
data["User_Score"] = data["User_Score"] / 10.

# data["Global_Sales"] = standardize(data["Global_Sales"])
# data["EU_Sales"] = standardize(data["EU_Sales"])
# data["JP_Sales"] = standardize(data["JP_Sales"])
# data["NA_Sales"] = standardize(data["NA_Sales"])
print(data["Critic_Score"])

74       0.85
75       0.85
76       0.93
77       0.77
78       0.88
         ... 
16705    0.51
16706    0.63
16707    0.77
16708    0.69
16709    0.92
Name: Critic_Score, Length: 10452, dtype: float64


Save modified dataset to file

In [18]:
data.to_excel("../data/games_sales_2016_preprocessed.xlsx")

#TODO

- dodac lata dla brakujących release date z sales większę, bądź równe 0.5m - Arek
- dodac publisherów brakujących - J
- dodac Critic Score recznie dla gier powyzej 8 - J
- dodac User Score recznie dla gier powyzej 8 - J
- zapytac o standaryzacje
- zapytac o uzupelnianie brakujących critic score/user score i czy brac pod uwage jeszcze rok