In [1]:
from sklearn.impute import KNNImputer
import pandas as pd
import numpy as np
from common.standardization import standardize, de_standardize

### Data loading and preprocessing
Load data from excel file, drop unused columns, replace missing values indicators (like tbd) with NaNs,
convert column types from object to numeric if necessary.

In [2]:
columns_to_drop=["Other_Sales", "Critic_Count", "User_Count", "Rating", "Developer", "Publisher"]

data = pd.read_excel("../data/games_sales_2016_modified.xlsx", index_col=0)
data = data.drop(columns=columns_to_drop)
data = data.replace({'tbd': np.NaN})
data["Critic_Score"] = pd.to_numeric(data["Critic_Score"])
data["User_Score"] = pd.to_numeric(data["User_Score"])

### Basic statistics before missing values handling
Get basic statistics about the dataset before handling missing values:
 * total number of rows
 * number of missing values in each column
 * values like mean, max, min

In [3]:
print("Number of rows: {}".format(data.shape[0]))
print("\nNumber of NaNs in each column: \n{}".format(data.isna().sum()))

columns_to_get_stats_from = ["Global_Sales", "User_Score", "Critic_Score",
                             "EU_Sales", "NA_Sales", "JP_Sales", "Year_of_Release"]
stats_to_compute = ["max", "min", "std", "mean", "median"]
columns_stats = data.agg({item: stats_to_compute for item in columns_to_get_stats_from})
print("\nDetailed stats: \n{}".format(columns_stats))

Number of rows: 16710

Number of NaNs in each column: 
Name                  2
Platform              0
Year_of_Release      74
Genre                 2
NA_Sales              0
EU_Sales              0
JP_Sales              0
Global_Sales          0
Critic_Score       8544
User_Score         9085
dtype: int64

Detailed stats: 
        Global_Sales  User_Score  Critic_Score   EU_Sales   NA_Sales  \
max        82.530000    9.700000     99.000000  28.960000  41.360000   
min         0.010000    0.000000     13.000000   0.000000   0.000000   
std         1.548316    1.501310     13.951209   0.503408   0.813713   
mean        0.533778    7.129456     69.022410   0.145100   0.263459   
median      0.170000    7.500000     71.000000   0.020000   0.080000   

         JP_Sales  Year_of_Release  
max     10.220000      2016.000000  
min      0.000000      1977.000000  
std      0.308895         5.912285  
mean     0.077609      2006.463092  
median   0.000000      2007.000000  


### Handling missing values
There are no missing values in EU_Sales, JP_Sales, NA_Sales and Global_Sales columns.
Some columns should be filled manually, while others can be generated (for example some
critic and user scores).

Manual missing and strange values handling:
 * over 100 records with `Year_of_Release` have been filled (there are still 74 left, which will be discarded)
 * records with strange names like **Luminous Arc 2 (JP sales)** have been removed (at least 8)
 * **Brothers in Arms: Furious 4** record has been removed because this game has been cancelled
 * **Imagine: Makeup Artist** had `Year_of_Release` set to 2020 - changed to proper 2009
 * missing `Critic_Score` and `User_Score` for important games (like **Super Mario**) filled manually, using GameSpot data
 * missing values in `Published` column for important games filled manually, but this column has been discarded

Remove entries without release date or without name

In [4]:
data = data.drop(data[data["Year_of_Release"].isna()].index)
data = data.drop(data[data["Name"].isna()].index)

Remove games without Critic or User Score with global sales below 0.2m:
 * this is done because for small global sales there is a lot of missing data
 * filling it would probably have a negative impact on further analysis results

In [5]:
data = data.drop(data[
    (data["Critic_Score"].isna() | data["User_Score"].isna()) &
    (data["Global_Sales"] < 0.2)
].index)

Standardize data before proceeding further into missing values generation

In [6]:
columns_to_standardize = ["Global_Sales", "User_Score", "Critic_Score",
                          "EU_Sales", "NA_Sales", "JP_Sales", "Year_of_Release"]
standardize(data, columns_to_standardize, columns_stats)
print(data)

                                            Name Platform  Year_of_Release  \
5140                                  Wii Sports      Wii        -0.078327   
16561                          Super Mario Bros.      NES        -3.630253   
5899                              Mario Kart Wii      Wii         0.259952   
5100                           Wii Sports Resort      Wii         0.429091   
15741                   Pokemon Red/Pokemon Blue       GB        -1.769720   
...                                          ...      ...              ...   
7553   Greg Hastings' Tournament Paintball Max'd      PS2        -0.078327   
7569                                     Deus Ex       PC        -1.093163   
7601                   Monster Rancher Advance 2      GBA        -0.754884   
7605                               Karnaaj Rally      GBA        -0.585745   
7606                 Wade Hixton's Counter Punch      GBA        -0.416606   

              Genre   NA_Sales   EU_Sales   JP_Sales  Global_Sa

Generate missing User Score and Critic Score values (using KNN method)

In [7]:
data = data.reset_index(drop=True)
columns_to_use = ["Global_Sales", "Year_of_Release", "User_Score", "Critic_Score",
                  "EU_Sales", "NA_Sales", "JP_Sales", "Year_of_Release"]
missing_values_generator_df = data[columns_to_use]
missing_values_handler = KNNImputer(n_neighbors=10)
data[columns_to_use] = pd.DataFrame(missing_values_handler.fit_transform(missing_values_generator_df),
                                    columns=columns_to_use)

De-standardize data after generating missing values using KNN

In [8]:
de_standardize(data, columns_to_standardize, columns_stats)

Display statistics after handling missing values

In [9]:
print("Number of rows: {}".format(data.shape[0]))
print("\nNumber of NaNs in each column: \n{}".format(data.isna().sum()))
columns_stats = data.agg({item: stats_to_compute for item in columns_to_get_stats_from})
print("\nDetailed stats: \n{}".format(columns_stats))

Number of rows: 10452

Number of NaNs in each column: 
Name               0
Platform           0
Year_of_Release    0
Genre              0
NA_Sales           0
EU_Sales           0
JP_Sales           0
Global_Sales       0
Critic_Score       0
User_Score         0
dtype: int64

Detailed stats: 
        Global_Sales    User_Score  Critic_Score   EU_Sales   NA_Sales  \
max        82.530000  9.600000e+00     99.000000  28.960000  41.360000   
min         0.010000  8.881784e-16     13.000000   0.000000   0.000000   
std         1.904295  1.302502e+00     12.449681   0.622661   1.002670   
mean        0.810365  7.270068e+00     70.942011   0.225307   0.402908   
median      0.360000  7.500000e+00     72.400000   0.070000   0.170000   

         JP_Sales  Year_of_Release  
max     10.220000      2016.000000  
min      0.000000      1977.000000  
std      0.385861         6.070377  
mean     0.108312      2006.003636  
median   0.000000      2007.000000  


Save modified dataset to file

In [None]:
print(data)
data.to_excel("../data/games_sales_2016_preprocessed.xlsx")

                                            Name Platform  Year_of_Release  \
0                                     Wii Sports      Wii           2006.0   
1                              Super Mario Bros.      NES           1985.0   
2                                 Mario Kart Wii      Wii           2008.0   
3                              Wii Sports Resort      Wii           2009.0   
4                       Pokemon Red/Pokemon Blue       GB           1996.0   
...                                          ...      ...              ...   
10447  Greg Hastings' Tournament Paintball Max'd      PS2           2006.0   
10448                                    Deus Ex       PC           2000.0   
10449                  Monster Rancher Advance 2      GBA           2002.0   
10450                              Karnaaj Rally      GBA           2003.0   
10451                Wade Hixton's Counter Punch      GBA           2004.0   

              Genre  NA_Sales  EU_Sales  JP_Sales  Global_Sales

### TODO:
 - wstepna analiza danych - uporzadkować (np. ilosc z danego gatunku, roku, platoformy itp, suma sprzedaży na rok) - **Jarek**
 - outliers (analiza, kalibracja itd) 1D + 2D - **Jarek**
 - klasteryzacja (analiza, kalibracja, przekroje np. tylko dla nowych konsol) - **Arek**
 - klasyfikacja - **Arek**