## Etapa 1: Importar bibliotecas, carregar e estudar os dados

In [6]:
import pandas as pd
# import matplotlib.pyplot as plt
import numpy as np
# import seaborn as sns
# from scipy import stats

In [7]:
df = pd.read_csv('../data/videogame_dataset.csv')
df.info()
df.sample(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16713 non-null  object 
 1   Platform         16715 non-null  object 
 2   Year_of_Release  16446 non-null  float64
 3   Genre            16713 non-null  object 
 4   NA_sales         16715 non-null  float64
 5   EU_sales         16715 non-null  float64
 6   JP_sales         16715 non-null  float64
 7   Other_sales      16715 non-null  float64
 8   Critic_Score     8137 non-null   float64
 9   User_Score       10014 non-null  object 
 10  Rating           9949 non-null   object 
dtypes: float64(6), object(5)
memory usage: 1.4+ MB


Unnamed: 0,Name,Platform,Year_of_Release,Genre,NA_sales,EU_sales,JP_sales,Other_sales,Critic_Score,User_Score,Rating
9126,Arctic Tale,DS,2007.0,Adventure,0.13,0.0,0.0,0.01,38.0,tbd,E
5365,Iron Storm,SAT,1995.0,Shooter,0.0,0.0,0.34,0.0,,,
15942,Winning Post 7: Maximum 2007,PS3,2007.0,Sports,0.0,0.0,0.02,0.0,,,
8707,Driven,PS2,2001.0,Racing,0.08,0.06,0.0,0.02,38.0,3.4,T
9757,Thor: God of Thunder,DS,2011.0,Action,0.08,0.03,0.0,0.01,64.0,7.2,E10+
9798,Rise of the Tomb Raider,PC,2016.0,Adventure,0.0,0.11,0.0,0.01,86.0,7.9,M
9455,Legendary,PS3,2008.0,Shooter,0.08,0.03,0.0,0.02,50.0,7.1,M
13748,Chibi Maruko-Chan DS: Maru-Chan no Machi,DS,2009.0,Puzzle,0.0,0.0,0.04,0.0,,,
5606,Rayman 2: The Great Escape,PS,2000.0,Platform,0.18,0.12,0.0,0.02,,,
12858,Gothic 3,PC,2006.0,Role-Playing,0.0,0.05,0.0,0.01,63.0,7.3,T


### Certezas: 
- renomear todas as colunas com .lower() [✓] => Para facilitar a análise e padronizar os nomes.
- renomear coluna other_sales -> rest_of_world_sales ou row_sales [✓]
- dtype alterations -> Year_of_Release de float para int [✓] => Como a coluna year_of_release apresenta poucos valores ausentes e eles não são essenciais para a análise, removi apenas as linhas contendo NaN nessa coluna. Assim, pude converter os anos para o tipo inteiro sem comprometer a integridade do restante do dataset. 
- dtype alterations -> user_score de object para float [✓] => Já que 'tbd's e NaNs indicam a mesma coisa (falta de avaliação do usuário) e correspondem a aproximadamente metade dos valores da coluna de user_score, resolvi transformar os tbds em NaNs e converter todos os NaNs em -1 (placeholder) pois acredito que essa coluna será útil para a análise mais pra frente.
- criar coluna 'global_sales' com a soma de vendas de todas as regiões 

### Talvez:
- coluna de rating (entertainment software rating board) [✓] -> manter os valores como estão, mas usar um dicionário descritivo quando for gerar os gráficos (célula abaixo)
- unidade de medida das colunas de sales -> milhões de dólares americanos => manter essa unidade para evitar números muito grandes, se necessário alterar mais pra frente na hora de gerar gráficos.

### Reavaliar:
- colunas de score -> transformar user em escala de 0-100 ou critic em escala 0-10 => aguardar por enquanto para ver como usarei esses valores mais pra frente
- analisar a quantidade de valores nulos das colunas de scores e ratings -> excluir? manter? preencher com média/mediana? => resolvi ignorá-las por enquanto

In [8]:
# # COLUNA DE RATING
# rating_labels = {
#     "EC": "Early Childhood",
#     "E": "Everyone",
#     "E10+": "Everyone 10+",
#     "T": "Teen",
#     "M": "Mature 17+",
#     "AO": "Adults Only 18+",
#     "RP": "Rating Pending",
#     "K-A": "Kids to Adults"    
# }

# df['rating_clean'] = df['rating'].fillna("Unknown")
# df['rating_label'] = df['rating'].map(rating_labels)

In [9]:
game_df = df.copy()
# colunas com nome minúsculo
game_df.columns = game_df.columns.str.lower()
# renomeando coluna other_sales
game_df = game_df.rename(columns={"other_sales": "row_sales"})
# lidando com os nulos e convertendo o dtype da coluna year_of_release
game_df = game_df.dropna(subset=["year_of_release"]).reset_index(drop=True)
game_df["year_of_release"] = game_df["year_of_release"].astype("int")
# substituindo os tbds por NaNs e convertendo o dtype da coluna user_score
game_df["user_score"] = game_df["user_score"].where(game_df["user_score"] != 'tbd', np.nan)
game_df["user_score"] = game_df["user_score"].astype("float")
# criando coluna com as vendas totais por jogo 
game_df['global_sales'] = game_df[["na_sales", "eu_sales", "jp_sales", "row_sales"]].sum(axis=1)
# reorganizando as colunas para ter as colunas de sales lado a lado
cols = list(game_df.columns) 
row_index = cols.index('row_sales') 
col_to_move = cols.pop(cols.index('global_sales'))
cols.insert(row_index + 1, col_to_move)
game_df = game_df[cols]

game_df.head()


Unnamed: 0,name,platform,year_of_release,genre,na_sales,eu_sales,jp_sales,row_sales,global_sales,critic_score,user_score,rating
0,Wii Sports,Wii,2006,Sports,41.36,28.96,3.77,8.45,82.54,76.0,8.0,E
1,Super Mario Bros.,NES,1985,Platform,29.08,3.58,6.81,0.77,40.24,,,
2,Mario Kart Wii,Wii,2008,Racing,15.68,12.76,3.79,3.29,35.52,82.0,8.3,E
3,Wii Sports Resort,Wii,2009,Sports,15.61,10.93,3.28,2.95,32.77,80.0,8.0,E
4,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,11.27,8.89,10.22,1.0,31.38,,,


In [10]:
game_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16446 entries, 0 to 16445
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             16444 non-null  object 
 1   platform         16446 non-null  object 
 2   year_of_release  16446 non-null  int64  
 3   genre            16444 non-null  object 
 4   na_sales         16446 non-null  float64
 5   eu_sales         16446 non-null  float64
 6   jp_sales         16446 non-null  float64
 7   row_sales        16446 non-null  float64
 8   global_sales     16446 non-null  float64
 9   critic_score     7983 non-null   float64
 10  user_score       7463 non-null   float64
 11  rating           9768 non-null   object 
dtypes: float64(7), int64(1), object(4)
memory usage: 1.5+ MB


## Etapa 2: Preparar os dados para análise

## Etapa 3: Analisar os dados

## Etapa 4: Criar um perfil para cada região e analisá-los

## Etapa 5: Testar as hipóteses

## Etapa 6: Conclusão geral