## Projeto de Análise de dados - Estatística - Profa. Maria Luísa - ADS III

### Alunos:
- João Victor Carrijo
- Danilo de Andrade
- Eloísa Rodrigues

### Sobre o projeto
Neste trabalho analisaremos uma base de dados envolvendo os dados de vendas de vídeo games

### Importação de Bibliotecas

In [49]:
import pandas as pd # leitura de dados
import numpy as np # cálculos
import plotly.express as plt # visualização de dados

### Lendo dados e fazendo ajustes iniciais

In [50]:
# Lendo os dados
df_games = pd.read_csv('vgsales.csv')
df_games

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.00
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.00,31.37
...,...,...,...,...,...,...,...,...,...,...,...
16593,16596,Woody Woodpecker in Crazy Castle 5,GBA,2002.0,Platform,Kemco,0.01,0.00,0.00,0.00,0.01
16594,16597,Men in Black II: Alien Escape,GC,2003.0,Shooter,Infogrames,0.01,0.00,0.00,0.00,0.01
16595,16598,SCORE International Baja 1000: The Official Game,PS2,2008.0,Racing,Activision,0.00,0.00,0.00,0.00,0.01
16596,16599,Know How 2,DS,2010.0,Puzzle,7G//AMES,0.00,0.01,0.00,0.00,0.01


In [51]:
# Renomeando colunas
df_games = df_games.rename(
    columns={
        'Name':'Titulo',
        'Platform':'Plataforma',
        'Year':'Ano_lancamento',
        'Genre':'Genero',
        'Publisher':'Distribuidora',
        'NA_Sales':'Vendas_EUA',
        'EU_Sales':'Vendas_Europa',
        'JP_Sales':'Vendas_JP',
        'Other_Sales':'Outras_Vendas',
        'Global_Sales':'Vendas_Totais'
    }
)

### Identificando e tratando dados nulos

In [52]:
df_games.info()

# Valores Nulos:

# Ano_lancamento -> 271
# Distribuidora -> 58
df_games['Distribuidora'][df_games['Distribuidora'].isna()]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Rank            16598 non-null  int64  
 1   Titulo          16598 non-null  object 
 2   Plataforma      16598 non-null  object 
 3   Ano_lancamento  16327 non-null  float64
 4   Genero          16598 non-null  object 
 5   Distribuidora   16540 non-null  object 
 6   Vendas_EUA      16598 non-null  float64
 7   Vendas_Europa   16598 non-null  float64
 8   Vendas_JP       16598 non-null  float64
 9   Outras_Vendas   16598 non-null  float64
 10  Vendas_Totais   16598 non-null  float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB


470      NaN
1303     NaN
1662     NaN
2222     NaN
3159     NaN
3166     NaN
3766     NaN
4145     NaN
4526     NaN
4635     NaN
5302     NaN
5647     NaN
6272     NaN
6437     NaN
6562     NaN
6648     NaN
6849     NaN
7208     NaN
7351     NaN
7470     NaN
7953     NaN
8330     NaN
8341     NaN
8368     NaN
8503     NaN
8770     NaN
8848     NaN
8896     NaN
9517     NaN
9749     NaN
10382    NaN
10494    NaN
11076    NaN
11526    NaN
12487    NaN
12517    NaN
13278    NaN
13672    NaN
13962    NaN
14087    NaN
14296    NaN
14311    NaN
14698    NaN
14942    NaN
15056    NaN
15261    NaN
15325    NaN
15353    NaN
15788    NaN
15915    NaN
16191    NaN
16198    NaN
16208    NaN
16229    NaN
16367    NaN
16494    NaN
16543    NaN
16553    NaN
Name: Distribuidora, dtype: object

Neste dataset, temos alguns dados não contabilizados, ou então não divulgados a respeito dos jogos.
A boa prática diz que dados inconsistentes deste tipo devem ser preenchidos com a média da coluna, para que estes não prejudiquem a nossa análise. 

In [53]:
# Vamos fazer a média do ano de lançamento
media_ano = np.round(df_games['Ano_lancamento'].mean())
media_ano
df_games['Ano_lancamento'] = df_games['Ano_lancamento'].fillna(media_ano)

In [54]:
df_games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Rank            16598 non-null  int64  
 1   Titulo          16598 non-null  object 
 2   Plataforma      16598 non-null  object 
 3   Ano_lancamento  16598 non-null  float64
 4   Genero          16598 non-null  object 
 5   Distribuidora   16540 non-null  object 
 6   Vendas_EUA      16598 non-null  float64
 7   Vendas_Europa   16598 non-null  float64
 8   Vendas_JP       16598 non-null  float64
 9   Outras_Vendas   16598 non-null  float64
 10  Vendas_Totais   16598 non-null  float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB


### Agrupando vendas do mesmo jogo em múltiplas plataformas

In [55]:
# Somando linhas com titulo repetido 
top_vendas = np.round(df_games.groupby(['Titulo'])['Vendas_Totais'].sum().sort_values(ascending=False),2)
top_vendas.to_csv('./top_vendas.csv')

### Visualização Básica

In [56]:
# Visualização de dados descritivos das colunas numéricas
df_games.describe()

# cat = 'Vendas_Europa'
# Teste de assimetria
# print(df_games[cat].mean(), df_games[cat].mode(), df_games[cat].median())

Unnamed: 0,Rank,Ano_lancamento,Vendas_EUA,Vendas_Europa,Vendas_JP,Outras_Vendas,Vendas_Totais
count,16598.0,16598.0,16598.0,16598.0,16598.0,16598.0,16598.0
mean,8300.605254,2006.399807,0.264667,0.146652,0.077782,0.048063,0.537441
std,4791.853933,5.781426,0.816683,0.505351,0.309291,0.188588,1.555028
min,1.0,1980.0,0.0,0.0,0.0,0.0,0.01
25%,4151.25,2003.0,0.0,0.0,0.0,0.0,0.06
50%,8300.5,2007.0,0.08,0.02,0.0,0.01,0.17
75%,12449.75,2010.0,0.24,0.11,0.04,0.04,0.47
max,16600.0,2020.0,41.49,29.02,10.22,10.57,82.74


In [57]:
print(np.round(df_games['Vendas_Totais'].mean(),2), df_games['Vendas_Totais'].mode(), df_games['Vendas_Totais'].median())

0.54 0    0.02
Name: Vendas_Totais, dtype: float64 0.17


In [58]:
# Visualizando jogo com maior quantidade de vendas no total
mais_vendas = df_games['Vendas_Totais'] >= 82.74
df_games[mais_vendas]

Unnamed: 0,Rank,Titulo,Plataforma,Ano_lancamento,Genero,Distribuidora,Vendas_EUA,Vendas_Europa,Vendas_JP,Outras_Vendas,Vendas_Totais
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74


In [59]:
# Visualizando jogo(s) com menor venda nos EUA
pior_avaliacao = df_games['Vendas_EUA'] <= 0.0
df_games[pior_avaliacao]

Unnamed: 0,Rank,Titulo,Plataforma,Ano_lancamento,Genero,Distribuidora,Vendas_EUA,Vendas_Europa,Vendas_JP,Outras_Vendas,Vendas_Totais
214,215,Monster Hunter Freedom 3,PSP,2010.0,Role-Playing,Capcom,0.0,0.00,4.87,0.00,4.87
338,339,Friend Collection,DS,2009.0,Misc,Nintendo,0.0,0.00,3.67,0.00,3.67
383,384,Monster Hunter 4,3DS,2013.0,Role-Playing,Capcom,0.0,0.00,3.44,0.00,3.44
402,403,English Training: Have Fun Improving Your Skills!,DS,2006.0,Misc,Nintendo,0.0,0.99,2.32,0.02,3.33
426,427,Dragon Quest VI: Maboroshi no Daichi,SNES,1995.0,Role-Playing,Enix Corporation,0.0,0.00,3.19,0.00,3.19
...,...,...,...,...,...,...,...,...,...,...,...
16587,16590,Mezase!! Tsuri Master DS,DS,2009.0,Sports,Hudson Soft,0.0,0.00,0.01,0.00,0.01
16589,16592,Chou Ezaru wa Akai Hana: Koi wa Tsuki ni Shiru...,PSV,2016.0,Action,dramatic create,0.0,0.00,0.01,0.00,0.01
16590,16593,Eiyuu Densetsu: Sora no Kiseki Material Collec...,PSP,2007.0,Role-Playing,Falcom Corporation,0.0,0.00,0.01,0.00,0.01
16595,16598,SCORE International Baja 1000: The Official Game,PS2,2008.0,Racing,Activision,0.0,0.00,0.00,0.00,0.01


In [60]:
# Retornando valores únicos
np.unique(df_games['Plataforma'])

array(['2600', '3DO', '3DS', 'DC', 'DS', 'GB', 'GBA', 'GC', 'GEN', 'GG',
       'N64', 'NES', 'NG', 'PC', 'PCFX', 'PS', 'PS2', 'PS3', 'PS4', 'PSP',
       'PSV', 'SAT', 'SCD', 'SNES', 'TG16', 'WS', 'Wii', 'WiiU', 'X360',
       'XB', 'XOne'], dtype=object)

In [61]:
# Buscando...
jogo = df_games['Titulo'] == 'Grand Theft Auto: San Andreas'
df_games[jogo]


Unnamed: 0,Rank,Titulo,Plataforma,Ano_lancamento,Genero,Distribuidora,Vendas_EUA,Vendas_Europa,Vendas_JP,Outras_Vendas,Vendas_Totais
17,18,Grand Theft Auto: San Andreas,PS2,2004.0,Action,Take-Two Interactive,9.43,0.4,0.41,10.57,20.81
873,875,Grand Theft Auto: San Andreas,XB,2005.0,Action,Take-Two Interactive,1.26,0.61,0.0,0.09,1.95
2120,2122,Grand Theft Auto: San Andreas,PC,2005.0,Action,Take-Two Interactive,0.0,0.92,0.0,0.05,0.98
9827,9829,Grand Theft Auto: San Andreas,X360,2008.0,Action,Take-Two Interactive,0.08,0.03,0.0,0.01,0.12


Neste dataset, temos alguns dados não contabilizados, ou então não divulgados a respeito dos jogos.
A boa prática diz que dados inconsistentes deste tipo devem ser preenchidos com a média da coluna, para que estes não prejudiquem a nossa análise. 

In [62]:
# Quanto aos dados de cópias vendidas, preencheremos os campos NaN com o valor 0, dado o fato de que a GRANDE MAIORIA das
# linhas com o valor NaN são jogos gratuitos.
df_games['vendas_totais'].fillna(0.0, inplace=True)
df_games['vendas_jp'].fillna(0.0,inplace=True)
df_games['vendas_pal'].fillna(0.0,inplace=True)
df_games['vendas_eua'].fillna(0.0,inplace=True)
df_games['outras_vendas'].fillna(0.0,inplace=True)
df_games

KeyError: 'vendas_totais'

### Visualizando a dispersão

Agora que resolvemos alguns problemas do dataset, vamos visualizar alguns gráficos para conferir se ainda restam outliers.

In [64]:
pd.DataFrame.iteritems = pd.DataFrame.items # Na última atualização do pandas, o método "iteritems", que ainda é utilizado
# pelo plotly trocou de nome, por tanto essa atrivbuição resolverá um bug na comunicação da duas biblioteca. 

# Vendas totais x Críticas x Vendas EUA X Vendas Japão 
plt.scatter_matrix(df_games, dimensions=['Vendas_Totais','Vendas_JP','Vendas_Europa', 'Vendas_EUA'])

In [35]:
# Vendas PAL x EUA x Japão
plt.scatter_matrix(df_games, dimensions=['vendas_pal','vendas_eua', 'vendas_jp'])

### Insights Prévios

In [65]:
# Gênero por console
plt.treemap(df_games, path=['Genero', 'Plataforma'])

In [66]:
# Gênero por vendas totais
plt.pie(df_games,names='Genero',values='Vendas_Totais')

In [81]:
# Score por Console
plt.bar(df_games,x='Plataforma',y='Vendas_Totais')
