# Clustering

As técnicas de clusterização são muito importantes para termos *insights* sobre os dados disponíveis. Mesmo sem uma classificação exata dos dados, é possível que possamos tirar conclusões interessantes e reveladoras. 
<img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_linkage_comparison_0011.png" alt="Fonte: Scikit-learn" width="500"/>

Assim, iremos utilizar os dados do [**Cartola FC**](https://globoesporte.globo.com/cartola-fc/) disponíveis [aqui](https://github.com/henriquepgomide/caRtola) para tentarmos agrupar os jogadores. Os dados são os valores do **Cartola FC** durate os anos de 2014 até 2017.

**Iremos utilizar a versão dos dados disponíveis nesse repositório que já estão limpos e prontos para as análises.**

## Descrição dos Dados

De acordo com o [fonte dos dados](https://github.com/henriquepgomide/caRtola/blob/master/data/README.md), no total são **17567** instancias e **77** características. Assim, cada linha da base de dados corresponde a uma rodada de uma jogador e as colunas as suas caracteríticas nessa rodada. Para uma visão completa de todas as 77 característica dê uma olhada na fonte, mas abaixo estão alguns dos atributos.

| Características | Descrição          |
|-----------------|--------------------|
| Pontos          | pontuação do jogador                                      |
| PontosMedia     | média da pontuação do jogador                             |
| Preco           | preço do jogador                                          |
| ...             | ...                                                       |
| FS              | faltas sofridas                                           |
| PE              | passes errados                                            |
| A               | assistências                                              |
| FT              | finalizações na trave                                     |
| FD              | finalizações defendidas                                   |
| FF              | finalizações para fora                                    |
| G               | gols                                                      |
| ano             | ano dos dados                                             |
|...              |...                                                        |

## Visualização os Dados
Inicialmente, vamos importar os dados e visualizá-los.

In [1]:
# imports
import pandas as pd 
data = pd.read_csv('dados_agregados_limpos.csv')
# brief look at the data
print(data.shape)
data.head(10)

(17567, 77)


Unnamed: 0,A,Apelido,AtletaID,CA,CV,ClubeID,DD,DP,FC,FD,...,away.score.x,dia,home.attack,home.defend,home.score.x,mes,pred.away.score,pred.home.score,risk_points,variable
0,0.0,Juan,36540,1.0,0.0,Internacional,0.0,0.0,2.0,0.0,...,0.0,19.0,0.0,0.0,1.0,4.0,0.0,0.0,1.0,home.team
1,0.0,Juan,36540,0.0,0.0,Internacional,0.0,0.0,2.0,1.0,...,2.0,27.0,0.0,0.0,2.0,4.0,0.0,0.0,1.0,away.team
2,0.0,Juan,36540,1.0,0.0,Internacional,0.0,0.0,2.0,0.0,...,1.0,4.0,0.0,0.0,2.0,5.0,0.0,0.0,1.0,home.team
3,0.0,Juan,36540,0.0,0.0,Internacional,0.0,0.0,0.0,1.0,...,1.0,10.0,0.0,0.0,2.0,5.0,0.0,0.0,1.0,home.team
4,0.0,Juan,36540,1.0,0.0,Internacional,0.0,0.0,5.0,0.0,...,0.0,18.0,0.0,0.0,0.0,5.0,0.0,0.0,1.0,away.team
5,0.0,Juan,36540,0.0,0.0,Internacional,0.0,0.0,0.0,0.0,...,3.0,25.0,0.0,0.0,1.0,5.0,0.0,0.0,1.0,home.team
6,0.0,Juan,36540,0.0,0.0,Internacional,0.0,0.0,1.0,0.0,...,0.0,28.0,0.0,0.0,2.0,5.0,0.0,0.0,1.0,home.team
7,0.0,Juan,36540,0.0,0.0,Internacional,0.0,0.0,1.0,1.0,...,1.0,1.0,0.0,0.0,1.0,6.0,0.0,0.0,3.226298,away.team
8,0.0,Juan,36540,0.0,0.0,Internacional,0.0,0.0,0.0,0.0,...,1.0,17.0,0.0,0.0,2.0,7.0,0.0,0.0,3.195153,away.team
9,1.0,Juan,36540,0.0,0.0,Internacional,0.0,0.0,0.0,0.0,...,0.0,20.0,0.0,0.0,4.0,7.0,0.0,0.0,3.97158,home.team


De fato, podemos notar que temos 77 atributos e 17567 instancias.

## Análises
Para realizarmos a nossa análise, temos que ter algumas perguntas para serem respondidas. As perguntas serão elaboradas nos tópicos mais adiante. 

Assim, duas análises serão realizadas utilizando esses dados:
- Posição Ideal
- Melhores Times

### Posições Ideal

Considere que um técnico está em dúvida sobre qual posição ele deveria escalar um dado jogador. Assim, queremos saber **"Qual a posição ideal de uma jogador segundo suas estastísticas?"**. Para isso, iremos **agrupar** os jogadores de acordo com as suas **métricas** e verificar quais as posições estão mais associadas a esses conjuntos.

Antes, vamos modificar os dados iniciais de forma que cada linha represente o desempenho de um jogador na temporada (no caso o ano de 2017). Além disso, para apenas estastísticas numéricas relacionadas aquelas rodadas serão utilizadas.

In [10]:
# Filter 2017 season
data_2017 = data[data['ano']==2017]

# Filters numeric values in the season
columns = ['AtletaID','FS', 'PE', 'A', 'FT', 'FD', 'FF', 'G', 'I', 'PP', 'RB', 'FC', 'GC', 'CA', 'CV', 'SG', 'DD', 'DP', 'GS']
data_2017_players = data_2017[columns].groupby(by=['AtletaID']).sum()
data_2017_players.head()


Unnamed: 0_level_0,FS,PE,A,FT,FD,FF,G,I,PP,RB,FC,GC,CA,CV,SG,DD,DP,GS
AtletaID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
36540,7.0,14.0,2.0,1.0,3.0,6.0,0.0,0.0,0.0,19.0,10.0,0.0,5.0,0.0,3.0,0.0,0.0,0.0
36612,8.0,13.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,14.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0
37604,11.0,41.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,13.0,14.0,0.0,4.0,0.0,5.0,0.0,0.0,0.0
37607,11.0,40.0,2.0,0.0,3.0,12.0,0.0,3.0,0.0,19.0,11.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0
37608,2.0,11.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,7.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Assim, ficamos com um total de **696 atletas** e o total de **18 estatísticas** ao longo da temporada.