# Clustering com Dados do Cartola FC

## 1.0 Dataset

Foram utilizados dados do Cartola FC obtido através do seguinte [respositório](https://github.com/henriquepgomide/caRtola) do GitHub. O CSV utilizado foi o "dados_agregados_limpos.csv", o qual é a junção dos dados dos anos de 2014 a 2018, além de serem dados já tratados sem a presença de NaN.

Cada linha representa...

Colunas relevantes:
- Rodada: número da rodada do Brasileirão
- ClubeID: clube do jogador
- AtletaID: id do jogador
- Posicao: posição do jogador
- Pontos: pontuação do jogador
- PontosMedia: média da pontuação do jogador
- FS: faltas sofridas
- PE: passes errados
- A: assistências
- FT: finalizações na trave
- FD: finalizações defendidas
- FF: finalizações para fora
- G: gols
- I: impedimentos
- FC: faltas cometidas
- CA: cartões amarelo
- CV: cartões vermelho
- SG: jogos sem sofrer gols
- GS: gols sofridos
- RB: roubadas de bola
- DD: defesas difíceis
- DP: defesas de pênalti

Primeiro, será clusterizado os jogadores separadamente, ou seja, o agrupamento será feito separadamente para cada posição (goleiro, zagueiro, atacante, lateral e meia).

Após essa primeira etapa, será feito...

In [2]:
import pandas as pd

dados_agregados = pd.read_csv('dados_agregados_limpos.csv')
dados_agregados.shape

(17567, 77)

## 2.0 Explorando os dados

### 2.1 Tipos de posição e a quantidade:

In [4]:
dados_agregados.Posicao.value_counts()

mei    6516
ata    4050
lat    3045
zag    2628
gol    1328
Name: Posicao, dtype: int64

### 2.2 Descrição das colunas

In [17]:
colunas = ['Pontos','PontosMedia','FS','PE','A','FT','FF','FC',
           'FD','G','I','RB','CA','CV','SG','DD','DP','GS']

dados_agregados[colunas].describe()

Unnamed: 0,Pontos,PontosMedia,FS,PE,A,FT,FF,FC,FD,G,I,RB,CA,CV,SG,DD,DP,GS
count,17567.0,17567.0,17567.0,17567.0,17567.0,17567.0,17567.0,17567.0,17567.0,17567.0,17567.0,17567.0,17567.0,17567.0,17567.0,17567.0,17567.0,17567.0
mean,2.946798,2.960628,1.111914,2.315763,0.064325,0.018102,0.347356,1.159845,0.241305,0.090169,0.136847,1.026015,0.169693,0.00814,0.124039,0.115671,0.002505,0.090169
std,4.344379,2.050958,1.304447,2.042131,0.262381,0.135862,0.644317,1.260528,0.538621,0.315378,0.440791,1.265575,0.37598,0.090489,0.329636,0.552542,0.051112,0.435866
min,-10.5,-8.4,-3.0,-6.0,-1.0,0.0,-1.0,-2.0,-2.0,0.0,-1.0,-4.0,-1.0,-1.0,0.0,0.0,0.0,0.0
25%,0.0,1.68,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2.0,2.88,1.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,5.2,4.09,2.0,3.0,0.0,0.0,1.0,2.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
max,33.0,21.83,12.0,14.0,4.0,2.0,7.0,10.0,6.0,3.0,7.0,9.0,1.0,1.0,1.0,7.0,2.0,6.0


In [42]:
teste = dados_agregados[colunas].describe()
teste.loc['mean','Pontos']

2.946797973472989

### 2.3 Tratar dados duplicados

A problemática é a repetição do valor "AtletaID" nas linhas do dataset, visto que um atleta pode ter participado de várias partidas. Para resolver esse problema, será calculada a média das colunas a serem utilizadas.

In [0]:
dados = dados_agregados.copy()
# filtrando apenas as colunas que serão usadas
dados = dados[['Rodada','ClubeID','AtletaID','Posicao','Pontos','PontosMedia','FS','PE','A','FT','FF','FC',
           'FD','G','I','RB','CA','CV','SG','DD','DP','GS']]

Nas análises abaixo estamos:
- conferindo se AtletaID só tem um tipo de posição;
- se o AtletaID participou de mais de um ClubeID: sim.

In [49]:
df_posicao = dados.groupby(['AtletaID']).Posicao.unique().reset_index().head(10)
display(df_posicao)

df_clube = dados.groupby(['AtletaID']).ClubeID.unique().reset_index().head(10)
display(df_clube)

Unnamed: 0,AtletaID,Posicao
0,36540,[zag]
1,36612,[lat]
2,37604,[zag]
3,37607,"[lat, mei]"
4,37608,[mei]
5,37623,[zag]
6,37644,[zag]
7,37646,[zag]
8,37652,[mei]
9,37653,[ata]


Unnamed: 0,AtletaID,ClubeID
0,36540,"[Internacional, Flamengo]"
1,36612,"[Grêmio, Palmeiras]"
2,37604,[Atlético-PR]
3,37607,"[São Paulo, Palmeiras]"
4,37608,[Grêmio]
5,37623,[Vasco]
6,37644,[Ponte Preta]
7,37646,[Avaí]
8,37652,"[Botafogo, Figueirense, Atlético-PR]"
9,37653,"[Internacional, Santos]"


Como alguns AtletaID estão relacionados com mais de uma tipo de posição e a mais de um Clube, iremos guardar essa informações em uma lista na coluna específica.

In [53]:
df_mean = dados.groupby(['AtletaID']).mean().reset_index().drop(['Rodada','PontosMedia'], axis=1)
df_mean.head()

Unnamed: 0,AtletaID,Pontos,FS,PE,A,FT,FF,FC,FD,G,I,RB,CA,CV,SG,DD,DP,GS
0,36540,4.095349,0.627907,1.418605,0.069767,0.023256,0.44186,0.860465,0.186047,0.0,0.0,1.348837,0.27907,0.0,0.395349,0.0,0.0,0.0
1,36612,2.895588,1.088235,1.705882,0.147059,0.0,0.191176,0.926471,0.073529,0.014706,0.102941,1.441176,0.176471,0.0,0.044118,0.0,0.0,0.0
2,37604,1.711111,0.611111,2.277778,0.0,0.055556,0.0,0.777778,0.111111,0.0,0.0,0.722222,0.222222,0.0,0.277778,0.0,0.0,0.0
3,37607,2.540426,0.531915,3.021277,0.212766,0.0,0.659574,0.914894,0.382979,0.042553,0.340426,0.829787,0.191489,0.042553,0.148936,0.0,0.0,0.0
4,37608,1.92,0.4,2.2,0.0,0.0,0.0,0.4,0.2,0.0,0.0,1.4,0.0,0.0,0.0,0.0,0.0,0.0


In [56]:
# concatenando posições, clubes e as médias
dados_unicos = pd.merge(df_mean,df_posicao,on='AtletaID')
dados_unicos = pd.merge(dados_unicos,df_clube,on='AtletaID')

dados_unicos.head()

Unnamed: 0,AtletaID,Pontos,FS,PE,A,FT,FF,FC,FD,G,I,RB,CA,CV,SG,DD,DP,GS,Posicao,ClubeID
0,36540,4.095349,0.627907,1.418605,0.069767,0.023256,0.44186,0.860465,0.186047,0.0,0.0,1.348837,0.27907,0.0,0.395349,0.0,0.0,0.0,[zag],"[Internacional, Flamengo]"
1,36612,2.895588,1.088235,1.705882,0.147059,0.0,0.191176,0.926471,0.073529,0.014706,0.102941,1.441176,0.176471,0.0,0.044118,0.0,0.0,0.0,[lat],"[Grêmio, Palmeiras]"
2,37604,1.711111,0.611111,2.277778,0.0,0.055556,0.0,0.777778,0.111111,0.0,0.0,0.722222,0.222222,0.0,0.277778,0.0,0.0,0.0,[zag],[Atlético-PR]
3,37607,2.540426,0.531915,3.021277,0.212766,0.0,0.659574,0.914894,0.382979,0.042553,0.340426,0.829787,0.191489,0.042553,0.148936,0.0,0.0,0.0,"[lat, mei]","[São Paulo, Palmeiras]"
4,37608,1.92,0.4,2.2,0.0,0.0,0.0,0.4,0.2,0.0,0.0,1.4,0.0,0.0,0.0,0.0,0.0,0.0,[mei],[Grêmio]


### 2.4 Separação das posições dos jogadores

Abaixo é feita a filtragem do dataframe de acordo com a posição dos jogadores.

In [0]:
# Ajeitar para o modo list
dados_meia = dados[dados['Posicao']=='mei']

dados_atacante = dados[dados['Posicao']=='ata']

dados_lateral = dados[dados['Posicao']=='lat']

dados_zaga = dados[dados['Posicao']=='zag']

dados_gol = dados[dados['Posicao']=='gol']