<a href="https://colab.research.google.com/github/cristianopoeta/DSWP/blob/master/Notebooks/NB10_04__Transformation_exerc_06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercícios 6 - 120 years of Olympic history: athletes and results
* [120 years of Olympic history: athletes and results](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results)
    * Trate adequadamente as variáveis 'sex', 'season', 'team', 'city', 'sport' e 'medal';
    * Aplique as transformações que acabamos de estudar nos campos/colunas numéricas 'height' e 'weight'. Cuidado com os Missing Values contidos nas variáveis!
    * Verifique/avalie o impacto dos outliers nestas colunas.
    * Neste caso, qual transformação é mais adequado diante dos outliers?

In [2]:
import re
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Markdown

In [3]:
pdod = pd.options.display    # atalho para opções de exibição
pdod.max_rows = 100          # qtd máx de linhas exibidas
pdod.max_columns = 100       # qtd máx de colunas exibidas
pdod.width = 200             # larg máx total em modo texto

In [4]:
# retorna para cada coluna de `frame`: nome da coluna, valor de exemplo, 
#     dtype da coluna, classe do valor de exemplo
# valor de exemplo é tomado na linha de `frame` indicada por `iloc`
def exemplo_linha(frame, iloc=0):
  df_info = pd.DataFrame(dict(valor_exemplo=frame.iloc[iloc].copy()))
  df_info['dtype_coluna'] = frame.dtypes.map(lambda x: x.name)
  df_info['classe_valor'] = df_info['valor_exemplo'].map(lambda x: x.__class__.__name__)
  df_info.index.name = 'nome_coluna'
  return df_info

In [5]:
# `obj` pode ser DataFrame ou Series.
# se `filtros` for `None`, exibe primeiras `nh` posições, últimas `nt` posições 
#     e `shape` de `obj`.
# se `filtros` não for `None`, deve ser um iterável com elemntos que possam ser 
#     utilizados em `obj.loc[]` (funções de filtragem são uma boa opção).
def d_pd(obj, nh=1, nt=None, filtros=None):
    if nt is None:
        nt = nh
    if filtros is None:
        display(pd.concat([obj.head(nh), obj.tail(nt)]), obj.shape)
    else:
        for filtro in filtros:
            display(obj.loc[filtro])
        display(obj.shape)

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [7]:
df_olimp = pd.read_csv('/content/drive/My Drive/DSWP/athlete_events.csv')
d_pd(df_olimp)

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
271115,135571,Tomasz Ireneusz ya,M,34.0,185.0,96.0,Poland,POL,2002 Winter,2002,Winter,Salt Lake City,Bobsleigh,Bobsleigh Men's Four,


(271116, 15)

In [8]:
exemplo_linha(df_olimp).T

nome_coluna,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
valor_exemplo,1,A Dijiang,M,24,180,80,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
dtype_coluna,int64,object,object,float64,float64,float64,object,object,object,int64,object,object,object,object,object
classe_valor,int64,str,str,float64,float64,float64,str,str,str,int64,str,str,str,str,float


In [9]:
df_olimp.set_axis(df_olimp.columns.str.lower(), axis=1, inplace=True)

In [10]:
df_vars = df_olimp[[ 'sex', 'season', 'team', 'city', 'sport', 'medal']]

In [11]:
d_pd(df_vars, 3)

Unnamed: 0,sex,season,team,city,sport,medal
0,M,Summer,China,Barcelona,Basketball,
1,M,Summer,China,London,Judo,
2,M,Summer,Denmark,Antwerpen,Football,
271113,M,Winter,Poland,Sochi,Ski Jumping,
271114,M,Winter,Poland,Nagano,Bobsleigh,
271115,M,Winter,Poland,Salt Lake City,Bobsleigh,


(271116, 6)

In [12]:
exemplo_linha(df_vars).T

nome_coluna,sex,season,team,city,sport,medal
valor_exemplo,M,Summer,China,Barcelona,Basketball,
dtype_coluna,object,object,object,object,object,object
classe_valor,str,str,str,str,str,float


In [13]:
df_vars.medal.value_counts(dropna=False)

NaN       231333
Gold       13372
Bronze     13295
Silver     13116
Name: medal, dtype: int64

### Exemplo de transformação do dataframe de correlações em uma série de correlações


In [30]:
# criação de um dataframe de correlações de exemplo
df_corr = (
    df_olimp
    .select_dtypes('number')
    .drop(columns='id')
    .corr())
df_corr

Unnamed: 0,age,height,weight,year
age,1.0,0.138246,0.212069,-0.115137
height,0.138246,1.0,0.796213,0.047578
weight,0.212069,0.796213,1.0,0.019095
year,-0.115137,0.047578,0.019095,1.0


In [31]:
df_corr.loc[:] = (
    np.where(np.triu(np.ones(df_corr.shape, dtype=int), k=1), df_corr, np.nan) )
df_corr 

Unnamed: 0,age,height,weight,year
age,,0.138246,0.212069,-0.115137
height,,,0.796213,0.047578
weight,,,,0.019095
year,,,,


In [32]:
df_corr.stack().dropna().sort_values(ascending=False, key=abs)

height  weight    0.796213
age     weight    0.212069
        height    0.138246
        year     -0.115137
height  year      0.047578
weight  year      0.019095
dtype: float64