# Lab2 Visualizations

Second round: last time had some visualizations to check the result of your cleaning. This time, you are gonna implement some yourself.

The objectives of this lab is for you to get:
* data normalization (lab 1 catch-up)
* data aggregation
* pandas/seaborn/pyplot plotting
* create custom graphs (styling, additional info, subplots)

## Reminders
* [GitHub repo](https://github.com/Faur/ITU-Data-Science-in-Games-Exercises)
* **Shut down notebooks** when you are done. Otherwise the server will run out of resources, and we will be forced to restart the them.
* Server storage is volatile! I.e. you must **save everything locally** that you don't want to loose.

In [None]:
# ! git pull

In [None]:
# Makes matplotlib plots work better with Jupyter
%matplotlib inline

# Import the necessary libraries. 
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Check that data and data path is present
# TODO fix path
basedir = "../"
file = "fifa.csv"
assert os.path.isdir(f"{basedir}data") and os.path.exists(f"{basedir}data/{file}"), 'Data not found. Make sure to have the most recent version!'

data = pd.read_csv(f'{basedir}/data/fifa.csv', sep=",")

# set default graph style
sns.set(rc={'figure.figsize':(20,8)})

## Given code 1: Color palettes

Objective: get familiar with color palette types

In [None]:
categorical_examples = ['deep', 'colorblind', 'muted']
sequential_examples = ['gray', 'Blues', 'BuGn_r', 'GnBu_d']
diverging_examples = ['coolwarm', 'BrBG', 'RdBu_r']

features = ['Age', 'Overall', 'Potential', 'Body Type']

subset = data[features][(data['Body Type']=='Lean') | (data['Body Type']=='Normal') | (data['Body Type']=='Stocky')]

with sns.axes_style('white'):
    sns.pairplot(subset, hue='Body Type', palette=diverging_examples[0])
plt.show()
    

## Task 1: Count and Histogram

> Estimated task time: 20 minutes.

1) Count the number of occurencies of each nationality

2) Plot an histogram of the results

3) Plot the histogram filtering out entries with a count "too small"

4) Plot the histogram showing the top X entries (choose X accordingly)

In [None]:
## YOUR CODE HERE 
# Pro tip: after creating a graph call plt.show() if the cell contains a single graph,
# otherwise all will be rendered on top of each other

# 1
# Creates a DF with a Hierarchical index, the only accessible column  is ID
# (for details about comple indexing see https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)
gb = data.groupby('Nationality')
natCount = pd.DataFrame({'count': gb.size()})

# Flattens the index
natCount = natCount.reset_index()

# # 3
# natCount = natCount \
#     .sort_values('count', ascending=False) \
#     .head(20)

# # 4
# natCount = natCount[natCount['count'] > 400]

# 2
# natCount.plot('Nationality', kind='bar', figsize=(20, 8))
natCount.plot('Nationality', kind='bar', figsize=(20, 8))
plt.show() 


sns.set(rc={'figure.figsize':(20,8)})
sns.barplot(natCount['Nationality'].values, natCount['count'].values, color='firebrick')

plt.show()


plt.bar(natCount['Nationality'].values, natCount['count'].values, label='count')
plt.legend()
plt.show()

# NB: seaborn can group and count the dataset automatically, but doesn't allow to manipulate the resulting dataset
# sns.countplot(data.Nationality, color='red')
# plt.show()

**NB**: matplotlib is 10 years older than pandas, so it can't handle a `DataFrame` directly but needs raw values and series. It is also old in graphical style (at least in the default values).
Is advisable to use pandas visualizations when possible, seaborn when dealing with high-level visualizations, and fall back to matplotlib only when needed (keeping in mind that seaborn can automatically set a modern look to matplotlib visualizations with `seaborn.set()`

## Task 2: Multiple data series

> Estimated task time: 10 minutes.

1) Calculate Min/Max/Avg for given features features, grouped by Club

2) Plot an an histogram with all three features

3) Plot a boxplot of the results # NOTE: not sure at all that this is clear


In [None]:
features = ['GKDiving', 'GKHandling', 'GKKicking']

In [None]:
## YOUR CODE HERE

#1
aggregation_columns = ['min', 'max', 'mean']
gb = data.groupby('Club')[features].agg(aggregation_columns)
clubStat = gb.reset_index()

#2
gb \
    .head(10) \
    .plot(kind='bar', figsize=(20, 8))
plt.show()

#3
gb.boxplot(figsize=(20, 8))
plt.show()


## Given code 2: Visualization sins

Objective: realize visualizations disadvantages

In [None]:
# exotic plots
sns.violinplot(data=gb)
plt.show()

# optimistic regression
sns.residplot(subset['Age'], subset['Overall'], lowess=True, color='g')
plt.show()

# misleading palettes?

# information clutterning?

## Task 3: Subplots

> Estimated task time: 10 minutes.

1) Edit the given code to show the graphs in a grid

**NB**: include biref description of KDE


In [None]:
# currency2float = lambda x: x.replace(r'€|[KM]', '', regex=True).astype(float) * \
#             x.str.extract(r'[\d\.]+([KM]+)', expand=False) \
#                 .fillna(1) \
#                 .replace(['K','M'], [10**3, 10**6]).astype(int)

features = ['Crossing', 'Finishing']

value = data[features]

f, axs = plt.subplots(1, 2)

plt.ticklabel_format(style='plain', axis='x')
for feature in features:
    sns.kdeplot(value[feature].dropna(), shade=True, ax=axs[0])
# plt.show()

for feature in features:
    sns.distplot(value[feature].dropna(), ax=axs[1])
plt.show()

# Given code 3: Plot functions
Objective: take a look at how to plot and analyze functions (will be useful for NN)

In [None]:
jointplots_kinds= ['scatter', 'kde', 'hex', 'reg']
"""
scatter(default): scatterplot with marginal histograms
kde:  kernel density estimate
hex:  joint histogram
reg:  regression and kernel density estimates
"""


uniform_data = np.random.rand(10, 12)
ax = sns.heatmap(uniform_data, square=True)
plt.show()

normal_data = np.random.randn(10, 12)
ax = sns.heatmap(normal_data, center=0, square=True)

plt.show()

with sns.axes_style('white'):
    for i, kind in enumerate(jointplots_kinds):
        sns.jointplot('GKDiving', 'GKHandling', data, kind=kind)
        
