# Data visualization notebook

## The Four Levels of Measurement
In order to choose an appropriate plot type or method of analysis for your data, you need to understand the types of data you have. One common method divides the data into four levels of measurement:

### Qualitative or categorical types (non-numeric types)
1. Nominal data: pure labels without inherent order (no label is intrinsically greater or less than any other)
2. Ordinal data: labels with an intrinsic order or ranking (comparison operations can be made between values, but the magnitude of differences are not be well-defined)

### Quantitative or numeric types
3. Interval data: numeric values where absolute differences are meaningful (addition and subtraction operations can be made)
4. Ratio data: numeric values where relative differences are meaningful (multiplication and division operations can be made)

All quantitative-type variables also come in one of two varieties: discrete and continuous.
#### Discrete quantitative variables
Discrete quantitative variables can only take on a specific set values at some maximum level of precision.
#### Continuous quantitative variables
Continuous quantitative variables can (hypothetically) take on values to any level of precision.

In [1]:
# Import libraries for all the following charts
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

### Exploratory analysis
Exploratory analysis is done when you are searching for insights. These visualizations don't need to be perfect. You are using plots to find insights, but they don't need to be aesthetically appealing. You are the consumer of these plots, and you need to be able to find the answer to your questions from these plots.


### Univariate Exploration of Data
#### 1. Bar chart ---- Qualitative data
A bar chart is used to depict the distribution of a categorical variable. In a bar chart, each level of the categorical variable is depicted with a bar, whose height indicates the frequency of data points that take on that level. A basic bar chart of frequencies can be created through the use of seaborn's countplot function.

In [None]:
# set one color for all the bars
base_color = sb.color_palette()[0]

# set the order of the bars:
# -- for ordinal: set up the order according to the data (example is not from the same dataset)
level_order = ['Alpha', 'Beta', 'Gamma', 'Delta']
df['cat_var'] = df['cat_var'].astype('category', ordered = True, categories = level_order)
sb.countplot(data = df, x = 'cat_var', color = base_color)

# -- for norminal: set up the order according to the counts/frequency
gen_order = pokemon.generation_id.value_counts().index
sb.countplot(data = pokemon, x = 'generation_id', color = base_color, order = gen_order)

# if the xtick labels are overlapping, it can be rotated to better show:
plt.xticks(rotation = 90)

# or transfer the figure to be horizontal by changing the x to y axis:
sb.countplot(data = pokemon, y = 'type_1', color = base_color, order = gen_order)


In [None]:
# Relative_frequency

n_pokemon = pokemon.shape[0]
max_type_count = type_counts[0]
max_prop = max_type_count / n_pokemon   # max proportion is 0.16

tick_props = np.arange(0, max_prop, 0.02)  # set up the x ticks
tick_names = ['{:0.2f}'.format(v) for v in tick_props]  # set up the xtick names with two digits


pkmn_types = pokemon.melt(id_vars = ['id','species'], 
                          value_vars = ['type_1', 'type_2'], 
                          var_name = 'type_level', value_name = 'type').dropna()
base_color = sb.color_palette()[0]
sb.countplot(data = pkmn_types, y = 'type', color = base_color, order = type_order)
plt.xticks(tick_props * n_pokemon, tick_names)  # Don't understand why tick_props need to multiply n --- counts are still in absolute value
plt.xlabel('proportion')

# to add the text to the bars to show the percentage

for i in range(type_counts.shape[0]):
    count = type_counts[i]
    pct_string = '{:0.1f}%'.format(100*count/n_pokemon)
    plt.text(count+1, i, pct_string, va = 'center', color = 'w') #'va' can be changed to 'ha' depending on vertical/horizontal bar

In [None]:
# using bar charts to count missing values

na_count = df.isna().sum()
base_color = sb.color_palette()[0]
sb.barplot(na_counts.index.values, na_counts, color = base_color)

#### 2. Pie chart ---- Qualitative data

A pie chart is a common univariate plot type that is used to depict relative frequencies for levels of a categorical variable. Frequencies in a pie chart are depicted as wedges drawn on a circle: the larger the angle or area, the more common the categorical value taken.

In [None]:
sorted_counts = df['cat_var'].value_counts()
plt.pie(sorted_counts, labels = sorted_counts.index, startangle = 90,
        countclock = False)
plt.axis('square')

# to make 'donut plot' on basis of pie chart, just add 'wedgeprops = {'width': 0.4}'

#### 3. Histograms ---- Quantitative data
A histogram is used to plot the distribution of a numeric variable. It's the quantitative version of the bar chart. However, rather than plot one bar for each unique numeric value, values are grouped into continuous bins, and one bar for each bin is plotted depicting the number. For instance, using the default settings for matplotlib's hist function:

In [None]:
# matplotlib
bins = np.arange(0, pokemon.speed.max()+5, 5)
plt.hist(data = pokemon, x = 'speed', bins = bins)

# seaborn
bins = np.arange(0, pokemon.speed.max()+5, 5)
sb.displot(pokemon['speed'], bins = bins, kde = False, 
           hist_kws = {'alpha':1})  # transparency

#### 4. Exceptions:
#### Discrete Data ---- Bar chart with non-connected bins 
#### Ordinal Data --- Histogram.

In [None]:
# take die_roll for example, which will only give numbers from 2 to 12.
# Adds the "rwidth" parameter to set the proportion of the bin widths that will be filled by each histogram bar.

bin_edges = np.arange(1.5, 12.5+1, 1)
plt.hist(die_rolls, bins = bin_edges, rwidth = 0.7)
plt.xticks(np.arange(2, 12+1, 1))

#### 4. Descriptive Statistics and Outliers 
1). Axis Limits

In [None]:
# Simily add x-axis limitations to get rid of the outliers first

plt.xlim(0, 35)

2). Scales and Transformations: Certain data distributions will find themselves amenable to scale transformations. The most common example of this is data that follows an approximately log-normal distribution. This is data that, in their natural units, can look highly skewed: lots of points with low values, with a very long tail of data points with large values. However, after applying a logarithmic transform to the data, the data will follow a normal distribution.

In [None]:
np.log10(pokemon['weight'].describe())  #--- to get the range of the x ticks (-1 to 3)

bins = 10 ** np.arange(-1, 3 + 0.1, 0.1)
ticks = [0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000]  #---add 0.3, 3, 30, 300 to the original ticks
labels = ['{}'.format(v) for v in ticks]
plt.hist(data = pokemon, x = 'weight', bins = bins)
plt.xscale('log')

### Explanatory analysis 
Explanatory analysis is done when you are providing your results for others. These visualizations need to provide you the emphasis necessary to convey your message. They should be accurate, insightful, and visually appealing.