# Notebook 3: Effective data visualization

This notebook will teach you new ways of plotting data and when to use each type to effectively plot your data.

Specifically, in this notebook we will use:

- Barcharts
- Bubblecharts
- Linecharts
- Violinplots

## Getting some data

Let's start getting some data to plot later. Here we use the `requests` and `json` packages, which are some of the most commonly used packages to obtain data from APIs.

Let's create a function for this purpose. This function was obtained from: https://www.cryptodatadownload.com/blog/how-to-download-coinbase-price-data.html

In [146]:
# First import the libraries that we need to use
from pandas import DataFrame, to_datetime
import requests
import json

def fetch_daily_data(symbol):
    '''
    Function copied from: https://www.cryptodatadownload.com/blog/how-to-download-coinbase-price-data.html
    
    This function gets a string on the different types of conversions to obtain and stores in a .csv
        the data downloaded.
        
    Arguments
    ---------
    symbol: :str: String in the type "XXX/XXX", e.g. "BTC/EUR", "BTC/USD", "ETH/EUR", etc...
    
    Output
    ------
    None.
    
    This function doesn't return any data, it simply stores the data in the folder "data/external/".
    
    '''
    pair_split = symbol.split('/')  # symbol must be in format XXX/XXX ie. BTC/EUR
    symbol = pair_split[0] + '-' + pair_split[1]
    url = f'https://api.pro.coinbase.com/products/{symbol}/candles?granularity=86400'
    response = requests.get(url)
    if response.status_code == 200:  # check to make sure the response from server is good
        data = DataFrame(json.loads(response.text), columns=['unix', 'low', 'high', 'open', 'close', 'volume'])
        data['datetime'] = to_datetime(data['unix'], unit='s')  # convert to a readable date
        data['vol_fiat'] = data['volume'] * data['close']      # multiply the BTC volume by closing price to approximate fiat volume

        # if we failed to get any data, print an error...otherwise write the file
        if data is None:
            print("Did not return any data from Coinbase for this symbol")
        else:
            data.to_csv(f'data/external/Coinbase_{pair_split[0] + pair_split[1]}_dailydata.csv', index=False)

    else:
        print("Did not receieve OK response from Coinbase API")
        

Then we execute this function and obtain the data and store it in the `data/external` folder.

In [147]:
fetch_daily_data('BTC/EUR')
fetch_daily_data('ETH/EUR')
fetch_daily_data('ADA/EUR')
fetch_daily_data('SOL/EUR')

Now that we have the data we need to read it and unify it. Notice that we use the `parse_dates` argument to read the `datetime` column as datetimes. For more info on datetimes read the following link: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

In [148]:
from pandas import read_csv

df_btc = read_csv('data/external/Coinbase_BTCEUR_dailydata.csv', parse_dates=['datetime'])
df_eth = read_csv('data/external/Coinbase_ETHEUR_dailydata.csv', parse_dates=['datetime'])
df_ada = read_csv('data/external/Coinbase_ADAEUR_dailydata.csv', parse_dates=['datetime'])
df_sol = read_csv('data/external/Coinbase_SOLEUR_dailydata.csv', parse_dates=['datetime'])

list_dfs = [df_btc, df_eth, df_ada, df_sol]

There is no column indicating which column is each dataframe, so we will create it in the following way:

In [149]:
list_coins = ['BTC', 'ETH', 'ADA', 'SOL']
count =0 
for df in list_dfs:
    list_dfs[count]['coin'] = list_coins[count]
    count += 1

Now we have a list of dataframes, and we want to concatenate it to obtain a single dataframe.

In [150]:
from pandas import concat

data_crypto = concat(list_dfs, axis=0)

data_crypto.head()

Unnamed: 0,unix,low,high,open,close,volume,datetime,vol_fiat,coin
0,1636416000,57906.6,59114.0,58278.1,58850.38,417.219088,2021-11-09,24553500.0,BTC
1,1636329600,54778.96,58498.0,54778.96,58293.11,1426.614721,2021-11-08,83161810.0,BTC
2,1636243200,53183.87,54794.87,53302.89,54790.6,441.921278,2021-11-07,24213130.0,BTC
3,1636156800,52133.72,53360.21,52816.7,53310.69,466.276996,2021-11-06,24857550.0,BTC
4,1636070400,52584.42,54253.61,53209.04,52832.65,728.725328,2021-11-05,38500490.0,BTC


In [151]:
data_crypto.dtypes

unix                 int64
low                float64
high               float64
open               float64
close              float64
volume             float64
datetime    datetime64[ns]
vol_fiat           float64
coin                object
dtype: object

In [152]:
data_crypto.corr()

Unnamed: 0,unix,low,high,open,close,volume,vol_fiat
unix,1.0,-0.112409,-0.120507,-0.117722,-0.116683,0.070362,-0.265247
low,-0.112409,1.0,0.998982,0.999052,0.999385,-0.287341,0.428061
high,-0.120507,0.998982,1.0,0.999496,0.999499,-0.288812,0.453049
open,-0.117722,0.999052,0.999496,1.0,0.998845,-0.2882,0.446686
close,-0.116683,0.999385,0.999499,0.998845,1.0,-0.288171,0.44192
volume,0.070362,-0.287341,-0.288812,-0.2882,-0.288171,1.0,-0.201503
vol_fiat,-0.265247,0.428061,0.453049,0.446686,0.44192,-0.201503,1.0


Now we have read the data and we have it in the desired format. Let's move to the first part of this notebook:

In `plotly`, there are always two way to make plots. The express one, using the `plotly.express` module, and the other one that allows you to have more control over your figure, but it's also harder, called the `plotly.graph_objects`. We will be seeing these two throughout this notebook.

## Part I: Barcharts

https://plotly.com/python/bar-charts/

Even though barcharts are not the best tool to plot time series data in many cases, here we will use this data as an example to give you an idea on what this data looks like. As you will see in the next parts, there are better ways to plot this data!

Let's start making a simple plot of the daily Bitcoin closing prices.

In [153]:
import plotly.express as px

data_crypto_btc = data_crypto[data_crypto.coin == 'BTC']
fig = px.bar(data_crypto_btc, x='datetime', y='close')
fig.show()

Great! Now we have plotted the time series data as barcharts, but there seem to be too many bars. How can we average over different times periods? Check the `.resample()` function of `pandas` when we have datetime indexes:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html

First, we need to groupby `datetime` and `coin` to obtain the vlues desired. 

In [154]:
data_crypto.groupby(['datetime', 'coin']).first()

Unnamed: 0_level_0,Unnamed: 1_level_0,unix,low,high,open,close,volume,vol_fiat
datetime,coin,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-01-14,BTC,1610582400,30220.0100,33067.5200,30748.3700,32167.3800,3.574592e+03,1.149852e+08
2021-01-14,ETH,1610582400,895.6400,1024.9500,929.1500,1011.6900,5.701239e+04,5.767886e+07
2021-01-15,BTC,1610668800,28297.3200,32574.8700,32159.1800,30479.6200,3.928551e+03,1.197407e+08
2021-01-15,ETH,1610668800,881.0100,1030.0000,1012.1600,968.0000,7.176360e+04,6.946716e+07
2021-01-16,BTC,1610755200,29220.2700,31442.0300,30443.7900,29823.0900,1.925854e+03,5.743493e+07
...,...,...,...,...,...,...,...,...
2021-11-08,SOL,1636329600,208.2780,218.6330,215.9830,214.3630,6.582292e+04,1.411000e+07
2021-11-09,ADA,1636416000,1.8187,1.9848,1.8353,1.9579,1.048658e+07,2.053167e+07
2021-11-09,BTC,1636416000,57906.6000,59114.0000,58278.1000,58850.3800,4.172191e+02,2.455350e+07
2021-11-09,ETH,1636416000,4117.3800,4170.0000,4150.8900,4156.2700,2.657418e+03,1.104495e+07


Great, now we can select the desired column, we will use the `close` price one. Then, we want to use the function `unstack()` to convert this dataframe into a dataframe of rows with dates and columns as coins. To do so, we pass the parameter `unstack(1)` to indicate that we are unstacking by the second index (remember that in python counts start at 0).

In [155]:
data_crypto.groupby(['datetime', 'coin']).first()['close'].unstack(1)

coin,ADA,BTC,ETH,SOL
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2021-01-14,,32167.38,1011.69,
2021-01-15,,30479.62,968.00,
2021-01-16,,29823.09,1016.96,
2021-01-17,,29636.74,1019.78,
2021-01-18,,30266.54,1040.83,
...,...,...,...,...
2021-11-05,1.7179,52832.65,3879.12,204.543
2021-11-06,1.7384,53310.69,3917.34,224.080
2021-11-07,1.7491,54790.60,3994.57,216.074
2021-11-08,1.8354,58293.11,4150.81,214.363


In [156]:
data_crypto.dtypes

unix                 int64
low                float64
high               float64
open               float64
close              float64
volume             float64
datetime    datetime64[ns]
vol_fiat           float64
coin                object
dtype: object

Great. Now we have the desired dataframe! But how can we apply the resample method? We will first resample the data weekly with the paremeter `W` and average over each week:

In [157]:
data_crypto.groupby(['datetime', 'coin']).first()['close'].unstack(1).resample('W').mean()

coin,ADA,BTC,ETH,SOL
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2021-01-17,,30526.7075,1004.1075,
2021-01-24,,27786.165714,1055.914286,
2021-01-31,,27118.148571,1099.154286,
2021-02-07,,30874.487143,1326.062857,
2021-02-14,,38836.01,1477.585714,
2021-02-21,,43742.064286,1554.981429,
2021-02-28,,39808.027143,1272.304286,
2021-03-07,,41202.135714,1320.49,
2021-03-14,,47545.141429,1535.805714,
2021-03-21,1.035675,48292.224286,1511.58,


Great! Now that we know how to group the data and resample it, we will resample it every 2 weeks:

In [158]:
data_grouped_weekly = data_crypto.groupby(['datetime', 'coin']).first()['close'].unstack(1).resample('2W').mean()

And now let's plot this data as barcharts. Here we use the `barmode` parameter as `group` to indicate that we want the variables to be grouped by their index:

In [159]:
fig = px.bar(data_grouped_weekly, barmode='group')
fig.show()

The scale of the prices of BTC hinds the rest of the variables. How can we fix this? One way is to standardize the data for the period observed. To do so, we subtract the mean from each column and divide by the standard deviation of each column as follows:

In [160]:
fig = px.bar((data_grouped_weekly-data_grouped_weekly.mean())/data_grouped_weekly.std(),
             barmode='group')
fig.show()

Great! Another way to do this is by dividing by the maximum of each column, such that we have always values between 0 and 1. Let's store this in a variable names `data_grouped_weekly_normalized`.

In [161]:
data_grouped_weekly_normalized = data_grouped_weekly / data_grouped_weekly.max()

And now let's plot this graph. The following graph shows the values relative to the all time high in our dataset for each particular currency. 

In [162]:
fig = px.bar(data_grouped_weekly_normalized, barmode='group',
             color_discrete_map={
        'BTC': 'yellowgreen',
        'ETH': 'darkblue',
        'ADA': 'red',
        'SOL': 'green'
    })
fig.show()

Nice! Now you are familiar with the `plotly.express` module. Now let's do this same graph ourselves using `plotly.graph_objects`, which gives us more flexibility at the time of doing this. It's a bit more tedious, but in the future if you want to make more complex visualizations you might want to use this.

In [163]:
import plotly.graph_objects as go
coins=['BTC', 'ETH', 'ADA', 'SOL']

fig = go.Figure(data=[
    go.Bar(name='BTC', x=data_grouped_weekly_normalized.index,
                       y=data_grouped_weekly_normalized['BTC'],
                          marker=dict(color = 'yellowgreen')),
    go.Bar(name='ETH', x=data_grouped_weekly_normalized.index,
                       y=data_grouped_weekly_normalized['ETH'],
                          marker=dict(color = 'darkblue')),
    go.Bar(name='ADA', x=data_grouped_weekly_normalized.index,
                       y=data_grouped_weekly_normalized['ADA'],
                          marker=dict(color = 'red')),
    go.Bar(name='SOL', x=data_grouped_weekly_normalized.index,
                       y=data_grouped_weekly_normalized['SOL'],
                          marker=dict(color = 'green')),

],)

fig.update_layout(barmode='group')
fig.show()

Great, now you are familiar with grouped barcharts. Let's try the Stacked Barcharts:

In [164]:
import plotly.graph_objects as go
coins=['BTC', 'ETH', 'ADA', 'SOL']

fig = go.Figure(data=[
    go.Bar(name='BTC', x=data_grouped_weekly_normalized.index,
                       y=data_grouped_weekly_normalized['BTC'],
                          marker=dict(color = 'yellowgreen')),
    go.Bar(name='ETH', x=data_grouped_weekly_normalized.index,
                       y=data_grouped_weekly_normalized['ETH'],
                          marker=dict(color = 'darkblue')),
    go.Bar(name='ADA', x=data_grouped_weekly_normalized.index,
                       y=data_grouped_weekly_normalized['ADA'],
                          marker=dict(color = 'red')),
    go.Bar(name='SOL', x=data_grouped_weekly_normalized.index,
                       y=data_grouped_weekly_normalized['SOL'],
                          marker=dict(color = 'green')),

],)

fig.update_layout(barmode='relative')
fig.show()

As you may have noticed, we have only changed the `barmode` from `fig.update_layout()` to 'relative'. It's as simple as that!

This graph is normalized, and is a bit hard to interpret, but it indicates that most of the coins are close to their all time highs. Now let's create a Normalized Stacked Barchart, where all values always sum up to 1:

In [165]:
data_grouped_weekly_normalized_new = data_grouped_weekly_normalized.fillna(0).div(data_grouped_weekly_normalized.fillna(0).sum(axis=1),
                                             axis=0)
fig = px.bar(data_grouped_weekly_normalized_new, 
             barmode='relative',
             color_discrete_map={
        'BTC': 'yellowgreen',
        'ETH': 'darkblue',
        'ADA': 'red',
        'SOL': 'green'
    })
fig.show()

This graph allows us to identify when the different coins start to appear in our data, and show their relative value.

However, we see some problems in this graph. Apparently, 'ADA' and 'SOL' only appear after April and July respectively. 

What if we plot the data after July? 



In [166]:
fig = px.bar(data_grouped_weekly_normalized_new[data_grouped_weekly_normalized_new.index > to_datetime('2021-07-01')],
             barmode='relative',
             color_discrete_map={
        'BTC': 'yellowgreen',
        'ETH': 'darkblue',
        'ADA': 'red',
        'SOL': 'green'
    })
fig.show()

Now the graph looks a little bit better, but it's still a little bit hard to interpret. The way to interpret this figure is that the larger each color is shown on the Y-axis, the closer each coin is to it's all time high, meaning that it's reaching the maximum value it had. However, as you may have noticed, the values always sum up to 1, and thus are relative to the other coins.

So this chart seems a little bit confusing. Let's move on to Exercise 1, and we will come back to this data and see if we can visualize it in a more informative manner in Part III. 

## Exercise 1:

Now it's your time! First, download the data for the following currencies relative to EUR:

- XRP
- DOT
- DOGE
- LINK

Then, once the data is downloaded, do the above processing, and convert the whole dataframe into the tabular format. Make sure that you also create the variable `datetime` in the right data type (using the pandas `to_datetime()` function).

Once you have the data in the tabular format, group it by datetime and coin, then unstack it using `unstack(1)`, converting into a dataframe with rows as dates and columns as the values of coins at the closing of each day.

Then resample the dataframe every 10 days (opposed to the 2 weeks used before), and make a barchart relative to it's all time high (i.e. `data_grouped_normalized = data_grouped / data_grouped.max()`).

Try making both, using the `px.express` module and the `graph_objects` one.

In [167]:
fetch_daily_data('BTC/EUR')
fetch_daily_data('DOT/EUR')
fetch_daily_data('DOGE/EUR')
fetch_daily_data('LINK/EUR')
fetch_daily_data('SHIB/EUR')

In [168]:
from pandas import read_csv

df_btc = read_csv('data/external/Coinbase_BTCEUR_dailydata.csv', parse_dates=['datetime'])
df_dot = read_csv('data/external/Coinbase_DOTEUR_dailydata.csv', parse_dates=['datetime'])
df_doge = read_csv('data/external/Coinbase_DOGEEUR_dailydata.csv', parse_dates=['datetime'])
df_link = read_csv('data/external/Coinbase_LINKEUR_dailydata.csv', parse_dates=['datetime'])
df_shib = read_csv('data/external/Coinbase_SHIBEUR_dailydata.csv', parse_dates=['datetime'])

list_dfs_euro = [df_btc, df_dot, df_doge, df_link,df_shib]

In [169]:
list_dfs_euro

[           unix       low      high      open     close       volume  \
 0    1636416000  57906.60  59114.00  58278.10  58833.47   417.096533   
 1    1636329600  54778.96  58498.00  54778.96  58293.11  1426.614721   
 2    1636243200  53183.87  54794.87  53302.89  54790.60   441.921278   
 3    1636156800  52133.72  53360.21  52816.70  53310.69   466.276996   
 4    1636070400  52584.42  54253.61  53209.04  52832.65   728.725328   
 ..          ...       ...       ...       ...       ...          ...   
 295  1610928000  28818.20  31000.00  29634.96  30266.54  2204.260517   
 296  1610841600  28055.00  30500.00  29816.15  29636.74  2239.666173   
 297  1610755200  29220.27  31442.03  30443.79  29823.09  1925.854277   
 298  1610668800  28297.32  32574.87  32159.18  30479.62  3928.550592   
 299  1610582400  30220.01  33067.52  30748.37  32167.38  3574.591683   
 
       datetime      vol_fiat  
 0   2021-11-09  2.453924e+07  
 1   2021-11-08  8.316181e+07  
 2   2021-11-07  2.421313e

In [170]:
list_coins_euro = ['BTC', 'DOT', 'DOGE', 'LINK','SHIB']
count =0 
for df in list_dfs_euro:
    list_dfs_euro[count]['coin'] = list_coins_euro[count]
    count += 1

In [171]:
data_crypto_euro = concat(list_dfs_euro, axis=0)

data_crypto_euro.head()

Unnamed: 0,unix,low,high,open,close,volume,datetime,vol_fiat,coin
0,1636416000,57906.6,59114.0,58278.1,58833.47,417.096533,2021-11-09,24539240.0,BTC
1,1636329600,54778.96,58498.0,54778.96,58293.11,1426.614721,2021-11-08,83161810.0,BTC
2,1636243200,53183.87,54794.87,53302.89,54790.6,441.921278,2021-11-07,24213130.0,BTC
3,1636156800,52133.72,53360.21,52816.7,53310.69,466.276996,2021-11-06,24857550.0,BTC
4,1636070400,52584.42,54253.61,53209.04,52832.65,728.725328,2021-11-05,38500490.0,BTC


In [172]:
data_crypto_btc_euro = data_crypto_euro[data_crypto_euro.coin == 'BTC']
fig = px.bar(data_crypto_btc_euro, x='datetime', y='close')
fig.show()

In [173]:
data_crypto_euro.groupby(['datetime', 'coin']).first()

Unnamed: 0_level_0,Unnamed: 1_level_0,unix,low,high,open,close,volume,vol_fiat
datetime,coin,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-01-14,BTC,1610582400,30220.010000,33067.520000,30748.370000,32167.380000,3.574592e+03,1.149852e+08
2021-01-14,LINK,1610582400,12.649550,14.771770,13.124700,14.768910,5.587520e+05,8.252158e+06
2021-01-15,BTC,1610668800,28297.320000,32574.870000,32159.180000,30479.620000,3.928551e+03,1.197407e+08
2021-01-15,LINK,1610668800,14.456300,17.749990,14.770170,17.220130,2.285014e+06,3.934825e+07
2021-01-16,BTC,1610755200,29220.270000,31442.030000,30443.790000,29823.090000,1.925854e+03,5.743493e+07
...,...,...,...,...,...,...,...,...
2021-11-09,BTC,1636416000,57906.600000,59114.000000,58278.100000,58833.470000,4.170965e+02,2.453924e+07
2021-11-09,DOGE,1636416000,0.240500,0.250500,0.243500,0.242300,1.422995e+07,3.447917e+06
2021-11-09,DOT,1636416000,45.378000,46.080000,46.000000,45.591000,3.144860e+04,1.433773e+06
2021-11-09,LINK,1636416000,29.403200,30.699440,29.711830,30.274460,7.527088e+04,2.278785e+06


In [220]:
data_crypto_euro.groupby(['datetime', 'coin']).first()['close'].unstack(1)

coin,BTC,DOGE,DOT,LINK,SHIB
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2021-01-14,32167.38,,,14.76891,
2021-01-15,30479.62,,,17.22013,
2021-01-16,29823.09,,,16.65589,
2021-01-17,29636.74,,,19.28302,
2021-01-18,30266.54,,,18.22327,
...,...,...,...,...,...
2021-11-05,52832.65,0.2259,44.875,28.48963,0.000054
2021-11-06,53310.69,0.2269,45.112,27.79000,0.000051
2021-11-07,54790.60,0.2304,45.208,28.00733,0.000049
2021-11-08,58293.11,0.2435,46.082,29.75077,0.000048


In [175]:
data_grouped_ten_days = data_crypto_euro.groupby(['datetime', 'coin']).first()['close'].unstack(1).resample('10D').mean()

In [176]:
fig = px.bar(data_grouped_ten_days, barmode='group')
fig.show()

In [177]:
fig = px.bar((data_grouped_ten_days-data_grouped_ten_days.mean())/data_grouped_ten_days.std(),
             barmode='group')
fig.show()

In [178]:
data_grouped_normalized = data_grouped_ten_days / data_grouped_ten_days.max()

In [179]:
fig = px.bar(data_grouped_normalized, barmode='group',
             color_discrete_map={
        'BTC': 'yellowgreen',
        'DOGE': 'darkblue',
        'DOT': 'red',
        'LINK': 'green',
        'SHIB': 'orange'
                 
    })
fig.show()

In [214]:
fig = go.Figure(data=[
    go.Bar(name='BTC', x=data_grouped_normalized.index,
                       y=data_grouped_normalized['BTC'],
                          marker=dict(color = 'yellowgreen')),
    go.Bar(name='DOGE', x=data_grouped_normalized.index,
                       y=data_grouped_normalized['DOGE'],
                          marker=dict(color = 'darkblue')),
    go.Bar(name='DOT', x=data_grouped_normalized.index,
                       y=data_grouped_normalized['DOT'],
                          marker=dict(color = 'red')),
    go.Bar(name='LINK', x=data_grouped_normalized.index,
                       y=data_grouped_normalized['LINK'],
                          marker=dict(color = 'green')),
    go.Bar(name='SHIB', x=data_grouped_normalized.index,
                       y=data_grouped_normalized['SHIB'],
                          marker=dict(color = 'orange')),

],)



fig.update_layout(barmode='group')

fig.show()

In [238]:
"""Intenté el loop de las graficas pero tuve fallas definiendo los colores, aún me falta manejar el loop anidado, 
así que deje que Plotly definiera los colores pero no es lo que buscaba """

coins=['BTC', 'DOGE', 'DOT', 'LINK','SHIB']


fig = go.Figure()

for coin in coins: 

        fig.add_trace(
        go.Bar(name=coin, x=data_grouped_normalized.index,
                       y=data_grouped_normalized[coin]))
        
        
fig.update_layout(barmode='group')

fig.show()
        
        
        
        
 

## Parth II: Bubblecharts

https://plotly.com/python/bubble-charts/

Now that we are familiar with barcharts, we want to introduce a new way of plotting data. This time, we want to expand scatterplots using bubblecharts. What is a bubblechart? It is essentially a scatterplot but regulating the size of the dots or circles using another numerical variable!

Let's start using the `px` module. First, we load the new data on GDPs and countries. In this example we use another dataset that can be loaded directly from `px`! This is a toy example, and it will help you understand this type of charts.

In [181]:
df_country = px.data.gapminder()

df_country.head()


Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,AFG,4
1,Afghanistan,Asia,1957,30.332,9240934,820.85303,AFG,4
2,Afghanistan,Asia,1962,31.997,10267083,853.10071,AFG,4
3,Afghanistan,Asia,1967,34.02,11537966,836.197138,AFG,4
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,AFG,4


So now we know what this data looks like.

Let's make a simple bubblechart, selecting the year 2007, with GPD per capita as the X-axis, life expectancy as the Y-axis, with a logarithmic scale on the Y-axis, using the population of each country as the size of each bubble, and the color of each bubble as the continent they belong to. We need to set a `size_max`to not have too large bubbles.

In [182]:
fig = px.scatter(df_country[df_country.year == 2007],
                 x="gdpPercap",
                 y="lifeExp",
                 size="pop", 
                 color="continent",
                 hover_name="country",
                 log_x=True,
                 size_max=60)
fig.show()

Great! Now you have seen how simple it is to make a bubblechart. Let's make another one to consolidate your knowledge on this function. We will now select only the year 2007 and the countries inside America. Each country now has its own color, but since there are too many countries, some colors are repeated. 

In [183]:
fig = px.scatter(df_country[(df_country.year == 2007) & (df_country.continent=='Americas')],
                 x="gdpPercap",
                 y="lifeExp",
                 size="pop", 
                 color="country",
                 hover_name="country",
                 log_x=True,
                 size_max=60)
fig.show()

Lastly, `plotly` is so great that we can add an animation frame, such that we can move around the years, or press the play button and observe how the graph evolves over the years. Check the following cell:

In [184]:
import plotly.express as px

px.scatter(df_country, 
           x="gdpPercap",
           y="lifeExp",
           animation_frame="year",
           animation_group="country",
           size="pop", 
           color="continent",
           hover_name="country",
           log_x=True,
           size_max=55,
           range_x=[100,100000],
           range_y=[25,90])

## Exercise 2:

Now it's your time! Create a bubblechart selecting all European countries for the year 2007, with the x axis as the life expectancy, and the y axis as the GDP per capita, grouping by country, with the size of the bubbles as populations, and the color as population as well, using the logarithmic scale on the y axis, and setting a max size of 50.

In [185]:
fig = px.scatter(df_country[(df_country.year == 2007) & (df_country.continent=='Europe')],
                 x="lifeExp",
                 y="gdpPercap",
                 size="pop", 
                 color="country",
                 hover_name="country",
                 log_x=True,
                 size_max=50)
fig.show()

If you really feel confident with this, then now try to add an `animation_frame='year'` parameter instead of selecting the year 2004 at the time of querying. Remember to select the correct `range_x` and `range_y` parameters such that your animation doesn't go off the axes!

In [186]:
px.scatter(df_country, 
           x="lifeExp",
           y="gdpPercap",
           animation_frame="year",
           animation_group="country",
           size="pop", 
           color="continent",
           hover_name="country",
           log_x=True,
           size_max=49,
           range_y=[-10000,100000],
           range_x=[25,90])

### Optional exercise for advanced users:

Can you now replicate this bubblechart using the `graph_objects` module?

In [187]:
#No funciona, solo lo dejo para mostrar lo que pretendia hacer pero creo que no va para nada por ahí...

x_country = df_country[(df_country.year == 2007)]["lifeExp"]
y_country = df_country[(df_country.year == 2007)]["gdpPercap"]

In [188]:

size_pop= df_country["pop"] 




fig = go.Figure(data=[go.Scatter(
    x=x_country, y=y_country,
    mode="markers",
    marker=dict(
        size="pop",
        sizemode='area',
        sizeref=2.*max(size_pop)/(40.**2),
        sizemin=4
    )
    
)])

fig.show()

ValueError: 
    Invalid value of type 'builtins.str' received for the 'size' property of scatter.marker
        Received value: 'pop'

    The 'size' property is a number and may be specified as:
      - An int or float in the interval [0, inf]
      - A tuple, list, or one-dimensional numpy array of the above

## Part III: Linecharts

Now that we are familiar with barcharts, bubblecharts as well as datetime and variables, isn't there a better way to actually plot time series data? There sure is! Let's move into Linecharts.

Let's start by making a simple plot of only the closing price of Bitcoin:

In [191]:
data_crypto_btc = data_crypto[data_crypto.coin == 'BTC']
data_crypto_btc.head()

Unnamed: 0,unix,low,high,open,close,volume,datetime,vol_fiat,coin
0,1636416000,57906.6,59114.0,58278.1,58850.38,417.219088,2021-11-09,24553500.0,BTC
1,1636329600,54778.96,58498.0,54778.96,58293.11,1426.614721,2021-11-08,83161810.0,BTC
2,1636243200,53183.87,54794.87,53302.89,54790.6,441.921278,2021-11-07,24213130.0,BTC
3,1636156800,52133.72,53360.21,52816.7,53310.69,466.276996,2021-11-06,24857550.0,BTC
4,1636070400,52584.42,54253.61,53209.04,52832.65,728.725328,2021-11-05,38500490.0,BTC


In [192]:
fig = px.line(data_crypto_btc, x='datetime', y='close')
fig.show()

That looks much better doesn't it? How about now plotting all the different coins at the same time?

In [193]:
fig = px.line(data_crypto, x='datetime', y='close', color='coin')
fig.show()

The scale doesn't really help! Let's try to use a logarithmic scale:

In [194]:
fig = px.line(data_crypto, x='datetime', y='close', color='coin', log_y=True)
fig.show()

Even though this graph looks a bit better, that doesn't really help either, as we flatten all the curves. Let's create subplots then as we did in Notebook 1, but this time using plotly!

In [195]:
from plotly.subplots import make_subplots

fig = make_subplots(rows=4, subplot_titles=('BTC','ETH', 'ADA', 'SOL'), )
data_crypto_grouped = data_crypto.groupby(['coin', 'datetime']).first()['close'].unstack(0)


fig.add_trace(
    go.Scatter(x=data_crypto_grouped.index,
               y=data_crypto_grouped.BTC,
                              name='BTC'),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=data_crypto_grouped.index,
               y=data_crypto_grouped.ETH,
               name='ETH'),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=data_crypto_grouped.index,
               y=data_crypto_grouped.ADA,
               name='ADA'),
    row=3, col=1
)

fig.add_trace(
    go.Scatter(x=data_crypto_grouped.index,
               y=data_crypto_grouped.SOL,
               name='SOL'),
    row=4, col=1
)

fig.update_layout(
    autosize=False,
    width=800,
    height=1200,)

Great! That looks much better than before! 

Now let's introduce error bars! Error bars are useful to plot volatility for example, or show standard errors around lines. This way, we can understand the variation on our time series data. In this example we will plot error bars as the difference between the 'high' of each row and the 'low', as this is the difference between the highest and lowest prices during each day.

In [196]:
data_crypto_btc['high'] - data_crypto_btc['low']

0      1207.40
1      3719.04
2      1611.00
3      1226.49
4      1669.19
        ...   
295    2181.80
296    2445.00
297    2221.76
298    4277.55
299    2847.51
Length: 300, dtype: float64

Now we use that to plot it over the time series data.

In [197]:
data_for_errorbars = data_crypto_btc.set_index('datetime').resample('W').mean().reset_index()
fig = px.line(data_for_errorbars,
              x='datetime', y='close',
             error_y=data_for_errorbars['high']- data_for_errorbars['low'].tolist())
fig.show()

## Exercise 3:

Do the same graph using subplots as above, but this time using the `px` weekly data on stocks for GOOG, AAPL, AMZN, FB, NFLX and MSFT. Notice that you have now 6 time series!

Can you also add the error bars using the volatility for each time period? (Hint: use the df.std(axis=1) to obtain the standard deviation for each row).

In [198]:
stocks = px.data.stocks()
data_stocks = stocks.copy()
data_stocks["datetime"] = to_datetime(stocks["date"])

In [199]:
data_stock_goog = data_stocks[["datetime","GOOG"]]
data_stock_aapl = data_stocks[["datetime","AAPL"]]
data_stock_amzn = data_stocks[["datetime","AMZN"]]
data_stock_fb = data_stocks[["datetime","FB"]]
data_stock_nflx = data_stocks[["datetime","NFLX"]]
data_stock_msft = data_stocks[["datetime","MSFT"]]

In [200]:
data_for_errorbars_goog = data_stock_goog.set_index('datetime').resample('W').mean().reset_index()
data_for_errorbars_aapl = data_stock_aapl.set_index('datetime').resample('W').mean().reset_index()
data_for_errorbars_amzn = data_stock_amzn.set_index('datetime').resample('W').mean().reset_index()
data_for_errorbars_fb = data_stock_fb.set_index('datetime').resample('W').mean().reset_index()
data_for_errorbars_nflx = data_stock_nflx.set_index('datetime').resample('W').mean().reset_index()
data_for_errorbars_msft = data_stock_msft.set_index('datetime').resample('W').mean().reset_index()

In [201]:
from plotly.subplots import make_subplots

fig = make_subplots(rows=3,cols=2, subplot_titles=('GOOG','AAPL', 'AMZN', 'FB','NFLX','MSFT'), )

fig.add_trace(
    go.Scatter(x=data_for_errorbars_goog.datetime,
               y=data_for_errorbars_goog.GOOG,
               error_y = dict(type='data', array=data_for_errorbars_goog.std(axis=1)),
                              name='GOOG'),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=data_for_errorbars_aapl.datetime,
               y=data_for_errorbars_aapl.AAPL,
               error_y = dict(type='data', array=data_for_errorbars_aapl.std(axis=1)),
               name='AAPL'),
    row=1, col=2
)

fig.add_trace(
    go.Scatter(x=data_for_errorbars_amzn.datetime,
               y=data_for_errorbars_amzn.AMZN,
               error_y = dict(type='data', array=data_for_errorbars_amzn.std(axis=1)),
               name='AMZN'),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=data_for_errorbars_fb.datetime,
               y=data_for_errorbars_fb.FB,
               error_y = dict(type='data', array=data_for_errorbars_fb.std(axis=1)),
               name='FB'),
    row=2, col=2
)


fig.add_trace(
    go.Scatter(x=data_for_errorbars_nflx.datetime,
               y=data_for_errorbars_nflx.NFLX,
               error_y = dict(type='data', array=data_for_errorbars_nflx.std(axis=1)),
               name='NFLX'),
    row=3, col=1
)

fig.add_trace(
    go.Scatter(x=data_for_errorbars_msft.datetime,
               y=data_for_errorbars_msft.MSFT,
               error_y = dict(type='data', array=data_for_errorbars_msft.std(axis=1)),
               name='MSFT'),
    row=3, col=2
)


fig.update_layout(
    autosize=False,
    width=1000,
    height=1200,)


#no entiendo el future warning, si selecciono las columnas numericas me da un error


Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.


Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.


Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.


Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.


Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reducti

https://plotly.com/python/violin/

Lastly, we want to introduce violinplots. These plots help understanding the distributions of data and also help understand multiple distributions at the same time. We will use stock data from the `px.data` module as well. 

In [202]:
df_stocks = px.data.stocks()

In [203]:
df_stocks.head()

Unnamed: 0,date,GOOG,AAPL,AMZN,FB,NFLX,MSFT
0,2018-01-01,1.0,1.0,1.0,1.0,1.0,1.0
1,2018-01-08,1.018172,1.011943,1.061881,0.959968,1.053526,1.015988
2,2018-01-15,1.032008,1.019771,1.05324,0.970243,1.04986,1.020524
3,2018-01-22,1.066783,0.980057,1.140676,1.016858,1.307681,1.066561
4,2018-01-29,1.008773,0.917143,1.163374,1.018357,1.273537,1.040708


In [204]:
df_stocks.dtypes

date     object
GOOG    float64
AAPL    float64
AMZN    float64
FB      float64
NFLX    float64
MSFT    float64
dtype: object

Notice that the variable 'date' is not a datetime variable. Let's make it become a datetime variable.

In [205]:
df_stocks['datetime'] = to_datetime(df_stocks['date'])

Great, now let's make some simple variables using the `df['datetime'].dt` + variable that we want. More info on this here: https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.dayofweek.html

In [206]:
df_stocks['month'] = df_stocks['datetime'].dt.month
df_stocks['year'] = df_stocks['datetime'].dt.year

Let's make a simple violinplot first:

In [207]:
fig = px.violin(df_stocks, y="GOOG")
fig.show()

Simple right? Now let's add a boxplot inside:

In [208]:
fig = px.violin(df_stocks, y="GOOG",
                box=True)
fig.show()

Can we also add points showing the values of the distribution?

In [209]:
fig = px.violin(df_stocks, y="GOOG",
                box=True,
                points='all')
fig.show()

Great! Now that we know how to use this, let's move to the next step.

Now we have the variables unstacked, but to use them easily for `px.express` we want them stacked, or in a tabular form, such that we have the columns 'date', 'NAME' and 'value' as each column separately. The variable 'NAME' indicates the name of the stock.

The following cell does exactly that:

In [210]:
df_stacked = df_stocks.set_index('date').drop(columns=['month', 'year']).stack().reset_index().rename(columns={'level_1':'NAME', 0:'value'})
df_stacked.head()

Unnamed: 0,date,NAME,value
0,2018-01-01,GOOG,1.0
1,2018-01-01,AAPL,1.0
2,2018-01-01,AMZN,1.0
3,2018-01-01,FB,1.0
4,2018-01-01,NFLX,1.0


The above cell stacks the dataframe into the three columns, and then we rename the columns to "NAME" and "value".

Let's plot them all together now!

In [211]:
fig = px.violin(df_stacked,
                y='value',
                color='NAME',
                box=True,
                points='all',
                range_x=[-.4, .25] # the smallest and largest values shows in the graph
                )
fig.show()

## Exercise 4:

Make the violinplot of the stock market using `graph_objects`. Here you will find all the required information:

https://plotly.com/python/violin/

In [212]:
fig = go.Figure()



fig.add_trace(go.Violin(y=df_stacked['value'][ df_stacked['NAME'] == 'GOOG'],
                        legendgroup='GOOG', scalegroup='GOOG', name='GOOG',
                        line_color='blue')
             )

fig.add_trace(go.Violin(y=df_stacked['value'][ df_stacked['NAME'] == 'AAPL' ],
                        legendgroup='AAPL', scalegroup='AAPL', name='AAPL',
                        line_color='orange')
             )


fig.add_trace(go.Violin(y=df_stacked['value'][ df_stacked['NAME'] == 'FB' ],
                        legendgroup='FB', scalegroup='FB', name='FB',
                        line_color="yellow")
             )

fig.add_trace(go.Violin(y=df_stacked['value'][ df_stacked['NAME'] == 'AMZN' ],
                        legendgroup='AMZN', scalegroup='AMZN', name='AMZN',
                        line_color='black')
             )

fig.add_trace(go.Violin(y=df_stacked['value'][ df_stacked['NAME'] == 'NFLX' ],
                        legendgroup='NFLX', scalegroup='NFLX', name='NFLX',
                        line_color='darkblue')
             )


fig.add_trace(go.Violin(y=df_stacked['value'][ df_stacked['NAME'] == 'MSFT' ],
                        legendgroup='MSFT', scalegroup='MSFT', name='MSFT',
                        line_color='red')
             )
fig.update_traces(box_visible=True, meanline_visible=True)
fig.update_layout(violinmode='group')
fig.show()

This is it for this lesson! Congratulations on finishing this lesson!