<img src="../figures/HeaDS_logo_large_withTitle.png" width="300">

<img src="../figures/tsunami_logo.PNG" width="600">

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Center-for-Health-Data-Science/PythonTsunami/blob/spring2022/Visualizations/plotly.ipynb)

**For (anonymous) questions**, use this **[Padlet link](https://ucph.padlet.org/henrikezschach1/7f65ytua2sv0qt9g)**. 

# Plotly

## Python Open Source Graphing Library

Plotly's Python graphing library makes interactive, publication-quality graphs. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, polar charts, and bubble charts.

![gallery](https://miro.medium.com/max/1458/1*qKpV3vkPZYoffsvFSEuw8A.png)


Plotly has an easy-to-use interface to it called [Plotly express](https://plotly.com/python/plotly-express/). This library makes plotting with Plotly very easy. Plotly express works nicely with Pandas dataframes as input, we just need to specify which columns need to be plotted.

# Import modules

In [None]:
import pandas as pd
import plotly.express as px

# Introduction
Let's start exploring the Plotly Database. The regular syntax for any Plotly.Express chart is `px.chart_type(df, parameters)` so let's try a simple line chart: `px.line(df, parameters)`. 

There're different ways to create the plot. We will check them all, but I think the third one makes the most sense. 

1. using lists of values 
2. using `pandas.Series`
3. using `pandas.DataFrame` and referencing the column names 

**1. Using lists of values.**

We can create two lists of values for the `x` and `y` axis and use them as parameters for the line chart plot

In [None]:
year = list(range(1996,2020,4))
medals = [1,4,5,9,1,2]

print(year, medals)

In [None]:
px.line(x = year, y = medals)

**2. Using `pandas.Series`**

This is very much like using lists

In [None]:
year = pd.Series(year)
medals = pd.Series(medals)

print(year, medals)

In [None]:
px.line(x = year, y = medals)

**3. Using `pandas.DataFrame`**

This is most of the time the **best option**. We can plot things directly from our DataFrame of interest. We need to give the `px.chart_type()` function our dataframe using the argument `data_frame`. Then we only need to specify as `x` and `y` axis the name of the columns we want to use!

In [None]:
# We create our dataframe 
df = pd.DataFrame({"Year" : year, "Medals" : medals})
df.head()

In [None]:
px.line(data_frame = df, x = "Year" , y = "Medals")

**`Note`: If our dataframe is in wide format, we may need to change the shape to long format.** This means that we always need to have our variables of interest as columns!
Have a look at the [melt method in Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html). For example, lets make a wide dataframe:

In [None]:
df = pd.DataFrame({'Year': {0: '2000', 1: '2010', 2: '2012'},
                   'Canada': {0: 1, 1: 3, 2: 5},
                   'USA': {0: 2, 1: 4, 2: 6}})

df

In this case, we would like to have a column named "Countries" that will encompass Canada and USA. We use the `.melt()` method to do this.

In [None]:
df = pd.melt(df, id_vars=['Year'], value_vars=['Canada', 'USA'])
df

Now we can use the long format dataframe to plot

In [None]:
px.line(data_frame = df, x = "Year" , y = "value", color = "variable")

## Save as variable and show

We can save our plots as variables. Then, if you would like to show your plot again, you can call it using the method `.show()`

In [None]:
fig = px.line(data_frame = df, x = "Year" , y = "value", color = "variable")

In [None]:
fig.show()

## Plotly object structure

On the background, each graph is a dictionary. When you store the chart into a variable, commonly `fig` and display this dictionary using `fig.to_dict()` or `fig["data"]` or `fig.data` to see the elements data or `fig["layout"]` to review the design of the plot.

In [None]:
fig.data, fig.layout

As you can see, there are many attributes inside this dictionary. This means that a plot can be modified even after it is created. For example, we can use a layout template to modify the design of a plot or change the plot and axis titles

In [None]:
fig.update_traces(line={"color":"red"})
fig.update_layout(template="plotly_dark", title = "Example", yaxis_title='Medals')

We will see more ways of modifying the plots as we go through the different types of plots we can make!

# Line Charts


Although we have seen already how to make line charts in Python with Plotly, let's take a look at what else we can learn.

With `px.line`, each data point is represented as a vertex (which location is given by the x and y columns) of a polyline mark in 2D space.

Let's use a more complicated dataframe to check it. The *gapminder* dataset from plotly contains information about different countries around the world.

In [None]:
# We use the query function to only retrieve data that we want
df = px.data.gapminder().query("country=='Canada'")

In [None]:
df.head()

In [None]:
# we can add a title to the plot using the argument `title`
fig = px.line(df, x="year", y="lifeExp", title='Life expectancy in Canada')
fig.show()

## Color argument

We can change the color of the lines based on another variable using the argument `color`. In order to show it, we will need information about different countries.

In [None]:
# By querying the data from a continent we now have information on several countries
df = px.data.gapminder().query("continent=='Oceania'")
df.head()

In [None]:
# We can now separate the data from the different countries by color using the argument `color`
fig = px.line(df, x="year", y="lifeExp", color='country')
fig.show()

You want, instead, to **change the color of all the lines**, we need to use the method `update_traces()`

In [None]:
fig = px.line(df, x="year", y="lifeExp", color = "country")
fig.update_traces(line={"color":"red"})
fig.show()

## Text argument

We can add the value of a variable at the coordinates given by the x and y argument by using the `text` argument,

In [None]:
fig = px.line(df,
              x="year",
              y="lifeExp",
              color='country',
              text="lifeExp", #The text argument allows us to plot the actual number on the datapoint
              labels={"year": "Year"}, # change year for Year
              title="Life expectancy per year")
fig.show()

Notice how the text argument positioned the text right on top of the data points? We can modify this behaviour by updating our figures using the `update_traces()` method, which will modify all data points inside `fig.data`.

In [None]:
fig.update_traces(textposition="top center")
fig.show()

## Color_discrete_map argument

We can set up the exact color of each line using a dictionary in the argument `color_discrete_map`

In [None]:
fig = px.line(df,
              x="year",
              y="lifeExp",
              color='country',
              text="lifeExp",
              color_discrete_map = {"Australia":"Black", "New Zealand": "Red"},
              title="Life expectancy per year")

fig.update_traces(textposition="top center")
fig.show()

## Line_dash argument
By using the `line_dash` argument, we can change the dash pattern of the lines based on a variable.

In [None]:
fig = px.line(df,
              x="year",
              y="lifeExp",
              color='country',
              text="lifeExp",
              color_discrete_map  = {"Australia":"Black", "New Zealand": "Red"},
              line_dash = "country",
              title="Life expectancy per year")

fig.update_traces(textposition="top center")
fig.show()

If you want, instead to **change all lines to be dashed**, you need to use the `update_traces()` method. Types could be one of `dash`, `dot` or the default `solid`.

In [None]:
fig = px.line(df,
              x="year",
              y="lifeExp",
              color='country',
              text="lifeExp",
              color_discrete_map  = {"Australia":"Black", "New Zealand": "Red"},
              title="Life expectancy per year")

fig.update_traces(textposition="top center", line = {"dash" : "dot"})
fig.show()

## Exercise

1) Create a line chart for continent 'Ocenia' and 'Africa' 

Tip: use as query: `("continent=='Oceania' | continent=='Africa'")`


2) Color by the country and change the line type by the continent

3) Change the template of the plot. Check out templates [here](https://plotly.com/python/templates/)

## Exercise

What would you do if instead of a line chart you wanted to show the data in a scatter plot? 

# Scatter plots

Scatter plots are coordinate plots that use x and y coordinates to show the relationship between two variables. However, the values of the variables do not necessarily need to be linked or ordered like in a line plot.

Plotting a scatter plot is very much like plotting a line plot, but we use the `px.scatter()` function. Many of the arguments shown previously for the line plots work here as well, for example, the color argument:

In [None]:
df = px.data.gapminder().query("continent == 'Africa'")

fig = px.scatter(df,
                 x="lifeExp",
                 y="gdpPercap",
                 color="country")
fig.show()

## Symbol argument
If you want to further differenciate the countries from each other, you can the `symbol` argument to different types of symbols, not just dots/circles.

In [None]:
fig = px.scatter(df,
                 x="lifeExp",
                 y="gdpPercap",
                 color="country",
                 symbol='country')
fig.show()

## Size argument
We can also play with the `size` of the dots to create **Bubble plots**

In [None]:
fig = px.scatter(df,
                 x="lifeExp",
                 y="gdpPercap",
                 color="country",
                 size='pop') # Using population as the size for the plot

fig.show()

## Trendline argument
We can easily add trendlines to our scatter plot using the argument `trendline`. By default you will use the Ordinary Least Squares trendline (linear regression). 

In [None]:
fig = px.scatter(df,
                 x="lifeExp",
                 y="gdpPercap",
                 trendline = "ols") # Using population as the size for the plot

fig.show()

If you have separated the countries using the `color` argument, you will get a trendline per country.

*This will look quite ugly since there are many countries*

In [None]:
fig = px.scatter(df,
                 x="lifeExp",
                 y="gdpPercap",
                 color="country",
                 trendline = "ols")

fig.show()

If you want to color by a variable but still have a global trend, use the argument `trendline_scope="overall"

In [None]:
fig = px.scatter(df,
                 x="lifeExp",
                 y="gdpPercap",
                 color="country",
                 trendline = "ols",
                 trendline_scope="overall")

fig.show()

## Exercise
1) Using the data from 'Oceania', create a scatter plot using GDP and population. Try to make the countries as distinguishable as possible. 

2) Model the correlation between GDP and population using a non-linear trendline (LOWESS) for each country

# Bar Charts

With `px.bar()`, each row of the DataFrame is represented as a rectangular mark. Bar plots are very useful to show quantitative information across qualitative features such as years, countries or other categorical data.

As line and scatter plots, `px.bar()` shares a lot of arguments with line and scatter plots.

In [None]:
df = px.data.gapminder().query("continent == 'Oceania'")
fig = px.bar(df, x='year', y='pop', color='country')
fig.show()

## Orientation argument

If we would rather see horizontal bars instead of vertical, we can set the argument `orientation` to `"h"`. We still need to **change the order** of the `x` and `y` arguments!

In [None]:
fig = px.bar(df, x='pop', y='year', color='country', orientation="h")
fig.show()

## Text on bar charts
You can add text to bars using the `text_auto` or `text` argument. `text_auto=True` will automatically use the same variable as the `y` argument, while you can use any variable with `text`.

In [None]:
df = px.data.medals_long()
df.head()

In [None]:
fig = px.bar(df, x="medal", y="count", color="nation", text="nation")
fig.show()

By default, Plotly will scale and rotate text labels to maximize the number of visible labels, which can result in a variety of text angles and sizes and positions in the same figure. The `textfont`, `textposition` and `textangle` trace attributes can be used to control these.

In addition, you can use the `text_auto` argument to format the text shown in the plot`

This is the default behaviour

In [None]:
df = px.data.gapminder().query("continent == 'Europe' and year == 2007 and pop > 2.e6")
fig = px.bar(df, y='pop', x='country', text_auto='.2s', #text_auto will show only two numbers
            title="Default: various text sizes, positions and angles")
fig.show()

Here we use `update_traces()` to control the angle (set to 0) and the position (outside the bar) and the size

In [None]:
fig = px.bar(df, y='pop', x='country', text_auto='.2s',
            title="Controlled text sizes, positions and angles")

fig.update_traces(textfont_size=12, textangle=0, textposition="outside")
fig.update_layout(yaxis_range=[0,10**8]) # We increase the range of the plot so the text fits
fig.show()

## Stacked vs Grouped Bars

When several rows share the same value of x (here Female or Male for the tips dataset), the rectangles are stacked on top of one another by default.

In [None]:
df = px.data.tips()
df.head()

In [None]:
fig = px.bar(df, x="sex", y="total_bill", color='time')
fig.show()

The default stacked bar chart behavior can be changed to grouped (also known as clustered) using the `barmode` argument:

In [None]:
fig = px.bar(df, x="sex", y="total_bill",
             color='smoker', barmode='group')
fig.show()

# Histograms

In statistics, a histogram is representation of the distribution of numerical data, where the data are binned and the count for each bin is represented. More generally, in Plotly a histogram is an aggregated bar chart, with several possible aggregation functions (e.g. sum, average, count...) which can be used to visualize data on categorical and date axes as well as linear axes.

Compared to `px.bar()`, `px.histogram()` can work with only the `x` argument, which can be a continuous or categorical variable

In [None]:
fig = px.histogram(df, x="total_bill", title = "Continuous variable")
fig.show()

In [None]:
fig = px.histogram(df, x="day", title="Categorical variable")
fig.show()

`px.histogram()` also shares the `color`, `text_auto` and `barmode` argument

In [None]:
fig = px.histogram(df, x="total_bill", color="sex", text_auto=True)
fig.show()

## Bins argument

By default, the number of bins is chosen so that this number is comparable to the typical number of samples in a bin. This number can be customized, as well as the range of values, with the `nbins` argument:

In [None]:
fig = px.histogram(df, x="total_bill", nbins=20)
fig.show()

## Histnorm argument

The default mode is to represent the count of samples in each bin. With the `histnorm` argument, it is also possible to represent the **percentage or fraction** of samples in each bin (`histnorm='percent'` or probability), or a `density histogram` (the sum of all bar areas equals the total number of sample points, density), or a `probability density histogram` (the sum of all bar areas equals 1, probability density).

In [None]:
fig = px.histogram(df, x="total_bill", histnorm='probability density')
fig.show()

## Histfunc and y argument

For each bin of `x`, one can compute a function of data using `histfunc`. The argument of `histfunc` is the dataframe column given as the `y` argument. Below the plot shows that the average tip increases with the total bill.

In [None]:
fig = px.histogram(df, x="total_bill", y="tip", histfunc='avg')
fig.show()

The default `histfunc` is `sum` if `y` is given, and works with categorical as well as binned numeric data on the `x` axis

**Note**: As noted above `px.bar()` will result in one rectangle drawn per row of input. This can sometimes result in a striped look as in the tips examples above. To combine these rectangles into one per color per position, you can use `px.histogram()`

In [None]:
fig = px.histogram(df, x="sex", y="total_bill",
             color='time')
fig.show()

In [None]:
fig = px.bar(df, x="sex", y="total_bill", color='time')
fig.show()

## Exercises

1) Using the tips dataset, create a chart that displays the average total bill depending on the day of the week

2) Create a chart using the "Oceania" gapminder and show the evolution of GDP per year. Separate the countries (non-stacked plot) and show the GDP value on the plot.

# Box plots and violin plots

Box plots and violin plots are another nice way of showing data distributions. `px.box()` and `px.violin()` share almost all their arguments and can be used interchangebly.

In [None]:
df = px.data.tips()
fig = px.box(df, y="tip", x="smoker", color="sex")
fig.show()

In [None]:
fig = px.violin(df, y="tip", x="smoker", color="sex")
fig.show()

## Points argument
You can show the underlying data inside the plots by setting the argument `points="all"`, to show only outliers `points="outliers"` or not show any points with `points=False`

In [None]:
fig = px.violin(df, y="total_bill", x="smoker", color="sex", points = "all")
fig.show()

In [None]:
fig = px.box(df, y="total_bill", x="smoker", color="sex", points = False)
fig.show()

## Boxplot inside violin
You can show a boxplot inside a violin plot using `box=True`

In [None]:
fig = px.violin(df, y="tip", x="smoker", color="sex", box=True)
fig.show()

## Notched bloxplot

You can add notches to your boxplot using `notched=True`

In [None]:
fig = px.box(df, y="total_bill", x="smoker", color="sex", points="all", notched=True)
fig.show()

## Show mean
We can show the mean in our boxplot using by updating our traces using `boxmean=True` and in our violin plots using `meanline_visible=True`

In [None]:
fig = px.box(df, y="total_bill", x="smoker", color="sex", points="all", notched=True)
fig.update_traces(boxmean=True)
fig.show()

In [None]:
fig = px.violin(df, y="total_bill", x="smoker", color="sex", points="all", box=True)
fig.update_traces(meanline_visible=True)
fig.show()

# Heatmaps

The `px.imshow()` function can be used to display heatmaps (as well as full-color images, as its name suggests). It accepts both array-like objects like lists of lists, as well as pandas.DataFrame objects. Heatmaps are particularly useful to display correlations between the variables of the data

In [None]:
df = px.data.tips()
px.imshow(df.corr(), text_auto=True)

We can modify the color scale using the argument `color_continuous_scale`

In [None]:
px.imshow(df.corr(), text_auto=True, color_continuous_scale='RdBu_r')

We can also explicitly map the color scale using `ranger_color` argument.

In [None]:
px.imshow(df.corr(), text_auto= '.2f',
          color_continuous_scale='RdBu_r', range_color=[-1,1])

# Advanced plotting

##  Facet_row and facet_col arguments
Another cool thing we can do in many types of plots is to split the chart into rows or columns depending on a variable. For example, we can divide the information of life expectancy into different plots using the variable "country"

In [None]:
df = px.data.gapminder().query("continent=='Oceania'")
fig = px.line(df,
              x="year",
              y="lifeExp",
              color='country',
              facet_col ="country",
              text="lifeExp",
              title="Life expectancy per year")

fig.update_traces(textposition="top center")
fig.show()

## Plot marginals

In **scatter** and **histogram** plots, you can add extra plots on the margins (called [Plot Marginals](https://plotly.com/python/marginal-plots/)) of your scatter plot, for instance "histogram", "rug", "box", or "violin" plots. These plots can be easily added by just using the attributes: `marginal_x` and `marginal_y`.

In [None]:
df = px.data.iris()
df.head()

In [None]:
fig = px.scatter(df,
                 x="sepal_length",
                 y="sepal_width",
                 color="species",
                 marginal_x="box",
                 marginal_y="rug",
                 size='petal_width',
fig.show()

## Exercise

1) Can you get a scatter plot with a histogram instead of a rug distribution plot? 

2) Divide the previous plot using the species variable

## Error argument

In **scatter**, **line** and **bar** plots we can show error bar information, such as confidence intervals or measurement errors, using the `error` arguments. You can choose between displaying the error in the y or x axis (`error_y` and `error_x`, respectively).

**Note**: You will need another variable that contains such information! Below, we create an error variable for showcasing. 

In [None]:
df = px.data.gapminder().query("continent=='Oceania'")
df['e'] = df["lifeExp"]/100 # We create an error variable just to show case
fig = px.line(df,
              x="year",
              y="lifeExp",
              color='country',
              color_discrete_map  = {"Australia":"Black", "New Zealand": "Red"},
              error_y='e',
              title="Life expectancy per year")

fig.show()

## Modifying Tooltips

Tooltips are the square popups that appear when you hover the mouse over a data point in the plot. We can modify the behaviour of these:

* `hover_name` - highlights value of this column on the top of the tooltip
* `hover_data` - let you add or remove tooltips by setting them True/False
* `labels` - let you rename the column names inside the tooltip

In [None]:
df = px.data.gapminder().query("continent=='Oceania'")
fig = px.line(df, x="year", y="lifeExp", color='country')
fig.show()

In [None]:
fig = px.line(df,
              x="year",
              y="lifeExp",
              color='country',
              hover_name="country",
              hover_data = {"country" : False}, # we remove country from the tooltip
              labels={"year": "Year"}, # change year for Year
              title="Life expectancy per year")
fig.show()

## Range Slider and Selector in Python


You can use sliders to navigate the range of your axis. This can for instance be very useful when visualizing time-series data. (https://plotly.com/python/reference/layout/xaxis/#layout-xaxis-rangeslider)

In [None]:
fig = px.line(df,
              x="year",
              y="lifeExp",
              color='country',
              facet_col ="country",
              hover_name="country",
              text="lifeExp",
              color_discrete_map  = {"Australia":"Black", "New Zealand": "Red"},
              line_dash = "country",
              title="Life expectancy per year")

fig.update_traces(textposition="top center")
fig.update_xaxes(rangeslider_visible=True)
fig.show()

## Exercise

1) Using the Africa's gapminder dataset, create a scatter plot with a [range selector](https://plotly.com/python/reference/layout/xaxis/#layout-xaxis-rangeselector).

In [None]:
df = px.data.gapminder().query("continent=='Africa'")
fig = px.scatter(df,
              x="year",
              y="pop",
              color='country',
              title="Life expectancy per year")

fig.update_xaxes(rangeselector=dict(
            buttons=list([
                dict(count=10,
                     label="10y",
                     step="year")])),
                visible = True, type = 'date',
                rangeslider_visible=True)

fig.show()

2) Modify the tool tip so that when you hover over it will provide information about life expectancy, population, GDP and country code.

In [None]:
df.columns

In [None]:
fig = px.scatter(df,
              x="year",
              y="pop",
              color='country',
              title="Life expectancy per year", 
              hover_data = {"country" : False, "gdpPercap":True,
                           'lifeExp' : True, 'pop' : True, 'iso_num' : True})

fig.update_xaxes(rangeselector=dict(
            buttons=list([
                dict(count=10,
                     label="10y",
                     step="year")])),
                visible = True, type = 'date',
                rangeslider_visible=True)

fig.show()

## Changing axis ticks

If we do not like the ticks on our axis, we can change them using the method `update_xaxes()` or `update_yaxes()`. We will tell what texts we would like to show (`ticktext`) instead of the actual values (`tickvals`)

In [None]:
df = px.data.gapminder().query("continent=='Oceania'")

fig = px.line(df,
              x="year",
              y="lifeExp",
              color='country',
              facet_col ="country",
              hover_name="country",
              text="lifeExp",
              color_discrete_map  = {"Australia":"Black", "New Zealand": "Red"},
              line_dash = "country",
              title="Life expectancy per year")

fig.update_xaxes(
    ticktext=["50s", "60s", "70s", "80s", "90s", "00s"],
    tickvals=["1950", "1960", "1970", "1980", "1990", "2000"],
)
fig.show()

## Animating your plot

Several Plotly Express functions support the creation of animated figures through the `animation_frame` and `animation_group` arguments (https://plotly.com/python/animations/).

In order to make the animation look nicer, we will use the `orientation` argument to make the plot horizontal. In addition, the variable `gdoPercap` has too many decimals. We can change the look of the text value by using again the `update_traces()` method, which will use text comprehension to only display 2 decimals.

In [None]:
df = px.data.gapminder().query("continent=='Oceania'")

fig = px.bar(df, 
             y="country", 
             x="gdpPercap", 
             color="country",
             orientation="h", 
             animation_frame="year",
             animation_group="country",
            title="Evolution of GDP",
            text="gdpPercap", range_x=[5000, 40000])

fig.update_traces(texttemplate='%{text:.2f}')
fig.show()

## Exercise

Animate an african gapminder bar plot so that we see the evolution of life expectancy over time. Remember to separate the countries