<img src="https://github.com/Center-for-Health-Data-Science/PythonTsunami/blob/oct_2022_3days/figures/HeaDS_logo_large_withTitle.png?raw=1" width="300">

<img src="https://github.com/Center-for-Health-Data-Science/PythonTsunami/blob/oct_2022_3days/figures/tsunami_logo.PNG?raw=1" width="600">



# Plotly

## Python Open Source Graphing Library

Plotly's Python graphing library makes interactive, publication-quality graphs. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, polar charts, and bubble charts.

![gallery](https://miro.medium.com/max/1458/1*qKpV3vkPZYoffsvFSEuw8A.png)


Plotly has an easy-to-use interface to it called [Plotly express](https://plotly.com/python/plotly-express/). This library makes plotting with Plotly very easy. Plotly express works nicely with Pandas dataframes as input, we just need to specify which columns need to be plotted.

# Import modules

In [1]:
import pandas as pd
import plotly.express as px

# Introduction
Let's start exploring the Plotly Database. The regular syntax for *any* Plotly.Express chart is

`px.chart_type(data, arguments)`


Let's try a simple line chart: `px.line(data, arguments)`.
We first need to create some data.

In [2]:
year = list(range(1996,2020,4))
medals = [1,4,5,9,1,2]

print(year)
print(medals)

[1996, 2000, 2004, 2008, 2012, 2016]
[1, 4, 5, 9, 1, 2]


In [3]:
# We create our dataframe

df = pd.DataFrame({"Year" : year, "Medals" : medals})
df.head()

Unnamed: 0,Year,Medals
0,1996,1
1,2000,4
2,2004,5
3,2008,9
4,2012,1


Now we insert this data into our general formula:

`px.chart_type(data, arguments)`

In [4]:
px.line(data_frame = df, x = "Year" , y = "Medals")

## Shaping your dataframe to fit to plotly

If our dataframe is in wide format, we may need to change the shape to long format. This is because in plotly we always need to have our variables of interest as columns!

For example, lets make a wide dataframe:


In [6]:
df = pd.DataFrame({'Year': {0: '2004', 1: '2008', 2: '2012', 3: '2016'},
                   'Canada': {0: 4, 1: 3, 2: 5, 3: 3},
                   'USA': {0: 5, 1: 9, 2: 1, 3: 2}})

df

Unnamed: 0,Year,Canada,USA
0,2004,4,5
1,2008,3,9
2,2012,5,1
3,2016,3,2


We can easily plot the medals for either USA or Canada:

In [7]:
px.line(data_frame = df, x = "Year" , y = "Canada")

But what if we want to have both countries in the same plot?

In this case, we would like to have a column named "Countries" that will have the integer values for both Canada and USA. We use the [`.melt()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html) method to do this.

In [8]:
long_df = pd.melt(df, id_vars=['Year'], value_vars=['Canada', 'USA'])
long_df

Unnamed: 0,Year,variable,value
0,2004,Canada,4
1,2008,Canada,3
2,2012,Canada,5
3,2016,Canada,3
4,2004,USA,5
5,2008,USA,9
6,2012,USA,1
7,2016,USA,2


Now we have all the medal values in the same column and we can easily plot them.

We also may want update the column names of our long_df to something more meaningful. Do you remember how to do that from yesterday?

In [9]:
#update column names to 'country' and 'medals'
long_df.rename(columns={'variable': 'country', 'value': 'medals'}, inplace=True)
long_df

Unnamed: 0,Year,country,medals
0,2004,Canada,4
1,2008,Canada,3
2,2012,Canada,5
3,2016,Canada,3
4,2004,USA,5
5,2008,USA,9
6,2012,USA,1
7,2016,USA,2


Now we can use the long format dataframe to plot

In [None]:
#What do I have to put?

In [10]:
#@title Solution
px.line(data_frame = long_df, x = "Year" , y = "medals", color = "country")

## Save as variable and show

We can save our plots as variables. Then, if you would like to show your plot again, you can call it using the method `.show()`

In [16]:
fig = px.line(data_frame = long_df, x = "Year" , y = "medals", color = "country")

In [17]:
fig.show()

# Line plot arguments

We have already seen that we need to have the `x` and `y` arguments to define what is on which axis. But there are more different arguments we can use to customize our line plots!

Let's use a dataframe with more rows and columns. We will make use of the *gapminder* dataset which is already integrated in plotly. We can load it by writing `px.data.gapminder`. Lets see what kind of dataset this is:


In [11]:
gapminder_data = px.data.gapminder()
gapminder_data

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,AFG,4
1,Afghanistan,Asia,1957,30.332,9240934,820.853030,AFG,4
2,Afghanistan,Asia,1962,31.997,10267083,853.100710,AFG,4
3,Afghanistan,Asia,1967,34.020,11537966,836.197138,AFG,4
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,AFG,4
...,...,...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306,ZWE,716
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786,ZWE,716
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960,ZWE,716
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623,ZWE,716


For now we would like to only use countries from Oceania. Can you help me to subset the dataframe?

In [None]:
#Let's subset the data

In [39]:
#@title Solution
df = gapminder_data.loc[gapminder_data['continent'] == 'Oceania']
df.sample(5)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
1103,New Zealand,Oceania,2007,80.204,4115771,25185.00911,NZL,554
1101,New Zealand,Oceania,1997,77.55,3676187,21050.41377,NZL,554
61,Australia,Oceania,1957,70.33,9712569,10949.64959,AUS,36
1097,New Zealand,Oceania,1977,72.22,3164900,16233.7177,NZL,554
70,Australia,Oceania,2002,80.37,19546792,30687.75473,AUS,36


## `Color` argument

As shown above, we can change the color of the lines based on a dataframe colunm by using the argument `color`. In this example, we plot the life expectancy column VS the year column and the line are colored by the content of the country column. This also gives us separate lines for the separate countries.

In [20]:
# We can separate the data from the different countries by color using the argument `color`
# Separating the px.line call into several lines like this is purely aesthetic. It does not influence the flow of the execution.
fig = px.line(df,
              x="year",
              y="lifeExp",
              color='country')
fig.show()

## `Color_discrete_map` argument

We can also decide the color palette to use with `color_discrete_map`. In this case, we need to specify for each level of the variable to color by, here `country`, what color should be used:

In [21]:
fig = px.line(df,
              x="year",
              y="lifeExp",
              color='country',
              color_discrete_map = {"Australia":"Black", "New Zealand": "Red"})

fig.show()

## `Title` argument

We can also already pass a title when we make the plot with the `title` argument.

In [22]:
fig = px.line(df,
              x="year",
              y="lifeExp",
              color='country',
              color_discrete_map = {"Australia":"Black", "New Zealand": "Red"},
              title="Life expectancy in Oceania")
fig.show()

## `Text` argument

We can further display the value of each 'dot' in the line (from the x and y values) by using the `text` argument.

In [25]:
fig = px.line(df,
              x="year",
              y="lifeExp",
              color='country',
              color_discrete_map = {"Australia":"Black", "New Zealand": "Red"},
              title="Life expectancy per year",
              text="lifeExp") #The text argument allows us to plot the actual number on the datapoint
fig.show()

Notice how the text argument positioned the text right on top of the data points? We can modify this behaviour by updating our figures using the `update_traces()` method, which will modify all data points inside `fig.data`.

In [26]:
fig.update_traces(textposition="top center")
fig.show()

## `Line_dash` argument
By using the `line_dash` argument, we can change the dash pattern of the lines based on a variable.

In [27]:
fig = px.line(df,
              x="year",
              y="lifeExp",
              color='country',
              text="lifeExp",
              color_discrete_map  = {"Australia":"Black", "New Zealand": "Red"},
              line_dash = "country",
              title="Life expectancy per year")

fig.update_traces(textposition="top center")
fig.show()

## `Line_dash_map` argument

Similar to `color_discrete_map` there is also `line_dash_map` to specify the line type at creation.

Note that for this to work you need to specify the `line_dash` argument (what column the dashing should depend on), otherwise a dash_map makes no sense.

In [41]:
fig = px.line(df,
              x="year",
              y="lifeExp",
              color="country",
              text="lifeExp",
              color_discrete_map  = {"Australia":"Black", "New Zealand": "Red"},
              line_dash = "country",
              line_dash_map = {"Australia":"dash", "New Zealand": "solid"},
              title="Life expectancy per year")
fig.show()

## Exercise 1: Line graphs (5 mins)

Now you!

1) Create a line graph of life expectancy per year for the three countries Denmark, Romania and Ghana.



In [33]:
#creating the dataframe and verify that it has the data you want
#we're using `isin` instead of writing gapminder_data['country'] == 'Denmark' | gapminder_data['country'] == 'Ghana' ect
df = gapminder_data.loc[gapminder_data['country'].isin(['Denmark', 'Ghana', 'Romania'])]
df.head()


Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
408,Denmark,Europe,1952,70.78,4334000,9692.385245,DNK,208
409,Denmark,Europe,1957,71.81,4487831,11099.65935,DNK,208
410,Denmark,Europe,1962,72.35,4646899,13583.31351,DNK,208
411,Denmark,Europe,1967,72.96,4838800,15937.21123,DNK,208
412,Denmark,Europe,1972,73.47,4991596,18866.20721,DNK,208


In [None]:
#now make the line plot

2) Color each line by the country. What do you observe when comparing to the previous plot?

3) Give your plot a title.

## Quiz 1

What would you do if instead of a line chart you wanted to show the data in a scatter plot?

# Scatter plots

Scatter plots are coordinate plots that use x and y coordinates to show the relationship between two variables. However, the values of the variables do not necessarily need to be linked or ordered like in a line plot.

Plotting a scatter plot is very much like plotting a line plot, but we use the `px.scatter()` function. Many of the arguments shown previously for the line plots work here as well, for example, the color argument:

In [44]:
df = gapminder_data.loc[gapminder_data["continent"] == 'Europe']

fig = px.scatter(df,
                 x="lifeExp",
                 y="gdpPercap",
                 color="country")
fig.show()

## Symbol argument
If you want to further differenciate the countries from each other, you can the `symbol` argument to different types of symbols, not just dots/circles.

In [35]:
fig = px.scatter(df,
                 x="lifeExp",
                 y="gdpPercap",
                 color="country",
                 symbol='country')
fig.show()

## Size argument
We can also play with the `size` of the dots to create **Bubble plots**

In [36]:
fig = px.scatter(df,
                 x="lifeExp",
                 y="gdpPercap",
                 color="country",
                 size='pop') # Using population as the size for the plot

fig.show()

## Trendline argument
We can easily add trendlines to our scatter plot using the argument `trendline`. By default you will use the Ordinary Least Squares trendline (linear regression).

We quickly see the relationship between GDP and life expectancy is not linear for Europe in general.

In [45]:
fig = px.scatter(df,
                 x="lifeExp",
                 y="gdpPercap",
                 trendline = "ols") # fitting a trendline with ordinary least squares

fig.show()

If you have separated the countries using the `color` argument, you will get one trendline per country.

*This will look quite ugly since there are many countries. Some of them actually look like the relationship could be linear.*

In [46]:
fig = px.scatter(df,
                 x="lifeExp",
                 y="gdpPercap",
                 color="country",
                 trendline = "ols")

fig.show()

If you want to color by a variable but still have a global trend, use the argument **`trendline_scope="overall"`**. We will also change to a none-linear fitting called LOWESS (Locally Weighted Scatterplot Smoothing). This type of fit is also sometimes called LOESS if you are familiar with that term.


In [47]:
fig = px.scatter(df,
                 x="lifeExp",
                 y="gdpPercap",
                 color="country",
                 trendline = "lowess",
                 trendline_scope="overall")

fig.show()

## Exercise 2: Scatter plots and trendlines (10 mins)
1) Using the data from 'Africa', create a scatter plot of GDP versus population. You will first have to subset the data like we did in above for Europe.

Use different symbols and colors for the different countries.

2) Make two separate plots that model the correlation between GDP and population for each country, once using an OLS fit and once a LOWESS fit. Which fit do you think looks more convincing?

In [None]:
#ols

In [None]:
#lowess

# Bar Charts

With `px.bar()`, each row of the DataFrame is represented as a rectangular mark. Bar plots are very useful to show quantitative information across qualitative features such as years, countries or other categorical data.

`px.bar()` shares a lot of arguments with line and scatter plots.

In [71]:
df = gapminder_data.loc[gapminder_data["continent"] == 'Oceania']
fig = px.bar(df, x='year', y='pop', color='country')
fig.show()

## Orientation argument

If we would rather see horizontal bars instead of vertical, we can set the argument `orientation` to `"h"`. Note that we need to **change the order** of the `x` and `y` arguments now!

In [49]:
fig = px.bar(df, x='pop', y='year', color='country', orientation="h")
fig.show()

## Barmode argument

Using `barmode=group` lets us have the bars for the different color groups next to each other instead of on top of each other.

In [72]:
fig = px.bar(df, x='year', y='pop', color='country', barmode='group')
fig.show()

## Text on bar charts
You can add text to bars using the `text_auto` or `text` argument. `text_auto=True` will automatically use the same variable as the `y` argument, while with `text` you can refer to any other column.


In [58]:
fig = px.bar(df, x='year', y='pop', color='country', text = 'continent')
fig.show()

In [78]:
fig = px.bar(df, x='year', y='pop', color='country', text_auto='.2s') #this format will show only two digits
fig.show()

## Exercise 3: Bar charts (10 mins)

1) Use the gapminder data for Oceania and show the GDP for each year in a bar plot.


2) Now separate the bars into countries and put them next to each other instead of stacked on top of each other.

3) Now, add the GDP as text with two digits.

# Histograms

In statistics, a histogram is representation of the distribution of numerical data, where the data are binned and the count for each bin is represented. More generally, in Plotly a histogram is an aggregated bar chart, with several possible aggregation functions (e.g. sum, average, count...) which can be used to visualize data on categorical and date axes as well as linear axes.

Compared to `px.bar()`, `px.histogram()` can work with only the `x` argument, which can be a continuous or categorical variable.

Let use the tips dataset which reports on tips given by customers and some information about the customer. It looks like this:


In [73]:
tips_df = px.data.tips()
tips_df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [74]:
fig = px.histogram(tips_df, x="total_bill", title = "Total bill")
fig.show()

## Bins argument

By default, the number of bins is chosen so that this number is comparable to the typical number of samples in a bin. This number can be customized, as well as the range of values, with the `nbins` argument:

In [75]:
fig = px.histogram(tips_df, x="total_bill", nbins=20)
fig.show()

## Split by color

`px.histogram()` also shares the `color`, `text_auto` and `barmode` argument of the bar plot.

In [76]:
fig = px.histogram(tips_df,
                   x="total_bill",
                   color="sex",
                   barmode='group',
                   text_auto=True)
fig.show()

So here we have actually two histograms next to each other:

One in red showing much the total bill was when men paid, and another one in blue for the women.

## Histnorm argument

We can see high bars for men when the bill was larger (20 dollars or more) compared to women.

However, when we investigate the dataframe we find that we actually have only half as many data points of women paying:

In [45]:
tips_df.value_counts('sex')

sex
Male      157
Female     87
dtype: int64

So just comparing counts doesn't give us the full picture. what if we want to know if men are more likely to have a higher total bill? We'll need to switch from looking at counts (1, 2, 4, 11 men) to **looking at the probability density**.

We do this by using the `histnorm` argument.

In [77]:
fig = px.histogram(tips_df,
                   x="total_bill",
                   color="sex",
                   barmode='group',
                   histnorm='probability')
fig.show()

What happens here is that we say if all men in the dataset are 100%, then how many have a total bill of 8 - 10 dollars? Now we can compare between the two groups.

## Exercise 4: Histogram (5 mins)

Make a histogram of the tip value from the tips dataframe. Then split it up by whether the customer is a smoker or not, and put the bars next to each other. Change from looking at counts to looking at the probability density (like above). You can do this step by step or all at once.

What do you observe? Is there a difference in how much smokers and non-smokers tip?

# Box plots and violin plots

Box plots and violin plots are another nice way of showing data distributions. `px.box()` and `px.violin()` share almost all their arguments and can be used interchangebly.

Lets start with a simple boxplot.

In [85]:
tips_df = px.data.tips()
fig = px.box(tips_df, y='total_bill', x='smoker')
fig.show()

Boxplots show the quartiles of the data distribution. You can also see the values when you hover your mouse over the plot.

<img src="https://github.com/Center-for-Health-Data-Science/PythonTsunami/blob/2024_april/figures/quartile-percentile.jpg?raw=1" width="600">



Just like barplots and histograms we can also split up boxplots by using the `color` argument:

In [86]:
fig = px.box(tips_df, y="total_bill", x="smoker", color="sex")
fig.show()

Switching to a violin plot shows that actual shape of the distribution:

In [87]:
fig = px.violin(tips_df, y="total_bill", x="smoker", color="sex")
fig.show()

## Points argument
You can show the underlying data inside the plots by setting the argument `points="all"`.

Other options are to show only outliers: `points="outliers"` (this is the default), or not show any points at all: `points=False`

In [88]:
fig = px.violin(tips_df, y="total_bill", x="smoker", color="sex", points = "all")
fig.show()

## Boxplot inside violin
Or you can have a boxplot inside the violin plot by using `box=True`.

In [89]:
fig = px.violin(tips_df, y="total_bill", x="smoker", color="sex", box=True)
fig.show()

## Show mean
We can show the mean in our boxplot using by updating our traces using `boxmean=True` and in our violin plots using `meanline_visible=True`

In [90]:
fig = px.box(tips_df, y="total_bill", x="smoker", color="sex", points="all")
fig.update_traces(boxmean=True)
fig.show()

In [92]:
fig = px.violin(tips_df, y="total_bill", x="smoker", color="sex", points="all", box=True)
fig.update_traces(meanline_visible=True)
fig.show()

## Exercise 5: Boxplots and violin plots (10 mins)

Have a look at the following dataframe. What information does it contain?

In [None]:
df = gapminder_data.loc[(gapminder_data["continent"] == 'Europe') & (gapminder_data['year'].isin([1987,2007]))]
df.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
19,Albania,Europe,1987,72.0,3075321,3738.932735,ALB,8
23,Albania,Europe,2007,76.423,3600523,5937.029526,ALB,8
79,Austria,Europe,1987,74.94,7578903,23687.82607,AUT,40
83,Austria,Europe,2007,79.829,8199783,36126.4927,AUT,40
115,Belgium,Europe,1987,75.35,9870200,22525.56308,BEL,56


1) Using this dataframe, make a boxplot of life expectancy versus the year.

2) Now do the same as a violin plot.

Which one do you prefer as a visualization and why?

3) Make a boxplot of the tips dataset where you show the amount of tip given on the y-axis and whether the customer was a smoker on the x-axis.

4) Split the plot up by the sex column and add a mean line. Are the mean and the median close? What's the difference between them?

# Heatmaps

The `px.imshow()` function can be used to display heatmaps (as well as full-color images, as its name suggests). It accepts both array-like objects like lists of lists, as well as pandas.DataFrame objects. Heatmaps are particularly useful to display correlations between the variables of the data

We can use `corr` to see how much the variables in the tip data set are correlated with each other. Correlation can only be calculated on numerical columns.

In [62]:
tips_df = px.data.tips()
corr_df = tips_df.corr(numeric_only=True)
corr_df

Unnamed: 0,total_bill,tip,size
total_bill,1.0,0.675734,0.598315
tip,0.675734,1.0,0.489299
size,0.598315,0.489299,1.0


In [63]:
px.imshow(corr_df, text_auto=True)

We can modify the color scale to less eye cancer by using the argument `color_continuous_scale`.

In [64]:
px.imshow(corr_df, text_auto=True, color_continuous_scale='RdBu_r')

Notice how white is not at 0 where we would like it to be? We can correct that by explicitly stating the extremes of the color scale using the `range_color` argument.

In [None]:
px.imshow(corr_df, text_auto= '.2f',
          color_continuous_scale='RdBu_r', range_color=[-1,1])

## Exercise 6: Heatmaps

Extract info for the continent Europe from the gapminder dataset and calculate the correlation between columns. Plot the result in a heatmap. What do you observe? Are the correlations as you expected?

Change the color scheme to something you find pleasing or useful and add the correlation values to the squares.

Now, do the same Africa. What do you observe? Are you surprised?

## Facet plots

Another cool thing we can do in many types of plots is to split the chart into rows or columns depending on a variable. For example, we can look at the relationship between total bill and tip, split up by day of the week. 


In [103]:
tips_df.sample(5)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
36,16.31,2.0,Male,No,Sat,Dinner,3
174,16.82,4.0,Male,Yes,Sun,Dinner,2
60,20.29,3.21,Male,Yes,Sat,Dinner,2
210,30.06,2.0,Male,Yes,Sat,Dinner,3
172,7.25,5.15,Male,Yes,Sun,Dinner,2


In [104]:
fig = px.scatter(tips_df,
              x="total_bill",
              y="tip",
              color='sex',
              title="Tips Vs Total Bill",
              facet_col ="day") #here we add the facet

fig.show()

## Plot marginals

In **scatter** and **histogram** plots, you can also add extra plots on the margins (called [Plot Marginals](https://plotly.com/python/marginal-plots/)) of your scatter plot, for instance "histogram", "rug", "box", or "violin" plots. These plots can be easily added by just using the attributes: `marginal_x` and `marginal_y`.

In [107]:
fig = px.scatter(tips_df,
              x="total_bill",
              y="tip",
              color='sex',
              title="Tips Vs Total Bill",
              marginal_x="histogram",
              marginal_y="violin")

fig.show()

If you have facets you can only have the margin on the 'non-faceted' side:

In [110]:
fig = px.scatter(tips_df,
              x="total_bill",
              y="tip",
              color='sex',
              title="Tips Vs Total Bill",
              marginal_x="histogram",
              facet_col= 'day')

fig.show()

This is it for now, but you can find more advanced plotting techniques in the extra notebook: plotly_extra_materials.