<p style="text-align: right; font-size:0.8em;"> Thomas Bury<br><a href=mailto:thomas.bury@mcgill.ca> thomas.bury@mcgill.ca </a> </p>
<h1> Notebook 1: Plotly fundamentals </h1> 

<img src='images/plotly.png' width='40'>

Learning objectives of this notebook:

1. Explore the basic properties of a dataframe using Pandas
2. Create simple plot representations including
    - Scatter plots 
    - Bar charts
    - Line plots
3. Create distribution plots including
    - Histograms
    - Box plots
4. Bonus
    - Violin plots
    - Parallel category diagram

<br>

<h3> Useful Jupyter notebook shortcuts </h3>
<img src='images/jupyter.png' width='40'>

Outside of cells:
- A : create new cell above
- B : create new cell below
- DD : delete cell
- Y : make cell of <i>code</i> type
- M : make cell of <i>text</i> type
- enter : enter cell

Within cell:
- **Cmd + /** : comment/uncomment a line of code
- **Shift + enter** : run cell
- **esc** : escape from cell


<h3> Import all required packages </h3>
This should be done at the start of any python script / notebook 

In [7]:

import numpy as np # For numerical computation
import pandas as pd # For handling dataframes
import plotly.express as px # For rapid plotting
import plotly.graph_objects as go # For finer plot details


<br><h3> Import the data! </h3>

For this notebook we will all use the same dataset. The Palmer penguins. It is freely available on Github, made available by Dr Kristen Gorman.

<img src='images/chinstrap2.jpg' width='250'>

In [10]:
# Import Palmer penguin data (made available on Github)
df_penguins = pd.read_csv(
    'https://raw.githubusercontent.com/JohnMount/Penguins/main/penguins.csv')

<br><br>
<h2> 1. Exploring properties of data with pandas </h2>

<img src='images/panda.jpg' width='250'>

Below are some useful pandas functions to rapidly overview the data. Important to do prior to plotting!
<br><br>


In [48]:
# Dimensions of the dataset
df_penguins.shape

(344, 7)

In [49]:
# Column names
df_penguins.columns

Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex'],
      dtype='object')

In [50]:
# First few rows
df_penguins.head()

# # Last few rows
# df_penguins.tail()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female


In [51]:
# Get info on memory usage, type of data and number of values in each column
# df_penguins.info()

In [52]:
# Statistical overview of numberical variables
# df_penguins.describe();

In [53]:
# Show data for a set of colums e.g.
df_penguins[['species','body_mass_g']];

In [21]:
# Extract data according to a certain criteria
# E.g. get all penguins from the island Torgersen
df_penguins[df_penguins['island']=='Torgersen'];
# E.g 2. get all penguins with flipper great than 228mm
df_penguins[df_penguins['flipper_length_mm']>228];

In [22]:
# All the different types of penguin species (union of column data)
df_penguins['species'].unique()

array(['Adelie', 'Gentoo', 'Chinstrap'], dtype=object)

<br>
<h3> Test yourself </h3>

1. List the different islands present in the data
2. Extract the bill lengths of the adelaide species

In [55]:
### Write code here ###

<br><br><br><br>
<h2> 2. Fundamental plot representations with Plotly Express </h2>


<h3> Scatter plots </h3>
<img src='images/scatter.jpg' width='60'>


Plot data points in 2D. How many dimensions in the data can we present? <br>
<a href=https://plotly.com/python/line-and-scatter/> Plotly documentation </a>

In [106]:
df_penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female


In [58]:
# Prepare data for plotting
df_plot = df_penguins.dropna()

In [109]:
fig = px.scatter(df_plot,
                 x='bill_length_mm',
                 y='bill_depth_mm',
                 color='species',
#                  size='body_mass_g',
#                  hover_data=['sex','flipper_length_mm','island']
                )

In [110]:
fig.write_html('figures/penguin_scatter1.html')

<br><br><h3> Basic presentation properties </h3>
Note: this workshop is on data *exploration*, not presentation. We will therefore not spend a lot of time on presentation, though the functionality in Plotly does exist to make publication-ready figures.

<h4> Layout properties </h4>

In [35]:
fig.update_layout(title_text='I love penguins',
                  title_font_size=15,
                  width=800,
                  height=400,
                  );

<h4> Axes properties </h4>
<a href=https://plotly.com/python/axes/> Axes documentation </a>

In [36]:
fig.update_xaxes(title='New axis label',
                 showgrid=False,
                 range=[1.4,2],
                 type='log',
                 tickangle=45,
                 );

In [37]:
fig.write_html('figures/penguin_scatter2.html')

<h4> Trace properties (e.g. colour, style) </h4>
Difficult in plotly express. Need to go the long way around with Plotly graph objects. Addressed later.

<br><br><h3> Bar charts </h3>
<img src='images/bar.jpg' width='80'>


Present values for categorical variables.<br>
<a href=https://plotly.com/python/bar-charts/> Plotly documentation </a>

The data in it's current form is not appropriate for a bar chart. We need single values associated with categorical variables, whereas here, each row is an indiviual penguin, of which there are many.

In [62]:
df_penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female


We can consider the mean of the numerical variables for each penguin species and sex.

In [81]:
df_mean_bill_length = df_penguins.groupby(['sex','species']).mean()
df_mean_bill_length.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
sex,species,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
female,Adelie,37.257534,17.621918,187.794521,3368.835616
female,Chinstrap,46.573529,17.588235,191.735294,3527.205882
female,Gentoo,45.563793,14.237931,212.706897,4679.741379
male,Adelie,40.390411,19.072603,192.410959,4043.493151
male,Chinstrap,51.094118,19.252941,199.911765,3938.970588


In [82]:
df_plot = df_mean_bill_length.reset_index()
df_plot.head()

Unnamed: 0,sex,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
0,female,Adelie,37.257534,17.621918,187.794521,3368.835616
1,female,Chinstrap,46.573529,17.588235,191.735294,3527.205882
2,female,Gentoo,45.563793,14.237931,212.706897,4679.741379
3,male,Adelie,40.390411,19.072603,192.410959,4043.493151
4,male,Chinstrap,51.094118,19.252941,199.911765,3938.970588


In [85]:
fig = px.bar(df_plot,
             x='sex',
             y='bill_length_mm',
             color='species',
             barmode='group',
             )

In [86]:
fig.write_html('figures/bar_penguins.html')

<br><br>
<h3> Line charts </h3>
<img src='images/line.png' width='80'>

Line charts are best suited for data with a 'sequential' variable (e.g. time) and therefore not appropriate for the penguins. <br>
We will create an artificial dataset for the line chart.
<a href=https://plotly.com/python/line-charts/> <br>Plotly documentation for line charts </a>

In [87]:
# Create a noisy dataset
time_vals = np.arange(0,10,0.01)
x_vals = np.sin(time_vals)+np.random.normal(0,0.1,len(time_vals))
y_vals = np.cos(time_vals)+np.random.normal(0,0.1,len(time_vals))
df_waves = pd.DataFrame({'Time':time_vals,
                         'Noisy sine':x_vals,
                         'Noisy cosine':y_vals})
df_plot = df_waves.melt(id_vars=['Time'],
                        value_vars=['Noisy sine','Noisy cosine']
                        )


In [88]:
df_plot.head()

Unnamed: 0,Time,variable,value
0,0.0,Noisy sine,0.036298
1,0.01,Noisy sine,-0.030089
2,0.02,Noisy sine,-0.077557
3,0.03,Noisy sine,0.201639
4,0.04,Noisy sine,0.011456


In [90]:
fig = px.line(df_plot,
              x='Time',
              y='value',
              color='variable')

In [93]:
fig.write_html('figures/line_sin_cos.html')

<br><br><br>
<h2> 3. Distribution plots </h2>

We can investigate statistical distributions of the data with minimal code.

<h3> Histograms </h3>
<img src='images/histogram.png' width='80'>
Shows the distribution of data for a given variable, by splitting it into bins of a given width.<br>
<a href=https://plotly.com/python/histograms/> Plotly documentation on histograms </a>

In [95]:
df_penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female


In [98]:
# Remove entries with Nan
df_plot = df_penguins.dropna()

In [111]:
fig = px.histogram(df_plot,
                   x='bill_length_mm',
                   color='sex',
#                    color_discrete_sequence=["red", "green"],
                   nbins=100,
                  )
# fig.update_layout(barmode='overlay') # Overlay or stack ('stacked') the histograms
# fig.update_traces(opacity=0.75); # Make sure to reduce opacity for overlaying


In [104]:
fig.write_html('figures/hist_penguins.html')

<br><br><h3> Box plots </h3>
<img src='images/box.svg' width='60'>


Visualise the distributions of the data via quartiles (median, inter-quartile range, range).<br>
<a href=https://plotly.com/python/box-plots/> Plotly documentation </a>

In [118]:
df_penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female


In [116]:
fig = px.box(df_penguins,
             x='island',
             y='bill_length_mm',
             color='species',
            )

In [117]:
fig.write_html('figures/box_penguin1.html')

<br><br><h2> Practice area </h2>
Exercises:
1. Familiarise yourself with Pandas basics (if required).
2. Edit code above to make plots for different penguin variables.
3. Create your favourite visualisation from scratch in the space below.

<br><br>
<h2> 4. Bonus material </h2>
If you have extra time during this session, play around with violin plots (?) or parrallel category diagrams.

<h3> Violin plots </h3>
Like a box plot, only with the smoothed probability density adjacent.<br>
<a href=https://plotly.com/python/violin/>Plotly documentation </a>
<br><br>

In [26]:
fig = px.violin(df_penguins.dropna(),
                x='species',
                y='body_mass_g',
                color='sex',
                box=True,
                points='all')
fig.write_html('figures/violin1.html')

<br><br>
<h3> Parallel category diagrams </h3>

Visualise relationships between **categorical** variables in a dataset <br>
<a href=https://plotly.com/python/parallel-categories-diagram/>Plotly documentation </a>  

In [120]:
fig = px.parallel_categories(df_penguins,
                             dimensions=['species','island','sex'],
                             )
fig.write_html('figures/parallel1.html')