# Making Box Plots better with Jitter

Adding **jitter** to a box plot is a technique used to improve the visualization of individual data points, especially when working with small to medium-sized datasets. Jitter involves slightly shifting the position of data points along the x-axis (or y-axis, depending on orientation) to avoid overlap and clustering, making each data point more distinct and easier to identify. This is particularly useful in situations where data points are tightly packed or where outliers and individual observations are of interest, as it can reveal patterns or variations that might be hidden in the summarized box plot alone.

By combining a box plot with jittered data points, sometimes referred to as a **"strip plot"** or a **"beeswarm plot,"** you can achieve a more detailed and informative visualization. This hybrid approach allows the viewer to see the overall summary of the data distribution (via the box plot) while also observing the specific spread and density of individual points (via the jittered dots). It is particularly useful when you want to highlight the actual data behind the summary statistics, such as in smaller datasets where individual values are more significant.

However, jitter should be applied carefully. It works well when the data is sparse enough that individual points can be meaningfully separated. In very large datasets, excessive use of jitter can make the plot appear cluttered and difficult to interpret, negating the purpose of summarization in a box plot. In such cases, alternative visualizations like a violin plot, which combines a box plot with a kernel density estimate, might be more effective for displaying the underlying distribution.

## Getting ready


In addition to `plotly`, `numpy` and `pandas`, make sure the `scipy` Python library avaiable in your Python environment
You can install it using the command:

```
pip install scipy 
```

For this recipe we will create two data sets

1. Import the Python modules `numpy`, `pandas`. Import the [`norm`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html) object from `scipy.stats`. This object will allow us to generate random samples from a normal distribution. This will help us to create data sets to be used in this recipe.

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import norm, t

2. Create two data sets to be used in this recipe

In [2]:
n = 200
sample1 = norm(loc=0).rvs(n)
sample2 = t(df=3).rvs(n)

In [3]:
data1 = pd.DataFrame({'Normal': sample1})

In [4]:
samples =  np.concatenate( (sample1, sample2))
labels = ['Normal']*n + ['t-Student']*n 
data2 = pd.DataFrame({'Data': samples, 'Label':labels})

## How to do it

1. Import the `plotly.graph_objects` module as `go`

In [5]:
import plotly.graph_objects as go

### Single data set

Load the firs data set

In [6]:
df = data1

2. Create a `Figure` object and make a simple notched box plot using the function `Box`. Then customise the layout by calling the method `update_layout`

In [7]:
fig = go.Figure()
fig.add_trace(go.Box(x=df["Normal"], 
                     notched=True, 
                     marker_color='purple', 
                     name='Normal',
                     ))
fig.update_layout(title='Box Plot - Sample from a Normal Distribution', 
                  height = 500, width = 800,)

3. Hihglight all the points by setting the input `boxpoints` as `'all'`

In [8]:
fig = go.Figure()
fig.add_trace(go.Box(x=df["Normal"], 
                     notched=True, 
                     marker_color='purple', 
                     boxpoints="all",
                     name='Normal',
                     ))
fig.update_layout(title='Box Plot - Sample from a Normal Distribution', 
                  height = 500, width = 800,)

4. Set the amount of  jitter in the sample by using the input `jitter`. If 0, the sample points align along the distribution axis. If 1, the sample points are drawn in a random jitter of width equal to the width of the box(es).

In [9]:
fig = go.Figure()
fig.add_trace(go.Box(x=df["Normal"], 
                     jitter=0.5,
                     notched=True, 
                     marker_color='coral', 
                     boxpoints="all",
                     name='Normal',
                     ))
fig.update_layout(title='Box Plot Sample from a Normal Distribution', 
                  height = 500, width = 800,)

In [10]:
fig = go.Figure()
fig.add_trace(go.Box(x=df["Normal"], 
                     jitter=0.0,
                     notched=True, 
                     marker_color='teal', 
                     boxpoints="all",
                     name='Normal',
                     ))
fig.update_layout(title='Box Plot Sample from a Normal Distribution', 
                  height = 500, width = 800,)

1. Customise the position of the points with respect to the box(es) by using the input `pointpos`.  If 0, the sample points are places over the center of the box(es). Positive (negative) values correspond to positions to the right (left) for vertical boxes and above (below) for horizontal boxes

In [11]:
fig = go.Figure()
fig.add_trace(go.Box(x=df["Normal"], 
                     pointpos=0.0,
                     jitter=0.5,
                     notched=True, 
                     marker_color='green', 
                     boxpoints="all",
                     name='Normal',
                     ))
fig.update_layout(title='Box Plot Sample from a Normal Distribution', 
                  height = 500, width = 800,)

### Multiple data set

In [12]:
df = data2

1. To show several distributions, we will use the same method as for a single distribution for each different label. Note that in each iteration, we are getting a subset of the data set 

In [13]:
fig = go.Figure()

for l in df['Label'].unique():
    subdata = df[df.Label==l]
    
    fig.add_trace(go.Box(x=subdata["Data"], 
                        notched=True, 
                        boxpoints="all",
                        name=l
                        ))
fig.update_layout(title='Box Plot Sample from a Normal Distribution', 
                  height = 500, width = 800,)

2. Customise the colors using a list of colors and zipping this into the `for` loop which adds the traces

In [14]:
fig = go.Figure()

colors = ['teal', 'purple']
for l, cl in zip(df['Label'].unique(), colors):
    subdata = df[df.Label==l]
    
    fig.add_trace(go.Box(x=subdata["Data"], 
                        notched=True, 
                        boxpoints="all",
                        name=l,
                        marker_color=cl, 
                        ))
fig.update_layout(title='Box Plot Sample from a Normal Distribution', 
                  height = 500, width = 800,)

In [15]:
fig = go.Figure()

colors = ['teal', 'purple']
for l, cl in zip(df['Label'].unique(), colors):
    subdata = df[df.Label==l]
    
    fig.add_trace(go.Box(x=subdata["Data"], 
                        pointpos=0.0,
                        jitter=0.5,
                        notched=True, 
                        boxpoints="all",
                        name=l,
                        marker_color=cl, 
                        ))
fig.update_layout(title='Box Plot Sample from a Normal Distribution', 
                  height = 500, width = 800,)

In [16]:
fig = go.Figure()

colors = ['teal', 'purple']
for l, cl in zip(df['Label'].unique(), colors):
    subdata = df[df.Label==l]
    
    fig.add_trace(go.Box(x=subdata["Data"], 
                        jitter=0.0,
                        notched=True, 
                        boxpoints="all",
                        name=l,
                        marker_color=cl, 
                        ))
fig.update_layout(title='Box Plot Sample from a Normal Distribution', 
                  height = 500, width = 800,)

## There is more

In [17]:
n = 500
sample = norm(loc=0).rvs(n)
data1 = pd.DataFrame({'Sample': sample})

In [18]:
df = data1

Show the mean of the distribution using the input `boxmean` as `True`

In [19]:
fig = go.Figure()
fig.add_trace(go.Box(x=df["Sample"], 
                     boxmean=True,
                     notched=True, 
                     marker_color='green', 
                     boxpoints="outliers",
                     name='Normal',
                     ))
fig.update_layout(title='Box Plot', 
                  height = 500, width = 800,)

2. Set the lmethod to determine the sample's quartiles with the input `quartilemethod`

- The “linear” method uses the 25th percentile for Q1 and 75th percentile for Q3 as computed using method #10 (listed on http://jse.amstat.org/v14n3/langford.html). 
- The “exclusive” method uses the median to divide the ordered dataset into two halves if the sample is odd, it does not include the median in either half - Q1 is then the median of the lower half and Q3 the median of the upper half
- The “inclusive” method also uses the median to divide the ordered dataset into two halves but if the sample is odd, it includes the median in both halves - Q1 is then the median of the lower half and Q3 the median of the upper half.

In [27]:
fig = go.Figure()
fig.add_trace(go.Box(x=df["Sample"], 
                     quartilemethod='inclusive',
                     notched=True, 
                     name='Normal',
                     ))
fig.update_layout(title='Box Plot', 
                  height = 500, width = 800,)

fig.show()