# Making scatter charts

A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. In the two-dimensional case, the position of each point on the graph is determined by its value on two different variables—one represented on the x-axis and the other on the y-axis. Scatter plots are particularly useful for identifying relationships, patterns, or trends between the two variables.


Scatter plots can be useful for:

- Identifying Correlation: Scatter plots can reveal the direction (positive, negative, or none) and strength of a relationship between two variables.
- Detecting Outliers: Points that fall far from the general pattern can be easily spotted as outliers.
- Understanding Distribution: They help in visualizing the distribution and clustering of data points.
- Identifying Patterns: Scatter plots can reveal underlying patterns that might not be immediately obvious.

## Getting ready

For this recipe we will load the `Gapminder` data set and filter the data for the year 2007.

In [1]:
import plotly.express as px

In [2]:
df = px.data.gapminder()
df = df[df.year==2007]

In [3]:
df.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
11,Afghanistan,Asia,2007,43.828,31889923,974.580338,AFG,4
23,Albania,Europe,2007,76.423,3600523,5937.029526,ALB,8
35,Algeria,Africa,2007,72.301,33333216,6223.367465,DZA,12
47,Angola,Africa,2007,42.731,12420476,4797.231267,AGO,24
59,Argentina,Americas,2007,75.32,40301927,12779.37964,ARG,32


# How to do it

1. Make a simple scatter using `px.scatter` and passing the data frame as well as the names of the two columns that will be ploted as `x` and `y` respectively.

In [4]:
fig = px.scatter(df, x='gdpPercap', y ='lifeExp')
fig.show()

2. Add a title to your chart by passing the input `title`

In [5]:
fig = px.scatter(df, x='gdpPercap', y ='lifeExp', title='Gap Minder Data: GDP per Capita vs Life Expectancy')
fig.show()

3. Use the input `color` to specify the color of the dots according to a third variable. 

In this case, we pass `continent` which allows us to observe if the relationship between GDP per capita and life expectancy is different depending on the continent. 

In [6]:
fig = px.scatter(df, x='gdpPercap', y ='lifeExp', color='continent', 
                 title='Gap Minder Data: GDP per Capita vs Life Expectancy')
fig.show()

An alternative way to introduce a third variable is by using the input `symbol`. This would make the marks different according to the specified variable. Let's take a look at the result when passing `continent`.

In [7]:
fig = px.scatter(df, x='gdpPercap', y ='lifeExp', symbol='continent',
                 title='Gap Minder Data: GDP per Capita vs Life Expectancy')
fig.show()

You can also use both methods together as follows

In [8]:
fig = px.scatter(df, x='gdpPercap', y ='lifeExp', symbol='continent', color='continent',
                 title='Gap Minder Data: GDP per Capita vs Life Expectancy')
fig.show()

4. Use the input `size` to specify the size of the marks in the scatter according to a variable. 
   
In this case we pass `pop` which means that the size of each dot will reflect the size of the population of that particular country. This allows us to investigate if there is any relationship between the size of the population and the main variables in the scatter GDP per capita and Life expectancy.

In [9]:
fig = px.scatter(df, x='gdpPercap', y ='lifeExp', color='continent', size='pop', 
                 title='Gap Minder Data: GDP per Capita vs Life Expectancy')
fig.show()

5. Customise the maximum size of the mark by passing an integer into the optional input `size_max`. The default vauel is `20`.

In this case, we want to make the markers bigger so we choose `size_max=50`.

In [10]:
fig = px.scatter(df, x='gdpPercap', y ='lifeExp', color='continent', size='pop', size_max=50, 
                 title='Gap Minder Data: GDP per Capita vs Life Expectancy')
fig.show()

One big weakness in this chart is that we cannot see the name of the country in the hover. But we will fix it in the next step.

6. Add variables to appear in the hover tooltip by passing the optional input `hover_data`.
In this case, we add the column `country` 

In [11]:
fig = px.scatter(df, x='gdpPercap', y ='lifeExp', color='continent', size='pop', size_max=50,
                 hover_data=['country'], 
                 title='Gap Minder Data: GDP per Capita vs Life Expectancy')

fig.show()

1. Customize the hover appearence further by invoking the method `update_traces` on the figure and  passing `hovertemplate`. We specify the order and format of the variables to be displayed.

In [12]:
fig = px.scatter(df, x='gdpPercap', y ='lifeExp', color='continent', size='pop', size_max=60,
                 hover_data=['pop', 'country'], 
                 title='Gap Minder Data: GDP per Capita vs Life Expectancy')
fig.update_traces(
    hovertemplate="<br>".join([
        "Country: %{customdata[1]}",
        "GDP per capita: %{x}",
        "Life Expectancy: %{y}",
        "Population: %{customdata[0]}",
    ])
)
fig.show()

8. Finally, customize the axis labels by by invoking the method `update_layout` on the figure and  passing both `xaxis_title` and `yaxis_title`.

In [13]:
fig = px.scatter(df, x='gdpPercap', y ='lifeExp', color='continent', size='pop', size_max=60,
                 hover_data=['pop', 'country'], 
                 title='Gap Minder Data: GDP per Capita vs Life Expectancy')

fig.update_traces(
    hovertemplate="<br>".join([
        "Country: %{customdata[1]}",
        "GDP per capita: %{x}",
        "Life Expectancy: %{y}",
        "Population: %{customdata[0]}",
    ])
)

fig.update_layout(
    xaxis_title="USD", yaxis_title="Years"
)
fig.show()