![Data Dunkers Banner](https://github.com/PS43Foundation/data-dunkers/blob/main/docs/top-banner.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fdata-dunkers%2Fdata-dunkers-modules&branch=main&subPath=..\demos/scatter-plots.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a><a href="https://colab.research.google.com/github/data-dunkers/data-dunkers-modules/blob/main/demos/scatter-plots.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg?sanitize=true" width="123" height="24" alt="Open in Colab"/></a>

# Creating Scatter Plots

## Lesson Objectives

Students will be able to:

- Load data from a CSV file into a [pandas](https://pandas.pydata.org/) DataFrame
- Filter the data
- Use the [Plotly Express](https://plotly.com/python/plotly-express/) library in [Python](https://www.python.org/) to create scatter plots
- Rename axis titles
- Identify trends using ordinary least squares (OLS) trendlines
- Use color and size of points to represent other columns from the data set
- Analyze the relationship between various basketball statistics, such as assists and points per game

## Getting and Cleaning Our Data

We're going to continue with the same importing and processing of the Pascal Siakam data we have been using.

In [None]:
import pandas as pd
import plotly.express as px

url = 'https://raw.githubusercontent.com/Data-Dunkers/data-dunkers-modules/main/data-dunkers/Data/Pascal_Siakam.csv'
df = pd.read_csv(url)

filter = df['SEASON_ID'] <= '2022-23'
df = df[filter]

df

## Scatter Plots

Previously we used `px.bar()` to create bar graphs. Now we'll use `px.scatter()` to create a scatter plot.

In [None]:
px.scatter(df, x='FGA', y='FGM', title='Siakam Field Goals versus Field Goal Attempts')

Let's update the x-axis and y-axis labels.

In [None]:
fig = px.scatter(df, x='FGA', y='FGM', title='Siakam Field Goals versus Field Goal Attempts')

fig.update_xaxes(title='Field Goal Attempts')
fig.update_yaxes(title='Field Goals Made')

fig.show()

## Method Chaining

We can also use method chaining for updating both axes in the same line of code.

In [None]:
fig = px.scatter(df, x='FGA', y='FGM', title='Siakam Field Goals Made versus Field Goal Attempts')

fig.update_xaxes(title='Field Goal Attempts').update_yaxes(title='Field Goals Made')

fig.show()

You can even combine them all together, but this makes for a very long line.

In [None]:
px.scatter(df, x='FGA', y='FGM', title='Siakam Field Goals Made versus Field Goal Attempts').update_xaxes(title='Field Goal Attempts').update_yaxes(title='Field Goals Made')

A more "Pythonic" way is to break the line up with a backslash `\` at the end of each wrapped line.

In [None]:
fig = px.scatter(df, x='FGA', y='FGM', title='Siakam Field Goals Made versus Field Goal Attempts') \
    .update_xaxes(title='Field Goal Attempts') \
    .update_yaxes(title='Field Goals Made')

fig.show()

It's totally your choice in which technique you'd like to use.

## Trend Analysis

What conclusions can you draw from the graph above? 

To help us draw conclusions we can add a line of best fit, which we call a trendline. We often use the [ordinary least squares](https://en.wikipedia.org/wiki/Ordinary_least_squares) method to calculate the parameters of the trendline.

In [None]:
fig = px.scatter(df, x='FGA', y='FGM', title='Siakam Field Goals Made versus Field Goal Attempts', trendline='ols')

fig.update_xaxes(title='Field Goal Attempts').update_yaxes(title='Field Goals Made')
fig.show()

## Colour Coding the Data

To add more data to the graph, we can color code the points by another value from the data set. For example, `color='SEASON_ID'`.

In [None]:
px.scatter(df, x='FGA', y='FGM', title='Siakam Field Goals Made versus Field Goal Attempts', color='SEASON_ID')

## Data Point Size

We can also change the size of the data points to be proportional to one of the data columns. For example `size='FG'`.

In [None]:
fig = px.scatter(df, 
    x='FGA', 
    y='FGM', 
    title='Siakam Field Goals versus Field Goal Attempts', 
    color='SEASON_ID', 
    size='FGM') 

fig.update_xaxes(title='Field Goal Attempts').update_yaxes(title='Field Goals')

fig.show()


## Exercise
Create a scatter plot with assists per game `('AST')` on the x-axis, points per game `('PTS')` on the y-axis, and `color='AGE'`. Include a trendline.

What do you observe about the relationship between these columns?

In [None]:
import pandas as pd
import plotly.express as px

url = 'https://raw.githubusercontent.com/Data-Dunkers/data-dunkers-modules/main/data-dunkers/Data/Pascal_Siakam.csv'
df = pd.read_csv(url)



---
Back to [Lessons](../Lessons.ipynb)