In [None]:
# initializing otter-grader
import otter
grader = otter.Notebook()

# Lab 5: Data Transformation and Density Estimation

In this lab we are going to apply data transformations and convert the data into specific forms to explore their features. Some important and powerful functions like `.cut()` and `.groupby()` will be illustrated in detail. Moreover, we will learn some smoothing techniques to better present information in our data. 

To receive credit for a lab, answer all questions correctly and submit it to Gradescope before the deadline.

**This lab is due 5/2 at 12:00 AM PST.**

### Collaboration Policy

Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually** and do not copy them from others. 

By submitting your work in this course, whether it is homework, a lab assignment, or a quiz/exam, you agree and acknowledge that **this submission is your own work and that you have read the policies regarding Academic Integrity**: https://studentconduct.sa.ucsb.edu/academic-integrity. The Office of Student Conduct has policies, tips, and resources for proper citation use, recognizing actions considered to be cheating or other forms of academic theft, and students’ responsibilities. You are required to read the policies and to abide by them.

## Setup

In [1]:
# Run this cell to set up your notebook.
# Do not change anything in this cell 

import csv
import numpy as np
import pandas as pd
import altair as alt
alt.data_transformers.disable_max_rows()

import zipfile
from pathlib import Path

# Basketball Shots Data

Are you a basketball fan? If you are, congratulations, you will be excited throughout this lab as both datasets are on this topic. Don't worry if you are not a fan and only have limited knowledge on the terminologies, you will only need some common sense to finish the questions here. This 2-minute video about [The Basic Rules of Basketball](https://www.youtube.com/watch?v=XbtmGKif7Ck) might also help if you are not familiar with the rules or some of the terminology.

We first load the dataset and explore some basic features.

In [2]:
# If you are working on a student version, try pd.read_csv("../data/bball_data.csv")

bball = pd.read_csv("data/bball_data.csv")
bball.head(10)

As shown in the above cell, this dataset is about shots made by the players. The first column indicates if the shot was made successfully; the second column represents the region where it was made, and the third column indicates how far the defender was from the player (measured in meters).

## Data Transformation
Intuitively, one would perform better when not blocked by other people, and thus have more chances scoring the shot. Let us see if our data supports this intuition. 

The '`defender_distance`' feature is measured in meters. A quick glimpse at its range suggests that it is always between 0 and 10 (qucik check: what command would you run to get this information?). We will group our data up based on this feature.

### Step 1: use `pd.cut()`

For this step, check out the documentation for the `pd.cut()` function (reminder: you can run `help(pd.cut)` to bring up the help page right in the notebook).

In [3]:
# help(pd.cut)

Create a new column, called '`defender_bin`', to represent the binned version of defender distance, such as (5, 6], (1, 2].  In particular, we need to create 10 bins since the range is between 0 and 10.

*Hint: You should be able to call pd.cut() with the column that needs to be binned, followed by the number of bins.*


<!--
BEGIN QUESTION
name: q1a
manual: false
points: 2
gradescope: show
-->

In [4]:
## Split the data into 10 bins using pd.cut

...
bball[['defender_bin']]

### Step 2: use `pd.groupby`

In basketball terms, the ratio of field goals that were *made* to the field goals that were *attempted* is called the **field goal percentage**. A glossary of basketball terms can be found here: https://stats.nba.com/help/glossary/.

Now we want to provide an estimate of the **field goal percentage** given the defender distance in our dataset. We will divide the dataset into subgroups based on the '`defender_bin`'. Then, for each subgroup, we compute **_the mean_** of '`made_shot`' and '`defender_distance`' for observations within this subgroup (which effectively gives us the estimate of the field goal percentage). **The result is a dataframe with two columns ('`made_shot`' and '`defender_distance`'), and it is indexed by the '`defender_bin`' values.** 

We would expect that the further away the defender is, the percentage of made shots should increase.

A natual and tedious way would be a `for` loop with some conditional statements. However, that is very inefficient when dealing with large datasets and is not elegant enough to be adopted by seasoned data scientists. This operation is extremely popular and thus deserves a unique function, which in pandas is called `.groupby()`. Lots of variations are used extensively in practical data analysis. 

Contents here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html can refresh your memory on some basics.

<!--
BEGIN QUESTION
name: q1b
manual: false
points: 3
gradescope: show
-->


In [7]:
## Group by bin and take the mean (hint: use chaining of operations)
bball_percentages = ...
bball_percentages.head(10)

## Visualization
Now we try to plot the field goal percentage estimates versus defender distances (binned version) using '`bball_percentages`'. The x-axis is the defender distances, and the y-axis is the probability estimates. **Note: since the y-axis is encoding probabilities, we can fix its range to be from 0 to 1** by using `scale=alt.Scale(domain=(0, 1))` when specifying the y-axis in Altair.

<!--
BEGIN QUESTION
name: q2
manual: true
points: 5
gradescope: show
-->
<!-- EXPORT TO PDF -->

In [11]:
## Plot the shot probability vs defender distances

alt.Chart(bball_percentages).mark_point().encode(
    x = ...
    y = ...
)

This should be really surprising. We expect defenders to affect the probability of making a shot, otherwise, why bother playing defense at all?! What could be going on? 

Let's explore other ways of looking at the data.

## Another Transformation

Alternatively, we now group the data by both, the defender bin **and** the shot region, and again compute the estimates of the field goal percentage for each subgroup. The shot regions are shown on the court below:

<img src="../images/court_colored.png" style="width: 250px;">

In the image above, the "Corner Three" is blue, "Arc Three" is purple, "Paint" is yellow, "Mid-range" is green and "Near hoop" is red.  The basket is at the bottom of the image and marked by a black dot.

Note that earlier we were grouping using a single column (the '`defender_bin`'). Now, we need to re-run effectively the same command, this time, grouping by multiple columns.

*Hint: the way to group the data by multiple columns can be found here*: [Group and Aggregate by One or More Columns in Pandas](https://jamesrledoux.com/code/group-by-aggregate-pandas).
*Since you only need the mean, you don't really need to use the `agg` method here, however, it will come in handy in the later questions.* 

<!--
BEGIN QUESTION
name: q3
manual: false
points: 3
gradescope: show
-->

In [12]:
bball_percentages_by_region = ...
bball_percentages_by_region.head(10)

## Visualization by Two Variables
Now we create a plot for the newly created variable, `bball_percentages_by_region`. Notice that currently there are two indices for '`bball_percentages_by_region`', as it is created using the `groupby` operation with respect to two variables. 

We can convert '`region`' back to a column by `.reset_index(level = 'region')`, and then plotting it. This should be very similar to our previous visualization where the x-axis is the defender distances and the y-axis is the probability estimates; new here is the color, which encodes the region from which the shot is made. 


<!--
BEGIN QUESTION
name: q4
manual: true
points: 5
gradescope: show
-->
<!-- EXPORT TO PDF -->

In [16]:
...

Now let us check it carefully! At every shot region, the probability of making the shot increases as the defender gets further away. This is what we expected when we started, so that's good. But how can we reconcile this with the plot above? When we ignored where the shot came from, the defender distance appeared to have no effect! 

**Please discuss this phenomenon with your group members!**

## An Observation

One possible statistic we can think of is the numbers of players in the defender bins. Are they very similar or drastically different? 

To get this statistic we apply again our `.groupby` method with respect to multiple columns. To be specific, we want to count the number of shots that were made in each subgroup, compute the probability estimates for field goal percentage, and also compute the average defender distance for the subgroup. You should figure out which functions should be applied to which column after the data is divided into subgroups. The function `.agg()` is crucial for a concise implementation of the above procedures, and its references can be found here:

https://pandas.pydata.org/pandas-docs/version/0.22.0/generated/pandas.core.groupby.DataFrameGroupBy.agg.html (you can also use the example that we linked above for how to [Group and Aggregate by One or More Columns in Pandas](https://jamesrledoux.com/code/group-by-aggregate-pandas)).


<!--
BEGIN QUESTION
name: q5
manual: false
points: 7
gradescope: show
-->

In [17]:
bball_percentages_by_region2 = ...

bball2 = bball_percentages_by_region2.reset_index('region')
bball2.columns = ['region', 'made_shot', 'count', 'defender_distance']
bball2.head(10)

## Visualize the Result

Now we create another plot (by updating our previous plot) making the points differ in both color and size: the `color` is based on the region, and the `size` of the point is encoded using the count that we just computed. 

<!--
BEGIN QUESTION
name: q6
manual: true
points: 5
gradescope: show
-->
<!-- EXPORT TO PDF -->

In [25]:
...

What do we observe in the above plot? Do you find any patterns that might explain the inconsistency between the previous two visualizations? Please discuss with your group members! 

# A More Comprehensive Basketball Shots Data

We had some fun playing around with the small dataset and explored the interesting and counterintuitive statistical phenomenon. Now here is a more comprehensive dataset. Get ready! We are going to learn more fascinating visualizations!

## Importing Data
In practice, large datasets are often stored in some compressed formats rather than the csv files. Our new dataset is in the `pickle` format. Check out https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_pickle.html and figure out how to load the pickle file titled `allplyrs2018.p` into a pandas dataframe called `allplyrs`, and load the pickle file titled `allshots2018.p` into a pandas dataframe called `allshots`.


<!--
BEGIN QUESTION
name: q7
manual: false
points: 5
gradescope: show
-->

In [26]:
# If you are working on a student version, try "../data/"

allshots = ...

## 1D Investigation of the Data

We are interested in figuring out how many shots most players make and answering questions such as: are there many players who made over 1000 shots?

### Create a Histogram

A histogram of player counts versus the number of shots made will help us visualize this underlying information. Intuitively, we would start with counting the number of shots each player attempted, and then create a histogram of shots attempted by these players. To create the **shot_counts** variable, first apply the `.groupby` method on a particular feature; then use `.size` function to compute the size of the subgroup; lastly use `.reset_index` method and give the new feature a proper name.

For the resulting histogram, label your x-axis `"Shot Counts"`, set the maximum number of bins to be 20, opacity to be 0.5, and color the bars blue. 

*Relevant links:* 

https://altair-viz.github.io/gallery/layered_histogram.html for setting maximum bins.

https://altair-viz.github.io/user_guide/customization.html for global and local customized configurations.


<!--
BEGIN QUESTION
name: q8
manual: true
points: 5
gradescope: show
-->
<!-- EXPORT TO PDF -->

In [32]:
shot_counts = ...

alt.Chart(shot_counts).mark_bar().encode(
      x= ...
      y= ...
).configure_mark(
    opacity = ...
    color = ...
)

### Plot the Estimated Density 

A smoothed version of the histogram is called density estimation, which is very useful in statistics. We are about to create a plot for the estimated density. Set the opacity to 0.5 and the color to red. 

Check out https://altair-viz.github.io/user_guide/transform/density.html for more information. 


<!--
BEGIN QUESTION
name: q9
manual: true
points: 5
gradescope: show
-->

<!-- EXPORT TO PDF -->

In [33]:
...

### Combine Them Together
Let's overlay the above two plots now. Be careful that since their $y$ scales are different (you can check them from the above two plots), we will normalize our previous histogram and overlay them. This sounds natural but actually it requires some work. Let us see a simpler example first to understand how to create a normalized histogram. 

In [34]:
import pandas as pd
import altair as alt

source = pd.DataFrame({'age': ['12', '32', '43', '54', '32', '32', '12']})

alt.Chart(source).transform_joinaggregate(
    total='count(*)'
).transform_calculate(
    pct='1 / datum.total / 5'
).mark_bar().encode(
    alt.X('age:Q', bin=True),
    alt.Y('sum(pct):Q')
)

Pay attention to `total='count(*)'`, `pct='1 / datum.total / 5'` and `alt.Y('sum(pct):Q')`. Figure out how the above code works and apply a similar procedure to our data.

<!--
BEGIN QUESTION
name: q10
manual: true
points: 5
gradescope: show
-->

*Hints*

1. Create the normalized histogram and the density plot as layer1 and layer2, respectively. 
2. Use alt.layer() to combine the two layers together.
<!-- EXPORT TO PDF -->

In [35]:
...

What do you observe about the distribution of shot attempts in the NBA across all players? Why might this be? Discuss in 1-3 sentences below.


<!--
BEGIN QUESTION
name: q11
manual: true
points: 5
gradescope: show
-->
<!-- EXPORT TO PDF -->

Solution:

# Running Built-in Tests
1. All tests are in `tests` directory
1. Each python file in `tests` is a test
1. `grader.check('testname')` runs test `'testname'`, e.g. `'q1'`
1. `grader.check_all()` runs all visible tests

In [None]:
# Run built-in checks
grader.check_all()

In [None]:
# Generate pdf in classic notebook (does not work in JupyterLab)
import nb2pdf
nb2pdf.convert('lab05.ipynb')

# To generate pdf using command-line, run in terminal,
# nb2pdf lab05.ipynb

# Submission Checklist
1. Check filename is 'lab05.ipynb'
1. Save file to confirm all changes are on disk
1. Run *Kernel > Restart & Run All* to execute all code from top to bottom
1. Check `grader.check_all()` output
1. Save file again to write any new output to disk
1. Check generated pdf that all responses are displayed correctly
1. Submit to Gradescope