In [None]:
%reload_ext nb_black

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

data_url = "https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Data%20Sets%20Clustering/nba_player_seasons.csv"

* Read the data located at `data_url` into a `pandas` dataframe

Get to know the data, what things should we explore?

* List of things I must insist we explore (the rest is up to you)
  * The number of rows/columns
  * The datatypes of each column
  

## What is a 'typical' number of the points column (`'PTS'`)?

Present your answer(s) with proof:
* numerically
  * generate at least 2 summary metrics that can be thought of as 'typical'
* graphically
  * what plot type might we use?
  * add your numeric values to the plot

A note about `sns.distplot` (and plots like it).  Just like a histogram or a bar plot, this '`distplot`' uses height to show which values are the most probable.  A histogram and barplot will often use counts to show how probable each value use (the taller, the more likely).

A '`distplot`' is using a 'kernel density estimate' to show the probability density of each value rather than count.  The area under a probability distribution should sum to 1 by definition; so it can be surprising that in the below plot we see numbers greater than 1.

Remember that area has more to it than height.  For example, in our plot we have rectangles who's area is modeled by `height * width`.  The width here is the missing piece for why we can have values greater than one.  You can think of this as a probability per unit rather than a raw probability. 

[Here](https://stats.stackexchange.com/a/4223/102646) is a great, more in depth explanation.

In [None]:
sns.distplot(nba["ORB"])
plt.show()

## What shot percentage stat has the most variation?

Before we do that:
* Create a dataframe that contains only the columns with `'%'` in the column name. Name this data frame `percents`

* In this data we have missing values, the code below shows a count of missing values per column.  Can you explain why we'd have missing values here?

In [None]:
percents.isna().sum()

* Drop NAs from this `percents` dataframe

In [None]:
percents = percents.dropna()

Back to the original question: What shot percentage stat has the most variation?

* What metric(s) can we use for this?
* What plot type(s) can we use to show this?

## Using `.describe()` with `pandas`

* `.describe()` might have already come up depending on how we answered the above questions
* Let's explore `.describe()`'s options using `?` and `help()`

## Descriptive statistics with `groupby`

Sometimes we want descriptive statistics grouped by a categorical column in our data.  For example, instead of the average of the `'PTS'` column for our full dataset, maybe we want to see the average of the `'PTS'` column for each player.

* Calculate the average `'PTS'` grouped by `'Player'`

* Calculate the average, standard deviation, and count of `'PTS'` for each `'Tm'` (team)
* Sort this output in descending order by average points

## Correlations

If we want to explore relationships between 2 numeric columns we might use a correlation.  The correlation between 2 numeric columns ranges between `[-1, 1]`.

* A correlation of -1 is a strong negative correlation
  * For example, `amount of money spent` and `amount of money saved` would be negatively correlated.  As the `amount of money spent` goes up the `amount of money saved` would go down and vice versa.
* A correlation of 0 is a weak correlation
  * For example, the `number of words in the harry potter books` and the `number of arrests in costa rica` are likely not very related.
* A correlation of 1 is a strong positive correlation
  * For example, `amount of ice cream sold` and `temperature` are likely positively correlated.  As the `temperature` goes up the `amount of ice cream sold` likely goes up as well.
  
Note the classic phrase: "correlation does not equal causation".  Just because 2 things are related doesn't mean one thing causes the other ([this site](https://www.tylervigen.com/spurious-correlations) has examples of correlations that are 'spurious' that is, the 2 variables appear related, but they likely aren't).

To run a correlation on a `pandas` dataframe we can use `.corr()`.

* Apply `.corr()` to our dataframe
* What numeric variables are related? is this a positive or negative relationship?

## Creating your own statistics

Sometimes you might create some metric to summarize a record.  For example, we all have a credit score that is a combination of a lot of separate metrics.  Some of these style of metrics might be referred to as a index (i.e. a financial index), note that this is different than a `pandas` index.

How might we create a player rating metric for our data?