In this mission, we'll be calculating statistics using data from the National Basketball Association (NBA).

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
from scipy.stats import skew
from scipy.stats import kurtosis

In [3]:
nba_stats = pd.read_csv("nba_2013.csv")

In [4]:
nba_stats.head()

Unnamed: 0,player,pos,age,bref_team_id,g,gs,mp,fg,fga,fg.,...,drb,trb,ast,stl,blk,tov,pf,pts,season,season_end
0,Quincy Acy,SF,23,TOT,63,0,847,66,141,0.468,...,144,216,28,23,26,30,122,171,2013-2014,2013
1,Steven Adams,C,20,OKC,81,20,1197,93,185,0.503,...,190,332,43,40,57,71,203,265,2013-2014,2013
2,Jeff Adrien,PF,27,TOT,53,12,961,143,275,0.52,...,204,306,38,24,36,39,108,362,2013-2014,2013
3,Arron Afflalo,SG,28,ORL,73,73,2552,464,1011,0.459,...,230,262,248,35,3,146,136,1330,2013-2014,2013
4,Alexis Ajinca,C,25,NOP,56,30,951,136,249,0.546,...,183,277,40,23,46,63,187,328,2013-2014,2013


In [5]:
nba_stats.columns

Index(['player', 'pos', 'age', 'bref_team_id', 'g', 'gs', 'mp', 'fg', 'fga',
       'fg.', 'x3p', 'x3pa', 'x3p.', 'x2p', 'x2pa', 'x2p.', 'efg.', 'ft',
       'fta', 'ft.', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf',
       'pts', 'season', 'season_end'],
      dtype='object')

In [6]:
nba_stats[["age","pts","ast","fg.","pf"]].describe()

Unnamed: 0,age,pts,ast,fg.,pf
count,481.0,481.0,481.0,479.0,481.0
mean,26.509356,516.582121,112.536383,0.436436,105.869023
std,4.198265,470.422228,131.019557,0.098672,71.213627
min,19.0,0.0,0.0,0.0,0.0
25%,23.0,115.0,20.0,0.4005,44.0
50%,26.0,401.0,65.0,0.438,104.0
75%,29.0,821.0,152.0,0.4795,158.0
max,39.0,2593.0,721.0,1.0,273.0


While we've looked at the mean briefly before, it has an interesting property we'd like to point out here.

If we subtract the mean of a set of numbers from each of the numbers within that set, the overall total of all of the differences will always add up to zero.

That's because the mean is the "center" of the data. All of the differences that are negative will always cancel out all of the differences that are positive. Let's look at some examples to verify this.

Let's also become familiar with the mathematical symbol for the mean:

$ \Huge\mu_\boldsymbol{x}$

This symbol means "the average of all of the values in x." The fact that x is lowercase and in bold indicates that it's a vector.

$\Huge\overline{\boldsymbol{x}}$

The bar over the top indicates "the average of".

In [7]:
# Make a list of values
values = [2, 4, 5, -1, 0, 10, 8, 9]

# Compute the mean of the values
values_mean = sum(values) / len(values)
print(values_mean)

# Find the difference between each of the values and the mean by subtracting the mean from each value.
differences = [i - values_mean for i in values]
print(differences)

# This equals 0.  If you'd like, try changing the values around to verify that it still equals 0.
print(sum(differences))

4.625
[-2.625, -0.625, 0.375, -5.625, -4.625, 5.375, 3.375, 4.375]
0.0


In [8]:
# Find the median of the values list. Assign the result to values_median.
values_median = np.median(values)
print(values_median)

# Subtract the median from each element in values.
differences = [i - values_median for i in values]
print(differences)

# Sum up all of the differences, and assign the result to median_difference_sum.
median_difference_sum = sum(differences)
print(median_difference_sum)

4.5
[-2.5, -0.5, 0.5, -5.5, -4.5, 5.5, 3.5, 4.5]
1.0


Let's look at **variance** in the data. Variance tells us how concentrated or "spread out" the data is around the mean.

We looked at kurtosis earlier, which measures the shape of a distribution. Variance directly measures how far the average data point is from the mean.

We calculate variance by subtracting every value from the mean, squaring the results, and then averaging them. Mathemically, this looks like this:

$\sigma^2 = \frac{\displaystyle\sum_{i=1}^{n}(x_i - \mu _x)^2} {n}$

$\sigma^2$ is variance, and $\sum_{i=1}^{n}$ means "the sum from 1 to n", where n is the number of elements in a vector.

This formula goes through the exact same process we just described, and is the most common way to represent it.

In [9]:
# We've already loaded the NBA data into the nba_stats variable.
# Find the mean value of the column.
pf_mean = nba_stats["pf"].mean()

# Initialize variance at zero.
variance = 0

# Loop through each item in the "pf" column.
for p in nba_stats["pf"]:
    # Calculate the difference between the mean and the value.
    difference = p - pf_mean
    # Square the difference. This ensures that the result isn't negative.
    # If we didn't square the difference, the total variance would be zero.
    # ** in python means "raise whatever comes before this to the power of whatever number is after this."
    square_difference = difference ** 2
    # Add the difference to the total.
    variance += square_difference
    
# Average the total to find the final variance.
variance = variance / len(nba_stats["pf"])

print(variance)

5060.83731485


In [10]:
# Compute the variance of the data set's "pts" column, which holds the total number of points each player scored.
# Assign the result to point_variance.
point_variance = np.var(nba_stats["pts"])

print(point_variance)

220836.99585496247


We've been multiplying and dividing values, but we haven't really discussed the order of operations yet.