# NBA Trends
In this project, you’ll analyze data from the NBA (National Basketball Association) and explore possible associations.

This data was originally sourced from 538’s Analysis of the Complete History Of The NBA and contains the original, unmodified data from Basketball Reference as well as several additional variables 538 added to perform their own analysis.

You can read more about the data and how it’s being used by 538 here. For this project we’ve limited the data to just 5 teams and 10 columns (plus one constructed column, point_diff, the difference between pts and opp_pts).

You will create several charts and tables in this project, so you’ll need to use plt.clf() between plots in your code so that the plots don’t layer on top of one another.

## Tasks


Mark the tasks as complete by checking them off
Analyzing relationships between Quant and Categorical
##### 1. In script.py, the data has been subsetted for you into two smaller datasets: games from 2010 (named nba_2010) and games from 2014 (named nba_2014). To start, let’s focus on the 2010 data.

Suppose you want to compare the knicks to the nets with respect to points earned per game. Using the pts column from the nba_2010 DataFrame, create two series named knicks_pts (fran_id = "Knicks") and nets_pts(fran_id = "Nets") that represent the points each team has scored in their games.


`Hint` <br>
`You can filter the values in the DataFrame using the team names and selecting only the pts column.`

In [None]:
knicks_pts = nba_2010.pts[nba.fran_id=='____']
nets_pts = nba_2010.pts[nba.fran_id=='___']

##### 2. Calculate the difference between the two teams’ average points scored and save the result as diff_means_2010. Based on this value, do you think fran_id and pts are associated? Why or why not?


`Hint` <br>
`Use the np.mean() function to calculate the mean points scored for each team. You can then take the difference of the two values.`

In [None]:
knicks_mean_score = np.mean(____) # Mean of Knicks Scores
nets_mean_score = np.mean(____) # Mean of Nets Scores
diff_means = knicks_mean_score - nets_mean_score

##### 3. Rather than comparing means, it’s useful look at the full distribution of values to understand whether a difference in means is meaningful. Create a set of overlapping histograms that can be used to compare the points scored for the Knicks compared to the Nets. Use the series you created in the previous step (1) and the code below to create the plot. Do the distributions appear to be the same?


`Hint` <br>
`Fill in the code below:`

In [None]:
plt.hist(____, alpha=0.8, normed = True, label='knicks')
plt.hist(____, alpha=0.8, normed = True, label='nets')
plt.legend()
plt.show()

##### 4. Now, let’s compare the 2010 games to 2014. Replicate the steps from the previous three exercises using nba_2014. First, calculate the mean difference between the two teams points scored. Save and print the value as diff_means_2014. Did the difference in points get larger or smaller in 2014? Then, plot the overlapping histograms. Does the mean difference you calculated make sense?


`Hint` <br>
`Replicate the steps from Exercises 1-3 using nba_2014.`

##### 5. For the remainder of this project, we’ll focus on data from 2010. Let’s now include all teams in the dataset and investigate the relationship between franchise and points scored per game.

Using nba_2010, generate side-by-side boxplots with points scored (pts) on the y-axis and team (fran_id) on the x-axis. Is there any overlap between the boxes? Does this chart suggest that fran_id and pts are associated? Which pairs of teams, if any, earn different average scores per game?


`Hint` <br>
`You can use the boxplot function from Seaborn (commonly imported as sns) to generate the side-by-side boxplots. Modify the code below to create your boxplots.`

In [None]:
plt.clf() #to clear the previous plot
sns.boxplot(data = df, x = 'x_variable', y = 'y_variable')
plt.show()

### Analyzing relationships between Categorical variables
##### 6. The variable game_result indicates whether a team won a particular game ('W' stands for “win” and 'L' stands for “loss”). The variable game_location indicates whether a team was playing at home or away ('H' stands for “home” and 'A' stands for “away”). Do teams tend to win more games at home compared to away?

Data scientists will often calculate a contingency table of frequencies to help them determine if categorical variables are associated. Calculate a table of frequencies that shows the counts of game_result and game_location.

Save your result as location_result_freq and print your result. Based on this table, do you think the variables are associated?


`Hint` <br>
`You can use the crosstab function from pandas to create a contingency table. Fill in the code below with the correct variables.`

In [None]:
location_result_freq = pd.crosstab(nba_2010.____, nba_2010.____)
print(location_result_freq)

##### 7. Convert this table of frequencies to a table of proportions and save the result as location_result_proportions. Print your result.


`Hint` <br>
`You can convert your table of frequencies to a table of proportions by dividing the values in location_result_freq by the total observations.`

In [None]:
location_result_proportions = location_result_freq/len(____)
print(location_result_proportions)

##### 8. Using the contingency table created in the previous exercise (Ex. 7), calculate the expected contingency table (if there were no association) and the Chi-Square statistic and print your results. Does the actual contingency table look similar to the expected table — or different? Based on this output, do you think there is an association between these variables?


`Hint` <br>
`Use the chi2_contingency() function to see the expected table and Chi-Square statistic. The input to chi2_contingency is a contingency table like the one you created earlier (step 7).`

In [None]:
chi2, pval, dof, expected = chi2_contingency(____)
print(expected)
print(chi2)

### Analyzing Relationships Between Quantitative Variables
##### 9. For each game, 538 has calculated the probability that each team will win the game. In the data, this is saved as forecast. The point_diff column gives the margin of victory/defeat for each team (positive values mean that the team won; negative values mean that they lost). Did teams with a higher probability of winning (according to 538) also tend to win games by more points?

Using nba_2010, calculate the covariance between forecast (538’s projected win probability) and point_diff (the margin of victory/defeat) in the dataset. Save and print your result. Looking at the matrix, what is the covariance between these two variables?


`Hint` <br>
`Use the np.cov() function to calculate the covariance. Pass the dataframe columns, forecast and point_diff as arguments to np.cov(). You can identify the covariance between two by finding the number that is represented twice in the matrix.`

##### 10. Using nba_2010, calculate the correlation between forecast and point_diff. Save and print your result. Does this value suggest an association between the two variables?


`Hint` <br>
`Use pearsonr from the scipy.stats package to calculate correlation. Fill in the code below to calculate the association between these two variables.`

In [None]:
point_diff_forecast_corr = pearsonr(nba._____, nba._____)
print(point_diff_forecast_corr)

##### 11. Generate a scatter plot of forecast (on the x-axis) and point_diff (on the y-axis). Does the correlation value make sense?


`Hint` <br>
`Use the plt.scatter() function and fill the correct variable names in below to generate a scatterplot.`

In [None]:
plt.clf() #to clear the previous plot
plt.scatter('____', '____', data=nba)
plt.xlabel('Forecasted Win Prob.')
plt.ylabel('Point Differential')
plt.show()

In [None]:
import numpy as np
import pandas as pd
from scipy.stats import pearsonr, chi2_contingency
import matplotlib.pyplot as plt
import seaborn as sns

import codecademylib3
np.set_printoptions(suppress=True, precision = 2)

nba = pd.read_csv('./nba_games.csv')

# Subset Data to 2010 Season, 2014 Season
nba_2010 = nba[nba.year_id == 2010]
nba_2014 = nba[nba.year_id == 2014]

print(nba_2010.head())
print(nba_2014.head())

# 2010
# Task 1 
# Compare the knicks to the nets with respect to points earned per game.
knicks_pts_2010 = nba_2010.pts[nba.fran_id=='Knicks']
nets_pts_2010 = nba_2010.pts[nba.fran_id=='Nets']

print(knicks_pts_2010)
print(nets_pts_2010)

# Task 2 
# Difference between the two teams’ average points scored
knicks_mean_score_2010 = np.mean(knicks_pts_2010)
nets_mean_score_2010 = np.mean(nets_pts_2010)
diff_means_2010 = knicks_mean_score_2010 - nets_mean_score_2010
print(knicks_mean_score_2010)
print(nets_mean_score_2010)
print(diff_means_2010)

# Task 3 
# Create a set of overlapping histograms that can be used to compare the points scored
plt.hist(knicks_pts_2010, alpha=0.8, normed = True, label='knicks')
plt.hist(nets_pts_2010, alpha=0.8, normed = True, label='nets')
plt.legend()
plt.show()
plt.clf()

# Task 4 Compare 2014
# Compare the knicks to the nets with respect to points earned per game.
knicks_pts_2014 = nba_2014.pts[nba.fran_id=='Knicks']
nets_pts_2014 = nba_2014.pts[nba.fran_id=='Nets']

print(knicks_pts_2014)
print(nets_pts_2014)

# Difference between the two teams’ average points scored
knicks_mean_score_2014 = np.mean(knicks_pts_2014)
nets_mean_score_2014 = np.mean(nets_pts_2014)
diff_means_2014 = knicks_mean_score_2014 - nets_mean_score_2014
print(knicks_mean_score_2014)
print(nets_mean_score_2014)
print(diff_means_2014)

# Create a set of overlapping histograms that can be used to compare the points scored
plt.hist(knicks_pts_2014, alpha=0.8, normed = True, label='knicks')
plt.hist(nets_pts_2014, alpha=0.8, normed = True, label='nets')
plt.legend()
plt.show()
plt.clf()

# Task 5
# Investigate the relationship between franchise and points scored per game
sns.boxplot(data = nba, x = 'pts', y = 'fran_id')
plt.show()
plt.clf()

# Task 6
# Investigate whether teams tend to win more games at home compared to away?
location_result_freq = pd.crosstab(nba_2010.game_result, nba_2010.game_location)
print(location_result_freq)

# Task 7 
# Convert this table of frequencies to a table of proportions
location_result_proportions = location_result_freq/len(nba_2010)
print(location_result_proportions)

# Task 8 
# Calculate the expected contingency table if there were no association) and the Chi-Square statistic and print your results
chi2, pval, dof, expected = chi2_contingency(location_result_proportions)
print(expected)
print(chi2)

# Task 9
# Using nba_2010, calculate the covariance between forecast (538’s projected win probability) and point_diff (the margin of victory/defeat) in the dataset.
covariance_forecast_point_diff = np.cov(nba_2010.forecast, nba_2010.point_diff)
print("covariance matrix: ")
print(covariance_forecast_point_diff)

# Task 10
# Using nba_2010, calculate the correlation between forecast and point_diff. Save and print your result.
correlation_forecast_point_diff = pearsonr(nba_2010.forecast, nba_2010.point_diff)
print(correlation_forecast_point_diff)

# Task 11
# Generate a scatter plot of forecast (on the x-axis) and point_diff (on the y-axis). Does the correlation value make sense?

plt.clf()
plt.scatter('forecast', 'point_diff', data=nba_2010)
plt.xlabel('Forecasted Win Prob.')
plt.ylabel('Point Differential')
plt.show()