# Data and Distributions

__Purpose:__ The purpose of this lecture is to cover topics of Statistics such as data types, distributions and graphing. 

__At the end of this lecture you will be able to:__
> 1. Understand data types and distributions 
> 2. Graph data

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
import math 
import random
import matplotlib.pyplot as plt
%matplotlib inline

## 1.1 Data and Distributions:

### 1.1.1 Types of Data:

__Overview:__ 
- In general, we can distinguish between two types of data:
> 1. __[Qualitative Data](https://en.wikipedia.org/wiki/Qualitative_property):__ Qualitative Data contains discrete categories known as levels or categories. Qualitative Data can be further classified into four levels as described below
> 2. __[Quantitative Data](https://en.wikipedia.org/wiki/Quantitative_research):__ Quantitative Data is also known as Numerical Data because it contains numbers as counts or measurements 

### 1.1.1.1 Qualitative Data:

__Overview:__
- Qualitative Data can be further broken down into four categories: 
> 1. __[Nominal](https://en.wikipedia.org/wiki/Level_of_measurement#Nominal_level):__ Nominal (Categorical) data includes levels that can be differentiated only based on the names of the categories and not on an implicit ordering
>> - Nominal data can be operated using $=$ and $\neq$ and can be grouped into categories 
>> - For example, Nominal Data includes Gender, States of the United States of America, Eye Color, Marital Status, etc.
>> - We CAN'T say one level is less than or greater than another 
> 2. __[Ordinal](https://en.wikipedia.org/wiki/Level_of_measurement#Ordinal_scale):__ Ordinal data includes levels that can be differentiated based on an implicit ordering, but the distances between the levels is unknown
>> - Ordinal data can be operated using $>$ and $<$ and can be sorted 
>> - For example, Likert Scale (Like, Like Somewhat, Neutral, Dislike Somewhat, Dislike)
>> - We CAN'T say level 1 is less than level 2 by a difference of 1 
> 3. __[Interval](https://en.wikipedia.org/wiki/Level_of_measurement#Interval_scale):__ Interval data includes levels that can be differentiated based on an implicit ordering AND the distances between the levels is known, but the ratio between levels is unknown 
>> - Interval data can be operated using $+$ and $-$ and can be measured using a yardstick method 
>> - For example, Interval data includes temperature on the Celsius Scale, Date when measured from a specific start point, and location in Cartesian coordinates 
>> - We CAN'T say level 1 is twice as large as level 2 
> 4. __[Ratio](https://en.wikipedia.org/wiki/Level_of_measurement#Ratio_scale):__ Ratio data includes levels that can be differentiated based on an implicit ordering AND the distances between the level is known AND the ratio between the levels is known because there is a unique and non-arbitrary zero value 
>> - Ratio data can be operated using $*$ and $/$ and can be measured using a ratio operation 
>> - For example, Ratio data includes temperature on the Kelvin Scale and most measurements in Physics (mass, energy, etc.) 

### Problem 1:

Give an example of each type of qualitative data above (other than the examples that are mentioned in the definition)

In [None]:
# write your answer here 





### 1.1.1.2 Quantitative Data:

__Overview:__
- Quantitative Data can be further broken into two categories: 
> 1. __[Discrete Data](https://en.wikipedia.org/wiki/Continuous_or_discrete_variable#Discrete_variable):__ Discrete data can only take particular values over a range
>> - For example, number of three point shots attempted, made, and missed
> 2. __[Continuous Data](https://en.wikipedia.org/wiki/Continuous_or_discrete_variable#Continuous_variable):__ Continuous data are not restricted to defined values over a range, but instead can take on any value over a range 
>> - For example, height and weight 

__Helpful Points:__
1. Technically, Categorical Data mentioned above is also an example of Discrete Data 
2. It is possible to group Continuous Data into distinct, discrete categories known as bins (i.e. 100 - 120 cm, 121 - 130 cm, 131 - 150 cm, > 150 cm, etc.) 

### 1.2.2 Graphing Data:

__Overview:__ 
- Depending on the type of data there exists different types of graphs that we can use to explore the properties of the data 
> 1. __Qualitative Data:__ Qualitative Data can be summarized using the following methods:
>> a. Pie Charts <br>
>> b. Bar Charts 
> 2. __Quantitative Data:__ Quantitative Data can be summarized using the following methods: 
>> a. Histograms <br>
>> b. Scatter Plots <br>
>> c. Line Graph 

__Helpful Points:__
1. There is lots of debate and best practices shared regarding the type of graph that should be chosen. Feel free to read about this online, but [here](https://blog.hubspot.com/marketing/types-of-graphs-for-data-visualization) is one such resource

__Practice:__ Examples of Graphing Data in Python 

The NBA data willl be used here to perform Statistical Analysis. 

In [None]:
# read in data to analyze 
nba_df = pd.read_csv("NBA_GameLog_2010_2017.csv")

In [None]:
# view the data 
nba_df.head(10)

In [None]:
# convert to datetime
nba_df['Date'] = pd.to_datetime(nba_df['Date'])

In [None]:
# make index the datetime column for easier indexing
nba_df.set_index("Date", inplace = True)

In [None]:
nba_df.head(5)

### Example 1 (Graphing Qualitative Data):

First, we have to determine which columns in the NBA data set are qualitative variables:

In [None]:
nba_df.dtypes

We can see that the GameType, Team, Opp, W.L and Referee columns refer to qualitative data 

### Example 1.1 (Frequency Table):

In [None]:
tor_2016_2017 = nba_df.loc[(nba_df.loc[:, "Season"] == 2017) & (nba_df.loc[:, "Team"] == "TOR"), ]
tor_2016_2017

In [None]:
tor_2016_2017.columns

In [None]:
freq_table = pd.DataFrame(tor_2016_2017.loc[:, "W.L"].value_counts())
freq_table.columns = ["Frequency"]
freq_table

In [None]:
freq_table["RelativeFrequency"] = freq_table["Frequency"] / freq_table.Frequency.sum()
freq_table

### Example 1.2 (Pie Charts):

In [None]:
plt.figure(figsize = [5,5])
plt.pie(freq_table.RelativeFrequency, labels=["W", "L"]);

### Example 1.3 (Bar Charts):

In [None]:
plt.bar(freq_table.index,freq_table.RelativeFrequency);

### Example 2 (Graphing Quantitative Data):

### Example 2.1 (Histogram):

In [None]:
# histogram of the team's points
plt.hist(tor_2016_2017.loc[:, "Tm.Pts"])

In [None]:
# histogram of the team's FG Percentage
plt.hist(tor_2016_2017.loc[:, "Tm.FG_Perc"])

In [None]:
sns.distplot(tor_2016_2017.loc[:, "Tm.FG_Perc"],kde=False) # same as above but using seaborn 

The two histograms above may not look the same because the "binning" process which allocates the data into a pre-defined number of bins may be different, affecting the final visual of the histogram. 

In the histograms shown above, 

> 1. __X-Axis__: The X-Axis represnts the range of values that the Team Field Goal Percentage takes on
> 2. __Y-Axis__: The Y-Axis represents the frequency count of the number of observations (games) that fall into each bin

In [None]:
# cumulative frequency histogram 
sns.distplot(tor_2016_2017.loc[:, "Tm.FG_Perc"], hist_kws={"cumulative":True},kde_kws={"cumulative":True})

In the cumulative frequency histogram shown above, 

> 1. __X-Axis__: The X-Axis represents the range of values that the Team Field Goal Percentage takes on
> 2. __Y-Axis__: The Y-Axis represents the cumulative frequency count of the number of observations (games) that fall into each bin. We can interpret this in the following way: 
>> - In 20% of the games, the team had a Field Goal Percentage of 0.40 or less
>> - In about 50% of the games, the team had a Field Goal Percentage of 0.45 or less
>> - In about 80% of the games, the team had a Field Goal Percentage of 0.50 or less 

### Example 2.2 (Scatter Plots):

In [None]:
tor_2016_2017_home = tor_2016_2017.loc[(tor_2016_2017.loc[:, "Home"] == 1), ]
tor_2016_2017_home

In [None]:
# scatter plot of the home attendance 
plt.scatter(tor_2016_2017_home.index, tor_2016_2017_home.loc[:, "Home.Attendance"]);

In [None]:
# scatter plot of the turnovers 
plt.scatter(tor_2016_2017.index, tor_2016_2017.loc[:, "Tm.TOV"])

In [None]:
# combine scatter plot and histogram of the turnovers 
sns.jointplot(tor_2016_2017.loc[:, "G"], tor_2016_2017.loc[:, "Tm.TOV"])

### Example 2.3 (Line Graphs):

In [None]:
# plot 2 series - team and opponent points
plt.plot(tor_2016_2017.loc[:, "Tm.Pts"], linestyle = '--',linewidth = 3, c = 'b')
plt.plot(tor_2016_2017.loc[:, "Opp.Pts"],linewidth = 5, c = 'r') 
plt.legend(['Tm.Pts','Opp.Pts'],shadow = True, loc = 0);

### Problem 2 

Load in the Seattle Home Price Data then choose 2-3 qualitative variables and 2-3 quantitative variables to summarize using the techniques explained above.

In [None]:
# Write your code here 





### ANSWERS 

### Problem 1:

Give an example of each type of categorical data above (other than the examples that are mentioned in the definition)

__Nominal Variable:__ Political party (i.e. Democratic Party, Republican Party, etc.) <br>
__Ordinal Variable:__ Socio-economic class (i.e. working class, middle class, upper-middle class, upper class) <br>
__Interval Variable:__ Standardized Test Score (i.e. GRE, GMAT, SAT, etc.) <br>
__Ratio Variable:__ Height, Weight, etc. (a height of zero means something significant, a weight of zero means something significant)

### Problem 2 

Load in the Seattle Home Price Data choose 2-3 qualitative variables and 2-3 quantitative variables to summarize using the techniques explained above.

In [None]:
home_df = pd.read_csv("SeattleHomePrices.csv")

In [None]:
home_df.head(5)

In [None]:
home_df.dtypes

- Choose the qualitative variable "PROPERTY TYPE"
- Choose the quantitative variable "PRICE"

### Part 1 - Qualitative Variables:

In [None]:
freq_table = pd.DataFrame(home_df.loc[:, "PROPERTY TYPE"].value_counts())
freq_table.columns = ["Frequency"]
freq_table

In [None]:
freq_table["RelativeFrequency"] = freq_table["Frequency"] / freq_table.Frequency.sum()
freq_table

In [None]:
plt.figure(figsize = [5,5])
plt.pie(freq_table.RelativeFrequency, labels=freq_table.index);

In [None]:
plt.bar(freq_table.index,freq_table.RelativeFrequency)
plt.xticks(rotation=90);

### Part 2 - Quantitative Variables:

In [None]:
plt.hist(home_df.loc[:, "PRICE"])

In [None]:
plt.scatter(home_df.index, home_df.loc[:, "PRICE"])

In [None]:
x = home_df.index
y1 = home_df.loc[:, "PRICE"]
y2 = home_df.loc[:, "SQUARE FEET"]

fig, ax1 = plt.subplots()

ax2 = ax1.twinx()
ax1.plot(x, y1, 'g-')
ax2.plot(x, y2, 'b-')

ax1.set_xlabel('House Index')
ax1.set_ylabel('Price', color='g')
ax2.set_ylabel('Square Feet', color='b')