# Frequency Distributions
## Simplifying Data
![Steps](https://s3.amazonaws.com/dq-content/285/s1m3_workflow.svg)

Our capacity to understand a data set just by looking at it in a table format is limited, and it decreases dramatically as the size of the data set increases. To be able to analyze data, we need to find ways to simplify it.

One way to simplify this data set is to select a variable, count how many times each unique value occurs, and represent the `frequencies` (the number of times a unique value occurs) in a table.

Try to get a sense for how difficult it is to analyze the basketball data set in its original form.
* Read in the basketball data set (the name of the CSV file is wnba.csv) using pd.read_csv().
* Using DataFrame.shape, find the number of rows and columns of the data set.
* Print the entire data set, and try to analyze the output to find some patterns.

In [2]:
import pandas as pd

pd.options.display.max_rows = 200
pd.options.display.max_columns = 50
wnba = pd.read_csv("wnba.csv")
shape = wnba.shape

print('Shape of data:', shape)
print('Dataset:', wnba)

Shape of data: (143, 32)
Dataset:                          Name Team  Pos  Height  Weight        BMI  \
0               Aerial Powers  DAL    F     183    71.0  21.200991   
1                 Alana Beard   LA  G/F     185    73.0  21.329438   
2                Alex Bentley  CON    G     170    69.0  23.875433   
3             Alex Montgomery  SAN  G/F     185    84.0  24.543462   
4                Alexis Jones  MIN    G     175    78.0  25.469388   
5             Alexis Peterson  SEA    G     170    63.0  21.799308   
6               Alexis Prince  PHO    G     188    81.0  22.917610   
7               Allie Quigley  CHI    G     178    64.0  20.199470   
8                Allisha Gray  DAL    G     185    76.0  22.205990   
9           Allison Hightower  WAS    G     178    77.0  24.302487   
10               Alysha Clark  SEA    F     180    76.0  23.456790   
11              Alyssa Thomas  CON    F     188    84.0  23.766410   
12            Amanda Zahui B.   NY    C     196   113.0 

***
## Frequency Distribution Tables

A frequency distribution table has two columns. One column records the unique values of a variable, and the other the frequency of each unique value.
![table](https://s3.amazonaws.com/dq-content/285/s1m3_freq_table_anatomy.svg)

To generate a frequency distribution table using Python, we can use the `Series.value_counts()`.


In [3]:
print(wnba['Pos'].value_counts())

G      60
F      33
C      25
G/F    13
F/C    12
Name: Pos, dtype: int64


Using the Series.value_counts() method, generate frequency distribution tables for the following columns:
  * Pos. Assign the frequency distribution table to a variable named freq_distro_pos.
  * Height. Assign the frequency distribution table to a variable named freq_distro_height.

In [4]:
freq_distro_pos = wnba['Pos'].value_counts()
freq_distro_height = wnba['Height'].value_counts()

print('Position:','\n',freq_distro_pos)
print('\n')
print('Height:','\n',freq_distro_height)

Position: 
 G      60
F      33
C      25
G/F    13
F/C    12
Name: Pos, dtype: int64


Height: 
 188    20
193    18
175    16
185    15
183    11
173    11
191    11
196     9
178     8
180     7
170     6
198     5
201     2
168     2
206     1
165     1
Name: Height, dtype: int64


***
## Sorting Frequency Distribution Tables

![Sort](https://s3.amazonaws.com/dq-content/284/s1m2_interval_ratio.svg)

Because the Height variable has direction, we might be interested to find:
* How many players are under 170 cm?
* How many players are very tall (over 185)?
* Are there any players below 160 cm?

It's time-consuming to answer these questions using the table above. The solution is to sort the table ourselves.

`wnba['Height'].value_counts()` returns a `Series` object with the measures of height as indices. This allows us to sort the table by index using the `Series.sort_index()` method:

In [5]:
print(wnba['Height'].value_counts().sort_index())

165     1
168     2
170     6
173    11
175    16
178     8
180     7
183    11
185    15
188    20
191    11
193    18
196     9
198     5
201     2
206     1
Name: Height, dtype: int64


We can also sort the table by index in a descending order using `wnba['Height'].value_counts().sort_index(ascending = False)`

Generate a frequency distribution table for the Age variable, which is measured on a ratio scale, and sort the table by unique values.
* Sort the table by unique values in an ascending order, and assign the result to a variable named age_ascending.
* Sort the table by unique values in a descending order, and assign the result to a variable named age_descending.

Using the variable inspector, analyze one of the frequency distribution tables and brainstorm questions that might be interesting to answer here. These include:
* How many players are under 20?
* How many players are 30 or over?

In [6]:
age_ascending = wnba['Age'].value_counts().sort_index()
age_descending = wnba['Age'].value_counts().sort_index(ascending = False)

print('Age Ascending:','\n', age_ascending)
print('\n')
print('Age Descending','\n', age_descending)

Age Ascending: 
 21     2
22    10
23    15
24    16
25    15
26    12
27    13
28    14
29     8
30     9
31     8
32     8
33     3
34     5
35     4
36     1
Name: Age, dtype: int64


Age Descending 
 36     1
35     4
34     5
33     3
32     8
31     8
30     9
29     8
28    14
27    13
26    12
25    15
24    16
23    15
22    10
21     2
Name: Age, dtype: int64


***
## Sorting Tables for Ordinal Variables

We name the new column PTS_ordinal_scale. Below is a short extract from our data set containing the new column:

In [7]:
def make_pts_ordinal(row):
    if row['PTS'] <= 20:
        return 'very few points'
    if (20 < row['PTS'] <=  80):
        return 'few points'
    if (80 < row['PTS'] <=  150):
        return 'many, but below average'
    if (150 < row['PTS'] <= 300):
        return 'average number of points'
    if (300 < row['PTS'] <=  450):
        return 'more than average'
    else:
        return 'much more than average'
    
wnba['PTS_ordinal_scale'] = wnba.apply(make_pts_ordinal, axis = 1)

print(wnba[['Name', 'PTS', 'PTS_ordinal_scale']].head())

              Name  PTS         PTS_ordinal_scale
0    Aerial Powers   93   many, but below average
1      Alana Beard  217  average number of points
2     Alex Bentley  218  average number of points
3  Alex Montgomery  188  average number of points
4     Alexis Jones   50                few points


Let's examine the frequency distribution table for the `PTS_ordinal_scale` variable:

In [8]:
print(wnba['PTS_ordinal_scale'].value_counts())

average number of points    45
few points                  27
many, but below average     25
more than average           21
much more than average      13
very few points             12
Name: PTS_ordinal_scale, dtype: int64


We want to sort the labels in an ascending or descending order, but using `Series.sort_index()` doesn't work because the method can't infer quantities from words like "few points". `Series.sort_index()` can only order the index alphabetically in an ascending or descending order:

In [9]:
print(wnba['PTS_ordinal_scale'].value_counts().sort_index())

average number of points    45
few points                  27
many, but below average     25
more than average           21
much more than average      13
very few points             12
Name: PTS_ordinal_scale, dtype: int64


The solution is to do selection by index label. The output of `wnba['PTS_ordinal_scale'].value_counts()` is a Series object with the labels as indices. This means we can select by indices to reorder in any way we like:

In [10]:
print(wnba['PTS_ordinal_scale'].value_counts()[['very few points', 'few points', 'many, but below average', 'average number of points']])

very few points             12
few points                  27
many, but below average     25
average number of points    45
Name: PTS_ordinal_scale, dtype: int64


This approach can be time-consuming because it involves more typing than is ideal. We can use iloc[] instead to reorder by position:

In [11]:
print(wnba['PTS_ordinal_scale'].value_counts().iloc[[3, 1, 2, 0]])

more than average           21
few points                  27
many, but below average     25
average number of points    45
Name: PTS_ordinal_scale, dtype: int64


Generate a frequency distribution table for the transformed PTS_ordinal_scale column.
* Order the table by unique values in a descending order (not alphabetically).
* Assign the result to a variable named pts_ordinal_desc.

In [12]:
def make_pts_ordinal(row):
    if row['PTS'] <= 20:
        return 'very few points'
    if (20 < row['PTS'] <=  80):
        return 'few points'
    if (80 < row['PTS'] <=  150):
        return 'many, but below average'
    if (150 < row['PTS'] <= 300):
        return 'average number of points'
    if (300 < row['PTS'] <=  450):
        return 'more than average'
    else:
        return 'much more than average'
    
wnba['PTS_ordinal_scale'] = wnba.apply(make_pts_ordinal, axis = 1)

# Type your answer below
pts_ordinal_desc = wnba['PTS_ordinal_scale'].value_counts().iloc[[4, 3, 0, 2, 1, 5]]
print(pts_ordinal_desc)

much more than average      13
more than average           21
average number of points    45
many, but below average     25
few points                  27
very few points             12
Name: PTS_ordinal_scale, dtype: int64


***
## Proportions and Percentages

In pandas, we can compute all the proportions at once by dividing each frequency by the total number of players:

In [13]:
print(wnba['Pos'].value_counts() / len(wnba))

G      0.419580
F      0.230769
C      0.174825
G/F    0.090909
F/C    0.083916
Name: Pos, dtype: float64


It's slightly faster though to use `Series.value_counts()` with the normalize parameter set to True:

In [14]:
print(wnba['Pos'].value_counts(normalize = True))

G      0.419580
F      0.230769
C      0.174825
G/F    0.090909
F/C    0.083916
Name: Pos, dtype: float64


To find percentages, we just have to multiply the proportions by 100:

In [15]:
print(wnba['Pos'].value_counts(normalize = True) * 100)

G      41.958042
F      23.076923
C      17.482517
G/F     9.090909
F/C     8.391608
Name: Pos, dtype: float64


Answer the following questions about the Age variable:
* What proportion of players are 25 years old? Assign your answer to a variable named proportion_25.
* What percentage of players are 30 years old? Assign your answer to a variable named percentage_30.
* What percentage of players are 30 years or older? Assign your answer to a variable named percentage_over_30.
* What percentage of players are 23 years or younger? Assign your answer to a variable named percentage_below_23.

In [16]:
print(wnba['Age'].value_counts(normalize = True))

24    0.111888
23    0.104895
25    0.104895
28    0.097902
27    0.090909
26    0.083916
22    0.069930
30    0.062937
31    0.055944
29    0.055944
32    0.055944
34    0.034965
35    0.027972
33    0.020979
21    0.013986
36    0.006993
Name: Age, dtype: float64


In [17]:
print(wnba['Age'].value_counts(normalize=True)*100)

24    11.188811
23    10.489510
25    10.489510
28     9.790210
27     9.090909
26     8.391608
22     6.993007
30     6.293706
31     5.594406
29     5.594406
32     5.594406
34     3.496503
35     2.797203
33     2.097902
21     1.398601
36     0.699301
Name: Age, dtype: float64


In [18]:

def make_age_ordinal(row):
    if row['Age'] <= 23:
        return 'under 23'
    if (20 < row['Age'] >=  30):
        return 'over 30'
    else:
        return 'other ages'
    
wnba['Age_ordinal_scale'] = wnba.apply(make_age_ordinal, axis = 1)

print(wnba[['Age_ordinal_scale']].value_counts(normalize=True)*100)

Age_ordinal_scale
other ages           54.545455
over 30              26.573427
under 23             18.881119
dtype: float64


***
## Percentiles and Percentile Ranks

In the previous exercise, we found that the percentage of players aged 23 years or younger is 19% (rounded to the nearest integer). This percentage is also called a **percentile rank**.

In this context, the value of 23 is called the 19th **percentile**. If a value *x* is the 19th percentile, it means that 19% of *all* the values in the distribution are equal to or less than *x*.

![percent](https://s3.amazonaws.com/dq-content/285/s1m3_percentiles_v2.svg)

In our previous exercise, our answer to this question was 18.881%. We can arrive at the same answer a bit faster using the `percentileofscore(a, score, kind='weak')` <span style="color:green">function</span> from `scipy.stats`:


In [19]:
from scipy.stats import percentileofscore
print(percentileofscore(a = wnba['Age'], score = 23, kind = 'weak'))

18.88111888111888


We need to use `kind = 'weak'` to indicate that we want to find the percentage of values that are equal to or less than the value we specify in the score parameter.

Another question we had was what percentage of players are 30 years or older. We can answer this question too using percentile ranks. First we need to find the percentage of values equal to or less than 29 years (the percentile rank of 29). The rest of the values must be 30 years or more.

![age](https://s3.amazonaws.com/dq-content/285/s1m3_difference.svg)

In our exercise the answer we found was 26.573%. This is what we get using the technique we've just learned:

In [20]:
print(100 - percentileofscore(wnba['Age'], 29, kind = 'weak'))

26.573426573426573


Import `percentileofscore()` from `scipy.stats` has been imported above. Use it to answer the following: 

* What percentage of players played half the number of games or less in the 2016-2017 season (there are 34 games in the WNBA’s regular season)? Use the `Games Played` column to find the data you need, and assign your answer to a variable named `percentile_rank_half_less`.
* What percentage of players played more than half the number of games of the season 2016-2017? Assign your result to `percentage_half_more`.

In [21]:
percentile_rank_half_less = percentileofscore(wnba['Games Played'], 17, kind='weak')
percentage_half_more = 100 - percentile_rank_half_less

print(percentile_rank_half_less)
print(percentage_half_more)

16.083916083916083
83.91608391608392


***
## Finding Percentiles with pandas

To find percentiles, we can use the `Series.describe()`

In [22]:
print(wnba['Age'].describe())

count    143.000000
mean      27.076923
std        3.679170
min       21.000000
25%       24.000000
50%       27.000000
75%       30.000000
max       36.000000
Name: Age, dtype: float64


 We can use `iloc[]` to isolate just the output we want:

In [23]:
print(wnba['Age'].describe().iloc[3:])

min    21.0
25%    24.0
50%    27.0
75%    30.0
max    36.0
Name: Age, dtype: float64


![AgePercent](https://s3.amazonaws.com/dq-content/285/s1m3_quantiles_v2.svg)

The three percentiles that divide the distribution in four equal parts are also known as `quartiles`.
* The first quartile (also called lower quartile) is 24 (note that 24 is also the 25th percentile).
* The second quartile (also called the middle quartile) is 27 (note that 27 is also the 50th percentile).
* And the third quartile (also called the upper quartile) is 30 (note that 30 is also the 75th percentile).

We may be interested to find the percentiles for percentages other than 25%, 50%, or 75%. For that, we can use the `percentiles` parameter of `Series.describe()`

In [24]:
print(wnba['Age'].describe(percentiles = [.1, .15, .33, .5, .592, .85, .9]).iloc[3:])

min      21.0
10%      23.0
15%      23.0
33%      25.0
50%      27.0
59.2%    28.0
85%      31.0
90%      32.0
max      36.0
Name: Age, dtype: float64


Use the `Age` variable along with `Series.describe()` to answer the following questions:
* What's the upper quartile of the `Age` variable? Assign your answer to a variable named `age_upper_quartile`.
* What's the middle quartile of the `Age` variable? Assign your answer to a variable named `age_middle_quartile`.
* What's the 95th percentile of the `Age` variable? Assign your answer to a variable named `age_95th_percentile`.

Indicate the truth value of the following sentences:
* A percentile is a value of a variable, and it corresponds to a certain percentile rank in the distribution of that variable. (If you think this is true, assign `True` (boolean, not string) to a variable named `question1`, otherwise assign `False`.)
* A percentile rank is a numerical value from the distribution of a variable. (Assign `True` or `False` to `question2`.)
The 25th percentile is the same thing as the lower quartile, and the upper quartile is the same thing as the third quartile. (Assign `True` or `False` to `question3`)

In [25]:
wnba = pd.read_csv('wnba.csv')
percentiles = wnba['Age'].describe(percentiles = [.5, .75, .95])
age_upper_quartile = percentiles['75%']
age_middle_quartile = percentiles['50%']
age_95th_percentile = percentiles['95%']

question1 = True
question2 = False
question3 = True

print(age_upper_quartile)
print(age_middle_quartile)
print(age_95th_percentile)

30.0
27.0
34.0


***
## Grouped Frequency Distribution Tables

Not all frequency tables are straightforward:

In [26]:
print(wnba['Weight'].value_counts().sort_index())

55.0      1
57.0      1
58.0      1
59.0      2
62.0      1
63.0      3
64.0      5
65.0      4
66.0      8
67.0      1
68.0      2
69.0      2
70.0      3
71.0      2
73.0      6
74.0      4
75.0      4
76.0      4
77.0     10
78.0      5
79.0      6
80.0      3
81.0      5
82.0      4
83.0      4
84.0      9
85.0      2
86.0      7
87.0      6
88.0      6
89.0      3
90.0      2
91.0      3
93.0      3
95.0      2
96.0      2
97.0      1
104.0     2
108.0     1
113.0     2
Name: Weight, dtype: int64


Fortunately, pandas can handle this process gracefully. We only need to make use of the `bins` parameter of `Series.value_counts()`. We want ten equal intervals, so we need to specify `bins = 10`:

In [27]:
print(wnba['Weight'].value_counts(bins = 10).sort_index())

(54.941, 60.8]     5
(60.8, 66.6]      21
(66.6, 72.4]      10
(72.4, 78.2]      33
(78.2, 84.0]      31
(84.0, 89.8]      24
(89.8, 95.6]      10
(95.6, 101.4]      3
(101.4, 107.2]     2
(107.2, 113.0]     3
Name: Weight, dtype: int64


`(54.941, 60.8], (60.8, 66.6] or (107.2, 113.0]` are number intervals. The `(` character indicates that the starting point is not included, while the `]` indicates that the endpoint is included. The interval `(54.941, 60.8]` contains all real numbers greater than 54.941 and less than or equal to 60.8.

Examine the frequency table for the PTS (total points) variable trying to find some patterns in the distribution of values. Then, generate a grouped frequency distribution table for the PTS variable with the following characteristics:
* The table has 10 class intervals.
* For each class interval, the table shows percentages instead of frequencies.
* The class intervals are sorted in descending order. 

Assign the table to a variable named grouped_freq_table, then print it and try again to find some patterns in the distribution of values.

In [28]:
grouped_freq_table = wnba['PTS'].value_counts(bins = 10, normalize = True).sort_index(ascending = False)*100

print(grouped_freq_table)

(525.8, 584.0]     3.496503
(467.6, 525.8]     2.797203
(409.4, 467.6]     5.594406
(351.2, 409.4]     6.993007
(293.0, 351.2]     5.594406
(234.8, 293.0]    11.888112
(176.6, 234.8]    13.986014
(118.4, 176.6]    11.888112
(60.2, 118.4]     16.783217
(1.417, 60.2]     20.979021
Name: PTS, dtype: float64


***
## Information Loss

When we generate grouped frequency distribution tables, there's an inevitable information loss.

In [29]:
print(wnba['PTS'].value_counts(bins = 10))

(1.417, 60.2]     30
(60.2, 118.4]     24
(176.6, 234.8]    20
(118.4, 176.6]    17
(234.8, 293.0]    17
(351.2, 409.4]    10
(293.0, 351.2]     8
(409.4, 467.6]     8
(525.8, 584.0]     5
(467.6, 525.8]     4
Name: PTS, dtype: int64


Looking at the first interval, we can see there are 30 players who scored between 2 and 60 points (2 is the minimum value in our data set, and points in basketball can only be integers).

To get back this granular information, we can increase the number of class intervals. However, if we do that, we end up again with a table that's lengthy and very difficult to analyze.

On the other side, if we decrease the number of class intervals, we lose even more information:

In [30]:
print(wnba['PTS'].value_counts(bins = 5).sort_index())

(1.417, 118.4]    54
(118.4, 234.8]    37
(234.8, 351.2]    25
(351.2, 467.6]    18
(467.6, 584.0]     9
Name: PTS, dtype: int64


As a rule of thumb, 10 is a good number of class intervals to choose because it offers a good balance between information and comprehensibility.

Generate a grouped frequency distribution for the MIN variable (minutes played during the season), and experiment with the number of class intervals to get a sense for what conclusions you can draw as you vary the number of class intervals. Try to experiment with the following numbers of class intervals:
1
2
3
5
10
15
20
40

***
## Readability for Grouped Frequency Tables

The intervals pandas outputs are confusing at first sight:


In [31]:
print(wnba['PTS'].value_counts(bins = 6).sort_index())

(1.417, 99.0]     48
(99.0, 196.0]     27
(196.0, 293.0]    33
(293.0, 390.0]    13
(390.0, 487.0]    13
(487.0, 584.0]     9
Name: PTS, dtype: int64


Let's look at one way to code the intervals. We start with creating the intervals using the `pd.interval_range()` function:

In [32]:
intervals = pd.interval_range(start = 0, end = 600, freq = 100)
print(intervals)

IntervalIndex([(0, 100], (100, 200], (200, 300], (300, 400], (400, 500], (500, 600]], dtype='interval[int64, right]')


Next, we pass the `intervals` variable to the `bins` parameter, store the result to `gr_freq_table`, and print the result, like this:

In [33]:
gr_freq_table = wnba["PTS"].value_counts(bins = intervals).sort_index()
print(gr_freq_table)

(0, 100]      49
(100, 200]    28
(200, 300]    32
(300, 400]    17
(400, 500]    10
(500, 600]     7
Name: PTS, dtype: int64


Now we do a quick check of our work. There are 143 players in the data set, so the frequencies should add up to 143:

In [34]:
print(gr_freq_table.sum())

143


Using the techniques above, generate a grouped frequency table for the PTS variable. The table should have the following characteristics:
* The first class interval starts at 0 (not included).
* The last class interval ends at 600 (included).
* Each interval has a range of 60 points.
* There are 10 class intervals.

Assign the table to a variable named gr_freq_table_10.

In [36]:
intervals = pd.interval_range(start = 0, end = 600, freq = 60)
gr_freq_table_10 = wnba['PTS'].value_counts(bins = intervals).sort_index()

print(gr_freq_table_10)

(0, 60]       30
(60, 120]     25
(120, 180]    17
(180, 240]    22
(240, 300]    15
(300, 360]     7
(360, 420]    11
(420, 480]     7
(480, 540]     4
(540, 600]     5
Name: PTS, dtype: int64
