# find the median for a group (using LaTex)

In [16]:
#import the libraries
import re
import pandas as pd
import latex

Now let's remake the dataframe page 85 of the book **'Statistical Methods and Data Analysis'** (Ott&Longnecker)

In [24]:
df = pd.DataFrame({'Class': [1,2,3,4,5,6,7,8,9,10,11], 
                   'Class Interval': ['16.25-18.75', '18.75-21.25', '21.25-23.75', 
                                      '23.75-26.25', '26.25-28.75', '28.75-31.25',
                                      '31.25-33.75', '33.75-36.25', '36.25-38.75',
                                      '38.75-41.25', '41.25-43.75'],
                  '$f_i$' : [2,7,7,14,17,24,11,11,3,3,1],
                  'Cumulative $f_i$': [2,9,16,30,47,71,82,93,96,99,100],
                  '$f_i/n$' : [.02,.07,.07,.14,.17,.24,.11,.11,.03,.03,.01],
                  'Cumulative $f_i/n$' : [.02, .09,.16,.30,.47,.71,.82,.93,.96,.99,1.00]})
df

Unnamed: 0,Class,Class Interval,$f_i$,Cumulative $f_i$,$f_i/n$,Cumulative $f_i/n$
0,1,16.25-18.75,2,2,0.02,0.02
1,2,18.75-21.25,7,9,0.07,0.09
2,3,21.25-23.75,7,16,0.07,0.16
3,4,23.75-26.25,14,30,0.14,0.3
4,5,26.25-28.75,17,47,0.17,0.47
5,6,28.75-31.25,24,71,0.24,0.71
6,7,31.25-33.75,11,82,0.11,0.82
7,8,33.75-36.25,11,93,0.11,0.93
8,9,36.25-38.75,3,96,0.03,0.96
9,10,38.75-41.25,3,99,0.03,0.99


## Assignment:
Calculate the median for grouped data of the dataframe 'df'

The formula for this --> $median = L + \frac{w}{f_m}\left(.5n - cf_b\right)$

where: 
- $L$ = lower class limit of the interval that contains the median
- $n$ = total frequency,
- $cf_b$ = the sum of frequencies (cumulative frequency) for all classes before the median class,
- $f_m$ = frequency of the class interval containing the median 
- $w$ = interval width

## Solution 

To determine the interval that contains the median, we must find the first interval for which
the cumulative relative exceeds .50. This interval is the one containing the median. 
For the dataframe here above the interval containing the median is than class 6: 28.75-31.25. 
With this information, we can fill in the formula

- $L$ = 28.75
- $n$ = 100
- $cf_b$ = 47
- $f_m$ = 24
- $w$ = 2.5

So the formula would be: 
$median = 28.75 + \frac{2.5}{24}\left(.5*100 - 47\right)$ = $28.75 + 0.10416667(50-47)$ = $28.75 + 0.3125$ = $29.0625$

# Let's do this with Python

Now, we can also do this automatically with Python using median_grouped, but we need all columns needed for the calculation to be numbers. The column 'Class Interval' is a string. let's try without altering anything.

In [72]:
# importing median_grouped from 
# the statistics module 
from statistics import median_grouped

# printing median_grouped for the set 
print("Grouped Median is %s" %(median_grouped(df['Class Interval']))) 

TypeError: expected number but got '28.75-31.25'

To solve this, I've tried putting the range in two columns, one with the lower, the other with the highest value as only the lowest / highest value does not work correctly. Let me show you what I mean.

In [73]:
#This time we create a dataframe with only the lower values
df2 = pd.DataFrame({'Class': [1,2,3,4,5,6,7,8,9,10,11], 
                    'Class Interval': [16.25, 18.75, 21.25, 23.75,26.25,28.75,31.25,33.75,36.25,38.75,41.25], 
                    '$f_i$' : [2,7,7,14,17,24,11,11,3,3,1],
                    'Cumulative $f_i$': [2,9,16,30,47,71,82,93,96,99,100],
                    '$f_i/n$' : [.02,.07,.07,.14,.17,.24,.11,.11,.03,.03,.01],
                    'Cumulative $f_i/n$' : [.02, .09,.16,.30,.47,.71,.82,.93,.96,.99,1.00]})
df2

Unnamed: 0,Class,Class Interval,$f_i$,Cumulative $f_i$,$f_i/n$,Cumulative $f_i/n$
0,1,16.25,2,2,0.02,0.02
1,2,18.75,7,9,0.07,0.09
2,3,21.25,7,16,0.07,0.16
3,4,23.75,14,30,0.14,0.3
4,5,26.25,17,47,0.17,0.47
5,6,28.75,24,71,0.24,0.71
6,7,31.25,11,82,0.11,0.82
7,8,33.75,11,93,0.11,0.93
8,9,36.25,3,96,0.03,0.96
9,10,38.75,3,99,0.03,0.99


In [74]:
# printing median_grouped for the set 
print("Grouped Median is %s" %(median_grouped(df2['Class Interval']))) 

Grouped Median is 28.75


In [34]:
#This time we create a dataframe with both the lower values and the higher values in two columns
df3 = pd.DataFrame({'Class': [1,2,3,4,5,6,7,8,9,10,11], 
                    'Class Interval_lower': [16.25, 18.75, 21.25, 23.75,26.25,28.75,31.25,33.75,36.25,38.75,41.25],
                    'Class Interval_higher': [18.75, 21.25, 23.75,26.25,28.75,31.25,33.75,36.25,38.75,41.25,43.75], 
                    '$f_i$' : [2,7,7,14,17,24,11,11,3,3,1],
                    'Cumulative $f_i$': [2,9,16,30,47,71,82,93,96,99,100],
                    '$f_i/n$' : [.02,.07,.07,.14,.17,.24,.11,.11,.03,.03,.01],
                    'Cumulative $f_i/n$' : [.02, .09,.16,.30,.47,.71,.82,.93,.96,.99,1.00]})
df3

Unnamed: 0,Class,Class Interval_lower,Class Interval_higher,$f_i$,Cumulative $f_i$,$f_i/n$,Cumulative $f_i/n$
0,1,16.25,18.75,2,2,0.02,0.02
1,2,18.75,21.25,7,9,0.07,0.09
2,3,21.25,23.75,7,16,0.07,0.16
3,4,23.75,26.25,14,30,0.14,0.3
4,5,26.25,28.75,17,47,0.17,0.47
5,6,28.75,31.25,24,71,0.24,0.71
6,7,31.25,33.75,11,82,0.11,0.82
7,8,33.75,36.25,11,93,0.11,0.93
8,9,36.25,38.75,3,96,0.03,0.96
9,10,38.75,41.25,3,99,0.03,0.99


In [75]:
# printing median_grouped for the set 
print("Grouped Median is %s" %(median_grouped(df3)))

TypeError: expected number but got 'Class Interval_higher'

Okay, so this did not work properly as well. 

In [68]:
#This time I've tried to make the bins manually in the dataframe. but this gave an error
df4 = pd.DataFrame({'Class': [1,2,3,4,5,6,7,8,9,10,11], 
                    'Class Interval': [(16.25,18.75], (18.75,21.25], (21.25,23.75], 
                                      (23.75,26.25], (26.25,28.75], (28.75,31.25],
                                      (31.25,33.75], (33.75,36.25], (36.25,38.75],
                                      (38.75,41.25], (41.25,43.75]],
                    '$f_i$' : [2,7,7,14,17,24,11,11,3,3,1],
                    'Cumulative $f_i$': [2,9,16,30,47,71,82,93,96,99,100],
                    '$f_i/n$' : [.02,.07,.07,.14,.17,.24,.11,.11,.03,.03,.01],
                    'Cumulative $f_i/n$' : [.02, .09,.16,.30,.47,.71,.82,.93,.96,.99,1.00]})
df4

SyntaxError: invalid syntax (<ipython-input-68-164a23b3c9dd>, line 3)

So I've now asked this question on StackOverFlow. As soon as I've got the answer, I'll update this script. 