In [4]:
%logstop
%logstart -rtq ~/.logs/pw.py append
import seaborn as sns
sns.set()

In [5]:
from static_grader import grader

# PW Miniproject
## Introduction

The objective of this miniproject is to exercise your ability to use basic Python data structures, define functions, and control program flow. We will be using these concepts to perform some fundamental data wrangling tasks such as joining data sets together, splitting data into groups, and aggregating data into summary statistics.
**Please do not use `pandas` or `numpy` to answer these questions.**

We will be working with medical data from the British NHS on prescription drugs. Since this is real data, it contains many ambiguities that we will need to confront in our analysis. This is commonplace in data science, and is one of the lessons you will learn in this miniproject.

## Downloading the data

We first need to download the data we'll be using from Amazon S3:

In [9]:
%%bash
mkdir pw-data
wget http://dataincubator-wqu.s3.amazonaws.com/pwdata/201701scripts_sample.json.gz -nc -P ./pw-data
wget http://dataincubator-wqu.s3.amazonaws.com/pwdata/practices.json.gz -nc -P ./pw-data

mkdir: cannot create directory ‘pw-data’: File exists
File ‘./pw-data/201701scripts_sample.json.gz’ already there; not retrieving.

File ‘./pw-data/practices.json.gz’ already there; not retrieving.



## Loading the data

The first step of the project is to read in the data. We will discuss reading and writing various kinds of files later in the course, but the code below should get you started.

In [6]:
import gzip
import simplejson as json

In [7]:
with gzip.open('./pw-data/201701scripts_sample.json.gz', 'rb') as f:
    scripts = json.load(f)

with gzip.open('./pw-data/practices.json.gz', 'rb') as f:
    practices = json.load(f)

This data set comes from Britain's National Health Service. The `scripts` variable is a list of prescriptions issued by NHS doctors. Each prescription is represented by a dictionary with various data fields: `'practice'`, `'bnf_code'`, `'bnf_name'`, `'quantity'`, `'items'`, `'nic'`, and `'act_cost'`. 

In [8]:
scripts[:50]

[{'bnf_code': '0101010G0AAABAB',
  'items': 2,
  'practice': 'N81013',
  'bnf_name': 'Co-Magaldrox_Susp 195mg/220mg/5ml S/F',
  'nic': 5.98,
  'act_cost': 5.56,
  'quantity': 1000},
 {'bnf_code': '0101021B0AAAHAH',
  'items': 1,
  'practice': 'N81013',
  'bnf_name': 'Alginate_Raft-Forming Oral Susp S/F',
  'nic': 1.95,
  'act_cost': 1.82,
  'quantity': 500},
 {'bnf_code': '0101021B0AAALAL',
  'items': 12,
  'practice': 'N81013',
  'bnf_name': 'Sod Algin/Pot Bicarb_Susp S/F',
  'nic': 64.51,
  'act_cost': 59.95,
  'quantity': 6300},
 {'bnf_code': '0101021B0AAAPAP',
  'items': 3,
  'practice': 'N81013',
  'bnf_name': 'Sod Alginate/Pot Bicarb_Tab Chble 500mg',
  'nic': 9.21,
  'act_cost': 8.55,
  'quantity': 180},
 {'bnf_code': '0101021B0BEADAJ',
  'items': 6,
  'practice': 'N81013',
  'bnf_name': 'Gaviscon Infant_Sach 2g (Dual Pack) S/F',
  'nic': 28.92,
  'act_cost': 26.84,
  'quantity': 90},
 {'bnf_code': '0101021B0BEAIAL',
  'items': 15,
  'practice': 'N81013',
  'bnf_name': 'Gaviscon

A [glossary of terms](http://webarchive.nationalarchives.gov.uk/20180328130852tf_/http://content.digital.nhs.uk/media/10686/Download-glossary-of-terms-for-GP-prescribing---presentation-level/pdf/PLP_Presentation_Level_Glossary_April_2015.pdf/) and [FAQ](http://webarchive.nationalarchives.gov.uk/20180328130852tf_/http://content.digital.nhs.uk/media/10048/FAQs-Practice-Level-Prescribingpdf/pdf/PLP_FAQs_April_2015.pdf/) is available from the NHS regarding the data. Below we supply a data dictionary briefly describing what these fields mean.

| Data field |Description|
|:----------:|-----------|
|`'practice'`|Code designating the medical practice issuing the prescription|
|`'bnf_code'`|British National Formulary drug code|
|`'bnf_name'`|British National Formulary drug name|
|`'quantity'`|Number of capsules/quantity of liquid/grams of powder prescribed|
| `'items'`  |Number of refills (e.g. if `'quantity'` is 30 capsules, 3 `'items'` means 3 bottles of 30 capsules)|
|  `'nic'`   |Net ingredient cost|
|`'act_cost'`|Total cost including containers, fees, and discounts|

The `practices` variable is a list of member medical practices of the NHS. Each practice is represented by a dictionary containing identifying information for the medical practice. Most of the data fields are self-explanatory. Notice the values in the `'code'` field of `practices` match the values in the `'practice'` field of `scripts`.

In [41]:
practices[:40]

[{'code': 'A81001',
  'name': 'THE DENSHAM SURGERY',
  'addr_1': 'THE HEALTH CENTRE',
  'addr_2': 'LAWSON STREET',
  'borough': 'STOCKTON ON TEES',
  'village': 'CLEVELAND',
  'post_code': 'TS18 1HU'},
 {'code': 'A81002',
  'name': 'QUEENS PARK MEDICAL CENTRE',
  'addr_1': 'QUEENS PARK MEDICAL CTR',
  'addr_2': 'FARRER STREET',
  'borough': 'STOCKTON ON TEES',
  'village': 'CLEVELAND',
  'post_code': 'TS18 2AW'},
 {'code': 'A81003',
  'name': 'VICTORIA MEDICAL PRACTICE',
  'addr_1': 'THE HEALTH CENTRE',
  'addr_2': 'VICTORIA ROAD',
  'borough': 'HARTLEPOOL',
  'village': 'CLEVELAND',
  'post_code': 'TS26 8DB'},
 {'code': 'A81004',
  'name': 'WOODLANDS ROAD SURGERY',
  'addr_1': '6 WOODLANDS ROAD',
  'addr_2': None,
  'borough': 'MIDDLESBROUGH',
  'village': 'CLEVELAND',
  'post_code': 'TS1 3BE'},
 {'code': 'A81005',
  'name': 'SPRINGWOOD SURGERY',
  'addr_1': 'SPRINGWOOD SURGERY',
  'addr_2': 'RECTORY LANE',
  'borough': 'GUISBOROUGH',
  'village': None,
  'post_code': 'TS14 7DJ'},
 {'

In the following questions we will ask you to explore this data set. You may need to combine pieces of the data set together in order to answer some questions. Not every element of the data set will be used in answering the questions.

## Question 1: summary_statistics

Our beneficiary data (`scripts`) contains quantitative data on the number of items dispensed (`'items'`), the total quantity of item dispensed (`'quantity'`), the net cost of the ingredients (`'nic'`), and the actual cost to the patient (`'act_cost'`). Whenever working with a new data set, it can be useful to calculate summary statistics to develop a feeling for the volume and character of the data. This makes it easier to spot trends and significant features during further stages of analysis.

Calculate the sum, mean, standard deviation, and quartile statistics for each of these quantities. Format your results for each quantity as a list: `[sum, mean, standard deviation, 1st quartile, median, 3rd quartile]`. We'll create a `tuple` with these lists for each quantity as a final result.

In [22]:
def average(data, key):
    l = [d[key] for d in data]
    total = sum(l)
    average = total/len(data)
    return average
        


In [23]:
def std_dev(data,key,avg):
    s = [(d[key]-avg)**2 for d in data]
    k = (sum(s)/len(data))**0.5
    return k

In [24]:
import math
def median1(data, key):
    l=[d[key] for d in data]
    l = sorted(l)
    #print(l)
    if len(l)%2 == 1:
        index_of_median = len(l)/2
        index_of_median = math.floor(index_of_median)
        #print(index_of_median)
        return l[index_of_median]
    else:
        index_of_median = [len(l)//2 -1 , len(l) // 2]
        
        #print(index_of_median)
        #print(l[index_of_median[1]])
        return (l[index_of_median[0]] + l[index_of_median[1]]) // 2

#data = [{'item': 1},{'item': 2},{'item': 3},{'item': 8},{'item': 5},{'item': 5}]
#median(data,'item')

In [9]:
import math
def median2(l):
    #l = sorted(l)
    #print(l)
    if len(l) % 2 == 1:
        index_of_median = len(l) // 2
        #index_of_median = math.floor(index_of_median)
        #print(index_of_median)
        return l[index_of_median]
    else:
        index_of_median = [len(l) // 2 -1 , len(l) // 2]
        
        #print(index_of_median)
        #print(l[index_of_median[1]])
        return (l[index_of_median[0]] + l[index_of_median[1]]) // 2


In [25]:
def quar_first(data,key):
    #So basically 1st quad is the median of the first half which means that if we have 100 items then it will be the median of first 50 items
    l=[d[key] for d in data]
    l = sorted(l)
    first_half = (len(data) // 2) + 1
    #print(first_half)
    #print(l[:first_half])
    a = median2(l[:first_half])
    return a
    
data = [{'item': 1},{'item': 2},{'item': 3},{'item': 8},{'item': 5},{'item': 5},{'item': 6},{'item': 9},{'item': 3}]
quar_first(data,'item')

3

In [26]:
def quar_third(data,key):
    #So basically 3rt quad is the median of the second half which means that if we have 100 items then it will be the median of last 50 items
    l=[d[key] for d in data]
    l = sorted(l)
    #print(l)
    second_half = len(data) // 2
    #print(second_half)
    #print(l[second_half:])
    a = median2(l[second_half:])
    return a
    
#data = [{'item': 1},{'item': 2},{'item': 3},{'item': 8},{'item': 5},{'item': 5},{'item': 6},{'item': 9},{'item': 3}]
#quar_third(data,'item')



In [13]:
def cal_quantiles(vals):
    vals = sorted(vals)
    first_half = vals[:len(vals) // 2 + 1]
    second_half = vals[len(vals) // 2:]
    
    q25 = median2(first_half)
    median = median2(vals)
    q75 = median2(second_half)
    
    return q25, median, q75

In [27]:

def describe(data, key):
    avg = 0
    total = [d[key] for d in data]
    total = sum(total)
    avg = average(data,key)
    s = std_dev(data,key,avg)
    q25 = quar_first(data,key)
    med = median1(data, key)
    q75 = quar_third(data,key)
    
   
    return (total, avg, s, q25, med, q75)

describe(scripts,'items')


(4410054, 11.522744731217633, 33.11216633980368, 1, 3, 8)

In [33]:
def median3(data):
    n = len(data)
    p = n * (50 / 100)
    if p.is_integer():
        return sorted(data)[int(p)]
    else:
        return sorted(data)[int(math.ceil(p)) - 1]
def q25(data):
    n = len(data)
    p = n *(25 / 100)
    if p.is_integer():
        return sorted(data)[int(p)]
    else:
        return sorted(data)[int(math.ceil(p)) - 1]

def q75(data):
    n = len(data)
    p = n *(75 / 100)
    if p.is_integer():
        return sorted(data)[int(p)]
    else:
        return sorted(data)[int(math.ceil(p)) - 1] 
    

def cal_quantiles2(vals):
    vals = sorted(vals)
    
    q25 = q25(vals)
    median = median3(vals)
    q75 = q75(vals)
    
    return q25, median, q75

SyntaxError: invalid syntax (<ipython-input-33-3980a3da98cb>, line 22)

In [31]:
summary = [('items', describe2(scripts, 'items')),
           ('quantity', describe2(scripts, 'quantity')),
           ('nic', describe2(scripts, 'nic')),
           ('act_cost', describe2(scripts, 'act_cost'))]

In [30]:
def describe2(data, key):
    vals = [d[key] for d in data]
    total = sum(vals)
    avg = total / len(vals)
    sum_squares = sum([(val - avg) ** 2 for val in vals])
    s = (sum_squares / len(vals)) ** 0.5
    q25 , med , q75 = cal_quantiles2(vals)
   
    return (total, avg, s, q25, med, q75)

describe2(scripts,'items')

(4410054, 11.522744731217633, 33.11216633980368, 1, 3, 8)

In [32]:
grader.score.pw__summary_statistics(summary)

Your score: 0.833


## Question 2: most_common_item

Often we are not interested only in how the data is distributed in our entire data set, but within particular groups -- for example, how many items of each drug (i.e. `'bnf_name'`) were prescribed? Calculate the total items prescribed for each `'bnf_name'`. What is the most commonly prescribed `'bnf_name'` in our data?

To calculate this, we first need to split our data set into groups corresponding with the different values of `'bnf_name'`. Then we can sum the number of items dispensed within in each group. Finally we can find the largest sum.

We'll use `'bnf_name'` to construct our groups. You should have *5619* unique values for `'bnf_name'`.

In [148]:
bnf_names = {d['bnf_name'] for d in scripts} #As we are making a set here so this will only contain the name one time, we donot have to do an extra effort to remove the same bnf_names
assert(len(bnf_names) == 5619)

We want to construct "groups" identified by `'bnf_name'`, where each group is a collection of prescriptions (i.e. dictionaries from `scripts`). We'll construct a dictionary called `groups`, using `bnf_names` as the keys. We'll represent a group with a `list`, since we can easily append new members to the group. To split our `scripts` into groups by `'bnf_name'`, we should iterate over `scripts`, appending prescription dictionaries to each group as we encounter them.

In [149]:
groups = {name: [] for name in bnf_names} #Here we are trying to make a dictnoary which will be having bnf_name as dictionary's key and value will be a list containing all those dictonaries with the same bnf_name
for script in scripts:                    #Above we are making a dictioanry with  empty list as value
    bnf_name = script['bnf_name']         #Here we are intializing bnf_name
    groups[bnf_name].append(script)       #Here we are appending all the scripts which contains the same bnf_name

In [21]:
scripts[0]['bnf_name']

'Co-Magaldrox_Susp 195mg/220mg/5ml S/F'

In [22]:
groups['Co-Magaldrox_Susp 195mg/220mg/5ml S/F']

[{'bnf_code': '0101010G0AAABAB',
  'items': 2,
  'practice': 'N81013',
  'bnf_name': 'Co-Magaldrox_Susp 195mg/220mg/5ml S/F',
  'nic': 5.98,
  'act_cost': 5.56,
  'quantity': 1000},
 {'bnf_code': '0101010G0AAABAB',
  'items': 1,
  'practice': 'N81029',
  'bnf_name': 'Co-Magaldrox_Susp 195mg/220mg/5ml S/F',
  'nic': 2.99,
  'act_cost': 2.78,
  'quantity': 500},
 {'bnf_code': '0101010G0AAABAB',
  'items': 2,
  'practice': 'N81088',
  'bnf_name': 'Co-Magaldrox_Susp 195mg/220mg/5ml S/F',
  'nic': 5.98,
  'act_cost': 5.56,
  'quantity': 1000},
 {'bnf_code': '0101010G0AAABAB',
  'items': 6,
  'practice': 'A81017',
  'bnf_name': 'Co-Magaldrox_Susp 195mg/220mg/5ml S/F',
  'nic': 20.93,
  'act_cost': 19.45,
  'quantity': 3500},
 {'bnf_code': '0101010G0AAABAB',
  'items': 2,
  'practice': 'A81034',
  'bnf_name': 'Co-Magaldrox_Susp 195mg/220mg/5ml S/F',
  'nic': 5.98,
  'act_cost': 5.56,
  'quantity': 1000},
 {'bnf_code': '0101010G0AAABAB',
  'items': 10,
  'practice': 'P85003',
  'bnf_name': 'Co

Now that we've constructed our groups we should sum up `'items'` in each group and find the `'bnf_name'` with the largest sum. The result, `max_item`, should have the form `[(bnf_name, item total)]`, e.g. `[('Foobar', 2000)]`.

In [23]:
%%timeit
highest_sum = 0
for name in bnf_names:                    #Here we are extracting the names of the bnf_names set which we have defined above
    group_with_name_only = groups[name]   #Here we are extracting the group with one specific name of the bnf_name at the time
    sum = 0
    for script in group_with_name_only:   #Here we are extracting the scripts associated with the single drug
        sum = sum + script['items']       #Here we are extracting the items key in that script
    if sum > highest_sum:
        highest_sum = sum
        highest_drug = name

238 ms ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [24]:

max_item = [("Omeprazole_Cap E/C 20mg", 113826)]

**TIP:** If you are getting an error from the grader below, please make sure your answer conforms to the correct format of `[(bnf_name, item total)]`.

In [25]:
grader.score.pw__most_common_item(max_item)

Your score: 1.000


**Challenge:** Write a function that constructs groups as we did above. The function should accept a list of dictionaries (e.g. `scripts` or `practices`) and a tuple of fields to `groupby` (e.g. `('bnf_name',)` or `('bnf_name', 'post_code')`) and returns a dictionary of groups. The following questions will require you to aggregate data in groups, so this could be a useful function for the rest of the miniproject.

In [26]:
def group_by_field(data, fields):
    
    unique_keys = for f in fields
    fields_name = {d[fields] for d in data}
    groups = {field_name: {} for field_name  in fields_name}
    for d in data:
        f = d[fields]
        groups[f].append(d)
    return groups

SyntaxError: invalid syntax (<ipython-input-26-de5784eeb3ff>, line 3)

In [192]:

def group_by_field_v1(data,fields):
    unique_fields = {tuple(d[f] for f in fields) for d in data}
    #print(unique_fields)
    groups = {key:[] for key in unique_fields}
    #print(groups)
    for d in data:
        key = tuple(d[f] for f in fields) #Here we are doing the same thing but only the difference is that we are using the tuple instead of of single entry
        groups[key].append(d)
    return groups




In [191]:
from collections import defaultdict

def group_by_field_v2(data,fields):
    groups = defaultdict(list)
    for d in data:
        key = tuple(d[f] for f in fields) #Here we are doing the same thing but only the difference is that we are using the tuple instead of of single entry
        groups[key].append(d)
    return groups

In [30]:
def test_max_item():
    groups = group_by_field_v1(scripts, ('bnf_name','items'))
    highest_sum = 0
    for name in bnf_names:                    #Here we are extracting the names of the bnf_names set which we have defined above
        group_with_name_only = groups[name]   #Here we are extracting the group with one specific name of the bnf_name at the time
        sum = 0
        for script in group_with_name_only:   #Here we are extracting the scripts associated with the single drug
            sum = sum + script['items']       #Here we are extracting the items key in that script
        if sum > highest_sum:
            highest_sum = sum
            highest_drug = name
    return [(highest_drug),(highest_sum)]

In [38]:
#groups = group_by_field_v1(scripts, ('bnf_name','item'))
sum(group['items'] for group in groups[("Omeprazole_Cap E/C 20mg")])  #So bascailly here we are just extracting the highest summed drug and adding all its items in its dictionary
#assert test_max_item == max_item

113826

## Question 3: postal_totals

Our data set is broken up among different files. This is typical for tabular data to reduce redundancy. Each table typically contains data about a particular type of event, processes, or physical object. Data on prescriptions and medical practices are in separate files in our case. If we want to find the total items prescribed in each postal code, we will have to _join_ our prescription data (`scripts`) to our clinic data (`practices`).

Find the total items prescribed in each postal code, representing the results as a list of tuples `(post code, total items prescribed)`. Sort your results ascending alphabetically by post code and take only results from the first 100 post codes. Only include post codes if there is at least one prescription from a practice in that post code.

**NOTE:** Some practices have multiple postal codes associated with them. Use the alphabetically first postal code.

 We can join `scripts` and `practices` based on the fact that `'practice'` in `scripts` matches `'code'` in `practices`. However, we must first deal with the repeated values of `'code'` in `practices`. We want the alphabetically first postal codes.

In [117]:
#Note!!!
# 1. practice_postal[practice_code] is post_code
# 2. practice_code is code
# 3. code will never change, only the post_code will change
practice_postal = {}

for practice in practices:
    practice_code = practice['code']                   # Making a list with only code entry of the practice dictionary
    if practice['code'] in practice_postal:            # Checking if the code entry was already in the practice_postal dictionary then we will 
        practice_postal[practice_code] = min(practice_postal[practice_code], practice['post_code']) # 1.If the already present post_code of that practice_code is alphabetically greater then the new post_code then replace the new post_code with that. 
    else:
        practice_postal[practice_code] = practice['post_code']

**Challenge:** This is an aggregation of the practice data grouped by practice codes. Write an alternative implementation of the above cell using the `group_by_field` function you defined previously.

In [67]:
assert practice_postal['K82019'] == 'HP21 8TR'

Now we can join `practice_postal` to `scripts`.

In [152]:
def cal_sum_v2(data,fields):
    j = sum(d[fields] for d in data)
    return j

In [160]:
joined = scripts

for script in joined:
    script['post_code'] = practice_postal[script['practice']]
    


In [141]:
groups = group_by_field_v2(joined,('post_code',))
groups = { key:val for key, val in groups.items() if len(val) > 1 }
items_by_post = [(print(post_code[0]), cal_sum_v2(script,'items')) for post_code, script in groups.items()] # groups.item() will return a list with [(key1,value1),(key2,value2)...] tuples. We will be taking post_code(key) as it is but with the scripts(value) we will be calculating the sum of the 'items'
#Important!! Here we are doing one important trick, as post_code is list with the format ('SK11 6JL',) but in the solution we only want SK11 6JL so we are doing a trick by just grabbing the first entry of the tuple 

('SK11 6JL',)
('CW5 5NX',)


Finally we'll group the prescription dictionaries in `joined` by `'post_code'` and sum up the items prescribed in each group, as we did in the previous question.

In [126]:
items_by_post = sorted(items_by_post)[:100]
items_by_post[:5]

[('B11 4BW', 20673),
 ('B18 7AL', 19001),
 ('B21 9RY', 29103),
 ('B23 6DJ', 24859),
 ('B70 7AW', 36531)]

In [28]:
#bnf_names = {script['post_code'] for script in scripts}

#groups = {name: [] for name in bnf_names} #Here we are trying to make a dictnoary which will be having bnf_name as dictionary's key and value will be a list containing all those dictonaries with the same bnf_name
#for script in scripts:                    #Above we are making a dictioanry with  empty list as value
#    bnf_name = script['post_code']         #Here we are intializing bnf_name
#   groups[bnf_name].append(script)       #Here we are appending all the scripts which contains the same bnf_name

#def cal_sum(bnf_names,groups):
 #   highest_sum = 0
  #  for name in bnf_names:                    #Here we are extracting the names of the bnf_names set which we have defined above
   #     group_with_name_only = groups[name]   #Here we are extracting the group with one specific name of the bnf_name at the time
    #    sum = 0
#        for script in group_with_name_only:   #Here we are extracting the scripts associated with the single drug
 #           sum = sum + script['items']       #Here we are extracting the items key in that script
  #      if sum > highest_sum:
   #         highest_sum = sum
    #        highest_drug = name
    #print(highest_drug)
    #print(highest_sum)
    #return highest_drug, highest_sum

#post,sum = cal_sum(bnf_names,groups)

SK11 6JL
110071


In [127]:
postal_totals = [('B11 4BW', 20673)] * 100

grader.score.pw__postal_totals(items_by_post)

Your score: 1.000


## Question 4: items_by_region

Now we'll combine the techniques we've developed to answer a more complex question. Find the most commonly dispensed item in each postal code, representing the results as a list of tuples (`post_code`, `bnf_name`, amount dispensed as proportion of total). Sort your results ascending alphabetically by post code and take only results from the first 100 post codes.

**NOTE:** We'll continue to use the `joined` variable we created before, where we've chosen the alphabetically first postal code for each practice. Additionally, some postal codes will have multiple `'bnf_name'` with the same number of items prescribed for the maximum. In this case, we'll take the alphabetically first `'bnf_name'`.

There are several approaches to solve this problem but we will guide you through one of them. Feel free to solve it your own way if it is easier for you to understand and implement. If your kernel keeps on dying, it's probably an indication that you are running out of memory. Consider deleting objects you don't need anymore using the `del` statement and shutdown any other running notebooks. For example:
```Python
del some_object_not_needed
```

The first step is to calculate the total items for each `'post_code'` and `'bnf_name'`. Let's call that result `total_items_by_post_bnf`. Consider what is the best data structure(s) to represent `total_items_by_post_bnf`. It should have 141196 `('post_code', 'bnf_name')` groups.

In [193]:
grouped_by_post_bnf = group_by_field_v2(joined,('bnf_name','post_code'))
groups = [(key, cal_sum_v2(val,'items')) for key, val in grouped_by_post_bnf.items()] 
assert len(groups) == 141196

In [194]:
groups[]


[(('Co-Magaldrox_Susp 195mg/220mg/5ml S/F', 'SK11 6JL'), 5),
 (('Alginate_Raft-Forming Oral Susp S/F', 'SK11 6JL'), 3),
 (('Sod Algin/Pot Bicarb_Susp S/F', 'SK11 6JL'), 94),
 (('Sod Alginate/Pot Bicarb_Tab Chble 500mg', 'SK11 6JL'), 9),
 (('Gaviscon Infant_Sach 2g (Dual Pack) S/F', 'SK11 6JL'), 41)]

Next, let's take `total_items_by_post_bnf` and group it by `'post_code'`. In other words, we want a  data structure that maps a `'post_code'` to a list of all records that belong to that `'post_code'`. There should be 118 groups.

In [269]:
total_item_by_bnf_post =  [{'bnf_names' : group[0][0], 'post_code' : group[0][1], 'total':group[1]} for group in groups]

total_item_by_bnf_post[:5]

[{'bnf_names': 'Co-Magaldrox_Susp 195mg/220mg/5ml S/F',
  'post_code': 'SK11 6JL',
  'total': 5},
 {'bnf_names': 'Alginate_Raft-Forming Oral Susp S/F',
  'post_code': 'SK11 6JL',
  'total': 3},
 {'bnf_names': 'Sod Algin/Pot Bicarb_Susp S/F',
  'post_code': 'SK11 6JL',
  'total': 94},
 {'bnf_names': 'Sod Alginate/Pot Bicarb_Tab Chble 500mg',
  'post_code': 'SK11 6JL',
  'total': 9},
 {'bnf_names': 'Gaviscon Infant_Sach 2g (Dual Pack) S/F',
  'post_code': 'SK11 6JL',
  'total': 41}]

In [270]:
grouped_by_post_only = group_by_field_v2(total_item_by_bnf_post, ('post_code',))


In [274]:
total_items_by_post = [(key[0], cal_sum_v2(val,'total')) for key, val in grouped_by_post_only.items()] # We are calculating total number of items for each post_code.
assert len(total_items_by_post) == 118
print(total_items_by_post[:5])

[('SK11 6JL', 110071), ('CW5 5NX', 38797), ('CW1 3AW', 64104), ('CW7 1AT', 43164), ('CH65 6TG', 25090)]


Now with `grouped_post_code`, let's iterate over each group and calculate the following fields for each `'post_code'`:
1. the sum of total items for all `'bnf_name'`
1. the most total items
1. the `'bnf_name'` that had the most total items

Once again, consider the best data structure(s) to use to represent the result. It may help to write and use a function when developing your solution.

In [279]:
#k = sum(bnf_name['total'] for bnf_name in total_item_by_bnf_post)
#print(k)

descending_list_of_post = [max(val, key=lambda d: d['total']) for val in grouped_by_post_only.values() if val] # We are using values() method which will give values of the dictionary


[{'bnf_names': 'Omeprazole_Cap E/C 20mg',
  'post_code': 'SK11 6JL',
  'total': 3219},
 {'bnf_names': 'Omeprazole_Cap E/C 20mg',
  'post_code': 'CW5 5NX',
  'total': 1419},
 {'bnf_names': 'Omeprazole_Cap E/C 20mg',
  'post_code': 'CW1 3AW',
  'total': 2364},
 {'bnf_names': 'Omeprazole_Cap E/C 20mg',
  'post_code': 'CW7 1AT',
  'total': 1655},
 {'bnf_names': 'Lansoprazole_Cap 30mg (E/C Gran)',
  'post_code': 'CH65 6TG',
  'total': 688}]

In [212]:
l = max(bnf_name['total'] for bnf_name in total_item_by_bnf_post)
print(l)



3219


In [308]:
def cal_total(x):
    post_code = x['post_code']
    numer = x['total']
    for val in total_items_by_post:
        if post_code == val[0]:
            denom = val[1]
    return numer/denom

In [309]:
items_by_region = [(x['post_code'], x['bnf_names'], cal_total(x)) for x in descending_list_of_post]

In [223]:
#highest_total = 0
#highest_sum = 0
#for bnf_name in total_item_by_bnf_post:                    
 #   bnf_name_only = bnf_name['bnf_names']
  #  total=0
   # total = bnf_name['total']       
#    if total > highest_total:
 #       highest_total = total
 #       highest_drug = bnf_name_only
 #       
#print(highest_drug)
    

Omeprazole_Cap E/C 20mg


In [305]:

def prop(t):
    j = t['total']/k
    return j
last = [(t['bnf_names'], t['post_code'], prop(t)) for t in total_item_by_bnf_post]
    
last[:5]

[('Co-Magaldrox_Susp 195mg/220mg/5ml S/F', 'SK11 6JL', 1.1337729651382954e-06),
 ('Alginate_Raft-Forming Oral Susp S/F', 'SK11 6JL', 6.802637790829772e-07),
 ('Sod Algin/Pot Bicarb_Susp S/F', 'SK11 6JL', 2.131493174459995e-05),
 ('Sod Alginate/Pot Bicarb_Tab Chble 500mg',
  'SK11 6JL',
  2.040791337248932e-06),
 ('Gaviscon Infant_Sach 2g (Dual Pack) S/F',
  'SK11 6JL',
  9.296938314134022e-06)]

In [252]:
who


bnf_name	 bnf_name_only	 bnf_names	 cal_sum_v2	 d	 data	 defaultdict	 f	 group	 
group_by_field_v1	 group_by_field_v2	 grouped_by_post_bnf	 grouped_by_post_only	 grouped_post_code	 groups	 gzip	 highest_drug	 highest_sum	 
highest_total	 i	 items_by_post	 joined	 json	 k	 l	 las_func	 math	 
max_item	 postal_totals	 practice	 practice_code	 practice_postal	 practices	 practices2	 prop	 script	 
scripts	 sns	 total	 total_item_by_bnf_post	 total_items_by_post	 


Now, we are ready to:
1. calculate the ratio (the amount dispensed as proportion of total)
1. [sort](https://docs.python.org/3/howto/sorting.html) alphabetically by the post code
1. format the answer as a list of tuples
1. take only the first 100 tuples
1. submit to the grader

In [None]:
items_by_region = [('B11 4BW', 'Salbutamol_Inha 100mcg (200 D) CFF', 0.0341508247)] * 100

In [310]:
items_by_region[:5]

[('SK11 6JL', 'Omeprazole_Cap E/C 20mg', 0.029244760200234393),
 ('CW5 5NX', 'Omeprazole_Cap E/C 20mg', 0.036574992911823076),
 ('CW1 3AW', 'Omeprazole_Cap E/C 20mg', 0.03687757394234369),
 ('CW7 1AT', 'Omeprazole_Cap E/C 20mg', 0.038342136965990176),
 ('CH65 6TG', 'Lansoprazole_Cap 30mg (E/C Gran)', 0.027421283379832604)]

In [312]:
grader.score.pw__items_by_region(items_by_region)

Your score: 0.820


*Copyright &copy; 2021 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.*