In [1]:
%logstop
%logstart -ortq ~/.logs/pw.py append
%matplotlib inline
import matplotlib
import seaborn as sns
sns.set()
matplotlib.rcParams['figure.dpi'] = 144

In [3]:
from static_grader import grader

# PW Miniproject
## Introduction

The objective of this miniproject is to exercise your ability to use basic Python data structures, define functions, and control program flow. We will be using these concepts to perform some fundamental data wrangling tasks such as joining data sets together, splitting data into groups, and aggregating data into summary statistics.
**Please do not use `pandas` or `numpy` to answer these questions.**

We will be working with medical data from the British NHS on prescription drugs. Since this is real data, it contains many ambiguities that we will need to confront in our analysis. This is commonplace in data science, and is one of the lessons you will learn in this miniproject.

## Downloading the data

We first need to download the data we'll be using from Amazon S3:

In [4]:
%%bash
mkdir pw-data
wget http://dataincubator-wqu.s3.amazonaws.com/pwdata/201701scripts_sample.json.gz -nc -P ./pw-data
wget http://dataincubator-wqu.s3.amazonaws.com/pwdata/practices.json.gz -nc -P ./pw-data

mkdir: cannot create directory ‘pw-data’: File exists
File ‘./pw-data/201701scripts_sample.json.gz’ already there; not retrieving.

File ‘./pw-data/practices.json.gz’ already there; not retrieving.



## Loading the data

The first step of the project is to read in the data. We will discuss reading and writing various kinds of files later in the course, but the code below should get you started.

In [5]:
import gzip
import simplejson as json

In [6]:
with gzip.open('./pw-data/201701scripts_sample.json.gz', 'rb') as f:
    scripts = json.load(f)

with gzip.open('./pw-data/practices.json.gz', 'rb') as f:
    practices = json.load(f)

This data set comes from Britain's National Health Service. The `scripts` variable is a list of prescriptions issued by NHS doctors. Each prescription is represented by a dictionary with various data fields: `'practice'`, `'bnf_code'`, `'bnf_name'`, `'quantity'`, `'items'`, `'nic'`, and `'act_cost'`. 

In [7]:
scripts[:2]

[{'bnf_code': '0101010G0AAABAB',
  'items': 2,
  'practice': 'N81013',
  'bnf_name': 'Co-Magaldrox_Susp 195mg/220mg/5ml S/F',
  'nic': 5.98,
  'act_cost': 5.56,
  'quantity': 1000},
 {'bnf_code': '0101021B0AAAHAH',
  'items': 1,
  'practice': 'N81013',
  'bnf_name': 'Alginate_Raft-Forming Oral Susp S/F',
  'nic': 1.95,
  'act_cost': 1.82,
  'quantity': 500}]

In [8]:
scripts[-2]

{'bnf_code': '23803108005',
 'items': 4,
 'practice': 'H81615',
 'bnf_name': '3m Health Care_Cavilon No Sting Barrier',
 'nic': 141.0,
 'act_cost': 130.54,
 'quantity': 180}

In [9]:
type(scripts)

list

A [glossary of terms](http://webarchive.nationalarchives.gov.uk/20180328130852tf_/http://content.digital.nhs.uk/media/10686/Download-glossary-of-terms-for-GP-prescribing---presentation-level/pdf/PLP_Presentation_Level_Glossary_April_2015.pdf/) and [FAQ](http://webarchive.nationalarchives.gov.uk/20180328130852tf_/http://content.digital.nhs.uk/media/10048/FAQs-Practice-Level-Prescribingpdf/pdf/PLP_FAQs_April_2015.pdf/) is available from the NHS regarding the data. Below we supply a data dictionary briefly describing what these fields mean.

| Data field |Description|
|:----------:|-----------|
|`'practice'`|Code designating the medical practice issuing the prescription|
|`'bnf_code'`|British National Formulary drug code|
|`'bnf_name'`|British National Formulary drug name|
|`'quantity'`|Number of capsules/quantity of liquid/grams of powder prescribed|
| `'items'`  |Number of refills (e.g. if `'quantity'` is 30 capsules, 3 `'items'` means 3 bottles of 30 capsules)|
|  `'nic'`   |Net ingredient cost|
|`'act_cost'`|Total cost including containers, fees, and discounts|

The `practices` variable is a list of member medical practices of the NHS. Each practice is represented by a dictionary containing identifying information for the medical practice. Most of the data fields are self-explanatory. Notice the values in the `'code'` field of `practices` match the values in the `'practice'` field of `scripts`.

In [10]:
practices[:2]

[{'code': 'A81001',
  'name': 'THE DENSHAM SURGERY',
  'addr_1': 'THE HEALTH CENTRE',
  'addr_2': 'LAWSON STREET',
  'borough': 'STOCKTON ON TEES',
  'village': 'CLEVELAND',
  'post_code': 'TS18 1HU'},
 {'code': 'A81002',
  'name': 'QUEENS PARK MEDICAL CENTRE',
  'addr_1': 'QUEENS PARK MEDICAL CTR',
  'addr_2': 'FARRER STREET',
  'borough': 'STOCKTON ON TEES',
  'village': 'CLEVELAND',
  'post_code': 'TS18 2AW'}]

In [11]:
type(practices)

list

The list is a called a JSON Object, because it contains a list of Dictionaries

* JSON objects are surrounded by curly braces {}.

* JSON objects are written in key/value pairs.

* Keys must be strings, and values must be a valid JSON data type (string, number, object, array, boolean or null).

* Keys and values are separated by a colon.

* Each key/value pair is separated by a comma

In the following questions we will ask you to explore this data set. You may need to combine pieces of the data set together in order to answer some questions. Not every element of the data set will be used in answering the questions.

## Question 1: summary_statistics

Our beneficiary data (`scripts`) contains quantitative data on the number of items dispensed (`'items'`), the total quantity of item dispensed (`'quantity'`), the net cost of the ingredients (`'nic'`), and the actual cost to the patient (`'act_cost'`). Whenever working with a new data set, it can be useful to calculate summary statistics to develop a feeling for the volume and character of the data. This makes it easier to spot trends and significant features during further stages of analysis.

Calculate the sum, mean, standard deviation, and quartile statistics for each of these quantities. Format your results for each quantity as a list: `[sum, mean, standard deviation, 1st quartile, median, 3rd quartile]`. We'll create a `tuple` with these lists for each quantity as a final result.

#### Python statistics 

#### Statistical Functions in Python | Set 1 (Averages and Measure of Central Location)

Python has the ability to manipulate some statistical data and calculate results of various statistical operations using the file “statistics“, useful in domain of mathematics.

Important Average and measure of central location functions :

* mean() :- This function returns the mean or average of the data passed in its arguments.
* mode() :- This function returns the number with maximum number of occurrences.
* median() :- This function is used to calculate the median, i.e middle element of data. 
* median_low() :- This function returns the median of data in case of odd number of elements, but in case of even number of elements, returns the lower of two middle elements. 
* median_high() :- This function returns the median of data in case of odd number of elements, but in case of even number of elements, returns the higher of two middle elements.
* median_grouped() :- This function is used to compute group median, i.e 50th percentile of the data.     

#### Statistical Functions in Python | Set 2 ( Measure of Spread)

* variance() :- This function calculates the variance i.e measure of deviation of data, more the value of variance, more the data values are spread. Variance (σ2) in statistics is a measurement of the spread between numbers in a data set. That is, it measures how far each number in the set is from the mean and therefore from every other number in the set.
* pvariance() :- This function computes the variance of the entire population. The data is interpreted as it is of the whole population
* stdev() :- This function returns the standard deviation ( square root of sample variance ) of the data.
* pstdev() :- This function returns the population standard deviation ( square root of population variance ) of the data.

#### | stdev() 

Statistics module in Python provides a function known as stdev() , which can be used to calculate the standard deviation. stdev() function only calculates standard deviation from a sample of data, rather than an entire population.
To calculate standard deviation of an entire population, another function known as pstdev() is used.

**Standard Deviation** is a measure of spread in Statistics. It is used to quantify the measure of spread, variation of a set of data values. It is very much similar to variance, gives the measure of deviation whereas variance provides the squared value.

**A low measure of Standard Deviation** indicates that the data are less spread out, whereas a **high value of Standard Deviation** shows that the data in a set are spread apart from their mean average values. A useful property of the standard deviation is that, unlike the variance, it is expressed in the same units as the data.

* Applications :
     * Standard Deviation is highly essential in the field of statistical maths and statistical study.  It is commonly used            to measure confidence in statistical calculations.
     * It is very useful in the field of financial studies as well as it helps to determine the margin of profit and loss.              The standard deviation is also important, where the standard deviation on the rate of return on an investment is a              measure of the volatility of the investment.

#### Averages and measures of central location

* mean(): Arithmetic mean (“average”) of data.
* fmean(): Fast, floating point arithmetic mean.
* geometric_mean() : Geometric mean of data.
* harmonic_mean(): Harmonic mean of data.
* median() : Median (middle value) of data.
* median_low(): Low median of data.
* median_high(): High median of data.
* median_grouped(): Median, or 50th percentile, of grouped data.
* mode(): Single mode (most common value) of discrete or nominal data.
* multimode(): List of modes (most common values) of discrete or nomimal data.
* quantiles(): Divide data into intervals with equal probability.

#### Measures of spread

These functions calculate a measure of how much the population or sample tends to deviate from the typical or average values.
* pstdev() : Population standard deviation of data.
* pvariance() : Population variance of data
* stdev() : Sample standard deviation of data.
* variance() : Sample variance of data.

In [12]:
key = 'items'
values = [script[key] for script in scripts]
total = sum(values)
print(total)

4410054


In [13]:
from statistics import mean, stdev, median
from math import ceil, floor

# floor() : floor() method in Python returns floor of x i.e., the largest integer not greater than x.
# ceil() : The method ceil() in Python returns ceiling value of x i.e., the smallest integer not less than x.

def quantile(q, values):
    if len(values) % 2 != 0:
        idx = floor(len(values) * q)
    else:
        idx = ceil(len(values) * q)
    return sorted(values)[idx]


def describe(key):
    values = [script[key] for script in scripts]

    total = sum(values)
    avg = mean(values)
    s = stdev(values)
    q25 = quantile(.25, values)
    med = median(values)
    q75 = quantile(.75, values)

    return (total, avg, s, q25, med, q75)

In [14]:
describe('items')

(4410054, 11.522744731217633, 33.11220959819492, 1, 3.0, 8)

In [15]:
summary = [('items', describe('items')),
           ('quantity', describe('quantity')),
           ('nic', describe('nic')),
           ('act_cost', describe('act_cost'))]

In [16]:
grader.score.pw__summary_statistics(summary)

Your score:  1.0


## Question 2: most_common_item

Often we are not interested only in how the data is distributed in our entire data set, but within particular groups -- for example, how many items of each drug (i.e. `'bnf_name'`) were prescribed? Calculate the total items prescribed for each `'bnf_name'`. What is the most commonly prescribed `'bnf_name'` in our data?

To calculate this, we first need to split our data set into groups corresponding with the different values of `'bnf_name'`. Then we can sum the number of items dispensed within in each group. Finally we can find the largest sum.

We'll use `'bnf_name'` to construct our groups. You should have *5619* unique values for `'bnf_name'`.

In [17]:
scripts[1]

{'bnf_code': '0101021B0AAAHAH',
 'items': 1,
 'practice': 'N81013',
 'bnf_name': 'Alginate_Raft-Forming Oral Susp S/F',
 'nic': 1.95,
 'act_cost': 1.82,
 'quantity': 500}

In [18]:
bnf_names = [script['bnf_name'] for script in scripts]
dir(bnf_names)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']

In [19]:
bnf_names = {script['bnf_name'] for script in scripts}
# print(len(bnf_names))
assert(len(bnf_names) == 5619)

We want to construct "groups" identified by `'bnf_name'`, where each group is a collection of prescriptions (i.e. dictionaries from `scripts`). We'll construct a dictionary called `groups`, using `bnf_names` as the keys. We'll represent a group with a `list`, since we can easily append new members to the group. To split our `scripts` into groups by `'bnf_name'`, we should iterate over `scripts`, appending prescription dictionaries to each group as we encounter them.

In [20]:
groups = {name: [] for name in bnf_names}
for script in scripts:
    key = script['bnf_name']
    groups[key].append(script)

In [21]:
groups['Enoxaparin_Inj 100mg/ml 0.4ml Pfs']

[{'bnf_code': '0208010D0AAABAB',
  'items': 1,
  'practice': 'N81016',
  'bnf_name': 'Enoxaparin_Inj 100mg/ml 0.4ml Pfs',
  'nic': 15.14,
  'act_cost': 14.13,
  'quantity': 5},
 {'bnf_code': '0208010D0AAABAB',
  'items': 4,
  'practice': 'N81053',
  'bnf_name': 'Enoxaparin_Inj 100mg/ml 0.4ml Pfs',
  'nic': 133.19,
  'act_cost': 123.46,
  'quantity': 44},
 {'bnf_code': '0208010D0AAABAB',
  'items': 1,
  'practice': 'Y03408',
  'bnf_name': 'Enoxaparin_Inj 100mg/ml 0.4ml Pfs',
  'nic': 84.76,
  'act_cost': 78.58,
  'quantity': 28},
 {'bnf_code': '0208010D0AAABAB',
  'items': 2,
  'practice': 'A81017',
  'bnf_name': 'Enoxaparin_Inj 100mg/ml 0.4ml Pfs',
  'nic': 108.98,
  'act_cost': 101.11,
  'quantity': 36},
 {'bnf_code': '0208010D0AAABAB',
  'items': 2,
  'practice': 'A81034',
  'bnf_name': 'Enoxaparin_Inj 100mg/ml 0.4ml Pfs',
  'nic': 175.57,
  'act_cost': 162.66,
  'quantity': 58},
 {'bnf_code': '0208010D0AAABAB',
  'items': 1,
  'practice': 'A81040',
  'bnf_name': 'Enoxaparin_Inj 100m

In [22]:
groups['Alginate_Raft-Forming Oral Susp S/F']

[{'bnf_code': '0101021B0AAAHAH',
  'items': 1,
  'practice': 'N81013',
  'bnf_name': 'Alginate_Raft-Forming Oral Susp S/F',
  'nic': 1.95,
  'act_cost': 1.82,
  'quantity': 500},
 {'bnf_code': '0101021B0AAAHAH',
  'items': 2,
  'practice': 'N81632',
  'bnf_name': 'Alginate_Raft-Forming Oral Susp S/F',
  'nic': 3.51,
  'act_cost': 3.28,
  'quantity': 900},
 {'bnf_code': '0101021B0AAAHAH',
  'items': 7,
  'practice': 'A81017',
  'bnf_name': 'Alginate_Raft-Forming Oral Susp S/F',
  'nic': 21.04,
  'act_cost': 19.57,
  'quantity': 3050},
 {'bnf_code': '0101021B0AAAHAH',
  'items': 8,
  'practice': 'A81034',
  'bnf_name': 'Alginate_Raft-Forming Oral Susp S/F',
  'nic': 12.53,
  'act_cost': 11.71,
  'quantity': 2700},
 {'bnf_code': '0101021B0AAAHAH',
  'items': 2,
  'practice': 'A81012',
  'bnf_name': 'Alginate_Raft-Forming Oral Susp S/F',
  'nic': 4.53,
  'act_cost': 4.22,
  'quantity': 650},
 {'bnf_code': '0101021B0AAAHAH',
  'items': 2,
  'practice': 'A81023',
  'bnf_name': 'Alginate_Raft

In [23]:
groups['3m Health Care_Cavilon No Sting Barrier']

[{'bnf_code': '23803108010',
  'items': 3,
  'practice': 'N81029',
  'bnf_name': '3m Health Care_Cavilon No Sting Barrier',
  'nic': 17.94,
  'act_cost': 16.62,
  'quantity': 3},
 {'bnf_code': '23803108005',
  'items': 1,
  'practice': 'N81062',
  'bnf_name': '3m Health Care_Cavilon No Sting Barrier',
  'nic': 0.78,
  'act_cost': 0.83,
  'quantity': 1},
 {'bnf_code': '23803108010',
  'items': 1,
  'practice': 'N81088',
  'bnf_name': '3m Health Care_Cavilon No Sting Barrier',
  'nic': 17.94,
  'act_cost': 16.61,
  'quantity': 3},
 {'bnf_code': '23803108010',
  'items': 3,
  'practice': 'N81632',
  'bnf_name': '3m Health Care_Cavilon No Sting Barrier',
  'nic': 17.94,
  'act_cost': 16.64,
  'quantity': 3},
 {'bnf_code': '23803108010',
  'items': 1,
  'practice': 'Y03882',
  'bnf_name': '3m Health Care_Cavilon No Sting Barrier',
  'nic': 5.98,
  'act_cost': 5.55,
  'quantity': 1},
 {'bnf_code': '23803108005',
  'items': 1,
  'practice': 'N81010',
  'bnf_name': '3m Health Care_Cavilon No S

Now that we've constructed our groups we should sum up `'items'` in each group and find the `'bnf_name'` with the largest sum. The result, `max_item`, should have the form `[(bnf_name, item total)]`, e.g. `[('Foobar', 2000)]`.

In [24]:
list(groups.items())[:5]

[('Hollister_Leg Bag Ster 500ml Inlet 10cm',
  [{'bnf_code': '22500105006',
    'items': 1,
    'practice': 'P81079',
    'bnf_name': 'Hollister_Leg Bag Ster 500ml Inlet 10cm',
    'nic': 29.35,
    'act_cost': 27.17,
    'quantity': 10},
   {'bnf_code': '22500105006',
    'items': 1,
    'practice': 'P81133',
    'bnf_name': 'Hollister_Leg Bag Ster 500ml Inlet 10cm',
    'nic': 29.35,
    'act_cost': 27.18,
    'quantity': 10}]),
 ('Sod Fluoride_Mthwsh 0.05% S/F',
  [{'bnf_code': '0905030G0AABCBC',
    'items': 1,
    'practice': 'P81069',
    'bnf_name': 'Sod Fluoride_Mthwsh 0.05% S/F',
    'nic': 3.02,
    'act_cost': 2.81,
    'quantity': 500},
   {'bnf_code': '0905030G0AABCBC',
    'items': 2,
    'practice': 'P81100',
    'bnf_name': 'Sod Fluoride_Mthwsh 0.05% S/F',
    'nic': 9.66,
    'act_cost': 8.94,
    'quantity': 1600},
   {'bnf_code': '0905030G0AABCBC',
    'items': 1,
    'practice': 'B81014',
    'bnf_name': 'Sod Fluoride_Mthwsh 0.05% S/F',
    'nic': 1.51,
    'act_cos

In [25]:
item_totals = []
for d_name, group in list(groups.items())[:]:
    item_total = (sum([d['items'] for d in group]))
    item_totals.append((d_name, item_total))

In [26]:
item_totals = []
for d_name, group in groups.items():
    item_total = (sum([d['items'] for d in group]))
    item_totals.append((d_name, item_total))

In [27]:
len(item_totals)

5619

In [28]:
[max(item_totals, key=lambda x: x[1])]

[('Omeprazole_Cap E/C 20mg', 113826)]

In [29]:
max_item = [max(item_totals, key=lambda x: x[1])]

**TIP:** If you are getting an error from the grader below, please make sure your answer conforms to the correct format of `[(bnf_name, item total)]`.

In [30]:
grader.score.pw__most_common_item(max_item)

Your score:  1.0


**Challenge:** Write a function that constructs groups as we did above. The function should accept a list of dictionaries (e.g. `scripts` or `practices`) and a tuple of fields to `groupby` (e.g. `('bnf_name')` or `('bnf_name', 'post_code')`) and returns a dictionary of groups. The following questions will require you to aggregate data in groups, so this could be a useful function for the rest of the miniproject.

In [31]:
data = scripts.copy()
data[0]

{'bnf_code': '0101010G0AAABAB',
 'items': 2,
 'practice': 'N81013',
 'bnf_name': 'Co-Magaldrox_Susp 195mg/220mg/5ml S/F',
 'nic': 5.98,
 'act_cost': 5.56,
 'quantity': 1000}

In [32]:
data = scripts.copy()
fields = ('bnf_name', 'practice', 'nic')

In [33]:
print(type(data))
print(type(fields))

<class 'list'>
<class 'tuple'>


In [34]:
names = set()
for d in data:
    groups = []
    for f in fields:
        groups.append(d[f])
    names.add(tuple(groups))

In [35]:
names = {tuple(d[f] for f in fields) for d in data}

In [36]:
def group_by_field(data, fields):
    names = {tuple(d[f] for f in fields) for d in data}
    groups = {name: [] for name in names}
    for d in data:
        key = tuple(d[f] for f in fields)
        groups[key].append(d)
    return groups

In [37]:
my_dict = group_by_field(data, fields)

In [38]:
my_dict

{('Melatonin_Tab 2mg M/R',
  'L83100',
  886.46): [{'bnf_code': '0401010ADAAAAAA',
   'items': 25,
   'practice': 'L83100',
   'bnf_name': 'Melatonin_Tab 2mg M/R',
   'nic': 886.46,
   'act_cost': 821.76,
   'quantity': 1728}],
 ('Clopidogrel_Tab 75mg',
  'Y04585',
  0.24): [{'bnf_code': '0209000C0AAAAAA',
   'items': 1,
   'practice': 'Y04585',
   'bnf_name': 'Clopidogrel_Tab 75mg',
   'nic': 0.24,
   'act_cost': 0.33,
   'quantity': 5}],
 ('Hydralazine HCl_Tab 50mg',
  'E84069',
  24.9): [{'bnf_code': '0205010J0AAAHAH',
   'items': 1,
   'practice': 'E84069',
   'bnf_name': 'Hydralazine HCl_Tab 50mg',
   'nic': 24.9,
   'act_cost': 23.06,
   'quantity': 112}],
 ('Linagliptin_Tab 5mg',
  'A81017',
  1139.2): [{'bnf_code': '0601023AEAAAAAA',
   'items': 41,
   'practice': 'A81017',
   'bnf_name': 'Linagliptin_Tab 5mg',
   'nic': 1139.2,
   'act_cost': 1056.02,
   'quantity': 959}],
 ('CliniSupplies_ProSys Night Bag Lever Out',
  'A81007',
  11.96): [{'bnf_code': '22600956003',
   'item

In [39]:
my_dict[('Clobet But_Oint 0.05%', 'C87017', 3.72)]

[{'bnf_code': '1304000H0AABABA',
  'items': 1,
  'practice': 'C87017',
  'bnf_name': 'Clobet But_Oint 0.05%',
  'nic': 3.72,
  'act_cost': 3.46,
  'quantity': 60}]

In [40]:
# groups = group_by_field(scripts, ('bnf_name',))
# test_max_item = ...

# assert test_max_item == max_item

## Question 3: postal_totals

Our data set is broken up among different files. This is typical for tabular data to reduce redundancy. Each table typically contains data about a particular type of event, processes, or physical object. Data on prescriptions and medical practices are in separate files in our case. If we want to find the total items prescribed in each postal code, we will have to _join_ our prescription data (`scripts`) to our clinic data (`practices`).

Find the total items prescribed in each postal code, representing the results as a list of tuples `(post code, total items prescribed)`. Sort your results ascending alphabetically by post code and take only results from the first 100 post codes. Only include post codes if there is at least one prescription from a practice in that post code.

**NOTE:** Some practices have multiple postal codes associated with them. Use the alphabetically first postal code.

We can join `scripts` and `practices` based on the fact that `'practice'` in `scripts` matches `'code'` in `practices`. However, we must first deal with the repeated values of `'code'` in `practices`. We want the alphabetically first postal codes.

In [41]:
practices[1]

{'code': 'A81002',
 'name': 'QUEENS PARK MEDICAL CENTRE',
 'addr_1': 'QUEENS PARK MEDICAL CTR',
 'addr_2': 'FARRER STREET',
 'borough': 'STOCKTON ON TEES',
 'village': 'CLEVELAND',
 'post_code': 'TS18 2AW'}

In [42]:
scripts[1]

{'bnf_code': '0101021B0AAAHAH',
 'items': 1,
 'practice': 'N81013',
 'bnf_name': 'Alginate_Raft-Forming Oral Susp S/F',
 'nic': 1.95,
 'act_cost': 1.82,
 'quantity': 500}

```python
{'A81002': 'TS18 2AW'}
```

In [43]:
practice_postal = {}
for practice in practices:
    code = practice['code']
    if code in practice_postal:
        practice_postal[code] = min(practice_postal[code], practice['post_code'])
    else:
        practice_postal[code] = practice['post_code']

In [44]:
practice_postal['A81002']

'TS18 2AW'

In [45]:
practice_postal['K82019']

'HP21 8TR'

**Challenge:** This is an aggregation of the practice data grouped by practice codes. Write an alternative implementation of the above cell using the `group_by_field` function you defined previously.

In [46]:
assert practice_postal['K82019'] == 'HP21 8TR'

Now we can join `practice_postal` to `scripts`.

In [47]:
scripts[:3]

[{'bnf_code': '0101010G0AAABAB',
  'items': 2,
  'practice': 'N81013',
  'bnf_name': 'Co-Magaldrox_Susp 195mg/220mg/5ml S/F',
  'nic': 5.98,
  'act_cost': 5.56,
  'quantity': 1000},
 {'bnf_code': '0101021B0AAAHAH',
  'items': 1,
  'practice': 'N81013',
  'bnf_name': 'Alginate_Raft-Forming Oral Susp S/F',
  'nic': 1.95,
  'act_cost': 1.82,
  'quantity': 500},
 {'bnf_code': '0101021B0AAALAL',
  'items': 12,
  'practice': 'N81013',
  'bnf_name': 'Sod Algin/Pot Bicarb_Susp S/F',
  'nic': 64.51,
  'act_cost': 59.95,
  'quantity': 6300}]

In [48]:
practice_postal['N81013']

'SK11 6JL'

In [49]:
joined = scripts[:]
for script in joined:
    script['post_code'] = practice_postal[script['practice']]

In [50]:
joined[2]

{'bnf_code': '0101021B0AAALAL',
 'items': 12,
 'practice': 'N81013',
 'bnf_name': 'Sod Algin/Pot Bicarb_Susp S/F',
 'nic': 64.51,
 'act_cost': 59.95,
 'quantity': 6300,
 'post_code': 'SK11 6JL'}

Finally we'll group the prescription dictionaries in `joined` by `'post_code'` and sum up the items prescribed in each group, as we did in the previous question.

In [51]:
# Group 'joined' JSON by 'post_code'
joined_by_post = group_by_field(joined, ('post_code', ))
joined_by_post = list(joined_by_post.items())

postal_totals = []
for name, group in joined_by_post:
    sum_ = sum([d['items'] for d in group])
    postal_totals.append((name[0], sum_))

In [52]:
list(postal_totals)[:20]

[('CH65 6TG', 25090),
 ('GU9 9QS', 32131),
 ('BD3 8QH', 21010),
 ('ST8 6AG', 34516),
 ('L36 7XY', 22965),
 ('TW3 3LN', 40141),
 ('WS10 8SY', 12598),
 ('GL1 3PX', 38120),
 ('GL50 4DP', 74822),
 ('TW8 8DS', 26941),
 ('NE37 2PU', 57500),
 ('SR4 7XF', 49843),
 ('BL1 8TU', 26132),
 ('LE18 2EW', 37144),
 ('NE10 9QG', 39882),
 ('BL9 0NJ', 32062),
 ('NG7 5HY', 24903),
 ('LA1 1PN', 47335),
 ('SM3 8EP', 24965),
 ('CT11 8AD', 44358)]

In [53]:
postal_totals = sorted(postal_totals)
# print (list(postal_totals[:100]))

In [54]:
# joined_by_post[('SK11 6JL',)][:3]

In [55]:
# items_by_post = ...

In [56]:
# postal_totals = [('B11 4BW', 20673)] * 100

grader.score.pw__postal_totals(postal_totals[:100])

Your score:  1.0


In [57]:
len(postal_totals)

118

## Question 4: items_by_region

Now we'll combine the techniques we've developed to answer a more complex question. Find the most commonly dispensed item in each postal code, representing the results as a list of tuples (`post_code`, `bnf_name`, amount dispensed as proportion of total). Sort your results ascending alphabetically by post code and take only results from the first 100 post codes.

**NOTE:** We'll continue to use the `joined` variable we created before, where we've chosen the alphabetically first postal code for each practice. Additionally, some postal codes will have multiple `'bnf_name'` with the same number of items prescribed for the maximum. In this case, we'll take the alphabetically first `'bnf_name'`.

Now we need to calculate the total items of each `'bnf_name'` prescribed in each `'post_code'`. Use the techniques we developed in the previous questions to calculate these totals. You should have 141196 `('post_code', 'bnf_name')` groups.

In [81]:
# postal_totals

In [82]:
joined[0]

{'bnf_code': '0101010G0AAABAB',
 'items': 2,
 'practice': 'N81013',
 'bnf_name': 'Co-Magaldrox_Susp 195mg/220mg/5ml S/F',
 'nic': 5.98,
 'act_cost': 5.56,
 'quantity': 1000,
 'post_code': 'SK11 6JL'}

In [83]:
items_by_post_name = \
    group_by_field(joined, ('post_code', 'bnf_name')).items()

total_items_by_bnf_post = \
    {name: sum([d['items'] for d in group])
    for name, group in items_by_post_name}

In [84]:
list(items_by_post_name)[0]

(('WN2 5NG', 'Ramipril_Cap 1.25mg'),
 [{'bnf_code': '0205051R0AAAAAA',
   'items': 50,
   'practice': 'P92006',
   'bnf_name': 'Ramipril_Cap 1.25mg',
   'nic': 89.05,
   'act_cost': 83.06,
   'quantity': 1820,
   'post_code': 'WN2 5NG'},
  {'bnf_code': '0205051R0AAAAAA',
   'items': 10,
   'practice': 'P92031',
   'bnf_name': 'Ramipril_Cap 1.25mg',
   'nic': 15.07,
   'act_cost': 14.08,
   'quantity': 308,
   'post_code': 'WN2 5NG'},
  {'bnf_code': '0205051R0AAAAAA',
   'items': 9,
   'practice': 'Y02274',
   'bnf_name': 'Ramipril_Cap 1.25mg',
   'nic': 12.33,
   'act_cost': 11.52,
   'quantity': 252,
   'post_code': 'WN2 5NG'}])

In [85]:
total_items_by_bnf_post[('L7 6HD', 'Mepore Film + Pad 4cm x 5cm VP Adh Film')]

4

In [86]:
list(total_items_by_bnf_post.items())[:4]

[(('WN2 5NG', 'Ramipril_Cap 1.25mg'), 69),
 (('HG1 5AR', 'Flumetasone/Clioquinol_Ear Dps 0.02%/1%'), 15),
 (('DA1 2HA', 'Salbutamol_Inha B/A 100mcg (200 D) CFF'), 9),
 (('BH18 8EE', 'Lamotrigine_Tab Disper 100mg S/F'), 4)]

In [87]:
len(total_items_by_bnf_post)

141196

In [88]:
# total_items_by_bnf_post = ...
assert len(total_items_by_bnf_post) == 141196

Let's use `total_items` to find the maximum item total for each postal code. To do this, we will want to regroup `total_items_by_bnf_post` by `'post_code'` only, not by `('post_code', 'bnf_name')`. First let's turn `total_items` into a list of dictionaries (similar to `scripts` or `practices`) and then group it by `'post_code'`. You should have 118 groups in the resulting `total_items_by_post` after grouping `total_items` by `'post_code'`.

We want the dictionaries in `total_items` to look like this:
    
```python
{'post_code': post_code, 
 'bnf_name': bnf_name, 
 'total': total}
```

In [89]:
# my_tuple = ('OL1 1NL', 10)
# print(my_tuple)

In [90]:
# post_code = my_tuple[0]
# total = my_tuple[1]
# print(post_code)
# print(total)

In [91]:
# post_code, total = my_tuple
# print(post_code)
# print(total)

In [92]:
my_tuple = list(total_items_by_bnf_post.items())[0]
print(my_tuple)

(('WN2 5NG', 'Ramipril_Cap 1.25mg'), 69)


In [93]:
(post_code, bnf_name), total = my_tuple
print(post_code)
print(bnf_name)
print(total)

WN2 5NG
Ramipril_Cap 1.25mg
69


```python
(('post_code', 'bnf_name), total )
```

In [94]:
total_items = [{'post_code': post_code,
               'bnf_name': bnf_name,
               'total': total}
              for (post_code, bnf_name), total
              in total_items_by_bnf_post.items()]

In [95]:
total_items[0]

{'post_code': 'WN2 5NG', 'bnf_name': 'Ramipril_Cap 1.25mg', 'total': 69}

In [96]:
total_items_by_post = group_by_field(total_items, ('post_code',))

In [97]:
len(total_items_by_post)

118

Now to get the `JSON` keys and values of post_code, bnf_name, and total

In [98]:
# total_items = ...
assert len(total_items_by_post) == 118

Now we will aggregate the groups in `total_by_item_post` to create `max_item_by_post`. Some `'bnf_name'` have the same item total within a given postal code. Therefore, if more than one `'bnf_name'` has the maximum item total in a given postal code, we'll take the alphabetically first `'bnf_name'`. We can do this by [sorting](https://docs.python.org/2.7/howto/sorting.html) each group according to the item total and `'bnf_name'`.

In [99]:
total_items_by_post[('BB2 1AX',)]

[{'post_code': 'BB2 1AX', 'bnf_name': 'Cetraben Crm 150g', 'total': 8},
 {'post_code': 'BB2 1AX', 'bnf_name': 'Senna_Tab 7.5mg', 'total': 30},
 {'post_code': 'BB2 1AX', 'bnf_name': 'Quetiapine_Tab 150mg', 'total': 8},
 {'post_code': 'BB2 1AX',
  'bnf_name': 'Fybogel_Gran Sach 3.5g Orange G/F S/F',
  'total': 13},
 {'post_code': 'BB2 1AX', 'bnf_name': 'Lacosamide_Tab 50mg', 'total': 2},
 {'post_code': 'BB2 1AX', 'bnf_name': 'Felodipine_Tab 10mg M/R', 'total': 27},
 {'post_code': 'BB2 1AX',
  'bnf_name': 'Fluticasone Prop_Nsl Spy 50mcg (150 D)',
  'total': 24},
 {'post_code': 'BB2 1AX',
  'bnf_name': 'Paracet_Oral Susp 250mg/5ml S/F',
  'total': 23},
 {'post_code': 'BB2 1AX', 'bnf_name': 'Vit B Co_Tab', 'total': 2},
 {'post_code': 'BB2 1AX', 'bnf_name': 'Etoricoxib_Tab 90mg', 'total': 5},
 {'post_code': 'BB2 1AX', 'bnf_name': 'Mebeverine HCl_Tab 135mg', 'total': 24},
 {'post_code': 'BB2 1AX', 'bnf_name': 'Diazepam_Tab 5mg', 'total': 47},
 {'post_code': 'BB2 1AX', 'bnf_name': 'Quetiapine_

In [100]:
sorted(list(total_items_by_post.values())[0],
      key=lambda d: (-d['total'], d['bnf_name']))[0]

{'post_code': 'CH65 6TG',
 'bnf_name': 'Lansoprazole_Cap 30mg (E/C Gran)',
 'total': 688}

In [101]:
# my_list = list(total_items_by_post.values())[0][:10]
# my_list

In [102]:
# sorted(my_list, key=lambda d: (-d['total'], d['bnf_name']))

In [103]:
# my_list

In [104]:
max_item_by_post = [sorted(group, key=lambda d: (-d['total'], d['bnf_name']))[0]
                   for group in total_items_by_post.values()]

In [105]:
max_item_by_post[:2]

[{'post_code': 'CH65 6TG',
  'bnf_name': 'Lansoprazole_Cap 30mg (E/C Gran)',
  'total': 688},
 {'post_code': 'GU9 9QS', 'bnf_name': 'Omeprazole_Cap E/C 20mg', 'total': 919}]

In [106]:
postal_totals = dict(postal_totals)

In [107]:
len(postal_totals)

118

In [108]:
for item in max_item_by_post:
    post_code = item['post_code']
    total = item['total']
    item['proportion'] = total / postal_totals[post_code]

In [109]:
max_item_by_post[0]

{'post_code': 'CH65 6TG',
 'bnf_name': 'Lansoprazole_Cap 30mg (E/C Gran)',
 'total': 688,
 'proportion': 0.027421283379832604}

In [110]:
items_by_region = [(d['post_code'], d['bnf_name'], d['proportion'])
                  for d in max_item_by_post]

In [111]:
items_by_region[:5]

[('CH65 6TG', 'Lansoprazole_Cap 30mg (E/C Gran)', 0.027421283379832604),
 ('GU9 9QS', 'Omeprazole_Cap E/C 20mg', 0.028601661946406898),
 ('ST8 6AG', 'Omeprazole_Cap E/C 20mg', 0.03963379302352532),
 ('BD3 8QH', 'Atorvastatin_Tab 40mg', 0.03422179914326511),
 ('TW3 3LN', 'Omeprazole_Cap E/C 20mg', 0.02742831518895892)]

In [112]:
items_by_region = sorted(items_by_region)[:100]

In [113]:
# items_by_region

In order to express the item totals as a proportion of the total amount of items prescribed across all `'bnf_name'` in a postal code, we'll need to use the total items prescribed that we previously calculated as `items_by_post`. Calculate the proportions for the most common `'bnf_names'` for each postal code. Format your answer as a list of tuples: `[(post_code, bnf_name, total)]`

In [114]:
# items_by_region = [('B11 4BW', 'Salbutamol_Inha 100mcg (200 D) CFF', 0.0341508247)] * 100

In [115]:
grader.score.pw__items_by_region(items_by_region)

Your score:  1.0


*Copyright &copy; 2019 The Data Incubator.  All rights reserved.*

**Personal Note**

**Python lambda (Anonymous Functions) | filter, map, reduce**

In Python, anonymous function means that a function is without a name. As we already know that `def` keyword is used to define the normal functions and the lambda keyword is used to create anonymous functions. It has the following syntax:

**lambda arguments: expression:**
* This function can have any number of arguments but only one expression, which is evaluated and returned.
* One is free to use lambda functions wherever function objects are required.
* You need to keep in your knowledge that lambda functions are syntactically restricted to a single expression.
* It has various uses in particular fields of programming besides other types of expressions in functions.


**Using Lambda :** Lambda definition does not include a “return” statement, it always contains an expression which is returned. We can also put a lambda definition anywhere a function is expected, and we don’t have to assign it to a variable at all. This is the simplicity of lambda functions.

**Use of lambda() with filter():**

The filter() function in Python takes in a function and a list as arguments. This offers an elegant way to filter out all the elements of a sequence “sequence”, for which the function returns True.

In [120]:
# Python code to illustrate 
# filter() with lambda() 
li = [5, 7, 22, 97, 54, 62, 77, 23, 73, 61] 
final_list = list(filter(lambda x: (x%2 != 0) , li)) 
print(final_list) 

[5, 7, 97, 77, 23, 73, 61]


**Use of lambda() with map()**

The map() function in Python takes in a function and a list as argument. The function is called with a lambda function and a list and a new list is returned which contains all the lambda modified items returned by that function for each item.

In [121]:
# Python code to illustrate 
# map() with lambda() 
# to get double of a list. 
li = [5, 7, 22, 97, 54, 62, 77, 23, 73, 61] 
final_list = list(map(lambda x: x*2 , li)) 
print(final_list) 

[10, 14, 44, 194, 108, 124, 154, 46, 146, 122]


**Use of lambda() with reduce()**

The reduce() function in Python takes in a function and a list as argument. The function is called with a lambda function and a list and a new reduced result is returned. This performs a repetitive operation over the pairs of the list.

In [122]:
# Python code to illustrate 
# reduce() with lambda() 
# to get sum of a list 
from functools import reduce
li = [5, 8, 10, 20, 50, 100] 
sum = reduce((lambda x, y: x + y), li) 
print (sum) 

193


In [123]:
# Python code to illustrate 
# reduce() with lambda() 
# to get sum of a list 
from functools import reduce
li = [5, 8, 10, 20, 50, 100] 
sum = reduce((lambda x, y: x * y), li) 
print (sum) 

40000000


*Copyright &copy; 2019 The Data Incubator.  All rights reserved.*