#  Finally: Practical things to do with Numpy and Panda


### - in this notebook:

- loading files: local (from your machine) and remote (accessed with online url)
- cleaning up the missing values and manipulating data
- visualisations with numpy or matplotlib
- examples of the whole journey of prepping simple data to Visualisation
- mini-diary ⭐️⭐️⭐️❓


In [None]:
import pandas as pd
import numpy as np

# Importing data from files

There are two main datatypes you will encounter.

**CSV (comma separated values)** - those basically look like simplified excel. First line of the file contains names of columns, and then each next line is a set of values for those columns. Valyes are separated by comma.

`
name, age, department
Pim, 34, ER
Jannet, 54, Oncology
Aoife, 25, Oncology
`

Notice you cannot skip any values, since the only indicator of what data means is which in order it is. So for example, if you do not know one person's name you can't just write

Natasha, ER

becuase it would treat ER as their age. (second value means second column). Instead sometimes you will see missing data just skipped with a missing value (nothing between commas) like this:

Natasha,, ER

**JSON (key-value pairs, like python dictionary)** - those will be very familiar to you. Collecitons are indicated by [ ] and key value pairs (separated by : colon) are enclosed in { }

`
[
    {'name':'Pim', 'age':34, 'department':'Oncology'},
    {'name':'Jannet', 'age':54, 'department':'Oncology'},
    {'name':'Aoife', 'age':25, 'department':'Oncology'}
]
`

In [None]:
# Let's import some data from a csv file. Panda simplified and streamlined importing data

# if we did not specify index_col, index would be 0,1,2,3,... but this data already has a gooid index 
nursing_home_residents = pd.read_csv("./data/nursing_home_residents.csv")
nursing_home_residents

In [None]:
# and we could creater an index, but in this dataset nont of the columns is uniquely identifying a row
# instead what is unique is the combination of date, statistic and CA (council area)

# so eg IF the date was a unique row item, we could use index_col to specify index
# nursing_home_residents = pd.read_csv("../data/nursing_home_residents.csv", index_col='Date')
# nursing_home_residents

### Online files use EXACTLY the same code

This is just so amazing. if the 'filename' you use in read_csv starts with http... it will be simply loaded from the internet.

In [None]:
nursing_home_residents = pd.read_csv("https://www.opendata.nhs.scot/dataset/75cca0a9-780d-40e0-9e1f-5f4796950794/resource/139f61d8-a87d-419d-b7af-31f555a60c89/download/file3_mean_median_age_years.csv")
nursing_home_residents
# 

### Json Files: similar, but might need some specifying if JSON is very nested (things, within things within things)

For example, if the file `simple.json` just contains a list of things, like this:

`
[
    {"name":"Pim", "age":34, "department":"Oncology"},
    {"name":"Jannet", "age":54, "department":"Oncology"},
    {"name":"Aoife", "age":25, "department":"Oncology"}
]
`

it can be loaded with `pd.read_json(filename)`


In [None]:
simple_staff = pd.read_json("./data/simple.json")
simple_staff
# 

But if the file is 'nested' as in contains things within things, we need to tell panda how we want it read.
Notice that in file `nested.json` the staff members are no longer at the top level. Also that there is a List of information ('shifts') inside of each patient. It is no longer obvious how to change it into 'excel spreadsheet' format that is so easy for CSV

`{
  "hospital":"Western General",
  "staff":[
    {"name":"Pim", "age":34, "department":"Oncology", "shifts": ["day"]},
    {"name":"Jannet", "age":54, "department":"Oncology", , "shifts": ["day", "night"]},
    {"name":"Aoife", "age":25, "department":"Oncology",  "shifts": ["night"]}
	]
}
`

If we do not specify what we mean, we'll get something like this: (not very useful)

In [None]:
nested_staff = pd.read_("./data/nested.json")
nested_staff
# 

So we need to specify things. eg, staff members are nested under key `staff`.

There are two ways to do this:

1. Load the file, and then create another dataframe by 'interpreting' just one of the columns (the staff column in above example)

This is a simple way if your data is simple:

In [None]:
nested_all_data = pd.read_json("./data/nested.json")
nested_staff = pd.DataFrame.from_records(nested_all_data['staff'])
nested_staff

2. Load the contents of the file first, and then create dataframe specifying where things come from.

A) Load the content of a file into a python Dictionary

B) Turn that dictionary into DataFrame specifying settings

This is needed if your data is more complex and also allows you to add some meta information.

In [None]:
# local file:
import json

file_location = "./data/nested.json"
with open(file_location) as local_data:    
    file_data_staff = json.load(local_data) 
    
file_data_staff # see it  
# 

In [None]:
# online file
import urllib.request
import json

file_url = "https://www.opendata.nhs.scot/api/3/action/datastore_search?resource_id=139f61d8-a87d-419d-b7af-31f555a60c89"
with urllib.request.urlopen(file_url) as online_data:
    file_data_nursing_homes = json.load(online_data)
    
file_data_nursing_homes # see it  
# 

###  Coming back to the simpler json file:

Let's specify what is where in the json:

In [None]:
# first let's see the data
# list of staff is under a 'staff' key
file_data_staff

In [None]:
# now that we have the data as python dictionary, we can turn it into a DataFrame
# first argument:
# we spoecify where in dictionary is what we're after.

nested_staff_df = pd.json_normalize(file_data_staff, ['staff'])
nested_staff_df
# if you do not have to drill to multiple levels you could also just say 'staff' without [ ]

In [None]:
# second argument:
# what meta-data we also want to keep for eachrow. notice these can drill into many levels
# here we submit a list of levels we want to keep: 
# hospital  - nodrilling, just top level
# ['location', 'city']   - drill into location first, then into city
nested_staff_df = pd.json_normalize(file_data_staff, ['staff'],
                                 ['hospital', ['location', 'city']])
nested_staff_df

### Extra quest for the brave ones: nested items inside of json and 'dummy' binary columns

In [None]:
# finally to separate items individual to each row (eg 'age') from those for whole set
# eg. hospital, we can specify a prefix for itels from the list
nested_staff_df = pd.json_normalize(file_data_staff, ['staff'],
                                 ['hospital', ['location', 'city']], 
                    record_prefix='staff.')
nested_staff_df
# prefix can be anything, eg 'staff_', 'person-', or even nothing

Final bit of the puzzle would be untangling the staff.shifts. 

But this will have to be an adventure for another time. If you're curious to explore it by yourself, look for for something along the lines of:

In [None]:
# turn data into binarry dummies: 1 if present, 0 if not-present 
shifts_df = pd.get_dummies(nested_staff_df['staff.shifts'].explode()).groupby(level=0).sum()
shifts_df = shifts_df.add_prefix('staff.shift.')

In [None]:
pd.concat([nested_staff_df, shifts_df], axis = 1)

## Real Data example

Back tot he complicated online file. Let's load it again:

In [None]:
# online file
import urllib.request
import json

file_url = "https://www.opendata.nhs.scot/api/3/action/datastore_search?resource_id=139f61d8-a87d-419d-b7af-31f555a60c89&limit=10000"
with urllib.request.urlopen(file_url) as online_data:
    file_data_nursing_homes = json.load(online_data)
    
file_data_nursing_homes # see it  
# 


In [None]:
# where is just the list of the data? This is still a dictionary
file_data_nursing_homes['result']['records']

In [None]:
# now load json into a dataframe
nursing_homes_df = pd.json_normalize(file_data_nursing_homes, ['result','records'])
nursing_homes_df

# notice it's ['result','records'] instead of ['result']['records'] 
# because at this point we are already in pandas naming world. Panda uses [fist_level, second_level, ...]

## Folding the data: 

In [None]:
### Pivot tables - these qre are quite powerful.
# take data and 'fold it' into a more excel like format. 
# You can apply transformations to things that got folded. eg average, or sum

fruit_df = pd.DataFrame({'fruit': ['apple', 'banana','banana', 'banana', 'kiwi', 'kiwi'],
                 'weight': [32,45,62,43,12,14]})


avg_fruit_weights = fruit_df.pivot_table(values='weight',
                 index='fruit',
                 aggfunc=np.mean)
avg_fruit_weights

In [None]:
# here's another aggregate function: sum
total_fruit_weights = fruit_df.pivot_table(values='weight',
                 index='fruit',
                 aggfunc=np.sum)
total_fruit_weights

### But why and how we would fold data? Long format vs Wide format

In the real data above you can see that some columns are repeated many times. It's because this data has a **long format*

Wide is what you know from excel - columns and rows are meaningful and contain data

`
student, math, literature
Pin,  B, A
Aoife, A, C
Marta, A, B
`

Long is how you can imagine a tall thin column with metadata. All information is the same, but it it presented in a way where

- first many columns tell you everythong ABOUT the data
- final column tells you what the actual value of the data is
 
`
student, subject, grade
Pim, math, B
Pim, literature, A
Aoife, math, A
Aoife, literature, C
Marta, math, A
Marta, literature, B
`

So let's look at the nursing home data. They are currently Long. With 'meta' columns being: `Date, KeyStatistic, CA, MainClientGroup`

In [None]:
nursing_homes_df.head()

In [None]:
# this is quite an advanced thing to try, but you can 'fold' the data using some columns
# eg. if we want a breakdown for each date
# where columns are KeyStatistic
# and each cell shows the Value
# and in case there are many items which would go into each cell, get the numpy.average

nursing_homes_df_wide=pd.pivot_table(nursing_homes_df, 
                               index=['Date'], 
                               columns = 'KeyStatistic',
                               aggfunc = np.average,
                               values = 'Value')

nursing_homes_df_wide.head(20)

In [None]:
# but if you also wanted to see how this breakdown works across council area
nursing_homes_df_wide=pd.pivot_table(nursing_homes_df, 
                               index=['CA','Date'], 
                               columns = 'KeyStatistic',
                               aggfunc = np.average,
                               values = 'Value')

nursing_homes_df_wide.head(20)

In [None]:
# and if needed you can then turn it back into long
columns_to_melt=list(nursing_homes_df_wide.columns)
nursing_homes_df_long_again = pd.melt(nursing_homes_df_wide, 
                                      value_vars=columns_to_melt,
                                      value_name='Age',  # here you can call your value column, eg 'Value' 
                                      ignore_index=False)
nursing_homes_df_long_again

In [None]:
# We might look more at folding things betwene Long and Wide later

# Finding and Fixing Missing Values

In [None]:
# Cleaning up missing data (with NaN values)

foods = pd.DataFrame({
    'name': ['Bagel', 'Milk Chocolate', 'Carrot', 'Ham Sandwich', 'Egg Cake', 'Tap Water'],
    'diet':['Vegan', None, 'Vegan', 'Meat', None, 'Vegan'],
    'supplier':['Bros', 'Luca', 'Whitmore', 'Union', 'Lovecrumbs', None],
    'price':[4.30, 2.10, 0.7, 5.70, None, 0]
}).set_index('name', drop=True)

print("\nShape:",foods.shape)

print("\nMissing values:\n",foods.isnull().sum())

print("\nAll values:\n",foods)

In [None]:
foods = pd.DataFrame({
    'name': ['Bagel', 'Milk Chocolate', 'Carrot', 'Ham Sandwich', 'Egg Cake', 'Tap Water'],
    'diet':['Vegan', None, 'Vegan', 'Meat', None, 'Vegan'],
    'supplier':['Bros', 'Luca', 'Whitmore', 'Union', 'Lovecrumbs', None],
    'price':[4.30, 2.10, 0.7, 5.70, None, 0]
}).set_index('name', drop=True)

# remove all rows with any missing values 
foods.dropna(inplace=True)

print("\nShape:",foods.shape)

print("\nMissing values:\n",foods.isnull().sum())

print("\nAll values:\n",foods)

In [None]:
foods = pd.DataFrame({
    'name': ['Bagel', 'Milk Chocolate', 'Carrot', 'Ham Sandwich', 'Egg Cake', 'Tap Water'],
    'diet':['Vegan', None, 'Vegan', 'Meat', None, 'Vegan'],
    'added_sugar':[True, True, False, False,  True, False],
    'supplier':['Bros', 'Luca', 'Whitmore', 'Union', 'Lovecrumbs', None],
    'price':[4.30, 2.10, 0.7, 5.70, None, 0]
}).set_index('name', drop=True)

# Cleanup the Diet column:

print("missing values in Diet:", foods.diet.isna().sum())
print("present values in Diet:\n", foods.diet.value_counts(sort=True))

In [None]:
foods.diet.fillna('Unknown',inplace=True)
print()

print("missing values in Diet:", foods.diet.isna().sum())
print("present values in Diet:\n", foods.diet.value_counts(sort=True))

In [None]:
# notice unknown string values show up as None, while missing numeric values show up as Unknown.
# and keep cleaning up columns until there are no NaNs in all columns
print(foods.isnull().sum() )

In [None]:
foods # still some NaN in price and supplier

In [None]:
foods.price.fillna(0,inplace=True) # cleanup price
foods.supplier.fillna("Unknown",inplace=True) # cleanup supplier
print(foods.isnull().sum() )

In [None]:
foods # no more NaNs - they were all replaced by meaningful values like 'Unknown' or 0

In [None]:
# some simple statistics on distribution of values
foods.diet.value_counts()

In [None]:
# or as fraction of all items
foods.diet.value_counts()/foods.shape[0]

In [None]:
# or as percent of all items
np.round( foods.diet.value_counts()/foods.shape[0]*100)

In [None]:
# or as a breakdown to all possible values, between two categories (wide)
pd.crosstab(foods.added_sugar, foods.diet, margins=True)

# Visualisation

There are many visualisation libraries, but in this course we will use Plotly, and sometimes matplotlib. Some might be familiar to you from R 

In [None]:
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default='notebook'
import pprint as pp

In [None]:
foods

In [None]:
# 
fig = go.Figure(
    data=[go.Bar(y=foods.price, 
                 x=foods.index)],
    layout=go.Layout(
        title=go.layout.Title(text="Prices of foods")
    )
)
fig.show('notebook')

In [None]:
# 
diet_colours = {'Vegan':'green',
                'Meat':'red',
                'Vegetarian':'orange',
                'Unknown':'grey',
               }


foods['color'] = foods['diet'].map( lambda diet: diet_colours[diet] )

fig = go.Figure(
    data=[go.Bar(y=foods.price, 
                 x=foods.index, 
                 marker_color=foods.color)],
    layout=go.Layout(
        title=go.layout.Title(text="Prices of foods"),
        yaxis_title="Price in Pounds"
    )
)
fig.show('notebook')


### In matplotlib it looks very simmilar:

matplot lib is very popular in many languages, but needs more care to look great.

In [None]:
import matplotlib.pyplot as plt
    
plt.bar(foods.index, 
        foods.price,
        color=foods.color)
plt.ylabel('Price in Pounds')
plt.title('Prices of foods')
plt.show()

## Now with some real data

In [None]:
nursing_homes_df.head()

In [None]:
# Pick only the Mean Age statistic
nursing_homes_df.loc[nursing_homes_df['KeyStatistic']=='Mean Age']

# group values by Date to get overall averages
averages = nursing_homes_df['Value'].groupby(nursing_homes_df['Date']).mean()
print(type(averages)) # what type is averages? it's a series. So basically one column of a dataframe
averages                

In [None]:
# let's turn it into a dataframe, fo ease of use
averages_df = pd.DataFrame(averages)
averages_df

In [None]:
fig = go.Figure(
    data=[go.Scatter(y=averages_df.Value, 
                 x=averages_df.index)]
)
fig.show('notebook')

In [None]:
# Do you see what is going wroing with the above graph? (on the x axis). \
# That's because the x axis values are numbers like 20201029 instead of actual datetime objects
# let's fix that!

In [None]:
averages_df = pd.DataFrame(averages)
averages_df.index = pd.to_datetime(averages_df.index, format='%Y%m%d')
averages_df

In [None]:
fig = go.Figure(
    data=[go.Scatter(y=averages_df.Value, 
                 x=averages_df.index)]
)
fig.show('notebook')
# notice that the x axis suddently makes sense!

# And to sum it up: let's try to use more than one part of the dataset:

In [None]:
# create a dataframe
stats_df = pd.DataFrame(nursing_homes_df['Value'].groupby(nursing_homes_df['Date']).mean())
# then add to if a few more columns. for example 1 standard deviation up and down from mean.
stats_df['Median'] = nursing_homes_df['Value'].groupby(nursing_homes_df['Date']).median()
stats_df['Stdev'] = nursing_homes_df['Value'].groupby(nursing_homes_df['Date']).std()
stats_df['Stdev_top'] = stats_df['Value']  + stats_df['Stdev'] 
stats_df['Stdev_bottom'] = stats_df['Value']  - stats_df['Stdev'] 

stats_df['Max'] = nursing_homes_df['Value'].groupby(nursing_homes_df['Date']).max()
stats_df['Min'] =nursing_homes_df['Value'].groupby(nursing_homes_df['Date']).min()

# and let's make dates dates!
stats_df.index = pd.to_datetime(stats_df.index, format='%Y%m%d')
stats_df

In [None]:
# results = nursing_home_ages_mean_overall_ca.merge(ca_lookup_simple, left_on = 'CA', right_on = 'CA')


In [None]:
fig = go.Figure(
    data=[go.Scatter(y=stats_df.Value, x=stats_df.index, name="Mean"),
         go.Scatter(y=stats_df.Stdev_bottom, x=stats_df.index, name="1 standard dev down"),
         go.Scatter(y=stats_df.Median, x=stats_df.index, name="Median"),
         go.Scatter(y=stats_df.Stdev_top, x=stats_df.index, name="1 standard dev up")]
)
fig.show('notebook')

In [None]:
# We'll continue this work during the lab.

## ⭐️⭐️⭐️💥 What you learned in this session: Three stars and a wish 
**In your own words** write in your Learn diary:

- 3 things you yould like to remember from this badge
- 1 thing you wish to understand better in the future or a question you'd like to ask
