# Task 2

In [1]:
# imports

import pandas as pd
import datetime
import json


## Subtask A - Single Series creation

In [2]:
# we create a dictionary containing information on OUH
oslo_university_hospital_data = {
    "name": "Oslo University Hospital",
    "major_location": "Oslo, Oslo county",
    "campuses": ["Rikshospitalet", "Ullevål", "Aker", "Radiumhospitalet"],
    "founded_on": datetime.date(2009, 1, 1)
}

# we create a series from the hospital data
ouh_series = pd.Series(oslo_university_hospital_data)

# display the series to confirm correct structure
ouh_series

name                                       Oslo University Hospital
major_location                                    Oslo, Oslo county
campuses          [Rikshospitalet, Ullevål, Aker, Radiumhospitalet]
founded_on                                               2009-01-01
dtype: object

## Subtask B - Multiple Series creation (reading from file)

In [3]:
# load the data and display
# data is retrieved from:
# https://en.wikipedia.org/wiki/List_of_hospitals_in_Norway and https://no.wikipedia.org/wiki/Liste_over_norske_sykehus
# and stored in a json file.

with open("data/hospital_data.json") as hospital_data_file:
    data = json.load(hospital_data_file)

data

[{'name': 'Levanger Hospital',
  'major_location': 'Levanger, TrÃ¸ndelag county',
  'campuses': ['Levanger'],
  'founded_on': '1844'},
 {'name': 'Namsos Hospital',
  'major_location': 'Namsos, TrÃ¸ndelag county',
  'campuses': ['Namsos'],
  'founded_on': '1848'},
 {'name': "St. Olav's University Hospital",
  'major_location': 'Trondheim, TrÃ¸ndelag county',
  'campuses': ['Ã˜ya'],
  'founded_on': '1902-06-18'},
 {'name': 'University Hospital of North Norway',
  'major_location': 'TromsÃ¸, Troms og Finnmark county',
  'campuses': ['TromsÃ¸ya', 'Harstad', 'Longyearbyen', 'Narvik'],
  'founded_on': '2001-12-18'},
 {'name': 'Akershus University Hospital',
  'major_location': 'LÃ¸renskog, Viken county',
  'campuses': ['Nordbyhagen'],
  'founded_on': '1961-05-15'},
 {'name': 'Elverum Hospital',
  'major_location': 'Elverum, Innlandet county',
  'campuses': ['Elverum'],
  'founded_on': '1878'},
 {'name': 'Lillehammer Hospital',
  'major_location': 'Lillehammer, Innlandet county',
  'campuses'

In [4]:
# we notice that the charset for this file reading is not correct
# therefore we repeat by specifying an encoder that supports norwegian letters

with open("data/hospital_data.json", encoding="utf-8") as hospital_data_file:
    data = json.load(hospital_data_file)

data

[{'name': 'Levanger Hospital',
  'major_location': 'Levanger, Trøndelag county',
  'campuses': ['Levanger'],
  'founded_on': '1844'},
 {'name': 'Namsos Hospital',
  'major_location': 'Namsos, Trøndelag county',
  'campuses': ['Namsos'],
  'founded_on': '1848'},
 {'name': "St. Olav's University Hospital",
  'major_location': 'Trondheim, Trøndelag county',
  'campuses': ['Øya'],
  'founded_on': '1902-06-18'},
 {'name': 'University Hospital of North Norway',
  'major_location': 'Tromsø, Troms og Finnmark county',
  'campuses': ['Tromsøya', 'Harstad', 'Longyearbyen', 'Narvik'],
  'founded_on': '2001-12-18'},
 {'name': 'Akershus University Hospital',
  'major_location': 'Lørenskog, Viken county',
  'campuses': ['Nordbyhagen'],
  'founded_on': '1961-05-15'},
 {'name': 'Elverum Hospital',
  'major_location': 'Elverum, Innlandet county',
  'campuses': ['Elverum'],
  'founded_on': '1878'},
 {'name': 'Lillehammer Hospital',
  'major_location': 'Lillehammer, Innlandet county',
  'campuses': ['Lil

In [5]:
# now that we prepared our dataset, we notice that the json data is a list containing dictionary entries of every hospital
# we can therefore map this list (functional programming paradigm) to create series

# this will pass each element in the data list as argument into the pd.Series constructor (__init__)
series = map(pd.Series, data)

series

<map at 0x2285c117100>

In [6]:
# the map function in Python returns a generator, so if we want to check individual series, then we have to listify our generator.
# Generally speaking, listifying a generator isn't considered best practice because it will cause unnecessary memory allocation, potentially hurting performance in an application,
# and DataFrame creation only requires an iterable in the first place, which a generator is, so listifying is technically a "wasted" operation.

series = list(series)

print(series[0])    # print the first hospital in our dataset
print(series[-1])   # print the last hospital in out dataset

name                       Levanger Hospital
major_location    Levanger, Trøndelag county
campuses                          [Levanger]
founded_on                              1844
dtype: object
name              Stavanger University Hospital
major_location       Stavanger, Rogaland county
campuses                               [Våland]
founded_on                                 1844
dtype: object


## Subtask C - DataFrame creation from list of Series

In [79]:
# we can see that our list of series works as intended, so we can create a DataFrame with the list.
# note: in the task it's explicitly written "Create a DataFrame from the Series generated from step b".
# therefore, we will exclude the Series containing info on Oslo University Hospital (but we could've added it by doing series.append(ouh_series))

data_frame = pd.DataFrame(series)

data_frame

Unnamed: 0,name,major_location,campuses,founded_on
0,Levanger Hospital,"Levanger, Trøndelag county",[Levanger],1844
1,Namsos Hospital,"Namsos, Trøndelag county",[Namsos],1848
2,St. Olav's University Hospital,"Trondheim, Trøndelag county",[Øya],1902-06-18
3,University Hospital of North Norway,"Tromsø, Troms og Finnmark county","[Tromsøya, Harstad, Longyearbyen, Narvik]",2001-12-18
4,Akershus University Hospital,"Lørenskog, Viken county",[Nordbyhagen],1961-05-15
5,Elverum Hospital,"Elverum, Innlandet county",[Elverum],1878
6,Lillehammer Hospital,"Lillehammer, Innlandet county",[Lillehammer],1878
7,Skien Hospital,"Skien, Vestfold og Telemark county",[Skien],2001
8,Bærum Hospital,"Bærum, Viken county",[Dønski],1924-03-29
9,Drammen Hospital,"Drammen, Viken county",[Drammen],1878


## Subtask D - Column creation (using .map())

In an ideal world, we would have access to an API that already has a column for us containing Regional Health Authority data.
Since that isn't the case, we could opt for data scraping with static data from Wikipedia (https://en.wikipedia.org/wiki/List_of_hospitals_in_Norway).
However, we do not know whether their layout will be changed in the future, and bringing in a dependency would weaken our code integrity.

Our solution to this subtask is therefore to create a key-value map so that we can use the `pd.Series().map()` function to make a new column

In [46]:
# first, we create the key-value map:
# note that the order in the mapping does not matter,
# therefore adding new hospitals to the map or having a different list of hospitals in the dataset won't be a problem in the future
hospital_map = {
    "Levanger Hospital": "Central Norway Regional Health Authority",
    "Namsos Hospital": "Central Norway Regional Health Authority",
    "St. Olav's University Hospital": "Central Norway Regional Health Authority",
    "University Hospital of North Norway": "Northern Norway Regional Health Authority",
    "Akershus University Hospital": "Southern and Eastern Norway Regional Health Authority",
    "Elverum Hospital": "Southern and Eastern Norway Regional Health Authority",
    "Lillehammer Hospital": "Southern and Eastern Norway Regional Health Authority",
    "Oslo University Hospital": "Southern and Eastern Norway Regional Health Authority",
    "Skien Hospital": "Southern and Eastern Norway Regional Health Authority",
    "Bærum Hospital": "Southern and Eastern Norway Regional Health Authority",
    "Drammen Hospital": "Southern and Eastern Norway Regional Health Authority",
    "Østfold Hospital": "Southern and Eastern Norway Regional Health Authority",
    "Sørlandet Hospital": "Southern and Eastern Norway Regional Health Authority",
    "Haukeland University Hospital": "Western Norway Regional Health Authority",
    "Sandviken Hospital": "Western Norway Regional Health Authority",
    "Førde Central Hospital": "Western Norway Regional Health Authority",
    "Stavanger University Hospital": "Western Norway Regional Health Authority",
}

In [47]:
# then we can use this mapping dictionary to create our new column:
regional_health_authority_column = data_frame["name"].map(hospital_map)

data_frame["regional_health_authority"] = regional_health_authority_column

# display the updated dataframe
data_frame

Unnamed: 0,name,major_location,campuses,founded_on,regional_health_authority
0,Levanger Hospital,"Levanger, Trøndelag county",[Levanger],1844,Central Norway Regional Health Authority
1,Namsos Hospital,"Namsos, Trøndelag county",[Namsos],1848,Central Norway Regional Health Authority
2,St. Olav's University Hospital,"Trondheim, Trøndelag county",[Øya],1902-06-18,Central Norway Regional Health Authority
3,University Hospital of North Norway,"Tromsø, Troms og Finnmark county","[Tromsøya, Harstad, Longyearbyen, Narvik]",2001-12-18,Northern Norway Regional Health Authority
4,Akershus University Hospital,"Lørenskog, Viken county",[Nordbyhagen],1961-05-15,Southern and Eastern Norway Regional Health Au...
5,Elverum Hospital,"Elverum, Innlandet county",[Elverum],1878,Southern and Eastern Norway Regional Health Au...
6,Lillehammer Hospital,"Lillehammer, Innlandet county",[Lillehammer],1878,Southern and Eastern Norway Regional Health Au...
7,Skien Hospital,"Skien, Vestfold og Telemark county",[Skien],2001,Southern and Eastern Norway Regional Health Au...
8,Bærum Hospital,"Bærum, Viken county",[Dønski],1924-03-29,Southern and Eastern Norway Regional Health Au...
9,Drammen Hospital,"Drammen, Viken county",[Drammen],1878,Southern and Eastern Norway Regional Health Au...


We can see from the data display above that the new column has been added.
However, if we had a larger dataset where each row could not be shown, we wouldn't know if there was a missing data cell.
Therefore we check fast if there is NaN values.

In [48]:
has_nan_value = data_frame["regional_health_authority"].isna()

# we use the fact that "has_nan_value" is a boolean array to slice our dataframe:
data_frame[has_nan_value]

Unnamed: 0,name,major_location,campuses,founded_on,regional_health_authority


We end up with an empty DataFrame which means that every item in the `has_nan_value` Series was `False`.

We have now confirmed there are definitely no NaN values in our new column in our "pretended large" DataFrame.

# Extra - Converting ISO8601 to datetime objects

Within the present JSON standard, there is no way to represent a date. We have therefore stored our dates in the ISO8601 format which is an internationally accepted standard for datetimes (https://en.wikipedia.org/wiki/ISO_8601). In Python you can actually store objects in DataFrames, and we think it's easier to handle dates if they were datetime objects rather than strings.

In [82]:
# Pandas has an existing helper function .to_datetime() that does this for us
data_frame["founded_on"] = pd.to_datetime(data_frame["founded_on"], format="ISO8601")

# display the dtype of the column
data_frame["founded_on"].dtype

dtype('<M8[ns]')

Our `founded_on` column no longer contains strings, but rather datetime objects. This can make it easier to manipulate data for further analysis on the dataset.