# TM351 Data Management & Analysis


## TMA02 Preparation Tutorial - TMA02 Review

In [1]:
# This cell imports the standard pandas library needed for the tutorial

import pandas as pd
import folium

# for JSON data
import requests
import json

# for MongodB
import pymongo

## TMA02

Review of what you need to do for TMA02

**Logistics reminder**

Preparation
Before you start the TMA:  
rename the folder 2021J_TMA02/ by prefixing it with your OU student PI (personal identifier) 
That is: yourPI_2021J_TMA02

![](images/TMA02-24b.JPG)

Make a copy of them as a backup in case you need to return to a fresh copy

**Deadline**
If you are unable to submit the TMA by 10th March you must let your tutor know beforehand

Some of the questions should be answered in a notebook, some in solution document called: `yourPI_TMA02_project.docx`. Each question will guide you as to where you should include your answer.


**Using Generative AI**

*The OU guidance document Generative AI for students defines the acceptable use policy for using Generative AI to support your studies. Using the framework defined in that document, TMA 02 is classed as a Category 2 activity, which means you may use Generative AI to assist you in completing an assessment piece as long as you acknowledge and report on its use.*

### Question One - 45 Marks

Foods Standards Agency (FSA)

This should be answered in the notebook `yourPI_q1_2024.ipynb`

There are several parts to this question (a-f):

**a) Getting a feel for the Data (8 marks)**

You will need to do some investigation for this part and are directed to several FSA websites:

* data: [ https://www.food.gov.uk/our-data ]
* data catalogue: [https://data.food.gov.uk/catalog]
* other information on accessing the FSA data [ https://ratings.food.gov.uk/open-data ].

This means there is a lot of information in different places, some information may be duplicated. You need to assimilate this information in order to answer this part.

Do take note that you are asked to limit your answers to:

*data relating to retrieved from the food hygiene ratings scheme (FHRS data). Note the distinction between `rating value` and `scores` as described in the FHRS Brand Standard, available at https://www.food.gov.uk/local-authorities/guidance-on-implementation-and-operation-of-the-food-hygiene-rating-scheme-the-brand-standard-and-statutory-guidance*

There are several bits to this part. Do make sure you answer them all:

- *under what terms is the data licensed?*
- *by what methods can copies of the data be obtained from the FSA website, and in what file or data format(s) is the data made available?*
- *what data is published for an establishment inspected under the the Food Hygiene Ratings Scheme (FHRS)?*
- *identify one good aspect and one poor or weak aspect about how the FSA make their data available.*

__*8 marks*__

**b) Inspecting the data (5 marks)**

*Food standards rating data for the Milton Keynes local authority can be found in a JSON form at https://ratings.food.gov.uk/api/open-data-files/FHRS870en-GB.json, and in XML form at https://ratings.food.gov.uk/api/open-data-files/FHRS870en-GB.xml.*

This gives you a clue to what formats the data is made available!

You have two choices: `xml` or `JSON` 

CSV is a common format for data exchange, with most systems that work with data (spreadsheets, databases, etc) providing functions to generate or load this type of data.

CSV has limitations in that it is just comma separated data, the most you can expect in terms of meta-data is that the first line contains meaningful headings. Research has been done over the year to improve this, with both XML and JSON becoming popular as non-proprietary formats that can be used for data exchange. 

XML allows you to create your own tags to describe the data, similar to those seen in HTML. It is stricter than HTML, in that there should be matching opening and closing tags. To ensure the XML is well formed a Document Type Definition (DTD) can be made, which can be used to check the structure is correct.

**XML**

The XML data available from: https://www.w3schools.com/js/js_json_xml.asp, which I have downloaded on the 20th January 2025

The data is more readable if you view the page in a browser, a snippet of it can be seen here:

!["XML snippet"](images/xml.jpg)


You can see the data is enclosed by XML tags, which is the meta-data, describing what the data represents. Depending on how the data has been generated, all the data may appear on just one row so makes it difficult to view in the notebook.

The first line of the raw data:

In [2]:
!head -1 data/2024J_TMA02_data/FSA/FHRS870en-GB.xml

<?xml version="1.0" encoding="UTF-8"?>


You can increase the number of rows returned, but beware all the data appears in the second line!

`ScoreDescriptors.xml` provided with the FSA data is more readable:

In [3]:
!head -18 ~/notebooks/2024J_TMA02/2024J_TMA02_data/FSA/ScoreDescriptors.xml

<?xml version="1.0"?>
<ScoreDescriptorsCollection>
  <links>
    <link>
      <href>http://api.ratings.food.gov.uk/scoredescriptors</href>
      <rel>self</rel>
    </link>
  </links>
  <meta>
    <dataSource>API</dataSource>
    <extractDate>2014-10-27T16:02:16.3646745+00:00</extractDate>
    <itemCount>17</itemCount>
    <pageNumber>1</pageNumber>
    <pageSize>17</pageSize>
    <returncode>OK</returncode>
    <totalCount>17</totalCount>
    <totalPages>1</totalPages>
  </meta>


As can be seen above, the tags can be quite lengthy and XML is falling out of popularity, with JSON seen as a more light-weight solution. 

Further information can be found here on XML and DTD:
- https://www.w3schools.com/xml/default.asp
- https://www.w3schools.com/xml/xml_dtd.asp

Comparison of XML and JSON:
- https://www.w3schools.com/js/js_json_xml.asp

**JSON Data**

JSON will be looked at in more detail, since it is becoming more popular.

When examining the both XML and JSON data, the OS head function is less useful since due to the hierarchical structure and how it has been generated, all the data may be stored in the first line. Even limiting the rows returned will show too much data, making it hard to evaluate if there are any issues.

In [4]:
# uncomment if you want to see the raw JSON data
#!head -1 data/2024J_TMA02_data/FSA/fsa_establishments_aldi_03-07-2024.json

In this case it is more appropriate to import the data and examine it in a dataframe.

The module notebook: `02.2.2 Data file formats - JSON.ipynb` includes examples of how to read the data.

Some of the examples use data directly from the BBC's iPlayer library. For example, have a look at BBC's Animal Park  

In [5]:
bbc_url = "http://www.bbc.co.uk/programmes/m0021r4h.json"
bbc_resp = requests.get(bbc_url)

programme = bbc_resp.json()
programme

{'programme': {'type': 'episode',
  'pid': 'm0021r4h',
  'expected_child_count': None,
  'position': 9,
  'image': {'pid': 'p0jj0dwm'},
  'media_type': 'audio_video',
  'title': 'Episode 9',
  'short_synopsis': 'The keepers rally around Ghost, an ageing lion who is suddenly losing weight.',
  'medium_synopsis': 'The keepers rally around Ghost, an ageing lion who is suddenly losing weight.',
  'long_synopsis': 'Along with feeding all the animals at Longleat Safari Park, the keepers need to keep an eye on their mood, behaviour, love life and, of course, their health. And that’s before doing any training or coming up with enrichment ideas. It’s a never-ending job, and they must be ready for anything. \n\nKate Humble’s first job today is creating a new toy for the magnificent colobus monkeys. There are seven boys that live in the troop on ‘gorilla island’, and keeper Carys has attached three bottles to a broom handle and stacked them full of monkey treats. Each bottle has different-sized h

Check what sort of data type is returned, since this can affect what you do with the results.

In [6]:
type(programme)

dict

In [7]:
# Convert to a dataframe
bbc_df = pd.DataFrame(programme)
bbc_df.head(10)

Unnamed: 0,programme
type,episode
pid,m0021r4h
expected_child_count,
position,9
image,{'pid': 'p0jj0dwm'}
media_type,audio_video
title,Episode 9
short_synopsis,"The keepers rally around Ghost, an ageing lion..."
medium_synopsis,"The keepers rally around Ghost, an ageing lion..."
long_synopsis,Along with feeding all the animals at Longleat...


In [8]:
# might be more readable by flatten the data
df_programme = pd.json_normalize(programme["programme"])
df_programme

Unnamed: 0,type,pid,expected_child_count,position,media_type,title,short_synopsis,medium_synopsis,long_synopsis,first_broadcast_date,...,peers.previous.title,peers.previous.first_broadcast_date,peers.previous.position,peers.previous.media_type,peers.next.type,peers.next.pid,peers.next.title,peers.next.first_broadcast_date,peers.next.position,peers.next.media_type
0,episode,m0021r4h,,9,audio_video,Episode 9,"The keepers rally around Ghost, an ageing lion...","The keepers rally around Ghost, an ageing lion...",Along with feeding all the animals at Longleat...,2024-08-22T09:30:00+01:00,...,Episode 8,2024-08-21T09:30:00+01:00,8,audio_video,episode,m0021r4j,Episode 10,2024-08-23T09:30:00+01:00,10,audio_video


Do remember that the JSON data is schemaless, one approach is to look at the keys to find out what the structure is.

In [9]:
# find out what keys the data has
programme["programme"].keys(), df_programme.columns

(dict_keys(['type', 'pid', 'expected_child_count', 'position', 'image', 'media_type', 'title', 'short_synopsis', 'medium_synopsis', 'long_synopsis', 'first_broadcast_date', 'display_title', 'ownership', 'parent', 'peers', 'versions', 'links', 'supporting_content_items', 'categories']),
 Index(['type', 'pid', 'expected_child_count', 'position', 'media_type',
        'title', 'short_synopsis', 'medium_synopsis', 'long_synopsis',
        'first_broadcast_date', 'versions', 'links', 'supporting_content_items',
        'categories', 'image.pid', 'display_title.title',
        'display_title.subtitle', 'ownership.service.type',
        'ownership.service.id', 'ownership.service.key',
        'ownership.service.title', 'parent.programme.type',
        'parent.programme.pid', 'parent.programme.title',
        'parent.programme.short_synopsis', 'parent.programme.media_type',
        'parent.programme.position', 'parent.programme.image.pid',
        'parent.programme.expected_child_count',
     

As you can see, the data is not as structured as you will have seen with the CSV and Excel data seen in TMA01.

An alternative is to store the data in some sort of permanent storage, the next two parts look at relational and document databases.

**c) Representing the Data in a Relational Database (10 marks)**

*Describe how you would represent the establishment data using a relational database. For each table in your database design, you should identify:*

*1. what entity is represented by the table,*

*2. what columns would be used in the table, and, if it is not clear, what data those columns would contain,*

*3. the table's primary and, if required, foreign keys.*

*4. any constraints that should be applied (including any constraints on keys).*

*Note: this question is primarily about the __design__ of the relational database, rather than its implementation. __You do not need to build the relational database when answering this question.__*

Do note the last part, you do not need to implement the schema.

`Part 8 Introduction to relational databases`, `Part 9 Relaional data modelling` and `Part 10 Normalisation` will help with this part.

The goal for relational databases is not to repeat information, such as store your name and address for every module you take at the OU. This means the data is normally split over several tables, but these tables should be meaningful and contain the same sort of data.

This can be seen in the Hospital example in Part 9:

![](images/tm351_pt09_f07.eps.jpg)


There is nothing to spot you putting all the information in one big table, but image the amount of data that would be duplicated.


For the TMA you need to examine the FHRS data and think about how the data could be stored in more than one table. Look for fields that contain data that could be repeated in different records, such as the details of an organisation.

JSON data is hierarchical, so another thing to look for is any sub-documents, which could indicate a potential table. 

Remember to include the primary and foreign keys too. 

- Each table has one Primary key, which must be unique and not null.
- Tables are linked implicitly using foreign keys. This will be matched to a primary key in the table it is linked to. For instance, in the hospital example above, the Prescription table contains three separate foreign keys to link it to the Patient, Doctor and Drug tables.
- Foreign key field(s) may include duplicate values and can be null if the relational is optional 

Let's look at our BBC data and what tables could be generated from it.

In [10]:
# was there an big new item this last week?
bbc_url = "http://www.bbc.co.uk/programmes/m00276p8.json"

bbc_resp = requests.get(bbc_url)

programme2 = bbc_resp.json()
programme2

{'programme': {'type': 'episode',
  'pid': 'm00276p8',
  'expected_child_count': None,
  'position': None,
  'image': {'pid': 'p0kjrzf5'},
  'media_type': 'audio_video',
  'title': "President Trump's Inauguration",
  'short_synopsis': 'Live coverage as Donald Trump is sworn in as the 47th president of the United States.',
  'medium_synopsis': 'Live from Washington DC as Donald Trump is sworn in as the 47th president of the United States. Special coverage of the inauguration with Clive Myrie and Sophie Raworth.',
  'long_synopsis': "Live from Washington DC as Donald Trump is sworn in as the 47th president of the United States. Special coverage of the inauguration with Clive Myrie and Sophie Raworth, and analysis from the BBC's team of experts.",
  'first_broadcast_date': '2025-01-20T15:30:00Z',
  'display_title': {'title': 'BBC News Special',
   'subtitle': "President Trump's Inauguration"},
  'ownership': {'service': {'type': None,
    'id': 'bbc_news',
    'key': 'news',
    'title': 

In this case there seems to be some sub-documents that are worth investigating.

There is some general programme information at the start, then information about the parent programme and peers. These could all be potential tables.

Look for potential unique keys in the data that could become primary keys. What could be used above?

**d) Using a MongoDB database (10 marks)**

The above data is semi-structured and probably better suited to a document database, such as MongoDB.

Let's setup and store our two programmes in a MongoDB collection.

In [11]:
# Set up a MongoDB connection
MONGO_CONNECTION_STRING = f"mongodb://localhost:27017/"
print(f"MONGO_CONNECTION_STRING = {MONGO_CONNECTION_STRING}")

MONGO_CONNECTION_STRING = mongodb://localhost:27017/


In [12]:
# set up a client
from pymongo import MongoClient
mongo_client = MongoClient(MONGO_CONNECTION_STRING)

In [13]:
# I'm likely to run this several times, so will drop my previous version
mongo_client.drop_database("bbc_db")

In [14]:
# Create database
db = mongo_client["bbc_db"]

# Create collection
bbc_collection = db["bbc_collection"]

In [15]:
# insert our two records
# need to use insert_one() since there is only document in each one
# use insert_many() if you have more than one document
bbc_collection.insert_one(programme["programme"])

InsertOneResult(ObjectId('679389b9280b6573548ebec8'), acknowledged=True)

In [16]:
bbc_collection.insert_one(programme2["programme"])

InsertOneResult(ObjectId('679389b9280b6573548ebec9'), acknowledged=True)

In [17]:
bbc_collection.find_one()

{'_id': ObjectId('679389b9280b6573548ebec8'),
 'type': 'episode',
 'pid': 'm0021r4h',
 'expected_child_count': None,
 'position': 9,
 'image': {'pid': 'p0jj0dwm'},
 'media_type': 'audio_video',
 'title': 'Episode 9',
 'short_synopsis': 'The keepers rally around Ghost, an ageing lion who is suddenly losing weight.',
 'medium_synopsis': 'The keepers rally around Ghost, an ageing lion who is suddenly losing weight.',
 'long_synopsis': 'Along with feeding all the animals at Longleat Safari Park, the keepers need to keep an eye on their mood, behaviour, love life and, of course, their health. And that’s before doing any training or coming up with enrichment ideas. It’s a never-ending job, and they must be ready for anything. \n\nKate Humble’s first job today is creating a new toy for the magnificent colobus monkeys. There are seven boys that live in the troop on ‘gorilla island’, and keeper Carys has attached three bottles to a broom handle and stacked them full of monkey treats. Each bottl

In [18]:
# how many documents
bbc_collection.count_documents({})

2

In [19]:
# parameters are usually provided as key: value pairs
# for example, to search for a particular title 
bbc_collection.find_one({"title": "President Trump's Inauguration"})

{'_id': ObjectId('679389b9280b6573548ebec9'),
 'type': 'episode',
 'pid': 'm00276p8',
 'expected_child_count': None,
 'position': None,
 'image': {'pid': 'p0kjrzf5'},
 'media_type': 'audio_video',
 'title': "President Trump's Inauguration",
 'short_synopsis': 'Live coverage as Donald Trump is sworn in as the 47th president of the United States.',
 'medium_synopsis': 'Live from Washington DC as Donald Trump is sworn in as the 47th president of the United States. Special coverage of the inauguration with Clive Myrie and Sophie Raworth.',
 'long_synopsis': "Live from Washington DC as Donald Trump is sworn in as the 47th president of the United States. Special coverage of the inauguration with Clive Myrie and Sophie Raworth, and analysis from the BBC's team of experts.",
 'first_broadcast_date': '2025-01-20T15:30:00Z',
 'display_title': {'title': 'BBC News Special',
  'subtitle': "President Trump's Inauguration"},
 'ownership': {'service': {'type': None,
   'id': 'bbc_news',
   'key': 'new

In [20]:
# you can use the dot notation to search sub-documents
bbc_collection.find_one({"display_title.title": "Animal Park"})

{'_id': ObjectId('679389b9280b6573548ebec8'),
 'type': 'episode',
 'pid': 'm0021r4h',
 'expected_child_count': None,
 'position': 9,
 'image': {'pid': 'p0jj0dwm'},
 'media_type': 'audio_video',
 'title': 'Episode 9',
 'short_synopsis': 'The keepers rally around Ghost, an ageing lion who is suddenly losing weight.',
 'medium_synopsis': 'The keepers rally around Ghost, an ageing lion who is suddenly losing weight.',
 'long_synopsis': 'Along with feeding all the animals at Longleat Safari Park, the keepers need to keep an eye on their mood, behaviour, love life and, of course, their health. And that’s before doing any training or coming up with enrichment ideas. It’s a never-ending job, and they must be ready for anything. \n\nKate Humble’s first job today is creating a new toy for the magnificent colobus monkeys. There are seven boys that live in the troop on ‘gorilla island’, and keeper Carys has attached three bottles to a broom handle and stacked them full of monkey treats. Each bottl

If you are interested in document databases and want to see some examples of how to use a MongoDB database, see this notebook: `MongoDB.ipynb`. 

Do note, the examples in the MongoDB notebook go beyond what is needed for TMA02.

### Data Validation

Once you have imported the FSA data you are also asked to validate it

You are given a `partial_validation_schema` to check your data against.

When creating your collection do check what other options are available, which will help with what to do with this schema:

https://www.mongodb.com/docs/manual/reference/method/db.createCollection

**e) Using the Geographic Data ( 3 + 3 = 6 marks)**

Two images are required:

*i) Using the folium package, generate a map that uses markers with pop-up labels to identify “Premier” retail stores in the Milton Keynes local authority area based on the data in the food_ratings_cleaner collection*. ***To what extent do you think your data query reliably identifies only those outlets that are part of the “Premier” branded stores organisation?***

*ii) As well as using markers to locate individual establishments, we can also use choropleth maps to visualise aggregated data values within area boundaries.*

Part i requires pulling out the Longitude and Latitude values from either your MongoDB database or dataframe.

Do note the bit highlighted - could be easy to overlook!

Creating choropleth maps for Part ii can be seen in `Notebook 05.2 Getting started with maps - folium`. `Notebook 05.X Optional notes on Geo data formats` has some examples of adding a marker to a map.

**f) Database models (6 marks)**

*Having explored the data, you need to write a short memo explaining whether a relational database (such as PostgreSQL) or a document database (such as MongoDB) would be more suitable for storing the FSA data.*

*Give two advantages and one disadvantage of each solution for this particular dataset. State which database you think would be more appropriate for an actual implementation. Your justification should be made with specific reference to the scenario.*

*Write no more than 250 words.*

***Marks will be capped at 2 out of 6 for a generic answer that does not reference the specific dataset used in this scenario.***

__*6 marks*__


This gets you to think about which database system is best for the FSA data. Do tailor your answer to what you have discovered in parts c and d, otherwise your marks will be capped.

### Question Two - 30 Marks

This question is moving towards the EMA. Details of what you need to do can be found in the `2024J_TMA02.pdf` document.

Snippets are included here:

*This question is designed to get you started on a data investigation that will be developed into a larger investigation for the end-of-module assessment (EMA).*

*While answering this question, you should ask yourself what stories the dataset might contain that could be explored using the techniques you have studied.*

*You may find it useful to consider your data exploration notebooks as portfolio pieces in which you demonstrate what you have learned through studying TM351 in the context of using data analysis and visualisation techniques in an investigative or exploratory setting.*

*You are encouraged to make use of a database, where appropriate, that contains cleaned and validated data when completing this question. For example, you may use the MongoDB database that you created in Question 1. If you are unable to use a database, or prefer not to, you will not be penalised if you do your work using data loaded from a file directly into a dataframe, although you should justify why you have not used a database approach. If instead you create a DataFrame from the original datafiles (for example, 2024J_TMA02_data/FSA/FHRS870en-GB.json) and use that for your investigation, you should address any data cleaning and validation considerations, as explored in Question 1, part b, before making analytical use of it.*

For this question you should use the cleaned MongoDB created in Question 1. 

If this not possible, you can use the JSON file provided into a dataframe, or import the provided zip version of the database. See details of how to do the later in the `Restoring a sample MongoDB database` section.


## Question 2(a) - 20 marks

*Question 1 will have given you a basic feel for the Food Standards Agency food hygiene ratings scheme (FHRS) data.*

*In this question, you should further explore the FHRS data and investigate another aspect of it, or answer a question of your own devising based on it.*

*You are not limited to reporting on just food hygiene rating scores. For example, you might also use the FHRS dataset as a basis for a range of other exploratory questions, such as geographically profiling various food related businesses, comparing the size of corporate groupings or other forms of “competitive intelligence”, etc.*

*For this part of question 2, you should work in your
**yourPI_q2a_lab_notebook.ipynb**
notebook, under the heading “Question 2(a)”, renaming the notebook to use your PI. Treat this notebook as a lab notebook: keep all the work you do and don’t tidy it up too much or delete work that turns out to be a dead end; if you have code cells which do not execute correctly, make a comment in the code cell regarding the error.*

See the section on `Presenting your answer` for what should go into your lab notebook.

Overall you should:

*Summarise the main characteristics of the dataset that allow you to perform your exploration.*

*Create and label at least two different plots to visualise different aspects of the data. You should use at least two different types of plot, e.g. scatter, line, bar, etc.*

*Create at least one folium map.*

*Include notes critically evaluating what you think your investigations and visualisations tell you.*

## Question 2(b) - 10 marks

This question is practice for writing your EMA report. For this part you should write your answer in a Word document: `yourPI_TMA02_project.docx` under the heading *“Question 2(b)”*.

*In this part you will use your findings from part 2(a) to write a brief report using the following outline structure:*

- *Aims and objectives*
- *Background*
- *Sources of data (original source; locally managed source)*
- *Analysis pipeline*
- *Findings*
- *Conclusions*
- *References*
  
*Your report should be no more than 650 words*

*Note: your notebook reference need be no more than: Name, Initials. (year) yourPI_q2a_lab_notebook.ipynb.*


Writing a report can be difficult if you are out of practice doing this.

`Part 5 Presentation: telling the story` can be useful for understanding how to report your findings.

In the discussion of `Exercise 5.6 Exploratory` you can find a file: `TM351_Report_Outline.docx` which as the name suggests is a general-purpose reporting structure and can be useful for getting started.

When discussing your results, things to think about commenting on:
- the range: how much the measurement varies
- the outliers: the maximum and minimum values (Are these unexpected? Do they represent errors in the data?)
- any trends: does the values increase or decrease or oscillate over time 
- any patterns: does anything stand out as being a regular pattern 


### Question Two - 25 Marks

*This question is designed to help you in the planning stage of the EMA. This is your opportunity to develop a work plan so your tutor can give you helpful, timely feedback. Before you attempt this question you should read the EMA. Think about what you are being asked to do and how this builds on the analysis you did for Question 2.*

*A good answer to this question will mean you have mapped out your EMA work and got a good start at understanding what the EMA requires.*

## Question 3(a) EMA Questions - 10 marks

This should be answered in your `yourPI_TMA02_project.docx` solution document, under the heading *“Question 3(a)”*.

This is getting you to think about the questions you will explore in the EMA. You will investigate an additional dataset, as well as the FHRS dataset.

Overall you need to:
- specify a question that can be answered by looking at the new dataset (on its own)
- specify a question that be answered when you combine both data sets

The two additional datasets have been provided:

1. Care home data from Care Quality Commission (CQC) containing residential care home names, addresses, organisational groupings and CQC ratings for care homes in England.

**You should limit the scope of any CQC data used to just data associated with residential care home establishments.**

2. Data from the Department for Education relating to schools, including administrative information for all schools and performance data for schools in the Milton Keynes Local Authority area

Two downloaded datasets are provided, although you are free to download other school performance related datasets from the links provided.

Note:
- You are welcome to download your own recent versions of the data from the locations specified.
- You may use additional FHRS ratings data obtained from Food Standards Agency website although you are not required to do so.

What you need to do for Question 3(a):

*For each of the following, write a paragraph or two in your solution document, explaining and justifying your choice:*

*1. Which additional dataset you have chosen.*

*2. A question you can investigate using just this additional dataset and why this dataset can answer it.*

*3. Another question, which you can answer only by joining or otherwise combining the food ratings dataset and the additional dataset you have chosen. You should also explain why this combination of datasets can answer that question.*

*At some point in the EMA data investigation, you will need to use a folium map, so you should consider this as you develop your questions.*

So open the datasets and explore them.

Think about how you will create a folium map.

Any metadata can help understand them.

There is a lot of data, so will need to limit what you use.

Do remember you only need to pick **one** of the two new datasets provided (Care home or Department for Education data).

**This is not a question to do last minute!!**

## Question 3(b) Strategic Concerns - 5 marks

Answer in `yourPI_TMA02_project.docx` solution document, under the heading *“Question 3(b)”*.

*Think about what you are being asked to do in the EMA and how this builds on the analysis you did for Question 2. How could you address the investigation questions you proposed in Part 3(a)?*

*Describe how you intend to store, combine and analyse the data, and how you intend to visualise your results. Identify and briefly describe the tools and techniques you could use.*

## Question 3(c) Time Planning - 10 marks

This is made up of two parts:

- workplan of activities for completing the EMA, including timescales and milestones
- initial entries in the project diary


The workplan can either be included in `your yourPI_TMA02_project.docx` solution document, under the heading *“Question 3(c)”*, or in a spreadsheet called: `yourPI_workplan.xlsx`

The notebook named `yourPI_project_diary.ipynb` should be used as your project diary.

The diary should include two cells:

*1. a code cell importing and briefly examining the dataset you chose in Part 3(a)*

*2. a markdown cell or code cell indicating how you might combine your Part 3(a) dataset with data from the FHRS data.*

## Reminders:

Deadlines:
- If you are unable to submit the TMA by 13th March you must let your Tutor know beforehand
- Bear in mind there are no extensions allowed for the EMA

iCMAs
- iCMA 44 is due by the 30th January.
- you need to get at least 30% in five of them

## Wrap Up

- Any questions?


### Data Sources



The FSA data used here is the from the TM351 TMA02-24J assessment and found in the `2024J_TMA02_data`.

The data is licenced and details can be found in the links mentioned in the assessment:

*The UK Food Standards Agency (FSA) publishes a wide range of data relating to food establishments and food standards [ https://www.food.gov.uk/our-data ]. As well as a data catalogue [https://data.food.gov.uk/catalog], the FSA also publishes information on accessing their data via an open data landing page [ https://ratings.food.gov.uk/open-data ].*
