
<h1 style ="color:blue;">TM351 25J</h1>
<h2 style="color:blue;">Data management  and analysis</h2>
<h2 style="color:blue;">TMA01 Preparation Tutorial</h2>

<h3>Mary Garvey and Stephen Murphy</h3>
<h3>3rd November 2025</h3>

In [1]:
# This cell imports the standard pandas library needed for this workbook

import pandas as pd
import json


## What we will cover today

* The data life cycle and data pipeline		
* Data acquisition
* Data preparation
* Data analysis
* Data presentation
* The TMA 

## Data life cycle and data pipeline


The module follows a data life cycle and pipeline:

![](images/tm351_pt1_f04.eps.jpg)

This takes into account what actually happens to the data, and the uses to which it is put, as it is processed.

The emphasis is on the management of data from its creation through to its reuse and eventual destruction:

!["Data cycle"](images/tm351_pt1_f03.eps.jpg)

Many variants of the data life cycle and data pipeline exist, for example CRISP_DM (CRoss Industry Standard Process for Data Mining).

This module is built around the data pipeline, which allows:

* Completeness and correctness

Analysts need to be confident that the information they generate is correct.

* Reproducibility and provenance

Researchers want to be able to reproduce and prove their results. For example, a research paper can't just say the answer is 42 without showing how the result came about.  This is not only to allow others to ensure there is a valid method leading to the answer, but also so that others can reproduce and check the results if needed.

From your point of view, notebooks allow you to document and provide reproducible data processes, so others can see your reasoning and allow them to re-run your entire analysis and may be carry out further analysis. 

Figure 1.9 from TM351 Part 1 illustrates a ‘big data’ workflow from a data-centric scientific research process:

![](images/tm351_pt1_f09.eps.jpg)


### Data Characteristics

Part 1 introduces various things to consider when working with data. 

Table 1.1. introduces Kitchin’s characterisation of data (Kitchin, 2014), which can be used as checklist to characterise a dataset  

![](images/kitchin.png)

### Name that structure

For example, have a look at the following samples of data, for each one, decide how you would describe their structure (structured, semi-structured or unstructured):

In [2]:
! head data/SFR01_2016_UD_national_1.csv

Country_code_9_digit,Country_code,Country_name,Characteristic,Characteristic_category,NATden_girls_15,NATden_boys_15,NATden_all_15,KS4_LEVEL2_girls_15,KS4_LEVEL2_boys_15,KS4_LEVEL2_all_15,KS4_LEVEL2_EM_girls_15,KS4_LEVEL2_EM_boys_15,KS4_LEVEL2_EM_all_15,KS4_LEVEL1_girls_15,KS4_LEVEL1_boys_15,KS4_LEVEL1_all_15,KS4_LEVEL1_EM_girls_15,KS4_LEVEL1_EM_boys_15,KS4_LEVEL1_EM_all_15,KS4_L2BASICS_girls_15,KS4_L2BASICS_boys_15,KS4_L2BASICS_all_15,KS4_LEVEL2MFL_girls_15,KS4_LEVEL2MFL_boys_15,KS4_LEVEL2MFL_all_15,KS4_GLEVEL2EM_girls_15,KS4_GLEVEL2EM_boys_15,KS4_GLEVEL2EM_all_15,KS4_EBACC_E_girls_15,KS4_EBACC_E_boys_15,KS4_EBACC_E_all_15,KS4_EBACC_girls_15,KS4_EBACC_boys_15,KS4_EBACC_all_15,NATden_PROGENG_girls_15,NATden_PROGENG_boys_15,NATden_PROGENG_all_15,KS4_PROGENG_girls_15,KS4_PROGENG_boys_15,KS4_PROGENG_all_15,NATden_PROGMAT_girls_15,NATden_PROGMAT_boys_15,NATden_PROGMAT_all_15,KS4_PROGMAT_girls_15,KS4_PROGMAT_boys_15,KS4_PROGMAT_all_15,NATden_girls_FSM_15,NATden_boys_FSM_15,NATden_all_FSM_15

The data as presented as above is hard to read, there does not appear to be any issues and looks like comma separated values (CSV).

Let's just take a quick subset of the data - first 5 columns and rows:

In [3]:
data_df = pd.read_csv("data/SFR01_2016_UD_national_1.csv", usecols = [0,1,2,3,4])
data_df.head()

Unnamed: 0,Country_code_9_digit,Country_code,Country_name,Characteristic,Characteristic_category
0,E92000001,921.0,England,,All pupils
1,E92000001,921.0,England,ETHNICGROUP_MAJOR,White
2,E92000001,921.0,England,ETHNICGROUP_MINOR,White British
3,E92000001,921.0,England,ETHNICGROUP_MINOR,Irish
4,E92000001,921.0,England,ETHNICGROUP_MINOR,Traveller of Irish Heritage


1. Structured, semi-structured or unstructured?

In [4]:
! head -25 data/role.json

{
 "meta": {
  "limit": 100,
  "offset": 0,
  "total_count": 100
 },
 "objects": [
  {
   "caucus": null,
   "congress_numbers": [
    116,
    117,
    118
   ],
   "current": true,
   "description": "Junior Senator for Washington",
   "district": null,
   "enddate": "2025-01-03",
   "extra": {
    "address": "511 Hart Senate Office Building Washington DC 20510",
    "contact_form": "https://www.cantwell.senate.gov/public/index.cfm/email-maria",
    "office": "511 Hart Senate Office Building",
    "rss_url": "http://www.cantwell.senate.gov/public/index.cfm/rss/feed"
   },
   "leadership_title": null,


2. Structured, semi-structured or unstructured?

In [5]:
! head -20 data/SFR01_2016_UD_metadata.txt

Release of underlying data for characteristics breakdowns in SFR 01/2016: Revised GCSE and equivalent results in England: 2014 to 2015.

Background
This statistical first release (SFR) includes 2014/15 information on attainment for GCSE and equivalent results by different pupil characteristics, specifically gender, ethnicity, eligibility for free school meals (FSM), disadvantage, special educational needs (SEN) and English as a first language, at National and LA level.
The 2014/15 figures contained within this SFR use data collected for the 2015 Secondary School Performance Tables, which has been checked by schools.

CHANGES TO THE PRODUCTION OF THESE STATISTICS
Two major reforms were implemented in 2013/14 which affected the calculation of key stage 4 (KS4) performance measures data:

	1. Professor Alison Wolf�s Review of Vocational Education recommendations which; 
		�	restrict the qualifications counted
		�	prevent any qualification from counting as larger than one GCSE
		�	cap the 

In [6]:
! tail data/SFR01_2016_UD_metadata.txt

KS4_LEVEL1_15: Total number of pupils achieving 5 or more A*-G grades at GCSE or equivalent
KS4_LEVEL1_EM_15: Total number of pupils achieving 5 or more A*-G grades at GCSE or equivalent including English and mathematics GCSEs
KS4_L2BASICS_15: Total number of pupils achieving an A*-C grade in English and mathematics GCSEs	
KS4_EBACC_E_15: Total number of pupils entering the English Baccalaureate
KS4_EBACC_15: Total number of pupils achieving the English Baccalaureate 







3. Structured, semi-structured or unstructured?

### Potential data issues


![](images/issues.png)


Can anyone think of other issues?

## Acquiring the data

![](images/tm351_pt2_f01.eps.small.jpg)

The first step in  the data analysis process is to acquire the data. 

* Has anyone worked with "found" datasets, other than for educational purposes?
For example, for work or hobbies. 

* If you have a question, or questions to be answered, to get a full picture you may need to work with different datasets. These can come from different sources, for example, with the bat data seen in TMA01 you may want to look at weather patterns too, to see if they have any effect on their hibernation. 

* The TMA sometimes gets you to download the data from the original source, but with live datasets they have been know to disappear/changed!


### Found Data

Lots of data sets out there:

|Name|URL|
|-----|-------|
|ONS|<a href="https://www.ons.gov.uk">https://www.ons.gov.uk</a>|
|EU Stats|<a href="http://ec.europa.eu/eurostat">http://ec.europa.eu/eurostat</a>|
|European commission stats|<a href="http://ec.europa.eu/eurostat/data/statistics-a-z/abc">http://ec.europa.eu/eurostat/data/statistics-a-z/abc</a>|
|UK Government|<a href="https://www.gov.uk/government/statistics">https://www.gov.uk/government/statistics</a>|
|US Government|<a href="https://www.usa.gov/statistics">https://www.usa.gov/statistics</a>|
|Edinburgh University data share|<a href="http://datashare.is.ed.ac.uk/">http://datashare.is.ed.ac.uk/</a>|
|List of high quality data sets|<a href="https://github.com/caesar0301/awesome-public-datasets">https://github.com/caesar0301/awesome-public-datasets</a>|
|AW public data sets|<a href="https://aws.amazon.com/datasets/">https://aws.amazon.com/datasets/</a>|
|Stanford: Computational Journalism lab|<a href="http://cjlab.stanford.edu/">http://cjlab.stanford.edu/ </a>|
|KDNuggets: data sets for mining/discovery|<a href="https://www.kdnuggets.com/datasets/">https://www.kdnuggets.com/datasets/</a>|
|Kaggle: AI & ML community|<a href="https://www.kaggle.com/">https://www.kaggle.com/</a>|
|Halifax house prices|<a href="http://www.lloydsbankinggroup.com/media/economic-insight/halifax-house-price-index/">http://www.lloydsbankinggroup.com/media/economic-insight/halifax-house-price-index/</a>|
|Nationwide house prices|<a href="http://www.nationwide.co.uk/about/house-price-index/headlines">http://www.nationwide.co.uk/about/house-price-index/headlines</a>|
|Historical weather|<a href="http://www.wunderground.com/history">http://www.wunderground.com/history</a>|


### Issues with Data


Things to consider when acquiring data:

**Character encodings**

- These define the mappings between the digital representation of a character (the binary 1s and 0s that represent it) and the character itself
- Data from different sources could be in different formats, so want to get into a common format, e.g., ASCII, UTF8 (one of the variants of the Unicode standard).

**Measurement scales**
- Nominal, ordinal, interval or ratio

**Data storage**
- tables (structured data) or documents (semi-structured data, such as JSON or XML)

We always tell you to check the data before importing it, the OS commands `head` and `tail` can be used for this.

Do check the results, things to look for include:

- what the format is, e.g., CSV, JSON
- how is the data separated? It's not always commas, even in a CSV file
- any decoding issues


The pandas `read_csv()` function can be used to read comma separated files.

Do check the documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html for the range of options, which can be used when the data is not straightforward.

![](images/read_csv.JPG)

### Reading CSV files

Example 1

What options might be used if the data looks like the following?

In [7]:
! head data/85043238.csv


"DC6206EW - NS-SeC by ethnic group by sex by age"
"ONS Crown Copyright Reserved [from Nomis on 30 August 2020]"
"Population :","All usual residents aged 16 and over"
"Units      :","Persons"
"Date       :","2011"
"Sex        :","All persons"
"Age        :","Age 16 to 24"
"Ethnic Group:","All categories: Ethnic group"



In [8]:
# Let's look at a few more rows:
! head -20 data/85043238.csv


"DC6206EW - NS-SeC by ethnic group by sex by age"
"ONS Crown Copyright Reserved [from Nomis on 30 August 2020]"
"Population :","All usual residents aged 16 and over"
"Units      :","Persons"
"Date       :","2011"
"Sex        :","All persons"
"Age        :","Age 16 to 24"
"Ethnic Group:","All categories: Ethnic group"

"2011 super output area - middle layer","mnemonic","1.1 Large employers and higher managerial and administrative occupations","1.2 Higher professional occupations","2. Lower managerial, administrative and professional occupations","3. Intermediate occupations","4. Small employers and own account workers","5. Lower supervisory and technical occupations","6. Semi-routine occupations","7. Routine occupations","L14.1 Never worked","L14.2 Long-term unemployed","L15 Full-time students","L17 Not classifiable for other reasons"

"Darlington 001","E02002559",2,14,77,89,11,39,92,66,23,3,321,0
"Darlington 002","E02002560",3,9,35,80,9,43,97,60,25,7,242,0
"Darlington 003","E02002561"

In [9]:
# and the end of the file
! tail -15 data/85043238.csv

"Newport 014","W02000360",0,10,56,55,11,43,128,73,70,20,378,0
"Newport 015","W02000361",6,14,75,113,21,52,237,165,163,30,1020,0
"Newport 016","W02000362",3,10,53,51,6,25,66,55,21,7,260,0
"Newport 017","W02000363",1,4,36,68,4,21,75,39,35,12,234,0
"Newport 018","W02000364",5,9,61,84,13,50,148,80,139,23,931,0
"Newport 019","W02000365",1,5,57,99,9,36,146,94,108,24,316,0
"Newport 020","W02000366",1,20,76,103,10,45,96,54,25,7,340,0



"","In order to protect against disclosure of personal information, records"
"","have been swapped between different geographic areas. Some counts will"
"","be affected, particularly small counts at the lowest geographies"
"",""



- What seems to be the problem?
- how might we resolve this looking at the `read_csv()` documentation?

Example Two 

(2021J TMA01 dataset):

In [10]:
! head data/stats1718.csv

﻿1718|al||Tirana||Bizet|18381025|18750603|fr|m|Carmen|fr|~|20180130|5||
1718|al||Tirana||Bizet|18381025|18750603|fr|m|Les Pecheurs de perles|fr|~|20170922|3|!|
1718|al||Tirana||Brahms|18330507|18970403|de|m|Ein deutsches Requiem|de|~O|20170930|1|c|
1718|al||Tirana||Strauss,J|1825|1899|at|m|Die Fledermaus|at|~L|20180329|5||
1718|am||Yerevan||Arutiunian|19200928|20120328|am|m|Sayat-Nova|am|~|20180313|1||
1718|am||Yerevan||Bizet|18381025|18750603|fr|m|Carmen|fr|~|20180321|1||
1718|am||Yerevan||Chukhadjian|1837|18980225|am|m|Arshak II|am|~|20171130|1||
1718|am||Yerevan||Donizetti|17971129|18480408|it|m|Poliuto|it|~|20171109|4||
1718|am||Yerevan||Puccini|18581222|19241129|it|m|Tosca|it|~|20171126|3||
1718|am||Yerevan||Tigranian|18791226|19500210|am|m|Anoush|am|~|20180301|2||


 - any issues here?
 - if so, how might we import the data?

Example three:

(2017J TMA01 dataset):

In [13]:
! head data/fruitveg.csv

﻿2004		Quality	Units	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
FRUIT															
Blackberries	All Varieties	-	£/kg							3.26	3.24	3.61	5.06		
Blackcurrants	All Varieties	-	£/kg							3.32	3.53				
Cherries	Sweet Black	1st	£/kg						1.69	1.72	3.68				
	Sweet Black	2nd	£/kg						1.11	1.08					
	Sweet Black	Ave	£/kg						1.53	1.54					
	Sweet White	1st	£/kg						1.53	1.56					
	Sweet White	2nd	£/kg							1.10					
	Sweet White	Ave	£/kg							1.47					


 - what are the issues?
 - how might we import the data?

## Measurement scales


After character encoding, the next consideration is classifying data so that it can be analysed. Part 2 introduces a scheme produced by Stanley Smith Stevens (Stevens, 1946), the NOIR measurement scale.

It is important to understand the classification and associated measurement scale so that you know what mathematical operations can legitimately be applied to the data found.

Stevens identified four classes: Nominal, Ordinal, Interval, Ratio (NOIR). The following diagram gives an overview: 

![](images/measurement.png)

Nominal and Ordinal values are discrete. They are distinct and separate, such as whole numbers representing the number of students taking TM351.

Interval and ratio values are continuous. They can be measured with precision and have an infinite number of values between two points, such as weight or temperature.

### Noir Question

What type of measurement is 
- hair colour
- weight
- IQ score
- Income
- Stage in development of a human

### Data Storage

After acquiring the data, we need to store it into appropriate complex structures and make the semantics (meaning) identifiable.

There are 2 common structures used:

**Tables:**
- structured into rows
- each row contains information about some thing
- each row contains the same number of possibly empty cells
- cells provide values of properties of the thing described by the row
- cells within the same column provide values for the same property for the thing described by the particular row

**Document:**
- any file or representation that embodies a particular data record
- conforms to a document structure
- Typical formats: JSON, XML

## Preparation

![](images/tm351_pt3_f01.eps.small.jpg)


### Cleansing the data

**Cleansing data**

Once the data is acquired, we need to clean it.

Common types of data errors include:

![](images/dataErrors.png)

**Handling dirty data**

![](images/handlingData.png)

Do not forget to document what has been added/deleted/changed.

See: Activity 3.2 Exploratory in Part 3, which reviews the following paper and is available from the OU Library:

Kim, W., Choi, B. J., Hong, E. K., Kim, S. K. and Lee, D. (2003) ‘A taxonomy of dirty data’, Data Mining and Knowledge Discovery, vol. 7, no. 1, pp. 81–99.


**Reshaping data**

This can involve:

- Remove unnecessary columns
- Filtering rows which do not meet specified criteria
- Sorting rows by column combinations
- Combining the datasets. For example, one dataset could use abbreviations in a country column, whereas another uses the full name, e.g., "UK" verses the "United Kingdom".

This can be carried out in one or more ways:
- OpenRefine
- Pandas
- SQL
 
Notebook 04.7 helps with reshaping


## Analysing the data

![](images/tm351_pt4_f01.eps.small.jpg)

**Visualising descriptive data**

Once you have imported and cleaned the data, it is now ready for analysis. Descriptive text can be hard to make sense of, so visualisations can help present the data.

This can be achieved through a variety of graphics: e.g. bar charts, histograms, trend lines, box-whisper charts.

Visualisation leverages the capabilities and bandwidth of the visual system. The techniques can help: 
- to move a huge amount of information into the brain very quickly.
- take advantage of the brain to identify patterns
- to communicate relationships and meaning
- inspire new questions and further exploration
- identify sub-problems
- to identify trends and outliers
- to search for interesting or specific data points in a larger field, etc.

Visualisation helps to get behind the numbers.


**Descriptive analysis**

For elements on an interval scale we can aggregate into basic statistical measures including: 

![](images/statistics.png)

**Correlation**
  
A technique that can show whether and how strongly pairs of variables are related.

Often demonstrated by using a scatterplot:

![](images/correlation.png)


Remember correlation does not imply causation:

![](images/spiders.png)

See further examples here: http://www.tylervigen.com/spurious-correlations 


## Data presentation: 

![](images/tm351_pt5_f01.eps.small.jpg)


**Reporting the story**

For a data investigation you need to include a record of activities and your findings so the final report must contain:
- the story of the investigation
- the story discovered in the data

General points to consider:
- Keep it simple: plain English, short paragraphs and sentences, clear headings and avoid jargon
- Keep it focused: develop a logical structure, highlight key points in each section
- Consider the audience: use their language and format – get it proofread
- Support the reader through long reports: provide a road map and signposting

The Notebook can be used for the reporting

Part 5 looks at `Presentation: telling the story`

See Exercise 5.6 Exploratory in Section 5.2 The report, which provides a sample layout for a report in the Discussion: `TM351_Report_Outline.docx` 

If time permits, we can look at a sample Notebook that works through the data pipeline using Boys Names: `Tutorial 2 BoyNames.ipynb` and there is an equivalent notebook for Girls names: `Tutorial 2 GirlNames.ipynb`


## TMA01 Detailed look

**TMA - logistics**

***Preparation***

Before you start the TMA:  

- Download the zip files with the Notebooks in (make sure they are the latest version!)
- At the top of each Notebook put your Name and Personal Identifier (PI) in the boxes provided. This will allow your Tutor to identify your work
- Complete the TM351 Generative AI academic integrity declaration (included in the TMA01 zip file)

***Completion***

Upon completing the TMA all of your files should be in the 25j_tma01 folder including:
- your Notebook files containing answers to Q1 and 2 (do not rename these)
- the original data files in the data folder
- any additional data files that you have added, created or updated, in the data folder
- the Generative AI academic integrity declaration

***Submission***
- zip the 25j_tma01 directory and confirm that all your files are in the resulting archive
- submit your zip file to the online TMA/EMA Service.

***Remember*** to back-up your files regularly so as not to lose your work

***Deadline*** If you are unable to submit the TMA by 4th December, you must let your Tutor know beforehand


### TMA01 Question 1 (40 marks)

In this TMA, you will investigate a dataset of bats in the UK. You are interested in the questions:

- Where do most horseshoe bats hibernate?
- Which bat species hibernate in the vicinity of rural or urban areas?

All the data required can be round in the `data` directory.

- `records-2025-05-12.csv`
- `cities500.txt`

Also read the meta-data file for the cities:  geonames_readme.txt

Question 1 is made up of several parts:
    
***Data provenance, and importing and shaping the data***

1. Licensing for the Bats dataset (3)
2. Licensing for the City dataset (3)

You need to find the specific clauses of the license that allow the OU to distribute the data and what obligations it puts on it.

3. Importing the bats dataset (3)
4. Importing the cities dataset (4)
    
Q3&4 panda `read_csv()` function can be used

Remember:
- do some initial investigation to see what the data looks like and discuss any issues
- show the first  5 rows once the data is in a dataframe

Note: you do not need to work with all the data found in these files - you are given information on what columns and rows are needed.

See Notebook 02.2.1 for reading csv files

(13 marks)

**Cleaning, reshaping and combining the data sets**

5. Adding the `normalised_common_name` column (5)
6. Identifying urban centres (2)
7. Combining the datasets: urban/rural bats (5)
   
(12 marks)

**Visualising the data**

8. Plot Horseshoe bats’ hibernation sites (6)
9. Plot Urban and rural hibernation sites (5)

The Notebooks in Part 05 will help with drawing maps, such Notebook 05.2 Getting started with maps - folium

(10 marks)

**Reflection on the data**

10. A colleague wonders whether the heuristic of defining an urban hibernation site as being within 10km of a city is a good one - give 2 potential criticisms and what problems they might raise. (4)

(4 marks)

40 marks

### 

### Question 2 (60 Marks)

**The Task**

[Tuberculosis](https://en.wikipedia.org/wiki/Tuberculosis) is an extremely widespread and contagious disease, occurring in all the world's continents. Although the rate of successful treatment is improving, there are several groups who are at particular risk from the disease.

You are given two datasets:
- tuberculosis data: data/WHO_TB_data.csv
- economic data: data/API_11_DS2_en_csv_v2_88104 (made up of several files)

and have to consider two questions:
1. Is there a relationship between these relative rates, and a country's GINI index? 
2. Is there a relationship between these relative rates, and the amount of a country's wealth held by the poorest 20% of citizens? 

You must produce two graphical illustrations of the available data:
1. The first graph should show the relationship between different countries' rate of tuberculosis, the rate of tuberculosis in people who are HIV-positive, and the GINI coefficient.

2. The second graph should show the relationship between different countries' rate of tuberculosis, the rate of tuberculosis in people who are HIV-positive, and the income share held by that country's poorest 20%.



This question requires that you complete a number of tasks:
- You should check the licenses for the datasets, and explain why you are permitted to carry out your chosen analysis.
- You need to import the two datasets, possibly pre-processing them first.

You must examine and clean the datasets to allow you to produce the two graphical illustrations asked for. You should consider questions such as:
- Is there ambiguity in the dataset? (That is, are there aspects of the data which are unclear, and not documented?)
- Is any data missing from the datasets?
- Is there any dirtiness in the datasets, or inconsistency in how the data is represented between the two datasets?
- In each of these cases you should describe the problem and explain how you have handled it.
- You will need to reshape the data into a DataFrame.
- Finally, you should select a visualisation method for the data in the dataset, and present two visualisations of the data, with a description of how you think it should be interpreted. 


Summary of the marks given:

| Category | Number of marks |
|-----------------|-------|
| 1. Identify and Explain the Relevant Licensing Terms and Conditions | 5 |
| 2. Preprocess the Data (if applicable) | (\*) |
| 3. Import the Datasets | 5 |
|4. Clean and Reshape the Data | (\*)20 |
|5. Put the data into an appropriate form for plotting | 5 |
|6. Visualise the data | 10 |
|7. Interpret the plot | 5 |
|Presentation (not explicit in the structure) | 10 |

A structure is given in the Notebook to help you frame your answer:
- Identify and Explain the Relevant Licensing Terms and Conditions
- Preprocess the Data (if applicable) - See section `Using Tools Outside the Notebooks` on how to display images from Open Refine
- Import the Datasets
- Clean and Reshape the Data
- Put the data into an appropriate form for plotting
- Visualise the data
- Interpret the plots

Some general guidance on presentation:
- You must present your answer in this notebook.
- Do not put too much text or code into each notebook cell.
- Text or markdown cells should contain one or two paragraphs at most
- code cells should not contain more than about ten lines of python.
- Ensure you use meaningful variable names in your code.
- You should have a specific cell whose return value is the dataframe described above.
- You should have a specific cell which plots the data in the dataframe.

**Important**

When finished, go to `Kernel>Restart Kernel and Run All Cells...` to check that you have not made some alterations in the notebook that has unwillingly affected other parts. If you have file size issues you can `Kernel>Restart Kernel and Clear Outputs of All Cells...` before submitting.

Do also ensure that you have not hard-coded in local files, such as you are trying to load an image called: `![](c:/MyComputer_you_cannot_access/Mary/OU/images/myFile.jpg)`

Make sure you include any files not provided by the OU, such as image files produced from Open Refine.

### TurnItIn and Generative AI

The OU uses the plagiarism detection tool TurnItIn in some of its modules.

The OU defines plagiarism in part as:  
- using text obtained from assignment writing sites, organisations or private individuals.
- obtaining work from other sources and submitting it as your own.  

The Plagiarism policy can be found here:
- https://help.open.ac.uk/documents/policies/plagiarism 

Under the Assessment tab you can find a link to check a draft of your TMA for plagiarism: `Check your draft assignment using Turnitin`.
Do not leave it until the last minute to check!

**Using Generative AI responsibly**

Do make sure you read the information on using Generative AI: <a href="https://about.open.ac.uk/policies-and-reports/policies-and-statements/generative-ai-learning-teaching-and-assessment-ou-0">Guidance on Generative AI</a>.

Note the requirement to fill the Generative AI academic integrity declaration in the document *TM351 Generative AI academic integrity declaration.docx* and indicate at the top of your notebooks if you have completed this.


### Further help and questions

Getting further help with the TMA:
- Forums
- Tutor
- Module team

**Do not forget the next iCMA is due on 20th November at 23:59!**


## References and Data Sets

### References

Kitchin, R. (2014) The data revolution: Big data, open data, data infrastructures and their consequences, Sage.

Stevens, S. S. (1946) ‘On the Theory of Scales of Measurement’, Science, vol. 103, no. 2684, pp. 677–80 [Online]. DOI: 10.1126/science.103.2684.677

### Data sets and Licences

* role.json
 
A lot of publicly available data is in JSON format. This notebook has used USA government politician datasets - US Senators (before the forthcoming 2024 elections!) found here: https://www.govtrack.us/api/v2/role?current=true&role_type=senator.

There is no obvious licence given, the Privacy Policy mentions: *GovTrack is offered to as a free public service to users (“You”, “Your”) in the United States by Civic Impulse, LLC (“We”, “Us”, “Our”)* (https://www.govtrack.us/legal)

The rest of the datasets are from previous TM351 TMAs or notebooks:

* SFR01_2016_UD_metadata.txt
* SFR01_2016_UD_metadata.csv

Used in 2020J TMA02 - Key Stage 4 data, which was originally downloaded from:
https://www.gov.uk/government/statistics/revised-gcse-and-equivalent-results-in-england-2014-to-2015

Licence: OGL All content is available under the Open Government Licence v3.0, except where otherwise stated (https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/)

* 85043238.csv - Used in 2020J TMA02

UK Census data Report DC6206EW - NS-SeC by ethnic group by sex by age (2011 data). 
Available at https://www.nomisweb.co.uk/query/construct/summary.asp?mode=construct&version=0&dataset=682 (Produced by OU: 31 August 2020).

Nomis is a service provided by Office for National Statistics (ONS), which provides data under the OGL licence v3.0

* stats1718.csv - Used in 2021J TMA01

This is a dataset for the 2017-2018 opera season, which is part of the larger dataset:
Cuntz, Alexander, 2020, "Replication Data for: Grand rights and opera reuse today", https://doi.org/10.7910/DVN/8LUFN8, Harvard Dataverse, V1

It is stored as a csv file called stats1718.csv in the data directory. This dataset was obtained from the Harvard dataverse portal on 13th March, 2021 from:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/8LUFN8

Licence: Public Domain CC0 1.0 0 (https://creativecommons.org/publicdomain/zero/1.0/)

* fruitveg.csv - Used in 2017J TMA01

This file contains the fruit and vegetable wholesale prices from 2004 to 2015
It is available from:  
https://www.data.gov.uk/dataset/c774f8b9-a606-4480-8cb8-ff6839d4b362/agricultural_market_reports
Licence: Open Government Licence v3.0 (https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/)


## Answers


## Name that structure

- structured: CSV files usually fits into a tabular structure 
- semi-structured: JSON format is key value pairs and is usually semi-structured
- unstructured: this seems to be part of the documentation, which may not follow any particular structure

## Reading CSV files

**Example One**

The problem is that meta-data can be found at the start and end of the file. 

We can get read_csv to ignore these rows when importing, for example:

In [14]:
# Notebook seems to be defaulting to the C engine, which does not support skipfooter
data_df = pd.read_csv("data/85043238.csv", engine="python", skiprows=10, skipfooter=7 )
data_df.head()

Unnamed: 0,2011 super output area - middle layer,mnemonic,1.1 Large employers and higher managerial and administrative occupations,1.2 Higher professional occupations,"2. Lower managerial, administrative and professional occupations",3. Intermediate occupations,4. Small employers and own account workers,5. Lower supervisory and technical occupations,6. Semi-routine occupations,7. Routine occupations,L14.1 Never worked,L14.2 Long-term unemployed,L15 Full-time students,L17 Not classifiable for other reasons
0,Darlington 001,E02002559,2,14,77,89,11,39,92,66,23,3,321,0
1,Darlington 002,E02002560,3,9,35,80,9,43,97,60,25,7,242,0
2,Darlington 003,E02002561,2,7,55,70,5,37,77,39,26,5,217,0
3,Darlington 004,E02002562,2,4,50,87,8,42,130,89,60,25,240,0
4,Darlington 005,E02002563,0,11,52,58,7,28,90,67,47,8,186,0


In [15]:
#check nothing is lost at the end
data_df.tail()

Unnamed: 0,2011 super output area - middle layer,mnemonic,1.1 Large employers and higher managerial and administrative occupations,1.2 Higher professional occupations,"2. Lower managerial, administrative and professional occupations",3. Intermediate occupations,4. Small employers and own account workers,5. Lower supervisory and technical occupations,6. Semi-routine occupations,7. Routine occupations,L14.1 Never worked,L14.2 Long-term unemployed,L15 Full-time students,L17 Not classifiable for other reasons
7196,Newport 016,W02000362,3,10,53,51,6,25,66,55,21,7,260,0
7197,Newport 017,W02000363,1,4,36,68,4,21,75,39,35,12,234,0
7198,Newport 018,W02000364,5,9,61,84,13,50,148,80,139,23,931,0
7199,Newport 019,W02000365,1,5,57,99,9,36,146,94,108,24,316,0
7200,Newport 020,W02000366,1,20,76,103,10,45,96,54,25,7,340,0


**Example Two**

If we try and read it in directly, it will generate an error:

In [16]:
data_df = pd.read_csv("data/stats1718.csv")
data_df.head()

ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2


The `head` command showed that the data is in fact separated by |'s instead of commas, so we can use the `sep` option to deal with this:

In [17]:
data_df = pd.read_csv("data/stats1718.csv", sep="|")
data_df.head()

Unnamed: 0,1718,al,Unnamed: 2,Tirana,Unnamed: 4,Bizet,18381025,18750603,fr,m,Carmen,fr.1,~,20180130,5,Unnamed: 15,Unnamed: 16
0,1718,al,,Tirana,,Bizet,18381025,18750603,fr,m,Les Pecheurs de perles,fr,~,20170922,3,!,
1,1718,al,,Tirana,,Brahms,18330507,18970403,de,m,Ein deutsches Requiem,de,~O,20170930,1,c,
2,1718,al,,Tirana,,"Strauss,J",1825,1899,at,m,Die Fledermaus,at,~L,20180329,5,,
3,1718,am,,Yerevan,,Arutiunian,19200928,20120328,am,m,Sayat-Nova,am,~,20180313,1,,
4,1718,am,,Yerevan,,Bizet,18381025,18750603,fr,m,Carmen,fr,~,20180321,1,,


Hmm, the first row does not look like column headings, but rather data. 

In [18]:
data_df = pd.read_csv("data/stats1718.csv", sep="|", header=None)
data_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,1718,al,,Tirana,,Bizet,18381025,18750603,fr,m,Carmen,fr,~,20180130,5,,
1,1718,al,,Tirana,,Bizet,18381025,18750603,fr,m,Les Pecheurs de perles,fr,~,20170922,3,!,
2,1718,al,,Tirana,,Brahms,18330507,18970403,de,m,Ein deutsches Requiem,de,~O,20170930,1,c,
3,1718,al,,Tirana,,"Strauss,J",1825,1899,at,m,Die Fledermaus,at,~L,20180329,5,,
4,1718,am,,Yerevan,,Arutiunian,19200928,20120328,am,m,Sayat-Nova,am,~,20180313,1,,


Hmm, the columns are now just numbers, which is not very informative. 

A first step in cleaning this dataset would be to rename the columns to something more meaningful. This can be done when importing the data using the `names=["col_1", "col_2", ..., "col_n"]` option, or afterwards using the Pandas DataFrame `rename()` function. You would need to investigate any documentation provided with the data, such as information on the metadata to find out what the columns means.

**Example Three**

This time the values are tab separated, so we can use the `sep` option again. 

The file contains information on fruit and vegetables. It also contains a lot of null data and various other issues, which would need to be addressed if you were working with this dataset. BTW, can anyone remember when fruit and vegetables were seasonal and not available year long!

In [19]:
data_df = pd.read_csv("data/fruitveg.csv", sep="\t")
data_df.head()

Unnamed: 0,2004,Unnamed: 1,Quality,Units,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
0,FRUIT,,,,,,,,,,,,,,,
1,Blackberries,All Varieties,-,£/kg,,,,,,,3.26,3.24,3.61,5.06,,
2,Blackcurrants,All Varieties,-,£/kg,,,,,,,3.32,3.53,,,,
3,Cherries,Sweet Black,1st,£/kg,,,,,,1.69,1.72,3.68,,,,
4,,Sweet Black,2nd,£/kg,,,,,,1.11,1.08,,,,,


### Noir measurements

Question - what type of measurement is:
- hair colour -> nominal
- weight -> ratio
- IQ score -> interval
- Income -> ratio
- Stage in development of a human -> ordinal