
<h1 style ="color:blue;">TM351 24J</h1>
<h2 style="color:blue;">Data management  and analysis</h2>
<h2 style="color:blue;">TMA01 Preparation Tutorial</h2>

<h3>Mary Garvey</h3>
<h3>4th November 2024</h3>

In [None]:
# This cell imports the standard pandas library needed for this workbook

import pandas as pd
import json

## What we will cover today

* The data life cycle and data pipeline		
* Data acquisition
* Data preparation
* Data analysis
* Data presentation
* The TMA 

## Data life cycle and data pipeline


The module follows a data life cycle and pipeline:

![](images/tm351_pt1_f04.eps.jpg)

This takes into account what actually happens to the data, and the uses to which it is put, as it is processed.

The emphasis is on the management of data from its creation through to its reuse and eventual destruction:

![](images/tm351_pt1_f03.eps.jpg)

There are many variants of the data life cycle and data pipe line exist for example CRISP_DM (CRoss Industry Standard Process for Data Mining).

This module is built around the data pipeline, which allows:

* Completeness and correctness

Analysts need to be confident that the information they generate is correct.

* Reproducibility and provenance

Researchers want to be able to reproduce and prove their results. For example, a research paper can't just say the answer is 42 without showing how the result came about.

From your point of view, notebooks allow you to document and provide reproducible data processes, so others can see your reasoning and allow them to re-run your entire analysis and may be carry out further analsysis. 

Figure 1.9 from TM351 Part 1 illustrates a ‘big data’ workflow from a data-centric scientific research process:

![](images/tm351_pt1_f09.eps.jpg)


### Data Characteristics

Part 1 introduces various things to consider when working with data. 

Table 1.1. introduces Kitchin’s characterisation of data (Kitchin, 2014), which can be used as checklist to characterise a dataset  

![](images/kitchin.png)

### Name that structure

For example, have a look at the following samples of data, for each one, decide how you would describe their structure (structured, semi-structured or unstructured):

In [None]:
! head data/SFR01_2016_UD_national_1.csv

The data as presented as above is hard to read, there does not appear to be any issues and looks like commaa separabed values (CSV).

Lets just take a quick subset of the data - first 5 columns and rows:

In [None]:
data_df = pd.read_csv("data/SFR01_2016_UD_national_1.csv", usecols = [0,1,2,3,4])
data_df.head()

Structured, semi-structured or unstructured?

In [None]:
! head data/role.json

Structured, semi-structured or unstructured?

In [None]:
! head data/SFR01_2016_UD_metadata.txt

In [None]:
! tail data/SFR01_2016_UD_metadata.txt

Structured, semi-structured or unstructured?

### Potential data issues


![](images/issues.png)


Can anyone think of other issues?

## Acquiring the data

![](images/tm351_pt2_f01.eps.small.jpg)

The first step in  the data analysis process is to acquire the data. 

* Has anyone found some sites containing interesting datasets?
For example, for work or hobbies

* Why is it important to look at other datasets and analysis?

* What data should be acquired?

* The TMA sometimes gets you to find data, but with live datasets, they have been know to disappear/changed!


### Found Data

Lots of data sets out there:

|Name|URL|
|-----|-------|
|ONS|<a href="https://www.ons.gov.uk">https://www.ons.gov.uk</a>|
|EU Stats|<a href="http://ec.europa.eu/eurostat">http://ec.europa.eu/eurostat</a>|
|European commission stats|<a href="http://ec.europa.eu/eurostat/data/statistics-a-z/abc">http://ec.europa.eu/eurostat/data/statistics-a-z/abc</a>|
|UK Government|<a href="https://www.gov.uk/government/statistics">https://www.gov.uk/government/statistics</a>|
|US Government|<a href="https://www.usa.gov/statistics">https://www.usa.gov/statistics</a>|
|Edinburgh University data share|<a href="http://datashare.is.ed.ac.uk/">http://datashare.is.ed.ac.uk/</a>|
|List of high quality data sets|<a href="https://github.com/caesar0301/awesome-public-datasets">https://github.com/caesar0301/awesome-public-datasets</a>|
|AW public data sets|<a href="https://aws.amazon.com/datasets/">https://aws.amazon.com/datasets/</a>|
|Stanford: Computational Journalism lab|<a href="http://cjlab.stanford.edu/">http://cjlab.stanford.edu/ </a>|
|KDNuggets: data sets for mining/discovery|<a href="https://www.kdnuggets.com/datasets/">https://www.kdnuggets.com/datasets/</a>|
|Kaggle: AI & ML community|<a href="https://www.kaggle.com/">https://www.kaggle.com/</a>|
|Halifax house prices|<a href="http://www.lloydsbankinggroup.com/media/economic-insight/halifax-house-price-index/">http://www.lloydsbankinggroup.com/media/economic-insight/halifax-house-price-index/</a>|
|Nationwide house prices|<a href="http://www.nationwide.co.uk/about/house-price-index/headlines">http://www.nationwide.co.uk/about/house-price-index/headlines</a>|
|Historical weather|<a href="http://www.wunderground.com/history">http://www.wunderground.com/history</a>|


### Issues with Data


Things to consider when acquiring data:

**Character encodings**

- This define the mappings between the digital representation of a character (the binary 1s and 0s that represent it) and the character itself
- Data from different sources could be in different formats, so want to get into a common format, e.g., ASCII, UTF 8 or Unicode

**Measurement scales**
- Nominal, ordinal, interval or ratio

**Data storage**
- tables (structured data) or documents (semi-structured data, such as JSON or XML)

We always tell you to check the data before importing it, the OS commands `head` and `tail` can be used for this.

Do check the results, things to look for include:

- what the format is, e.g., CSV, JSON
- how is the data separated - not always commas, even in a CSV file
- any decoding issues




The pandas `read_csv()` function can be used to read comma separated files.

Do check the documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html for the range of options, which can be used when the data is not straightforward.

![](images/read_csv.jpg)

### Reading CSV files

For example, what options might be used if the data looks like the following:


In [None]:
! head data/85043238.csv

In [None]:
# Let's look at a few more rows:
! head -20 data/85043238.csv

In [None]:
# and the end of the file
! tail -15 data/85043238.csv

- What seems to be the problem?
- how might we resolve this looking at the `read_csv()` documentation?

Second example (2021J TMA01 dataset):

In [None]:
! head data/stats1718.csv

 - any issues here?
 - if so how might we import the data?

## Measurement scales


After character encoding, the next consideration is classifying data so that it can be analysed. Part 2 introduces a scheme produced by Stanley Smith Stevens (Stevens, 1946), the NOIR measurement scale.

It is important to understand the classification and associated measurement scale so that you know what mathematical operations can legitimately be applied to the data found.

Stevens identified four classes: Nominal, Ordinal, Interval, Ratio (NOIR). The following diagram gives an overview: 

![](images/measurement.png)

Nominal and Ordinal values are discrete. They are distinct and separate, such as whole numbers representing the number of students taking TM351.

Interval and ratio values are continuous. They can be measured with precision and have an infinite number of values between two points, such as weight or temperature.

### Noir Question

What type of measurement is 
- hair colour
- weight
- IQ score
- Income
- Stage in development of a human

### Data Storage

After acquiring the data, we need to store it into appropriate complex structures and make the semantics (meaning) identifiable.

There are 2 common structures used:

**Tables:**
- structured into rows
- each row contains information about some thing
- each row contains the same number of possibly empty cells
- cells provide values of properties of the thing described by the row
- cells within the same column provide values for the same property for the thing described by the particular row

**Document:**
- any file or representation that embodies a particular data record
- conforms to a document structure
- Typical formats: JSON, XML

## Preparation

![](images/tm351_pt3_f01.eps.small.jpg)


### Cleansing the data

**Cleansing data**

Once the data is acquired, we need to clean it.

Common types of data errors include:

![](images/dataErrors.png)

**Handling dirty data**

![](images/handlingData.png)

Don't forget to document what has been added/deleted/changed.

See: Activity 3.2 Exploratory in Part 3, which reviews the following paper and is available from the OU Library:

Kim, W., Choi, B. J., Hong, E. K., Kim, S. K. and Lee, D. (2003) ‘A taxonomy of dirty data’, Data Mining and Knowledge Discovery, vol. 7, no. 1, pp. 81–99.


**Reshaping data**

This can involve:

- Remove unnecessary columns
- Filtering rows which do not meet specified criteria
- Sorting rows by column combinations

This can be carried out in one or more ways:
- OpenRefine
- Pandas
- SQL
 
Notebook 04.7 helps with reshaping


## Analysing the data

![](images/tm351_pt4_f01.eps.small.jpg)

**Visualising descriptive data**

Once you have imported and cleaned the data, it is now ready for analysis. Descriptive text can be hard to make sense of, so visualisations can help present the data.

This can be achieved through a variety of graphics: e.g. bar charts, histograms, trend lines, box-whisper charts.

Visualisation leverages the capabilities and bandwidth of the visual system. The techniques can help: 
- to move a huge amount of information into the brain very quickly.
- take advantage of the brain to identify patterns
- to communicate relationships and meaning
- inspire new questions and further exploration
- identify sub-problems
- to identify trends and outliers
- to search for interesting or specific data points in a larger field, etc.

Visualisation helps to get behind the numbers


**Descriptive analysis**

For elements on an interval scale we can aggregate into basic statistical measures including: 

![](images/statistics.png)

**Correlation**
  
A technique that can show whether and how strongly pairs of variables are related.

Often demonstrated by using a scatterplot:

![](images/correlation.png)


Remember correlation does not imply causation:

![](images/spiders.png)

See further examples here: http://www.tylervigen.com/spurious-correlations 


## Data presentation: 

![](images/tm351_pt5_f01.eps.small.jpg)


**Reporting the story**

For a data investigation you need to include a record of activities and your findings so the final report must contain:
- the story of the investigation
- the story discovered in the data

General points to consider:
- Keep it simple: plain English, short paragraphs, avoid jargon
- Keep it focused: develop a logical structure, highlight key points in each section
- Consider the audience: use their language and format – get it proofread
- Support the reader through long reports: provide a road map and signposting

The Notebook can be used for the reporting

Part 5 looks at `Presentation: telling the story`

See Exercise 5.6 Exploratory in Section 5.2 The report, which provides a sample layout for a report in the Discussion: `TM351_Report_Outline.docx` 

If time permits, we can look at a sample Notebook that works through the data pipeline using Boys Names: `Tutorial 2 BoyNames.ipynb` and there is an equivalent notebook for Girls names: `Tutorial 2 GirlNames.ipynb`


## TMA01 Detailed look

**TMA - logistics**

***Preparation***

Before you start the TMA:  

- Download the zip files with the Notebooks in
- At the top of each Notebook put your Name and Personal Identifier (PI) in the boxes provided. This will allow your Tutor to identify your work

***Completion***

Upon completing the TMA all of your files should be in the 2024j_tma01 folder including:
- your Notebook files containing answers to Q1 and 2 (do not rename these)
- the original data files in the data folder
- any additional data files that you have added, created or updated, in the data folder

***Submission***
- zip the 2024j_tma01 directory and confirm that all your files are in the resulting archive
- submit your zip file to the online TMA/EMA Service.

***Remember*** to back-up your files regularly so as not to lose your work

***Deadline*** If you are unable to submit the TMA by 5th December you must let your Tutor know beforehand


### TMA01 Question 1 (40 marks)

In this TMA, you will investigate a dataset of museums in the UK. You are interested in the questions:

- Where are museums located in the United Kingdom?
- Are the country’s small museums equitably distributed among smaller communities?

All the data required can be round in the `data` directory.

- `MappingMuseumsData2021_09_30.csv`
- `cities500.txt`

Also read the meta-data file for the cities:  geonames_readme.txt

Question 1 is made up of several parts:
    
***Data provenance, and importing and shaping the data***

1. Licensing for the Museums dataset (2)
2. Licensing for the City dataset (3)

You need to find the specific clauses of the license that allow the OU to distribute the data and what obligations it puts on it.

3. Importing the museums dataset (3)
4. Removing museums which have closed (2)  
5. Importing the cities dataset (4)
    
Q3&5 panda `read_csv()` function can be used

Remember:
- do some initial investigation to see what the data looks like and discuss any issues
- show the first  5 rows once the data is in a dataframe

Note: you do not need to work with all the data found in these files - you are given information on what columns and rows are needed.

See Notebook 02.2.1 for reading csv files

(14 marks)

**Cleaning the data**

6. Remove duplicates from the uk_cities_df DataFrame (3)
7. Identifying discrepancies between the datasets (1)
8. Correcting the discrepancies between the datasets (6)
9. Estimate the quality of the data cleaning (1)

(11 marks)

**Visualising the data**

10. Plot and comment upon the museums’ locations (4+1)
11. Plot and comment upon the number of museums by size of town or city (4+1)

The Notebooks in Part 05 will help with drawing maps, such Notebook 05.2 Getting started with maps - folium

(10 marks)

**Reflection on the data**

12. Alternative to selecting by population (5)

(5 marks)

### Question 2 (60 Marks)

**The Task**

You are given two datasets:
- world happiness data: data/happiness_2024.xls
- economic data: data/cultural_heritage_data (made up of several files)

and have to consider two questions:
- Is there a relationship between the amount that a country spends on cultural heritage per person, and the happiness of that country’s population?
- Is there a relationship between the amount that a country spends on cultural heritage per person, and the generosity of that country’s population?

You must produce two graphical illustrations of the available data:
- The first graph should illustrate the relationship between the amount that a country spends on cultural heritage per person, and the happiness of that country’s population.
- The second graph should illustrate the relationship between the amount that a country spends on cultural heritage per person, and the generosity of that country’s population.


This question requires that you complete a number of tasks:
- You should check the licenses for the datasets, and explain why you are permitted to carry out your chosen analysis.
- You need to import the two datasets, possibly pre-processing them first.

You must examine and clean the datasets to allow you to produce the two graphical illustrations asked for. You should consider questions such as:
- Is there ambiguity in the dataset? (That is, are there aspects of the data which are unclear, and not documented?)
- Is any data missing from the datasets?
- Is there any dirtiness in the datasets, or inconsistency in how the data is represented between the two datasets?
- In each of these cases you should describe the problem and explain how you have handled it.
- You will need to reshape the data into a DataFrame.
- Finally, you should select a visualisation method for the data in the dataset, and present two visualisations of the data, with a description of how you think it should be interpreted. 


Summary of the marks given:

| Category | Number of marks |
|-----------------|-------|
| 1. Identify and Explain the Relevant Licensing Terms and Conditions | 5 |
| 2. Preprocess the Data (if applicable) | (\*) |
| 3. Import the Datasets | 5 |
|4. Clean and Reshape the Data | (\*)20 |
|5. Put the data into an appropriate form for plotting | 5 |
|6. Visualise the data | 10 |
|7. Interpret the plot | 5 |
|Presentation (not explicit in the structure) | 10 |

A structure is given in the Notebook to help you frame your answer:
- Identify and Explain the Relevant Licensing Terms and Conditions
- Preprocess the Data (if applicable) - See section `Using Tools Outside the Notebooks` on how to display images from Open Refine
- Import the Datasets
- Clean and Reshape the Data
- Put the data into an appropriate form for plotting
- Visualise the data
- Interpret the plots

Some general guidance on presentation:
- You must present your answer in this notebook.
- Do not put too much text or code into each notebook cell.
- Text or markdown cells should contain one or two paragraphs at most
- code cells should not contain more than about ten lines of python.
- Ensure you use meaningful variable names in your code.
- You should have a specific cell whose return value is the dataframe described above.
- You should have a specific cell which plots the data in the dataframe.

**Important**

When finished, go to `Kernel>Restart Kernel and Run All Cells...` to check that you have not made some alterations in the notebook that has unwillingly affected other parts. If you have file size issues you can `Kernel>Restart Kernel and Clear Outputs of All Cells...` before submitting.

Do also ensure that you have not hard-coded in local files, such as you are trying to load an image called: `![](c:/MyComputer/Mary/OU/images/myFile.jpg)`

Make sure you include any files not provided by the OU, such as image files produced from Open Refine.

### TurnItIn

The OU has introducing the use of the plagiarism detection tool TurnItIn in some of its modules.

The OU defines plagiarism in part as:  
- using text obtained from assignment writing sites, organisations or private individuals.
- obtaining work from other sources and submitting it as your own.  

The Plagiarism policy can be found here:
- https://help.open.ac.uk/documents/policies/plagiarism 

Under the Assessment tab you can find a link to check a draft of your TMA for plagiarism: `Check your draft assignment using Turnitin`.
Do not leave it until the last minute to check!

**Breaking News**

See the News item from: 25 Oct 2024:
*How to use Generative AI responsibly...*


### Further help and questions

Getting further help with the TMA:
- Forums
- Tutor
- Module team

**Do not forget the next iCMA is due on 21st November at 23:59!**


## References and Data Sets

### References

Kitchin, R. (2014) The data revolution: Big data, open data, data infrastructures and their consequences, Sage.

Stevens, S. S. (1946) ‘On the Theory of Scales of Measurement’, Science, vol. 103, no. 2684, pp. 677–80 [Online]. DOI: 10.1126/science.103.2684.677

### Data sets and Licences

* role.json
 
A lot of publicly available data is in JSON format. This notebook has used USA government politician datasets - US Senators (before the forthcoming 2024 elections!) found here: https://www.govtrack.us/api/v2/role?current=true&role_type=senator.

There is no obvious licence given, the Privacy Policy mentions: *GovTrack is offered to as a free public service to users (“You”, “Your”) in the United States by Civic Impulse, LLC (“We”, “Us”, “Our”)* (https://www.govtrack.us/legal)

The rest of the datasets are from previous TM351 TMAs:

* SFR01_2016_UD_metadata.txt
* SFR01_2016_UD_metadata.csv

Used in 2020J TMA02 - Key Stage 4 data, which was originally downloaded from:
https://www.gov.uk/government/statistics/revised-gcse-and-equivalent-results-in-england-2014-to-2015

Licence: OGL All content is available under the Open Government Licence v3.0, except where otherwise stated (https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/)

* 85043238.csv - Used in 2020J TMA02

UK Census data Report DC6206EW - NS-SeC by ethnic group by sex by age (2011 data). 
Available at https://www.nomisweb.co.uk/query/construct/summary.asp?mode=construct&version=0&dataset=682 (Produced by OU: 31 August 2020).

Nomis is a service provided by Office for National Statistics (ONS), which provides data under the OGL licence v3.0

* stats1718.csv - Used in 2021J TMA01

This is a dataset for the 2017-2018 opera season, which is part of the larger dataset:
Cuntz, Alexander, 2020, "Replication Data for: Grand rights and opera reuse today", https://doi.org/10.7910/DVN/8LUFN8, Harvard Dataverse, V1

It is stored as a csv file called stats1718.csv in the data directory. This dataset was obtained from the Harvard dataverse portal on 13th March, 2021 from:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/8LUFN8

Licence: Public Domain CC0 1.0 0 (https://creativecommons.org/publicdomain/zero/1.0/)

## Answers


## Name that structure

- structured: CSV files usually fits into a tabular structure 
- semi-structured: JSON format is key value pairs and is usually semi-structured
- unstructured: this seems to be part of the documentation, which could may not follow any particular structure

## Reading CSV files

First example.

The problem is that meta-data can be found at the start and end of the file. 

We can get read_csv to ignore these rows when importing, for example:

In [None]:
# Notebook seems to be defaulting to the C engine, which does not support skipfooter
data_df = pd.read_csv("data/85043238.csv", engine="python", skiprows=10, skipfooter=7 )
data_df.head()

In [None]:
#check nothing lost at the end
data_df.tail()

Second example.

If we try and read it in directly, it will generate an error:

In [None]:
data_df = pd.read_csv("data/stats1718.csv")
data_df.head()

The `head` command showed that the data is in fact separated by |'s instead of commas, so we can use the `sep` option to deal with this:

In [None]:
data_df = pd.read_csv("data/stats1718.csv", sep="|")
data_df.head()

Hmm, the first row does not look like column headings, but rather data. 

In [None]:
data_df = pd.read_csv("data/stats1718.csv", sep="|", header=None)
data_df.head()

Hmm, the columns are now just numbers, which is not very meaningful. Something for you to think about.....

### Noir measurements

Question - what type of measurement is:
- hair colour -> nominal
- weight -> ratio
- IQ score -> interval
- Income -> ratio
- Stage in development of a human -> ordinal