# Week 10 - Data Management

Learning Outcomes (Week 10)  
By the end of this week, you should be able to  
- Learn about data management requirements from an
internal (data lifecycle) and external (data value
chain) perspectives  
- Understand conflicting business and legal objectives  
- Differentiate between aspects of data governance
and data management  
- Examine data management with frameworks  
- Recognise the relationship between ethics,
privacy, storage, security and analysis  

## 1. Data management

### Data Quality
You want the data you are using to be of sufficient quality for your purpose  
- Accuracy  
- Completeness  
- Consistency  
- Integrity  
- Reasonability  
- Timeliness  
- Uniqueness/deduplication  
- Validity  
    - Data Management Association (DAMA)  
    
Much of this is a data management issue  
Data management is the development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets. It is not just data quality!  
To organize, maintain and protect data, ensuring its quality and accessibility throughout its lifecycle.  
To achieve data availability while minimizing redundancy and managing latency.  
To develop a policy for user privacy.  

### Potential Data Management Issues
- Medical informatics: for predicting fungal infections from nursing notes, the team needs to abide by confidentiality and security requirements.
- Internet advertising: what implicit and explicit data is stored about a user?
- Retailing: conduct market intelligence on new products; put together data from different divisions (brands) within the company.
- Predictive medical system: implementation may need changing standard operating procedure for staff

### Data Management Contexts
- Science: reproducibility and credibility of scientific work, producing artifacts of knowledge, creating scientific data
- Business: governance, compliance, information privacy, etc.
- Curation: e.g. museums and libraries, preservation, maintenance, etc.
- Government: a unique legislative environment that regulates them (e.g., “transparency”), archiving, FOIs, support data infrastructure, etc.
- Medicine: significant privacy issues, conflicting corporate financial constraints, government regulations and furthering of medical science


## 2. Data lifecycles
Standard Value Chain (as before)
- Collection: getting the data
- Wrangling: data preprocessing, cleaning
- Analysis: discovery (learning, visualisation, etc.)
- Presentation: arguing that results are significant and useful
- Engineering: storage and computational resources
- Governance: overall management of data
- Operationalisation: putting the results to work

Data lifecycle
- Creating data
- Processing data
- Analysing data
- Preserving data
- Giving access to data
- Reusing data

DataOne model
- Plan
- Collect
- Assure
- Describe
- Preserve
- Discover
- Integrate
- Analyze  

![image.png](attachment:image.png)
<style type="text/css">
    img {
        width: 400px;
    }
</style>  

Different companies have different lifecycles to follow


## 3. Data governance and responsibilities
### Supporting and handling:
- ethics, confidentiality
- security
- consolidation and quality-assurance (e.g. link all customer related information together)
- persistence (backups and recoverability)
- regulatory compliance
- organisation policy compliance
- organisation business outcomes
which may include handling the steps in the data science and/or big data value chain

### Governance and management
Data governance and data management are often used to mean each other.  
Better to treat them as separate levels  
- Data Management is what you do to handle the data
    - Resources, practises, enacting policies
- Data Governance is making sure that it is done appropriately
    - Policies, training, providing resources
    - Planning and understanding

### Legal and ethical responsibilities
- Must follow laws
    - Australian Privacy law
    - Australian medical data regulations
    - Australian telecommunications act
    - EU’s General Data Protection Regulations (GDPR)
- Must meet (funding) requirements
    - Australian Research Council (ARC)
    - National Health and Medical Research Council (NHMRC)
- Must be ethical
    - Don’t be evil!
    - Rights for
        - Privacy
        - Access
        - Erasure
        - etc
    - Work with the stakeholders
    - Be transparent and clear
- Confidentiality
- Ownership
- Copyright
- Intellectual property
- Licensing

Just because a data science project ends, the data curation shouldn’t!

### Regulations and Compliance
- Regulations devised by various government bodies: taxation,
  medical care, securities and investments, work health and
  safety, employment, corporate law.
- They need to check companies for their compliance
- Regulatory compliance:
    - that organisations ensure that they are aware of and take
      steps to comply with relevant laws and regulations.
- Auditing
    - systematic and independent examination of books,
      accounts, documents and vouchers of an organization to
      ascertain how far they present a true and fair view
- auditing data and records are a good source for Data Science

### Terminology
For our purposes, we define:
- Privacy as having control over how one shares oneself, e.g.,
  closing the blinds in your living room
- Confidentiality as information privacy, how information about an
  individual is treated and shared, e.g., excluding others from
  viewing your search terms or browsing history
- Security as the protection of data, preventing it from being
  improperly used, e.g., preventing hackers from stealing credit
  card data
- Ethics as the moral handling of data, e.g., not selling on other’s
  private data to scammers
- Implicit data that is not explicitly stored but inferred with
  reasonable precision from available data, see “Private traits and
  attributes are predictable ...”

Data privacy focuses on who has the right to access data, while
data security focuses on protecting data from unauthorized access
or breaches.  


## 4. Stakeholders and the data scientist
Who's responsible?  
Stakeholders are any parties that have a relationship with a
project/policy/product/data.  
This includes
- the data’s source
- managers
- analysts and users
- IT developers
- data scientists!
With great data, comes great responsibilities for all stakeholders  

![image.png](attachment:image.png)
<style type="text/css">
    img {
        width: 400px;
    }
</style>  



## 5. Data management planning
How do you get it all right?
- Policies and laws
    - rights, Australian privacy principles, EU GDPR
- Procedures and practises
    - access, ownership, security
- Planning and training
    - data management plans, design
- Management and capability
    - technology, staffing
- Governance
    - oversight & review, ethics

A Data Management Plan provides
- Clarity
- Direction
- Transparency
- Expectations
The result is
- Improvements to efficiency, protection, quality and exposure
- Value
- Innovation
It contains
- Backups
- Survey of existing data
- Data owners & stakeholders
- File formats
- Metadata
- Access and security
- Data organisation
- Bibliography
- Storage
- Data sharing, publishing and archiving
- Destruction
- Responsibilities
- Budget

Capability Maturity Model
- Good management happens all through the data lifecycle
- 4 key process areas:
    - Data acquisition, processing and quality assurance  
    Goal: Reliably capture and describe scientific data in a way that
    facilitates preservation and reuse
    - Data description and representation  
    Goal: Create quality metadata for data discovery, preservation, and
    provenance functions
    - Data dissemination  
    Goal: Design and implement interfaces for users to obtain and
    interact with data
    - Repository services/preservation  
    Goal: Preserve collected data for long-term use
- Good data governance uses a good management system
    - A mature system manages data all through the data lifecycle and
    throughout all projects.

![image.png](attachment:image.png)
<style type="text/css">
    img {
        width: 400px;
    }
</style>  

Universality
- Data management and governance are not things just
  arranged for each project.
- They should be universal in how an organisation
  thinks about and approaches data
    - at all times
    - in all divisions
    - in all projects
    - for all stakeholders

## 6. Examples of good and bad data governance
Assignment 3 - Don’t forget data management and governance!
- Access & security
- Software & hardware
- Regulations, ethics & licensing
- Stakeholders & transparency

### Case studies  
#### Robodebt
- Australian government wanted to double check the incomes of people
  being paid social welfare payments.
- The Online Compliance Intervention system (aka RoboDebt) was set up
  in 2016 to automatically compare ATO records to Centrelink records.
    - Calculates the benefits that people are entitled to, based on
    assumptions about their earnings
    - Debt collection letters if benefits have been overpaid
- Problems discovered with a lack of human-in-the-loop for doublechecking
    - Incorrect/inappropriate calculations
    - Using out-of-date data
    - Sending debt notices to dead people & pensioners
- In November 2019, the Federal Court declared that the averaging process
  using ATO income data to calculate debts was unlawful
    - Government stopped the system

#### British COVID data
- Like most countries, Britain has been trying to gather
  data about COVID-19 infections
- Nearly 16,000 new coronavirus infections went
  unreported in their test-and-trace program
    - Used Excel to process the original data
    - Too much data for Excel, so it didn’t process all the data
    - Output didn’t explain the context of the processing
    - The contacts of those who were COVID-19 positive -
    potentially around 50,000 people - were not traced
    immediately.

#### BUPA "leak"
- In 2015, Bupa, a healthcare company, was contracted to help the
  Australian government determine “the health of people applying
  for visas and permanent residency in Australia”.
    - granted limited access to departmental records.
- Subcontractor (SHP) emailed copied Government departmental
  data about the applicants and constructed Excel spreadsheets as
  status reports
    - Due to a typo, they emailed them to an unknown address, not BUPA
    - 5 weeks later, BUPA had to contact Google to get the details of the
    recipient
- the government department was not aware that Bupa and SHP
  had accessed/shared the health information in this way.
- Bupa “failed to comply” with the conditions of its contract and
  weren’t adhering to standard department policies.

### Summary
- Data usage is no good without good data management
- Data management needs good data governance to do it right
- All aspects of the life of data should all be planned,
  managed and governed, and not necessarily restricted
  to one project.
- The responsibility of looking after the data lies with
  all stakeholders – the data scientist is a key to making
  this happen

Backup, recovery, archiving. You need to backup your data in case of disaster  
Everyone needs to be on the same page. IT cannot be the one to organize this, it must be the business stakeholders who standardizes it.  