# Week 4 Notes

### Analytic Levels  
• Descriptive analytics: What happened?  
• Predictive analytics: What will happen?  
• Prescriptive analytics: What should we do about it?  

#### Descriptive Analytics: gain insight from historical data  
• plot sales results by region and product category  
• correlate with advertising revenue per region  

#### Predictive analytics: 
- make prediction using statistical and machine learning techniques  
    - predict next quarter’s sales results using economic projections and advertising targets  
- analyze current and historical facts to make predictions about future or otherwise unknown events.  

##### Common Use Cases:
• Business: Forecasting demand, identifying fraud, targeting marketing campaigns.  
• Education: Predicting student dropout risk or performance.  
• Healthcare: Predicting disease outbreaks or patient readmission.  
• Finance: Credit scoring, risk assessment  

#### Prescriptive analytics
- recommend decisions using optimisation, simulation, etc.  
    - recommend which regions to advertise in given a fixed budget  
- goes one step beyond predictive analytics.  

• While predictive analytics tells you what is likely to happen,
prescriptive analytics tells you what actions to take to
influence or optimize future outcomes.  

##### Common Use Cases:
• Business: Recommends inventory levels to minimize costs
while meeting demand.  
• Education: After predicting which students are at risk,
prescriptive analytics might suggest tailored learning
strategies or interventions.  
• Healthcare: Recommends treatment plans based on patients  

### Where to find data?
1. Sharing data
2. Open data
3. Utilising data
4. Data standards
5. Combining data
6. Scripting languages and tools

#### Shared data provides opportunities
- New combinations of data
    - Attributes from one dataset can be merged into another given similar properties
- New relationships in data
- New visualisations of data
- New understandings of data
- Also creates new data!

#### Open data
• Data that is “freely available to everyone to use and
republish as they wish, without restrictions from copyright,
patents or other mechanisms of control” – Wikipedia  
    - Free – accessible, costs nothing  
    - Free – unrestricted usage  
    - Free – simple, non-proprietary format  
• Commonly associated with open government data  

##### Open data opportunities
open data provides new opportunities for business, new products
and services, and can raise productivity  
• open data supports public understanding and citizen engagement  
• scientists need to better publicise their data (with help from
universities, etc.)  
• industry sectors should work with regulators and coordinate
industry collaboration  
• collaboration across sectors in both public and private settings,  
• e.g., disaster response, education  

##### Open data problems
- Data is not always usable  
- Need the right skills to make use of it  


#### Utilising data sources
Data requires work to clean up
Be creative about sources
Combine many sources
Some might need to be generated
Fine-grained data really helps, but is hard to use

Many companies offer a free, public API
- Facebook
- Twitter
- Google maps
- Youtube
- Amazon
- New york times

#### Data standards
Standardization is more efficient  
Efficiency lowers cost  
E.g. XML, json, csv are standards that tell you how to arrange your data

##### Data formats
• Machine-readable data: data (or metadata) which is in a
format that can be understood by a computer,
e.g., XML, JSON  
• Markup language: system for annotating a document in a
way that is syntactically distinguishable from the text
e.g., Markdown, Javadoc  
• Digital container: file format whose specification describes
how different elements of data and metadata coexist in a
computer file, e.g., MPEG  

##### Metadata
- Data about data  
- Structured for computer interpretation and processing  
- Tells you when, where, how the data was collected  

• Descriptive: describes content for identification and
retrieval, e.g. title, author of a book  
• Structural: documents relationships and links, e.g.
chapters in a book, elements in XML, containers in MPEG  
• Administrative: helps to manage information, e.g. version
number, archiving date, Digital Rights Management (DRM)  

##### Why metadata?
• Facilitate data discovery  
• Help users determine the applicability of the data  
• Enable interpretation and reuse  
• Clarify ownership and restrictions on reuse  
• Metadata helps set standards  
• Metadata should also be standardised   
    • Archiving data  
    • Sharing data  
    • Searching data  



#### Combining data
Relationships in data  
- Tabular data  
    - Tables  
    - Relational database  
- Graph data  
    - Nodes: entities  
    - Edges: relationship between entities  
    - Graph database  

For datasets to be joined, they must have something in common  

##### Joining datasets
• Can be temporary  
    - Just for the current analysis  
• Can be permanent  
    - Store the combined data  
• Can have conditions  
    - Can you share the combined data?  
• Can be costly  
    - Memory  
    - Processing time & capacity  
        ‣ joining  
        ‣ searching  
        ‣ analysing  


#### Scripting languages and tools
• A script is a series of commands to be performed  
• A script is executable on demand  
    - not compiled to an executable form  
    - interpreted command-by-command as it is executed, like on a command line
• Examples:  
    - R  
    - Python  
    - Unix shell  

##### Unix shell script
• Command-line code for Unix (+ Linux & Mac OS)  
• Commonly include:  
    - Wildcards: *, ?  
    e.g., ./Customer??Loc*.txt  
    - Piping: output from one command streams as input to another  
    e.g., cat product*v1.txt | sort  
    - / in filepaths  
    - ; to separate commands  
    - > and < to indicate the input and output  
    e.g. cat product*v1.txt > contents  

• Piping shells commands buffers their execution  
    - Don’t try to do everything at once, just enough for the
    next command  
    - Tend to work through text files line by line  
    - Allows different commands to be working on different
    parts of the data  
    - Scales up well for big files!  
        ‣ Reduces the memory overload  

##### Standardising workflow
• Need to standardise how data is accessed  
• Need to be able to reproduce  
    - Wrangling  
    - Analysis  
    - All other stages of the value chain!  
• Scripting allows these to be recorded  
• Scripting allows these to be shared  
• Scripting allows these to be modified  

• It is also vital to understand why certain steps are used  
    - Why was the wrangling done  
    - What was the analysis for  
• The context of working with data also needs to be
recorded  

• So how can you standardise data?  
    - Access  
    - Format  
    - Value & vocabulary  
    - Metadata  
    - Software & tools  
    - Process & workflow  
• What role do data scientists and data science play in
standardising things related to data?  
    - Establishing the standards  
    - Enacting the standards  

You can use a new method on an existing dataset or on your own country and that counts as novel
Only R is allowed, no python!