# Week 11 - Issues in data science

## Learning Outcomes (Week 11)
By the end of this week, you should be able to  
- Explain linked data  
- Understand some of the legal and social issues that arise in a Data Society  
- Understand some of the legal and ethical issues due to the use of AI and ML  

## 1. Ethics of linking data
- Connecting elements within multiple structured data sets  
- Allows data relating an element to be collected from multiple data sets  
- Expands the knowledge base of a single data set  
- Linked Open Data (LOD) allows the links and data to be freely shared and accessed  
    - Used by companies but don’t tend to contribute their own data  

### Ethics
- Ethics - the moral handling of data, e.g., not selling on other’s private data to scammers  
- People have rights  
    - privacy  
    - access  
    - erasure  
    - … etc.  
- Companies have rights  
    - ownership of data  
    - intellectual property  
    - copyright  
    - confidentiality  

### Companies using linked data
- Business models  
    - Data has become a valuable asset  
    - Data has become a valuable product  
- Data from different services can be linked by companies by buying out other companies or establishing new services for other companies to use.  

### Governments using linked data
- Business models  
    - Multiple departments have separate systems  
    - Departments interact, so why can’t their data  
    - Law enforcement needs to know what everyone else knows!  
- Problems  
    - Who should know what?  
    - How do you manage who should know what?  
    - What priorities do you give to the rights of people?  

What can you do?  
What should you do?  
How do you make sure the right thing is done?  

### Confidentiality
See: “The curly fry conundrum: Why social media ‘likes’ say more than you might think” by Jennifer Golbeck  
e.g. Target ® predicting which women are pregnant based on their purchases  
- Many things can be predicted from Facebook “likes”  
- Homophily (tendency to associate with similar individuals) is important for enabling prediction  
- We often don’t own or manage corporate/internet/app data about ourselves  
- The source data critical for advertisers so we cannot expect companies to be banned/excluded from using it  
- So how can we manage confidentiality?  
- for many apps/websites, you must accept their privacy data sharing policies to use their services fully;  
- the interface for selecting privacy preferences should move away from individual Internet platforms and be put into the hands of individual consumers;  
- user could have an open source agent that broker their confidentiality preferences  
- but would that be feasible and would businesses ever agree?  

## 3. AI veracity
Can you trust the analysis?  
- Various factors can affect the “accuracy” of any analysis  
    - Data quality  
    - Choice of analysis  
    - Design of analysis  
    - Choice of data  
- It is easy for the modelling to misrepresent what the data is supposed to reflect.  
    - Even statistical analysis can be biased!  

### Data Provenance and LLM Training
- What is Data Provenance?  
    - The origin, lineage, and history of data used to train AI models  
    - Key for assessing the credibility, fairness, and compliance of AI systems  
- Why It Matters for LLMs  
    - LLMs are trained on massive, web-scale datasets (e.g., Reddit, Wikipedia)  
    - Often lack transparency: unclear what data was included or excluded  
    - Impacts veracity: biased, outdated, or low-quality data → unreliable outputs  
- Risks of Poor Provenance
    - Legal & ethical concerns: copyright infringement, data consent violations  
    - Bias propagation: inherited from skewed or unbalanced training sources  
    - Hallucinations: generating false facts due to poor source grounding  

### Bias of data
- Sometimes the data used to train a ML system is biased, regardless of its volume
    - Narrow  
    - Regional  
    - Undertested in varied contexts  
- Biased system may discriminate in its results, for instance by
    - gender  
    - ethnic associations  
- Biased system may not be as accurate in its results for unfamiliar contexts and subjects

- What is Bias in LLMs?
    - Systematic skew in LLM outputs  
    - Can reflect and amplify social, cultural, or political stereotypes  
- Sources of Bias
    - Training data bias: Overrepresentation of certain groups/languages/views  
    - Modeling bias: Architectural choices may reinforce patterns  

- Not all bias is in the numbers  
- Bias can also be in how you have designed the research  
    - Are the variables appropriate for all situations being modelled?  
    - Are assumptions made about the stakeholders who the data relates to?  
    - Are assumptions being made about the context of the data?  
- What is Bias in LLMs?
    - Systematic skew in model outputs  
    - Can reflect and amplify social, cultural, or political stereotypes  
- Sources of Bias
    - Training data bias: Overrepresentation of certain groups/languages/views  
    - Modeling bias: Architectural choices may reinforce patterns  
    - User prompt bias: How questions are framed affects results  
- Gender Bias
    - Definition: Stereotyping based on gender in language generation or representation.
- Racial or Ethnic Bias
    - Definition: Bias where the model generates outputs that reflect stereotypes or disproportionate associations based on race or ethnicity.
- Cultural or Geographical Bias
    - Definition: The tendency of LLMs to prioritize perspectives, norms, or knowledge from dominant (usually Western) cultures.
- Other types of bias
    - Political or Ideological Bias  
    - Occupational Stereotyping  
    - Religious Bias  
    - ….

## 4. Sampling
- When collecting data for processing, it has to be relevant  
    - Can you get all data relating to the scenario you are modelling?  
    - Can you only get a random sample of data? The sample data has to be representative of the population being modelled  
    - How large a sample do you need?  
    - What known variables are included in the data?  
    - Is the sample data distributed to match the required strata/categories  
- Observe the population before you make any unqualified assumptions  

### A/B Testing
- Blind experiments or A/B testing may be used to show if relationship between various variables
- The experimental scenario needs to be divided into:  
    - A: Sample is subjected to the known variable  
    - B: Sample is not subjected to the known variable (the Control set)  
- The validity of the the hypothesis is based on whether A has a different response compared to B, where the response is the target variable.   

### Significance testing
How much of a difference in results is enough?  
- Must test the statistical significance  
    - p value: units of chance of your “surprise” (0 to 1)  
Considering how likely you could get the same results regardless of the hypothesis  
- Hypothesis: Aspirin reduces heart attack  
    - Sample: studied 100 men for 5 years  
        Group HA: 50 men take aspirin daily  
        Group HP: 50 men take placebo daily (control)  
    - Results:  
        - High p: HA 4 heart attacks, HP 5 heart attacks so both around 1 in 10 men  
        - Low p: HP 10, HA 1, so very different and significant!  



## Tutorial
What's used in Amazon web page?   
What's tracked?   
Location to change  
Reviews and ratings to add/read  
Links to related products  

What's it used for?  
Clustering: book recommendations  
Text analysis: contextualize reviews  
Location specific pricing  

Who owns the data?  
Not us  

Do we have a say?  
No  