# Introduction to Machine Learning

Before we dive into machine learning algorithms, we need to understand what makes machine learning fundamentally different from traditional programming.

> **Machine Learning** is a subset of Artificial Intelligence that enables systems to learn and improve from experience without being explicitly programmed.

## Traditional Programming vs Machine Learning

### Traditional Programming (Algorithmic Approach)

In traditional programming, a developer writes explicit rules that transform inputs into outputs. The process is **deterministic** - given the same input, you will always get the same output.

```
INPUT + RULES → OUTPUT
```

### Machine Learning Approach

In machine learning, we provide data (inputs and expected outputs) and the system learns the rules. The process can be **probabilistic** - the output is based on patterns learned from data.

```
INPUT + OUTPUT → RULES (Model)
```

## IPO Comparison: Traditional Algorithm vs Machine Learning

### Example: Adding Two Numbers

#### Traditional Algorithm IPO Table

| Input | Process | Output |
|-------|---------|--------|
| Two numbers: 2 and 2 | Apply explicit rule: `return a + b` | 4 (always guaranteed, deterministic) |

#### Machine Learning Algorithm IPO Table

| Input | Process | Output |
|-------|---------|--------|
| Two numbers: 2 and 2 | Black box: learned patterns from training data | ~4 (probabilistic, may not be exactly 4) |

#### Key Differences

| Aspect | Traditional Algorithm | Machine Learning Algorithm |
|--------|----------------------|----------------------------|
| Transparency | Fully transparent - we wrote the code | Opaque - we don't know exactly how it decided |
| Consistency | 100% deterministic | May vary based on model confidence |
| Adaptability | Cannot adapt without reprogramming | Can adapt by learning from new data |

In [None]:
# Traditional Algorithm - Explicit, Deterministic, Transparent
def traditional_add(a, b):
    """
    A simple algorithm with explicit rules.
    We know EXACTLY what this function does.
    """
    return a + b

# Always returns 4 - guaranteed
result = traditional_add(2, 2)
print(f"Traditional Algorithm: 2 + 2 = {result}")
print(f"We know exactly how we got this answer: by adding a + b")

In [None]:
# Simulating a "Black Box" ML approach
# In reality, ML models learn from data, but this demonstrates the concept

import numpy as np
from sklearn.linear_model import LinearRegression

# Training data: examples of addition
# The model learns the PATTERN, not the explicit rule
X_train = np.array([
    [1, 1],
    [1, 2],
    [2, 1],
    [3, 3],
    [5, 5],
    [4, 3],
    [10, 5]
])
y_train = np.array([2, 3, 3, 6, 10, 7, 15])  # The sums

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Ask the ML model: what is 2 + 2?
prediction = model.predict([[2, 2]])[0]
print(f"ML Model Prediction: 2 + 2 = {prediction:.4f}")
print(f"\nNotice: The answer is close to 4, but not exactly 4!")
print(f"The model learned a pattern, but we don't know its exact reasoning.")

## The Black Box Problem

### What is a Black Box?

A **black box** is a system where we can see inputs and outputs, but the internal workings are hidden or too complex to understand.

```mermaid
flowchart LR
    INPUT --> BB[BLACK BOX<br/>Hidden Logic<br/>? ? ? ? ?]
    BB --> OUTPUT
    style BB fill:#333,stroke:#fff,color:#fff
```

### Why Machine Learning is Often a Black Box

| Aspect | Traditional Code | Machine Learning Model |
|--------|-----------------|------------------------|
| **Logic** | Written by humans, readable | Learned from data, mathematical weights |
| **Explainability** | Can trace every step | Often cannot explain why a decision was made |
| **Debugging** | Step through code line by line | Analyze patterns in millions of parameters |
| **Trust** | Trust the programmer | Trust the training data and process |

In [None]:
# Demonstrating the "black box" nature
print("What the ML model actually learned:")
print(f"Coefficients (weights): {model.coef_}")
print(f"Intercept (bias): {model.intercept_}")
print(f"\nThe model learned: output = {model.coef_[0]:.4f}*a + {model.coef_[1]:.4f}*b + {model.intercept_:.4f}")
print(f"\nIdeally it should be: output = 1*a + 1*b + 0")
print(f"\nEven for this simple example, the model's internal 'reasoning' is not perfect.")

## Bias in Machine Learning

### What is Bias?

**Bias** in machine learning refers to systematic errors that cause the model to consistently make wrong predictions in a particular direction. Bias can come from:

1. **Data Bias** - The training data doesn't represent the real world
2. **Algorithm Bias** - The model's assumptions don't match reality
3. **Human Bias** - Prejudices of the people who collect data or design systems

### Types of Bias

| Type | Description | Example |
|------|-------------|----------|
| **Selection Bias** | Training data not representative | Training facial recognition only on certain demographics |
| **Confirmation Bias** | Model reinforces existing patterns | Hiring AI that favors historically hired demographics |
| **Historical Bias** | Past data reflects past inequalities | Loan approval models based on historically discriminatory lending |
| **Measurement Bias** | Data collection method is flawed | Using arrest records as proxy for criminal behavior |

In [None]:
# Demonstrating Data Bias
import numpy as np
from sklearn.linear_model import LinearRegression

# Biased training data - only small numbers!
X_biased = np.array([
    [1, 1],
    [1, 2],
    [2, 1],
    [2, 2],
    [3, 2]
])
y_biased = np.array([2, 3, 3, 4, 5])

biased_model = LinearRegression()
biased_model.fit(X_biased, y_biased)

# Test on data similar to training
print("Testing on data SIMILAR to training data:")
print(f"2 + 2 = {biased_model.predict([[2, 2]])[0]:.2f} (expected: 4)")

# Test on data DIFFERENT from training (larger numbers)
print(f"\nTesting on data DIFFERENT from training data:")
print(f"100 + 100 = {biased_model.predict([[100, 100]])[0]:.2f} (expected: 200)")
print(f"50 + 75 = {biased_model.predict([[50, 75]])[0]:.2f} (expected: 125)")

print(f"\n⚠️ The model was only trained on small numbers!")
print(f"This is DATA BIAS - the training data didn't represent all possible inputs.")

## Ethics in Machine Learning

### Why Ethics Matter

Machine learning models are increasingly used to make decisions that affect people's lives:
- Who gets a loan?
- Who gets hired?
- Who gets parole?
- What medical treatment is recommended?

### Key Ethical Considerations

| Principle | Description | Question to Ask |
|-----------|-------------|------------------|
| **Fairness** | Model treats all groups equitably | Does this model disadvantage any group? |
| **Transparency** | Decisions can be explained | Can we explain WHY a decision was made? |
| **Accountability** | Someone is responsible for outcomes | Who is responsible when the model is wrong? |
| **Privacy** | Personal data is protected | What data does this model use and store? |
| **Safety** | Model doesn't cause harm | What happens if this model makes a mistake? |

### Real-World Examples of ML Ethics Issues

#### 1. Amazon's Hiring Algorithm (2018)
Amazon developed an AI recruiting tool that showed bias against women. The model was trained on 10 years of resumes, which were predominantly from men. The system learned to penalize resumes containing words like "women's" (as in "women's chess club captain").

**Lesson:** Historical bias in training data leads to discriminatory models.

#### 2. COMPAS Recidivism Algorithm
A criminal justice algorithm used in the US to predict the likelihood of reoffending was found to be biased against Black defendants, incorrectly flagging them as higher risk at nearly twice the rate of white defendants.

**Lesson:** Black box algorithms making life-changing decisions need transparency and oversight.

#### 3. Healthcare Algorithm Bias (2019)
A widely-used healthcare algorithm was found to systematically underestimate how sick Black patients were compared to white patients, affecting the care millions received.

**Lesson:** Using proxy variables (like healthcare costs) can encode existing inequalities.

#### 4. Australian Robodebt Scheme (2016-2019)
The Australian Government's "Robodebt" scheme used an automated system to calculate welfare debts by averaging annual income data and comparing it against fortnightly payments. The algorithm falsely accused over 400,000 Australians of owing money they did not owe, resulting in wrongful debt notices totaling $1.76 billion. The scheme caused significant financial hardship, mental health crises, and was linked to suicides. A Royal Commission found the scheme was unlawful from its inception.

**Lesson:** Automated decision-making systems affecting vulnerable populations require human oversight, transparency, and proper legal review. The "black box" nature of the system prevented meaningful appeal processes.

## The Machine Learning Pipeline

Understanding where bias and ethical issues can enter:

| Stage | Process | Potential Bias Issues |
|-------|---------|----------------------|
| 1. Data Collection | Gather training data | Selection bias, Sampling bias, Historical bias |
| 2. Model Selection | Choose algorithm type | Algorithm bias, Wrong assumptions |
| 3. Training | Model learns patterns | Overfitting to biased patterns |
| 4. Deployment | Model makes predictions | Real-world consequences, Feedback loops |

**Pipeline Flow:**

```mermaid
flowchart LR
    A[DATA COLLECTION] --> B[MODEL SELECTION]
    B --> C[TRAINING]
    C --> D[DEPLOYMENT]
    A -.-> A1[Selection bias]
    B -.-> B1[Algorithm bias]
    C -.-> C1[Overfitting to bias]
    D -.-> D1[Feedback loops]
```

### Ethical Checkpoints

At each stage, we should ask:

1. **Data Collection:** Is our data representative? Are we respecting privacy?
2. **Model Selection:** Is this model appropriate? Can it be explained?
3. **Training:** Are we evaluating for fairness across groups?
4. **Deployment:** Who is affected? What happens when it's wrong?

In [None]:
# Interactive Example: The Danger of Black Box Decision Making

import numpy as np

# Simulated "loan approval" model - THIS IS A SIMPLIFIED EXAMPLE
# In reality, such models are far more complex

def black_box_loan_decision(income, credit_score, postcode):
    """
    A simulated black box loan approval system.
    Notice: postcode shouldn't affect loan decisions,
    but if the training data was biased...
    """
    # Hidden "learned" weights (we pretend we don't know these)
    score = (income * 0.3) + (credit_score * 0.5) + (postcode * -0.2)
    return "Approved" if score > 400 else "Denied"

# Same financial profile, different postcodes
print("Two applicants with IDENTICAL financial profiles:")
print(f"Income: $80,000 | Credit Score: 720")
print()
print(f"Applicant A (Postcode 2000): {black_box_loan_decision(80000, 720, 2000)}")
print(f"Applicant B (Postcode 2770): {black_box_loan_decision(80000, 720, 2770)}")
print()
print("⚠️ Same qualifications, different outcomes!")
print("This is an example of how bias can hide in black box models.")

## Summary: Key Concepts

### Traditional Programming vs Machine Learning

| Traditional | Machine Learning |
|-------------|------------------|
| Rules are written by humans | Rules are learned from data |
| Deterministic outputs | Probabilistic outputs |
| Fully transparent | Often a "black box" |
| Cannot improve without reprogramming | Can improve with more data |

### The Black Box
- ML models often cannot explain their decisions
- This creates challenges for accountability
- We need to balance accuracy with interpretability

### Bias
- Bias can enter at any stage of the ML pipeline
- Common sources: data collection, historical patterns, algorithm assumptions
- Biased models can perpetuate and amplify existing inequalities

### Ethics
- ML decisions affect real people's lives
- Key principles: fairness, transparency, accountability, privacy, safety
- Engineers have a responsibility to consider ethical implications

## Discussion Questions

1. **Black Box Trade-offs:** Some of the most accurate ML models (like deep neural networks) are also the hardest to explain. When should we prioritize accuracy over explainability? When should we prioritize explainability?

2. **Data Responsibility:** If a company trains a model on biased historical data and the model makes unfair decisions, who is responsible - the data scientists, the company, or the people who created the biased data in the first place?

3. **Automation vs Human Judgment:** Should some decisions always require human oversight, even if an ML model is more accurate? What types of decisions?

4. **Feedback Loops:** If a model is trained on data that includes its own past predictions, how might this create problems? (Hint: think about a recommendation algorithm)

## Extension: Exploring Model Interpretability

Some ML models are more interpretable than others:

| Model Type | Interpretability | Accuracy Potential |
|------------|------------------|--------------------|
| Linear Regression | High | Lower |
| Decision Trees | High | Medium |
| Random Forests | Medium | Higher |
| Neural Networks | Low | Highest |

In subsequent weeks, we'll explore these models and understand the trade-offs between interpretability and performance.