In [4]:
import mermaid as md

# Machine Learning Principles

We will learn about machine learning from 30,000 feet and see how machine-learning principles apply to modern AI systems.

You will learn more about machine learning in the [Machine Learning](https://www.coursera.org/learn/machine-learning-with-python) course at Coursera next week.

## Transforming Inputs to Outputs

Software development is mostly about writing functions that transform inputs to outputs.

For example, given data in a birth record and data about one of your ancestors, write a function to return True if the birth record belongs to that ancestor.

In traditional softawre development, you would write a sequence of carefully-crafted steps to return the correct result.

In machine learning, you give a machine-learning algorithm a list of hundreds or thousands (or more) examples of input-output pairs, called **training data** and let the algorithm determine the correct output (called a **label**) given an input.

## Machine-Learning Algorithms

There are many different machine-learning algorithms to determine what to output given an input. Some learn floating-point weights, others learn rules, and others learn complex combinations of rules and weights.

- Logistic Regression - weights
- Decision trees - rules
- XGBoost - rules + weights
- Deep neural networks - weights + non-linear functions

For example, a decision tree might learn that if the name, the year, and the place on the input birth record exactly matches the name, birth year, and birth place of the input ancestor, then the birth record belongs to that ancestor and the function should output True.

## Parameters and Hyperparameters

The rules, weights, etc. that a machine-learning algorithm learns from the input-output examples are called its **parameters**.

But when faced with a function to write (like determining which records belong to which people in a tree), which algorithm do you choose? And each algorithm has several factors that govern how it behaves (for example, how deep can the learned decision tree be) - how can you determine the best combination of algorithm and governing factors?

These governing factors are called **hypyerparameters**. They are not learned by the algorithm. You have to choose them up-front before the algorithm can learn its parameters.

## Model complexity

The thing that is learned by a machine-learning algorithm - the thing that transforms the inputs into the outputs - is called a **model**.

Models with more parameters (more weights or rules) can learn more-complex input to output transformation functions, which are able to match (*fit*) the example input-outputs better.

## Decision Tree Example

Suppose we have the following training data inputs + human-provided labels (outputs) (only a few are shown)

| Person Name | Birth Date | Birth Place | Record Name | Record Date | Record Place | Page # | Human Label |
|-------------|------------|-------------|------------|-------------|--------|-------------|---|
| John Smith  | about 1883 | Illinois    | Jane Doe | July 1, 1883 | Chicago, Illinois | 1 | False |
| John Smith  | about 1883 | Illinois    | John Smith | July 1, 1900 | Peru | 1 | False |
| John Smith  | about 1883 | Illinois    | John Smith | July 1, 1883 | Chicago, Illinois | 1 | True |
| John Doe    | about 1883 | Illinois    | John Doe   | July 1, 1883 | Chicago, Illinois | 2 | **False** |
| Jane Doe    | about 1883 | Illinois    | Jane Doe   | July 1, 1883 | Chicago, Illinois | 1 | True |

We hope the decision-tree algorithm learns something like the following:

In [14]:
%%mermaidjs
flowchart LR
    C{Names match?}
    C -->|No| D[False]
    C -->|Yes| E{Years match?}
    E -->|No| F[False]
    E -->|Yes| G{Places match?}
    G -->|No| H[False]
    G -->|Yes| I[True]

## Underfitting

When a model doesn't have enough parameters, it isn't able to capture the complexity of the problem, and we say it **underfits** the training data. 

When a model underfits, it doesn't do a very good job generating the correct outputs from the inputs.

For example, if we set a hyperparameter that restricts our decision tree algorithm to make trees that contain only a single decision, then it won't be powerful enough to make good record-to-person-attachment decisions.

In [11]:
%%mermaidjs
flowchart LR
    C{Names match?}
    C -->|No| D[False]
    C -->|Yes| E[True]

## Overfitting

When a model has a lot of parameters, it is often able to match the training data perfectly or nearly perfectly.
But training data often has mistakes. When a model matches the training data so well that it even matches the mistakes in the training data, we say it **overfits** the training data. 

When a model overfits, it does really well on the training data but doesn't generalize well to unseen examples.

For example, remember this line in the training data? 

| Person Name | Birth Date | Birth Place | Record Name | Record Date | Record Place | Page # | Human Label |
|-------------|------------|-------------|------------|-------------|--------|-------------|---|
| John Doe    | about 1883 | Illinois    | John Doe   | July 1, 1883 | Chicago, Illinois | 2 | **False** |

**False** is likely a mistake. If we set a hyperparameter that says our decision tree algorithm is able to make trees that contain up to four decisions, it might make a decision tree that looks like this in order to match the training data perfectly.

In [21]:
%%mermaidjs
flowchart LR
    C{Names match?}
    C -->|No| D[False]
    C -->|Yes| E{Years match?}
    E -->|No| F[False]
    E -->|Yes| G{Places match?}
    G -->|No| H[False]
    G -->|Yes| I{Page number?}
    I -->|==2| J[False]
    I -->|!=2| K[True]

## Hyperparameter optmization

Picking a good set of hyperparameters makes a big difference in how good your model is at generating the correct outputs.

**Question:** How can you choose the best hyperparameters?

*Hint:* You can't simply choose the hyperparameters that give the best result on the training data because of overfitting.

**Answer:** You need more examples (input-output pairs)! This is a separate set of examples that your model hasn't been trained on. It's often called a **dev set** or **development dataset**.

### Hyperparameter optimization steps:

1. Choose a combination of hyperparameters.
2. Train your model using the training data with the chosen hyperparameters.
3. Evaluate how well your trained model does on the dev set.
4. Repeat until you can't find new combinations of hyperparameters that give better results than the results you've already gotten from previous combinations of hyperparameters. Return the best combination.

## Predicting "real world" model performance

**Question:** How can you know how well your model will do in the real world?

*Hint:* You can't simply re-use the dev set, because you chose your hyperparameters to optimize the model's performance on the dev set.

**Answer:** You need still more examples! This is a separate set of examples that your model hasn't been trained on, and you haven't used it even to choose the hyperparameters. This is called the **test set**. 

### Predicting real world model performance steps

1. After choosing a combination of hyperparameters, train your model. (You can train it using both the training and the dev set at this point.)
2. Evaluate how well your trained model does on the test set. This is your best estimate of how well it will do in the real world.

**Important:** Don't look into the mistakes made on the test set! If you review the mistakes your model makes on the test set and then use those results to make the model better, you will need a new test set in order to measure real-world performance.

## Data Pyramid

- Test dataset - smallest, used solely to estimate how well the model will perform in the real world. (hundreds of examples)
- Dev dataset - medium size, used to pick the best combination of hyperparameters. (hundreds to thousands of examples)
- Training dataset - largest, used to train the model parameters. (often thousands to 10's of thousands of examples or even more)

## How do we create all this data?

It used to be that you could spend months creating the data and cleaning it up before you could start your machine-learning project. But now with large language models (LLMs) you now have other options:

1. Find and use pre-trained datasets (the **PT** in ChatG**PT** stands for **P**re**T**rained). Find a dataset that is similar to yours and build on top of it (called fine-tuning).
2. Ask an LLM like GPT4 to generate the data for you. Use the best (expensive) LLM you can find. You can use the outputs from an expensive model as labels to train a cheaper model.
3. Ask humans to create the data. This is the most expensive approach, but usually provides the highest-quality data.

### Dataset quality

In general, the test dataset needs to be the highest quality, the dev dataset the next-highest quality, and you can deal with errors in the training dataset. However, I believe that every hour spent cleaning any of the datasets will save you at least two hours of headaches later on. You could get started quickly with computer-generated data and clean it later on. There are things you can do to identify examples that are most-likely to be incorrect and need human review.

## Application to AI projects

Now let's learn how these machine-learning principles apply to our personal and group AI projects.

Last week we talked about one way to create and deploy a personal project quickly. It's important to get something up and running as soon as you can. Once it's deployed you can work on making it better. Let's call this initial commit your **baseline**. 

Once you have implemented your baseline the next step is to improve it.

## How can I improve my baseline?

Here are the steps involved in a RAG question-answer AI system. Other generative AI systems will have similar steps. Each step has several different algorithms to try, each with different hyperparameters that can be set. 

1. Load data - where is your data coming from? do you need to clean or filter out bad data? deduplicate?
2. Split data into chunks - based on # characters, markdown titles, or similarity?
3. Index the data - vector, keyword, or hybrid? try different embedding models?
4. Retrieve the data - query rewrite? semantic router? rerank?
5. Send the data to pre-trained model for a generated response - try different prompts (prompt engineering)?

## How can I know if my new approach is better than the baseline?

You need to gather a set of inputs, and compare the quality of the outputs of your new approach against the quality of the outputs of your baseline on the inputs. 

This is hyperparameter optimization, same as for machine-learning projects. You need a way to **evaluate** the quality of your project's outputs so you can compare one approach against another.

### Options for inputs:

1. Find a source for real-world user inputs.
2. Ask an LLM to generate inputs based upon your data (e.g., given text paragraphs, generate one or more questions that could be answered by each paragraph).

### Options for outputs (labels):

1. Eyeball it - review a dozen examples to see which approach is better. This approach is the easiest to start with, but will get very tiresome over time since you have to re-eyeball results for every approach.
2. Ask an LLM to review the inputs and outputs and judge how correct the outputs are. This approach is the next easiest, but may not give very accurate judgings, which may lead to choosing less-than-ideal hyperparameters.
3. Ask an LLM to generate "ideal/correct" outputs for each input. This can work if you ask a large, expensive LLM like GPT4 to generate outputs for simple inputs, like simple questions you generated from your text paragraphs.
4. Ask humans to generate "ideal/correct" outputs for each input.

## But wait, there's one more problem!

If you choose options 3 or 4, you now have another issue: how to judge how close the computer-generated output is to the human-generated ideal/correct output in case the outputs use different words that mean the same thing?

You can ask an LLM to review the inputs, the human-generated output, and the computer-generated output and tell you how close the computer-generated output is to the human-generated output. This is a much simpler problem than asking an LLM to come up with ideal/correct outputs to begin with, so this generally works ok.

### But isn't this just another machine-learning problem? 

Given the original inputs, the human-generated output, and the computer-generated output we want to return how close the computer-generated output is to the human-generated output. We can ask an LLM to score them as in the previous slide, but another option is to learn a function to score them.

First have humans label a bunch of these (input, human-generated-output, computer-generated-output) triples with a score of 1-5 (5=computer- and human-generated outputs are the same; 1=different). Now train a model on these labeled triples so the computer outputs scores similar to the human scores.

Training a separate model to be the judge for the first model is extra work but will likely result in a more-accurate judge that may eventually be worth the extra work.

## Take-away for today

Start figuring out how will get the data you plan to use to **evaluate** your personal AI project.

- What are the inputs and the outputs?
- Where will you get it from?
  - Can you find it online, will you have a computer generate it, or will it be human-generated?

## Coming up next

We talked about using a development dataset to choose the hyperparameters of our AI systems, and we can similarly use a test dataset to test the performance of our AI systems on real-world inputs.

But we haven't talked about how to use a training dataset to train an AI system. That will be the focus in two weeks, using an amazing framework called [DSPy](https://github.com/stanfordnlp/dspy).

Next we will show how to use Llamaindex, a library called [Optuna](https://optuna.org/) that makes it easy to do hyperparameter optimization, and a library called [Arize Phoenix](https://phoenix.arize.com/) that makes it easy to evaluate the quality of your model's outputs.