## Introduction to Neural Networks

Session content:
* ML is about learning functions (15 min)
* Gradient descent helps find the "best" function (1 min)
* Neural networks (and their descendants) are very good function approximators (30 min)

## ML is about learning functions (15 min)

Think about Spotify's "Recommended Songs" feature. How does it know what songs you might like? At its core, it's using a function that:
- Takes in: Your listening history, liked songs, and other user data
- Gives out: A list of songs you might enjoy

This is what machine learning is all about - creating functions that can learn from data to make predictions or decisions. Let's practice identifying these input/output patterns in different ML systems.

For each example below:
1. First, think about what information the system needs (inputs)
2. Then, consider what the system should produce (outputs)
3. Finally, try writing the function header yourself before checking the solution

I'll do an example myself so that you can see what we're aiming for.

**Sentiment Classifier**
A movie theater wants to know whether the Instagram comments about their movies are positive or negative. The machine learning system they'll use will probably look like:

In [None]:
def classify_sentiment(movie_comment: str) -> float:
    """
    Input: A text message like "I love this movie!"
    Output: A number from -1 (very negative) to 1 (very positive)
    """
    pass # The 'pass' keyword indicates the function is not yet implemented

Try a couple other examples yourself!

**Image classifier**  
A self-driving car company wants to know whether an image shows a pedestrian. Try writing in the code block below how you would write the function header for that machine learning system.

<details>
    <summary>One Possible Solution (Only look after trying yourself!!)</summary>
    ```  
    
    import numpy as np

    def is_pedestrian(image: np.ndarray) -> bool:
        """
        Input: A numpy array representing an image. The shape of the array is (height, width, 3)
        where the last dimension indicates the color in RGB with three numbers, each for red, green and blue.
        Output: A boolean (True or False) indicating whether the image shows a pedestrian
        """
        pass
    ```    
</details>

In [None]:
# Create the header function for the image classifier here:


An interesting question from the previous exercise is: how can we pass an image to an ML system in a way it understands? While we might think to use image files (like PNG or JPEG), to the system these are just collections of 1s and 0s that are hard to process directly.
Under the hood, ML systems can only work with numbers. Whether we're dealing with text, images, or audio, there's always a process to convert the input into numbers that the system can understand (and sometimes convert those numbers back into a format we humans can interpret).
For images, this conversion process is fairly straightforward. Think of an image as a grid of tiny squares called pixels. Each pixel is like a mixture of three colors - red, green, and blue - where we measure how much of each color we use (from 0 to 1). For example:

(1, 0, 0) means "full red, no green, no blue" = Pure red
(0, 1, 0) means "no red, full green, no blue" = Pure green
(0.5, 0.5, 0.5) means "half of each" = Gray

When we represent an image in this way, it becomes an array with three dimensions:

height = number of rows in our grid
width = number of columns
3 = our red, green, blue measurements for each pixel

Let's see how this works in practice. In the code below, you can change the values of red_values, blue_values, and green_values to see how different color combinations create different images.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

red_values = [
    [0.8, 0, 0],
    [0, 0.7, 0],
    [0, 0, 0.9]
]
blue_values = [
    [0, 0, 0.6],
    [0, 0.5, 0],
    [0.7, 0, 0]
]
green_values = [
    [0, 0.8, 1],
    [0.6, 0, 0],
    [0, 0, 0.5]
]

# Create image matrices of shape (height, width, [R, G, B])
red_img = np.zeros((3, 3, 3))
blue_img = np.zeros((3, 3, 3))
green_img = np.zeros((3, 3, 3))

# Set the values for each channel
red_img[:, :, 0] = np.array(red_values)  # Red channel
blue_img[:, :, 2] = np.array(blue_values)  # Blue channel
green_img[:, :, 1] = np.array(green_values)  # Green channel
combined_img = red_img + blue_img + green_img

def plot_image_matrices(red_img, blue_img, green_img, combined_img):
    # Create a figure with 4 subplots side by side
    fig, (ax1, ax2, ax3, ax4) = plt.subplots(1, 4, figsize=(16, 4))

    # Plot each image
    ax1.imshow(red_img)
    ax1.set_title('Red Values')
    ax1.axis('off')

    ax2.imshow(blue_img)
    ax2.set_title('Blue Values') 
    ax2.axis('off')

    ax3.imshow(green_img)
    ax3.set_title('Green Values')
    ax3.axis('off')

    ax4.imshow(combined_img)
    ax4.set_title('Combined Values')
    ax4.axis('off')

    plt.tight_layout()
    plt.show()

plot_image_matrices(red_img, blue_img, green_img, combined_img)


We can also see how the process looks for a real image from the internet. I lowered the resolution of the image so that it's more evident that it's composed of individual pixels.

In [None]:
# Load and display a sample image
from PIL import Image
import requests
from io import BytesIO

# Get a small sample image from the internet (an astronaut picture)
url = "https://raw.githubusercontent.com/scikit-image/scikit-image/master/skimage/data/astronaut.png"
response = requests.get(url)
img = Image.open(BytesIO(response.content))

def plot_image_by_channels(img):
    # Resize to a very low resolution (e.g., 64x64)
    small_img = img.resize((64, 64))

    # Convert to numpy array
    img_array = np.array(small_img)

    # Create figure and display
    plt.figure(figsize=(15, 4))

    # Original image
    plt.subplot(1, 5, 1)
    plt.imshow(img)
    plt.title('Original Image')
    plt.axis('off')

    # Resized image
    plt.subplot(1, 5, 2)
    plt.imshow(img_array)
    plt.title('64x64 Resolution')
    plt.axis('off')

    # Red channel
    plt.subplot(1, 5, 3)
    plt.imshow(img_array[:,:,0], cmap='Reds')
    plt.title('Red Channel')
    plt.axis('off')

    # Green channel
    plt.subplot(1, 5, 4)
    plt.imshow(img_array[:,:,1], cmap='Greens')
    plt.title('Green Channel')
    plt.axis('off')

    # Blue channel
    plt.subplot(1, 5, 5)
    plt.imshow(img_array[:,:,2], cmap='Blues')
    plt.title('Blue Channel')
    plt.axis('off')

    plt.tight_layout()
    plt.show()

    # Print the shape of the low-resolution image array
    print(f"Low resolution image shape: {img_array.shape}")

plot_image_by_channels(img)


If you're interested you can ask an AI how the process of encoding the input into numbers works for text, which would be needed to create ML systems like ChatGPT (e.g., "How does ChatGPT encode text? I'm only slightly familiar with neural networks and machine learning").

For now, we'll ignore this process and assume ML systems can receive text directly. What would the header function look like for this ML system:

**LLM Chatbot**
OpenAI is creating a chatbot that can answer questions on chatgpt.com. To reduce costs, this chatbot does not accept images or voice, only text prompts. Similarly, it only outputs text.

<details>
    <summary>One Possible Solution</summary>
    ```python
    def chatbot(user_message: str) -> str:
        """
        Input: A string with the user's message
        Output: A string with the chatbot's response
        """
        pass
    ```    
</details>


In [None]:
# Create the header function for the chatbot here:


## Gradient descent helps find the "best" function (1 min)

Although the math and intuition behind gradient descent is incredibly cool, we don't have enough time to cover it here. If you're interested, I recommend this 3Blue1Brown video ([english](https://www.youtube.com/watch?v=IHZwWFHWa-w), [español](https://www.youtube.com/watch?v=mwHiaTrQOiI)). Also, if you want to be challenged, you can try implementing the gradient descent algorithm by yourself at this [page](https://neetcode.io/problems/gradient-descent).

## Neural Networks: The Ultimate Function Learners (30 min)

Remember how we described machine learning as finding the right function for a task? To do this effectively, we need a flexible system that can be shaped into many different types of functions - like clay that can be molded into any shape. This is where neural networks shine.

Neural networks have revolutionized machine learning because they're incredibly good at learning complex patterns. They have three key advantages:

* **Expressive Power**: Neural networks are universal function approximators - a fancy way of saying that if you give them enough neurons and tune them correctly, they can represent practically any function you want. Think of it like having enough LEGO blocks to build anything you can imagine.

* **Parallel Processing**: Training neural networks involves doing many similar calculations at once. Modern computers are very good at this kind of parallel processing, making neural networks both fast to train and cost-effective to use.

* **Pattern Recognition**: Advanced neural networks (like the Transformers used in ChatGPT) are remarkably efficient at spotting patterns in data. While they need lots of training data, the amount required is actually achievable with today's technology.

In this notebook, we'll focus on Multi-Layer Perceptrons (MLPs) - the simplest type of neural network. While MLPs might seem basic compared to the neural networks powering today's AI systems, they're perfect for learning the core concepts. Think of them as the "Hello World" of neural networks - once you understand MLPs, you'll have a strong foundation for understanding more advanced architectures like RNNs, CNNs, and Transformers.

Let's experiment with different types of ML models to see how they perform on a classic problem: recognizing handwritten digits (MNIST dataset).

You can adjust:
- MLP Width: How many neurons are in each layer (more = more complex patterns)
- MLP Depth: How many layers of neurons (more = deeper patterns)
- Tree Depth: How detailed the decision tree's rules can be
- Dataset Size: How many examples we use for training

Try to answer:
1. What happens when you increase the width vs. the depth?
2. Does more training data always help?
3. Which model seems to learn fastest with limited data?

In [None]:
from sklearn.datasets import fetch_openml

# Load MNIST dataset
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)
X = X / 255.0  # Scale pixel values

# Plot some example MNIST digits
plt.figure(figsize=(10, 2))
for i in range(5):
    plt.subplot(1, 5, i+1)
    plt.imshow(X[i].reshape(28, 28), cmap='gray')
    plt.title(f'Label: {y[i]}')
    plt.axis('off')
plt.tight_layout()
plt.show()

print(X.shape, "The images are 28x28 pixels, so 784 pixels once flattened.")

In [None]:
import matplotlib.pyplot as plt
import ipywidgets as widgets
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier

def compare_predictions():
    """Compare predictions of different models on MNIST."""
    
    # Create widgets for hyperparameters
    mlp_width = widgets.IntSlider(value=32, min=4, max=64, description='MLP Width:')
    mlp_depth = widgets.IntSlider(value=2, min=1, max=8, description='MLP Depth:')
    tree_depth = widgets.IntSlider(value=5, min=1, max=20, description='Tree Depth:')
    train_size = widgets.IntSlider(value=1_000, min=100, max=5_000, description='Dataset Size:')
    # Create placeholder for output
    output = widgets.HTML()
        
    val_size = 2_000    
    
    def train_and_plot(mlp_width, mlp_depth, tree_depth, train_size):
        
        output.value = "Training models..."
        
        X_train, y_train = X[:train_size], y[:train_size]
        X_val, y_val = X[-val_size:], y[-val_size:]

        # Train logistic regression once
        lr_model = LogisticRegression(max_iter=1000)
        lr_model.fit(X_train, y_train)
        lr_acc = lr_model.score(X_val, y_val)
        
        # Train decision tree
        dt_model = DecisionTreeClassifier(max_depth=tree_depth)
        dt_model.fit(X_train, y_train)
        dt_acc = dt_model.score(X_val, y_val)
        
        # Train MLP
        mlp_model = MLPClassifier(
            hidden_layer_sizes=(mlp_width,)*mlp_depth,
            activation='relu',
            max_iter=1000
        )
        mlp_model.fit(X_train, y_train)
        mlp_acc = mlp_model.score(X_val, y_val)
    
        output.value = (f"MLP Accuracy: {mlp_acc:.2f}<br>"
                        f"Logistic Regression Accuracy: {lr_acc:.2f}<br>"
                        f"Decision Tree Accuracy: {dt_acc:.2f}")
    
    controls = {'mlp_width': mlp_width, 'mlp_depth': mlp_depth, 'tree_depth': tree_depth, 'train_size': train_size}
    widgets.interact(train_and_plot, **controls, continuous_update=False)
    display(output)

compare_predictions()

Looking at our results, something interesting emerges: the logistic regression performs surprisingly well, often matching or even outperforming the MLP on the MNIST dataset. While both these methods significantly outperform the decision tree, the MLP's advantage isn't as dramatic as we might expect. (Don't worry though - for more complex problems beyond MNIST, MLPs typically show much greater benefits.)

However, this strong performance comes with a trade-off: MLPs are much harder to interpret than simpler models. With logistic regression or decision trees, we can easily visualize and understand how they make decisions. In contrast, when we try to visualize an MLP's internal workings, we get something that looks like this:

![MLP weights visualization](https://i.sstatic.net/Z5L70.png)

In this visualization, each circle represents a neuron that can be activated by different inputs, and the lines show how earlier neurons influence later ones through weighted connections. While pretty, this complexity makes it hard to understand exactly how the network makes its decisions.

To better understand how neural networks work, let's zoom in and experiment with a much simpler version. Here's an interesting way to think about it: computers can perform complex tasks like sending emails, playing videos, or running Python code by combining very simple operations (like AND and OR) many times. Similarly, if we can show that neural networks can perform these basic operations, we can understand how they might be combined to tackle more complex tasks.

Let's try this hands-on by building the simplest possible neural network: one with just two inputs (each either 0 or 1) and a single neuron for output. Your challenge is to make this tiny network perform basic logical operations:

- For AND: The output should be ON only when both inputs are ON
- For OR: The output should be ON when either input (or both) is ON

You can adjust the network's behavior using:
- Weights (shown as lines): Blue = positive influence, Red = negative influence
- Bias: An additional value that makes it easier or harder for the neuron to activate
- The darkness of the neuron shows how strongly it's activated

Try different combinations and see if you can make the network properly implement these logical operations!

In [None]:
import ipywidgets as widgets
from IPython.display import display
import matplotlib.pyplot as plt
import numpy as np

# Create widgets for weight adjustment
target_function = widgets.Dropdown(options=['AND', 'OR'], description='Target Function:')
w1_slider = widgets.FloatSlider(value=0.0, min=-2.0, max=2.0, step=0.1, description='Weight 1:')
w2_slider = widgets.FloatSlider(value=0.0, min=-2.0, max=2.0, step=0.1, description='Weight 2:') 
bias_slider = widgets.FloatSlider(value=0.0, min=-2.0, max=2.0, step=0.1, description='Bias:')
first_input_on = widgets.Checkbox(value=False, description='First input ON')
second_input_on = widgets.Checkbox(value=False, description='Second input ON')


def plot_neuron(target_function, w1, w2, bias, first_input_on, second_input_on):
    plt.clf()
    
    relu = lambda x: max(0, x)
    
    first_input_coords = (0, 1)
    second_input_coords = (1, 1)
    neuron_coords = (0.5, 0.5)
    
    # Plot inputs
    x1 = 1 if first_input_on else 0
    x2 = 1 if second_input_on else 0
    
    # Calculate actual and expected values
    actual = relu(w1*x1 + w2*x2 + bias)
    expected = x1 and x2 if target_function == 'AND' else x1 or x2  # AND gate
    
    # Plot input points with fill based on input state
    plt.plot(first_input_coords[0], first_input_coords[1], 'ko', markersize=10,
             fillstyle='full' if first_input_on else 'none')
    plt.plot(second_input_coords[0], second_input_coords[1], 'ko', markersize=10,
             fillstyle='full' if second_input_on else 'none')
    
    # Plot neuron with activation-based color
    fill_color = f"{1 - np.clip(actual, 0, 1):.2f}"
    plt.plot(neuron_coords[0], neuron_coords[1], 'ko', markersize=20,
             fillstyle='full', markerfacecolor=fill_color)
    
    # Plot connections with weight-based colors

    plt.plot(*zip(first_input_coords, neuron_coords), 
            color='blue' if w1 > 0 else 'red',
            alpha=abs(w1/2),
            linewidth=2)
    plt.plot(*zip(second_input_coords, neuron_coords),
            color='blue' if w2 > 0 else 'red',
            alpha=abs(w2/2),
            linewidth=2)
    
    # Remove grid and spines
    plt.gca().axis('off')
    plt.xlim(-0.5, 1.5)
    plt.ylim(-0.5, 1.5)
    
    # Print values
    is_on = lambda x: "ON" if x > 0 else "OFF"
    plt.text(-0.4, -0.4, f'Expected: {is_on(expected)}\nActual: {is_on(actual)}')
    print('The plots appear twice, sorry!')
    plt.show()
    
# Display interactive widgets
widgets.interactive(
    plot_neuron, 
    target_function=target_function,
    w1=w1_slider,
    w2=w2_slider,
    bias=bias_slider,
    first_input_on=first_input_on,
    second_input_on=second_input_on,
)

Now that we've seen how neural networks can learn basic logical operations like AND and OR, let's zoom out and consider how these same principles can tackle much more complex challenges - like teaching a computer to have human-like conversations.

Imagine we have a dataset of written conversations and want to teach an ML model to participate in them naturally. We might think to use an MLP like the ones we've explored, but we'd quickly encounter several challenges:

* Overfitting: Just as our simple network needed the right balance of weights to learn AND/OR operations, a conversation model needs to learn genuine patterns of human communication rather than just memorizing specific examples.
* Model size: Remember how our digit recognition improved with larger networks? For something as complex as human language, we'd need a dramatically larger network - think millions or billions of neurons instead of dozens.
* Computational cost: Training such a massive network on enough conversation data to make it useful would require enormous computing power - thousands of specialized chips (GPUs) running for months.

This was exactly the challenge OpenAI tackled. They started small, with GPT-1: a relatively simple neural network trained on a modest collection of books (you can try it [here](https://huggingface.co/spaces/mkmenta/try-gpt-1-and-gpt-2)). While this first attempt produced rather clumsy text, it proved something important: neural networks could begin to grasp human language patterns.

The truly remarkable discovery was that scaling up this approach - using bigger networks, more data, and more computing power - led to systems that could engage in surprisingly human-like conversation. Just as our simple networks learned to combine basic operations into more complex behaviors, these larger networks learned to combine basic language patterns into meaningful dialogue.

However, getting to this point required moving beyond the MLP architecture we've explored today. In our next session, we'll dive into Transformers - the specialized neural network architecture that powers modern AI language models like GPT-4, Claude, and Llama 3. We'll see how they build upon the fundamental concepts we've learned while introducing clever innovations that make them particularly well-suited for processing language.