## Introduction to Neural Networks
Lesson Plan:
* ML is about learning functions
* Gradient descent helps find the "best" function
* Neural networks (and their descendants) are very good function approximators

## ML is about learning functions (15 min)

To get an intuition  for how current machine learning systems work, we'll pick a couple real-world ML systems and try to express them as functions.

For each example below:
1. Think about what information the function needs (inputs)
2. Think about what the function should return (outputs)
3. Try writing the function header yourself before checking the solution

I'll do an example myself so that you can see what we're aiming for.

**Sentiment Classifier**
A movie theater wants to know whether the Instagram comments about their movies are positive or negative. The machine learning system they'll use will probably look like:

In [None]:
def classify_sentiment(movie_comment: str) -> float:
    """
    Input: A text message like "I love this movie!"
    Output: A number from -1 (very negative) to 1 (very positive)
    """
    pass # The 'pass' keyword indicates the function is not yet implemented

Try a couple other examples yourself!

**Image classifier**  
A self-driving car company wants to know whether an image shows a pedestrian. Try writing in the code block below how you would write the function header for that machine learning system.

<details>
    <summary>One Possible Solution (Only look after trying yourself!!)</summary>
    ```  
    
    import numpy as np

    def is_pedestrian(image: np.ndarray) -> bool:
        """
        Input: A numpy array representing an image. The shape of the array is (height, width, 3)
        where the last dimension indicates the color in RGB with three numbers, each for red, green and blue.
        Output: A boolean (True or False) indicating whether the image shows a pedestrian
        """
        pass
    ```    
</details>

In [None]:
# Create the header function for the image classifier here:


An interesting question for the previous exercise is, how can we pass the image to the ML system in a way it understands it? We could think of passing an image file (e.g., PNG or JPEG), but to the system that is only a large collection of 1s and 0s and it's really hard to make sense of it. 

Below the hood, ML systems can only understand their inputs and outputs if they are numbers. When we pass a piece of text, an image, or an audio, there's a process underneath to convert them into numbers, and also for converting them back into some format we can understand.

For images this process is fairly straightforward. We just take every pixel and extract the intensity (from 0 to 1) of three base colors: red, green, and blue. Then we represent the image as a collection of all the pixel values arranged as an array of shape (height, width, 3).

Let's see an example of how this works. You can change the values of `red_values`, `blue_values`, and `green_values` to see how the image changes.


In [None]:
import matplotlib.pyplot as plt
import numpy as np

red_values = [
    [0.8, 0, 0],
    [0, 0.7, 0],
    [0, 0, 0.9]
]
blue_values = [
    [0, 0, 0.6],
    [0, 0.5, 0],
    [0.7, 0, 0]
]
green_values = [
    [0, 0.8, 1],
    [0.6, 0, 0],
    [0, 0, 0.5]
]

# Create image matrices of shape (height, width, [R, G, B])
red_img = np.zeros((3, 3, 3))
blue_img = np.zeros((3, 3, 3))
green_img = np.zeros((3, 3, 3))

# Set the values for each channel
red_img[:, :, 0] = np.array(red_values)  # Red channel
blue_img[:, :, 2] = np.array(blue_values)  # Blue channel
green_img[:, :, 1] = np.array(green_values)  # Green channel
combined_img = red_img + blue_img + green_img

def plot_image_matrices(red_img, blue_img, green_img, combined_img):
    # Create a figure with 4 subplots side by side
    fig, (ax1, ax2, ax3, ax4) = plt.subplots(1, 4, figsize=(16, 4))

    # Plot each image
    ax1.imshow(red_img)
    ax1.set_title('Red Values')
    ax1.axis('off')

    ax2.imshow(blue_img)
    ax2.set_title('Blue Values') 
    ax2.axis('off')

    ax3.imshow(green_img)
    ax3.set_title('Green Values')
    ax3.axis('off')

    ax4.imshow(combined_img)
    ax4.set_title('Combined Values')
    ax4.axis('off')

    plt.tight_layout()
    plt.show()

plot_image_matrices(red_img, blue_img, green_img, combined_img)


We can also see how the process looks for a real image from the internet. I lowered the resolution of the image so that it's more evident that it's composed of individual pixels.

In [None]:
# Load and display a sample image
from PIL import Image
import requests
from io import BytesIO

# Get a small sample image from the internet (an astronaut picture)
url = "https://raw.githubusercontent.com/scikit-image/scikit-image/master/skimage/data/astronaut.png"
response = requests.get(url)
img = Image.open(BytesIO(response.content))

def plot_image_by_channels(img):
    # Resize to a very low resolution (e.g., 64x64)
    small_img = img.resize((64, 64))

    # Convert to numpy array
    img_array = np.array(small_img)

    # Create figure and display
    plt.figure(figsize=(15, 4))

    # Original image
    plt.subplot(1, 5, 1)
    plt.imshow(img)
    plt.title('Original Image')
    plt.axis('off')

    # Resized image
    plt.subplot(1, 5, 2)
    plt.imshow(img_array)
    plt.title('64x64 Resolution')
    plt.axis('off')

    # Red channel
    plt.subplot(1, 5, 3)
    plt.imshow(img_array[:,:,0], cmap='Reds')
    plt.title('Red Channel')
    plt.axis('off')

    # Green channel
    plt.subplot(1, 5, 4)
    plt.imshow(img_array[:,:,1], cmap='Greens')
    plt.title('Green Channel')
    plt.axis('off')

    # Blue channel
    plt.subplot(1, 5, 5)
    plt.imshow(img_array[:,:,2], cmap='Blues')
    plt.title('Blue Channel')
    plt.axis('off')

    plt.tight_layout()
    plt.show()

    # Print the shape of the low-resolution image array
    print(f"Low resolution image shape: {img_array.shape}")

plot_image_by_channels(img)


If you're interested you can ask an AI how the process of encoding the input into numbers works for text, which would be needed to create ML systems like ChatGPT (e.g., "How does ChatGPT encode text? I'm only slightly familiar with neural networks and machine learning").

For now, we'll ignore this process and assume ML systems can receive text directly. What would the header function look like for this ML system:

**LLM Chatbot**
OpenAI is creating a chatbot that can answer questions on chatgpt.com. To reduce costs, this chatbot does not accept images or voice, only text prompts. Similarly, it only outputs text.

<details>
    <summary>One Possible Solution</summary>
    ```python
    def chatbot(user_message: str) -> str:
        """
        Input: A string with the user's message
        Output: A string with the chatbot's response
        """
        pass
    ```    
</details>


In [None]:
# Create the header function for the chatbot here:


## Gradient descent helps find the "best" function (1 min)

Although the math and intuition behind gradient descent is incredibly cool, we don't have enough time to cover it here. If you're interested, I recommend this 3Blue1Brown video ([english](https://www.youtube.com/watch?v=IHZwWFHWa-w), [español](https://www.youtube.com/watch?v=mwHiaTrQOiI)). Also, if you want to be challenged, you can try implementing the gradient descent algorithm by yourself at this [page](https://neetcode.io/problems/gradient-descent).

## Neural networks are very good function approximators (30 min)

To ensure gradient descent finds a function that solves the task, we need to have a good substrate, in the form of a model architecture  that can be tuned to take the form of many different functions. Neural networks (and their descendants) have shown to be remarkably good at learning functions, partly for these reasons:
* Expressive power: You can prove that after stacking enough neurons and tuning the parameters, you can form practically any function.
* Paralellism: Training NNs consists of carrying out many identical operations, which can be computed in parallel at the same time, massively speeding up the process and reducing costs.
* Efficiency: Some NN variants (like Transformers, the architecture used by ChatGPT) can learn extremely sophisticated patterns from amounts of data that, while large, are still attainable from current sources.

In this notebook we'll focus on multi-layer perceptrons (MLPs), which are the simplest type of neural networks. While they are rarely used directly in practical applications, many of the lessons we can derive from playing with them are directly applicable to more complex architectures (such as RNNs, CNNs, and Transformers).

To start with, let's try to get an intuition for just how expressive neural networks can be. In this excercise we'll take a twisted function and see how different machine learning systems do at approximating it.

You can play with the depth and width of the MLP to see how it changes its ability to approximate the underlying function.

We'll use PyTorch to create the data and train the models. You can think of PyTorch as a version of Numpy that is optimized for machine learning applications. PyTorch is also structured around the creation of arrays (now called tensors), but includes additional functions, and modules that are useful to train machine learning models. 

In [None]:
import torch

def spiral(phi):
    x = (phi + 1) * torch.cos(phi)
    y = phi * torch.sin(phi)
    return torch.cat((x, y), dim=1)


def generate_data(num_data):
    angles = torch.empty((num_data, 1)).uniform_(1, 15)
    data = spiral(angles)
    # Add some noise to the data.
    data += torch.empty((num_data, 2)).normal_(0.0, 0.4)
    labels = torch.zeros((num_data,), dtype=torch.int)
    # Flip half of the points to create two classes.
    data[num_data // 2 :, :] *= -1
    labels[num_data // 2 :] = 1
    return data, labels

x_train, y_train = generate_data(4000)
x_val, y_val = generate_data(1000)

print("x_train shape:", x_train.size())
print("y_train shape:", y_train.size())


def plot_data(x, y):
    """Plot data points x with labels y. Label 1 is a red +, label 0 is a blue +."""
    plt.figure(figsize=(5, 5))
    plt.plot(x[y == 1, 0], x[y == 1, 1], "r+")
    plt.plot(x[y == 0, 0], x[y == 0, 1], "b+")
    
    
plot_data(x_train, y_train)

import ipywidgets as widgets
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
import torch.nn as nn
import torch.optim as optim

def create_mlp(width, depth):
    layers = []
    layers.append(nn.Linear(2, width))
    layers.append(nn.ReLU())
    
    for _ in range(depth-1):
        layers.append(nn.Linear(width, width))
        layers.append(nn.ReLU())
        
    layers.append(nn.Linear(width, 1))
    layers.append(nn.Sigmoid())
    
    return nn.Sequential(*layers)

def train_mlp(model, x, y, epochs=100):
    optimizer = optim.Adam(model.parameters())
    criterion = nn.BCELoss()
    
    for _ in range(epochs):
        optimizer.zero_grad()
        output = model(x)
        loss = criterion(output.squeeze(), y.float())
        loss.backward()
        optimizer.step()

def compare_predictions(x=x_val, y=y_val):
    """Compare predictions of different models."""
    
    # Train logistic regression
    lr_model = LogisticRegression()
    lr_model.fit(x_train, y_train)
    lr_pred = lr_model.predict(x)
    lr_acc = (lr_pred == y.numpy()).mean()
    
    # Train decision tree
    dt_model = DecisionTreeClassifier()
    dt_model.fit(x_train, y_train)
    dt_pred = dt_model.predict(x)
    dt_acc = (dt_pred == y.numpy()).mean()
    
    # Create and train MLP with widget controls
    width_slider = widgets.IntSlider(value=32, min=4, max=128, description='Width:')
    depth_slider = widgets.IntSlider(value=2, min=1, max=5, description='Depth:')
    
    @widgets.interact(width=width_slider, depth=depth_slider)
    def train_and_plot(width, depth):
        mlp_model = create_mlp(width, depth)
        train_mlp(mlp_model, x_train, y_train)
        
        with torch.inference_mode():
            mlp_pred = (mlp_model(x).squeeze() > 0.5).float()
        mlp_acc = (mlp_pred == y).float().mean()

        plt.figure(figsize=(15, 5))

        # Plot logistic regression
        plt.subplot(131)
        reds = lr_pred > 0.5
        plt.plot(x[reds, 0], x[reds, 1], "r+")
        plt.plot(x[~reds, 0], x[~reds, 1], "b+")
        plt.title(f"Logistic Regression\nAccuracy: {lr_acc:.3f}")

        # Plot decision tree
        plt.subplot(132)
        reds = dt_pred > 0.5
        plt.plot(x[reds, 0], x[reds, 1], "r+")
        plt.plot(x[~reds, 0], x[~reds, 1], "b+")
        plt.title(f"Decision Tree\nAccuracy: {dt_acc:.3f}")

        # Plot MLP
        plt.subplot(133)
        reds = mlp_pred > 0.5
        plt.plot(x[reds, 0], x[reds, 1], "r+")
        plt.plot(x[~reds, 0], x[~reds, 1], "b+")
        plt.title(f"MLP (w={width}, d={depth})\nAccuracy: {mlp_acc:.3f}")

        plt.tight_layout()
        plt.show()

Although the MLP does much better at this task, it does so partly at the cost of being less interpretable. We can easily visualize the algorithm behind a logistic regression or a decision tree, but if you were to plot the weights of an MLP, it would look somewhat like this:

![MLP weights visualization](https://i.sstatic.net/Z5L70.png)

The circles represent neurons that can be active or not for a given input, while the lines represent the weights that determine how early neurons influence the later ones to form the output.

Although the network we trained is too big to quickly get an intuitive grasp of how it works, we can learn a few things by playing with a smaller version of an MLP.

As an illustration, we know that computers can send emails, reproduce videos, run Python code, and many other things by repeating many times some basic operations (such as taking the AND and OR of two bits). If neural networks can also implement these basic operations, in theory they can be piled up to perform anything a computer can do (in reality, we).

Let's then try to manually set the weights for an MLP to implement the AND and OR operations.

The following neural network receives two inputs which are either 0 or 1. It has depth 1, so its output is just the value taken by its only neuron.

In [7]:
import ipywidgets as widgets
from IPython.display import display
import matplotlib.pyplot as plt
import numpy as np

# Create widgets for weight adjustment
w1_slider = widgets.FloatSlider(value=0.0, min=-2.0, max=2.0, step=0.1, description='Weight 1:')
w2_slider = widgets.FloatSlider(value=0.0, min=-2.0, max=2.0, step=0.1, description='Weight 2:') 
bias_slider = widgets.FloatSlider(value=0.0, min=-2.0, max=2.0, step=0.1, description='Bias:')

def plot_neuron(w1, w2, bias):
    plt.figure(figsize=(8, 6))
    
    # Define sigmoid function
    sigmoid = lambda x: 1 / (1 + np.exp(-x))
    
    # Plot inputs (0,0), (0,1), (1,0), (1,1)
    inputs = [(0,0), (0,1), (1,0), (1,1)]
    for x1, x2 in inputs:
        # Calculate neuron activation
        activation = sigmoid(w1*x1 + w2*x2 + bias)
        
        # Plot input points
        plt.plot(x1, x2, 'ko', markersize=10)
        
        # Plot neuron with activation-based color
        plt.plot(0.5, 0.5, 'o', color=f'{1-float(activation):.2f}', 
                markersize=20, zorder=3)
        
        # Plot connections with weight-based colors
        if x1 == 1:
            plt.plot([1, 0.5], [x2, 0.5], 
                    color='red' if w1 > 0 else 'blue',
                    alpha=abs(w1/2),
                    linewidth=2)
        if x2 == 1:
            plt.plot([x1, 0.5], [1, 0.5],
                    color='red' if w2 > 0 else 'blue',
                    alpha=abs(w2/2),
                    linewidth=2)
    
    # Calculate accuracy for AND operation
    correct = 0
    for x1, x2 in inputs:
        pred = sigmoid(w1*x1 + w2*x2 + bias) > 0.5
        target = x1 and x2
        correct += int(pred == target)
    accuracy = correct / len(inputs)
    
    plt.title(f'Neural Network AND Gate\nAccuracy: {accuracy:.2f}')
    plt.xlim(-0.5, 1.5)
    plt.ylim(-0.5, 1.5)
    plt.grid(True)
    plt.show()

def update_plot(w1, w2, bias):
    plot_neuron(w1, w2, bias)

# Display interactive widgets
out = widgets.interactive(update_plot, 
                        w1=w1_slider,
                        w2=w2_slider,
                        bias=bias_slider)
display(out)

print("Try to set the weights to implement an AND gate!")
print("Hint: Both inputs should need to be ON (1) to activate the neuron.")


interactive(children=(FloatSlider(value=0.0, description='Weight 1:', max=2.0, min=-2.0), FloatSlider(value=0.…

Try to set the weights to implement an AND gate!
Hint: Both inputs should need to be ON (1) to activate the neuron.


In [None]:
# TODO: Do the same for the OR gate.