# Introduction to Deep Learning

<img src="https://lamarr-institute.org/wp-content/uploads/deepLearn_2_EN-2048x1024.png" style="width: 800px; background: white;">

Deep learning is a field of artificial intelligence that teaches computers to learn from examples. It is inspired by the way the human brain works. In deep learning, we use neural networks, which are systems made of layers of connected nodes called “neurons.”

These networks learn by looking at many examples. For example, if we want a computer to recognize cats, we show it thousands of cat pictures. Over time, the network learns what features make a cat: shapes, colors, and patterns.

Deep learning is powerful because it can find complex patterns that are hard for humans to describe. It is used in many areas, such as voice assistants, self-driving cars, medical image analysis, and translation tools.

Although deep learning can achieve impressive results, it needs large amounts of data and strong computers to train the models. It can also be difficult to understand how the model makes decisions, since the learning happens inside many layers.

In summary, deep learning is a modern approach that allows computers to learn from data in a way that is flexible and effective, making it an important technology in today’s world.

<img src="https://towardsdatascience.com/wp-content/uploads/2021/12/1hkYlTODpjJgo32DoCOWN5w.png" style="width: 800px">

<br>
<img src="https://nickmccullum.com/images/python-deep-learning/understanding-neurons-deep-learning/activation-function.png" style="width: 800px">

In [1]:
# The features from the input layer
input_layer = [1, 2, 3, 4, 5]

# The weights for the connections between layers
hidden_layer1 = [
    [0.1, 0.2, 0.3, 0.4, 0.5],
    [0.5, 0.4, 0.3, 0.2, 0.1],
    [0.2, 0.3, 0.4, 0.5, 0.6]
]
hidden_layer2 = [
    [0.5, 0.4, 0.3],
    [0.1, 0.2, 0.3]
]
output_layer = [
    [0.9, 0.8]
]


# Function to perform matrix multiplication
def matrix_multiply(inputs, weights):
    outputs = []
    for weight_vector in weights:
        output = sum(i * w for i, w in zip(inputs, weight_vector))
        outputs.append(output)
    return outputs

# Forward pass through the network
hidden_output1 = matrix_multiply(input_layer, hidden_layer1)
print("Output after first hidden layer:", hidden_output1)

hidden_output2 = matrix_multiply(hidden_output1, hidden_layer2)
print("Output after second hidden layer:", hidden_output2)

final_output = matrix_multiply(hidden_output2, output_layer)
print("Final output of the network:", final_output)

# activation function
def relu(x):
    return max(0, x)

# Applying ReLU activation function to final output
activated_output = [relu(x) for x in final_output]

print("Activated final output (ReLU):", activated_output)


Output after first hidden layer: [5.5, 3.5, 7.0]
Output after second hidden layer: [6.25, 3.35]
Final output of the network: [8.305]
Activated final output (ReLU): [8.305]


<img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*EZiBCBstS5malFC38HGKig.jpeg" style="width: 800px">
<br>
<img src="https://mukulrathi.com/static/b4793bf4ff6f9063ff9a3d00f1ebf6ec/e629e/grad-descent.webp" style="width: 800px">

Training a deep learning model means teaching the model to make good predictions by learning from data. The process starts with many examples, such as images, sentences, or numbers. Each example has an input and often a correct answer, called a label.

The model makes a prediction, and then we compare this prediction with the correct answer. The difference between them is called the loss. A high loss means the model is wrong, and a low loss means the model is learning well.

To improve, the model adjusts its internal values, called weights. This is done using an algorithm called backpropagation, which calculates how much each weight should change. Another algorithm, called an optimizer (like SGD or Adam), updates the weights step by step.

This process repeats many times over the entire dataset. With each pass, the model becomes a little better at recognizing patterns. After enough training, the model can make accurate predictions on new, unseen data.

In [2]:
# calculating the loss

target = 3.0
prediction = activated_output[0]

loss = prediction - target
print("\nLoss:", loss)


# SUPER-SIMPLE BACKPROP (educational only)
# Each weight is updated using:
# new_weight = old_weight - lr * loss

learning_rate = 0.01

def update_weights(layer):
    for i in range(len(layer)):
        for j in range(len(layer[i])):
            layer[i][j] -= learning_rate * loss

update_weights(output_layer)
update_weights(hidden_layer2)
update_weights(hidden_layer1)


# Forward pass again after weight update
hidden_output1_new = matrix_multiply(input_layer, hidden_layer1)
hidden_output2_new = matrix_multiply(hidden_output1_new, hidden_layer2)
final_output_new = matrix_multiply(hidden_output2_new, output_layer)
activated_output = [relu(x) for x in final_output_new]

print("Old prediction:", prediction)
print("New prediction after weight update:", activated_output)


Loss: 5.305
Old prediction: 8.305
New prediction after weight update: [5.479291101463751]


Tensor Flow, from google, has a cool playground we can use to understand how all this process works!

https://playground.tensorflow.org/

# PyTorch

PyTorch is an open-source library used to build and train deep learning models. It is widely used in research and industry because it is easy to learn and flexible.

PyTorch works with tensors, which are similar to arrays or matrices. Tensors can run on the CPU or on the GPU, making computations faster when working with large models.

https://pytorch.org/

https://developer.nvidia.com/cuda/toolkit


In [3]:
!nvidia-smi

Sun Dec 14 21:04:41 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.94                 Driver Version: 560.94         CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4060      WDDM  |   00000000:01:00.0  On |                  N/A |
| 37%   41C    P8             N/A /  115W |     429MiB /   8188MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [4]:
import torch
print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
print("CUDA device:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU")

PyTorch version: 2.9.1+cu126
CUDA available: True
CUDA device: NVIDIA GeForce RTX 4060


Recreating the simple newral net from previous example:

In [5]:
from torch import nn

class SimpleNN(nn.Module):
    
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(5, 3)
        self.fc2 = nn.Linear(3, 2)
        self.fc3 = nn.Linear(2, 1)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x


In [6]:
torch.manual_seed(42)

model = SimpleNN()
prediction = model(torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0]))
print(prediction)

tensor([-0.1721], grad_fn=<ViewBackward0>)


In [7]:
epochs = 10
learning_rate = 0.1

criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

for epoch in range(epochs):
    optimizer.zero_grad()

    inputs = torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0])
    target = torch.tensor([3.0])

    outputs = model(inputs)

    loss = criterion(outputs, target)
    loss.backward()
    optimizer.step()

    print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item()}")


Epoch 1/10, Loss: 10.062329292297363
Epoch 2/10, Loss: 5.28465461730957
Epoch 3/10, Loss: 3.38217830657959
Epoch 4/10, Loss: 2.1645941734313965
Epoch 5/10, Loss: 1.3853403329849243
Epoch 6/10, Loss: 0.88661789894104
Epoch 7/10, Loss: 0.5674353241920471
Epoch 8/10, Loss: 0.3631584942340851
Epoch 9/10, Loss: 0.23242133855819702
Epoch 10/10, Loss: 0.14874958992004395


In [8]:
prediction = model(torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0]))
print(prediction)

tensor([2.6915], grad_fn=<ViewBackward0>)


# Text generation with deep learning!

For this example, we are going to use recurrent newral networks!

We will use a dataset from Shakespeare to create a newral network able to write text!

https://www.kaggle.com/datasets/kingburrito666/shakespeare-plays?resource=download

In [9]:
import pandas as pd

df = pd.read_csv('./datasets/Shakespeare_data.csv')
df.head()

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"


In [10]:
lines = df['PlayerLine'].dropna().astype(str).tolist()

cutoff_index = int(0.8 * len(lines))
train_lines, val_lines = lines[:cutoff_index], lines[cutoff_index:]

print(f"Number of training lines: {len(train_lines)}")
print(f"Number of validation lines: {len(val_lines)}")

Number of training lines: 89116
Number of validation lines: 22280


**Extracting the tokens!**

We have our phrases, but AI doesn't work with frases, we need numbers!

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(train_lines)

tokenizer = vectorizer.build_tokenizer()

tokens = tokenizer(train_lines[3])
vocab = vectorizer.get_feature_names_out()

print(f"""
First frase: {train_lines[3]}
Tokens in the first phrase: {tokens}

Vocabulary size: {len(vocab)}

First 20 tokens: {vocab[:20]}
""")


First frase: So shaken as we are, so wan with care,
Tokens in the first phrase: ['So', 'shaken', 'as', 'we', 'are', 'so', 'wan', 'with', 'care']

Vocabulary size: 20574

First 20 tokens: ['10' '2d' '2s' '4d' '5s' '6d' '8d' 'abaissiez' 'abandon' 'abandoned'
 'abase' 'abate' 'abated' 'abatement' 'abatements' 'abbess' 'abbey'
 'abbeys' 'abbominable' 'abbot']



In [12]:
vocab = list(vocab)
vocab.append('<EOF>') # END OF FILE token

In [13]:
word2index = {word: idx for idx, word in enumerate(vocab)}
index2word = {idx: word for word, idx in word2index.items()}

In [14]:
phrase = "To be, or not to be, that is the question."
tokens = tokenizer(phrase)
indices = [word2index[token] for token in tokens if token in word2index]
words = [index2word[idx] for idx in indices]

print(f"Phrase: {phrase}")
print(f"Tokens: {tokens}")
print(f"Indices: {indices}")
print(f"Words from indices: {words}")

Phrase: To be, or not to be, that is the question.
Tokens: ['To', 'be', 'or', 'not', 'to', 'be', 'that', 'is', 'the', 'question']
Indices: [1378, 12301, 12029, 18177, 1378, 17910, 9653, 17916, 14255]
Words from indices: ['be', 'or', 'not', 'to', 'be', 'that', 'is', 'the', 'question']


**Embbedings**

Embeddings are ways to represent our frases a matrix of float numbers. It is important when working with AI, in special for nlp (natural language processing).

In [15]:
from torch import nn

phrase = "To be, or not to be, that is the question."

tokens = tokenizer(phrase)
indices = [word2index[token] for token in tokens if token in word2index]

input_tensor = torch.tensor(indices, dtype=torch.long).unsqueeze(0)

embedding_dim = 5
embedding_layer = nn.Embedding(num_embeddings=len(vocab), embedding_dim=embedding_dim)
embedded_output = embedding_layer(input_tensor)

print(f"Input indices: {indices}")
print("Embedded output:")
print(embedded_output.detach().squeeze(0))

Input indices: [1378, 12301, 12029, 18177, 1378, 17910, 9653, 17916, 14255]
Embedded output:
tensor([[-1.1392e+00,  7.2054e-02,  7.7972e-04, -1.4003e-01,  9.1053e-01],
        [-1.7629e-01,  1.3517e+00, -6.8322e-01, -1.0474e+00, -4.3596e-01],
        [-2.5106e-01, -1.1762e+00,  2.0946e-01,  4.6188e-01,  5.4676e-01],
        [ 1.1102e+00, -5.8836e-01, -2.8079e-03, -1.0986e-01,  2.9528e-01],
        [-1.1392e+00,  7.2054e-02,  7.7972e-04, -1.4003e-01,  9.1053e-01],
        [-4.9812e-01, -5.7807e-02,  3.2006e-01, -1.8862e-02, -1.5732e-02],
        [-9.3847e-01,  2.0246e-01,  2.4391e-01, -1.8471e-01,  8.6820e-01],
        [-1.6026e+00,  1.9846e-01,  1.0945e-01,  1.7788e+00, -7.6462e-01],
        [-4.3056e-01, -7.3652e-01, -1.1400e+00,  9.3324e-01,  2.8117e-01]])


**The Recurrent Newral Netword Model**

Now we can create the model we are going to train

<img src="https://mdpi-res.com/information/information-15-00517/article_deploy/html/images/information-15-00517-g001.png" style="width: 500px;">

This model is going to be based on RRN

In [43]:
from torch import nn

class SimpleRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, hidden_layers=3, dropout_prob=0.2):
        super(SimpleRNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=hidden_layers, batch_first=True)
        self.dropout = nn.Dropout(dropout_prob)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x, hidden=None):
        x = self.embedding(x)
        out, hidden = self.lstm(x, hidden)
        out = self.dropout(out)
        out = self.fc(out)
        return out, hidden


vocab_size = len(vocab)
embedding_dim = 5
hidden_dim = 32


model = SimpleRNN(vocab_size, embedding_dim, hidden_dim)
print(model)

SimpleRNN(
  (embedding): Embedding(20575, 5)
  (lstm): LSTM(5, 32, num_layers=3, batch_first=True)
  (dropout): Dropout(p=0.2, inplace=False)
  (fc): Linear(in_features=32, out_features=20575, bias=True)
)


**Training our model**

Now that we know how to extract tokens, convert into embeddings and we also have our model, we can train it!

In [44]:
# Step 1: Prepare input data

sequences = []
for line in train_lines:
    tokens = tokenizer(line)
    indices = [word2index[token] for token in tokens if token in word2index]
    if len(indices) > 1:
        sequences.append(indices)

print(f"Number of sequences prepared: {len(sequences)}")

Number of sequences prepared: 82828


In [68]:
# Step 2: Set hyperparameters

embedding_dim = 5

hidden_dim = 1024
hidden_layers = 3
vocab_size = len(vocab)

batch_size = 32
num_epochs = 500
lr = 0.0001

In [69]:
# Step 3: Train the model
from torch.nn.utils.rnn import pad_sequence
import random

random.seed(42)
torch.manual_seed(42)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

model = SimpleRNN(vocab_size, embedding_dim, hidden_dim).to(device)
criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

for epoch in range(num_epochs):
    total_loss = 0
    random.shuffle(sequences)

    for i in range(0, len(sequences), batch_size):
        batch = sequences[i:i+batch_size]
        
        input_batch = [torch.tensor(seq[:-1], dtype=torch.long) for seq in batch]
        target_batch = [torch.tensor(seq[1:], dtype=torch.long) for seq in batch]
        input_padded = pad_sequence(input_batch, batch_first=True, padding_value=0).to(device)
        target_padded = pad_sequence(target_batch, batch_first=True, padding_value=0).to(device)

        optimizer.zero_grad()
        output, _ = model(input_padded) 

        loss = criterion(output.view(-1, vocab_size), target_padded.view(-1))
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    if epoch == 0 or (epoch + 1) % 50 == 0:
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(sequences):.4f}")

Using device: cuda
Epoch 1/500, Loss: 0.2184
Epoch 50/500, Loss: 0.0557
Epoch 100/500, Loss: 0.0390
Epoch 150/500, Loss: 0.0364
Epoch 200/500, Loss: 0.0352
Epoch 250/500, Loss: 0.0345
Epoch 300/500, Loss: 0.0340
Epoch 350/500, Loss: 0.0337
Epoch 400/500, Loss: 0.0335
Epoch 450/500, Loss: 0.0333
Epoch 500/500, Loss: 0.0333


In [57]:
def generate_text(model, start_text, tokenizer, word2index, index2word, max_len=20, device='cpu'):
    model.eval()
    tokens = tokenizer(start_text)
    indices = [word2index[token] for token in tokens if token in word2index]
    input_seq = torch.tensor(indices, dtype=torch.long).unsqueeze(0).to(device)
    generated = indices.copy()

    hidden = None
    for _ in range(max_len):
        with torch.no_grad():
            output, hidden = model(input_seq, hidden)
            next_token_logits = output[0, -1]
            next_token = torch.argmax(next_token_logits).item()
            if index2word[next_token] == '<EOF>':
                break
            generated.append(next_token)
            input_seq = torch.tensor([generated], dtype=torch.long).to(device)
            

    words = [index2word[idx] for idx in generated]
    return ' '.join(words)


start_text = "to be"
print(generate_text(model, start_text, tokenizer, word2index, index2word, max_len=20, device=device))

to be so to you to the king and is not so much as you have been the world to be so


In [49]:
start_text = val_lines[0]
print(start_text)
print(generate_text(model, start_text, tokenizer, word2index, index2word, max_len=10, device=device))

And I am tied to be obedient,
am tied to be obedient nay and ll say to me was that to my


In [32]:
import os
os.makedirs("models", exist_ok=True)
torch.save(model.state_dict(), "models/simple_rnn_shakespeare.pth")

In [35]:
model = SimpleRNN(vocab_size, embedding_dim, hidden_dim)
model.load_state_dict(torch.load("models/simple_rnn_shakespeare.pth"))
model.to(device)
model.eval()

SimpleRNN(
  (embedding): Embedding(20575, 10)
  (lstm1): LSTM(10, 512, batch_first=True)
  (lstm2): LSTM(512, 512, batch_first=True)
  (lstm3): LSTM(512, 512, batch_first=True)
  (fc): Linear(in_features=512, out_features=20575, bias=True)
)