<h1 style="text-align: center;">Building a Transformer From Scratch</h1>
<h3 style="text-align: center;">by Rida Assalouh</h3>


Given an input matrix $X \in \mathbb{R}^{T \times d_{\text{model}}}$ and projection matrices  $W_Q, W_K, W_V \in \mathbb{R}^{d_{\text{model}} \times d_k}$,  
we compute the query, key, and value representations as

$$
Q = X W_Q, \qquad 
K = X W_K, \qquad 
V = X W_V,
$$

each in $\mathbb{R}^{T \times d_k}$.  
In this example, we take $d_v = d_k$.

The self-attention output, which produces updated token representations in  $\mathbb{R}^{T \times d_v}$, is given by

$$
\mathrm{Attention}(Q, K, V)
   = \mathrm{softmax}\!\left(\frac{QK^{\mathsf T}}{\sqrt{d_k}}\right)V 
$$

The matrix

$$
\mathrm{softmax}\!\left(\frac{QK^{\mathsf T}}{\sqrt{d_k}}\right)
   \in \mathbb{R}^{T \times T}
$$

contains all pairwise attention weights between tokens, describing how much each position attends to every other one.

In causal language models, this matrix is **masked** to prevent a token from attending to future tokens.  
This enforces the autoregressive property: the model may only use past context, never future information, exactly as required in systems like ChatGPT.


In [1]:
import torch
import torch.nn.functional as F
import torch.nn as nn

In [3]:
def scaled_dot_product_attention(Q, K, V):
    """
    Compute scaled dot-product attention.
    Q: (batch_size, n_heads, T_q, d_k)
    K: (batch_size, n_heads, T_k, d_k)
    V: (batch_size, n_heads, T_k, d_v)
    Returns:
      output: (batch_size, n_heads, T_q, d_v)
      attn_weights: (batch_size, n_heads, T_q, T_k)"""

    d_k = Q.shape[-1]
    KT = K.transpose(-2,-1)
    attn_weights = F.softmax(torch.matmul(Q, KT)/Q.size(-1)**0.5, dim=-1)
    output = torch.matmul(attn_weights, V)

    return output, attn_weights


In the multi-head attention setting, let $h$ be the number of heads and define  $Q_i = X W_i^Q$ for head $i = 1, \dots, h$

We set $d_k = d_q = d_v = d_{\text{model}} / h$.  
Each projection matrix $W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}$ is specific to its head.

The output of each head is computed as

$$
E_i = \mathrm{softmax}\!\left(\frac{Q_i K_i^{T}}{\sqrt{d_k}}\right)V_i
\in \mathbb{R}^{T \times d_k}
$$

Finally, the outputs of all heads are concatenated and projected using $W_O$:

$$
\mathrm{MultiHead}(X) = [E_1, \dots, E_h]\, W_O
$$

where $W_O \in \mathbb{R}^{d_{\text{model}} \times d_{\text{model}}}$, and therefore $\mathrm{MultiHead}(X) \in \mathbb{R}^{T \times d_{\text{model}}}$.


In [4]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()

        self.d_model = d_model
        self.n_heads = n_heads

        self.d_k = d_model // n_heads

        self.W_q = nn.Linear(d_model, d_model) # this W_q is a concatenation of all W_qs [W_q_1, ..., W_q_h] where h=n_heads
        self.W_k = nn.Linear(d_model, d_model) # same
        self.W_v = nn.Linear(d_model, d_model) # same

        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, X):
          """
          X: (batch_size, T, d_model)
          Returns: (batch_size, T, d_model)
          """
          batch_size, T, _ = X.shape

          Q = self.W_q(X) # for each batch b Q[b] is the concatenation  [X . W_q_1, ..., X . W_q_h] so Q shape is (batch_size, T, d_model)
          K = self.W_k(X) # same
          V = self.W_v(X) # same

          Q = Q.view(batch_size, T, self.n_heads, self.d_k).transpose(1,2) # (batch_size, n_heads, T, d_k)
          K = K.view(batch_size, T, self.n_heads, self.d_k).transpose(1,2)
          V = V.view(batch_size, T, self.n_heads, self.d_k).transpose(1,2)

          outputs, _ = scaled_dot_product_attention(Q, K, V) # (batch_size, n_heads, T, d_v)
          outputs = outputs.transpose(1,2).contiguous().view(batch_size, T, self.d_model) # (batch_size, T, d_model)

          return self.W_o(outputs) # (batch_size, T, d_model)

In [5]:
encoder = MultiHeadAttention(1024, 8)
X = torch.randn(1, 20, 1024)

print(encoder(X).shape)

torch.Size([1, 20, 1024])


After multi-head attention, instead of passing the output directly forward, we add a **skip connection** and apply **layer normalization**:

$$
X' = \text{LayerNorm}(X + \text{MultiHeadAttention}(X))
$$

Next, we feed $X'$ into an MLP, and again apply a skip connection followed by normalization:

$$
X_{\text{transformerblock}} = \text{LayerNorm}(X' + \text{MLP}(X'))
$$


In [34]:
class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()

        self.mha = MultiHeadAttention(d_model, n_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.dropout_att =  nn.Dropout(dropout)
        self.ffn = nn.Sequential(
              nn.Linear(d_model, d_ff),
              nn.ReLU(),
              nn.Linear(d_ff, d_model)
            )
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout_MLP =  nn.Dropout(dropout)


    def forward(self, X):
        X_attention = self.mha(X)
        X_att_norm = self.norm1(X + self.dropout_att(X_attention))
        X_ff = self.ffn(X_att_norm)
        X = self.norm2(X_att_norm + self.dropout_MLP(X_ff))
        return X

In [35]:
transformer_block = TransformerBlock(d_model=512, n_heads=8, d_ff=2048)
X = torch.randn(2, 45, 512)
X = transformer_block(X)
print(X.shape)

torch.Size([2, 45, 512])


All of what was done until now represents a single transformer block, which can be seen as a function of the input $X$, denoted $f_{\theta}(X)$.  
In practice, we sequentially stack multiple transformer blocks, so that the full encoder is the composition of these functions applied to the initial input embeddings   $X \in \mathbb{R}^{\text{batch\_size} \times T \times d_{\text{model}}}$

$$
f_{\theta_N}\big(f_{\theta_{N-1}}(\dots f_{\theta_1}(X)\dots)\big)
$$

We also add **positional encodings** to retain order information.  
For position $t \in \{1, \dots, T\}$ and dimension $k \in \{1, \dots, d_{\text{model}}\}$, the sinusoidal positional encoding is defined as:

$$
PE(t)_k =
\begin{cases}
\sin\Big(\frac{t}{10000^{k / d_{\text{model}}}}\Big), & \text{if $k$ is even} \\[2mm]
\cos\Big(\frac{t}{10000^{(k-1) / d_{\text{model}}}}\Big), & \text{if $k$ is odd}
\end{cases}
$$

- The term $10000^{k/d_\text{model}}$ ensures different frequencies across dimensions.

---

### Code details:

- Input: integers (token IDs) $\in [0, V-1]$.
- The embedding matrix outputs dense vectors of size $d_{\text{model}}$.
- Implemented in PyTorch as `nn.Embedding(vocab_size, d_model)`.
- Equivalence: it is mathematically equivalent to multiplying a one-hot encoded vector of size $V$ by a linear layer $W \in \mathbb{R}^{V \times d_{\text{model}}}$, but much more efficient because we don’t need to construct the huge one-hot vector.

---

### 1. Learned positional embeddings

- Create an `nn.Embedding(max_seq_len, d_model)` in the constructor.  
- These embeddings are **learned parameters** updated during training.  

### 2. Sinusoidal positional embeddings

- These are **fixed, deterministic values**.  
- Precompute them in the constructor for the maximum sequence length and store in `self.pos_embedding`, then slice in `forward()` according to the input length.  

- Optimization tip: **avoid while loops**; use broadcasting to vectorize computations. This allows parallelization on GPUs, unlike loops.

## **Now let's build the whole transformer encoder**

**A very useful idea for the classification task we will study at the end** is to initialize the embedding matrix with **BERT pretrained embeddings**.  
In one of the experiments, this increased the accuracy from **0.80 to 0.85**.

This is essentially like a  **BERT finetuning** adapted to our model.

I added a parameter `embedding_initialization` (default: `"bert"`) to choose whether we use **BERT embeddings** or a **random initialization**.

We sue sinusoidal positional encodings.


In [58]:
class TransformerEncoder(nn.Module):
    def __init__(self, V, N, d_model, n_heads, d_ff, max_seq_len, embedding_initialization='BERT', dropout=0.1):
        super().__init__()

        if embedding_initialization == 'BERT':
          bert = AutoModel.from_pretrained("bert-base-uncased")
          pretrained_emb = bert.embeddings.word_embeddings.weight  # shape (30522, 768)
          if d_model != 768: # Project embeddings to your d_model if needed
              projection = nn.Linear(768, d_model, bias=False)
              with torch.no_grad():
                  projected_emb = projection(pretrained_emb)  # shape (30522, d_model)
          else:
              projected_emb = pretrained_emb
          self.token_embedding = nn.Embedding.from_pretrained(projected_emb, freeze=False)
        
        elif embedding_initialization == 'random':
          self.token_embedding = nn.Embedding(V, d_model)
        
        else:
          raise ValueError(f"Parameter 'embedding_initialization' must be either 'BERT' or 'random', got '{embedding_initialization}'")

        positions = torch.arange(max_seq_len).unsqueeze(1)
        dims = torch.arange(d_model).unsqueeze(0)
        angles = 1 / 10000**((dims//2)*2 / d_model) # substract one to odd indexes
        PE = torch.zeros(max_seq_len, d_model)
        PE[:, 0::2] = torch.sin(positions * angles[:, 0::2]) # broadcasting so things are vectorized and parallelizable on GPU
        PE[:, 1::2] = torch.cos(positions * angles[:, 1::2])
        self.register_buffer('pos_embedding', PE)

        self.layers = nn.ModuleList([TransformerBlock(d_model, n_heads, d_ff) for _ in range(N)])

        self.dropout = nn.Dropout(dropout)

    def forward(self, X):
        """
        X: (batch_size, T)  # token IDs (we suppose X is already preprocessed such that T < max_seq_len)
        """
        _ , T = X.shape
        X = self.token_embedding(X) # now X is (batch size, T, d_model)
        X = X + self.pos_embedding[:T, :].unsqueeze(0)
        X = self.dropout(X)

        for layer in self.layers:
          X = layer(X)

        return X

Now let's initiliaze the embedding matrix by Bert embeddings:

In [47]:
model = TransformerEncoder(100, 6, 512, 8, 1024, 16)
X = torch.zeros(4, 15, dtype=torch.long)
print(model(X).shape)

torch.Size([4, 15, 512])


We may also apply dropout in other places within the transformer block:

$$
X' = \text{LayerNorm}\big(X + \text{Dropout}(\text{MultiHeadAttention}(X))\big)
$$

and

$$
X_{\text{transformerblock}}
= \text{LayerNorm}\big(X' + \text{Dropout}(\text{MLP}(X'))\big)
$$


# **Using a transformer encoder for sentence classification**

At the end of the transformer, I use a linear layer to produce the predicted logits.  
This layer takes as input the embedding of the **[CLS]** token, which is placed as the **first token** in every input sequence, both during training and testing.


In [48]:
class TransformerClassifierCLS(nn.Module):
    def __init__(self, encoder, d_model, num_classes):
        super().__init__()
        self.encoder = encoder

        self.fc = nn.Linear(d_model, num_classes)

    def forward(self, X):
        """
        X: (batch_size, T) token IDs,
           make sure the first token in each sequence is the [CLS] token
        """
        enc_out = self.encoder(X) # batch_size, T, d_model

        cls_emb = enc_out[:, 0, :] # batch_size, d_model

        logits = self.fc(cls_emb) # batch_size, num_classes

        return logits


In [49]:
encoder = TransformerEncoder(200, 6, 512, 8, 1024, 500)
classifier = TransformerClassifierCLS(encoder, 512, 2)
X = torch.zeros(64 ,13, dtype=torch.long)

print(classifier(X).shape)

torch.Size([64, 2])


## **Let's Load imdb movies reviews dataset from huggingface:**

In [14]:
from datasets import load_dataset

raw_dataset = load_dataset("imdb")
train_texts = raw_dataset['train']['text']
train_labels = raw_dataset['train']['label']


Generating train split: 100%|██████████| 25000/25000 [00:00<00:00, 40023.62 examples/s]
Generating test split: 100%|██████████| 25000/25000 [00:00<00:00, 46103.17 examples/s]
Generating unsupervised split: 100%|██████████| 50000/50000 [00:01<00:00, 44090.39 examples/s] 


In [15]:
avg_len = sum(len(text.split()) for text in train_texts) / len(train_texts)
print("The average number of words in the samples in the dataset is : ", avg_len)


The average number of words in the samples in the dataset is :  233.7872


In [16]:
print("number of samples in the training dataset :", len(raw_dataset['train']['text']))

number of samples in the training dataset : 25000


In [17]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In [18]:
sentence = ["", "hey bro how is Inezgane"]
encoding = tokenizer(
    sentence,
    truncation=True,
    padding='max_length',
    max_length=10,
    add_special_tokens=True,
)
print(encoding['input_ids'])

print([tokenizer.decode(encoding['input_ids'][i]) for i in range(len(sentence))])
print(tokenizer.decode(101))

[[101, 102, 0, 0, 0, 0, 0, 0, 0, 0], [101, 4931, 22953, 2129, 2003, 1999, 9351, 5289, 2063, 102]]
['[CLS] [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]', '[CLS] hey bro how is inezgane [SEP]']
[CLS]


In [19]:
tokenizer.vocab_size

30522

In [None]:
# # Parameters
# V = tokenizer.vocab_size    # number of tokens used by the tokenizer
# N = 2                       # number of transformer blocks
# d_model = 256               # embedding dimension
# n_heads = 4
# d_ff = 1024                 # typical value is 4 x d_model
# max_seq_len = 256
# dropout = 0.1
# num_epochs = 4
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### **Training Procedure**

We now train our Transformer classifier on the sentence-classification task.  
The full training pipeline includes:

1. **Tokenization:**  
   We encode the raw text with the BERT tokenizer using `max_seq_len = 256`, applying truncation and padding.

2. **Dataset and DataLoader:**  
   We construct PyTorch datasets containing:
   - token IDs (padded sequences)
   - the corresponding integer labels  
   These are fed into dataloaders for mini-batch training and validation.

3. **Model construction:**  
   We instantiate our custom `TransformerEncoder` with:
   - `N` transformer blocks  
   - multi-head attention  
   - optional BERT-initialized token embeddings  
   - sinusoidal positional encodings  
   The encoder is wrapped in a `TransformerClassifierCLS`, which applies a final linear layer on top of the `[CLS]` token embedding.

4. **Embedding freezing / unfreezing:**  
   If we initialize embeddings with BERT, we optionally **freeze** them during the first training epochs.  
   After a warm-up period (`epochs_to_unfreeze`), we **unfreeze** them to allow fine-tuning.

5. **Optimization:**  
   We use the AdamW optimizer, standard for Transformer-based training, together with cross-entropy loss.  
   The training loop performs:
   - forward pass  
   - loss computation  
   - backpropagation  
   - parameter updates  

6. **Validation and checkpointing:**  
   After each epoch, we evaluate on the validation set and compute accuracy.  
   The model achieving the best validation accuracy is saved as `best_model.pth`.

Below is the full training loop:


In [None]:
import torch.optim as optim
from torch.optim import AdamW

from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer
from sklearn.model_selection import train_test_split

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")


class MyDataset(Dataset):
    def __init__(self, token_ids, labels):
        """
        token_ids: list/array of token sequences (padded to same length) (num_samples, max_seq_len)
        labels: list/array of integer labels (num_samples)
        """
        self.token_ids = torch.tensor(token_ids, dtype=torch.long)
        self.labels = torch.tensor(labels, dtype=torch.long)


    def __len__(self):
      return len(self.token_ids)

    def __getitem__(self, idx):
        return self.token_ids[idx], self.labels[idx]


# Parameters
V = tokenizer.vocab_size    # number of tokens used by the tokenizer
N = 2                       # number of transformer blocks
d_model = 256               # embedding dimension
n_heads = 4
d_ff = 1024                 # typical value is 4 x d_model
max_seq_len = 256
dropout = 0.1
num_epochs = 20
batch_size = 64
initialization = 'BERT'
initially_freeze = True
epochs_to_unfreeze = 1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Split training data into train + validation
train_and_val_texts = list(raw_dataset['train']['text'])
train_and_val_labels = list(raw_dataset['train']['label'])

train_texts, val_texts, train_labels, val_labels = train_test_split(train_and_val_texts, train_and_val_labels, test_size=0.2, random_state=42)

# Tokenize
train_encodings = tokenizer(
    train_texts,
    truncation=True,
    padding='max_length',
    max_length=max_seq_len,
    add_special_tokens=True,
) # (num_samples, max_seq_len) = (12500, 256)

val_encodings = tokenizer(
    val_texts,
    truncation=True,
    padding='max_length',
    max_length=max_seq_len,
    add_special_tokens=True,
)

train_dataset = MyDataset(train_encodings['input_ids'], train_labels)
val_dataset = MyDataset(val_encodings['input_ids'], val_labels)

train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

encoder = TransformerEncoder(V, N, d_model, n_heads, d_ff, max_seq_len, initialization, dropout)
model = TransformerClassifierCLS(encoder, d_model=d_model, num_classes=2)
model.to(device)

# Freeze BERT embeddings initially
if initialization == 'BERT' and initially_freeze:
    model.encoder.token_embedding.weight.requires_grad = False
    print("BERT embeddings frozen for initial training")


criterion = nn.CrossEntropyLoss()
optimizer = AdamW(model.parameters(), lr=5e-5, betas=(0.9, 0.999), weight_decay=1e-5)
best_val_acc = 0.0  # Keep track of the best validation accuracy

for epoch in range(num_epochs):
    model.train()

    if initially_freeze and epoch == epochs_to_unfreeze:  # num of epochs after which we start fine tuning BERT embeddings
        print("Unfreezing BERT embeddings for fine-tuning...")
        model.encoder.token_embedding.weight.requires_grad = True

        # Re-create the optimizer to set different learning rates
        # optimizer = AdamW([
        #     {'params': model.encoder.token_embedding.parameters(), 'lr': 1e-5},  # small LR for embeddings
        #     {'params': [p for n, p in model.named_parameters() if 'token_embedding' not in n], 'lr': 5e-5}
        # ], betas=(0.9, 0.999), weight_decay=1e-5)
        
        optimizer = AdamW(model.parameters(), lr=5e-5, betas=(0.9, 0.999), weight_decay=1e-5)

    running_loss = 0.0
    num_batches = len(train_dataloader)

    for i, (batch_token_ids, batch_labels) in enumerate(train_dataloader, 1):
        batch_token_ids = batch_token_ids.to(device)
        batch_labels = batch_labels.to(device)
        optimizer.zero_grad()
        logits = model(batch_token_ids)
        loss = criterion(logits, batch_labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        print(f"Epoch {epoch+1}/{num_epochs}, Batch {i}/{num_batches}, Loss: {loss.item():.4f}")

    avg_loss = running_loss / num_batches
    print(f"\nEpoch {epoch+1}/{num_epochs} finished, Avg Loss: {avg_loss:.4f}")

    # Validation loop
    model.eval()
    correct = 0
    total = 0
    num_val_batches = len(val_dataloader)

    with torch.no_grad():
        for j, (val_token_ids, val_labels_batch) in enumerate(val_dataloader, 1):
            val_token_ids = val_token_ids.to(device)
            val_labels_batch = val_labels_batch.to(device)
            logits = model(val_token_ids)
            predictions = logits.argmax(dim=1)
            correct += (predictions == val_labels_batch).sum().item()
            total += val_labels_batch.size(0)
            print(f"Epoch {epoch+1}/{num_epochs}, Validation Batch {j}/{num_val_batches}")

    val_acc = correct / total
    print(f"\nValidation Accuracy: {val_acc:.4f}")

    # Save the best model
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        torch.save(model.state_dict(), "best_model.pth")
        print(f"New best model saved with Validation Accuracy: {best_val_acc:.4f}")


BERT embeddings frozen for initial training
Epoch 1/20, Batch 1/313, Loss: 0.7461
Epoch 1/20, Batch 2/313, Loss: 0.8210
Epoch 1/20, Batch 3/313, Loss: 0.7178
Epoch 1/20, Batch 4/313, Loss: 0.7179
Epoch 1/20, Batch 5/313, Loss: 0.7318
Epoch 1/20, Batch 6/313, Loss: 0.7566
Epoch 1/20, Batch 7/313, Loss: 0.7459
Epoch 1/20, Batch 8/313, Loss: 0.7213
Epoch 1/20, Batch 9/313, Loss: 0.7272
Epoch 1/20, Batch 10/313, Loss: 0.6567
Epoch 1/20, Batch 11/313, Loss: 0.6984
Epoch 1/20, Batch 12/313, Loss: 0.7568
Epoch 1/20, Batch 13/313, Loss: 0.7455
Epoch 1/20, Batch 14/313, Loss: 0.7379
Epoch 1/20, Batch 15/313, Loss: 0.7011
Epoch 1/20, Batch 16/313, Loss: 0.7266
Epoch 1/20, Batch 17/313, Loss: 0.6591
Epoch 1/20, Batch 18/313, Loss: 0.7443
Epoch 1/20, Batch 19/313, Loss: 0.7120
Epoch 1/20, Batch 20/313, Loss: 0.7363
Epoch 1/20, Batch 21/313, Loss: 0.7210
Epoch 1/20, Batch 22/313, Loss: 0.6979
Epoch 1/20, Batch 23/313, Loss: 0.7556
Epoch 1/20, Batch 24/313, Loss: 0.6964
Epoch 1/20, Batch 25/313, Los

### **Let's evaluate the model**

In [75]:
from sklearn.metrics import f1_score, confusion_matrix
import numpy as np

model.eval()
model.to(device)

best_model_path = "best_model.pth"
model.load_state_dict(torch.load(best_model_path, map_location=device, weights_only=True))

all_preds = []
all_labels = []

test_texts = list(raw_dataset['test']['text'])
test_labels = list(raw_dataset['test']['label'])

test_encodings = tokenizer(
    test_texts,
    truncation=True,
    padding='max_length',
    max_length=max_seq_len,
    add_special_tokens=True,
)

test_dataset = MyDataset(test_encodings['input_ids'], test_labels)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

num_test_batches = len(test_dataloader)
with torch.no_grad():
    for j, (batch_token_ids, batch_labels) in enumerate(test_dataloader, 1):
        batch_token_ids = batch_token_ids.to(device)
        batch_labels = batch_labels.to(device)

        logits = model(batch_token_ids)  # (batch_size, num_classes)
        preds = logits.argmax(dim=1)     # (batch_size,)

        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(batch_labels.cpu().numpy())
        print(f"test Batch {j}/{num_test_batches}", end='\r')

# Accuracy
accuracy = (np.array(all_preds) == np.array(all_labels)).mean()
print(f"Test Accuracy: {accuracy:.4f}")

# F1 Score
f1 = f1_score(all_labels, all_preds, average='binary')
print(f"F1 Score: {f1:.4f}")

# Confusion Matrix
cm = confusion_matrix(all_labels, all_preds)
print("Confusion Matrix:")
print(cm)


Test Accuracy: 0.8567
F1 Score: 0.8512
Confusion Matrix:
[[11173  1327]
 [ 2255 10245]]
