### Tokenization with spaCy 
(reminder: Tokenization = the process of breaking text into the smallest units of text, i.e. tokens)

if not installed:
!pip install spacy
!python -m spacy download en_core_web_sm

more information here: https://spacy.io/

In [1]:
import spacy

In [2]:
# Load small English model
nlp = spacy.load("en_core_web_sm")

In [3]:
text = "A long time ago in a galaxy far, far away."

In [4]:
doc = nlp(text)

In [5]:
# Print tokens
print([token.text for token in doc])

['A', 'long', 'time', 'ago', 'in', 'a', 'galaxy', 'far', ',', 'far', 'away', '.']


In [6]:
# how does spacy deal with contractions?
text = "I'm loving spaCy's tokenization – it's great!"

doc = nlp(text)
print([token.text for token in doc])

['I', "'m", 'loving', 'spaCy', "'s", 'tokenization', '–', 'it', "'s", 'great', '!']


In [7]:
# how does spacy deal with emojis?
text = "The price is $5.99 💰 and the date is 12/08/2025."

doc = nlp(text)
print([token.text for token in doc])

['The', 'price', 'is', '$', '5.99', '💰', 'and', 'the', 'date', 'is', '12/08/2025', '.']


In [8]:
# how does spacy deal with emojis that are part of a word?
text = "You are my sweet❤️"

In [9]:
doc = nlp(text)
print([token.text for token in doc])

['You', 'are', 'my', 'sweet', '❤', '️']


### pre-processing - lowercase
turns every character of the string into lowercase

In [10]:
text_lowercase = text.lower()

In [11]:
print(text_lowercase)

you are my sweet❤️


In [12]:
# the opposite = UPPERCASE
text_uppercase = text.upper()

In [13]:
print(text_uppercase)

YOU ARE MY SWEET❤️


### Tokenization with Bert-base-uncased
reminder: In BERT, there is a subword tokenization algorithm using WordPiece.

if not installed: !pip install transformers

In [14]:
from transformers import BertTokenizer

In [15]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [16]:
tokenizer.tokenize("unhappiness")

['un', '##ha', '##pp', '##iness']

In [17]:
tokenizer.tokenize("untranslatable")

['un', '##tra', '##ns', '##lat', '##able']

In [18]:
tokenizer.tokenize("A cat cought a mouse.")

['a', 'cat', 'cough', '##t', 'a', 'mouse', '.']

### Vectorization
reminder: Bag-of-Words representation of texts shows the frequencies for each token in each document in the form of a vector. 
We can adjust BoW by changing the parameters of CountVectorizer, a Python implementation of BoW.

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

docs = [
    "A long time ago in a galaxy far, far away.",
    "The galaxy is far away."
]

# Initialize vectorizer
vectorizer = CountVectorizer()

# Fit and transform
X = vectorizer.fit_transform(docs)

# Convert to DataFrame for readability
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
print(df)


   ago  away  far  galaxy  in  is  long  the  time
0    1     1    2       1   1   0     1    0     1
1    0     1    1       1   0   1     0    1     0


In [20]:
# binary BoW:
vectorizer = CountVectorizer(binary=True)
X = vectorizer.fit_transform(docs)
pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

Unnamed: 0,ago,away,far,galaxy,in,is,long,the,time
0,1,1,1,1,1,0,1,0,1
1,0,1,1,1,0,1,0,1,0


In [21]:
# n-grams:
vectorizer = CountVectorizer(ngram_range=(1,2))  # unigrams + bigrams
X = vectorizer.fit_transform(docs)
pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

Unnamed: 0,ago,ago in,away,far,far away,far far,galaxy,galaxy far,galaxy is,in,in galaxy,is,is far,long,long time,the,the galaxy,time,time ago
0,1,1,1,2,1,1,1,1,0,1,1,0,0,1,1,0,0,1,1
1,0,0,1,1,1,0,1,0,1,0,0,1,1,0,0,1,1,0,0


In [22]:
# stopwords removal:

vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(docs)
pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

Unnamed: 0,ago,away,far,galaxy,long,time
0,1,1,2,1,1,1
1,0,1,1,1,0,0


In [23]:
# restricting vocabulary size - keeping only 5 most common tokens:
vectorizer = CountVectorizer(max_features=5)
X = vectorizer.fit_transform(docs)
pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

Unnamed: 0,ago,away,far,galaxy,in
0,1,1,2,1,1
1,0,1,1,1,0


### BoW is not just a theory! 
It is a useful representation in plenty of tasks, including spam filtering:

In [24]:
# Step 1: Import required libraries
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [25]:
# Step 2: Define a small dataset (5 spam + 5 non-spam emails)
emails = [
    "Congratulations! You've won a free iPhone!",        # spam
    "Limited offer: Claim your $1000 gift card now!",    # spam
    "Win money now!!!",                                  # spam
    "Get rich quick with this simple trick",             # spam
    "Act now! Limited time deal on lottery tickets!",    # spam
    "Meeting scheduled at 10 AM tomorrow",               # not spam
    "Don't forget to send the report",                   # not spam
    "Lunch at 12?",                                      # not spam
    "Please review the attached document",               # not spam
    "Let’s discuss the project update",                  # not spam
]

labels = [1, 1, 1, 1, 1,   # 1 = spam
          0, 0, 0, 0, 0]   # 0 = not spam

In [26]:
# Step 3: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    emails, labels, test_size=0.3, random_state=42, stratify=labels  
)

In [27]:
# check out the lenghts of the vectors
print(len(X_train))
print(len(y_train))

print(len(X_test))
print(len(y_test))

7
7
3
3


In [28]:
# Step 4: Convert text to feature vectors using CountVectorizer
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

In [29]:
# Let's have a look inside!
pd.DataFrame(X_train_vec.toarray(), columns=vectorizer.get_feature_names_out())

Unnamed: 0,1000,12,at,card,claim,congratulations,discuss,don,forget,free,...,this,to,trick,update,ve,win,with,won,you,your
0,1,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,1,1,0,...,0,1,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
5,0,0,0,0,0,1,0,0,0,1,...,0,0,0,0,1,0,0,1,1,0
6,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [30]:
pd.DataFrame(X_test_vec.toarray(), columns=vectorizer.get_feature_names_out())

Unnamed: 0,1000,12,at,card,claim,congratulations,discuss,don,forget,free,...,this,to,trick,update,ve,win,with,won,you,your
0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
# Step 5: Train an SVM classifier
model = SVC(kernel='linear')  # simple linear SVM
model.fit(X_train_vec, y_train)

In [32]:
# Step 6: Predict on test data
y_pred = model.predict(X_test_vec)

In [33]:
# Step 7: Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

In [34]:
# Step 8: Print evaluation results
print("📊 Evaluation Metrics on Test Set:")
print(f"Accuracy:  {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall:    {recall:.2f}")
print(f"F1 Score:  {f1:.2f}")

📊 Evaluation Metrics on Test Set:
Accuracy:  1.00
Precision: 1.00
Recall:    1.00
F1 Score:  1.00


In [35]:
# Bonus: Try out with your own mail for prediction
my_input = ['Humanitarian Grant of 2M for you, contact me for quick claims.']
X_test_vec = vectorizer.transform(my_input)
y_pred = model.predict(X_test_vec)

In [36]:
print(y_pred)

[1]


... and the same we can do with TF-IDF!

In [37]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [38]:
tfidf = TfidfVectorizer()
X_train_vec = tfidf.fit_transform(X_train)
X_test_vec = tfidf.transform(X_test)

In [39]:
pd.DataFrame(X_train_vec.toarray(), columns=tfidf.get_feature_names_out())

Unnamed: 0,1000,12,at,card,claim,congratulations,discuss,don,forget,free,...,this,to,trick,update,ve,win,with,won,you,your
0,0.360632,0.0,0.0,0.360632,0.360632,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.360632
1,0.0,0.0,0.0,0.0,0.0,0.0,0.461804,0.0,0.0,0.0,...,0.0,0.0,0.0,0.461804,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.419257,0.419257,0.0,...,0.0,0.419257,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.377964,0.0,0.377964,0.0,0.0,0.0,0.377964,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.609819,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.408248,0.0,0.0,0.0,0.408248,...,0.0,0.0,0.0,0.0,0.408248,0.0,0.0,0.408248,0.408248,0.0
6,0.0,0.57735,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [40]:
# Step 5: Train an SVM classifier
model = SVC(kernel='linear')  # simple linear SVM
model.fit(X_train_vec, y_train)

In [41]:
# Step 6: Predict on test data
y_pred = model.predict(X_test_vec)

In [42]:
# Step 7: Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

In [43]:
# Step 8: Print evaluation results
print("📊 Evaluation Metrics on Test Set:")
print(f"Accuracy:  {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall:    {recall:.2f}")
print(f"F1 Score:  {f1:.2f}")

📊 Evaluation Metrics on Test Set:
Accuracy:  1.00
Precision: 1.00
Recall:    1.00
F1 Score:  1.00
