# 💬 NLP: Text Preprocessing and Embeddings

In this notebook, we’ll cover the fundamental techniques for **preparing text** and **converting it into a useful numerical representation** for Machine Learning and Deep Learning models. We will review:

- How to clean and normalize text
- What *tokens*, *stopwords*, *lemmatization*, etc. are
- Different ways to represent text: Bag-of-Words, TF-IDF, and Word Embeddings
- How to use `spaCy`, `NLTK`, `scikit-learn`, and `gensim`

## Text Cleaning and Preprocessing

### Why is it necessary to preprocess text?

Text data is full of noise: punctuation, capitalization, accents, irrelevant words (*stopwords*), etc. The goal of preprocessing is to transform raw text into a form that models can understand and make the most of.

Useful steps:
- Convert everything to lowercase
- Remove punctuation and special characters
- Remove *stopwords*: empty words like "the", "and", "is", ...
- Tokenize (split text into words)
- Lemmatize (reduce words to their base form)


In [1]:
import nltk   # Natural language processing library (tokenization, stopwords, etc.)
import spacy  # Library for tokenization, lemmatization and grammatical analysis
import re     # Regular expressions library for cleaning text
from nltk.corpus import stopwords
nltk.download('stopwords')

nlp = spacy.load("en_core_web_sm") # Load a simple text model
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and numbers
    text = re.sub(r'[^a-z\s]', '', text)
    # Apply tokenization and lemmatize the text using the model
    doc = nlp(text)

    # Save tokens in a list
    tokens = [token.lemma_ for token in doc if token.text not in stop_words and token.is_alpha]
    return tokens

example_text = "Natural Language Processing (NLP) is amazing! It's used in so many cool applications."
preprocess_text(example_text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['natural',
 'language',
 'processing',
 'nlp',
 'amazing',
 'use',
 'many',
 'cool',
 'application']

## Representing text

### Bag of Words (BoW)

The Bag-of-Words technique converts text into vectors by **counting how many times each word appears**. It is simple and useful for many cases, although it does not capture contextual meaning.

For example:
- "I like coffee"
- "Coffee is good"

Both sentences will have a representation based on the frequency of common words.


In [2]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "I love natural language processing",
    "Language models are powerful tools",
    "Processing text is fun"
]

# Create a model that learns the vocabulary from the text and obtain word fequency
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("BoW matrix:\n", X.toarray())

Vocabulary: ['are' 'fun' 'is' 'language' 'love' 'models' 'natural' 'powerful'
 'processing' 'text' 'tools']
BoW matrix:
 [[0 0 0 1 1 0 1 0 1 0 0]
 [1 0 0 1 0 1 0 1 0 0 1]
 [0 1 1 0 0 0 0 0 1 1 0]]


### TF-IDF

TF-IDF (Term Frequency - Inverse Document Frequency) **improves BoW by reducing the weight of frequent words** and increasing the weight of rare but meaningful words.

**TF**: calculates the frequency of words in a document. A word that appears frequently in a document will have a high TF.

**IDF**: calculates the frequency of each word across all documents. A word that appears in every document provides little information, so it will have a low IDF.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Same procedure as before
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus)

print("Vocabulary:", tfidf.get_feature_names_out())
print("TF-IDF matrix:\n", X_tfidf.toarray())

Vocabulary: ['are' 'fun' 'is' 'language' 'love' 'models' 'natural' 'powerful'
 'processing' 'text' 'tools']
TF-IDF matrix:
 [[0.         0.         0.         0.42804604 0.5628291  0.
  0.5628291  0.         0.42804604 0.         0.        ]
 [0.46735098 0.         0.         0.35543247 0.         0.46735098
  0.         0.46735098 0.         0.         0.46735098]
 [0.         0.52863461 0.52863461 0.         0.         0.
  0.         0.         0.40204024 0.52863461 0.        ]]


### Word Embeddings

Word embeddings are dense, continuous representations of words, trained to **capture semantic relationships**. For example, in a good embedding:

vector("king") - vector("man") + vector("woman") ≈ vector("queen")

This allows models to identify patterns and relationships between words in a way that is more effective than simple frequency-based methods. We will use `gensim` to load `Word2Vec` or `GloVe`.

In [4]:
!pip install gensim



In [5]:
import gensim.downloader as api

# Download pretrained embeddings
wv = api.load("glove-wiki-gigaword-50")

print("'king' vector:", wv['king'])

# Measure similarity
print("Similarity between king and queen:", wv.similarity('king', 'queen'))

'king' vector: [ 0.50451   0.68607  -0.59517  -0.022801  0.60046  -0.13498  -0.08813
  0.47377  -0.61798  -0.31012  -0.076666  1.493    -0.034189 -0.98173
  0.68229   0.81722  -0.51874  -0.31503  -0.55809   0.66421   0.1961
 -0.13495  -0.11476  -0.30344   0.41177  -2.223    -1.0756   -1.0783
 -0.34354   0.33505   1.9927   -0.04234  -0.64319   0.71125   0.49159
  0.16754   0.34344  -0.25663  -0.8523    0.1661    0.40102   1.1685
 -1.0137   -0.21585  -0.15155   0.78321  -0.91241  -1.6106   -0.64426
 -0.51042 ]
Similarity between king and queen: 0.78390425


## Example: represent some sentences with TF-IDF

In [6]:
texts = [
    "Deep learning is revolutionizing AI.",
    "Neural networks are the core of deep learning.",
    "NLP enables machines to understand human language."
]

# Preprocess and vectorize with TF-IDF
clean_texts = [" ".join(preprocess_text(t)) for t in texts]
X_clean = tfidf.fit_transform(clean_texts)

print("Clean text:", clean_texts)
print("Vocabulary:", tfidf.get_feature_names_out())
print("TF-IDF:\n", X_clean.toarray())

Clean text: ['deep learning revolutionize ai', 'neural network core deep learning', 'nlp enable machine understand human language']
Vocabulary: ['ai' 'core' 'deep' 'enable' 'human' 'language' 'learning' 'machine'
 'network' 'neural' 'nlp' 'revolutionize' 'understand']
TF-IDF:
 [[0.5628291  0.         0.42804604 0.         0.         0.
  0.42804604 0.         0.         0.         0.         0.5628291
  0.        ]
 [0.         0.49047908 0.37302199 0.         0.         0.
  0.37302199 0.         0.49047908 0.49047908 0.         0.
  0.        ]
 [0.         0.         0.         0.40824829 0.40824829 0.40824829
  0.         0.40824829 0.         0.         0.40824829 0.
  0.40824829]]


## Conclusions

- Text preprocessing is a critical step when working with text.
- BoW and TF-IDF are simple but useful representations.
- Word embeddings enable capturing meaning and relationships between words, going beyond basic frequency-based approaches.

## More advanced models


So far, we have learned classic methods for representing text, such as BoW or TF-IDF. While these approaches have been foundational in text processing, they have some important limitations:

* They do not capture the actual meaning of words.
* They treat all words as independent, ignoring semantic relationships.
* They poorly handle the context in which a word appears.

To overcome these limitations, more advanced word embedding techniques were developed. These allow us to:
- Represent words as dense vectors in a space where **semantics** is reflected in the **distance and direction** of the vectors.
- Capture relationships between words without needing to rewrite the entire corpus.

Some examples of these are: Word2Vec, FastText, and BERT (Transformers).

###  Word2Vec

Word2Vec is a model created by Google that transforms **words into numeric vectors**, so that words with similar meanings have vectors that are close to each other in space. The main algorithms are:

- CBOW (Continuous Bag of Words): predicts a word based on its context.

- Skip-gram: predicts the context based on a single word.

In [7]:
# !pip install gensim
from gensim.models import Word2Vec

# Data for training the model
sentences = [
    ["king", "queen", "man", "woman"],
    ["paris", "france", "berlin", "germany"],
    ["apple", "orange", "banana", "fruit"]
]

# Word2Vec model: CBOW
model_w2v = Word2Vec(sentences, vector_size=50, window=3, min_count=1, sg=0)  # CBOW
# print("'King' vector:", model_w2v.wv['king'])

# Measure similarity between 'king' and 'queen'
print("Similitud entre king y queen:", model_w2v.wv.similarity('king', 'queen'))

Similitud entre king y queen: 0.0651624


> In this case does not work good because of the small data sample

### FastText

In this case, it represents words as a **set of character n-grams**, allowing it to understand unknown words (Out Of Vocabulary — OOV).

In [8]:
from gensim.models import FastText

# Data for training the model
sentences = [
    ["king", "queen", "man", "woman"],
    ["paris", "france", "berlin", "germany"],
    ["apple", "orange", "banana", "fruit"]
]

# FastText model
model_fasttext = FastText(sentences, vector_size=50, window=3, min_count=1)

# Display words similar to 'king'.
print("Similar words to 'king':", model_fasttext.wv.most_similar('king'))

Similar words to 'king': [('queen', 0.0559174083173275), ('paris', 0.019648853689432144), ('france', -0.008112777955830097), ('apple', -0.03811424598097801), ('fruit', -0.05331452935934067), ('woman', -0.11958061158657074), ('banana', -0.12536485493183136), ('man', -0.17913685739040375), ('orange', -0.1807599514722824), ('berlin', -0.18634317815303802)]


### BERT

BERT (Bidirectional Encoder Representations from Transformers) is a model that reads **context from both left to right and right to left** at the same time. It was revolutionary because it understands the meaning of a word in its full context.

In [9]:
#!pip install transformers

from transformers import BertTokenizer, BertModel
import torch
import numpy as np

# Load the base BERT model and its tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def get_token_embedding(sentence, token):
    # Tokenize the sentence and get its representations
    inputs = tokenizer(sentence, return_tensors="pt")
    outputs = model(**inputs)

    # Get the index of the token
    token_id = inputs['input_ids'][0].tolist().index(tokenizer.convert_tokens_to_ids(token))
    # Return its embedding
    return outputs.last_hidden_state[0][token_id].detach().numpy()

# Two sentences with "bank" in different contexts
sentence1 = "He deposited money in the bank."
sentence2 = "The river overflowed the bank after the storm."

# Get both embeddings
embedding1 = get_token_embedding(sentence1, 'bank')
embedding2 = get_token_embedding(sentence2, 'bank')

# Calculate cosine similarity between the two embedding vectors
cosine_similarity = np.dot(embedding1, embedding2) / (np.linalg.norm(embedding1) * np.linalg.norm(embedding2))

print(f"Similarity between 'bank' in different contexts: {cosine_similarity:.4f}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Similarity between 'bank' in different contexts: 0.4849


## (APPENDIX A) Intro a LLMs y Chatbots



### What is an LLM?

A **Large Language Model (LLM)** is a language model trained on massive amounts of text to predict and generate human language in a coherent way. Some well-known examples include:

- GPT (OpenAI)
- LLaMA (Meta)
- Claude (Anthropic)
- Mistral
- Falcon
- PaLM (Google)

These models are capable of:

✅ Answering questions

✅ Translating text

✅ Generating code

✅ Summarizing documents

✅ Holding conversations (like a chatbot)

LLMs are based on Transformers, an architecture introduced by Vaswani et al. in 2017.

### What is a Chatbot?

A chatbot is an application that simulates conversation with humans. There are two main types:

- **Rule-based**: Respond using predefined patterns (if...else, decision trees)
- **LLM-based**: Use models like GPT to generate more dynamic and natural responses

Modern chatbots combine:
- An interface (web, mobile, WhatsApp, etc.)
- A backend with an LLM
- Possibly a database or memory

### DEMO: Example Chatbots

In [10]:
!pip install openai



In [11]:
import openai

# API management: https://platform.openai.com/settings/organization/api-keys
key = "YOUR_API_KEY_HERE"

In [None]:
from openai import OpenAI

client = OpenAI(api_key=key)
respuesta = client.chat.completions.create(
    model="gpt-3.5-turbo",  # o "gpt-4o"
    messages=[{"role": "user", "content": "How does DNA work?"}]
)
print(respuesta.choices[0].message.content)

> Although the model is complete, a paid ChatGPT account is required in order to run it and call the API. However, we can use a pre-trained model using Huggingface.

In [13]:
!pip install -q transformers huggingface_hub

In [14]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Configure the model
model_id = "tiiuae/falcon-rw-1b"
token = "YOUR_TOKEN_HERE"  # Token management in https://huggingface.co/settings/tokens

tokenizer = AutoTokenizer.from_pretrained(model_id, token=token)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", token=token)

def chatbot(pregunta):
    inputs = tokenizer(pregunta, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=20)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example phares
# print(chatbot("How does photosynthesis work?"))
print(chatbot("Why is important to drink water?"))

tokenizer_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.62G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.62G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/115 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Why is important to drink water?
Water is important for the body to function properly. It is the main component of the body.


## (APPENDIX B) LLMs integration

### FastAPI: Fast and modern API with Python

🔍 What is it?

FastAPI is a web framework for **building fast, secure, and easily scalable APIs**. It is based on standard Python and is ideal for serving ML models or connecting a frontend with an intelligent backend (e.g., a chatbot).

🚀 Why use it?

- Super fast (uses uvicorn and async)
- Ideal for microservices
- Automatically generates Swagger documentation
- Compatible with ML frameworks (scikit-learn, PyTorch, TensorFlow, etc.)

In [15]:
!pip install fastapi



In [16]:
# Example: API that responds using ChatGPT
from fastapi import FastAPI
from pydantic import BaseModel
import openai

# Create an instance
app = FastAPI()
openai.api_key = "YOUR_API_KEY_HERE"

# The model receives a string message sent by the user
class Prompt(BaseModel):
    message: str

# Define a POST endpoint at the "/chat/" route
@app.post("/chat/")

# This endpoint receives a JSON with the message
def chat(prompt: Prompt):
    # Call the OpenAI API to generate a chat response
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt.message}]
    )
    # And return the response
    return {"reply": response['choices'][0]['message']['content']}

> To test locally: uvicorn app:app --reload

### LangChain: LLMs with tools, memory, and logic

LangChain is a library for **building applications with LLMs** that can:
- Retain conversation memory
- Access documents, APIs, or databases
- Integrate with intelligent agents and external tools

🎯 Ideal for:
- Contextual chatbots with memory
- QA systems over PDFs
- Autonomous agents with logic and decision-making

🧠 Key Concepts:
* Chains: Sequences of steps (input → prompt → LLM → output)
* Memory: Stores the conversation history
* Tools: Access to external sources (Google, Python, databases, etc.)
* Agents: Models that decide which tool to use at each step