# Data Preprocessing and Cleaning

Data cleaning is an essential step in the process of preparing data for further processing and serving into a large language model. Without data cleaning, the language model would be limited in its ability to interpret and process the data accurately. Data cleaning helps to ensure that the data is in a consistent format, free of any inconsistencies or errors that could lead to incorrect results. It also helps to reduce the amount of time needed to process the data, as it eliminates the need to manually check for any inconsistencies or errors. Additionally, data cleaning can help to improve the accuracy of the language model by removing any irrelevant or incorrect data. By taking the time to properly clean the data, the language model can more accurately interpret and process the data, leading to better results.

> 📍 Fill out the missing pieces in the source source to get everything working (indicated by `#FIXME`).

## Count tokens

> **Important note**
>
> *You do not need to manually tokenize strings before feeding texts into the model. This will be done automatically once you put instructions into the `prompt` parameter. However, you can use the `tiktoken` library to check how a string is tokenized and count the numbers of tokens to calculate the cost of an API call. Learn more [here](https://platform.openai.com/docs/introduction/tokens).*

Tokenizing text strings is an important step in natural language processing (NLP) as it helps models like GPT-3 understand the structure of a text string. Tokenizing breaks a text string into smaller pieces called tokens, which can then be analyzed and used by the model. By understanding the structure of a text string, models can better understand the meaning of the text. Additionally, tokenizing helps to determine the cost of an Azure OpenAI Service API call, as usage is priced by token. Furthermore, different models use different encodings, so it is important to tokenize text strings in the appropriate format.

`tiktoken` supports three encodings used by Azure OpenAI Service models:

| Encoding name | Azure OpenAI Service models |
| ------------- | -------------- |
| gpt2 (or r50k_base) | Most GPT-3 models |
| p50k_base | Code models, text-davinci-002, text-davinci-003 |
| cl100k_base | text-embedding-ada-002 |

Tokens in English typically range from one character to one word (e.g. "t" or "great"), though some languages may have tokens that are shorter or longer than one character or word. Spaces are usually placed at the beginning of words (e.g. " is" instead of "is " or "+"is"). You can use the Tokenizer to quickly check how a string is tokenized.

To show it briefly, we will use `tiktoken` to tokenize a text string and see how the output looks like.

In [None]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/1.2 MB[0m [31m2.5 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━[0m [32m0.7/1.2 MB[0m [31m9.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.8.0


In [None]:
import tiktoken

encoding = tiktoken.get_encoding("p50k_base")
encoding.encode("Hello world, this is fun!")

[15496, 995, 11, 428, 318, 1257, 0]

Write a script that shows the string tokens from an input phrase.

In [None]:
# Function to show tokens for a given input phrase
def show_tokens(input_phrase):
    tokens = encoding.encode(input_phrase)
    token_strings = [encoding.decode([token]) for token in tokens]
    return token_strings

# Input phrase
input_phrase = input("Enter a phrase: ")

# Show the tokens for the input phrase
tokens = show_tokens(input_phrase)
print(f"Tokens: {tokens}")

Enter a phrase: Hello world, this is fun!
Tokens: ['Hello', ' world', ',', ' this', ' is', ' fun', '!']


Let's write a function to count the number of tokens in a text string.

In [None]:
def get_num_tokens_from_string(string: str, encoding_name: str='p50k_base') -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

get_num_tokens_from_string("Hello World, this is fun!")

7

## Clean data

Next we'll perform some light data cleaning by removing redundant whitespace and cleaning up the punctuation to prepare the data for tokenization. Again, tokenization is not required for the model to work, but it is a good practice to do so to ensure that the data is in a consistent format and that the model is able to process the data correctly. Also, it makes sure that the request is not too long for the model as the maximum number of tokens for davinci is 2048, e.g., equivalent to around 2-3 pages of text.

> **Best Practices**
> - **Replace newlines with a single space**: Unless you're embedding code, we suggest replacing newlines (\n) in your input with a single space, as we have observed inferior results when newlines are present.

In [None]:
import re

def normalize_text(string, sep_token = " \n "):
    """Normalize text by removing unnecessary characters and altering the format of words."""
    # make text lowercase
    string = re.sub(r'\s+',  ' ', string).strip()
    string = re.sub(r". ,","",string)
    # remove all instances of multiple spaces
    string = string.replace("..",".")
    string = string.replace(". .",".")
    string = string.replace("\n", "")
    string = string.strip()
    return string

Generate some data to test the cleaning function.

In [None]:
# Test cases to evaluate the function
test_strings = [
    "This is    a test   sentence.   ",  # Excessive spaces
    "Hello, world! How    are  you?    ",  # Punctuation and spaces
    "Test..\n\nString.",  # Periods and newlines
    "Multiple   spaces\n    between words. ",  # Multiple spaces and newlines
    "  \n   Leading and trailing spaces.\n",  # Leading and trailing spaces
    "Punctuation...and. spaces.,   should, be, normalized.",  # Multiple punctuation
    "     \n\n   Normalized text should  be all spaces now.    ",  # Newlines with extra spaces
    "This is an example... sentence. More text here.",  # Periods and sentence split
]

# Applying the function to test strings
for i, test in enumerate(test_strings, 1):
    normalized = normalize_text(test)
    print(f"Test {i}:")
    print(f"Original: '{test}'")
    print(f"Normalized: '{normalized}'")
    print('-' * 40)

#normalize_text(text1)

Test 1:
Original: 'This is    a test   sentence.   '
Normalized: 'This is a test sentence.'
----------------------------------------
Test 2:
Original: 'Hello, world! How    are  you?    '
Normalized: 'Hello, world! How are you?'
----------------------------------------
Test 3:
Original: 'Test..

String.'
Normalized: 'Test. String.'
----------------------------------------
Test 4:
Original: 'Multiple   spaces
    between words. '
Normalized: 'Multiple spaces between words.'
----------------------------------------
Test 5:
Original: '  
   Leading and trailing spaces.
'
Normalized: 'Leading and trailing spaces.'
----------------------------------------
Test 6:
Original: 'Punctuation...and. spaces.,   should, be, normalized.'
Normalized: 'Punctuation..and. spaces., should, be, normalized.'
----------------------------------------
Test 7:
Original: '     

   Normalized text should  be all spaces now.    '
Normalized: 'Normalized text should be all spaces now.'
----------------------------

## Ejercicio 1: Validar longitud de texto antes de enviar a OpenAI

Escribe una función llamada validate_text_length que valide si un texto supera un límite de tokens especificado. Si el texto supera el límite, la función debe dividirlo en segmentos de tamaño adecuado para enviarlos al modelo de OpenAI. Usa la codificación cl100k_base.



In [17]:
import tiktoken

def validate_text_length(text, max_tokens=2048, encoding_name='cl100k_base'):
     # Obtener el codificador para la codificación 'cl100k_base'
    encoding = tiktoken.get_encoding(encoding_name)

    # Codificar el texto para obtener los tokens
    tokens = encoding.encode(text)

    # Dividir los tokens en segmentos según el límite
    segments = []
    for i in range(0, len(tokens), max_tokens):
        segments.append(encoding.decode(tokens[i:i + max_tokens]))

    return segments

# Prueba la función
text = "Lorem ipsum " * 500  # Texto largo
segments = validate_text_length(text)

# Verifica el número de segmentos y su longitud
print(len(segments))  # Resultado esperado: Más de 1 segmento
print(len(tiktoken.get_encoding('cl100k_base').encode(segments[0])))  # <= XXX tokens
##Resultado esperado:

##El texto se divide en segmentos adecuados para ser procesados por OpenAI.
##Cada segmento tendrá un número de tokens menor o igual al límite especificado.


1
1001


## Ejercicio 2: Filtrar palabras prohibidas en un texto

Escribe una función llamada filter_prohibited_words que remueva palabras prohibidas de un texto antes de enviarlo a OpenAI.

Como ejemplo queremos Eliminar palabras como: ['password', 'confidential', 'secret'].



In [19]:
import re

def filter_prohibited_words(text, prohibited_words):
    # Iterar sobre cada palabra prohibida y reemplazarla por '[REDACTED]'
    for word in prohibited_words:
        # Usamos re.sub para reemplazar las palabras prohibidas de forma segura
        text = re.sub(r'\b' + re.escape(word) + r'\b', '[REDACTED]', text, flags=re.IGNORECASE)
    return text

# Prueba la función
text = "This document contains confidential and secret information."
prohibited_words = ['password', 'confidential', 'secret']

cleaned_text = filter_prohibited_words(text, prohibited_words)
print(cleaned_text)  # Resultado esperado: "This document contains [REDACTED] and [REDACTED] information."
#Resultado esperado:
#El texto tendrá palabras prohibidas reemplazadas por [REDACTED].



This document contains [REDACTED] and [REDACTED] information.


## Ejercicio 3: Identificar idioma del texto

Crea una función detect_language que use langdetect para identificar el idioma de un texto y devuelva el código del idioma (e.g., en, es). Verifica el idioma antes de enviarlo al modelo OpenAI. Debes instalar la libreria langdetect


In [21]:
pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.3/981.5 kB[0m [31m6.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m972.8/981.5 kB[0m [31m17.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993222 sha256=ae5b05034118457494fe948a28ed1f886e1f4f03ab22790292afc4a77de813a4
  Stored in directory: /root/.cache/pip/wheels/95/03/7d/59ea870c70ce4e5a370638b5462a7711

In [22]:
from langdetect import detect

def detect_language(text):
    """Detecta el idioma de un texto."""
    # Detectar el idioma usando la función detect de langdetect
    return detect(text)

# Prueba la función
text = "Hola, este es un texto en español."
language = detect_language(text)
print(language)  # Resultado esperado: 'es' (para español)
##Resultado esperado:
##El idioma del texto detectado (es, en, etc.) será devuelto.


es


## Ejercicio 5: Generar prompts efectivos

Escribe una función generate_effective_prompt que genere un prompt optimizado para OpenAI, dado un tema y un objetivo. Usa plantillas como:
“Explain [tema] in simple terms for a [audiencia].”



In [23]:
def generate_effective_prompt(topic, audience="general audience"):
    """Genera un prompt optimizado para OpenAI."""
    # Plantilla para crear el prompt
    return f"Explain {topic} in simple terms for a {audience}."

# Prueba la función
topic = "machine learning"
prompt = generate_effective_prompt(topic, audience="beginner")
print(prompt)  # Resultado esperado: "Explain machine learning in simple terms for a beginner."





Explain machine learning in simple terms for a beginner.


## Ejercicio 6: Medir costo estimado del uso de OpenAI

Crea una función estimate_cost que calcule el costo estimado de una solicitud a OpenAI basado en el número de tokens. Supón un costo de $0.02 por 1000 tokens.



In [24]:
pip install tiktoken



In [25]:
import tiktoken

def estimate_cost(text, encoding_name='cl100k_base', cost_per_1k_tokens=0.02):
    """Calcula el costo estimado de un texto basado en el número de tokens."""
    # Cargar el codificador (encoding) adecuado
    encoding = tiktoken.get_encoding(encoding_name)

    # Contar el número de tokens en el texto
    num_tokens = len(encoding.encode(text))

    # Calcular el costo basado en el número de tokens
    cost = (num_tokens / 1000) * cost_per_1k_tokens

    return cost

# Prueba la función
text = "This is a sample text." * 100
cost = estimate_cost(text)
print(f"Estimated cost: ${cost:.4f}")  # Resultado esperado: Costo aproximado basado en la cantidad de tokens




Estimated cost: $0.0100


## Ejercicio 7: Limpiar texto JSON para OpenAI

Escribe una función clean_json_text que tome un JSON como entrada y limpie todas las claves y valores de caracteres no ASCII.


In [26]:
import json
import re

def clean_value(value):
    """Limpia un valor eliminando caracteres no ASCII."""
    # Usar una expresión regular para eliminar los caracteres no ASCII
    if isinstance(value, str):
        return re.sub(r'[^\x00-\x7F]+', '', value)
    return value

def clean_json_text(json_obj):
    """Limpia texto JSON eliminando caracteres no ASCII."""
    return {clean_value(k): clean_value(v) for k, v in json_obj.items()}

# Prueba la función
data = {"key1": "Hello 😊", "key2": "Café and thé"}
cleaned_data = clean_json_text(data)
print(cleaned_data)  # Resultado esperado: {"key1": "Hello ", "key2": "Caf and th"}


{'key1': 'Hello ', 'key2': 'Caf and th'}
