To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth?tab=readme-ov-file#-installation-instructions).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

Features in the notebook:
1. Uses Maxime Labonne's [FineTome 100K](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset.
1. Convert ShareGPT to HuggingFace format via `standardize_sharegpt`
2. Train on Completions / Assistant only via `train_on_responses_only`
3. Unsloth now supports Torch 2.4, all TRL & Xformers versions & Python 3.12!

In [10]:
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Found existing installation: unsloth 2024.10.3
Uninstalling unsloth-2024.10.3:
  Successfully uninstalled unsloth-2024.10.3
[0mCollecting unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-m8w3ulkw/unsloth_4d007f8437474ee68cf83f53a8e8013f
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-m8w3ulkw/unsloth_4d007f8437474ee68cf83f53a8e8013f
  Resolved https://github.com/unslothai/unsloth.git to commit 1f52468fa31bf0b641ec96217ef0f5916a07fce5
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.

* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
* [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [12]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 2x faster
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # 4bit for 405b!
    "unsloth/Mistral-Small-Instruct-2409",     # Mistral 22b 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!

    "unsloth/Llama-3.2-1B-bnb-4bit",           # NEW! Llama 3.2 models
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "model2", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
    load_in_8bit=False,
    trust_remote_code=True
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

Are you certain you want to do remote code execution?
==((====))==  Unsloth 2024.10.3: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA H100 NVL. Max memory: 93.122 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 9.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.10.3 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [4]:
from datasets import load_dataset
data_path="dataset.csv"
corpus=load_dataset('csv', data_files=data_path,column_names=['instruct', 'input', 'output'],cache_dir=None)

Generating train split: 0 examples [00:00, ? examples/s]

In [5]:
print(corpus)

DatasetDict({
    train: Dataset({
        features: ['instruct', 'input', 'output'],
        num_rows: 21084
    })
})


In [6]:
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
  instructions = examples["instruct"]
  inputs = examples["input"]
  outputs = examples["output"]
  texts = []
  for instruction, input, output in zip(instructions, inputs, outputs):
    text = instruction + " " + input + " " + output + EOS_TOKEN
    texts.append(text)
  return {"text":texts}

dataset = corpus.map(formatting_prompts_func, batched = True, keep_in_memory=False, num_proc=1)

Map:   0%|          | 0/21084 [00:00<?, ? examples/s]

In [7]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported
from unsloth import FastLanguageModel
from datasets import Dataset

# Configuración del modelo con LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # Puedes ajustar este valor (sugerido: 8, 16, 32, 64, 128)
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,  # 0 es optimizado
    bias="none",  # "none" es optimizado
    use_gradient_checkpointing="unsloth",  # True o "unsloth" para contexto largo
    random_state=3407,
    use_rslora=False,  # Soporte para Rank Stabilized LoRA
    loftq_config=None  # Soporte para LoftQ
)

# Asegúrate de seleccionar el split 'train' del dataset
train_dataset = dataset['train']

# Función para tokenizar el dataset
def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, max_length=max_seq_length)

# Tokenizar el dataset
train_dataset = train_dataset.map(tokenize_function, batched=True)

# Función para agregar la columna 'labels' al dataset tokenizado
def add_labels(examples):
    examples['labels'] = examples['input_ids']  # Asigna 'labels' igual a 'input_ids'
    return examples

# Aplica la función para agregar la columna 'labels'
train_dataset = train_dataset.map(add_labels, batched=True)

# Configuración del Data Collator para secuencias
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

# Configuración del trainer de Unsloth
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,  # Usar el dataset 'train' con 'labels'
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    data_collator=data_collator,  # Usar el collator modificado
    dataset_num_proc=2,
    packing=False,  # Puede hacer el entrenamiento 5x más rápido para secuencias cortas
    args=TrainingArguments(
        per_device_train_batch_size=32,
        gradient_accumulation_steps=2,
        warmup_steps=5,
        num_train_epochs=2,  # Ajusta el número de épocas aquí
        #max_steps=5000,
        learning_rate=0.001,
        fp16= not is_bfloat16_supported(),
        bf16= is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="none",  # Usa esto para WandB, TensorBoard, etc.
    ),
)

Unsloth: Already have LoRA adapters! We shall skip this step.


Map:   0%|          | 0/21084 [00:00<?, ? examples/s]

Map:   0%|          | 0/21084 [00:00<?, ? examples/s]

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 21,084 | Num Epochs = 2
O^O/ \_/ \    Batch size per device = 32 | Gradient Accumulation steps = 2
\        /    Total batch size = 64 | Total steps = 658
 "-____-"     Number of trainable parameters = 11,272,192


**** Unsloth: Please use our fixed gradient_accumulation_steps by updating transformers and Unsloth!


Step,Training Loss
1,4.8453
2,4.7897
3,4.6444
4,4.4775
5,4.2882
6,4.0312
7,4.0117
8,3.889
9,3.674
10,3.7124


In [10]:
model.save_pretrained("modelo")
tokenizer.save_pretrained("modelo")

('modelo/tokenizer_config.json',
 'modelo/special_tokens_map.json',
 'modelo/tokenizer.json')

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

In [13]:
def format_prompt(instruction, input_text):
    return f"Instruction: {instruction} Input: {input_text} Output:"

In [14]:
def tokenize_prompt(prompt, tokenizer, max_seq_length):
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=max_seq_length)
    return inputs


In [15]:
import torch

def generate_response(prompt, model, tokenizer, max_seq_length=256, max_new_tokens=50):
    # Formatear el prompt
    formatted_prompt = format_prompt(prompt["instruction"], prompt["input"])

    # Tokenizar el prompt
    inputs = tokenize_prompt(formatted_prompt, tokenizer, max_seq_length)

    # Pasar el prompt al modelo y generar una respuesta
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,  # Número máximo de tokens a generar
        do_sample=True,  # Muestra aleatoriamente para más diversidad
        temperature=0.7,  # Controla la creatividad de la respuesta
        top_p=0.9,  # Controla el filtro de nucleus sampling
        eos_token_id=tokenizer.eos_token_id  # ID del token de fin de secuencia
    )

    # Decodificar la respuesta generada
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Mostrar la respuesta generada después del prompt
    return response


In [18]:
from unsloth import FastLanguageModel

# Preparar el modelo para inferencia
model = FastLanguageModel.for_inference(model)

def format_prompt(instruction, input_text):
    return f"Instruction: {instruction} Input: {input_text} Output:"

def tokenize_prompt(prompt, tokenizer, max_seq_length):
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=max_seq_length)
    return inputs

import torch


def generate_response(prompt, model, tokenizer, max_seq_length=256, max_new_tokens=50):
    # Determinar el dispositivo
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Formatear el prompt
    formatted_prompt = format_prompt(prompt["instruction"], prompt["input"])

    # Tokenizar el prompt
    inputs = tokenize_prompt(formatted_prompt, tokenizer, max_seq_length)

    # Mover los inputs al dispositivo
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Pasar el prompt al modelo y generar una respuesta
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id
    )

    # Decodificar la respuesta generada
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return response


# Ejemplo de un prompt
prompt_example = {
    "instruction": "Traduce el siguiente texto a Nahuatl",
    "input": "Dame un pedazo de ese chocolate amigo"
}

# Generar respuesta
response = generate_response(prompt_example, model, tokenizer)
print(f"Respuesta del modelo: {response}")


Respuesta del modelo: Instruction: Traduce el siguiente texto a Nahuatl Input: Dame un pedazo de ese chocolate amigo Output: Xinehualti notlatl axoxa notlatl axoxa notlatl axoxa notlatl axoxa notlatl axoxa notlatl axoxa notlatl axoxa notlatl


In [22]:
import torch

def get_embeddings(prompt, model, tokenizer, max_seq_length=256):
    # Determinar el dispositivo
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Formatear y tokenizar el prompt
    formatted_prompt = format_prompt(prompt["instruction"], prompt["input"])
    inputs = tokenize_prompt(formatted_prompt, tokenizer, max_seq_length)

    # Mover los inputs al dispositivo
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Pasar el input por el modelo sin generar texto
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)

    # Extraer los embeddings de la última capa oculta
    embeddings = outputs.hidden_states[-1]

    # Promediar los embeddings a lo largo de la secuencia (opcional)
    averaged_embeddings = embeddings.mean(dim=1)

    # Convertir a float32 antes de pasarlos a NumPy
    return averaged_embeddings.cpu().float().numpy()

# Obtener embeddings del ejemplo de prompt
embeddings = get_embeddings(prompt_example, model, tokenizer)
print(f"Embeddings del prompt: {embeddings}")


Embeddings del prompt: [[-1.78125    -0.1484375   0.6640625  ... -0.5703125   0.48046875
  -1.625     ]]


In [23]:
import torch

def get_output_embeddings(prompt, model, tokenizer, max_seq_length=256, max_new_tokens=50):
    # Determinar el dispositivo
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Formatear el prompt inicial
    formatted_prompt = format_prompt(prompt["instruction"], prompt["input"])
    inputs = tokenize_prompt(formatted_prompt, tokenizer, max_seq_length)

    # Mover los inputs al dispositivo
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Generar el output con el modelo
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            eos_token_id=tokenizer.eos_token_id
        )

    # Decodificar el output generado para obtener el texto
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Tokenizar el texto generado para obtener los embeddings
    output_inputs = tokenizer(generated_text, return_tensors="pt", truncation=True, max_length=max_seq_length)

    # Mover el output tokenizado al dispositivo
    output_inputs = {k: v.to(device) for k, v in output_inputs.items()}

    # Pasar el output tokenizado por el modelo para obtener los embeddings
    with torch.no_grad():
        output_embeddings = model(**output_inputs, output_hidden_states=True).hidden_states[-1]

    # Promediar los embeddings a lo largo de la secuencia (opcional)
    averaged_output_embeddings = output_embeddings.mean(dim=1)

    # Convertir a float32 antes de pasarlos a NumPy
    return averaged_output_embeddings.cpu().float().numpy(), generated_text

# Obtener embeddings del output generado
output_embeddings, generated_text = get_output_embeddings(prompt_example, model, tokenizer)

print(f"Texto generado: {generated_text}")
print(f"Embeddings del output: {output_embeddings}")


Texto generado: Instruction: Traduce el siguiente texto a Nahuatl Input: Dame un pedazo de ese chocolate amigo Output: Xinehuetzti quin tletl inin tlacualli quipia noca cuicatl amoné Nahuatl  Xinehuetzti quin tletl inin tlacualli qu
Embeddings del output: [[-1.3203125  -0.51171875 -0.9609375  ...  0.50390625 -0.48828125
  -2.359375  ]]


In [29]:
import pandas as pd
import numpy as np
import torch
import csv

def process_dataset(dataset_path, model, tokenizer, max_seq_length=256, max_new_tokens=50, output_csv='output_embeddings.csv'):
    # Cargar el dataset y seleccionar los primeros 20 inputs
    dataset = pd.read_csv(dataset_path)
    selected_inputs = dataset.head(200)

    # Lista para almacenar los embeddings de salida
    all_output_embeddings = []

    # Iterar sobre los primeros 20 inputs del dataset
    for index, row in selected_inputs.iterrows():
        prompt = {
            "instruction": row.get("instruction", "Translate to Nahuatl"),  # Ajustar según la columna de instrucciones en el CSV
            "input": row.get("input", "")
        }

        # Generar los embeddings del output
        output_embeddings, generated_text = get_output_embeddings(prompt, model, tokenizer, max_seq_length, max_new_tokens)
        
        # Agregar los embeddings a la lista
        all_output_embeddings.append(output_embeddings.flatten())

        print(f"Processed input {index + 1}: {generated_text}")

    # Guardar los embeddings de salida en un nuevo archivo CSV
    output_df = pd.DataFrame(all_output_embeddings)
    output_df.to_csv(output_csv, index=False, header=False)
    print(f"Embeddings guardados en {output_csv}")

# Ejecutar la función para procesar el dataset y guardar los embeddings
process_dataset('dataset.csv', model, tokenizer)


Processed input 1: Instruction: "Traduce el siguiente texto a Nahuatl" Input: "Y así cuando hizo su ofrenda de fuego se sienta delante de los demás y una persona se queda junto a él" Output: "Ihuan ihcuac ye otlaneci in tlatoli oncan motlalito in occequih tlacatl ica quipiaya ipan ihuan"
Processed input 2: Instruction: "Traduce el siguiente texto a Nahuatl" Input: "Si es jade si es oro acaso no tendrá que ir allá" Output: "Tlen cuauhxochimeh tlen cuauhquihtozme acan ye tlapatihui"
Processed input 3: Instruction: "Traduce el siguiente texto a Nahuatl" Input: "Y cuando el Sol estuvo solo en el cielo enseguida comenzó a amarillear y fue oscureciendo poco a poco hasta que el Sol desapareció cuando frente a él fue a colocarse la Luna alcanzando a cubrir el disco del Sol y así lentamente desapareció el Sol" Output: "Auh in ihcuac in tonatiuh zan ye huitze quixohuac niman ye opeuh in ye huitze huel onpa ye huitze yn ihcuac in tonatiuh ye
Processed input 4: Instruction: "Traduce el siguiente t

In [30]:
import pandas as pd
import torch

def get_input_embeddings(prompt, model, tokenizer, max_seq_length=256):
    # Determinar el dispositivo
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Tokenizar el prompt
    inputs = tokenize_prompt(prompt, tokenizer, max_seq_length)

    # Mover los inputs al dispositivo
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Pasar el input por el modelo sin generar texto
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)

    # Extraer los embeddings de la última capa oculta
    embeddings = outputs.hidden_states[-1]

    # Promediar los embeddings a lo largo de la secuencia (opcional)
    averaged_embeddings = embeddings.mean(dim=1)

    # Convertir a float32 antes de pasarlos a NumPy
    return averaged_embeddings.cpu().float().numpy()

def process_input_embeddings(dataset_path, model, tokenizer, max_seq_length=256, output_csv='input_embeddings.csv'):
    # Cargar el dataset y seleccionar la tercera columna
    dataset = pd.read_csv(dataset_path)
    third_column = dataset.iloc[:, 2]  # Seleccionar la tercera columna

    # Lista para almacenar los embeddings de entrada
    all_input_embeddings = []

    # Iterar sobre los primeros 20 inputs de la tercera columna
    for index, input_text in enumerate(third_column.head(200)):
        prompt = input_text  # Utilizar el texto de la tercera columna como prompt

        # Obtener los embeddings del input
        input_embeddings = get_input_embeddings(prompt, model, tokenizer, max_seq_length)
        
        # Agregar los embeddings a la lista
        all_input_embeddings.append(input_embeddings.flatten())

        print(f"Processed input {index + 1}: {prompt}")

    # Guardar los embeddings de entrada en un nuevo archivo CSV
    input_df = pd.DataFrame(all_input_embeddings)
    input_df.to_csv(output_csv, index=False, header=False)
    print(f"Input embeddings guardados en {output_csv}")

# Ejecutar la función para procesar los input embeddings de la tercera columna
process_input_embeddings('dataset.csv', model, tokenizer)


Processed input 1: "Auh in ye yuhqui in on tlenamacac niman ye ic teixpan on motlalia ce tlacatl itech mocaua"
Processed input 2: "In chalchihuitl teocuitlatl mach ah ca on yaz"
Processed input 3: "Auh yn oyuh in yoca hualmotlalli tonatiuh ylhuicatitech niman yc peuh yn huel ye tlacoçahuia çan ihuiantzin ye tlayohuatiuh ynic ye poliuhtiuh tonatiuh ynic huel ixpan  147 ye yatiuh ynic huel ixpan ye onmomana metztli huel cacitimoman ynic yahualtic tonatiuh ynic quixtzacuilli y çan ihuiantzin huel onpolihuico tonatiuh"
Processed input 4: "Yn oncan mohuicatza yhuan yn ciudad cabildo tlaca yhuan oydores Audiencia Real tlacan quinepantlahuitiaque yn tlahtocacorona real quinapalotia ce tlacatl coxín ypan mantia çan ye ynmamanian yn mantiaque"
Processed input 5: "Kualtia"
Processed input 6: "Auh ynin ca huel yehuatl quimocenteotia  31r yn huey tlacatecolotl ma cenca quinmicnelli ma quinpalehui ma quinmaquixti ynic amo mochtin quinmictizque ynic amo quincenpopolozque ma cana occeccan quinhuica m

In [31]:
import pandas as pd
from scipy.spatial.distance import cosine

def calcular_promedio_distancia_coseno(embeddings_csv1, embeddings_csv2):
    # Cargar los dos archivos CSV de embeddings
    embeddings1 = pd.read_csv(embeddings_csv1, header=None)
    embeddings2 = pd.read_csv(embeddings_csv2, header=None)

    # Asegurarse de que ambos tengan el mismo número de filas
    num_filas = min(len(embeddings1), len(embeddings2))

    # Lista para almacenar las distancias coseno
    distancias_coseno = []

    # Iterar sobre las filas y calcular la distancia coseno
    for i in range(num_filas):
        vector1 = embeddings1.iloc[i].values
        vector2 = embeddings2.iloc[i].values

        # Calcular la distancia coseno entre los vectores
        distancia = cosine(vector1, vector2)

        # Agregar la distancia a la lista
        distancias_coseno.append(distancia)

    # Calcular el promedio de las distancias coseno
    promedio_distancia = sum(distancias_coseno) / len(distancias_coseno)

    # Mostrar el promedio
    print(f"El promedio de la distancia coseno es: {promedio_distancia}")

# Ejecutar la función para calcular el promedio de la distancia coseno
calcular_promedio_distancia_coseno('output_embeddings.csv', 'input_embeddings.csv')


El promedio de la distancia coseno es: 0.22025931930152842
