# Text-to-Text Semantic Matching with AutoMM



Computing the similarity between two sentences/passages is a common task in NLP, with several practical applications such as web search, question answering, documents deduplication, plagiarism comparison, natural language inference, recommendation engines, etc. In general, text similarity models will take two sentences/passages as input and transform them into vectors, and then similarity scores calculated using cosine similarity, dot product, or Euclidean distances are used to measure how alike or different of the two text pieces.

## Prepare your Data
In this tutorial, we will demonstrate how to use AutoMM for text-to-text semantic matching with the Stanford Natural Language Inference ([SNLI](https://nlp.stanford.edu/projects/snli/)) corpus. SNLI is a corpus contains around 570k human-written sentence pairs labeled with *entailment*, *contradiction*, and *neutral*. It is a widely used benchmark for evaluating the representation and inference capbility of machine learning methods. The following table contains three examples taken from this corpus.

| Premise                                                   | Hypothesis                                                           | Label         |
|-----------------------------------------------------------|----------------------------------------------------------------------|---------------|
| A black race car starts up in front of a crowd of people. | A man is driving down a lonely road.                                 | contradiction |
|  An older and younger man smiling.                        | Two men are smiling and laughing at the cats playing on the   floor. | neutral       |
| A soccer game with multiple males playing.                | Some men are playing a sport.                                        | entailment    |

Here, we consider sentence pairs with label *entailment* as positive pairs (labeled as 1) and those with label *contradiction* as negative pairs (labeled as 0). Sentence pairs with neural relationship are discarded. The following code downloads and loads the corpus into dataframes.

In [None]:
!pip install autogluon.multimodal

Collecting autogluon.multimodal
  Downloading autogluon.multimodal-1.1.1-py3-none-any.whl.metadata (12 kB)
Collecting scipy<1.13,>=1.5.4 (from autogluon.multimodal)
  Downloading scipy-1.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Collecting boto3<2,>=1.10 (from autogluon.multimodal)
  Downloading boto3-1.35.22-py3-none-any.whl.metadata (6.6 kB)
Collecting torch<2.4,>=2.2 (from autogluon.multimodal)
  Downloading torch-2.3.1-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)
Collecting lightning<2.4,>=2.2 (from autogluon.multimodal)
  Downloading lightning-2.3.3-py3-none-any.whl.metadata (35 kB)
Collecting transformers<4.41.0,>=4.38.0 (from transformers[sentencepiece]<4.41.0,>=4.38.0->autogluon.multimodal)
  Downloading transformers-4.40.2-py3-none-any.whl.metadata (137 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m

In [None]:
from autogluon.core.utils.loaders import load_pd
import pandas as pd

snli_train = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/snli/snli_train.csv', delimiter="|")
snli_test = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/snli/snli_test.csv', delimiter="|")
snli_train.head()

Unnamed: 0,premise,hypothesis,label
0,A person on a horse jumps over a broken down a...,"A person is at a diner , ordering an omelette .",0
1,A person on a horse jumps over a broken down a...,"A person is outdoors , on a horse .",1
2,Children smiling and waving at camera,There are children present,1
3,Children smiling and waving at camera,The kids are frowning,0
4,A boy is jumping on skateboard in the middle o...,The boy skates down the sidewalk .,0


In [None]:
!pip uninstall torch torchvision torchaudio -y

Found existing installation: torch 2.3.1
Uninstalling torch-2.3.1:
  Successfully uninstalled torch-2.3.1
Found existing installation: torchvision 0.18.1
Uninstalling torchvision-0.18.1:
  Successfully uninstalled torchvision-0.18.1
Found existing installation: torchaudio 2.4.0+cu121
Uninstalling torchaudio-2.4.0+cu121:
  Successfully uninstalled torchaudio-2.4.0+cu121


In [None]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

Looking in indexes: https://download.pytorch.org/whl/cpu
Collecting torch
  Downloading https://download.pytorch.org/whl/cpu/torch-2.4.1%2Bcpu-cp310-cp310-linux_x86_64.whl (194.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.9/194.9 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchvision
  Downloading https://download.pytorch.org/whl/cpu/torchvision-0.19.1%2Bcpu-cp310-cp310-linux_x86_64.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchaudio
  Downloading https://download.pytorch.org/whl/cpu/torchaudio-2.4.1%2Bcpu-cp310-cp310-linux_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m22.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torch, torchvision, torchaudio
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. T

In [None]:
from autogluon.multimodal import MultiModalPredictor

# Initialize the model
predictor = MultiModalPredictor(
        problem_type="text_similarity",
        query="premise", # the column name of the first sentence
        response="hypothesis", # the column name of the second sentence
        label="label", # the label column name
        match_label=1, # the label indicating that query and response have the same semantic meanings.
        eval_metric='auc', # the evaluation metric
    )

# Fit the model
predictor.fit(
    train_data=snli_train,
    time_limit=180,
)

No path specified. Models will be saved in: "AutogluonModels/ag-20240919_013302"
AutoGluon Version:  1.1.1
Python Version:     3.10.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP PREEMPT_DYNAMIC Thu Jun 27 21:05:47 UTC 2024
CPU Count:          2
Pytorch Version:    2.4.1+cpu
CUDA Version:       CUDA is not available
Memory Avail:       10.50 GB / 12.67 GB (82.9%)
Disk Space Avail:   67.26 GB / 107.72 GB (62.4%)
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [0, 1]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])

AutoMM starts to create your model. ✨✨✨

To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdi

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

GPU Count: 0
GPU Count to be Used: 0

INFO: GPU available: False, used: False
INFO: TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO: 
  | Name              | Type                         | Params | Mode 
---------------------------------------------------------------------------
0 | query_model       | HFAutoModelForTextPrediction | 33.4 M | train
1 | response_model    | HFAutoModelForTextPrediction | 33.4 M | train
2 | validation_metric | BinaryAUROC                  | 0      | train
3 | loss_func         | ContrastiveLoss              | 0      | train
4 | miner_func        | PairMarginMiner              | 0      | train
---------------------------------------------------------------------------
33.4 M    Trainable params
0         Non-trainable params
33.4 M    Total params
133.440   Total estimated model params size (MB)


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

  self.pid = os.fork()


Training: |          | 0/? [00:00<?, ?it/s]

INFO: Time limit reached. Elapsed time is 0:03:01. Signaling Trainer to stop.


Validation: |          | 0/? [00:00<?, ?it/s]

INFO: Epoch 0, global step 7: 'val_roc_auc' reached 0.88325 (best 0.88325), saving model to '/content/AutogluonModels/ag-20240919_013302/epoch=0-step=7.ckpt' as top 3
  self.pid = os.fork()
Start to fuse 1 checkpoints via the greedy soup algorithm.
  state_dict = torch.load(path, map_location=torch.device("cpu"))["state_dict"]
  self.pid = os.fork()


Predicting: |          | 0/? [00:00<?, ?it/s]

  self.pid = os.fork()
  avg_state_dict = torch.load(checkpoint_paths[0], map_location=torch.device("cpu"))["state_dict"]
AutoMM has created your model. 🎉🎉🎉

To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/content/AutogluonModels/ag-20240919_013302")
    ```

If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub (https://github.com/autogluon/autogluon/issues).




<autogluon.multimodal.predictor.MultiModalPredictor at 0x7e5ca9f937c0>

In [None]:
score = predictor.evaluate(snli_test)
print("evaluation score: ", score)

  self.pid = os.fork()


Predicting: |          | 0/? [00:00<?, ?it/s]

evaluation score:  {'roc_auc': 0.8977426699305903}


In [None]:
pred_data = pd.DataFrame.from_dict({"premise":["The teacher gave his speech to an empty room."],
                                    "hypothesis":["There was almost nobody when the professor was talking."]})

predictions = predictor.predict(pred_data)
print('Predicted entities:', predictions[0])

  self.pid = os.fork()


Predicting: |          | 0/? [00:00<?, ?it/s]

Predicted entities: 1


In [None]:
probabilities = predictor.predict_proba(pred_data)
print(probabilities)

  self.pid = os.fork()


Predicting: |          | 0/? [00:00<?, ?it/s]

          0         1
0  0.207859  0.792141


In [None]:
embeddings_1 = predictor.extract_embedding({"premise":["The teacher gave his speech to an empty room."]})
print(embeddings_1.shape)
embeddings_2 = predictor.extract_embedding({"hypothesis":["There was almost nobody when the professor was talking."]})
print(embeddings_2.shape)

  self.pid = os.fork()


Predicting: |          | 0/? [00:00<?, ?it/s]

  self.pid = os.fork()


(1, 384)


  self.pid = os.fork()


Predicting: |          | 0/? [00:00<?, ?it/s]

  self.pid = os.fork()


(1, 384)
