# Multi-Modal: Natural Language Binary Classification

A common data set simply includes **just** a natural language feature and a binary label, such as sentiment analysis  (positive or negative sentiment) or in this case, whether an email is "ham" (legitimate) or "spam" (a spammy email).

For a single column of natural language text as are only feature, it makes sense to use the MultiModalPredictor from autogluon.

Our data set is sourced from here:
https://www.kaggle.com/datasets/venky73/spam-mails-dataset

## Imports

In [None]:
!pip install autogluon

In [1]:
from autogluon.multimodal import MultiModalPredictor
from autogluon.tabular import TabularDataset

In [2]:
data = TabularDataset("/content/spam_ham_dataset.csv")

In [3]:
data.head()

Unnamed: 0,label,text
0,ham,Subject: enron methanol ; meter # : 988291\r\n...
1,ham,"Subject: hpl nom for january 9 , 2001\r\n( see..."
2,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar..."
3,spam,"Subject: photoshop , windows , office . cheap ..."
4,ham,Subject: re : indian springs\r\nthis deal is t...


In [4]:
data['label'].value_counts()

ham     3672
spam    1499
Name: label, dtype: int64

In [5]:
print(data.iloc[4]['label'])
print(data.iloc[4]['text'])

ham
Subject: re : indian springs
this deal is to book the teco pvr revenue . it is my understanding that teco
just sends us a check , i haven ' t received an answer as to whether there is a
predermined price associated with this deal or if teco just lets us know what
we are giving . i can continue to chase this deal down if you need .


---

In [6]:
print(data.iloc[3]['label'])
print(data.iloc[3]["text"])

spam
Subject: photoshop , windows , office . cheap . main trending
abasements darer prudently fortuitous undergone
lighthearted charm orinoco taster
railroad affluent pornographic cuvier
irvin parkhouse blameworthy chlorophyll
robed diagrammatic fogarty clears bayda
inconveniencing managing represented smartness hashish
academies shareholders unload badness
danielson pure caffein
spaniard chargeable levin



## Train Test Split

In [7]:
train_size = int(len(data) * 0.8)
seed = 42
train_data = data.sample(train_size, random_state=seed)
test_data = data.drop(train_data.index)

## MultiModalPredictor

Let's train a MultiModalPredictor on this dataset. You can use this specific predictor when you're only dealing with natural language text data. Keep in mind that a TabularPredictor is also quite capable, its just that with a MultiModalPredictor Autogluon will also attempt to train a Transformer Network.

Note that without a GPU training might take a while!

In [8]:
save_path="sentiment_analysis"

In [9]:
predictor = MultiModalPredictor(label="label", path=save_path)

In [10]:
# ONLY RUN THIS IF YOU HAVE A LOT OF TIME!
# THIS WILL TAKE HOURS TO RUN ON A CPU!
predictor.fit(train_data)

INFO:lightning_fabric.utilities.seed:Global seed set to 123
AutoMM starts to create your model. ✨

- Model will be saved to "/content/sentiment_analysis".

- Validation metric is "roc_auc".

- To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /content/sentiment_analysis
    ```

Enjoy your coffee, and let AutoMM do the job ☕☕☕ Learn more at https://auto.gluon.ai



Downloading (…)lve/main/config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/27.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

INFO:pytorch_lightning.utilities.rank_zero:Using 16bit None Automatic Mixed Precision (AMP)
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name              | Type                         | Params
-------------------------------------------------------------------
0 | model             | HFAutoModelForTextPrediction | 108 M 
1 | validation_metric | AUROC                        | 0     
2 | loss_func         | CrossEntropyLoss             | 0     
-------------------------------------------------------------------
108 M     Trainable params
0         Non-trainable params
108 M     T

Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 0, global step 14: 'val_roc_auc' reached 0.98960 (best 0.98960), saving model to '/content/sentiment_analysis/epoch=0-step=14.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 0, global step 28: 'val_roc_auc' reached 0.99603 (best 0.99603), saving model to '/content/sentiment_analysis/epoch=0-step=28.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 1, global step 43: 'val_roc_auc' reached 0.99867 (best 0.99867), saving model to '/content/sentiment_analysis/epoch=1-step=43.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 1, global step 57: 'val_roc_auc' reached 0.99933 (best 0.99933), saving model to '/content/sentiment_analysis/epoch=1-step=57.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 2, global step 72: 'val_roc_auc' reached 0.99970 (best 0.99970), saving model to '/content/sentiment_analysis/epoch=2-step=72.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 2, global step 86: 'val_roc_auc' reached 0.99958 (best 0.99970), saving model to '/content/sentiment_analysis/epoch=2-step=86.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 3, global step 101: 'val_roc_auc' reached 0.99966 (best 0.99970), saving model to '/content/sentiment_analysis/epoch=3-step=101.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 3, global step 115: 'val_roc_auc' reached 0.99996 (best 0.99996), saving model to '/content/sentiment_analysis/epoch=3-step=115.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 4, global step 130: 'val_roc_auc' reached 1.00000 (best 1.00000), saving model to '/content/sentiment_analysis/epoch=4-step=130.ckpt' as top 3
Start to fuse 3 checkpoints via the greedy soup algorithm.


Predicting: 0it [00:00, ?it/s]

Predicting: 0it [00:00, ?it/s]

Predicting: 0it [00:00, ?it/s]

AutoMM has created your model 🎉🎉🎉

- To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/content/sentiment_analysis")
    ```

- You can open a terminal and launch Tensorboard to visualize the training log:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /content/sentiment_analysis
    ```

- If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub: https://github.com/autogluon/autogluon




<autogluon.multimodal.predictor.MultiModalPredictor at 0x7fd65f28ce80>

## Evaluation

After training, we can easily evaluate our predictor on separate test data formatted similarly to our training data.

In [11]:
test_score = predictor.evaluate(test_data, metrics=['acc', 'f1'])
print(test_score)

Predicting: 0it [00:00, ?it/s]

{'acc': 0.9922705314009662, 'f1': 0.9872204472843451}


## Prediction on New Data

In [12]:
email1 = "Subject: free pills\n Give me your social security number and bank account for free pills"
email2 = "Subject: Work Meeting\n Let me know if you have availability for Friday. Thanks!"

In [13]:
print(email1)

Subject: free pills
 Give me your social security number and bank account for free pills


In [14]:
print(email2)

Subject: Work Meeting
 Let me know if you have availability for Friday. Thanks!


In [15]:
predictions = predictor.predict({'email': [email1, email2]})
print('"email":', email1, '"Predicted Sentiment":', predictions[0])
print('"email":', email2, '"Predicted Sentiment":', predictions[1])



"email": Subject: free pills
 Give me your social security number and bank account for free pills "Predicted Sentiment": spam
"email": Subject: Work Meeting
 Let me know if you have availability for Friday. Thanks! "Predicted Sentiment": ham


For classification tasks, you can ask for predicted class-probabilities instead of predicted classes.

In [16]:
predictions = predictor.predict_proba({'email': [email1, email2]})
print('"email":', email1, '"Predicted Sentiment":', predictions[0])
print('"email":', email2, '"Predicted Sentiment":', predictions[1])

"email": Subject: free pills
 Give me your social security number and bank account for free pills "Predicted Sentiment": [3.2066830e-05 9.9996793e-01]
"email": Subject: Work Meeting
 Let me know if you have availability for Friday. Thanks! "Predicted Sentiment": [9.9997187e-01 2.8188662e-05]


