# Support Vector Machine

### Text classification with SVM encoded by Sentence Transformers

Using a custom dataset in csv. With reference to [scikit-learn](https://scikit-learn.org/stable/auto_examples/svm/plot_iris_svc.html#sphx-glr-auto-examples-svm-plot-iris-svc-py)

In [1]:
%pip install scikit-learn
%pip install pandas
%pip install numpy
%pip install matplotlib
%pip install sentence_transformers

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.

Note: you may need to restart the kernel to use updated packages.


## Load the Dataset
1. Read the dataset.csv file using pandas
2. Load the csv dataset into a hf_dataset object
3. Perform a train test split on the hf_dataset

In [1]:
from datasets import Dataset
import pandas as pd

df = pd.read_csv(r"C:\Users\ISS-User1\Desktop\eugene\Glowing-Torch\datasets\dataset.csv", encoding='latin1')

hf_dataset = Dataset.from_pandas(df)
hf_dataset = hf_dataset.train_test_split(
    test_size=1-0.8, shuffle=True)

print(hf_dataset)

  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['product_id', 'name', 'default', 'extract_date', 'category'],
        num_rows: 25352
    })
    test: Dataset({
        features: ['product_id', 'name', 'default', 'extract_date', 'category'],
        num_rows: 6339
    })
})


## Load the embedding model

This sentence transformer is used to create encodings with shape of (384,) that will then be used for classification.

In [2]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

sentence = ['Sample sentence to encode', 'And another one']

embedding = model.encode(sentence)
print(embedding.shape)

(2, 384)


## Encode the dataset using the sentence transformer model

In [3]:
X = model.encode(hf_dataset['train']['name'])
y = hf_dataset['train']['category']

## Instantiate an SVM instance and fit the model

Create a LinearSVC SVM classifier and fit with the training data.

In [4]:
from sklearn.metrics import hinge_loss
from sklearn import svm

svm_clf = svm.LinearSVC(C=1, dual="auto")
svm_clf.fit(X, y)

Compute the hinge loss of the SVM model

In [5]:
decision_scores = svm_clf.decision_function(X)

hinge_loss_value = hinge_loss(y, decision_scores)
print("Hinge loss:", hinge_loss_value)

Hinge loss: 0.009142243299475506


## Predicting the data


In [6]:
query = ["RIBBED COTTON TOPS", "Crossbody BAG", "Nike river swimsuit"]
encoding = model.encode(query)
print(f"Classes: {svm_clf.classes_}")

import numpy as np

def softmax(logits):
    exp_logits = np.exp(logits - np.max(logits, axis=1, keepdims=True))
    probabilities = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)
    return probabilities

decision_scores = svm_clf.decision_function(encoding)
prediction = svm_clf.predict(encoding)
probabilities = softmax(decision_scores)
for i,name in enumerate(query):
    probability = [round(x,6) for x in probabilities[i]]
    print(f"Query: {name}")
    print(f"Predicted: {prediction[i]} with {probability}")

Classes: ['accessories' 'beauty' 'bottoms' 'outerwear' 'shoes' 'socks' 'sportswear'
 'swimwear' 'tops' 'underwear']
Query: RIBBED COTTON TOPS
Predicted: tops with [0.004342, 0.011304, 0.012236, 0.001095, 0.001865, 0.000713, 0.001223, 0.001807, 0.961335, 0.00408]
Query: Crossbody BAG
Predicted: accessories with [0.928597, 0.021926, 0.013652, 0.000393, 0.010699, 0.000254, 0.018557, 0.000738, 0.005024, 0.000159]
Query: Nike river swimsuit
Predicted: swimwear with [0.000135, 0.126044, 0.056685, 0.006987, 0.02838, 0.005764, 0.209357, 0.465675, 0.098355, 0.002617]


In [11]:
X_test = model.encode(hf_dataset['test']['name'])
y_test = hf_dataset['test']['category']

score = svm_clf.score(X_test, y_test)
print(score)

0.9960561602776463
