## Zero-shot classification with embeddings

In this notebook we will classify the sentiment of reviews using embeddings and zero labeled data! The dataset is created in the [Get_embeddings_from_dataset Notebook](Get_embeddings_from_dataset.ipynb).

We'll define positive sentiment to be 4- and 5-star reviews, and negative sentiment to be 1- and 2-star reviews. 3-star reviews are considered neutral and we won't use them for this example.

We will perform zero-shot classification by embedding descriptions of each class and then comparing new samples to those class embeddings."

In [1]:
#r "nuget:Microsoft.DotNet.Interactive.AIUtilities, 1.0.0-beta.24129.1"

Loading extension script from `C:\Users\dicolomb\.nuget\packages\microsoft.dotnet.interactive.aiutilities\1.0.0-beta.24054.2\interactive-extensions\dotnet\extension.dib`

In [2]:
using Microsoft.DotNet.Interactive;
using Microsoft.DotNet.Interactive.AIUtilities;

In [3]:
public record DataRow(string ProducIt, string UserId, int Score, string Summary, string Text, int TokenCount, float[] Embedding);

In [4]:
using System.Text.Json;
using System.Text.Json.Serialization;
using System.IO;

var filePath = Path.Combine("..","..","..","Data","fine_food_reviews_with_embeddings_1k.json");

var foodReviewsData = JsonSerializer.Deserialize<DataRow[]>(File.ReadAllText(filePath));

## Zero-Shot Classification
To perform zero shot classification, we want to predict labels for our samples without any training. To do this, we can simply embed short descriptions of each label, such as positive and negative, and then compare the cosine distance between embeddings of samples and label descriptions.

The highest similarity label to the sample input is the predicted label. We can also define a prediction score to be the difference between the cosine distance to the positive and to the negative label. This score can be used for plotting a precision-recall curve, which can be used to select a different tradeoff between precision and recall, by selecting a different threshold.

The code defines two public records, `Label` and `LabelledItem`. The `Label` record represents a label with its associated text and embedding. The `LabelledItem` record represents an item with its associated product ID, summary, text, score, label, predicted label, and probability.

The `PredictLabels` method is used to predict labels for a given set of data. It takes three parameters: `positiveLabel` and `negativeLabel` which are strings representing the labels for positive and negative sentiments, and `data` which is an enumerable collection of `DataRow` objects representing the data to be classified.

Inside the method, a list of `Label` objects is created. Then, the method calculates the average embedding for each label. It does this by filtering the `data` based on the `Score` property, then aggregating the `Embedding` property of each item. This is done separately for positive and negative scores.

After calculating the average embeddings, the method creates new `Label` objects with the calculated embeddings and adds them to the `labels` list.

Finally, the method creates a list of `LabelledItem` objects by iterating over the `data`. For each item in `data`, it calculates the similarity score with each label in the `labels` list, selects the label with the highest score, and creates a new `LabelledItem` with this information. The list of `LabelledItem` objects is then returned.


In [6]:
public record Label(string Text, float[] Embedding);
public record Labelleditem(string ProducIt,string Summary, string Text, float Score, string Label, string PredictedLabel, float Probability);

public List<Labelleditem> PredictLabels(string positiveLabel, string negativeLabel, IEnumerable<DataRow> data){
    var labels = new List<Label>();
   
    // calculate the average embedding for each label

    labels.Add(new Label(positiveLabel, data.Where(d => d.Score >= 4).Select(d => d.Embedding).Centroid()));
    labels.Add(new Label(negativeLabel, data.Where(d => d.Score < 4).Select(d => d.Embedding).Centroid()));
    
    var predictions = data.Select(review => 
    {
        var scoredLabel = labels.ScoreBySimilarityTo(review.Embedding, new CosineSimilarityComparer<float[]>(l => l), l => l.Embedding)
            .OrderByDescending(e => e.Score)
            .First();

        var itemLabel = review.Score < 4 ? negativeLabel : positiveLabel;

        return new Labelleditem(review.ProducIt, review.Summary, review.Text, review.Score, itemLabel, scoredLabel.Value.Text, scoredLabel.Score);
    }).ToList();
    return predictions;
}

In [7]:
var predictions =  PredictLabels("positive", "negative", foodReviewsData);

In [8]:
predictions.OrderByDescending(r => r.Score).DisplayTable();

ProducIt,Summary,Text,Score,Label,PredictedLabel,Probability
B003XPF9BO,where does one start...and stop... with a treat like this,Wanted to save some to bring to my Chicago family but my North Carolina family ate all 4 boxes before I could pack. These are excellent...could serve to anyone,5,positive,positive,0.8781883
B001BORBHO,Happy with the product,My dog was suffering with itchy skin. He had been eating Natural Choice brand (cheaper) since he was a puppy. I was nervous to change foods. The vet suggested to change foods sand see if the skin issues cleared up. Wellness brand did the job. My dog seems to love the food and the skin issues cleared up within a few weeks.,5,positive,positive,0.83263487
B008YA1LQK,Blackcat,Great coffee! Love all Green Mountain coffee and all the wonderful flavors. Would and do recommend this coffee to all my friends.,5,positive,positive,0.8887383
B001KP6B98,Excellent product,After scouring every store in town for orange peels and not finding anything satisfactory I turned to the online options.<br /><br /> I received the candied orange peels today and I found exactly what I was looking for. The peels are perfect for the fruit cake I plan to bake. The peels are not crystallized with sugar which is great I like the texture and the taste of the peels and I am gonna order another box soon.,5,positive,positive,0.86872625
B008YA1LQK,Bulk k-Cups,This is the best way to buy coffee for my office. Least expensive way to buy convenience with harder to find flavor and brand. I also buy this way for home.,5,positive,positive,0.90695256
B000H9K4KA,FABULOUS...,Absolutely wonderful. A real licorice taste. No phony baloney here!<br />It has a great flavor. I'd purchase it again for sure.,5,positive,positive,0.8903128
B004QDA8WC,"Exactly what I was looking for: Fast, fantastic Chai!","I was skeptical as to how good an all-in-one Chai Tea for Keurig could be, but my doubts were erased at my first sip! The spice blend is great, as is the sweetening and the dairy component. This is the kind of thing I was hoping I could get from my single-serve coffeemaker!<br /><br />Order some, you will thank yourself every time you brew a cup.",5,positive,positive,0.8873766
B0051C0J6M,Makes me drool just thinking of them,"The Brit's have out done us. The flavor is supreme,they satisfy my hunger for steak and onions...<br />Get them while you can... Their other flavors are great tooo",5,positive,positive,0.87322354
B008JKSJJ2,"Loved these gluten free healthy bars, saved $$ ordering on Amazon",These Kind Bars are so good and healthy & gluten free. My daughter came across them and loves them for a quick snack between her hectic schedule of classes & work. Most times she won't have time to eat a full meal and these are such a great alternative to fast food. I will order again & this time I'll get a few for moi! Really loved the coconut too..,5,positive,positive,0.8839456
B006N3HZ6K,Great bold taste-- compare to Emeril's Bold,"I've been drinking Emeril's Bold for a year and a half, and wanted to try something different. A review led me to this brand, and I love it too! I'm a strong coffee gal-- I like Starbuck's-- so this is right up my alley.",5,positive,positive,0.88369954


In [1]:
#r "nuget: Microsoft.ML, 3.0.0"

First, an instance of `MLContext` is created. `MLContext` is the main entry point for working with ML.NET, providing methods and properties for loading data, creating machine learning models, and more.

Next, a `dataView` is created by loading data from an enumerable collection of predictions. The `LoadFromEnumerable` method is used to load the data, and it's transforming the `predictions` collection into a new anonymous type with three properties: `Label`, `PredictedLabel`, and `Probability`. The `Label` and `PredictedLabel` properties are set to 1f if the corresponding label is "positive", and 0f otherwise. The `Probability` property is simply the `Probability` property of the prediction.

After the data is loaded, the `Evaluate` method of the `BinaryClassification` catalog is called on the `context` object. This method computes various metrics that can be used to evaluate the performance of a binary classification model. The `dataView` is passed as the first argument, and the names of the label and score columns are specified as "Label" and "PredictedLabel", respectively.

Finally, the `Display` method is called on the `metric` object to print the evaluation metrics to the console.

In terms of improvements, the code is quite efficient and readable as it is. However, you might consider adding comments to explain what each line of code does, especially if this code will be read by others who may not be familiar with ML.NET.

In [10]:
using Microsoft.ML;

var context = new MLContext();
var dataView =  context.Data.LoadFromEnumerable(predictions.Select(r => new { Label = r.Label == "positive"? 1f : 0f, PredictedLabel = r.PredictedLabel == "positive" ? 1f : 0f, Probability = r.Probability }));

var metric = context.BinaryClassification.Evaluate(dataView, labelColumnName: "Label", scoreColumnName: "PredictedLabel");

metric.Display();

index,value
LogLoss,0.769665510965833
LogLossReduction,-0.04766976027719402
Entropy,0.7346451526501956
AreaUnderRocCurve,0.9014862804878049
Accuracy,0.9139784946236559
PositivePrecision,0.9673295454545454
PositiveRecall,0.9227642276422764
NegativePrecision,0.7477876106194691
NegativeRecall,0.8802083333333334
F1Score,0.9445214979195561

index,value
PerClassPrecision,"[ 0.9673295454545454, 0.7477876106194691 ]"
PerClassRecall,"[ 0.9227642276422764, 0.8802083333333334 ]"
Counts,"indexvalue0[ 681, 57 ]1[ 23, 169 ]"
index,value
0,"[ 681, 57 ]"
1,"[ 23, 169 ]"
NumberOfClasses,2

index,value
0,"[ 681, 57 ]"
1,"[ 23, 169 ]"
