# Large Language Models and their applications

In this notebook, we will look at some applications of Large Language Models. We will make use of open source models from Hugging Face.

In [1]:
!pip install datasets transformers sentence_transformers

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl.metadata (20 kB)
Collecting transformers
  Downloading transformers-4.38.2-py3-none-any.whl.metadata (130 kB)
     ---------------------------------------- 0.0/130.7 kB ? eta -:--:--
     --- ------------------------------------ 10.2/130.7 kB ? eta -:--:--
     ----------- ------------------------- 41.0/130.7 kB 653.6 kB/s eta 0:00:01
     ----------------------------------- -- 122.9/130.7 kB 1.2 MB/s eta 0:00:01
     -------------------------------------- 130.7/130.7 kB 1.1 MB/s eta 0:00:00
Collecting sentence_transformers
  Downloading sentence_transformers-2.5.1-py3-none-any.whl.metadata (11 kB)
Collecting filelock (from datasets)
  Downloading filelock-3.13.1-py3-none-any.whl.metadata (2.8 kB)
Collecting pyarrow>=12.0.0 (from datasets)
  Downloading pyarrow-15.0.1-cp311-cp311-win_amd64.whl.metadata (3.1 kB)
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)


In [2]:
import numpy as np
import pandas as pd

from datasets import load_dataset
from transformers import pipeline


  from .autonotebook import tqdm as notebook_tqdm


## Sentiment Analysis
Here, we are looking to classify documents into Positive or Negative sentiments (or sometimes even other sentiments, e.g. Neutral). 

**Model**: DistilBERT model finetuned on SST-2 sentiment dataset (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)

### Create the sentiment analysis pipeline

In [3]:
sentiment_clf = pipeline(
    task="sentiment-analysis",  # this is an alias for text-classification
    model="distilbert-base-uncased-finetuned-sst-2-english",  # this is optional, the pipeline will automatically load a relevant default model which is the same in this case
    # truncation=True,
)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


### Example: Toy movie review corpus

In [4]:
corpus = [
    "The movie was fantastic. It redefines the horror movie genre. OMG the piano!!!",
    "It was a terrible movie, I couldn't sit through it. The piano was so lame!",
    "Probably the worst movie ever made in the entire history of movies. Don't even get me started on the piano...",
]

In [5]:
# Pass the toy corpus through the sentiment analysis pipeline and check the results
results = sentiment_clf(corpus)

In [6]:
print(results)

[{'label': 'POSITIVE', 'score': 0.999795138835907}, {'label': 'NEGATIVE', 'score': 0.9997960925102234}, {'label': 'NEGATIVE', 'score': 0.9997865557670593}]


### Example: IMDB Movie Review dataset 
https://huggingface.co/datasets/imdb

In [7]:
imdb_dataset = load_dataset("imdb")


Downloading readme: 100%|██████████| 7.81k/7.81k [00:00<00:00, 7.67MB/s]
Downloading data: 100%|██████████| 21.0M/21.0M [00:04<00:00, 4.98MB/s]
Downloading data: 100%|██████████| 20.5M/20.5M [00:03<00:00, 6.03MB/s]
Downloading data: 100%|██████████| 42.0M/42.0M [00:07<00:00, 5.99MB/s]
Generating train split: 100%|██████████| 25000/25000 [00:00<00:00, 118280.82 examples/s]
Generating test split: 100%|██████████| 25000/25000 [00:00<00:00, 103763.52 examples/s]
Generating unsupervised split: 100%|██████████| 50000/50000 [00:00<00:00, 124141.93 examples/s]


In [8]:
# Inspect the dataset
imdb_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

Let us look at a random sample from the test set

In [35]:
# Load the test split from the dataset
test_data = imdb_dataset['test']

In [15]:
test_data[123]

{'text': "Alas, another Costner movie that was an hour too long. Credible performances, but the script had no where to go and was in no hurry to get there. First we are offered an unrelated string of events few of which further the story. Will the script center on Randall and his wife? Randall and Fischer? How about Fischer and Thomas? In the end, no real front story ever develops and the characters themselves are artificially propped up by monologues from third parties. The singer explains Randall, Randall explains Fischer, on and on. Finally, long after you don't care anymore, you will learn something about the script meetings. Three endings were no doubt proffered and no one could make a decision. The end result? All three were used, one, after another, after another. If you can hang in past the 100th yawn, you'll be able to pick them out. Despite the transparent attempt to gain points with a dedication to the Coast Guard, this one should have washed out the very first day.",
 'labe

In [11]:
sentiment_clf(test_data[123]["text"])

[{'label': 'NEGATIVE', 'score': 0.9991201758384705}]

**Note**: Here, label 0 corresponds to negative sentiment and label 1 corresponds to positive sentiment

Let us look at another random sample from the test set

In [16]:
test_data[24321]

{'text': 'I channel surfed past this many times, mainly because the synopsis sounded so cheesy, so "Love American Style". However, it turned out to be quite good, very well done. The two stand-out features are the dialog and acting. Great cast. The premise is actually well executed and there aren\'t too many weak moments. I guess what I was most amazed by was how often you thought the wheels are going to come off the cart, and instead, the cart just banks the turns, so to speak, and the movie keeps flying. There are some nice little sub-plots, particularly the relationship that develops between the character played by former Conan sidekick Andy Richter. Also, want to mention that the music accompanying it was good.',
 'label': 1}

In [13]:
sentiment_clf(test_data[24321]['text'])

[{'label': 'POSITIVE', 'score': 0.9994550347328186}]

### Exercise - Sentiment Analysis
- Load 500 random examples from the test set
  - Hint: You can generate 500 random numbers less than 25000, store them in a list and use `senti_data.select(random_number_list)`
- Classify the sentiment for these examples (this might take a couple of minutes)
  - Note: You may get an error that token sequence length is longer than the maximum sequence length (512). This means you are trying to encode a sequence that is longer than this model can handle (i.e. longer than 512). One way to overcome this to include `truncation=True` as an additional argument while initialising the pipeline object.
- Compute the model accuracy 
  - Hint: You can compute this using the scikit-learn library

Search for another model for sentiment classification from the Hugging Face model repository (https://huggingface.co/models) and use that in your pipeline.

In [79]:
import random

test_data = imdb_dataset['test']
sentiment_clf(test_data[20]['text'])

random_number_list = random.sample(range(len(test_data)), 500)


sentiment_clf = pipeline(
    task="sentiment-analysis",  # this is an alias for text-classification
    model="distilbert-base-uncased-finetuned-sst-2-english",  # this is optional, the pipeline will automatically load a relevant default model which is the same in this case
    truncation=True,
)

sentiment_output = []
predicted_label_list = []
true_label_list = []

for i in random_number_list:
    sentiment = sentiment_clf(test_data[i]['text'])
    sentiment_output.append(sentiment)
    true_label = test_data['label'][i]
    true_label_list.append(true_label)

predicted_labels = [item[0]['label'] for item in sentiment_output]
binary_predicted_labels = [1 if label == 'POSITIVE' else 0 for label in predicted_labels]


In [80]:
print(true_label_list)
print(binary_predicted_labels)

[1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 

In [82]:
from sklearn.metrics import accuracy_score
# Compute accuracy
accuracy = accuracy_score(true_label_list, binary_predicted_labels)
print(accuracy)

0.904


## Summarization
Here, we are looking to provide a summary for a given document.

**Model**: t5-small model (https://huggingface.co/t5-base) 

### Create the summarization pipeline

In [17]:
summarizer = pipeline(
    task="summarization",
    model="t5-small",  # this is not the default model for this pipeline so it has to be explicitly included
    min_length=20,  # this sets the minimum length for the summary
    max_length=100,  # this sets the maximum length for the summary
)


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


### Example: Summarising the NLP Introduction chapter from Wikipedia (https://en.wikipedia.org/wiki/Natural_language_processing)

In [18]:
document = """Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate speech. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic (i.e. statistical and, most recently, neural network-based) machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Challenges in natural language processing frequently involve speech recognition, natural-language understanding, and natural-language generation. 
"""
print(document)


Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate speech. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic (i.e. statistical and, most recently, neural network-based) machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Challenges in natural language processing frequently involve speech recognition, natural-language understanding, and natural-language generation. 



In [19]:
# Pass the document through the summarizer
summarizer(document)

[{'summary_text': 'natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics . it involves processing natural language datasets, such as text corpora or speech corporia . the goal is a computer capable of "understanding" the contents of documents .'}]

### Exercise - Summarization of news articles from the xsum dataset

- Load the xsum dataset from Hugging Face (https://huggingface.co/datasets/xsum)
- Select 5-10 random articles from the test set and summarise them using a model of your choice and compare the results against the reference summary.

Additional reading: Evaluating summarization models: https://cookbook.openai.com/examples/evaluation/how_to_eval_abstractive_summarization


In [84]:
from datasets import load_dataset

xsum_data = load_dataset("xsum")


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
Downloading builder script: 100%|██████████| 5.76k/5.76k [00:00<?, ?B/s]
Downloading readme: 100%|██████████| 6.24k/6.24k [00:00<?, ?B/s]
Downloading data: 100%|██████████| 255M/255M [00:36<00:00, 7.03MB/s] 
Downloading data: 2.72MB [00:00, 10.2MB/s]                           
Generating train split: 100%|██████████| 204045/204045 [00:28<00:00, 7264.40 examples/s]
Generating validation split: 100%|██████████| 11332/11332 [00:16<00:00, 671.89 examples/s]
Generating test split: 100%|██████████| 11334/11334 [00:16<00:00, 677.93 examples/s]


In [101]:

test_data = xsum_data['test']
random_number_list = random.sample(range(len(test_data)), 10)



sample_data = []
for i in random_number_list:
    sample_data.append(test_data[i]['document'])
print(sample_data)
summarisations = []
for i in sample_data:
    summarisations.append(summarizer(sample_data))


['Kuba Moczyk, 22, died in hospital after he was knocked out in an unlicensed fight at the Tower Complex, Great Yarmouth, Norfolk, on 19 November.\nA memorial mass has been held at St Mary\'s Church in the town.\nFather Philip Shryane told the congregation Mr Moczyk\' was a "good man" whose "life was boxing".\nMore on this story and others from Norfolk\nHe said Mr Moczyk was "a young man with a good heart, with so much to give and so much to look forward to... but always a gentle smile".\nHis uncle, Marcin Smigaj gave a tribute, in Polish, on behalf of the family. Mr Moczyk was due to be cremated.\nMr Moczyk, originally from Poland, worked at a chicken factory and lived in the town.\nHis trainer Scott Osinski said earlier that Mr Moczyk was winning the fight when he took the fatal blow.\nHis opponent is believed to be aged 17.', '"Nobody risks £15,000 on a hunch," said a spokesman for William Hill, who make the band 6/4 favourites to get the gig.\nLadbrokes, Coral and Paddy Power have 

In [100]:
print(summarisations)

[]


## Similarity Search
Here, we will look at comparing documents and finding similar ones. This can be performed either by using classical encoding techniques such as Bag of Words or TF-IDF or by extracting features using LLMs. 

For this example, we will make use of the `sentence_transformers` module.

In order to compare embeddings, we can make use of cosine similarity. This is a common approach for similarity measurement and it measures the angle between the two embeddings (represented as vectors). It produces a value between -1 and 1 with 0 indicating no similarity.

In [20]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity


**Model**: all-MiniLM-L6-v2 (https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)

In [21]:
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


### Example: Toy corpus 

In [22]:
corpus = [
    "The cat and the mouse ate together",
    "The old man put the cigarette in the ashtray and placed it on the table",
    "The cat and mouse game",
]


In [23]:
# Call the encode method of the embedding model on the corpus
embeddings = embedding_model.encode(corpus)

In [24]:
embeddings

array([[ 0.06908117,  0.00346941,  0.06142804, ...,  0.08479892,
         0.01098884,  0.07748368],
       [ 0.03966445,  0.14998706, -0.10435025, ...,  0.0289601 ,
         0.00873841,  0.05782938],
       [ 0.03251994,  0.02064463,  0.05969327, ...,  0.07268786,
        -0.00851978,  0.08279924]], dtype=float32)

In [25]:
# Check the shape of the embeddings.
embeddings.shape

(3, 384)

In [26]:
# Calculate the cosine similarity on the embeddings
cosine_similarity(embeddings)

array([[1.0000001 , 0.15666504, 0.67292565],
       [0.15666504, 1.        , 0.09970371],
       [0.67292565, 0.09970371, 0.99999994]], dtype=float32)