# Natural Language Processing with Transformers

#  SentenceTransformers

* SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings

* You can use this framework to compute sentence / text embeddings for more than 100 languages.
*  These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining.

* The framework is based on PyTorch and Transformers and offers a large collection of pre-trained models tuned for various tasks. Further, it is easy to fine-tune your own models.

# projects

* Sentence Embeddings and Similarity project
* Semantic Search project 
* K-Mean Clustering on Text Data project
* Fast Clustering project
* Similar Research Paper Recommendation System project
* Extractive text summarization project

# Project 1- Sentence Embeddings and Similarity  

In [None]:
%pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m63.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3

In [None]:
from sentence_transformers import SentenceTransformer,util
model = SentenceTransformer('all-MiniLM-L6-v2')

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

* Sentence embedding is the collective name for a set of techniques in natural language processing (NLP) where sentences are mapped to vectors of real numbers.

In [None]:
sentences = ['the cat sits outside','the new movie is awesome','the new movie is really great','the dog bark on strange']

In [None]:
embeddings = model.encode(sentences=sentences,convert_to_tensor=True)

In [None]:
for sent,embed in zip(sentences,embeddings):
  print('sentence',sent)
  print('len(Embeddings:', len(embed))
  #print('Embeddingd:',embed)

sentence the cat sits outside
len(Embeddings: 384
sentence the new movie is awesome
len(Embeddings: 384
sentence the new movie is really great
len(Embeddings: 384
sentence the dog bark on strange
len(Embeddings: 384


#  cos_sim
* Computes the cosine similarity cos_sim(a[i], b[j]) for all i and j. :return: Matrix with res[i][j] = cos_sim(a[i], b[j])

# util
* sentence_transformers.util defines different helpful functions to work with text embeddings.

In [None]:
cosine_scores = util.cos_sim(embeddings,embeddings)

In [None]:
cosine_scores

tensor([[ 1.0000, -0.0247, -0.0258,  0.1960],
        [-0.0247,  1.0000,  0.9074,  0.1464],
        [-0.0258,  0.9074,  1.0000,  0.1379],
        [ 0.1960,  0.1464,  0.1379,  1.0000]], device='cuda:0')

In [None]:
sentences

['the cat sits outside',
 'the new movie is awesome',
 'the new movie is really great',
 'the dog bark on strange']

# paraphrase mining
* Given a list of sentences / texts, this function performs paraphrase mining. It compares all sentences against all other sentences and returns a list with the pairs that have the highest cosine similarity score.

# Parameters
model – SentenceTransformer model for embedding computation

sentences – A list of strings (texts or sentences)

show_progress_bar – Plotting of a progress bar

batch_size – Number of texts that are encoded simultaneously by the model

query_chunk_size – Search for most similar pairs for #query_chunk_size at the same time. Decrease, to lower memory footprint (increases run-time).

corpus_chunk_size – Compare a sentence simultaneously against #corpus_chunk_size other sentences. Decrease, to lower memory footprint (increases run-time).

max_pairs – Maximal number of text pairs returned.

top_k – For each sentence, we retrieve up to top_k other sentences

score_function – Function for computing scores. By default, cosine similarity.

# Returns
Returns a list of triplets with the format [score, id1, id2]

In [None]:
paraphrases = util.paraphrase_mining(model,sentences)

In [None]:
for sim in paraphrases[0:10]:
  score,i,j = sim
  print(sentences[i], '<>',sentences[j], "-->", score)
  print()

the new movie is awesome <> the new movie is really great --> 0.9074223041534424

the cat sits outside <> the dog bark on strange --> 0.19601118564605713

the new movie is awesome <> the dog bark on strange --> 0.14636774361133575

the new movie is really great <> the dog bark on strange --> 0.13786301016807556

the cat sits outside <> the new movie is awesome --> -0.02468010224401951

the cat sits outside <> the new movie is really great --> -0.025751957669854164

